ph.d. thesismarc.frincu/frincu-thesis-en.pdfadaptive scheduling for distributed systems marc eduard...

195
West University of Timisoara Faculty of Mathematics and Computer Science Department of Computer Science Ph.D. Thesis Adaptive Scheduling for Distributed Systems Candidate: Marc Eduard Frˆ ıncu Scientific advisor: Prof. Dr. Dana Petcu Timisoara, May 2011

Upload: others

Post on 10-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

West University of TimisoaraFaculty of Mathematics and Computer Science

Department of Computer Science

Ph.D. ThesisAdaptive Scheduling for Distributed Systems

Candidate: Marc Eduard FrıncuScientific advisor: Prof. Dr. Dana Petcu

Timisoara, May 2011

Universitatea de Vest Timis,oaraFacultatea de Matematica s, i Informatica

Departamentul de Informatica

Teza de DoctoratPlanificare Adaptiva pentru Sisteme Distribuite

Candidat: Marc Eduard FrıncuCoordonator S, tiint, ific: Prof. Dr. Dana Petcu

Timis,oara, Mai 2011

Adaptive Scheduling for Distributed Systems

Marc Eduard Frıncu

A Thesis submitted for the degree of Doctor of Philosophy

Faculty of Mathematics and Computer Science

West University of Timisoara, Romania

Jury :Reviewers : Costin Badica - University of Craiova, Romania

Valentin Cristea - Technical University of Bucharest, RomaniaDaniela Zaharie - West University of Timisoara, Romania

Advisor : Dana Petcu - West University of Timisoara, RomaniaPresident : Viorel Negru - West University of Timisoara, Romania

May 2011

Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 Introduction 91.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.1.1 Grid and Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . 91.1.2 Grid and Cloud Scheduling . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Motivation of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 State of Art 172.1 Formalism Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.1 The Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . 172.1.2 Commonly Used Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Scheduling Algorithms and Systems . . . . . . . . . . . . . . . . . . . . . . . 252.2.1 Scheduling Algorithms Overview . . . . . . . . . . . . . . . . . . . . 262.2.2 Resource Management Systems Overview . . . . . . . . . . . . . . . . 40

2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Scheduling Heuristics for Heterogeneous Systems 453.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.2 Dynamic Scheduling Algorithms for Heterogeneous Environments . . . . . . 523.3 MinQL Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.3.2 MinQL Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4 DMECT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.4.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.4.2 Convergence of the Algorithm . . . . . . . . . . . . . . . . . . . . . . 703.4.3 Stability of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 733.4.4 Physical Movement Condition for Tasks on New Queues . . . . . . . 733.4.5 Population based DMECT and MinQL . . . . . . . . . . . . . . . . 753.4.6 DMECT Tests Using EET Approximations . . . . . . . . . . . . . . 773.4.7 DMECT Tests Using Lateness . . . . . . . . . . . . . . . . . . . . . 813.4.8 Population Based Tests for DMECT and MinQL . . . . . . . . . . . 82

i

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4 Rule Based Formalism for Expressing Scheduling Heuristics 894.1 Language and Platform for Executing SAs . . . . . . . . . . . . . . . . . . . 91

4.1.1 Rule Based Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.1.2 Executing SiLK rules: the OSyRIS engine . . . . . . . . . . . . . . . 984.1.3 Distributed OSyRIS . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.1.4 Auxiliary Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.2 Expressing Scheduling Heuristics Using a Rule Based Approach . . . . . . . 1094.2.1 Representing SAs Using Chain Reactions . . . . . . . . . . . . . . . . 1134.2.2 Distributing the Scheduling Heuristics . . . . . . . . . . . . . . . . . 1174.2.3 Test scenarios and results . . . . . . . . . . . . . . . . . . . . . . . . 122

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5 Self-Healing Multi-Agent System for Task Scheduling 1295.1 Platform Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.1.1 Agent Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1335.1.2 Managed Resources and Touch-points . . . . . . . . . . . . . . . . . . 1335.1.3 Scheduling Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1365.1.4 Self-Healing Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.2 Reducing the Number of Agents used in the Negotiation Phase . . . . . . . . 1475.3 Dynamically Changing the Scheduling Policy at Runtime . . . . . . . . . . . 1505.4 Platform Testing Scenario and Results . . . . . . . . . . . . . . . . . . . . . 159

5.4.1 Optimal Parameters Tests . . . . . . . . . . . . . . . . . . . . . . . . 1595.4.2 Recovery Time Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6 Conclusions 165

Bibliography 167

ii

List of Figures

2.2.1 Classification of Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . 262.2.2 List of Best Effort and QoS algorithms . . . . . . . . . . . . . . . . . . . . . 292.2.3 List of algorithms for independent and dependent tasks (DA06) . . . . . . . 32

3.1.1 Example of a simple two cluster interconnection . . . . . . . . . . . . . . . . 493.1.2 Graphical example of a simple two cluster interconnection . . . . . . . . . . 493.1.3 Example showing how traces can be added to CPU and network bandwidth . 503.2.1 Simple workflow example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.2 Task execution order for a simple workflow example . . . . . . . . . . . . . . 543.3.1 Makespan of MinQL compared to other heuristics . . . . . . . . . . . . . . . 60

(a) For a platform with h = 2 and 5 clusters . . . . . . . . . . . . . . . . . 60(b) For a platform with h = 18 and 5 clusters . . . . . . . . . . . . . . . . 60(c) For a platform with h = 80 and 21 clusters . . . . . . . . . . . . . . . . 60(d) For a platform with h = 80 and 5 clusters . . . . . . . . . . . . . . . . 60

3.3.2 Compactness of MinQL compared to other heuristics . . . . . . . . . . . . . 61(a) For a platform with h = 2 and 5 clusters . . . . . . . . . . . . . . . . . 61(b) For a platform with h = 18 and 5 clusters . . . . . . . . . . . . . . . . 61(c) For a platform with h = 80 and 21 clusters . . . . . . . . . . . . . . . . 61(d) For a platform with h = 80 and 5 clusters . . . . . . . . . . . . . . . . 61

3.3.3 Average waiting time per resource when the MinQL is used at both levels . 62(a) Scenario: 10 submitted online workflows . . . . . . . . . . . . . . . . . 62(b) Scenario: 20 submitted online workflows . . . . . . . . . . . . . . . . . 62

3.3.4 Average waiting time per resource for 20 workflows when different SAs areused at the two levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63(a) MinQL-Max-Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63(b) MinQL-Min-Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3.5 Average load per resource when the MinQL is used at both levels . . . . . . 63(a) Scenario: 10 submitted online workflows . . . . . . . . . . . . . . . . . 63(b) Scenario: 20 submitted online workflows . . . . . . . . . . . . . . . . . 63

3.3.6 Average load per CAS for 20 workflows when different SAs are used at thetwo levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64(a) MinQL-Max-Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64(b) MinQL-Min-Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.4.1 Comparison of the time constraint behaviour for several σ flavours . . . . . . 68

iii

(a) Time constraint for a constant σ = 10 and a rescheduling interval of 1time unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

(b) Time constraint for a priority based σ and a rescheduling interval of 1time unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

(c) Time constraint for a deadline based σ and a rescheduling interval of 1time unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4.2 Comparison of various DMECT flavours with other SAs . . . . . . . . . . . 80(a) Scenario 1: Makespan comparison on platforms with h = 2 . . . . . . . 80(b) Scenario 1: Compactness comparison on platforms with h = 2 . . . . . 80(c) Scenario 1: Avg. schedule runtime comparison on platforms with h = 2 80(d) Scenario 2: Makespan comparison on platforms with h = 18 . . . . . . 80(e) Scenario 2: Compactness comparison on platforms with h = 18 . . . . 80(f) Scenario 2: Avg. schedule runtime comparison on platforms with h = 18 80

3.4.3 Lateness comparison for deadline constraint DMECT . . . . . . . . . . . . . 85(a) Scenario 1: Comparison against other SAs on platforms with h = 2 . . 85(b) Scenario 1: Comparison against other SAs on platforms with h = 18 . . 85(c) Scenario 1: Comparison against other SAs on platforms with h = 80 . . 85(d) Scenario 2: Comparison against other SAs on platforms with h = 2 . . 85(e) Scenario 2: Comparison against other SAs on platforms with h = 18 . . 85(f) Scenario 2: Comparison against other SAs on platforms with h = 80 . . 85

3.4.4 Average no. of task movements comparison for deadline constraint DMECT 86(a) Scenario 1: Comparison against other SAs on platforms with h = 2 . . 86(b) Scenario 1: Comparison against other SAs on platforms with h = 18 . . 86(c) Scenario 1: Comparison against other SAs on platforms with h = 80 . . 86(d) Scenario 2: Comparison against other SAs on platforms with h = 2 . . 86(e) Scenario 2: Comparison against other SAs on platforms with h = 18 . . 86(f) Scenario 2: Comparison against other SAs on platforms with h = 80 . . 86

4.1.1 Several types of elementary workflow constructs . . . . . . . . . . . . . . . . 94(a) Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94(b) Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94(c) Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94(d) Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94(e) Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94(f) Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.1.2 Mapping between the epsilon function from Relation 4.1.3 and the SiLK rule 954.1.3 Switching between rule domains . . . . . . . . . . . . . . . . . . . . . . . . . 984.1.4 Architecture of the OSyRIS platform . . . . . . . . . . . . . . . . . . . . . . 994.1.5 Distributed Workflow Engine . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.1.6 Message flow and platform recovery . . . . . . . . . . . . . . . . . . . . . . . 1034.1.7 The feedback loop for component recovery . . . . . . . . . . . . . . . . . . . 1054.1.8 Workflow execution times when centralized and decentralized engines are used 105

iv

4.1.9 Rule Base Chaining Example . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.1.10User interface for the Visual Workflow Designer . . . . . . . . . . . . . . . . 1084.1.11User interface for the Visual Workflow Manager . . . . . . . . . . . . . . . . 1094.2.1 Layers of the Distributed Rule Based Scheduling Platform . . . . . . . . . . 1104.2.2 Reaction chain inside the ECTMin solution . . . . . . . . . . . . . . . . . . . 1134.2.3 Reaction chain inside the ContinueAndCheck solution . . . . . . . . . . . . . 1144.2.4 Reaction chain inside the UnassignTask solution . . . . . . . . . . . . . . . . 1154.2.5 Reaction chain inside the MinMin solution . . . . . . . . . . . . . . . . . . . 1164.2.6 Reaction chain inside the DMECT and TaskIterator solutions . . . . . . . . 1174.2.7 Data consistency issues when locking resource related molecules for synchro-

nized access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.2.8 Solving data consistency issues when locking resource related molecules for

synchronized access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1204.2.9 Example of moving a task from the current queue to the smallest using the

DMECT sequential algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 1214.2.10Example of moving a task from the current queue to the smallest available

one using the DMECT distributed algorithm (the dotted squares representresources locked by other schedulers) . . . . . . . . . . . . . . . . . . . . . . 121

4.2.11Makespan comparison among the Min-Min, Max-Min, Suffrage, DMECTsequential and the non-locking distributed version of DMECT . . . . . . . . 125

4.2.12Locking vs. non-locking distributed DMECT . . . . . . . . . . . . . . . . . 126(a) Scenario 1: Makespan comparison for online scheduling . . . . . . . . . 126(b) Scenario 1: Lateness comparison for online scheduling . . . . . . . . . . 126(c) Scenario 1: Avg. schedule runtime comparison for online scheduling . . 126(d) Scenario 2: Makespan comparison for offline scheduling . . . . . . . . . 126(e) Scenario 2: Lateness comparison for offline scheduling . . . . . . . . . . 126(f) Scenario 2: Avg. schedule runtime comparison for offline scheduling . . 126

5.1.1 Self-healing MAS scheduling platform: high level architecture . . . . . . . . 1325.1.2 Message format for the MAS platform . . . . . . . . . . . . . . . . . . . . . 1355.1.3 Content format for a message containing task information . . . . . . . . . . 1365.1.4 Content format for a message containing agent module information . . . . . 1375.1.5 Content format for a message containing the winner agent information . . . 1375.1.6 Content format for a message containing platform information . . . . . . . . 1385.1.7 The feedback loop for the task rescheduling process . . . . . . . . . . . . . . 1385.1.8 State transitions between the communication language during the negotiation

phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1405.1.9 Example of a task description requirements using a Condor-like format . . . 1425.1.10State transitions between the task statuses . . . . . . . . . . . . . . . . . . . 1435.1.11The feedback loop for the module recovery process . . . . . . . . . . . . . . . 1475.2.1 The influence of the number of clusters over the makespan produced by DMECT1495.3.1 Data distribution for the case of h=0 . . . . . . . . . . . . . . . . . . . . . . 152

v

5.3.2 Data distribution for the case of h=42 . . . . . . . . . . . . . . . . . . . . . 1535.3.3 The gain with regard to the Best Selection strategy of several scheduling

heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1555.4.1 Occurrence of false healing events with regard to the ping time and message

batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161(a) False recovery events related with the ping interval size for hit = 1,

miti = 1, msgtimeout = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 161(b) False recovery events related with the ping interval size for hit = 1,

miti = 500, msgtimeout = 1 . . . . . . . . . . . . . . . . . . . . . . . . . 1615.4.2 Recovery time vs. number of failed modules in the platform . . . . . . . . . 162

vi

List of Tables

2.1.1 Inconsistent EET matrix for a set of 3 tasks and 3 available resources . . . . 22

2.1.2 Consistent EET matrix for a set of 3 tasks and 3 available resources . . . . . 22

3.3.1 Test platforms used in the experiments . . . . . . . . . . . . . . . . . . . . . 57

3.3.2 Gain of MinQL with regard to other SAs for a platform with h = 2 and 5clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3.3 Gain of MinQL with regard to other SAs for a platform with h = 18 and 5clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3.4 Gain of MinQL with regard to other SAs for a platform with h = 80 and 21clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.3.5 Gain of MinQL with regard to other SAs for a platform with h = 80 and 5clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.3.6 Makespan comparison for the case MinQL is used in a multi-level SP . . . . 62

3.4.1 σ condition example for p = 1, 2 and a=5 . . . . . . . . . . . . . . . . . . . . 67

3.4.2 σ condition example for a task with initial TUD = 5 when using Relation 3.4.5 67

3.4.3 Characteristics of the strategies used to construct initial schedules . . . . . . 76

3.4.4 Characteristics of the strategies used to perturb the schedules . . . . . . . . 76

3.4.5 Gain of DMECT with regard to other SAs for the platform with h = 2 and 5clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.4.6 Gain of DMECT with regard to other SAs for the platform with h = 18 and5 clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.4.7 Average makespan obtained by online scheduling heuristics and their popula-tion based variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.2.1 Atomic service methods for SAs . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.2.2 Statistics on the rule based scheduling heuristics . . . . . . . . . . . . . . . . 124

4.2.3 Speed-up for DMECT-par-6 scheduling time compared with the sequentialversion for 10 resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.2.1 Agent grouping per cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.3.1 Confusion matrix for the case of using a h = 42 training set with 8 hiddenlayer elements and a learning rate of 0.3 . . . . . . . . . . . . . . . . . . . . 156

5.3.2 True Positive and False Positive percentage for the platform with h = 42,training set with 8 hidden layer elements and a learning rate of 0.3 . . . . . 156

vii

5.3.3 Confusion matrix for the case of using a h = 0 training set with 8 hidden layerelements and a learning rate of 0.3 . . . . . . . . . . . . . . . . . . . . . . . 157

5.3.4 True Positive and False Positive percentage for the platform with h = 0,training set with 8 hidden layer elements and a learning rate of 0.3 . . . . . 157

5.3.5 Confusion matrix for the case of using a mixed training set with 100 hiddenlayer elements and a learning rate of 1 . . . . . . . . . . . . . . . . . . . . . 158

5.3.6 True Positive and False Positive percentage for the platform with a mixedtraining set having 100 hidden layer elements and a learning rate of 1 . . . . 158

5.3.7 Average schedule selection runtime in seconds . . . . . . . . . . . . . . . . . 159

viii

List of Abbreviations

AMQP Advanced Message Queuing Protocol

ATT Approximate Turnaround Time

CAS Computer Algebra System

CG Computational Grid

DMECT Dynamic Minimisation of Estimated Completion Time

DS Distributed System

ECA Event Condition Action

ECT Estimated Completion Time

EET Estimated Execution Time

FCL Feedback Control Loop

GC Grid Computing

GS Grid Service

GWA Grid Workflow Archive

HDFS Hadoop Distributed File System

HTTP Hyper Text Transfer Protocol

IaaS Infrastructure as a Service

IO Input/Output

IP Infrastructure Provider

JSON JavaScript Object Notation

LHS Left Hand Side

LWT Local Waiting Time

MAPE Monitor-Analyse-Plan-Execute

MAS Multi-Agent System

MinQL Minimization of Queue Length

OSyRIS Orchestrating System using a Rule based Inference Solution

P2P Peer to Peer

ix

PaaS Platform as a Service

PDA Platform Description Archive

RBS Rule Based System

RBSch Rule Based Scheduler

RBWS Rule Based Workflow System

REST Representational State Transfer

RHS Right Hand Side

RMI Remote Method Invocation

RU Resource Utilization

SA Scheduling Algorithm

SaaS Software as a Service

SiLK Simple Language for Workflows

SOA Software Oriented Architecture

SOAP Simple Object Access Protocol

SP Scheduling Platform

TC Transfer Cost

TWT Total Waiting Time

VO Virtual Organisation

WS Web Service

XML eXtensible Markup Language

x

“If people never did silly things, nothing intelligent would ever get done.”(Ludwig Wittgenstein, philosopher)

1

2 Marc E. Frıncu

Acknowledgments

Work on this thesis was supported by several international projects in which I was in-cluded as a member: FP6-2005-Infrastructure SCIEnce (2006–2010, RII3-CT-2005-026133)– much of the work in Chapter 3 –, GiSHEO (2008–2010, European Space Agency PECS Pro-gramme) – much of the work in Chapter 4 – and FP7-ICT mOSAIC (2010–2013, grant no.256910) – much of the work in Chapter 5. I also add to the list FP6 SPRERS (2010-2012),a project that funded some of my visits to foreign research institutions.

However despite the funding offered by these projects, my work would have never grewin this form without the aid of several people to whom I would like to pay my thanks. Dueto the large number of people involved the following list is not intended to be exhaustive.

First, I would like to pay my deepest gratitude to my Ph.D. adviser Prof. Dr. DanaPetcu, for the uncountable advises and support throughout the entire process of researching,writing and submitting papers to international conferences and journals.

Second, I would like to thank my entire family, especially my dad, col. inf. GuerinoFrıncu, for the advises and support in dealing with the mathematical model of the algorithmspresented in this thesis.

Third, many thanks and unmeasurable appreciation to Simina-Ioana Giurginca, for herunderstanding and for the hours spent during the proofreading of this thesis and accompa-nying abstract.

Many thanks to my colleagues from both the West University and e-Austria ResearchInstitute – including but not limited to: Ciprian Craciun, Georgiana Macariu, AlexandruCarstea, Silviu Panica and Marian Neagul – for their advises and fruitful meetings that leadme to draw some important conclusions during the writing of this work.

Also many thanks to the people I have met during my short visits to INRIA (Institutnational de recherche en informatique et en automatique) France and which have also helpedshape the researcher I am today: Norha Villegas (University of Victoria, Canada), whichI met during my stay as a visiting researcher at INRIA Lille, team ADAM (2010); Fred-eric Suter (Centre national de la recherche scientifique – CNRS, Lyon, France) and MartinQuinson (Nancy Universite, Nancy, France) both of which co-advised me during my fourmonths internship at INRIA Nancy, team Algorille (2007–2008), a visit after which I decidedto pursuit a research dedicated to scheduling algorithms. Many thanks also to the unlistedpeople who eased my stay and also advised me during my visits at INRIA.

Last but not least I dedicate this work to my late grandfather, teacher Pavel Frıncu, whoI know would have liked to see me become a doctor but did not had the chance.

3

4 Marc E. Frıncu

Abstract

Aceasta teza abordeaza cateva din principalele problemele de planificare ce pot apareaın sistemele distribuite eterogene. Printre acestea, prezinta interes problemele cauzatede influent,a acestor sisteme asupra euristicilor de planificare s, i cele datorate erorilor defunct, ionare ın componentele platformei de planificare. Teza ıncepe prin prezentarea pe scurta literaturii de specialitate s, i a modelului matematic folosit, continua cu prezentarea prin-cipalelor rezultate pe parcursul a trei capitole s, i se sfars,es,te prin enumerarea principalelorconcluzii.

Capitolele ce cont, in contribut, iile personale cuprind urmatoarele: Capitolul 3 abordeazaproblematica euristicilor de planificare pentru medii distribuite s, i ofera o euristica stabiladin punct de vedere al comportamentului pe gama de scenarii testate. Capitolul 4 studiazaproblematica distribuirii euristicilor s, i a separarii acestora de platforma ın sine. Pentruaceasta, este propus un limbaj bazat pe reguli de inferent, a pentru descrierea euristicilor deplanificare, precum s, i o platforma pentru execut, ia acestora. Capitolul 5 prezinta o platformade planificare bazata pe agent, i ce ofera posibilitatea de a se auto-vindeca ın caz de erori.Platforma foloses,te ca euristica de planificare la meta-nivel algoritmul descris ın Capitolul 3.Totodata, ea se bazeaza pe modelul descris ın Capitolul 4 pentru reprezentarea s, i execut, iaalgoritmilor atat la nivel de furnizor cat s, i la meta-nivel. In Capitolul 5, este abordatas, i problematica alegerii dinamice a algoritmului de planificare folosind informat, ii legate deaplicat, ii s, i resurse extrase din platforma.

Rezultatele obt, inute extind cunos,tiint,ele din domeniu, prezentand un algoritm de plani-ficare stabil, un model inovator bazat pe reguli de inferet, a pentru reprezentarea euristicilorde planificare, o platforma de planificare ce ofera un mecanism de auto-vindecare folosindbucle de auto-control unic ın domeniu precum s, i o metoda noua de schimbare dinamica aeuristicii de planificare.

5

6 Marc E. Frıncu

Abstract

This thesis addresses some of the main issues related with task scheduling inside dis-tributed heterogeneous systems. Among them there is a particular interest in the onescaused by the influence of system volatility over the scheduling heuristics and in the prob-lems caused by failures in some of the scheduling platform components. The thesis startswith a brief state of the art and description of the used mathematical model, continues bypresenting the main achievements of our work over the course of three chapters and ends bypresenting some of the main conclusions.

The three chapters depicting the thesis main achievements address the following: Chap-ter 3 tackles the problem of scheduling inside heterogeneous environments and presents ascheduling heuristics that is stable with regard to the tested scenarios. Chapter 4 studies theissues of distributing the scheduling heuristics and of separating them from the schedulingplatform logic. A solution based on an inference language is presented together with a ruleexecution platform. Chapter 5 presents a self-healing multi-agent scheduling platform. Theplatform uses at meta-level the scheduling heuristics described in Chapter 3 and relies on themodel given in Chapter 4 to represent the scheduling heuristics at both local and meta-level.In Chapter 5 we also address the problem of dynamically selecting the scheduling policy atruntime. To achieve this we propose a method based on the information regarding the tasksand the resources part of the distributed system.

The results obtained in this thesis expand the existing knowledge in the area of schedulingsystems and algorithms for distributed systems by presenting: a stable scheduling heuristics;a novel model for representing distributed scheduling algorithms based on inference rules;a scheduling platform which relies on a unique self-healing component based on feedbackcontrol loops; as well as a new method for dynamically switching the scheduling algorithmat runtime.

7

8 Marc E. Frıncu

1 INTRODUCTION

Contents1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.1.1 Grid and Cloud Computing . . . . . . . . . . . . . . . . . . . . . . 9

1.1.2 Grid and Cloud Scheduling . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Motivation of the Work . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . . 15

1.1 Introduction

1.1.1 Grid and Cloud Computing

Grid Computing (GC) is a form of distributed computing where the “virtual super com-puter” used for processing data is composed of loosely coupled computers. More specificallyit means on the one hand coordinating resource sharing and on the other hand solving prob-lems in large, multi-institutional Virtual Organizations (VOs) (FK99). A Grid consists ofheterogeneous machines each of them offering their resources to the public. Access to re-sources is done in a transparent manner. Each resource can be accessed by any client undercertain security rights. Depending on its design objectives Grids can be divided (KBM02)in: Computational Grids (CG), Data Grids and Service Grids. CGs offer high aggregatecomputational resources for executing large number of tasks in a short amount of time. Theycan also be divided into Distributed Supercomputing and High Throughput Systems. DataGrids provide an infrastructure for processing and storing large amount of data stored ineither a distributed file system or a large network. Service Grids offer access to resourcesprovided by more than one machine. A recent trend in this direction is represented by theSoftware as a Service (SaaS) paradigm which allows access to virtual resources by usingservices. SaaS is part of what is currently called Cloud computing.

Milojicic et al. (MKL+03) includes GC in the larger family called Distributed Systems(DSs), which comprises among others P2P computing and client-server systems. Accordingto the same work GC is sometimes associated with P2P computing. However, there aresome fundamental differences arising from the very basis of their definition and usage, whichoffer counter arguments to the previous affirmation. According to the definition proposed by

9

10 Marc E. Frıncu

FOLDOC1 a DS is “a collection of (probably heterogeneous) automata whose distribution istransparent to the user so that the system appears as one local machine. This is in contrastto a network, where the user is aware that there are several machines, and their location,storage replication, load balancing and functionality is not transparent. Distributed systemsusually use some kind of client-server organization”. This definition implies that a DS is acontrolled and managed network which acts as a single, logical computing resource whereasP2P is neither controlled nor managed and the existence of several peers is not hidden fromthe user. Moreover the contrast between Grids and P2P is also visible from the way theyhave been designed: the Grid (FK99) is a “coordinated resource sharing and problem solvingin dynamic multi-institutional virtual organizations”. P2P are similar as they also worktowards solving a common goal but they are neither coordinated nor based on institutions.As a general overview we can enumerate the following positive and negative aspects aboutthem: Grids are persistent, they address security issues, use powerful resources, are dataintensive and standard based but face problems of autonomic configuration, managementand scalability issues. By contrast P2P ensure a much higher scalability, fault tolerance andself-configuration yet have security problems, do not have the same infrastructure as Gridsand are less concerned about the quality of service.

Recently an increasingly interest in Service Oriented Architectures (SOA) and virtuali-sation has been given. Consequently a considerable number of organizations have chosen toexpose their resources, both hardware and software, using the * as a service (*aaS) paradigm,which includes SaaS, Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS).The Grid community has been itself oriented in this direction, with solutions offering state-full access to services being integrated inside Grid middleware. For instance Grid Services(GS) have been designed as state-full Web Services (WS). GSs also provide higher securitythan normal WSs. As a result to constantly growing solutions for virtualisation and *aaSthe concept of Cloud emerged. Vendors of virtualised solutions and services are called CloudProviders (CP). Seen from a general perspective Clouds and Grids are close as both try tooffer the same functionality: transparent access to virtually unlimited resources. FOLDOC2

gives the following definition for Cloud Computing: “a loosely defined term for any systemproviding access via the Internet to processing power, storage, software or other computingservices, often via a web browser. Typically these services will be rented from an externalcompany that hosts and manages them”. Clouds have managed to overcome some of thelimitations or disadvantages of Grids including: licensing; legal or political issues; lack ofvirtualisation support; complex architectures, tools and technologies etc. Due to the poten-tial of virtualisation and *aaS, Clouds receive a growing interest from the industry, whereasGrids are mostly used by the academic society as large scale computing and storage.

1http://foldoc.org2http://foldoc.org

Adaptive Scheduling for Distributed Systems 11

1.1.2 Grid and Cloud Scheduling

When one deals with large numbers of processes that race to obtain resources (CPU, mem-ory, storage, network) in order to complete their tasks it becomes clear that an optimalmapping between processes and resources is needed. This problem is however NP-complete(Bru07) and thus sub-optimal solutions are searched instead. These solutions are generatedby Scheduling Algorithms (SAs), usually as part of a scheduler. SAs cannot control the DS’sresources directly and are hence similar to brokers or agents. According to Oliker et al.(OBSS04) a resource broker has the responsibility of selecting resources and processes sothat all tasks get – given the scheduling heuristics – the best possible resources. Among thebroker’s tasks it can be noticed: the resource discovery mechanism (e.g., through the GridInformation Service – GIS –, or the Universal Description Discovery and Integration services– UDDI –), resource selection; and the task (re)scheduling and migration.

Grid Scheduling is aimed at solving the problem of task scheduling inside heterogeneousenvironments. Oliker et al. (OBSS04) propose to define an architecture where the schedulingis done in a P2P manner together with a pseudo-hierarchy of schedulers. The upper layerwould consist of Grid Schedulers which deal with task migration to the lower layer – meta-scheduling – made up of Local Schedulers that schedule access to local resources. The latterhandles both tasks received from the Global Schedulers and local tasks. This approachincreases fault-tolerance as well as the scheduler’s scalability. Fault tolerance is achievedas all tasks assigned to Local Scheduler would continue executing even when either levelfails. Scalability is obtained as it is possible to either add additional layers to the system(e.g., Cluster Schedulers which act as intermediaries between the Grid Schedulers and LocalSchedulers) or increase the number of the schedulers at either level.

Algorithms for these systems deal mostly with task migration between schedulers, whiletrying to optimize a certain objective function which depends on the scenario. This approachis preferable to the centralized solution, where scaling is almost impossible to achieve andfault tolerance is low. A third way of achieving scheduling is by offering a completelydecentralized solution. In this scenario there are no resource controllers and the resourcescheduling and allocation is directly determined by the resource requestors and providers.

Grid Scheduling can be used in several scenarios including: parameter sweep applications(FH04), multi-site computing (EHS+02), P2P computing, etc.

• Parameter sweep applications are applications which are meant to be executed withdistinct sets of input data. It usually consists of a set containing the same task butwith different input parameters;

• Multi-site computing means that a task is executed in parallel at different sites. As aresult a greater number of usable resources become available. One possible drawbackmight be the overhead introduced by network bandwidth and latency;

• P2P and Grid computing involves solving tasks consisting of several tasks by distribut-ing them on several machines spread across the network. These tasks usually use the

12 Marc E. Frıncu

cumulative power and bandwidth between peers to solve their objectives. In a pureP2P network there are no client or server notions as all peers are equal.

Cloud Scheduling is similar with Grid Scheduling but focuses more on scheduling virtualmachines by creating and destroying them in order to offer maximum load and efficiency.

Both Grids and Clouds can extend over multiple VOs and CPs which are linked together,forming a federation. Their resources can be exposed in two ways: directly or throughWS/GS interfaces. When dealing with a federation a common problem is that generallyeach involved party wants to maintain its individual scheduling, access and security policies.Hence scheduling becomes more difficult and requires cooperation and negotiation. SOAmakes this procedure even more problematic as executing and scheduling tasks inside theseenvironments is hard due to certain reasons such as: the fact that schedulers find it hardto retrieve information about resource and network characteristics needed for schedulingdecisions; impediments related with the SOA paradigm where resources and applicationcannot be accessed directly and usually expose an interface with limited functionality. Asa result it is almost impossible to give estimations on information required by schedulers(e.g., task run times and deadlines). Moreover, in what concerns availability, services are asunpredictable as the resources they run on. Transfer costs are not easy to determine as wellsince the route to the service is most of the times unknown and the service could act merelyas a gateway to another hidden DS.

1.2 Motivation of the Work

DS are defined by heterogeneity and dynamism. Consequently the behaviour of SAs isdrastically – and continuously – affected by the system’s configuration. Even though not asoften as in P2P systems, Grids and Clouds are also susceptible to resource failures. As a resultSAs without rescheduling could cause tasks not to get executed. Dynamic rescheduling solvesthis problem partially. Nonetheless when a centralized planner is used or task replication isignored, the system is also susceptible to failures as the scheduler’s execution depends onlyon a single resource. Furthermore due to computational and communication heterogeneityfound in DS it is hard to estimate task execution and transfer times on resource queues.Also, because of wrong estimates, the mappings provided by SAs could negatively influencethe cost function.

Most current solutions for Resource Management Systems (RMS) have been developedfor cluster or intra-provider usage where access policies are subject to the local authoritydecisions (TTL05; Fos05; CKKG99; CD97). However, when considering inter-provider envi-ronments, these solutions need be adapted accordingly in order to facilitate task migrationunder certain access and sharing rules. Multi-Agent Systems (MAS) offer a natural exten-sion to RMS as they allow multi-providers to inter-operate autonomically through negotiation(SLGW02; OGM+05). Intelligent agents act on their own interest when solving a problem.Most MAS schedulers rely on specialized agent hierarchies (SFT00), (CSJN05). Yet these

Adaptive Scheduling for Distributed Systems 13

multilevel approaches are not always necessary and usually imply a strict hierarchy whichonce broken leads to failures in the overall functioning of the scheduler. Besides they requirefor each VO/CP to use the existing agents and thus to change their scheduling policies. Incontrast a completely decentralized approach with no strict hierarchies and in which agentsact as ants inside a colony and are able to heal themselves, offers a more fault-tolerantscheduling environment. Likewise in order to allow VO/CP to maintain their autonomyonly the communication protocol should be standardised. In this way any member of theDS could expose its own custom built agent – in terms of scheduling, access and securitypolicies – to deal with the task negotiations.

SAs are influenced by both network topology and user estimates execution and transfercosts. As far as the latter is concerned the work of Lee et al. (LSHS05) shows that nomatter how wrong user estimates are there exists an upper limit on how the schedule costfunction can deteriorate. However due to the intrinsic nature of SA to offer sub-optimalsolutions opinions are divided with some authors arguing that the impact of user estimatesindeed influences the cost function. For instance, Mualem et al. (MF01) argue that inexactestimates actually improves the cost function while Chiang et. al (CADV02) give evidenceof the contrary.

To have a global idea on how SA behave, several resource and task configurations needto be tested. Maheswaran et al. (MAS+99) study 8 SA including Minimum CompletionTime (BSB01), Minimum Execution Time (BSB01), Opportunistic Load Balancing (BSB01),Swithing Algorithm (MAS+99), K-Percent Best (MAS+99), Min-Min (MAS+99), Max-Min(MAS+99) and Suffrage (CLZB00) in various task and resource heterogeneity scenarios.Results show that no single SA can have a permanent advantage in all possible scenarios.It is proven that the relative performances of the previously listed scheduling heuristicsdepend on: the consistency of the Estimated Execution Time matrix (BSB01; MAS+99), therequirement to optimize certain performance metrics (either system or application) and thearrival rate of new tasks.

Braun et al. (BSB01) also perform a comparison study including 11 SA, as well assome meta-heuristics which had been omitted in the work of (MAS+99). They neverthelessexclude the Suffrage heuristics from testing. The tested meta-heuristics consist of: GeneticAlgorithms (HAR94; WSRM97; WYJL04; ZWM99), Simulated Annealing (LA87), GeneticSimulated Annealing and Tabu Search (Bru07). An A∗ based technique (CL91) has alsobeen considered. Twelve testing scenarios have been envisioned including various EstimatedExecution Time matrix configurations. The Genetic Algorithm proved to offer the bestresults under their simulated conditions with Min-Min and A∗ following it.

Results given by Casanova et. al (CLZB00) show an example of how the Suffrage heuris-tics can perform better on heterogeneous platforms and worse on (almost) homogeneousones – resources are considered to be clusters. The latter has been shown to be true fordata intensive tasks as Suffrage tends to schedule tasks based on their Suffrage value whichin this case would be close to 0 for all resources. Thus all tasks would get approximatelythe same priorities and consequently the scheduling heuristic could degenerate into a FirstIn First Served technique. In the studied case of using networks of clusters and high input

14 Marc E. Frıncu

data a task would fail to be scheduled on a cluster storing the large input data due to thehigh probability that two of its nodes could be the best and second best candidates for theSuffrage value3. Eventually this could lead the task to be scheduled on another cluster andthe large amount of input data would have to be transported to it causing a deterioration ofthe schedule’s cost function.

Schedules can also be affected by the heterogeneity of the tasks themselves – in terms ofnumber of operations, instruction parallelism, input data, transmission rate between depen-dent tasks, etc. For instance the Min-Min algorithm could perform worse than Max-Min incases where there are many short tasks and just a few long ones (MAS+99). When applyingthe Max-Min algorithm for this scenario, the chance of executing many short tasks concur-rently with long ones increases. This is in contrast with the case of Min-Min where the longtasks would be scheduled lastly, a behaviour which could deteriorate the cost function.

Scheduling tasks inside a DS when a federation of VO/CP is assumed also raises theissue of negotiation when relocating tasks. The negotiation (SLGW02) needs to take placeat specific moments that are correlated with the rescheduling process. Generally, negotiationfor resources inside a federation is not different from the localyl selecting the best resource.Thus the meta-scheduler must be able to incorporate the negotiation within itself in orderto treat both external and internal resources equally.

Consequently two main ideas can be emphasised. The first one is that given the presentedwork no SA has an advantage on the rest when considering the dynamism of the DS, andthe second one is that a good scheduler should have the ability to adapt to changes in DSconfiguration or to failures in the system itself. These adaptation should occur at well chosenmoments, so that no CPU would be wasted and the impact of the new decisions would bemaximized. A distributed MAS scheduler enhanced with negotiation and self-healing abil-ities, which has the capacity to adapt itself to the new DS configuration by changing theused heuristics – for instance based on the results outlined by previous studies detailed in(BSB01; MAS+99) – could provide a solution to the problem. Given the current scientificliterature finding such a solution is not trivial, although some work towards creating switch-ing SAs (MAS+99; EN07; SS08; CJ10) and self-healing schedulers (PKN04) has been made.Because of the large amount of available SA and the tendency to discover new improved ver-sions, creating a super-scheduling algorithm which contains conditional branches to existingheuristics is inappropriate. Furthermore this requires constant editing and recompilation ofthe algorithm when new scenarios are investigated. A new approach is therefore required.Rule-based systems could offer an answer as they allow to clearly separate the logic repre-sented by the SA from data represented by objects. By using a rule based approach whendefining scheduling (meta-)heuristics it is possible to allow the SA to automatically adaptto changes in the DS/task configuration. These self adaptations could occur by introducingand retracting rules from the SA rule base. In addition the decision of when to performsuch an action could be dictated by either previous tests or extrapolated results performedon various scheduling scenarios. Rules also offer implicit parallelism which could allow the

3The Suffrage value is the difference between the cost offered by the best and the second best resource.In literature the name can also be found spelled as Sufferage.

Adaptive Scheduling for Distributed Systems 15

scheduling decision to be parallelized so that multiple tasks requiring rescheduling would besimultaneously reassigned to new resources. This however is a possible cause of bottlenecksand inconsistency as tasks could require simultaneously access to the same resource datawhen computing the cost function.

As mentioned in the previous paragraph distributed schedulers need to be able to recoverfrom failures either in the DS fabric or in the scheduler itself. Self-healing mechanismscan be implemented by using Feedback Control Loops (FCL) (MKS09). When applied toMAS, FCL allow agents to recover from expected failures (e.g., scheduling and securitypolicy modifications) and more importantly from unexpected ones (e.g., network or resourcefailures, agent availability). Working MAS for scheduling, which have been integrated withhealing abilities are hard to find. (PKN04) proposes a distributed self-adaptive and failuretolerant system based on an n-ary tree. Each tree node represents a scheduler while leavesstand for processing nodes. Still the system lacks a negotiation protocol between schedulers,communication is not truly decentralized as scheduling nodes can only communicate withtheir parents or children and adaptivity is restricted to reassigning tasks to parent schedulersin case a scheduling node fails. Recently, Caprarescu et al. (CCDND10) proposed a self-adaptive MAS system for optimizing the service distribution in a cloud. The system usesdynamic service allocation and load balancing to optimize resource usage.

In summary, several approaches have been proposed toward creating either adaptivescheduling strategies or MAS for task scheduling. However, most approaches focus on indi-vidual aspects such as negotiation, SLA management, (adaptive) scheduling and self-healing.Building a functional Scheduling Platform (SP) which is fully distributed in terms of commu-nication, storage, scheduling and negotiation solutions and also self healing is obviously stillan open research challenge. To face this challenge, we propose in this thesis an adaptiveinter-provider MAS SP based on an inference language for defining schedulingheuristics and on intelligent agent modules implemented as autonomous FCLs.

To treat negotiation and scheduling unitary inside a federated DS system we also pro-pose an efficient scheduling heuristic capable of working at both meta and locallevels. As DSs are dynamic in nature, each task is treated independently and has its ownrescheduling time. To completely address the problem of providing a scheduling systemcapable to adapt to changes occurring inside heterogeneous and dynamic DS, a rule basecontaining several heuristics is used. Selection of the best candidate is done by usinga new method of selecting the heuristics which provided the best results in ascenario which closely resembles the current DS configuration.

1.3 Structure of this thesis

The remainder of the thesis is structured in 5 chapters as follows:

Chapter 2 presents the mathematical model used throughout the rest of the thesis, aswell as a taxonomy of SAs and RMS. Based on previous works the list is updated and severalnew categories are introduced.

16 Marc E. Frıncu

Chapter 3 presents two SAs, MinQL (cf. Sect. 3.3) and DMECT (cf. Sect. 3.4) forheterogeneous distributed environments. The simulation environment (cf. Sect. 3.1) usedfor testing the algorithms is also described. The algorithms are formalized and tested againstknown policies in several scenarios. A population based version of the proposed SAs is alsoaddressed with the aim of studying its improvement over the single-element versions.

Chapter 4 describes a formalism for expressing SAs using inference rules. This allows aneasier customization, mixing and distribution of the scheduling heuristics. The rule basedlanguage and engine are depicted in Sect. 4.1. In Sect. 4.2 several SAs including DMECTare rewritten using the formalism and tested – centralized and decentralized – for efficiencyagainst the non rule-based versions.

Chapter 5 presents a modular self-healing MAS SP. The platform uses the formalism andSA described in the previous chapters during negotiation and task scheduling. Emphasis isput on its two healing aspects, task rescheduling and module recovery, which are depictedin Sects. 5.1.3 and 5.1.4. The self-healing is based on FCL and the modules that makeup the loop are explained in details. Optimization issues including agent reduction duringthe negotiation phase and dynamical selection of the SA during runtime are tackled as well.Both problems are solved by using clustering techniques. Section 5.2 shows how the agentreduction is made without deteriorating the schedule, while Sect. 5.3 describes how SAs canbe switched at runtime based on current system configurations and previous knowledge ofthe best schedule provider. Tests on the platform recovery times are depicted in Sect. 5.4.

Finally Chapter 6 outlines the main achievements and sets new future research direc-tions.

2 STATE OF ART

Contents2.1 Formalism Description . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.1 The Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.2 Commonly Used Terms . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Scheduling Algorithms and Systems . . . . . . . . . . . . . . . . 25

2.2.1 Scheduling Algorithms Overview . . . . . . . . . . . . . . . . . . . 26

2.2.2 Resource Management Systems Overview . . . . . . . . . . . . . . 40

2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.1 Formalism Description

This section presents the mathematical model and main terms used in the field of schedul-ing algorithms and systems, as well as in the remainder of this thesis.

2.1.1 The Mathematical Model

In what follows we present a mathematical model for describing scheduling problems basedon the model presented in Brukner (Bru07). Several new notations are introduced as wellin order to integrate scheduling systems with MAS, as Brukner’s work focused chiefly onscheduling problems.

Let M = {M1,M2, ...,Mm} be the set of machines which have to process n jobs repre-sented by the set J = {J1, J2, ..., Jn}. A job Ji is usually made of several tasks {T1, ..., Tk}.This model is different from the one described in (Bru07) where jobs are made of operations.The reason for doing this is that we consider tasks to be part of greater jobs (workflows)and operations (or instructions) to be the atomic elements of tasks. Workflow tasks canbe represented as Directed Acyclic Graphs (DAG) or Petri Nets in case an interdependencybetween them exists.

As mentioned in the previous paragraph each task Ti consists of ni operations {Oi1, Oi2, ..., Oini}.To every Oij a process requirement pij (Bru07) is associated. In the particular case in whichtask Ti is made up of only Oi1 we identify Ti with Oi1. Each operation Oij is associated with

17

18 Marc E. Frıncu

a set of machines µij ⊂ M that can be either dedicated or parallel. Dedicated machines aregeneral-purpose computer systems confined to performing only one function for reasons ofefficiency or convenience. Parallel machines are computers which carry out many operationsin parallel for efficiency reasons. In order to generalize the problem we will consider thatan operation Oij can be processed on any machine capable of solving it, in which case themachine is called a multi-purpose machine (Bru07). It is also possible for operation Oij tosimultaneously use all the machines in µij in which case we deal with multiprocessor taskscheduling.

According to Brukner (Bru07) a schedule for a task Ti is supposed to be an allocation ofone or more time slices to one or several machines. The goal of any scheduling problem isto determine an allocation which satisfies certain restrictions. Schedules can be representedby using Gantt charts (Cla52) which may be either machine or job oriented.

Additionally a set of characteristics can be added to each machine Mj ∈ M and repre-sented as MCj = {s, p, c,m, d, o} where s stands for the speed of the machine in flops/s,p represents the number of processors, c is the number of cores, m specifies the availablememory in bytes, d the available disk space and o represents other relevant information suchas operating system, software installed, etc. Elements m and d are likely to vary over timedue to machine workload and may periodically need updates.

The machine environment is specified by α = α1α2 (Bru07) where the possible valuesare {◦, P,Q,R, PMPM,QMPM,G,X} for α1 and ◦ or any positive integer for α2. Thesymbol ◦ indicates the empty value in the case for α1 and an arbitrary value in the case ofα2. Parameter α2 points out the number of machines in the machine environment while α1

indicates the type of the machines as follows:

• α1 = ◦ - means we deal with dedicated machines and for each job we must specify one;

• α1 ∈ P,Q,R - means we deal with parallel machines and each task Ti can be processedon any machine Mj ∈ M . There are three scenarios of parallel machines which areusually taken into consideration: identical parallel machines (α1 = P ), uniform parallelmachines (α1 = Q) and unrelated parallel machines (α1 = R). Considering a process-ing time of pij for task Ti on machine Mj we have pij = pi for α1 = P , pij = pi/sj forα1 = Q and pij = pi/sij for α1 = R, where sj represents the speed of machine Mj andsij represents the task-dependent speed of machine Mj;

• α1 = PMPM |QMPM - means we deal with Multi-Purpose Machines (MPMs). ForPMPM we have identical MPMs while for QMPM we deal with uniform MPMs;

• α1 ∈ {G,X,O, J, F} - means we deal with a multi-operation model. In this case eachtask Ti comprises a set of operations O = {Oi1, Oi2, ..., Oini} and the machines arededicated. Depending on the value of α1 we have general shop (α1 = G), job shop(α1 = J), flow shop (α1 = F ), open shop (α1 = O) or mixed shop (α1 = X) models. Ina general shop model there exist precedence relations between arbitrary tasks. All theremaining multi-operational shops are special cases of it. A job shop for example has aspecial precedence relation in the form Oi1 → Oi2 → ... → Oini ,∀i = 1, n. Depending

Adaptive Scheduling for Distributed Systems 19

on whether µij is equal or not with µij+1,∀j = 1, ni − 1 the job shop is called or notwith machine repetition. The flow shop, open shop and mixed shop are particular casesof the job shop.

Tasks usually have some characteristics which are normally described by a set β madeup of at most six elements (Bru07):

• β1 is optional and indicates whether pre-emption is allowed or not. If it is missing fromthe set then the task cannot be pre-empted. Otherwise it must be set to value pmtn;

• β2 indicates any precedence relations between tasks. Usually such relations are de-scribed by a DAG G = (V,A), where V = T and the pair (i, j) ∈ A iff Ti must becompleted before Tj. Section 2.2.1 presents this scenario in greater detail. Given thedefinition of a job we can also state that

⋃G = J ;

• β3 is optional and represents the release date of the specified task. It specifies the timewhen the first operation belonging to that task becomes available for processing. If itmisses then the release date is not set and the first task is available immediately;

• β4 indicates restrictions on processing time or on number of operations allowed;

• β5 indicates the deadline for executing task Ti;

• β6 is optional and indicates whether we deal or not with a batching problem. A batchis generally made up of several tasks which are to be scheduled together but they canalso consist of only one single task. In (Bru07) it is stated that these batches must besolved jointly on a machine however this can be generalized such that more than onemachine could be used. This is also the case for the SAs described in Section 2.2.1.Mainly we deal with two types of such problems: p-batching problems and s-batchingproblems. In the case of p-batching problems the length of the batch is equal to themaximum of the processing times of all tasks comprised in the batch. Similarly thes-batching problems consider the sum of processing times of all tasks in the batch.Consequently the only admissible values for β6 are p-batch or s-batch.

We now introduce the notion of resource which is usually identified with a machine.However in this thesis a resource is identified with a network node which in our model,can be either an individual machine or a cluster. Hence we define the set of resourcesR = {R1, R2, ..., Rl} where in general Rk = {

⋃Mj | Mj ∈ M} ⊆ M . Furthermore in this

paper we only deal with non-pre-emptive tasks (β1 is missing from the task characteristics).Each resource Ri has a queue Qj ∈ Q = {Q1, Q2, ..., Ql} attached. The correspondencebetween resources and queues is one to one. Every Qi contains a set of tasks T ′ ∈ T assignedto be solved on resource Ri.

Definition 2.1.1 A schedule is defined as a function f : T → R which maps every taskTi ∈ T on a resource Rj ∈ R linked to a queue Qj.

20 Marc E. Frıncu

The aim of every schedule is to minimize the total cost function (Bru07). Two examplesof such functions are given as follows:

gmax(C) = max{gi(Ci)|i = 1, n} (2.1.1)

∑gi(C) =

n∑i=1

gi(Ci) (2.1.2)

where n represents the number of tasks, Ci represents the finishing time of Ti and gi(Ci) itsassociated cost.

According to Brukner (Bru07) the most important cost objective functions are themakespan and maximum lateness. The makespan is defined as the time needed to com-plete the schedule. More formally it is equal with Cmax = max{Ci|i = 1, n}. The maximumlateness is defined as Lmax = maxni=1(Ci − di).

As an alternative to the makespan we can use the TPCC (Total Processor Consump-tion Cycle) (FH04) schedule which represents the total computing power consumed by anapplication (Relation 2.1.3). Its advantage is that it can be barely affected by the varianceof resource performance. A schedule with a good TPCC is also a schedule with a goodmakespan. When using this objective function the length of a task is represented by thenumber of instructions (flops) in it, while the speed of the processor is represented in flops/s.

When Cmax is assumed TPCC can be expressed as follows:

m∑j=1

bCmaxc−1∑t=0

sj,t +m∑j=1

(Cmax − bCmaxc)sj,bCmaxc (2.1.3)

where m represents the number of resources, sj,t stands for the speed of machine Mj in thetime interval [t, t + 1). We can safely assume without loss of generality that the processorspeed does not vary inside the specified interval.

Optimality for schedules is defined according to Definition 2.1.2:

Definition 2.1.2 A schedule is said to be optimal if it minimizes a given optimality cri-terion γ.

When considering makespan and maximum lateness, γ is equal to Cmax and Lmax.Brukner (Bru07) defines a class of scheduling problems as follows:

Definition 2.1.3 Given the machine environment (α), task characteristics (β) and opti-mality criterion (γ) we define a class of scheduling problems as a triplet α|β|γ.

In this thesis we deal with problems falling in one of the following three classes: Q|s −batch|Cmax, Q|prec|Cmax, Q||Cmax, Q|s− batch|Lmax, Q|prec|Lmax and Q||Lmax.

Because in DS resources are spread and possible under security policies, they require anindependent and autonomous method for communication. Autonomous intelligent agents can

Adaptive Scheduling for Distributed Systems 21

be used to handle task scheduling and migration and negotiation. They have the advantageof having three main characteristics (SFT00): reactivity - enables agents with capabilitiesto react to events from the environment; pro-activeness - allows agents to take initiativesdriven by goals and social ability - enables agents to interact with other agents. An agentis defined as Al = A1

lA2l , where A1

l = R′ ⊂ R represents the set of resources handled by theagent, with

⋂A1l = ∅ and

⋃A1l = R, and A2

l represents the set of capabilities possessed bythe agent.

2.1.2 Commonly Used Terms

Besides the previously defined formalism we need to also introduce a couple of well knownterms in scientific literature.

The Transfer Cost (TC) of task Ti from queue Qk to queue Qj can be determined inmany ways depending on the complexity of the scenario. In Equation (2.1.4) it is given thesimplest method for estimating the time required by a task Ti to move from the currentqueue to the new one. It is also the least realistic estimation. Examples closer to a realscenario include wormhole and Max-Min fairness variations.

TCkji =

n∑l

(latl +

T sizei

bwl

)(2.1.4)

where T sizei represents the size of Ti in bytes, n represents the number of links in the routebetween the old and new computing node, bwl and latl stand for the latency and bandwidthof linkl in the network route. Their values can be determined by using either topologymapping tools or data extracted from an on-line catalogue.

The Estimated Completion Time (ECT) is the estimated time when task Ti is com-pleted on resource Rj. It is defined as in Equation (2.1.5):

ECT ji = max{ITT,EWT ji , ISTj}+ EET ji (2.1.5)

where ITT represents the Input Transfer Time in seconds, EWTj is the Estimated WaitingTime of task Ti on queue Qj associated with resource Rj, until all tasks in front of Ti getexecuted, ISTj means the time needed for the start of the task on resource Rj and EET jistands for the Estimated Execution Time of Ti on Rj.

In cases where a task Ti is being executed on multiple processors a formula based on theAmdahl’s law (Amd67) could be used. An example of computing the EET of a task is givenin paper (RvG01) (Relation 2.1.6):

EETNpi =

(α +

1− αNp

)EETi (2.1.6)

where Np represents the number of processors which execute Ti, α is the fraction of Ti whichexecutes serially and EETi represents the estimated execution time on one processor.

22 Marc E. Frıncu

Related with EET a commonly used notation is the EET matrix (BSB01; MAS+99).Each term, EET (EET ji , Rj), represents the EET of task Ti on resource Rj. EET matricescan be classified as consistent, inconsistent and partially-consistent matrices. The first ones(Table 2.1.2) are matrices where each row is ordered so that resource R0 is always the fastestfor all tasks. The second ones (Table 2.1.1) represent situations where some Rj ⊂ R resourcesare faster for some tasks and slower for others. The third case refers to inconsistent matricesthat include a consistent sub-matrix of a predefined size.

R1 R2 R3

T1 200 150 250

T2 17 21 15

T3 1000 800 300

Table 2.1.1: Inconsistent EET matrix fora set of 3 tasks and 3 available resources

R1 R2 R3

T1 150 200 250

T2 15 17 21

T3 300 800 1000

Table 2.1.2: Consistent EET matrix for aset of 3 tasks and 3 available resources

The Approximate Turnaround Time (OBSS04) of a task Ti from resource Rk toresource Rj and back is computed as follows:

ATT ji = max(EWT ji , TCkji , ISTj) + EET ji + TCjk

i (2.1.7)

Additionally we have the following terms related to a task’s execution: the EstimatedStart Time (EST ji ) and the Real Start Time (RST ji ) of task Ti on resource Rj.

The Resource Utilization (OBSS04) of resource Rj ratio is given by the ratio:

RUj =pusedj

ptotalj

(2.1.8)

where pusedj and ptotalj represent the number of used respectively total processors on resourceRj.

Resource utilization can also be linked to Cmax and ECT :

RUj =

∑Mj∈M completionj

Cmax |M |(2.1.9)

where completionj represents the time when machine Mj finishes executing its assignedtasks.

The reason we chose to use the max function instead of sum in Relations 2.1.5 and 2.1.7is that all operations listed in it are assumed to start at the same moment. If a sequentialstart-up had been assumed we would have sum instead of max.

The Matching Proximity (MP) (XA10) metric can be used to match tasks to resourcesthat best fit the selected criteria. For instance when the fastest resources are targeted forevery Ti this metric becomes:

MP =

∑Ti∈T ECT

f(Ti)i∑

Ti∈T ECTλi

(2.1.10)

Adaptive Scheduling for Distributed Systems 23

where λ = {Rk : ECT ki ≤ ECT ji ,∀Rj ∈ R}.The larger MP is the larger the number of tasks assigned to the fastest resources.Next we define some additional metrics which will help the algorithms’ analysis. Some

of these metrics cannot be however seen as optimality criteria as they arise naturally fromthe scheduling heuristics of the algorithms.

The schedule’s throughput represents the rate at which tasks are executed inside eachtime unit. The optimal value is achieved when the total throughput of a job is maximized:

ρ∗(i) =m∑j=1

ρ∗(i)j (2.1.11)

where ρ∗(i)j represents the throughput of Mj for Ji.

Remark 2.1.1 Finding the optimal throughput requires the solution to be characterized bya linear program over rational numbers, a problem which is of polynomial complexity.

A solution to this problem is given by Relation 2.1.12. The full proof can be found in(BMP+10):

ρ∗(Ji) = min

{BW

δi,m∑j=1

min

{sijwi,bw

δi

}}(2.1.12)

where m represents the number of jobs, BW symbolizes the bandwidth of the sender’snetwork card, δi represents the size of Ji in bytes, bw stands for the bandwidth betweenthe sender and the receiver machine, wi represents the number of flops required to processJi, and sij represents the processing speed of machine i when executing Ji in the unrelatedparallel machines model.

In the general case of having several application concurrently executing on several ma-chines the solution is to find a point (ρJiMj→Ml

(tk, tk+1), ρJiMl

(tk, tk+1)) – where ρJiMj→Mlrepre-

sents the communication throughput between Mj and Ml, and ρJiMlrepresents the computa-

tional throughput of Ml – in the convex polyhedron defined by several constraints as detailedin (BMP+10).

By having defined the throughput we can now express the schedule maskespan as:

Cmax = max

{CJimax =

NJi

ρJi,∀Ji ∈ J

}(2.1.13)

where NJi represents the number of tasks in Ji.The stretch of a job represents its slowdown and is defined as:

SJi =CJimax

Coptimalmax

(2.1.14)

where Coptimalmax represents the optimal makespan when only Ji is executed.

24 Marc E. Frıncu

The objective function in this case is represented by S = maxJi∈JSJi .The compactness (Fr9b) c of a SA is defined as the ratio between the queue with the

smallest ECT and the one with the largest one. As a result a SA is said to be ideal in termsof node workload if c↗ 1 and c is recomputed each time the queue is rebalanced.

A measure of the improvement of one SA related to another in terms of makespan isgiven by its gain. Given two SAs SA1 and SA2 the gain is given by the ratio:

gain =CSA1max

CSA2max

(2.1.15)

Assuming gain > 1, the bigger its value is the better algorithm SA2 is.When dealing with DS we usually refer to them as heterogeneous platforms. This property

can be defined by the heterogeneity factor, h, based on the definition found in paper(SC07) (Relation 2.1.16):

h =

(smaxsmin

− 1

)× 100 (2.1.16)

where smax and smin represent the maximum, respectively minimum processor speed ex-pressed in flops/s in the platform.

The value of h can be interpreted as the maximum relative difference in percentage ofcomputing power found inside a DS.

Having defined the heterogeneity factor of a platform we can now express an homogeneousplatform as a platform with h = 0. In a DS we usually do not deal with the latter case butinstead we could have what is called an almost homogeneous platform defined as havingh < 10.

Adaptive Scheduling for Distributed Systems 25

2.2 Scheduling Algorithms and Systems

At the core of any scheduler, whether centralized or not, lies a SA responsible for mappingtasks on resources.

SAs usually try to minimize one or more optimality criteria (cf. Section 2.1.1) by usingthe following criteria as markers:

• Optimizing performance without regarding the cost and vice-versa;

• Optimizing performance within a specific cost and/or time constraint;

• Optimizing cost within a specific performance and/or time constraint.

Task scheduling is not an easy task and also depends on how one views the schedulingmodel. Eernemann et al. (EHS+02) exploit the following approaches as basic models: sitemodel, machine model and job model. The site model considers the DS as a set of independentmachines each of them with their own resources and local users that submit jobs to the LocalScheduler. Jobs coming from the DS are distributed by the Global Schedulers and resourcesare controlled exclusively by them. The next model is the machine model in which each siteis viewed as a Massive Parallel Processor (MPP) which consists of several homogeneous (oralmost homogeneous) nodes. The machines support in this case space-sharing while theyrun jobs in an exclusive, not pre-empted or time shared way. Thirdly the authors identifythe job model where jobs are submitted on local machines, producing as a result a stream ofjobs also called batch jobs. Jobs are later sent to other sites by the Global Schedulers unlesscertain conditions are met on the local machine.

All of these models deal with additional problems not related to computing resources.The problems rise mostly due to network characteristics and can bring additional overheadto the total processing time of a particular job because of time delays in data transport. Theissue of time delays can be solved by pre/post-fetching data before the actual job execution.Nonetheless this can create problems if the job has to be later rescheduled by the GlobalSchedulers due to high waiting times.

From these models one can obtain several job execution scenarios which do not necessarilyrelate to actual job planning algorithms but can rather be viewed as the execution spacefor a particular job. These scenarios include: simple local job processing where each job issolved locally and cannot migrate from one local scheduler to another, inter-site job sharingwhere jobs can migrate from one site to another, and site-sharing in which a job does notexecute exclusively on one site but can use several sites in the scenario in which it does notfind enough resources on a single one.

The way in which jobs are chosen to migrate depends on the SA. One of the majorproblems SAs deal with is the heterogeneity of the underlying DS. This often leads to differentcapabilities for job processing and data access. Another problem is that sites are usuallyautonomous and local jobs might temporarily take control and push the job coming from theGlobal on a waiting queue. The resource broker cannot intervene in the logic of the LocalScheduler as this would violate one of the site’s basic properties, autonomy.

26 Marc E. Frıncu

2.2.1 Scheduling Algorithms Overview

According to Dong et al. (DA06) SAs can be divided according to several criteria (cf. Fig.2.2.1):

Figure 2.2.1: Classification of Scheduling Algorithms

Local vs. Global: This classification divides SAs according to where the schedulingdecisions are made: at machine level – operation usually performed by the Local Schedulers– or at inter-site level – the operation is handled by the Global Scheduler.

Static vs. Dynamic: This classification divides SAs according to the moment schedul-ing decisions are taken. In the case of static SAs scheduling is performed before the jobsare submitted whereas in the case of dynamic SAs the decisions are taken while the jobis being executed. The first approach assumes that all the information about resources ispriory known and that resource properties do not vary over time. The second approach takesinto consideration the dynamics of the DS and is generally preferred as resource propertiesmay change unpredictably during a job’s execution. Amongst the changes that could influ-ence the outcome of the schedule we can notice computational node failures, node over-loadsor bandwidth bottlenecks. Since the cost for an assignment when dealing with dynamicscheduling is not available, a natural way to keep the whole system healthy is by periodicload rebalancing. The advantage of dynamic load balancing over static scheduling is that thesystem does not need to be aware of the run-time behaviour of the job before execution. It isparticularly useful in a system where the primary performance goal is maximizing resourceutilization, rather than minimizing runtime for individual tasks. Depending on who initiates

Adaptive Scheduling for Distributed Systems 27

the balancing process there are three kinds of approaches to dynamic scheduling (OBSS04):

• Sender initiated (SI): the Global Scheduler sends the resource requirements of taskTi to all clusters and waits for estimates such as ATT (cf. Relation 2.1.7) or RU ofresource Rj (cf. Relation 2.1.8).

In case several ATT s are smaller than a given threshold t the cluster node with thesmallest RU is selected. An improvement for this approach is to allow jobs to migratefrom one cluster to another once they are scheduled.

• Receiver initiated (RI): each cluster periodically checks its Cluster Resource Utilization(CRU). If CRU falls under a given threshold c then it will volunteer itself for jobs byinforming the Global Scheduler. This in turn informs the other clusters of the newlyavailable resource. Each of the computing nodes in the informed clusters then checksthe ATT value of the first job queued and compares it with the ATT s of the volunteer’snodes. Only those nodes with a small RU from the volunteer cluster will be taken intoconsideration. If a volunteer node which minimizes the ATT is found the job will bemigrated on it.

• Symmetrical initiated (SyI): is an approach that combines the previously two ones.Alike in RI, each of the nodes periodically checks for underused resources and informsthe Global Scheduler. The main difference comes from the situation when a job exceedsits LWT but no volunteer node is found. While in RI it waits passively for a volunteer,in SyI it will actively send a request to the clusters using the SI approach.

Some of the well known solutions in the case of dynamic load balancing are the following(DA06):

• Unconstrained FIFO : the resource with the currently shortest waiting queue or thesmallest waiting queue time is selected as destination for the new task. This policy isalso called Opportunistic Load Balancing (OLB) (MAS+99);

• Balanced Constrained : loads on resources are periodically re-balanced inside a neigh-bourhood where access to resources does not introduce a time overhead due to networkconnections. This is an important condition in the case of large systems such as Gridsor Clouds where task migration could introduce high transfer costs;

• Cost Constrained : this approach is similar with the previous one, the only differenceconsisting in that the communication cost between jobs is also taken into consideration.More precisely for a task having a communication cost greater than the decrease inexecution time the scheduler will take the decision not to move the task;

• Static-Dynamic Hybrid : combines the two scheduling approaches so that static schedul-ing is applied to those parts that always execute and the dynamic approach is appliedto all the rest. The idea is to take into account some of the unpredictable behaviourof the resources while at the same time take advantage of static scheduling.

28 Marc E. Frıncu

Dynamic load balancing usually implies rescheduling either at given intervals or whencertain events occur. Paper (Fr9b) presents an event driven rescheduling policy based onthe task EWT whereas in (CFMP10) a rescheduling mechanism occurring only when tasksarrive or finish executing is detailed.

Optimal vs. Suboptimal: these two approaches require that all information regardingthe jobs and the resources must be known a priori in order to make an assignment based onsome objective function such as makespan or RU. However such assumptions are hard to bedone due to the NP-complete nature of the problem (Bru07). Consequently researchers tryto find suboptimal solutions by using scheduling heuristics or approximate algorithms.

Distributed vs. Centralized: SAs adhering to any of these criteria rely either ontaking the scheduling decision on a single scheduler or on multiple schedulers. The firstapproach is simpler but is not scalable and has a small fault tolerance. Having multipleschedulers allows sub-parts of the system to work independently even when connection be-tween peer schedulers fails. They will continue to function and schedule tasks in their ownpart of the DS. Results which need to be transmitted to resources inside the network areawhere the connections have been interrupted will be resumed once broken connections arere-established. New incoming tasks are subject to the same treatment. Distributed sched-ulers can be viewed as either cooperative or non-cooperative depending on whether schedulingdecisions are taken with regard to other schedulers or not. Shan et al. (SOBS04) give anexample of a cooperative distributed scheduler and compare the results with a centralizedand local scheduling.

Application vs. Resource centric: application centric schedulers attempt to optimizethe performance of individual applications, with time (i.e., makespan) or economic costsplaying a major role in decision taking. Resource centric schedulers aim at optimizing RUby taking into consideration the throughput – how many tasks can a node process inside agiven time frame – and utilization that is the percentage of time the resource busy. Condor(TTL03) for example uses throughput as the criterion for taking scheduling decisions.

Best effort vs. QoS constraint: (YB07): best-effort based scheduling tries to min-imize the execution time by ignoring other factors such as the monetary cost of accessingresources and various user QoS satisfaction levels. By contrast, QoS constraint based schedul-ing attempts to minimize performance under several important QoS constraints. These in-clude for instance time minimization under budget constraints or cost minimization underdeadline constraints. Figure 2.2.2 (DA06) shows a list of SAs falling in this category.

Best Effort based SAs are derived from either heuristics based or meta-heuristics basedapproaches. The difference between scheduling heuristics and meta-heuristics is that thelatter tries to provide a general solution method for developing a specific heuristics whichfits a particular set of problems, while the former intends to provide a SA which matchesonly to a particular type of problems. Scheduling heuristics based algorithms include: Op-portunistic Load Balancing (BSB01), Round Robin (FH04), Min-Min (MAS+99), Max-Min(MAS+99), XSuffrage (CLZB00), Minimum Execution Time (BSB01), Minimum Comple-tion Time (Myopic) (BSB01), HEFT (ZS03), TANH (BA04), Hybrid (SZ04a) and variationsof these. Scheduling meta-heuristics include: Genetic Algorithms (GRH05), Simulated An-

Adaptive Scheduling for Distributed Systems 29

Figure 2.2.2: List of Best Effort and QoS algorithms

nealing (YD02) and Greedy Randomized Adaptive Search Procedure (GRASP) (BHLR01).

QoS constraint based SAs are built on top of different constraints such as deadline orbudget. Deadline constraint SAs include Back-tracking (YB07) and Deadline distribution(YB07). The budget constraint SAs include LOSS and GAIN (SZTD07), and GeneticAlgorithms (YB05).

User EET required vs. no user EET required: it is a new classification introducedin this thesis. The reason is that most of the SAs fall into one of these two categories. SAswhich do not require any knowledge of task EET are generally concerned with optimizingthe resource workload. Among these algorithms we notice Opportunistic Load Balancing(BSB01), Round-Robin (FH04), Balance-Constrained (DA06) and Back-Filling (LS02) SAs.On the other hand SAs which require EET are used to optimize criteria such as makespan,throughput or lateness. Most of the algorithms listed in previous paragraphs are EETbased. Their advantage is that by having foreknowledge on the task runtimes the schedulercan optimize the schedule. Nonetheless DS are dynamic by nature and consequently theseestimates might vary over short periods of time. As a result they should be used togetherwith efficient algorithms for estimating the required system/task properties. This can beachieved by either estimating from historical data (ABC+04; SFT98) or by foreseeing thefuture configuration of the DS and job demand (JH08; JHY08).

Computation vs. Data Scheduling: Grid projects related to astrophysics, bio-

30 Marc E. Frıncu

informatics, Earth sciences usually require or produce large amount of data consisting ofseveral hundreds or more of GBs. Due to the geographical distribution of both computa-tional and storage resources a balance between where to run the application and where tostore its data needs to be achieved. Assigning the task to the fastest possible resource mightnot produce the expected result as the cost overhead induced by the data transfer couldbe quite significant. Casanova et al (CLZB00) give as example Suffrage, a heuristics whichprovides bad results in Grids of clusters in cases of data intensive tasks.

When dealing with scenarios where large amounts of data are needed two basic approachesexist: with (APR99; DV05; RF02) or without (DBG+03; COBW00) data replication. Datareplication is intended to reduce communication bandwidth and data access bottlenecks bycopying the data across the DS.

Classic vs. Nature inspired: nature has also been an inspiration for schedulingmeta-heuristics. Recent papers such as (RL04; LSACi07) try to address the problem oftask scheduling offering meta-heuristics inspired from behavioural patterns observed in antcolonies. This technique also called Ant Colony Optimization (ACO) relies on the fact thatants inside a colony act as independent agents which try to find the best available resourceinside their space by using global search techniques. Each time such an agent finds a resourcebetter than the existing one it marks its path with pheromones. These will attract otherants which in turn will start using the same resource until a better one is found. The mostconsiderable disadvantage ACO has over other approaches is that it only works well forstatic scheduling. The reason for this is that rescheduling requires a lot of time until anoptimal scenario is reached through intensive training given by multiple iterations. Becausedistributed environments are both unpredictable and heterogeneous each time a change isnoticed the entire system needs to be trained again which is a process that could last severalhours. This large amount of time is not acceptable when tasks are scheduling under deadlineconstraints. An improvement on this might be given by mixing the time consuming globalsearch with local search when minor changes occur inside the distributed system. Howeverdefining the notion of minor changes is still an open issue.

Genetic Algorithms (GRH05) and Simulated Annealing (YD02) are another example ofnature inspired meta-heuristics. Braun et al. (BSB01) even show that in some situationsgenetic algorithms perform best. Micota et al. (ZFZ11) also presents a population basedmeta-heuristics which outperforms classic ones for online scheduling. The authors also showthat, in general, population based meta-heuristics provide through mutations solutions thatare improved compared to their classic versions.

Single objective vs. Multi-objective: DS are normally multi-objective and hence agood trade-off needs to be obtained when objective functions are considered together (XA10).Two approaches exist in this direction: hierarchical and simultaneous. In the hierarchicalapproach criteria are sorted in the order of their priorities. So if γi is less important thanγj, the value of the latter cannot be varied while optimizing the former. The simultaneousapproach considers that any improvement in one criterion would deteriorate another. Thisproblem is usually solved through Pareto optimization theory (EG04; Sch04) by using eitherthe weighted sum approach or the general approach. The weighed approach combines all cri-

Adaptive Scheduling for Distributed Systems 31

teria in a single aggregate function which is solved by using heuristic, meta-heuristic or hybridapproaches. The general approach relies on computing the Pareto optimal front (DPAM02;EG04; Sch04). Multi-objective SAs include (DPAM02; XC07; Xha07), while single-objectiveSA include (BMP+10; TTL03) – for throughput optimization, (BSB01; FH04)– for load bal-ancing, (BSB01; CLZB00; MAS+99) – for ECT based makespan, (KMSN07) – for TPCCbased makespan and (CKKG99) for – lateness.

Grid based vs. Cloud based : with the emergence of Clouds a new kind of SAdesigned for them was required. Clouds require special SAs as users generally need topay when accessing resources. Therefore SAs need to optimize effectively the allocation ofresources and make sure they remain balanced. Otherwise users could end up paying forresources they don’t require. Another issue which emerged from Clouds was that of energyefficient resource utilization, an aspect SAs begin to take into consideration as optimalitycriterion. Since most of the SAs described in this section have been designed for Grids, inwhat follows we restrict ourselves exclusively to SAs designed for Clouds.

Banerjee et al. (BMM09) propose an ant based approach for initiating the service loaddistribution inside clouds has been proposed. Simulated results on Google Application En-gine (Cor09a) and Microsoft Live Mesh (Cor09b) have shown a slight improvement in thethroughput of cloud services when using the proposed modified ACO algorithm.

High Performance Computing (HPC) task scheduling inside clouds is an aspect tackledin (GYAB09). Energy consumption is important both in what concerns the user costs andin relation to the carbon emissions. The proposed meta-scheduler takes into considerationfactors such as energy costs, carbon emission rate, CPU efficiency and resource workflowswhen selecting an appropriate data center belonging to a cloud provider. The designedenergy based scheduling heuristics shows a significant increase in energy savings comparedwith other policies.

Work aiming multi-objective online workflows for cloud systems also received attention.For instance Wei et al. (WVZX10) present a theoretic game-based approach for multi-criteriascheduling while Xu et al. (XCWB09) focus on online scheduling and propose a bi-criteria– execution time and cost – algorithm for workflow scheduling.

Independent vs. Dependent Tasks: one problem when dealing with job schedulingin DS concerns the level at which we handle the problem. These levels can be found if weexamine the composition of the requests. Requests usually comprise jobs which in turn aremade up of tasks. Tasks can be further divided into operations but in this thesis we preferto view tasks as the atoms in DS scheduling and the operations as the building blocks ofthe actual processing application. When dealing with task scheduling an important role isplayed by task dependencies. There are cases in which a certain order is required in orderto execute tasks. When this happens we generally deal with a workflow. In such situationsthe scheduling becomes more difficult as additional criteria such as precedence needs to beconsidered. In the case of precedence conditions a task cannot start until all its parents havecompleted their executions.

Based on the scheduling techniques the SAs can be divided into (DA06): independentscheduling, list scheduling (e.g., batch mode, dependency-mode and mixed mode) and cluster

32 Marc E. Frıncu

Figure 2.2.3: List of algorithms for independent and dependent tasks (DA06)

based (cf. Fig. 2.2.3).

• Independent scheduling : the algorithms belonging to this category make their decisionsbased only on individual tasks. Among the most well known examples we notice:Opportunistic Load Balancing (BSB01) and Round-Robin (FH04);

• List scheduling : prioritizes and schedules tasks based on their priorities. There aretwo major phases in list scheduling heuristics: the task prioritizing phase and theresource selection phase. Three categories of SAs fall under this classification: batchmode algorithms, dependency mode algorithms and mixed mode algorithms.

Batch mode SAs have been initially designed for scheduling parallel independent tasksbut can also be modified to be used for workflow oriented SAs. They group workflowtasks in groups of independent tasks and consider tasks only in the current group. Thealgorithms which fall in this category include Min-Min (MAS+99), Max-Min (MAS+99)and XSuffrage (CLZB00);

Dependency mode SAs are derived from algorithms dealing with graphs for interdepen-dent tasks on DS. They provide strategies to schedule workflow tasks on heterogeneousresources based on the analysis of the entire task graph. The approach is to prioritizethe tasks by ranking them and to start the execution in the order given by the rankvalues. SAs in this category base their scheduling heuristics on the weights of tasknodes and edges of the workflow graph. In a homogeneous environment such as acluster, the weights are set as follows: ECT for the nodes and TC for edges, as theyare the same for all computational resources. The situation changes however whenwe deal with heterogeneous environments. In this case an approximation approach toweight tasks and edges is taken to compute the rank value. This approximation isbased on either the average, median, maximum or minimum ECT and TC over all

Adaptive Scheduling for Distributed Systems 33

computational resources (ZS03). Shi et al. (SD06) describe a different approach inwhich higher weights are assigned to tasks with less capable computing resources. Thisis due to the fact that tasks falling in this category may cause longer delays if they arenot scheduled first. One of the most well known SAs for dependency mode schedulingis the Heterogeneous Earliest Finish Time (HEFT) algorithm (ZS03; SD06).

Mixed algorithms combine batch with dependency algorithms. Sakellariou and Zhao(SZ04a) propose an algorithm in which tasks are first ranked and sorted by rank indescending order, followed by their grouping into sets of independent tasks. Whengrouping takes place a task having a dependence with another one in the same groupis automatically placed inside a new group.

• Cluster and duplication based scheduling : is designed to avoid the TC inflicted bythe transfer of results between interdependent tasks. As a result the overall ECTis reduced and tasks with heavy intercommunication are clustered together. As thename suggests it consists of two parts: clustering and duplication. The former specifiesthat tasks are grouped and assigned to the same cluster (or resource), while the latterrefers to the action of duplicating some parent tasks on other resources. A knownSA following this approach is the Task Duplication Based Scheduling Algorithm forNetwork of Heterogeneous Systems (TANH) (BA04).

Independent tasks are tasks on which we cannot apply any precedence rules. In thisscenario the goal of the SA is to assign tasks to computing resources so that the cost functionis minimized. Several SAs can be applied in this case. Here are some of these:

• Opportunistic Load Balancing (OLB) (BSB01): is a simple load balancing SA withoutEET which assigns each task to the next available machine in arbitrary order. Due tothis simple technique it achieves a good resource workload balance but it might leadto high makespan as it does not consider either processing speed or task EET.

• Minimum Execution Time (MET) (BSB01): the goal of this SA is to assign eachtask to the best existing computing resource (cf. Relation 2.2.1) even if this resourceis available or not at the present time. It is an enhanced version of OLB. Its maindisadvantage is that it can cause large imbalances across the resources.

MinEETRiki = min(EETR1i , EETR2

i , ..., EETRmi ) (2.2.1)

• Minimum Completion Time (MCT) (BSB01): assigns each task to the computingresource with the minimum expected completion time MinECTRikTi

(cf. Relation 2.2.2).One of its goals is to combine MET with the OLB.

MinECTRiki = min(ECTR1i , ECTR2

i , ..., ECTRmi ) (2.2.2)

34 Marc E. Frıncu

• Switching Algorithm (MAS+99): combines MET with MCT in an attempt to get thebest out of both. It makes use of the compactness c (cf. Section 2.1.2) across allresources and assigns two threshold values cmin < cmax ∈ [0, 1). Starting with c = 0it begins assigning tasks using the MCT scheduling heuristics. Once cmin has beenreached it switches to MET until the compactness reaches cmax at which moment itswitches back to MCT . The cycle continues until there are no more tasks to map.Other switching algorithms include the work of Etminani (EN07) and more recentlyChauman (CJ10). Both of these rely on a switching method between the Min-Min andMax-Min heuristics (MAS+99).

• K-Percent Best (KPB) (MAS+99): considers only a sub-set made of km100

(100m≤ k ≤

100) resources when trying to map tasks. The scheduling heuristics assigns the task tothe resource with the best MinECTRikTi

(cf. Relation 2.2.2) within a subset of superiorresources. Its aim is to assign a task not only to a better resource but to avoid assigningit to a resource which might be best for a later arriving task.

• Round Robin (RR) (FH04): is a simple SA which is based on the Round Robin algo-rithm for scheduling tasks inside a CPU (Aas05). It schedules tasks by using a ringof tasks and assigning them to the next available resource. Fujimoto and Hagihara(FH04) propose a dynamic version on which each task is mapped on a resource untilno resources are available. As soon as a task is completed the next one in the ringof tasks is assigned to that resource. The process continues until there are no moretasks to schedule. At that moment a number of k tasks remain unfinished. Each ofthose tasks will be replicated on the next available resource by placing them in a ringof tasks: once a task is finished on Ri all its running clones will be removed from theresources they are running on and the next available task in the ring will be startedin its place on Ri. A task replication based SA, relying on TPCC, is also presented in(KMSN07).

• Min-Min (MAS+99): is a SA which schedules tasks in an iterative way. For eachiterative step, it computes ECT of each task Ti on every available resource Rj ∈ R′ ⊂ Rand obtains the minimum MinECTRikTi

(cf. Relation 2.2.2). A task Tk having the

minimum ECTRkk (cf. Relation 2.2.3) value is chosen to be scheduled first at thisiteration. It assigns the task on resource Rk which is expected to complete it atearliest time.

ECTRkk = min(MinECTR1k1 ,MinECTR2k

2 , ...,MinECTRnkn ) (2.2.3)

• Max-Min (MAS+99): is the same as the Min-Min algorithm with the difference thatit schedules the task Tk with the greatest ECTRki (cf. Relation 2.2.4) on resource Rk.

ECTRkk = max(MinECTR1kT1

,MinECTR2kT2

, ...,MinECTR2kTn

) (2.2.4)

Adaptive Scheduling for Distributed Systems 35

• Duplex (AHK98; BSB01): combines the Min-Min and Max-Min scheduling heuristicsin the way it chooses the schedule with the best improvement of the optimality criterionγ. It is useful in scenarios where both of them perform relatively the same.

• (X)Suffrage (CLZB00): this SA is based on the scheduling heuristics called Suffragewhich takes into consideration the suffrage value which is the difference between thebest and second best MinECT. The rationale behind Suffrage is that a task should beassigned to a certain host and if it does not go to that host, it will suffer the most.Tasks with high Suffrage values take precedence. However, this scheduling heuristicssuffers when clusters (and almost homogeneous nodes) are taken into consideration. Inthis case there appears the probability that tasks could be assigned low Suffrage valuesand thus low priorities. As a result an improvement – XSuffrage – has been broughtby Casanova et al. (CLZB00) where the authors propose a cluster level Suffrage valuefor each task with the hope that files presented in a cluster can be maximally reused.Experiments have shown that the new scheduling heuristics gives better results evenif resource information cannot be predicted very accurately.

Min-Min, Max-Min and Suffrage can be augmented by taking into consideration aging(MAS+99). This is done to avoid starvation (i.e., to avoid a task from being reassigned toa new resource several successive times without actually beginning their execution). Theageing is implemented by using an ageing factor :

ξ = (1 +age

σ) (2.2.5)

where age is incremented each time a task is remapped and σ is a constant set to 10 in thetests performed by Maheswaran et al. (MAS+99). Once ξ is computed for each Ti mappedon Rj, ECT

ji will be multiplied by a factor of 1

ξ. Thus the value of the ECT is diminished

and the chance for a new task to be mapped on a resource before an older one is minimized.Dependent tasks (YB07) are tasks that have a parent-child relationship which can be

modelled using DAGs or Petri Nets (Jen94; vdA98). Usually these scenarios are handled byeither dependency-mode or mixed SAs.

In what follows we consider task dependency modelled as DAGs. Therefore we definea graph G = (V,A) in which the set V = T contains the list of nodes and (i, j) ∈ A isa transition from task Ti to Tj if Ti must be executed before Tj. This is similar with theapproach in the case of dependent jobs described in Section 2.1.1.

• HEFT (ZS03): is one of the most well known SAs for dependent tasks. The algorithmfirst computes the average ECT for each task and average TC for two consecutivetasks. The tasks are then ordered based on a rank function as described in whatfollows. First, to each task Ti ∈ V and transition (i, j) ∈ A is assigned a weight wi,respectively Wij. These weights are computed based on either the average, median,maximum or minimum EET and TC (ZS03). Tests (ZS03) conducted on differentapproximations of these weights have proven that their performance varies depending

36 Marc E. Frıncu

on the application and no universal approximation can be used. For example theaverage weight of the EET of a task Ti over all resources can be computed as inRelation 2.2.6:

wi =

∑Rj∈R′

EET ji

| R′ |(2.2.6)

Similarly the average TCjki is computed as in Relation 2.2.7:

Wjk =

∑Rj∈R′,Rk∈R′′

TCji k

| R′ || R′′ |(2.2.7)

where R′ and R′′ are the sets of resources available for Tj, respectively Tk.

Finally the set of ordered tasks Ti1, Ti2, ..., Tin is obtained by computing the rank ofeach task Ti ∈ T and ordering the tasks descending by their ranks as follows. For theexit task Tk (i.e., ∃(k, i) ∈ A,∀i = 1, n, i 6= k) the rank is computed as rank(Tk) = wk.The rest of the tasks’ ranks are computed as described in Relation 2.2.8:

rank(Tj) = wj + maxTj :(j,k)∈A

(Wjk, rank(Tk)) (2.2.8)

N’Takpe et al. (TS07) present three improvements of HEFT – called M-HEFT-*– for the case of scheduling mixed-parallel applications. The classic M-HEFT com-putes average tasks’ EET s over all 1-processor allocations and average TC betweenall possible 1-processor allocations. Tasks are then scheduled on the set of processorswhich minimize the most the ECT by taking into account the costs of data commu-nication and redistribution. The three improved versions described in (TS07) offerdistinct conditions of increasing the allocation of a task by one processor. In the caseM-HEFT-IMP the processor allocation is increased only if the task’s EET increaseis beyond a given threshold, M-HEFT-EFF targets the task’s parallel efficiency andM-HEFT-MAX takes into account the percentage of cluster usage by a single task.

Paper (SZ04c) proposes a low cost rescheduling technique which starts with a DAG anda schedule provided by any task dependent SA. During runtime prior to each node’sexecution it takes rescheduling decisions by evaluating it against its delay (cf. Relation2.2.9) and slack (cf. Relation 2.2.10):

delay(i) = EST ji −RSTji (2.2.9)

where the terms have been defined in Section 2.1.2.

Adaptive Scheduling for Distributed Systems 37

slack(i) = minj:(i,j)∈A

(slack(j) + spare(i, j)) (2.2.10)

where slack(k) = 0 for the exit task Tk defined as in the previous paragraph andspare(i, j) is defined as in Relation 2.2.11:

spare(i, j) = min (spareDAG(i, j), spareSameResource(i, j)) (2.2.11)

where spareDAG(i, j) and spareSameResource(i, j) represent the spare time between theend of task Ti and the start of the next task in the DAG, respectively on the sameresource.

As an alternative to the slack the minSpare(i) = minj:(i,j)∈A

spare(i, j) could also be used.

• Hybrid (SZ04b): is a mixed SA which combines HEFT and independent task schedulingby creating groups, Gi ∈ G, composed of independent tasks – ∀Ti,Tj∈Gk,i 6=j∃(i, j) ∈ A.Tasks in each group Gi will be scheduled by using an independent task SA.

• CPA (RvG01) (Critical Path and Allocation): is a SA for mixed task and data paral-lelism. It tries to achieve the best balance between the critical path, CP (cf. Relation2.2.12), in the task dependency DAG and the average processor usage, APU (cf. Rela-tion 2.2.13) by using a two phase algorithm consisting of (1) determining the numberof processors for each task Ti ∈ V and (2) scheduling the tasks on the processors foundat the previous step.

CP = maxTi∈V{Tb(i)} (2.2.12)

APU =1

P

∑Ti∈V

(EETNpi )×Np(i)) (2.2.13)

Tb(i) is defined as the longest path including task Ti to any exit task, P representsthe total number of available processors, Np(i) represents the number of processors

allocated to Ti and EETNpi was defined in Relation 2.1.6. Tb(i) depends on network

characteristics, amount of transferred data and processors allocated to each task (node)in the path. It can be decreased by assigning more processors to the tasks belongingto it.

An improvement – called HCPA – for heterogeneous environments is given in paper(TS07). The reason for improving this heuristics is that in such environments pro-cessors have different speeds and allocations can be somewhat difficult. To overcomethis problem an equivalent reference homogeneous cluster is being introduced. Thespeed of the processors forming this cluster is set to be equal with the speed of the

38 Marc E. Frıncu

slowest processor in the original heterogeneous platform. Also the number of theseprocessors is higher than the one in the original platform. Allocations are done usingthese reference processors and afterwards a translation to allocations on the originalclusters is being made. This translation is done based on an application of Amdahl’slaw which states that given an allocation on a reference cluster we can determine theequivalent allocation on any other cluster. Furthermore this new allocation leads tothe same task EET . Once all its allocations are translated the task is scheduled onthe cluster which minimizes its ECT .

• TANH (BA04): is another algorithm falling in this category. It clusters tasks basedon several node parameters such as: earliest start and completion time; latest startand completion time; critical immediate parent task; and best resource. It then makesa decision based on the number of created clusters. If this number is greater thanthe number of computing resources then it is scaled down to the number of resources,otherwise it uses the idle time to clone some tasks in order to minimize the overallexecution time.

The previously listed SAs were heuristics based. Another method of finding solutions toscheduling problems is to scheduling meta-heuristics. They are usually applied to problemsrequiring large amount of data and can be used both for independent and dependent taskscheduling problems. From these methods we can notice the local (neighbourhood) search(Bru07). They provide solutions which are not guaranteed to be optimal, but instead thedeviation from the optimal objective can be bounded if a lower (or upper) limit is set.Simulated annealing and Tabu search algorithms go in this category. In what follows wepresent these two scheduling techniques in greater detail.

Local search techniques are useful for solving discrete optimization problems (Bru07)which in turn can be described as follows: Given a finite set S and a cost function c we haveto find a solution s∗ ∈ S such that c(s∗) ≤ c(s),∀s ∈ S. The cost function c can be viewed forexample as being the same as gmax in Relation 2.1.1 and having the objective of minimizingCmax. Basically this technique uses iterations – under some restrictions – through severalsolutions until a condition is met. In order to describe the restrictions for moving from onesolution to the next we need to define (Bru07) a neighbourhood structure as N : S → 2S

such that ∀s ∈ S we have N(s) – called the neighbourhood of s – describing the solutionsreachable in the next step by moving from s. Having these defined we can now describe thelocal search method. At every iteration we start with a solution s ∈ S, choose s′ ∈ S andcalculate c(s) and c(s′). Based on the computed values we decide the starting solution of thenext iteration. Depending on the criteria used for selecting the starting solution we obtaindifferent methods of local search. The simplest such method is the iterative improvementalgorithm which simply chooses the next solution the c(s′) with the smallest value.

Another method is the simulated annealing (LA87). The main difference from theprevious method is that it chooses s′ ∈ N(s) randomly and accepts it with probability

min{1, e−c(s′)−c(s)

ci }, where ci+1 = g(ci) with g being a predefined function c0 set initially to

Adaptive Scheduling for Distributed Systems 39

a high value and limi→∞

ci = 0. The acceptance condition states that the new condition is ac-

cepted unconditionally if it is better than the old one and with some probability otherwise.The algorithm continues until a certain stop criterion is reached. Paper (LA87) describes thetechnique in greater detail. As an example the stopping criterion can be chosen to be eithera given amount of time or a small value for ci such that a certain equilibrium is reached.Simulated annealing is based on the Metropolis Monte Carlo algorithm (MRR+53).

Variations of the simulated annealing method include the threshold acceptance methodand tabu search (Bru07). The former differs only in the acceptance condition – it onlyaccepts a solution s′ ∈ N(s) if c(s′) − c(s) < t where t is a positive threshold and isgradually decreased. The latter has the advantage of eliminating the risk of revisiting alreadyconsidered solutions. It does this by keeping a tabu list containing attributes which definethe set of solutions. This list will contain attributes corresponding to each recently visitedsolution and all selections of solutions characterized by these attributes are denied.

Genetic algorithms are generalizations for the previous method of local search. Theycombine the exploration of new solutions in the solution space with the knowledge on thepreviously obtained best solutions by maintaining a population of individuals which evolveover generations. A fitness function is used to determine the quality of each individual inthe population. An example of such a function could be given by (HAR94) fitness(I) =Cmax − CT (I) where FT (I) represents the Completion Time (CT) of individual I. Geneticalgorithms have been developed for both homogeneous and dedicated systems, belongingto classes PMPM ||Cmax, P ||Cmax and ◦||Cmax, (WYJL04; ZWM99) and heterogeneous(WSRM97) environments (class R||Cmax).

The Greedy Randomized Adaptive Search Procedure (GRASP) (BHLR01; FR95) is an-other method based on an iterative greedy randomized local search. At each step the bestsolution is kept and the iteration cycle stops when a stop criterion is satisfied. Each iterationhas two parts: construction of the solution and local search. In the construction phase thesolution based on several criteria (e.g., ∀Ti, Tj ∈ S ⇒ i 6= j) is constructed and a RestrictedCandidate List (RCL) containing resources is kept for each task. As soon as a solution isconstructed, a local search is applied in order to improve it based on an optimality criterion(e.g., Cmax). Each time an augmented solution is found it replaces the previously existingone.

Another method for scheduling jobs (tasks) is known as branch and bound (Bru07).Assuming that the problem to be solved, P , is a minimization problem we consider sub-problems of it defined by a subset S ′ ⊂ S of the solutions to problem P . The principle behindthe branch and bound method consists of two essential parts: branching and lower/upper

bounding. In the branching process the set S is replaced by sub-problems such thatk⋃i=1

Si = S

and a bounding tree is created so that the root is represented by P and nodes are sub-problems. The bounding stage is made up of two parts. First the upper bound UB ofthe objective value of problem P is computed. This value is usually computed by using ascheduling heuristics which provides a good small value for it. Then the lower bounds LBi

40 Marc E. Frıncu

for the objective values of all solutions for a sub-problem are computed by using a givenalgorithm. Only the sub-problems yielding a lower bound smaller than the upper bound aretaken into consideration as possible solutions. Once a sub-problem with a single such solutionhas been found the upper bound UB will be set to the value of its corresponding LBi. Thatsolution will then be selected as the current best. Paper (Bru07) presents this more closelyand describes a branch and bound algorithm for the F2||

∑Ci NP-hard problem.

2.2.2 Resource Management Systems Overview

Inside a DS the SA plays an important role in deciding where to schedule incoming or alreadysubmitted tasks. However its influence is restricted to taking decisions and cannot actuallyapply them. In order to apply the scheduling decisions taken by a SA, a RMS is required.The RMS provides a set of services which vary depending on the system but usually involvetaking tasks and physically assign them to resources based on the logical assignment doneby the SA. Inside a DS we can generally view the SA as the legislative entity and the RMSas the executive part.

Grid based vs. Cloud based : Most of the work concerning RMS has evolved aroundthe assumption of applying them onto Grids and not Clouds. This can be explained bytwo facts. The first one is that there are many similarities between a Cloud and a Gridand RMS developed for one type could also work well on the other. The second one isrelated with age, and as Grids emerged earlier than Clouds most of the solutions havebeen developed for the former. Nonetheless several of the Grid oriented RMSs could beadapted to work for Clouds too. Examples of RMS for Grid scheduling include (KBM02):Condor (TTL03; TTL05), Darwin (CFK98), Globus (Fos05), MSHN (HKSJ+99), Netsolve(CD97; CD98), Nimrod/G (BAG00) or Legion (CKKG99). Cloud RMS are still in theirinfancy and include: CloudScheduler (TUoV10) and OpenNebula (SMLF09).

Condor is a computing environment for high-throughput applications which harnessesidle time on managed resources and has capabilities for their sharing. It uses resourcemanagement services for sequential and parallel application and allows jobs to preserve theiroriginating resource environment on the execution resource. A collector is responsible forinformation storage and listens for service advertisements. The resources are advertisedby a resource agent which periodically informs the collector on the available services. Amatchmaker agent is responsible for determining which resource advertised by the collectormatches the desired job. It is also responsible for performing the scheduling of jobs insidea Condor pool (set of machines following a flat organization). Through a mechanism calledflocking jobs can be distributed across different pools. Generally it can be said that Condoruses a scheduler and a resource discovery that are centralized.

Darwin is a RMS for network services. Despite being oriented towards scheduling com-putations on network based resources it also offers a mechanism for non-network nodes. QoSis supported by default as Darwin runs in routers and can control bandwidth at networklevel. The core component is the request broker called Xena which is responsible with theglobal resource allocation. While the resource organization is flat, the scheduling policy

Adaptive Scheduling for Distributed Systems 41

is hierarchical with a non-predictive state estimation and uses a hierarchical fair servicescheduling algorithm (H-FSC).

Globus toolkit is a middleware which allows viewing of distributed resources as a singlemachine. It is constructed as a multi-layered architecture in which high level services can bedeveloped using core level services. Through the Metacomputing Directory Services (MDS)resource brokers or schedulers gain access to registered resources. It also offers WS-GRAM(Web Service - Grid Resource Allocation and Management) which provides an interface forrequesting and controlling remote resources for job execution. WS-GRAM is however nota scheduling system, but allows for integration with third party resource brokers such asCondor/G, Nimrod/G, NetSolve or AppleS.

MSHN or Management System for Heterogeneous Networks is a research project aimedat developing a RMS for distributed heterogeneous environments. As a result the resourcemanager is designed to work in a Grid environment where each machine has its own operatingsystem. It also focuses on applications which can adapt to various resource characteristics.As a consequence there can be many application versions each supporting a particular re-source configuration. A scheduling advisor component provides a centralized event drivenrescheduling service which uses a predictive scheduling heuristics for state estimation.

Netsolve, Network Enabled Computational Kernel, is a client-agent-server solution de-signed to solve computational problems in DSs. It allows resource discovery and brokerageas well as best resource selection by making use of agents. In order to ensure a relativelygood performance a load balancing policy is used. Furthermore a retry for fault-tolerance isused during job execution.

Nimrod/G is a Grid resource broker which allows handling of task farming on CGs.More precisely it allows for parameter sweep application to be tested using DSs. It relies onother middleware such as Globus or Legion for resource discovery and through the compo-nents of GRACE (Grid Architecture for Computational Economy) it enables task schedulingor bidding actions. GRACE also allows for QoS through computational economy servicesand resource reservation. Load balancing is achieved through periodic task rescheduling.

Legion is a Grid operating system which provides software infrastructure that permitsthe components forming a DS to easily interact with each other. Its RMS is hierarchicaland relies on decentralized scheduling policies. By default it uses system oriented schedulersbut through the use of brokers they can be extended to cover user oriented cases involvingdeadline based scheduling and computational economy.

OpenNebula provides the functionality needed to deploy, monitor and control VMson a pool of distributed physical resources. The backend scheduling is governed by Haizea(SKF08) which uses renegotiable lease agreements as a fundamental resource provisioningabstraction. The leases are implemented as virtual machines by taking into account theoverhead of using virtual machines when scheduling them.

CloudScheduler allows users to set up a Virtual Machine (VM) and submit jobs to aCondor pool. The cloud scheduler then looks in the job queue and creates the required VM.The VM will be replicated on machines and used as container for executing the jobs. Theaim is to support clouds such as Nimbus, OpenNebula, Eucalyptus or EC2 at the backend.

42 Marc E. Frıncu

Classic vs. Agent based approaches : DS are characterized by unpredictable changesin resource and network workloads. They are yet rigid and inflexible in what concernsinteroperability and interactivity. One solution to these problems could be agent basedapproaches. Agents are defined by flexibility, agility and autonomy and as depicted in(FJK04) they can act as the brain for scheduling tasks inside the Grid infrastructure whereasthe latter can be seen as the brawn of the whole system.

Agent based scheduling usually implies a (semi)decentralized environment in which agentsview the environment from their individual point of view and try to minimize a local ob-jective function by communicating with each other. At the same time a global objectiveneeds to be met. This implies a lot of communication overhead (WARW04) which does notoccur in centralized planning systems. Centralized systems only work well if the underlyingenvironment remains relatively stable. Decentralized systems on the other hand have at firstglance the disadvantage of permanently requiring to communicate changes in each compo-nent to the other partners – a behaviour which could lead to system crashes. For agents totake scheduling decisions they must: be able to quickly adapt to DS changes or failures andcommunicate with others in order to relocate a task.

Despite the existence of numerous classic RMSs including some of the ones previouslymentioned, there is few work on agent based approaches. These include: ARMS (CJS+02),Nimrod/G (ABG02), AppLeS (BWC+03; COBW00) and TRACE (FW01).

The ARMS (CJS+02) system represents an example of agent based RMS. It uses PACE(CKPN00) for application performance predictions which are later used as inputs to thescheduling mechanism.

Nimrod/G also uses agents to handle the setup of the running environment, the trans-port of the task to the site, its execution and the return of its execution to the client.Agents can also record information acquired during task execution as CPU time, memoryconsumption etc.

AppLeS (Application-Level Scheduling) is an example of a methodology for adaptivescheduling also relying on agents. Applications using AppLeS share a common architectureand are scheduled adaptively by a customized scheduling agent. The agent follows severalwell established steps in order to obtain a schedule for an application: resource discovery,resource selection, schedule selection, application execution and schedule adaptation. Whilethe schedulers are centralized, the actual execution is decentralized and is taken care of bythe local resource schedulers in a way similar with that of Nimrod/G. Scheduling is doneby using an online predictive scheduling heuristics estimation model. The changes in theresource performances are monitored through the Network Weather Service (NWS) andagents use both static and dynamic application and resource information when selecting theset of viable resources.

TRACE uses on demand allocation of resources and agents.

Other MAS RMS include the work of (AMA09; CSJN05; SFT00; SLGW02; TZ06).

Sauer et al. (SFT00) propose a multi site agent based scheduling approach consistingof two distinct decision levels one global and another one local. Each of these levels has apredictive and a reactive component for dealing with workload distribution and for reacting

Adaptive Scheduling for Distributed Systems 43

to changes in the workloads.In Cao et al. (CSJN05) a Grid load balancing approach by combining both intelligent

agents and multi-agent approaches is presented. Each existing agent is responsible for han-dling task scheduling over multiple resources within a Grid. As in (SFT00) there also existsa hierarchy of agents which cooperate with each other in a peer to peer manner towardstheir common goal of finding new resources for their tasks. This hierarchy is composed of abroker, several coordinators and simple agents. By using evolutionary processes the SAs areable to cope with changes in the number of tasks or resources.

Shen et al. (SLGW02) describe a system which can automatically select from variousnegotiation models, protocols or strategies the best one for the current computational needsand changes in resource environment. It does this by building on the two main issues of Gridas identified by (CJS+02): scalability and adaptability. The work carried in (SLGW02) cre-ates an architecture which uses several specialized agents for applications, resources, yellowpages and jobs. Job agents for example are responsible for handling a job since its submissionand until its execution and their lifespan is restricted to that interval. The framework offersseveral negotiation models between job and resource agents including contract net protocol,auction and game theory based strategies.

Tang and Zang (TZ06) present a service-oriented peer-to-peer MAS relying on a simplisticscheduling mechanism. The MAS proposed by Amoon et al. (AMA09) uses a single fixedscheduling agent responsible for planning tasks on resources. Tasks are then moved to andfrom these resources by using mobile agents.

44 Marc E. Frıncu

2.3 Conclusions

In this chapter a short introduction into the problem of task scheduling inside hetero-geneous DS has been given. In this direction a taxonomy of existing SAs was given inSect.2.2.1. Some of the existing RMS have also been categorized and described in Sect.2.2.2. The mathematical model used to present the SA and which will be used throughoutthe rest of the thesis was detailed in Sect. 2.1. The model is based on the one describedin (Bru07) and was slightly modified to fit the problem studied in this thesis. Several wellknown terms were also introduced. The overview of the SAs was based on the work of Donget al. (DA06) and advanced new classification criteria which were missing from the originalwork: User EET required vs. no user EET required, Classic vs. Nature inspired, Single ob-jective vs. Multi-objective and Grid based vs. Cloud based. Each SA is formalized using theintroduced mathematical model so that a unitary view over all algorithms can be achieved.The RMS taxonomy also brought in two new criteria for classifying the systems: Grid basedvs. Cloud based and Classic vs. Agent based approaches.

3 SCHEDULING HEURISTICS FORHETEROGENEOUS SYSTEMS

Contents3.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Dynamic Scheduling Algorithms for Heterogeneous Environ-ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3 MinQL Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3.2 MinQL Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4 DMECT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.2 Convergence of the Algorithm . . . . . . . . . . . . . . . . . . . . . 70

3.4.3 Stability of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 73

3.4.4 Physical Movement Condition for Tasks on New Queues . . . . . . 73

3.4.5 Population based DMECT and MinQL . . . . . . . . . . . . . . . 75

3.4.6 DMECT Tests Using EET Approximations . . . . . . . . . . . . . 77

3.4.7 DMECT Tests Using Lateness . . . . . . . . . . . . . . . . . . . . 81

3.4.8 Population Based Tests for DMECT and MinQL . . . . . . . . . 82

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

As already described in Chapter 2 scheduling tasks in DS is not an easy task. Thedifficulty arises from the heterogeneous – in network, resource and task properties – anddynamic – resource and network load and availability vary in time – nature of the underlyingsystems. Furthermore specific SA are known to offer optimized solutions for particular systemconfiguration (BSB01; CLZB00; MAS+99) and thus they could fail to provide sub-optimalsolutions for configurations that arise from the very nature of the DS itself. This chapterpresents a SA designed for online task scheduling inside heterogeneous DS. The aim is to offera unitary scheduling heuristics which offers solutions to a wide range of system configurations.Three cost objectives have been considered when designing it: resource load, execution timeand lateness. The motivation behind these choices lies in the current objectives used byvarious systems:

45

46 Marc E. Frıncu

• most RMS use load balancing algorithms. Examples include: the load balancing basedon priorities used by Condor (TTL03; TTL05); the hierarchical fair service schedulingalgorithm (H-FSC) which guarantees real-time, adaptive best-effort and hierarchicallink-sharing services from Darwin (CFK98); the periodic relocation of tasks imple-mented in Nimrod/G (BAG00); etc.;

• Cloud providers usually lend resources to users based on their requirements. Thus thereis a symbiosis between the users that need to accurately estimate the requirements ofthe VM they wish to create in order to minimize their costs and the provider whichneeds to suitably (re)allocate the created VMs in order to minimize the additionalmonetary costs (by reducing the lateness).

Any SA needs to be simulated before actually deployed on a SP. Hence this chapterstarts by presenting the SP used in both simulation and actual scheduling (cf. Sect. 3.1).As most RMS use various load balancing techniques we present and test a SA for DS againstknown scheduling heuristics (cf. Sect. 3.3). The motivation for a new policy came from theneed to adapt existing load balancing algorithms without EET estimates (LS02) to onlinescheduling and heterogeneous systems.The algorithm is generalized in Sect. 3.4 by offeringsupport for Cmax and Lmax cost objectives. The SA is rigorously defined by describingthe mathematical model and by studying its convergence and stability. The modelled SA isdistinct from other scheduling heuristics (MAS+99) as it allows per task rescheduling withoutaffecting the rest of the enqueued tasks. From this point of view the generalized algorithmresembles the backfilling strategy (GRC09) but adds an additional constraint on the timewhen rescheduling is needed. The pro-active rescheduling makes the algorithm suited formeta-scheduling inside MAS and can be easily integrated in the negotiation strategy (cf.Chapter 5).

3.1 Simulation Environment

The simulation environment consists of three types of information: task, network andresource related, and has been presented in the author’s work described in (CFMP10) aspart of a two level scheduling simulator for validating scheduling results on a distributedsystem for binding together workflow-based Computer Algebra Systems (CAS).

Task information: simulating task and resource characteristics in a realistic manner isessential for every simulator aiming at producing relevant results for the scientific community.In order to allow maximum flexibility for users we have integrated several known probabilisticmethods for generating both parallel and workstation tasks.

A wide range of task characteristics can be considered : minimum and maximum mem-ory, requirements, estimated completion and execution times, size, transfer costs, resolvetimes, number of symbols and methods used by the task, arrival rate of tasks, and taskdependencies. By default each of the previously mentioned characteristics is pre-set with adefault value as described in the following paragraphs.

Adaptive Scheduling for Distributed Systems 47

In the work of Feitelson (Fei10) a comprehensive list of probabilistic methods for deter-mining task characteristics is given. These methods can be roughly divided into methodsfor workstation tasks (i.e., tasks arriving from end users operating desktops) and paralleltasks (i.e., tasks coming from parallel application including MPI, PVM, OpenMP etc). Forboth task types the most important characteristics are represented by: the arrival rate, thenumber of tasks generated during each arrival, the estimated execution cost and the size ofthe task.

For workstation tasks the arrival rate is modelled by an 8 degree extrapolation polynomial(cf. Relation 3.1.1) which provides a means of simulating task arrivals between 8:30 AM and18:00 PM since this is usually the interval in which regular users submit tasks. Executiontimes and task sizes are usually modelled by a Pareto or, less used, a lognormal distribution– both distributions being closely related (Mit03).

arrivalRate = 3.1−8.5t+24.7t2 +130.8t3 +107.7t4−804.2t5−2038.5t6 +1856.8t7 +4618.6t8

(3.1.1)In the case of parallel tasks the arrival rate is described by a Weibull or lognormal

distribution, while the execution time can be modelled by a hyper-exponential, a hyper-erlang or a log-uniform probability function. Task size is modelled in the same way as forthe workstation tasks.

The previous methods for generating task information are based on data analysis of ex-isting workload traces and are obtained from either descriptive or generative models (Fei10).The former tries to describe the phenomena observed in the workload (i.e., statistical sum-maries of existing data) while the latter try to mimic the process that generated the workloadin the first place.

Once the characteristics have been set, dependencies between tasks need to be generated.The chosen model follows the samepred approach (TH02) where each new node can be linked,with a certain probability, to any other existing nodes.

Network and resource information: in a simulated environment physical resourcesplay a major role as well. In this case the characteristics that are of interest include: theCPU, the memory, the disk and the network. The work presented in (YCA04) describeshow computing resources can be modelled in a natural way based on extrapolations fromreal traces. As for the network connecting the resources, it can be produced using varioustopology generation methods including random models (M.88; ZCD97), structural (TGJ+02)or power law based (BR99) models.

The produced topology files are rendered using the previously mentioned techniques forresource generation. A full description of the format can be found in the author’s workdescribed in (FQS08b; FQS08a). The formalism is primarily used to describe resources usedby the SimGrid simulator (CQ08). Besides synthetic generated topologies, real ones can alsobe used. Some examples can be found at the Platform Description Archive (PDA) (SQF).The description formalism is XML based and allows for better:

• expressiveness - means that the information described with the formalism should not

48 Marc E. Frıncu

limit itself to only the resources simulated by the specific environment but also to allowexpressing additional data that carries useful information to end users. Among thisextra data one might consider host properties such as disk, OS, memory, etc. Theprop tag is extremely useful in this case. Also adding heterogeneity (random tag) toclusters, and implement trace features (trace tag) might prove a helpful issue;

• readability - achieved through the use of XML. The files should be easy to parse bycomputers and also easy to be read by human users. Tags like host, link, cluster,route are used to express computational and network resources;

• reusability - means that the platforms described with this formalism should be easilyrun several times, and should be independent of particular machines. For example it isintended to be used both with the SimGrid platform and the deployment tool DIET1

(Distributed Interactive Engineering Toolbox). The final aim of the work is to proposethis formalism to the Open Grid Forum as a way of representing Grid topologies;

• compactness - the data expressed using it should be compact enough so that it doesnot occupy to much space, but in the same time it does not lose its meaning. Forexample when dealing with a cluster it might by much more intuitive and easy to usea specialized tag such as the cluster rather than using a large number of host tags.In addition when dealing with hierarchical DS such as cluster networks or clustersof clusters, defining the links between all of the hosts could end up occupying morethen 70% of the resulting platform file. To cope with clusters the route:multi tag hasbeen introduced.

Figures 3.1.1 – and corresponding visual linkage in Fig. 3.1.2 – and 3.1.3 show howcluster based topologies can be created and also how CPU power and network bandwidthand latency can be adjusted to mimic existing traces.

Another important aspect regarding resources is their availability. This is modelled byusing the probability Pr(X < e(

tMTBF

)), where MTBF represents the Mean Time BetweenFailures and t represents the time since the last downtime moment.

Scheduling simulator: one of the most important component of a simulation systemis the SP. Ideally this component should be used, unmodified, in both simulated and reallife scenarios. Its purpose is to apply the SA on the existing tasks and to obtain a schedule,which can then be used by other system components to execute the tasks.

Consequently experiments can be run using various testing and simulation platforms.These include real platforms, simulators and emulators. Real platforms offer the advantageof raw power but they lack several aspects detrimental to running experiments including theability to reproduce environmental conditions and experiment set-ups. Examples include:Grid’5000 (Con10), a French initiative to link 9 sites into a national grid; DAS-3 (DAS10),the Dutch grid linking 5 sites; Worldwide LHC Computing Grid (LCG10) that offers datastorage and analysis infrastructure for CERN in more than 170 centers; GridPP (Gri10)

1http://graal.ens-lyon.fr/~diet/

Adaptive Scheduling for Distributed Systems 49

<platform version="2">

<cluster id="science_cluster"

prefix="science" suffix=".uvt.ro"

radical="1-7" power="2700000000"

bw="125000000" lat="5E-5"

bb_bw="250000000" bb_lat="5E-4"/>

<cluster id="mediogrid_cluster"

prefix="mediogrid" suffix=".uvt.ro"

radical="1-4" power="2000000000"

bw="125000000" lat="5E-5"

bb_bw="250000000" bb_lat="5E-4"/>

<link id="backbone" bandwidth="1250000000"

latency="5E-4"/>

<route:multi src="science_cluster"

dst="mediogrid_cluster">

<link:ctn id="backbone"/>

<link:ctn id="$dst"/>

</route:multi>

<route:multi src="mediogrid_cluster"

dst="science_cluster">

<link:ctn id="backbone"/>

<link:ctn id="$dst"/>

</route:multi>

</platform>

Figure 3.1.1: Example of a simple two cluster interconnection

SCIEnce 1

SCIEnce 2

SCIEnce 4

SCIEnce 5

SCIEnce 6

SCIEnce 7

SCIEnce 3

MedioGrid 1

MedioGrid 2

MedioGrid 3

MedioGrid 4

Inter cluster backbone

SC

IEn

ce i

ntr

a−

clu

ste

r b

ack

bo

ne

Med

ioG

rid in

tra−

clu

ste

r back

bo

ne

Figure 3.1.2: Graphical example of a simple two cluster interconnection

which represents a grid network in the UK; and Planet-lab (Pla) which comprises of 500globally-distributed virtualized-nodes.

Emulators allow applications to execute by intercepting major system calls. Thus the per-

50 Marc E. Frıncu

<platform version="2">

<host id="nanosim1" power="2.0E9"

<prop key="memory" value="2E9"/>

<prop key="disk" value="80E9"/>

<prop key="OS" value="Linux Gentoo"/>

</host>

<random id="myRandGenerator" generator="DRAND48"

seed="0" min="1E9" max="2E9" mean="1.6E9"

std_deviation="1E6" />

<host id="nanosim2" power="$rand(myRandGenerator)"/>

<link id="link1" bandwidth="1.25E8"

latency="5E-5"/>

<trace} id="myTrace" file="link1.load">

<trace:connect element="link1"

kind="bandwidth" trace="myTrace"/>

<route src="bob" dst="alice" symmetric="yes">

<link:ctn id="link1"/>

</route>

</platform>

Figure 3.1.3: Example showing how traces can be added to CPU and network bandwidth

formance can be adjusted to mimic any kind of platform. For instance, MicroGrid (XDCC04)allows users to execute applications written for the Globus toolkit (Fos05). Other examplesinclude eWAN (PGOE04) and ModelNet (VYW+02) that emulate the network componentor the Virtual Machine technology that allows CPU throttling to handle the performance ofa node.

In a simulation, both the environment and the application need to be modelled. Thiscan be both an advantage and a disadvantage depending on whether we want to simulatea model of an application or the application itself. Examples include Bricks (TMN+99),SimGrid (CQ08), OptorSim (CMN+04), GridSim (BM02), GridNet (LSSD03), Wrekavoc(CDGJ09) etc.

Despite the fact that all previously mentioned solutions provide means of running exper-iments inside distributed environments, we need a custom built simulator able to cope withthe various requirements and particularities needed:

• meta-scheduling inside MAS;

• parallel tests for various SAs and scheduling strategies;

• simulation and deployment without modifying the algorithm and by using the samesimulation platform;

Adaptive Scheduling for Distributed Systems 51

• multiple levels of scheduling each with its own SA;

• various models for computing transfer and execution costs;

• etc.

This goal is not trivial with the examples presented in the previous paragraphs. Fur-thermore the particular architecture of the SymGrid-Services framework2 – for which thesimulator has been initially built (CFMP10) – is more appropriate for a custom solution.This requirement is mandatory as we need to be able to control every aspect a task canencounter since its submission and until completion. The non-deterministic behaviour ofCAS algorithms and the need to facilitate the customization of network traffic and resourceavailability modeling were also detrimental in the choice of designing a custom simulator.

In what follows we briefly explain the designed Scheduling Platform for simulating SAs.The SP relies on scheduling heuristics to map tasks on resources. These can be plugged

in dynamically at runtime by specifying the class that implements it. In this way a user caneasily switch between various scheduling algorithms and observe the results. The schedulingheuristics base class provides transparent access to the database holding system meta-dataregarding tasks and computing resource. Thus it allows users to quickly integrate theirown policy without needing to bother with details such as how to handle data from externalsources (i.e., databases, file systems). Hence this allows even more flexibility for the scheduleras it can be used on its own, independently from both the simulator and real system.

Topology file are used for computing transfer and execution costs. These files usuallyrepresent the underlying network and resources on which the system executes. It can begenerated either automatically based on the sensors deployed on the platform (EDQ07) orcan be static in cases when the topology does not changes often. In the case of SymGrid-Services the corresponding data is taken automatically from the database and creates thetopology platform prior to running the simulation.

Rescheduling is accomplished during either fixed time intervals or when events such astask has finished or new task has arrived occur. Choosing between these two methodsdepends on the application. In systems where the scheduling and executor components aretightly coupled an event based re-scheduler is an appropriate choice. However in looselycoupled systems where the executor does not inform the scheduler on the completion of itstasks the latter would have no information on when to start the rescheduling. Hence, a timeinterval based approach is more suited in this scenario.

Given a task’s characteristic and its initial estimates the scheduler will determine thenew costs based the characteristics of each considered network route and computationalresource. In order to simulate the non-determinism offered by some real systems (e.g., CAS)the new costs can be computed with the help of random variables. These variables give theprobability for the cost to be in a certain interval. The computation proceeds as follows:

2SymGrid-Services is a platform for exposing CAS functionality through a WS/GS. It also allows usersto compose mathematical problems and to solve them distributed on various CASs. More information onthe project can be found at http://www.symbolic-computation.org/The_SCIEnce_Project

52 Marc E. Frıncu

• compute the new cost based on resource and task characteristics: Cji = f(Ti, Rj);

• compute the cost interval. The limits are chosen when the scheduling heuristics isdefined: CIji = [Cj

i − σ,Cji + σ], where σ > 0 is arbitrarily chosen;

• once all costs have been computed on every available resource a multi set of costintervals is created. Similar intervals are treated as being identical. Two intervals aretreated as similar if their boundaries differ by only an ε value chosen arbitrarily;

• a random variable will be used to specify which cost interval is likely to occur. Thisvariable is defined as X :

(1 2 ... kp1 p2 ... pk

), where: k represents the number of entries

in the multiset; pk represents the probability that the cost group k is selected; and∑ki=i pi = 1. pi = noSimilarIntervali

numberResourceswhere noSimilarIntervali represents the number

of similar resource costs belonging to the multi set element i; and numberResourcesrepresents the total number of considered resources.

Due to reasons related with similarity between resources there can be several almostidentical intervals (ε) in the list. In this case the chance for that particular interval to beselected increases. If such an interval is selected then the resource to be assigned to the taskis assigned randomly from the resources that provided similar results.

This approach for selecting a resource basically clusters resources and favours the largest– in terms of number of resources – clusters. Each cluster is comprised of a set of almosthomogeneous resources. Homogeneity refers here not only to the physical characteristics butalso to the services running on top of the resources and their exposed functionality. Althoughfavouring the largest cluster to the detriment of the one offering the best result may seemodd it offers a solution to the non-determinism of certain systems including CASs.

Nonetheless, in order to offer a general solution, the cost can be computed by using anycost function defined by the user.

To simulate task execution, a deviation – derived from a normalized distribution – fromthe generated EET is used. This time is used to determine how long the resource is keptbusy. All tasks are assumed to be non-preemptive.

3.2 Dynamic Scheduling Algorithms for Heterogeneous Environ-ments

This thesis treats the case of online task scheduling. As most scheduling involves eitherindependent or dependent task scheduling a unitary view of the both needs to be defined.One approach is to offer a dynamic SA in which tasks coming from different workflows aretreated as a single batch. This approach is similar to the hybrid SA for DAGs schedulingpresented (SZ04a) where a batch is created for each set of independent tasks in a DAG. Inwhat follows we present an extension which support tasks from multiple DAGs for use inonline task scheduling. A first restriction we impose is to fix the number of task batches to

Adaptive Scheduling for Distributed Systems 53

one. The batch is updated periodically as new tasks arrive. The new tasks can be eitherinitial tasks from newly submitted workflows or tasks from old workflows whose precedenttasks have already completed. Chapter 5 will detail the role of task statuses during thescheduling process. In addition a condition on the tasks belonging to the batch of tasks Bat any given time is imposed as follows:

∀(Ti, Tj ∈ B ∧ Ti, Tj ∈ V )∃ ((i, j) ∈ A ∨ (j, i) ∈ A) (3.2.1)

where G = (V,A) is a DAG in which the set V = T contains the list of nodes and (i, j) ∈ Ais a transition from task Ti to Tj if Ti must be executed before Tj. For simplicity reasons, inwhat follows, we (i, j) instead of Ti → Tj.

The above condition indicates that more than one task can belong to the same workflowin a batch but there cannot be a precedence relationship between them.

It can be argued that the previous relation suffices as there is no need to schedule achain of tasks – as in the offline HEFT algorithm for instance (ZS03) – having a dependencyrelation between them: T1 → T2 → ... → Tn, Ti ∈ V because in online scheduling it isimpossible to know a priori all the required task information from a workflow.

As an example on how we can schedule tasks inside a workflow without the need for awhole workflow schedule let us consider Fig. 3.2.1. The tasks belonging to it get scheduledas shown in Fig. 3.2.2. The full circles represent tasks that can be scheduled while the emptytasks represent tasks, which will be scheduled in the future. Tasks T1 and T2 get scheduledand executed in the same time since there is no transition T1 → T2 or T2 → T1. Tasks T3and T4 get nevertheless scheduled and executed in different steps as there exists a transitionT1,2 → T3 and T3 → T4.

T1

T2

T3 T4

Figure 3.2.1: Simple workflow example

For the remaining of this thesis this model will be used implicitly and no explicit referenceto dependent or independent task scheduling will be made.

The previous model has been presented in Frıncu et al. (FMC09). The next two sectionspresent the load balancing MinQL algorithm (cf. Sect. 3.3) and the general case of DMECT(cf. Sect. 3.4).

54 Marc E. Frıncu

T2 T1 T3 T4

T3T2 T4T1

T4T2 T3T1

STEP 1: Set as SCHEDULED

the tasks which

need to be executed

in parallel

STEP 2 & 3: Set as SCHEDULED

the next sequential

task which needs to

be executed

Figure 3.2.2: Task execution order for a simple workflow example

3.3 MinQL Algorithm

MinQL (Minimization of Queue Length) was introduced in Frıncu et al. (FMC09).The algorithm is a flavour of the backfilling algorithm (LS02) without EET requirements.Recent work on backfilling strategies (GRC09) has tackled the problem of taking resourceusage into account when rescheduling, but fails to consider the heterogeneity of resourceswhen applying the policy. As mentioned in Section 3.2 MinQL is a particular case of theDMECT scheduling heuristics and consequently the mathematical model will be emphasisedonly for the latter case (cf. Sect. 3.4.1). Next we present the algorithm, followed by itsformalization (cf. Sect. 3.3.1) and some test results (cf. Sect. 3.3.2).

Tasks are supposed to arrive from multiple running workflows. The arrival rate is assumedto follow one of the two models presented in Sect. 3.1. All existing tasks are part of thebatch introduced in Sect. 3.2 and obey Condition 3.2.1.

MinQL takes into consideration ageing due to online task arrival rate. This is importantas old tasks should avoid starvation (i.e., continuously delay from being executed). Hencethe task to be relocated will be placed on the new queue such that tasks in it would remainordered decreasingly by their Total Waiting Time (TWT ). TWT represents the time sincethe task’s initial submission. Newly arrived tasks are placed randomly in the queues.

3.3.1 Mathematical Model

The algorithm requires two sets: (1) for the list of tasks not executed yet T ′ ⊆ T (2) andfor the list of queues Q. Both have been defined in Sect. 2.1.

As the SA periodically rebalances the queues, the only necessary condition is the one forthe selection the new queue – Qmin – where Ti from Qk will be moved:

Qmin ={Qj :

((|Qj| < |Qk|) ∧ isSupported(Ti, Qj),∀Qj ∈ Q \ {Qk}

)}(3.3.1)

Adaptive Scheduling for Distributed Systems 55

Queues are rebalanced either after the arrival of a new set of tasks or due to the successfulexecution of some tasks.

Additionally multi-criteria conditions could be added. A selection based of the smallestqueue having the fastest processing power is given next:

Qmin ={Qj :

((|Qj| < |Qk|) ∧ isSupported(Ti, Qj) ∧ (Qjspeed ≥ Qkspeed), ∀Qj ∈ Q \ {Qk}, Ti ∈ Qk

)}(3.3.2)

The isSupported(Ti, Qj) method is used to decide whether a specific task Ti (e.g., amathematical problem) can be executed or not on queue Qj.

These optional conditions represent the main improvements brought by this algorithm tothe existing backfilling variants such as the one described in Larson et al. (LS02) and Guimet al. (GRC09). The backfill is done at the end of the queue and not by replacing the leadtask (GRC09). The reason for this choice is that we give priority to older tasks and do notconsider the starting time of a task when scheduling. The reason for the latter is that weassume in this thesis a processing intensive scenario in which the processor is always busyexecuting enqueued tasks.

In the case of mathematical problems which need to be solved on a remotely located CASsthe different capabilities of the mathematical systems needs to be taken into consideration.The task can be encoded either in an XML-based language called Symbolic ComputationSoftware Composability Protocol (SCSCP) (KL07) or in plain text by specifying the CASfunction to be called and its arguments. When using the SCSCP protocol messages areencoded using OpenMath symbols and CASs must support this protocol in order to use it.When taking the decision on whether Ti can run on Qj or not we must take into consideration:

• whether the task is SCSCP encoded or not;

• which CAS supports this protocol and the OpenMath symbols encoded inside of it;

• whether the memory and computational requirements specified for this task are avail-able or not on the node running the chosen CAS;

• etc.

The pseudocode for the MinQL SA is given in Algorithm 3.3.1.

Proposition 3.3.1 The complexity of MinQL is O(mn/2).

Remark The complexity of MinQL can be reduced to O(ln(m)n/2) if a self-balancingbinary tree such as AVL or Red-Black is used for ordering the queues by their cardinality.

56 Marc E. Frıncu

Algorithm 3.3.1 The MinQL pseudocode

Require: T ′ the set of newly submitted tasksRequire: T the set of already submitted tasksRequire: Q the set of queues attached to resourcesEnsure: |Qi| − |Qj| 6= 1,∀Qj, Qi ∈ Q

1: repeat2: while T ′ 6= ∅ do3: t ∈ T ′;4: Choose randomly Qk ∈ Q for which isSupported(t, Qk);5: Let Qk = Qk

⋃{t} | ∀i<j:i=1,n,j=1,mTi, Tj ∈ Qk(TWTTi ≥ TWTTj);

6: end while7: Qprocessed = ∅8: while Q 6= ∅ do9: Qmax = {Qj : |Qj| < |Qk|, ∀Qk, Qj ∈ Q};

10: Qmin = 0;11: found = true;12: while |Qmax| − |Qmin| > 1 and found do13: t last task in Qmax;14: Qmin = {Qj : |Qj| ≤ |Qmax| ∧ isSupported(t, Qj),∀Qj ∈ Q};15: if Qmin 6= Qmax then16: Qmax = Qmax \ {t};17: Qmin = Qmin

⋃{t} | ∀i<j:i=1,n,j=1,mTi, Tj ∈ Qk(TWTTi ≥ TWTTj);

18: else19: found = false;20: end if21: Q = Q \ {Qmax};22: Qprocessed = Qprocessed

⋃{Qmax};

23: end while24: end while25: Q = Qprocessed;26: Qprocessed = ∅;27: until ∃Qi ∈ Q : Qi > 1;

3.3.2 MinQL Tests

MinQL is a scheduling heuristics intended to provide a solution when scheduling online taskswithout requiring any user EETs as there are cases when either they cannot be relied on (cf.Sect. 1.2) or they cannot be supplied as little is known about task or resource properties(e.g., in case of mathematical problems). The simple scheduling heuristics has been designedprimarily for dealing with CAS problems. We now present a test scenario (cf. Sect. 3.3.2)with a discussion on obtained results (cf. Sect. 3.3.2).

Adaptive Scheduling for Distributed Systems 57

Table 3.3.1: Test platforms used in the experimentsName #clusters #procs (total/used) h (cf. Relation 2.1.16)

GridPP 21 7948/42 80sub-GridPP 5 1865/10 80

DAS-3 5 272/10 18sub-G’5000 5 339/10 2

Test Scenarios

For the testing scenario five scheduling heuristics – Max-Min, Min-Min, Suffrage and RoundRobin – including two MinQL variants – one using as relocation condition the CPU speedand another one, called MinQL-Plain, which considers all resources capable of executingtasks – were evaluated using generated EET s as described in Sect. 3.1. For the four schedul-ing heuristics used as comparison with MinQL, the EET is considered to be known a prioriand can be placed inside the EET matrix (cf. Sect. 2.1.2). The matrix is considered tobe consistent. The SP used for testing is the one described in Sect. 3.1. Given a ran-dom resource, assigned to queue Qj we generate a normalized random N(µCn , σCn) value forthe EET of a task Ti on that resource. For the tests we chose µCi = i × 150,∀i = 1, 10and σ = 100. For all other resources Qi, i 6= j the EETi of the same task is computed asQspeedj ×EETj/Qspeed

i . This formula does not take into consideration CPU fluctuations duringscheduling. More precisely if we take the column belonging to a resource i and the columnbelonging to a faster resource j then all elements from column i are smaller than the elementsbelonging to column j. Normal generated EET values can then be easily transformed intolog-normal ones.

In the MinQL’s case we do not actually take into consideration the EET as it is a loadbalancing algorithm. EET is only to help compute the schedule’s makespan for statistics.

The platforms used for testing represent real Grids and are extracted from the PlatformDescription Archive (SQF). We have used two platform configurations (Table 3.3.1) repre-senting the GridPP (Gri10) platform, DAS-3 (DAS10) and two sub-platforms belonging tothe G’5000 Grid (Con10) and GridPP. For the simulation we have mapped one cluster to onequeue with each queue having a number of executors equal with the number of processors(or nodes in the case the number of processors was unavailable) in the corresponding cluster.Scheduling is done at meta-level. The SP automatically assigns a task to an available nodein the cluster at a particular time. In the case all nodes are occupied tasks will remain ontheir queues until either the simulator tries to start them or the SA performs a reschedulingoperation.

The underlying platforms were chosen such that the SAs would be tested against a widerange of scenarios varying from low to high heterogeneity.

Besides testing MinQL against other SA, a multi-level approach has also been consid-ered. The reason for this is that MAS usually schedule at two levels: global (meta) andlocal. During tests we considered only MinQL at global level, while at local level either

58 Marc E. Frıncu

Min-Min, Max-Min or MinQL is used. These choices were driven by the fact that the testswere performed on SymGrid-Services. On this framework inter-cluster scheduling aims onlyat levelling load distribution, while inter-cluster scheduling requires more complex schedul-ing heuristics optimized for various cost objectives. The platform topology was based onthe SCIEnce platform (sci) which is formed of two clusters, which in turn have eight compu-tational nodes each. Tests aimed at determining the makespan, average resource load andaverage waiting time on a resource queue.

For the single level tests the experiments consisted of batches ranging from 10 up to 500tasks and have been repeated for 20 times. The average of the tests was selected as the finalresult. The multi-level tests targeted online scheduling with 10, respectively 20 workflows –comprised of a varying number of tasks modelled as described in Sect. 3.1 – being generated.

Test Results

Experiments based on the previously defined setup have shown that MinQL gives goodresults in all envisioned scenarios. Furthermore the two versions, MinQL and MinQL-Plain, performed relatively the same. This behaviour confirmed the results described byFrıncu et al. (FMC09) i.e., that in the case of this SA, taking into account the processorspeed when moving tasks does not offer a significant improvement. Tests have evidenced thatthe resulted makespan is comparable with that of Min-Min, Max-Min or Round-Robin (cf.Fig. 3.3.1(a)) in the case of the platform with h = 2 and 5 clusters and closely matches thatof the Suffrage algorithm for h = 18 and h = 80 heterogeneity factor platforms (cf. Figs.3.3.1(b), 3.3.1(d) and 3.3.1(c)). They have also re-confirmed that Suffrage performs betterwhen the DS is heterogeneous. The other tested SA which does not require EET, RoundRobin, has proven an increasingly negative behaviour as the heterogeneity of the platformincreases.

An interesting result was obtained in the case of the two platforms with h = 80. In thiscase MinQL behaved the same in both cases, yet Round Robin gave a better makespan inthe platform with 5 cluster tests (cf. Fig. 3.3.1(d)) than in the 21 cluster tests where itoffered the worst makespan. Min-Min outperformed Max-Min for the 5 cluster tests whichwas not the case of the 21 cluster experiments. This result re-enforces that varying thenumber of resources affects the performance of some scheduling heuristics.

Tables 3.3.2, 3.3.3, 3.3.4 and 3.3.5 show the gain (cf. Sect. 2.1.2) of the algorithm in thecase of the four tested scenarios. From these tables we can easily notice the improvementoffered by MinQL compared to the rest of the studied SAs. It can be also noticed that forh = 18 and h = 80 heterogeneity factors (cf. Tables 3.3.3, 3.3.4 and 3.3.5) the Suffrage andMinQL perform almost the same.

MinQL also performed well when taking into account the queue compactness (cf. Figs3.3.2(a), 3.3.2(b) and 3.3.2(c)). The average results are of 0.9 (h = 2 and 5 clusters), 0.9(h = 18 and 5 clusters), 0.88 (h = 80 and 5 clusters) and 0.69 (h = 2 and 21 clusters).Tests on MinQL-Plain produced average values of 0.9 (h = 2 and 5 clusters), 0.91 (h = 18and 5 clusters), 0.89 (h = 80 and 5 clusters) and 0.69 (h = 2 and 21 clusters). Given that

Adaptive Scheduling for Distributed Systems 59

a value of one means that the system load has been perfectly balanced we can deduce thatthe algorithm also performs well in this direction.

Results for the multi-level approach are depicted in Figs. 3.3.3, 3.3.4, 3.3.5, 3.3.6 andTable 3.3.6.

If we consider that tasks at both global and local level are scheduled with MinQL algo-rithm and we submit 10 respectively 20 execution workflows the average waiting time inour simulation is relatively small for all servers as can be seen in Fig. 3.3.3. This demon-strates that the SAs behave as expected and that the values are similar for the two cases.

When using different SAs at local level we notice a slight modification in the averagewaiting time profile (cf. Figs. 3.3.4 and 3.3.3(b)). This is due to the fact that Min-Min andMax-Min are not load balancing algorithms. This results in higher average waiting time forcertain resources.

Two facts can be underlined from Fig. 3.3.5: (1) the resource load when MinQL isused is not affected by the number of executed workflows, and (2) a better resource load isachieved when MinQL is used at meta-level. Not the same conclusion can be drawn fromthe situation when we use Min-Min or Max-Min algorithms. These two algorithms led tounbalanced loads of the resources (cf. Fig. 3.3.6).

Table 3.3.6 depicts the average makespan (including the standard deviation). It can benoticed that when Max-Min and Min-Min is used at the local level the makespan is bettercompared with the case when MinQL is used at both levels. This is because Min-Min andMax-Min algorithms use task EET when taking scheduling decisions while MinQL focuseson load balancing.

No.tasks

Max-Min

Min-Min

RR Suff.

10 1.22 1.51 1.09 1.62

50 1.05 1.15 1.06 1.36

100 1.02 1.08 1.03 1.3

150 1.02 1.05 1.03 1.29

200 1.02 1.04 1.02 1.28

250 1.01 1.03 1.02 1.27

300 1.01 1.03 1.02 1.26

350 1.01 1.03 1.02 1.26

400 1.01 1.02 1.01 1.26

450 1 1.02 1.02 1.25

500 1 1.01 1.01 1.25

Table 3.3.2: Gain of MinQL with regard toother SAs for a platform with h = 2 and 5clusters

No.tasks

Max-Min

Min-Min

RR Suff.

10 1.34 1.73 1.54 1.43

50 1.13 1.22 1.14 1.11

100 1.1 1.18 1.12 1.05

150 1.1 1.16 1.12 1.03

200 1.09 1.15 1.11 1.02

250 1.09 1.14 1.09 1.02

300 1.09 1.16 1.1 1.02

350 1.09 1.15 1.1 1.01

400 1.09 1.15 1.09 1.01

450 1.1 1.16 1.09 1.01

500 1.1 1.15 1.09 1.01

Table 3.3.3: Gain of MinQL with regard toother SAs for a platform with h = 18 and 5clusters

60 Marc E. Frıncu

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 50 100 150 200 250 300 350 400 450 500

Ma

ke

sp

an

(m

s)

No. tasks

MinQLMinQL-Plain

Round RobinSuffrageMin-Min

Max-Min

(a) For a platform with h = 2 and 5 clusters

0

10000

20000

30000

40000

50000

60000

70000

80000

0 50 100 150 200 250 300 350 400 450 500

Ma

ke

sp

an

(m

s)

No. tasks

MinQLMinQL-Plain

Round RobinSuffrageMin-MinMax-Min

(b) For a platform with h = 18 and 5 clusters

0

5000

10000

15000

20000

25000

30000

0 50 100 150 200 250 300 350 400 450 500

Ma

ke

sp

an

(m

s)

No. tasks

MinQLMinQL-Plain

Round RobinSuffrageMin-Min

Max-Min

(c) For a platform with h = 80 and 21 clusters

0

20000

40000

60000

80000

100000

120000

0 50 100 150 200 250 300 350 400 450 500

Ma

ke

sp

an

(m

s)

No. tasks

MinQLMinQL-Plain

Round RobinSuffrageMin-Min

Max-Min

(d) For a platform with h = 80 and 5 clusters

Figure 3.3.1: Makespan of MinQL compared to other heuristics

Adaptive Scheduling for Distributed Systems 61

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400 450 500

Co

mp

actn

ess

No. tasks

MinQLMinQL-Plain

Round RobinSuffrageMin-Min

Max-Min

(a) For a platform with h = 2 and 5 clusters

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400 450 500

Co

mp

actn

ess

No. tasks

MinQLMinQL-Plain

Round RobinSuffrageMin-Min

Max-Min

(b) For a platform with h = 18 and 5 clusters

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400 450 500

Co

mp

actn

ess

No. tasks

MinQLMinQL-Plain

Round RobinSuffrageMin-Min

Max-Min

(c) For a platform with h = 80 and 21 clusters

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400 450 500

Co

mp

actn

ess

No. tasks

MinQLMinQL-Plain

Round RobinSuffrageMin-Min

Max-Min

(d) For a platform with h = 80 and 5 clusters

Figure 3.3.2: Compactness of MinQL compared to other heuristics

62 Marc E. Frıncu

No.tasks

Max-Min

Min-Min

RR Suff.

10 4.25 5.37 0 3.82

50 1.25 1.68 2.26 1.26

100 1.32 1.57 1.89 1.08

150 1.3 1.46 1.69 1.02

200 1.35 1.47 1.64 1.04

250 1.34 1.45 1.61 1.01

300 1.35 1.44 1.57 1

350 1.35 1.43 1.55 1

400 1.36 1.42 1.54 1

450 1.36 1.41 1.51 1

500 1.38 1.42 1.51 1

Table 3.3.4: Gain of MinQL with regard toother SAs for a platform with h = 80 and 21clusters

No.tasks

Max-Min

Min-Min

RR Suff.

10 1.57 1.99 1.81 1.52

50 1.42 1.51 1.47 1.09

100 1.44 1.45 1.42 1.04

150 1.48 1.41 1.41 1.02

200 1.5 1.39 1.38 1.01

250 1.51 1.39 1.4 1.01

300 1.51 1.41 1.39 1.01

350 1.52 1.4 1.39 1

400 1.51 1.38 1.38 1

450 1.53 1.39 1.38 1

500 1.49 1.4 1.38 1

Table 3.3.5: Gain of MinQL with regard toother SAs for a platform with h = 80 and 5clusters

0

100

200

300

400

500

600

700

800

2 4 6 8 10 12 14 16

Avg

Wa

itin

g T

ime

(m

s)

Resource

MinQL-MinQL-10

(a) Scenario: 10 submitted online workflows

0

200

400

600

800

1000

1200

1400

1600

2 4 6 8 10 12 14 16

Avg

Wa

itin

g T

ime

(m

s)

Resource

MinQL-MinQL-20

(b) Scenario: 20 submitted online workflows

Figure 3.3.3: Average waiting time per resource when the MinQL is used at both levels

Workflow no. MinQL-MinQL MinQL-Max-Min MinQL-Min-Min

10 22758 ± 6026 21230 ± 5425 24397 ± 509220 42247 ± 5398 37935 ± 8327 37450 ± 6261

Table 3.3.6: Makespan comparison for the case MinQL is used in a multi-level SP

Adaptive Scheduling for Distributed Systems 63

0

500

1000

1500

2000

2500

3000

3500

2 4 6 8 10 12 14 16

Avg

Wa

itin

g T

ime

(m

s)

Resource

MinQL-MaxMin-20

(a) MinQL-Max-Min

0

500

1000

1500

2000

2500

3000

3500

4000

2 4 6 8 10 12 14 16

Avg

Wa

itin

g T

ime

(m

s)

Resource

MinQL-MinMin-20

(b) MinQL-Min-Min

Figure 3.3.4: Average waiting time per resource for 20 workflows when different SAs areused at the two levels

0

0.05

0.1

0.15

0.2

0.25

0.3

2 4 6 8 10 12 14 16

Lo

ad

Resource

MinQL-MinQL-10

(a) Scenario: 10 submitted online workflows

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

2 4 6 8 10 12 14 16

Lo

ad

Resource

MinQL-MinQL-20

(b) Scenario: 20 submitted online workflows

Figure 3.3.5: Average load per resource when the MinQL is used at both levels

64 Marc E. Frıncu

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2 4 6 8 10 12 14 16

Lo

ad

Resource

MinQL-MaxMin-20

(a) MinQL-Max-Min

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

2 4 6 8 10 12 14 16

Lo

ad

Resource

MinQL-MinMin-20

(b) MinQL-Min-Min

Figure 3.3.6: Average load per CAS for 20 workflows when different SAs are used at the twolevels

Adaptive Scheduling for Distributed Systems 65

3.4 DMECT Algorithm

This section describes the Dynamic Minimization of the Estimated Completion Time(DMECT) scheduling heuristics. The algorithm was presented in Frıncu (Fr9b), togetherwith the mathematical model (cf. Sect. 3.4.1), proof of convergence (cf. Sect. 3.4.2)and threshold determination value for the task’s actual relocation (cf. Sect. 3.4.4). Inthis section we also study the stability of the algorithm (cf. Sect. 3.4.3). Detailed testsinvolving deadline constraint flavours have also been analysed in Frıncu (Fr9a). A populationbased approach that uses DMECT as well as MinQL was considered in Micota, Frıncu andZaharie (ZFZ11) and is presented in Sect. 3.4.5 together with some test in Sect. 3.4.8.DMECT treats each task individually when attempting to relocate them. The criterionwhen performing rescheduling is the ECT of individual tasks. Because its application inonline scheduling, the SA considers the Total Waiting Time (TWT) of each task and usesthis value to maintain the queues ordered descending after it. The decision to move thetask is based on whether its Local Waiting Time (LWT) (i.e., the time since the task hasbeen assigned to the current queue) has reached or not a certain threshold. The SA makesuse of three conditions: time – when to move –, position – where to move – and force– give priority to older tasks. The time condition introduces a new customizable methodfor relocating a task. The exact time can depend on internal policies or can be adjusted atruntime based on various criteria. Results from performed tests have shown that it does notmatter how these tasks are initially distributed on queues as this criterion does not impactthe overall schedule objective function (i.e., makespan or lateness). Tests also tried to copewith two aspects of scheduling: (1) cases when user ECT is required (cf. Sect. 3.4.6) and(2) cases when task lateness is needed instead (cf. Sect. 3.4.7).

3.4.1 Mathematical Model

Let Q be the set of queues, R the set of computing resources and T the set of tasks. EveryQj ∈ Q belongs to one Rj ∈ R and is an ordered set (Remark 3.4.1) containing tasks. Aschedule is a function f : T → R which maps every task task ∈ T on a resource r ∈ R.

Given a T ki ∈ Qk we define LWT ki and TWTi as the Local Waiting Time of the task onQk, respectively the Total Waiting Time of the same task. The LWT ki is always reset to 0after each relocation.

Remark 3.4.1 Queues are ordered sets such that ∀Qk ∈ Q,∀Ti, Tj ∈ Qk (i > j ⇒ TWTi ≤TWTj).

The previous remark restricts younger tasks from being scheduled ahead of older ones. Thepurpose is to prohibit older tasks from starving due to constant insertion of younger taskswith smaller ECT ahead of them. In case of batch jobs arriving at the same moment theremark will not influence in any way the scheduling decisions as all TWTi would be equal.

In addition, and in contrast with the MinQL model (cf. Sect. 3.3.1), we assume thatthe queue selection is done so that each Qj ∈ Q supports each and every task assigned to

66 Marc E. Frıncu

it. As a result we will not explicitly use the isSupported(Ti, Qj) (cf. Sect. 3.3.1) functionduring the queue selection phase. The three conditions imposed on the algorithm and listedin Section 3.4 are modelled as follows:

Condition 1 is represented by the time when T ji can be moved from Qj and is given byEquation (3.4.1):

eσ−LWT ji ∈{

[0, 1) ,move[1,∞) , keep

(3.4.1)

where σ <∞ is arbitrarily selected. The previous equation simply sets a time threshold tothe moment when a decision for moving T ji from Qj to a new queue should be made. Thereis no rule for choosing a proper value for σ but tests (cf. Sect. 3.4.6) have shown that themakespan is directly influenced by it. In our tests we have considered two approaches forcomputing the σ value: (1) one using EET and (2) another using deadline constrains forcases where EET is hard or even impossible to be determined. A similar relocation conditionis given in Chiang et al. (CADV02) where it is used as a priority function inside severalbackfilling strategies. However in our approach we consider the LWT and not the TWT of atask as we feel that a task relocation should be linked to the failure of a resource to executeit inside a given interval and not to the time since it was submitted.

Remark 3.4.2 A necessary and sufficient condition for Relation 3.4.1 to reach values insidethe [0, 1) interval is for σ − LWT ji ↘ L with L < 0.

EET based σ: for example in the case in which we want to move Ti to a new queuewhen its LWT exceeded its ECT in the assumption that all tasks have the same EET asTi, we have a condition like the one in Relation 3.4.2:

σ =

{EET ji · (i− 1) , i > 0

LWT ji , i = 0(3.4.2)

Variations of σ could use the smallest EET, the actual ECT, a priority based approach orcould simply reschedule tasks at a fix time interval (i.e., σ = a, a ∈ (0,∞] fixed). In casewe choose a system based on priorities and assume a priority pi for Ti that is increased eachtime the queue is changed, the relocation condition could be as described by Relation 3.4.3.

σ =

{a ∈(1,∞) , pi = 1TWTipi

, pi > 1(3.4.3)

where pi is initially 0 and increases with one unit each time the task is logically relocated toanother queue. Table 3.4.1 exemplifies how the SA would behave in this case.

Deadline based σ: uses a natural condition of moving tasks to another queue at the mo-ment when their ECT value is greater than their Time Until Deadline (TUD) (cf. Relation3.4.4).

σ = TUDi − ECT ji (3.4.4)

Adaptive Scheduling for Distributed Systems 67

Table 3.4.1: σ condition example for p = 1, 2 and a=5TWT 0 3 6 9 12 15 18 21LWT 0 3 6 0 3 6 9 12σ 5 5 5 4.5 6 7.5 9 10.5

σ − LWT ji 5 2 -1 4.5 3 1.5 0 -2.5

p 1 1 1 2 2 2 2 2

Decision stay stay move stay stay stay stay move

As previously mentioned computing the ECT is not easy in cases when insight on theexecutor is difficult to find (e.g. CAS in which their is no knowledge on the complexity ofthe algorithm used for solving a certain task; or in Service Oriented Environments wherethe service interface hides relevant details from the rest of the SP). In this case a prioritybased approach where older tasks are rescheduled faster as their TUD seems at first glance anappropriate solution. In order to execute them as fast as possible without inflicting too muchlateness we attempt to relocate tasks faster if their deadline has been exceeded. Anotherissue we deal with is to minimize the number of times tasks relocate. Relations 3.4.5 and3.4.6 formalise these conditions by allowing tasks which are far from their deadline to waitlonger on their queues than tasks approaching the time limit. As a result we have:

σ =TUDi

TWT ji(3.4.5)

such that:

eσ−LWT ji ∈ [0, 1) (3.4.6)

represents the condition of when to attempt task relocation.Table 3.4.2 exemplifies the behaviour of DMECT when using a value for σ as shown in

Relation 3.4.5. It also shows in what way – given a rescheduled interval of 1 second – a taskwould be relocated each time Relation 3.4.6 is satisfied.

Table 3.4.2: σ condition example for a task with initial TUD = 5 when using Relation 3.4.5

TWT 1 2 3 4 5 6 7TUD 4 3 2 1 0 -1 -2LWT 1 2 0 1 0 1 0σ 4 1.5 0.66 0.25 0 -0.16 -0.28

σ − LWT ji 3 -0.5 0.66 -0.75 0 -1.16 -0.28

Decision stay move stay move stay move move

Figures 3.4.1(a), 3.4.1(b) and 3.4.1(c) present some graphs which exemplify the timecondition for a single task during an interval of 100 time units. The rescheduling interval

68 Marc E. Frıncu

has been chosen to be equal with 1 time unit and the initial σ = 10. Each time σ − LWTdrops bellow 0 a rescheduling event takes place (cf. Relation 3.4.1). It can be easily notedthat in the case of a constant σ value the rescheduling time is fixed (cf. Fig. 3.4.1(a)),while for a priority based approach (cf. Relation 3.4.3) the rescheduling time increases withthe priority (cf. Fig. 3.4.5). In the case of a deadline approach (cf. Relation 3.4.5) therescheduling time decreases when TUD → DLT , with DLT being the Deadline Time of thetask.

-2

0

2

4

6

8

10

0 10 20 30 40 50 60 70 80 90 100

σ -

LW

T (

s)

TWT (s)

(a) Time constraint for a constant σ = 10 and arescheduling interval of 1 time unit

-2

0

2

4

6

8

10

0 10 20 30 40 50 60 70 80 90 100

σ -

LW

T (

s)

TWT (s)

(b) Time constraint for a priority based σ and arescheduling interval of 1 time unit

-10

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

σ -

LW

T (

s)

TWT (s)

(c) Time constraint for a deadline based σ and arescheduling interval of 1 time unit

Figure 3.4.1: Comparison of the time constraint behaviour for several σ flavours

Condition 2 Let ECTQj =∑l<|Qj |

l=0 EET jl be the ECT of Qj and θk =ECT jiECTki

. Choosing

which is the new Qk where to move T ji means finding k = {i : ∀0≤j<n,j 6=i (θj ≤ θi)} and

Adaptive Scheduling for Distributed Systems 69

satisfying Relation (3.4.7):[(ECTQj − EET

j

T ji

)≥(ECTQk + EET k

T ji

)]∧[ECT jiECT ki

≥ 1

](3.4.7)

where the first term ensures that the resulting ECTQk will be smaller than the remainingECTQj . The second term of the condition restricts us from moving a task to Qk where

ECT ki ≥ ECT ji . This situation may occur when ∀Qk ∈ Q \ {Qj}(ECT ki ≥ ECT ji ).

Remark 3.4.3 Before computing the ECT on a new queue a task is considered as beinginserted at a position satisfying Remark 3.4.1. Based on this position the ECT value is thencalculated.

Remark 3.4.4 New incoming tasks will be placed on a waiting queue. They will be movedon a queue assigned to a resource only if ‖T ‖ (Definitions 3.4.1 and 3.4.2) is smaller orequal with ‖T ‖ at the previous step. This is necessary to obey Definition 3.4.4.

For the deadline flavours (cf. Relation 3.4.6) Relation 3.4.7 is difficult to satisfy due tothe unavailability of the task’s EET . As a result we consider the following criteria for queueselection:

∃Qm∀Qk ∈ Q (| Qm |<| Qk |) (3.4.8)

For environments in which information on the executor is hard to obtain, a selectionbased on the fastest resource was considered. The queue Qm with the smallest number oftasks is considered to be the fastest (cf. Relation 3.4.8).

This approach has disadvantages as scheduling data intensive tasks can lead to highrelocation costs when nothing about the involved resources. Nonetheless the algorithm wouldeventually relocate all the tasks belonging to the queue executing the data intensive tasks.This however could induce a high lateness:

Li = CTi −DLTi (3.4.9)

Relation 3.4.8 also lies at the foundation of the MinQL algorithm (cf. Sect. 3.3) whichperiodically relocates the last tasks from large queues to smaller ones also supporting theirexecution.

Remark 3.4.5 MinQL is a particular flavour of DMECT where the place condition isexpressed by Relation 3.4.8, the force condition is identical with the one given in Condition3 and the σ condition is arbitrary but fixed for all tasks.

Condition 3 The condition for executing tasks is given by fi = TWTi. It is chosen suchthat in each Qk tasks are executed in the descending order of their TWT. This conditioncan be expressed as:

∀Qk∈Q∀0≤i≤|Qk|0≤j≤|Qk|

i<j

(TWTj ≤ TWTi) (3.4.10)

70 Marc E. Frıncu

3.4.2 Convergence of the Algorithm

In order to establish whether DMECT optimizes the cost function or not under givenscenarios, we need to study if the objective function is minimized. In our case we study twofunctions: makespan and lateness. We need to prove that they always converge decreasinglyto a limit. In this direction we introduce the following definitions:

Definition 3.4.1 Given Q the set of queues and T the set of tasks we define in this thesisa transition inside a schedule as the process of moving one task from a queue to another.More formally we can say that if:

∃Qj, Qk ∈ Q, Ti ∈ Qj

((eσ−LWT jTi < 1

)∧(ECT jiECTki

≥ 1))((

ECTQj − EETj

T ji

)≥(ECTQk + EET k

T ji

))where k = {i : ∀0 ≤ j < n, j 6= i (θj ≤ θi)}, a transition T (Ti) : Qj → Qk is defined by thefollowing set of operations:

T =

{Qj = Qj \ {Ti}Qk = Qk ∪ {Ti}

(3.4.11)

Informally T (Ti) means taking Ti from queue Qj and placing it on queue Qk.

Definition 3.4.2 Given the Cmax (makespan) and the Lmax (lateness) we define:

• For the Cmax: Let st = ‖Tt‖ =

max(∑l<|Q1|

l=0 EET 1l ,∑l<|Q2|

l=0 EET 2l , . . . ,

∑l<|Qn|l=0 EET nl ) at moment t in time;

• For the Lmax: Let st = ‖Tt‖ = max(|Q1|, |Q2|, . . . , |Qn|) at moment t in time.

Definition 3.4.3 Given the monotone function s : [0, τ ]→ [0, T (0)], we define the Riemannintegral:

τ∫0

s(t)dt

as a means of describing the evolution of Cmax, and respectively Lmax, during [0, τ ] as aresult of applying the scheduling heuristics.

Definition 3.4.4 We say that the iterative process described by a dynamic SA is convergentif st ↘ l > 0.

In the case of the lateness we need to prove that under the given resource selection (cf.Relation 3.4.8) the lateness is minimized (cf. Proposition 3.4.2). However in order to provethis we first need to prove Proposition 3.4.1:

Proposition 3.4.1 Choosing the best ECT for a task when rescheduling minimizes its late-ness.

Adaptive Scheduling for Distributed Systems 71

Proof Obvious both from Relation 3.4.9 and from the fact that at each reschedule the ECTis minimized.

We can now prove that the Lmax is minimized:

Proposition 3.4.2 Task lateness is minimized if and only if⌊| Tτ1→τ2 || R |

⌋(τ2 − τ1) ≤

τ2∫τ1

s(t)dt ≤⌈| Tτ1→τ2 || R |

⌉(τ2 − τ1)

where Tτ1→τ2 represents the number of tasks existing in the system inside the [τ1, τ2] interval.

Proof ⇒ Assume that the lateness is minimized. We need to show that the evolution ofthe optimality criterion is bounded by the given interval. Given Relation 3.4.8 we can inferthat smaller queues are faster and consequently the ECT s of their tasks are smaller (cf.Proposition 3.4.1). At every moment t the scheduling heuristics applies a series of transitionT until | |Qi| − |Qmin| |≤ 1,∀0≤i≤|Q|Qi ∈ Q. It is easily observable that:

∀0≤i≤|Q| | Qi |∈[⌊| Tτ1→τ2 || R |

⌋,

⌈| Tτ1→τ2 || R |

⌉](3.4.12)

The integral can be written asτ2∫τ1

s(t)dt =∑τ2−1

i=τ1s(xi)(ti+1 − ti). If ti+1 − ti = 1 then

the integral becomes equal with∑τ2−1

i=τ1s(xi). It is easily noticeable by Definition 3.4.2 and

Relation 3.4.12 that:

b| Tτ1→τ2 || R |

c(τ2 − τ1) ≤τ2−1∑i=τ1

s(xi) ≤ d| Tτ1→τ2 || R |

e(τ2 − τ1)

.Hence: ⌊

| Tτ1→τ2 || R |

⌋(τ2 − τ1) ≤

τ2∫τ1

s(t)dt ≤⌈| Tτ1→τ2 || R |

⌉(τ2 − τ1)

is satisfied.⇐ Now we prove that if the inequality exists then the lateness is minimized. The

inequality can be rewritten as:

⌊| Tτ1→τ2 || R |

⌋≤

τ2∫τ1

s(t)dt

τ2 − τ1≤⌈| Tτ1→τ2 || R |

⌉Given that

|Tτ1→τ2 ||R| represents the ideal load balancing of tasks on the existing resources,

we notice that the average load produced inside [τ1, τ2] is between the ceil and floor of this

72 Marc E. Frıncu

ideal value. The ideal load is achieved by balancing tasks from slower queues to faster ones.Hence by Proposition 3.4.1 the lateness is minimized.

Proposition 3.4.2 is of importance for our case as it shows that the lateness will beminimized even when the task’s ECT is ignored when reassigning tasks.

Proposition 3.4.3 The DMECT scheduling heuristics is convergent.

Proof From the conditions defined in Section 3.4.1 we have that the sequence st fromDefinition 3.4.2 is bounded by 0 and ‖T (0)‖. Furthermore it is a descending monotonicsequence. As a result the sequence has a limit and according to Definition 3.4.4 the SA isconvergent. The proof stands both for the Cmax case (obvious) and for the Lmax case (byProposition 3.4.2)

Proposition 3.4.3 tells that DMECT always converges to a limit. Determining thislimit is however extremely difficult. Nonetheless we can show that the algorithm convergesto a value close to the optimal value – is asymptotic optimal – in cases in which largenumber of tasks are considered and their granularity g → 0. Granularity redefines the taskscharacteristics as follows:

T gsize = g × Tsize, T gω = g × Tω, ng =n

g

where Tω represents the computing requirements of the task expressed in flops, Tsize repre-sents the task size in bytes and n represents the number of tasks to be scheduled.

Proposition 3.4.4 For g → 0, Cmax → Coptimalmax .

Proof We know that Coptimalmax ≤ st and EET ji = T gω/sj.

Hence for g → 0 st → 0, because EET ji → 0.We obtain Coptimal

max ≤ st → 0.

Proposition 3.4.5 For g → 0, Lmax → 0.

Proof The proof is similar with the one given in Benoit et al. (BMP+10).To summarize: Lmax < max{CTi−DLTi} ≤ max{| ECT ji −DLTi |}+ ε→ 0 for g → 0,

where ε represents the deviation of the ECT from the CT.

Proposition 3.4.6 For any ε > 0 there exists a Cmax such that DMECT can achieve itand | Cmax − ε |< Coptimal

max .

Proof Obvious from Coptimalmax ≤ st → 0.

Once the convergence of the algorithm has been determined there remains the issueconcerning its stability. Section 3.4.3 addresses this issues from a theoretical point of viewand Sections 3.4.6 and 3.4.7 deal with the issue from the experimental perspective.

Adaptive Scheduling for Distributed Systems 73

3.4.3 Stability of the Algorithm

We now study the stability of the algorithm and give the following proposition:

Proposition 3.4.7 Given the mathematical model and Proposition 3.4.3 DMECT is stable.

Proof As already mentioned we consider a discrete time model where each value of the stdepends on the previous st−1 – by the T transitions. In order to prove the stability of theSA we need to prove that given bounded inputs the algorithm always produces boundedoutputs. This model follows the Binded Input Binded Output (BIBO) (ZTF98) approachwhere given the inputs x[t] and the impulse response h[t], the outputs y[t] must satisfyrelation: y[t] = h[t]⊗ x[t].

For our case we consider the input to be the st, the impulse response to be the relationdescribing the time condition (cf. Relation 3.4.1) and the output to be the tasks’ ECT s. Inorder to prove the BIBO stability we need to prove that the impulse response is absolutelysummable:

m=∞∑m=−∞

|h[m]| <∞ (3.4.13)

or to show that |y[m]| <∞.Proving Relation 3.4.13 is however particular difficult for general functions such as the

one in our case (cf Relation 3.4.1). As alternative we can easily prove that |y[t]| < ∞ byProposition 3.4.3. The proposition states that st is convergent and consequently |st| adheresto the same property.

For the ECT based scenarios, the st value is nothing else than the schedule’s makespan(i.e., the ECT of the last task). Because st is bounded by 0 and T (0) we can infer thatevery task ECT is finite as well. Thus |y[t]| <∞ .

In the case of deadline constraint scenarios the proof follows a path similar with the onegiven for Proposition 3.4.2. Moving tasks from larger queues to smaller ones implies thatthe smaller ones are faster in executing tasks and consequently the tasks’ ECT s are smallerand finite.

3.4.4 Physical Movement Condition for Tasks on New Queues

As mentioned in Section 3.4 we now describe the actual time when task – data and logic– relocation should occur. Moving tasks between queues after each relocation implies anoverhead in bandwidth usage and may be a source of bottleneck problems. This is why tasksshould be kept on their initial allocated queues until with a high degree of probability thetask is set to execute on the assigned resource. Because we need to keep track of the queueload, we use task references during rescheduling. The references are made of data required forrescheduling decisions (i.e., EET, size of task including required data, additional resourcerequirements such as OS, memory, CPU, etc.). The threshold value for actual relocationshould be chosen such that the chances of the task being executed on the new queue is

74 Marc E. Frıncu

maximized. Thus we assume that after a task has been moved it will not be rescheduled ona new queue before being executed. For this to happen, the chances for other tasks to beinserted ahead of out task need to be minimized. Furthermore the ECT of the task at themoment of the physical relocation should be as close as possible to its TC. Formally thiscan be expressed as in Definition 3.4.5. The interval which bounds the limit is defined inProposition 3.4.8.

Definition 3.4.5 We define the threshold value, φ, for physically moving Ti toQk = maxQk∈Q\{Qj}(θk) where the task will execute when the following conditions are met:

eσ−LWTki ↘ 1(ECT ki − EET ki

)↘ TCi

∀Qj ∈ Q \ {Qk}∃task ∈ Qj

[(TWTtask > TWTTi) ∧

(eσ−LWT jtask < 1

)∧(ECT jiECTki

≥ 1)

∧(ECTQj − EET

j

T ji≥ ECTQk + EET k

T ji

)]Proposition 3.4.8 The only admissible threshold values are located inside the interval φ ∈(σ − δ, σ) ∀δ > 0 such that the relations in Definition 3.4.5 hold and φ > 0.

Proof In what follows we will assume that we have a threshold φ as in Definition 3.4.5.Let f(x) = eg(x), where g(x) = σ − x and x = φ. From Definition 3.4.5 we have that

limx→σ

f(x) = 1 and f(x)↘ 1. These imply that limx→σ

g(x) = 0 and x < σ.

From the ε-δ definition of the limit we obtain that σ − δ < x < σ + δ. Thus we canconclude that the threshold φ ∈ (σ − δ, σ), ∀δ ∈ [0, σ] and the conditions from Definition3.4.5 are met.

More generally Proposition 3.4.8 states two main facts. Firstly that Ti can be physicallymoved to its designated queue at any time between its assignment and the moment we needto relocate it again, if and only if there are no more tasks which could be added ahead ofit. Secondly, that the time remaining until the start of its execution is almost equal with itstransfer time.

DMECT Pseudo-code

Given the model described in Sect. 3.4.1 we can create an algorithm for a centralized dynamicSA. The algorithm relies on user estimates to compute the value of the EET which will beused for taking task relocation decisions.

As stated in the beginning of Sect. 3.4 the decision on where to assign new incomingtasks does not affect the overall schedule result. Therefore all new tasks will be assigned torandom queues. This has no influence on the SA as all tasks which do not obey the timecondition will be reassigned according to the position condition expressed in Relation (3.4.7).Each time a processor/core on a resource is available the first task will be scheduled to runon it non-preemtively. The pseudocode is described in Algorithm 3.4.1.

Adaptive Scheduling for Distributed Systems 75

Algorithm 3.4.1 The DMECT pseudocode

Require: T the set of newly submitted tasksRequire: Q the set of queues attached to resourcesEnsure: Remark 3.4.1 and Condition 2

1: while true do2: for Ti ∈ T do3: if Remark 3.4.4 is met then4: assign task to random queue;5: T = T \ Ti;6: end if7: end for8: for Qj ∈ Q do9: Qj must obey Remark 3.4.1;

10: end for11: repeat12: for Qj ∈ Q do13: for Ti ∈ Qj do14: if is time to move Ti by Condition 1 then15: find new Qk based on Condition 2;16: find position to insert Ti such that Remark 3.4.1 is met;17: insert Ti in the found position;18: end if19: end for20: end for21: until Condition 1 is not met ∀Ti∀Qj(Ti ∈ Qj)22: end while

Proposition 3.4.9 The complexity of DMECT is O(m× n).

Remark The complexity of DMECT can be reduced to O(ln(m) × n) if a self-balancingbinnary tree such as AVL or Red-Black is used for ordering the queues by their ECT.

3.4.5 Population based DMECT and MinQL

In what follows we study the behaviour of DMECT and MinQL when a population basedscheduling selection is taken. Population based SA are genetic algorithms and since Braun etal. (BSB01) have shown that they can generate good solutions for task scheduling problemsa lot of progress has been made (cf. Sect. 2.2.1). DMECT and MinQL could also benefitfrom this approach by running in parallel multiple scheduling scenarios and selecting thebest one to be applied on the system.

The construction of a (sub-)optimal schedule is usually based on creating an initial sched-ule which is then iteratively improved. Two decisions need to be taken in the initial con-struction phase: (1) the order in which the tasks are assigned to processors; and (2) the

76 Marc E. Frıncu

criterion used to select the processor corresponding to each task. Table 3.4.3 presents sev-eral strategies for taken these decisions. Each of these strategies generates initial scheduleswith a specific potential of being improved. Therefore it would be beneficial to use notjust one strategy but to use a population of initial schedules constructed through differentstrategies.

Table 3.4.3: Characteristics of the strategies used to construct initial schedules

Task selection Processor selection Strategy

Random Random RandomRandom Min CT OLBRandom Min EET METRandom Min ECT MCTIncreasing min ECT Min ECT Min-MinDecreasing min ECT Min ECT Max-Min

Initial schedules usually require to be improved by mutations (e.g., moving or swappingtasks between resource queues). According to Xhafa et al. (XA10) a lot of strategies toperturb a schedule exist. Table 3.4.4 depicts those we took in consideration – for reasons ofsimplicity, efficiency and randomness/greediness balance. The random move corresponds tothe local move operator (Xha07) and is similar to the mutation operator used in evolutionaryalgorithms. The greedy move operator is related to the steepest local move in (Xha07) butwith a higher greediness since it always involves the most loaded processor. The greedy swapis similar to steepest local swap in (Xha07) but it is less greedy and less expensive since itdoes not involve a search over the set of tasks.

Table 3.4.4: Characteristics of the strategies used to perturb the schedulesSource Destination Strategy

Processor Task Processor Task

Random Random Random - Random MoveMost loaded (max CT) Random Best improvement - Greedy MoveMost loaded (max CT) Random Least Loaded (min CT) Random Greedy Swap

Iterations on the obtained schedule population continue until either n iterations wereexecuted – each task has the chance to be moved – or a maximal number, gp, of unsuccessfulperturbations is reached. Algorithm 3.4.2 depicts the general structure of the populationbased scheduler. The perturb(Si) function represents the point in which the specific SA isapplied. In our case this function applies either the DMECT or the MinQL schedulingheuristics. In Sect. 3.4.8 some test results for online scheduling based on this algorithm arepresented and discussed.

Applying classic scheduling euristics as the pertubators inside genetic algorithms is dif-ferent from the traditional approaches depicted in (Xha07), where perturbation is achievedthrough swapping, moving or probabilistic mutations.

Adaptive Scheduling for Distributed Systems 77

Algorithm 3.4.2 The general structure of the population based scheduler

1: Generate the set of initial schedules:2: S ← {S1, . . . , SN}3: while 〈the stopping condition is false〉 do4: for i = 1, N do5: S ′i ←perturb(Si)6: end for7: S ← select(S, {S ′1, . . . , S ′N})8: end while

3.4.6 DMECT Tests Using EET Approximations

The following tests aim to offer an insight on the DMECT’s behaviour in cases in whichEET based scheduling is used. EET can be: (1) supplied by the user; (2) deduced fromhistorical data (ABC+04; SFT98); (3) or foreseen (JH08; JHY08) based on future systemconfigurations. Section 3.4.6 will provide insight on the testing scenarios while Sect. 3.4.6details and analyses the obtained results.

Test Scenarios

Simulations conducted on DMECT using a value for σ as described in Relation 3.4.2 havebeen run on two sub-platforms: (Scenario 1) part of Grid’5000 with h = 2 and (Scenario 2)part of DAS-3 with h = 18. Also four additional variations on σ (cf. Condition 2 in Sect.2.1.1) have been tested in order to study their influence on the resulting makespan:

• DMECT2 : sets σ = 1;

• DMECT3 : uses a priority approach (cf. Relation 3.4.3);

• DMECT4 : takes into consideration the minimum EET found in the current queue;

• DMECT5 : sets σ to be equal with the task’s current ECT.

As discussed in Sect. 1.2 user estimates influence the outcome of the SA. Therefore we useda testing scenario in which tasks are divided into classes. Each class has an average valuefor EET and a standard deviation which will be used for generating normalized randomvalues. Every generated task belongs to one class Cn and has an EET value correspondingto a normal distribution N(µCn , σCn). To further refine the results tests were repeated for anumber of 20 times and the average of their output was taken as the final result.

The five DMECT flavours were tested against versions of Min-Min, Max-Min, Suffrageand Round Robin SAs that have been adapted to support ageing (as described in Sect. 2.2.1).For makespan we also considered MinQL as comparison. Tests used batch scheduling witheach individual test consisting of batches of 10 up to 500 tasks.

78 Marc E. Frıncu

Test Results

During testing several characteristics including schedule: makespan (cf. Figs. 3.4.2(a) and3.4.2(d)), gain (cf. Tables 3.4.5 and 3.4.6), compactness (cf. Figs. 3.4.2(b) and 3.4.2(e)) andruntime (cf. Figs. 3.4.2(c) and 3.4.2(f)) were studied.

Results have shown that in the case of Scenario 1 the DMECT flavour performed betterin all cases by providing an average gain of 1.79 (cf. Table 3.4.5) against the second bestSuffrage. Regarding compactness (cf. Fig. 3.4.2(b)) an average value of 0.6 was obtainedfor DMECT during tests. This shows that DMECT tends to offer a solution at a timewhen its largest queue has an ECT of almost the double size of its smallest queue in termsof ECT. Among DMECT flavours DMECT3 performed best at these tests and offered anaverage value of 0.81. DMECT also performed well in regard to the overall time requiredfor assigning the submitted tasks (cf. Fig. 3.4.2(c)) as it gave the best schedule build timewhen compared with Suffrage, Min-Min and Max-Min.

When considering Scenario 2 the DMECT flavour was slightly outperformed by Suffragewhen makespan was consider as optimal criterion. This can be seen from Table 3.4.6.The improvement in the behaviour of Suffrage was expected as it tends to perform betterwhen heterogeneous platforms are used. Nonetheless Suffrage did not performed as well incompactness tests (cf. Fig. 3.4.2(e)) where DMECT kept an average value of 0.55. Thisvalue is similar with the value obtained during the tests performed for Scenario 1. Out of thetested DMECT flavours, DMECT3 performed best giving an average value of 0.80 (0.81in the Scenario 1 tests). Concerning the rescheduling time for tasks, the DMECT flavoursperformed best as they usually do not require reassigning the entire batch of tasks.

Adaptive Scheduling for Distributed Systems 79

No. tasks Round Robin Suffrage Min-Min Max-Min MinQL

10 1.19 0.90 0.93 1.10 1.0950 1.84 1.59 1.75 1.89 1.81100 2.44 1.93 2.33 2.44 2.40150 2.47 1.92 2.40 2.51 2.44200 2.60 1.93 2.54 2.59 2.58250 2.68 1.86 2.61 2.65 2.63300 2.98 2.00 2.94 2.99 2.95350 3.22 2.07 3.16 3.20 3.18400 3.19 1.94 3.14 3.16 3.14450 2.91 1.70 2.87 2.89 2.88500 3.27 1.81 3.23 3.25 3.22

Table 3.4.5: Gain of DMECT with regard to other SAs for the platform with h = 2 and 5clusters

No. tasks Round Robin Suffrage Min-Min Max-Min MinQL

10 1.27 0.93 0.85 1 0.950 2.44 0.78 2.22 2.37 2.08100 3.51 0.8 3.32 3.5 3.13150 3.66 1.04 3.57 3.67 3.3200 3.69 1.05 3.63 3.73 3.32250 3.64 1 3.62 3.68 3.32300 3.49 0.97 3.48 3.6 3.19350 3.69 0.94 3.7 3.77 3.38400 3.97 0.93 4.03 4.28 3.65450 4.05 0.81 4.14 4.4 3.72500 4.5 0.81 4.6 4.9 4.13

Table 3.4.6: Gain of DMECT with regard to other SAs for the platform with h = 18 and 5clusters

80 Marc E. Frıncu

0

10000

20000

30000

40000

50000

60000

70000

0 50 100 150 200 250 300 350 400 450 500

Ma

ke

sp

an

(m

s)

No. tasks

DMECTDMECT2DMECT3DMECT4DMECT5

MinQLRound Robin

SuffrageMin-Min

Max-Min

(a) Scenario 1: Makespan comparison on platforms withh = 2

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400 450 500

Co

mp

actn

ess

No. tasks

DMECTDMECT2DMECT3DMECT4DMECT5

MinQLRound Robin

SuffrageMin-MinMax-Min

(b) Scenario 1: Compactness comparison on platformswith h = 2

0

200

400

600

800

1000

1200

1400

1600

1800

0 50 100 150 200 250 300 350 400 450 500

Ru

ntim

e (

ms)

No. tasks

DMECTMinQL

Round RobinSuffrageMin-Min

Max-Min

(c) Scenario 1: Avg. schedule runtime comparison onplatforms with h = 2

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 50 100 150 200 250 300 350 400 450 500

Ma

ke

sp

an

(m

s)

No. tasks

DMECTDMECT2DMECT3DMECT4DMECT5

MinQLRound Robin

SuffrageMin-MinMax-Min

(d) Scenario 2: Makespan comparison on platforms withh = 18

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400 450 500

Co

mp

actn

ess

No. tasks

DMECTDMECT2DMECT3DMECT4DMECT5

MinQLRound Robin

SuffrageMin-Min

Max-Min

(e) Scenario 2: Compactness comparison on platformswith h = 18

0

200

400

600

800

1000

1200

1400

0 50 100 150 200 250 300 350 400 450 500

Ru

ntim

e (

ms)

No. tasks

DMECTMinQL

Round RobinSuffrageMin-Min

Max-Min

(f) Scenario 2: Avg. schedule runtime comparison onplatforms with h = 18

Figure 3.4.2: Comparison of various DMECT flavours with other SAs

Adaptive Scheduling for Distributed Systems 81

3.4.7 DMECT Tests Using Lateness

The aim of tests conducted under deadline constrains was both to study the efficiency ofDMECT and to determine the number of required task relocations before the task is exe-cuted. Deadline constrained situations are used in cases where either users cannot providetask EET s or some QoS is required. Section 3.4.7 describes the testing scenario while 3.4.7presents and analyses the results from the tests.

Test Scenarios

Two deadline based DMECT flavours have been compared with Min-Min, Max-Min, Suf-frage, Round Robin and MinQL as well as with an EET based DMECT flavour (cf. Re-lation 3.4.2). The two deadline based DMECT flavours rely on Relation 3.4.6 (DMECT-deadline) and on a fixed forced relocation attempt (for all tasks) at each rescheduling interval(DMECT2).

Out of the rest of the tested candidates Round-Robin and MinQL are the only oneswhich do not rely on task EET. Also Min-Min, Max-Min, Suffrage were modified versions ofthe originals and take ageing into account.

Each individual test consisted of batches of 10 up to 500 tasks. Each batch of themhaving an actual execution time generated with a normal distribution N(µCn , σCn) where Cnrepresents a class of tasks with each class having a certain mean and standard deviation forthe execution time. For tests 10 task classes have been used each of them generated withσCi = i×10 and µCi = i×150 where i = 1, 10. Given these classes each task is then assignedto one of them by generating an execution time with the corresponding normal distribution.In order to compute the task lateness a deadline must be specified for each task. This is alsogenerated with a given normal distribution and two scenarios have been proposed: (1) withvalues close to the the task’s ET and computed as N(EET, 50) (Scenario 1) and (2) withvalues far from it and given as N(10000, 500 · Ci) (Scenario 2).

For each testing scenario three distributed platforms with different heterogeneity factorshave been used: Grid’5000 with h = 2, DAS-3 with h = 18, and GridPP with h = 80. Eachof these platforms is made up of 5 clusters with 5 nodes each.

Test Results

Six batches of tests have been performed. Each batch has been repeated 20 times and itsaverage results has been taken into consideration. During tests we were primarily interestedin task lateness and the average number of task movements between queues until theiractual execution. Tests have shown (cf. Figs. 3.4.3(a), 3.4.3(b), 3.4.3(c), 3.4.3(d), 3.4.3(e)and 3.4.3(f)) that when lateness is considered Round Robin performs best during the sixbatches. MinQL fails to provide good lateness when considering platforms with h = 18 andh = 80 during Scenario 2 (cf. Figs. 3.4.3(e) and 3.4.3(f)). This leads us assume that MinQLworks well with homogeneous platforms but fails to offer good lateness in heterogeneousones. Out of the three DMECT flavours DMECT performed best while DMECT-deadline

82 Marc E. Frıncu

obtained the third best lateness. However this flavour uses task EET and so it is difficultto be used in SOA or CAS based environments. Min-Min, Max-Min and Suffrage failed toprovide a good lateness in either batch – Max-Min always provided the worst result.

When considering the average number of task movements between queues we havetaken into consideration exclusively the DMECT flavours together with MinQL and RoundRobin. This was due to the poor results regarding lateness, given by the other three SAs.Round Robin which gave excellent results during the lateness tests provided the worst resultswhile MinQL was by far the best candidate (cf. Figs. 3.4.4(a), 3.4.4(b), 3.4.4(c), 3.4.4(d),3.4.4(e) and 3.4.4(f)). The three DMECT flavours provided relatively similar results. Thereason for the strange behaviour of Round Robin is that at each rescheduling tasks arereassigned using a ring approach and so the probability of tasks ending up on different queuesis high. In contrast MinQL only moves the tasks from the larger queues to the smaller oneswhile the three DMECT flavours move tasks only when Relation 3.4.6 is satisfied.

As far as the overall results are concerned the three DMECT flavours have proved tobe the most stable because their behaviour has been relatively similar throughout the tests.While in the makespan tests performed in Sect. 3.4.6 the DMECT flavour was best, in thetests considering the lateness and the average number of task movements, DMECT2 wasthe winner with DMECT-deadline being close behind it. As mentioned DMECT2 uses asimple time constraint where the value for σ is always set to a fixed number for every task.In our tests this value was set to 200ms. The reason for selecting a small value was that wewanted a scenario in which tasks would be relocated fast without having to wait long timesfor the relocation to proceed. From these tests we can conclude that as far as DMECTis concerned the best lateness results are offered by flavours where the time constraint isas small as possible (i.e., that limit is the rescheduling interval) and not by ones such asDMECT-deadline which use a time constraint that decreases as the task’s TWT increases.The small number of task movements observed for DMECT2 can be explained by the factthat due to the selection condition (cf. Relation 3.4.8) only a small fraction of the tasks getrelocated at each rescheduling. This observation is also valid for MinQL which relies on thesame selection condition.

3.4.8 Population Based Tests for DMECT and MinQL

The main aim of the numerical tests was to analyse whether or not population based versionsof DMECT and MinQL can obtain improvements in the quality with an acceptable loss inthe scheduling time. The results were presented in (ZFZ11).

Test Scenarios

The population based tests considered that task EET s follow a Pareto distribution withα = 2. Tasks arrival rate is modelled based on statistical results extrapolated from real worldtraces (Fei10). A total number of 500 tasks were generated for every test. Rescheduling wasdone every 250ms given a minimal execution time of 1,000ms. All tests were repeated 20times in order to collect statistics.

Adaptive Scheduling for Distributed Systems 83

Several dynamic scheduling heuristics with ageing have been tested against their corre-sponding population based versions which were constructed by using the specific schedulingheuristics as perturbation operators in Algorithm 3.4.2. Their behaviour has also been com-pared with a Simple Population Scheduler (SPS) based on a non-iterated hybrid perturbation– at each perturbation step, in Algorithm 3.4.2, the hybrid perturbation is applied only once(cf. Algorithm 3.4.3). SPS uses the the perturbation as key operator. This hybrid perturba-tion has a structure similar to the re-balancing mutation described in Xhafa et al. (Xha07).However there are some differences between them. In (Xha07) the swap perturbation is ap-plied before move perturbation while in the hybrid perturbation the order is reversed. Thisapparently minor difference influences the overall cost of the perturbation as the applicationof the move operation is less costly than that of swap and it can induce a larger gain in themakespan.

Among the online scheduling algorithms we tested a flavour of DMECT, MinQL, Min-Min, Max-Min and Suffrage.

The population variants of DMECT (pDMECT), MinQL (pDMECT) as well as SPSuse a population of 25 elements initialized both with random schedules (60% ) and by usingthe Min-Min heuristics (40% ). The schedules corresponding to the next iterative step (i.e.,generation) are selected from the sets of current and perturbed schedules using a binarytournament approach – the schedule with the smallest makespan from a randomly selectedpair of schedules is selected). To ensure the elitism, the best element of the population ispreserved. Analysis on the role of crossover in generating good schedules illustrates that nosignificant gain is obtained by using crossover (at least uniform and one cut-point crossover).The procedure stops when an improvement in Cmax of at least 10% is no longer noticed aftera given number of iterations (e.g., 600 ).

Test Results

Table 3.4.7 presents the main benefits of population based scheduling heuristics (pDMECTand pMinQL) when used in online scheduling and Cmax as optimality criterion. BothpDMECT and pMinQL obtained significantly better results than their non-populationalvariants, with pDMECT having a behaviour similar to SPS – the best values in Table 3.4.7are bold-faced and they were validated using a t-test with 0.05 as level of significance. Theonly notable difference in the behaviour of pDMECT and SPS was that of speed. pDMECTrequired almost 30s to build a schedule while the simple population-based scheduler neededonly three seconds on average. The reason why the population base versions of DMECTand MinQL perform better (i.e., a gain of 1.34 respectively 1.40) than the non-populationvariants is that in a population based environment a wider range of sub-optimal schedules isexplored. This is possible due to the initialization phase. The large amount of time requiredby pDMECT and pMinQL is explained by the complexity of the algorithms, the populationsize and the number of iterations.

84 Marc E. Frıncu

Algorithm 3.4.3 The Hybrid Perturbation

1: i← 0; fail← 02: while i < n and fail< gp do3: i← i+ 14: if GreedyMove(S) is successfull then5: fail← 0; S←GreedyMove(S)6: else7: if GreedySwap(S) is successfull then8: fail← 0; S←GreedySwap(S)9: else

10: fail←fail+111: if random(0, 1) < pm then12: S← RandomMove(S)13: else14: S← RandomSwap(S)15: end if16: end if17: end if18: end while19: return S

Table 3.4.7: Average makespan obtained by online scheduling heuristics and their populationbased variants

DMECT pDMECT MinQL pMinQL SPS MaxMin MinMin Suffrage

Cmax 66556± 49409± 76564± 54332± 46996± 61165± 68774± 74244±(ms) 15097 9522 18114 9891 8812 11936 15101 18783Time 66.56 ± 28343.04± 3.06 ± 2254.64± 2777.70± 684.49± 669.21± 606.27±(ms) 15.50 10702.15 2.52 314.45 578.22 242.15 209.99 154.79

Adaptive Scheduling for Distributed Systems 85

-50000

0

50000

100000

150000

200000

250000

300000

350000

0 50 100 150 200 250 300 350 400 450 500

Late

ness (

ms)

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round RobinSuffrageMin-MinMax-Min

(a) Scenario 1: Comparison against other SAs on plat-forms with h = 2

-50000

0

50000

100000

150000

200000

250000

300000

0 50 100 150 200 250 300 350 400 450 500

Late

ness (

ms)

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round RobinSuffrageMin-MinMax-Min

(b) Scenario 1: Comparison against other SAs on plat-forms with h = 18

-50000

0

50000

100000

150000

200000

250000

300000

0 50 100 150 200 250 300 350 400 450 500

Late

ness (

ms)

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round RobinSuffrageMin-MinMax-Min

(c) Scenario 1: Comparison against other SAs on plat-forms with h = 80

-50000

0

50000

100000

150000

200000

250000

300000

0 50 100 150 200 250 300 350 400 450 500

Late

ness (

ms)

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round RobinSuffrageMin-MinMax-Min

(d) Scenario 2: Comparison against other SAs on plat-forms with h = 2

0

5

10

15

20

25

30

0 50 100 150 200 250 300 350 400 450 500

Avg. no. ta

sk m

ovem

ents

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round Robin

(e) Scenario 2: Comparison against other SAs on plat-forms with h = 18

-50000

0

50000

100000

150000

200000

250000

300000

0 50 100 150 200 250 300 350 400 450 500

Late

ness (

ms)

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round RobinSuffrageMin-MinMax-Min

(f) Scenario 2: Comparison against other SAs on plat-forms with h = 80

Figure 3.4.3: Lateness comparison for deadline constraint DMECT

86 Marc E. Frıncu

0

5

10

15

20

25

0 50 100 150 200 250 300 350 400 450 500

Avg. no. ta

sk m

ovem

ents

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round Robin

(a) Scenario 1: Comparison against other SAs on plat-forms with h = 2

0

5

10

15

20

25

0 50 100 150 200 250 300 350 400 450 500

Avg. no. ta

sk m

ovem

ents

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round Robin

(b) Scenario 1: Comparison against other SAs on plat-forms with h = 18

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400 450 500

Avg. no. ta

sk m

ovem

ents

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round Robin

(c) Scenario 1: Comparison against other SAs on plat-forms with h = 80

0

5

10

15

20

25

0 50 100 150 200 250 300 350 400 450 500

Avg. no. ta

sk m

ovem

ents

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round Robin

(d) Scenario 2: Comparison against other SAs on plat-forms with h = 2

0

5

10

15

20

25

30

0 50 100 150 200 250 300 350 400 450 500

Avg. no. ta

sk m

ovem

ents

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round Robin

(e) Scenario 2: Comparison against other SAs on plat-forms with h = 18

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400 450 500

Avg. no. ta

sk m

ovem

ents

No. tasks

DMECTDMECT2

DMECT-DeadlineMinQL

Round Robin

(f) Scenario 2: Comparison against other SAs on plat-forms with h = 80

Figure 3.4.4: Average no. of task movements comparison for deadline constraint DMECT

Adaptive Scheduling for Distributed Systems 87

3.5 Conclusions

The work depicted in this chapter focused on two SAs designed to work in both onlineand offline (batch) scheduling for heterogeneous DS. Section 3.3 has presented a load bal-ancing algorithm called MinQL. The algorithm is based on a backfilling strategy adapted toonline scheduling and heterogeneous environments. In Sect. 3.4 a generalized version calledDMECT was presented. The algorithm differs from existing non-backfilling EET basedpolicies (CLZB00; MAS+99) as it considers a per task relocation – instead of a global reas-signment – when rescheduling. In this direction the algorithm proposes a new customizablemethod for determining the moment when to reschedule certain tasks. This moment can beselected a priori based on certain policies or at runtime based on system characteristics.

The mathematical formalism behind the algorithms has proven that the SA are stableand always converge decreasingly to a better Cmax value. It has also shown that given a smallgranularity the DMECT algorithm is asymptotic optimal. Contrary to other algorithms wealso proposed a relocation condition which prohibits a task to be constantly relocated untilthe chances of executing it on the assigned resource are maximized.

Testing has been done based on the simulation environment described in Sect. 3.1. Thedesigned SP offers several advantages over existing ones:

• customizable plug-and-play algorithms;

• customizable transfer and execution cost models;

• ability to be used as both simulator and actual SP without modifying the SA or theplatform;

• both event driven and periodical rescheduling policies;

• possibility to define custom based scenarios ranging up from workstation scheduling toCluster, Grid, Cloud and Sky Computing 3.

Both MinQL and DMECT have been tested in batch environments (cf. Sects. 3.3.2,3.4.6 and 3.4.7) as well as in online cases (cf. Sect. 3.4.8).

Overall the performed tests have shown that:

• given MinQL using conditions related to CPU speed when relocating does not producesignificant improvements. The reason for this is the periodical queue rebalancing;

• when considering Lmax as optimality criterion MinQL performs better in homoge-neous environments, while DMECT gives better results when the relocation conditiondoes not consider the deadline constraint. In all tests both DMECT and MinQLoutperformed the classic Min-Min, Max-Min and Suffrage algorithms;

3Sky Computing is a form of Cloud Computing where the services are offered by multiple providers

88 Marc E. Frıncu

• DMECT and MinQL are more stable, in terms of Cmax, under the considered scenariosthan the rest of the tested SAs. Their stability does not however guarantee that theywill also obtain the best cost value under the tested scenarios. It only states thatscheduling results are relatively the same, with no large deteriorations or improvementsunder various test cases. This is especially important when considering volatile DSenvironments as it guarantees that the schedules produced by DMECT don’t fluctuate;

• when using population based schedulers that use classic SAs as perturbation factors,the results are drastically improved. Furthermore the results of the population basedDMECT are comparable with that of an enhanced genetic algorithm;

4 RULE BASED FORMALISM FOR EXPRESSINGSCHEDULING HEURISTICS

Contents4.1 Language and Platform for Executing SAs . . . . . . . . . . . . 91

4.1.1 Rule Based Language . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.1.2 Executing SiLK rules: the OSyRIS engine . . . . . . . . . . . . . . 98

4.1.3 Distributed OSyRIS . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.1.4 Auxiliary Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.2 Expressing Scheduling Heuristics Using a Rule Based Approach 109

4.2.1 Representing SAs Using Chain Reactions . . . . . . . . . . . . . . 113

4.2.2 Distributing the Scheduling Heuristics . . . . . . . . . . . . . . . . 117

4.2.3 Test scenarios and results . . . . . . . . . . . . . . . . . . . . . . . 122

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Distributed SAs offer several advantages centralized algorithms do not have. Firstly,distributed SAs allow increased failure tolerance through data distribution by eliminatingthe single-point-of-failure issue. Secondly, they avoid the possible bottlenecks offered bycentralized scheduling heuristics. These bottlenecks are often caused by the complexity ofthe algorithms which is usually linear with regards to task and resource numbers. Thirdly,distributed SAs allow for multi-institutional VOs – VO federation – to handle parts of the SAin accordance to their own requirements (e.g., security and access policies). Still distributedSAs come with disadvantages too. Among them we emphasize the increased network trafficand the possible data inconsistencies due to the outdated view on the resource informationthe distributed algorithm shares.

Distributed SAs should not be confused with distributed RMS. The former rely on dis-tributing the scheduling heuristics themselves while the latter are characterized by the dis-tribution of the scheduling mechanism. A RBS – distributed or centralized – can be used toexecute the distributed SA.

To facilitate the distribution and management of the SA the algorithm’s logic needs tobe separated from data. Rule Based Systems (RBS) offer a solution to this problem byallowing the application logic to be expressed as inference rules which can be managed –loaded, unloaded, modified – by the system functionality. In a RBS the rules are separated

89

90 Marc E. Frıncu

from the data (i.e., objects). Inference rules permit to easily modify the SA by adding andretracting rules without any need for further application modifications.

As the *aaS paradigm becomes more used, it is only natural for distributed SAs to takeadvantage of it. Hence data needed by a SA could be provided by means of a remotelylocated service. In addition the distributed components of the SA could also be providedthrough specialized service interfaces. So members of a VO federation could provide dis-tinct functionalities which put together enact the distributed SA. Exposing functionalityas services allows for SA to become even more customizable, as the functionality of thescheduling heuristics can be changed without altering the algorithm itself. As long as the in-terfaces exposing scheduling logic remain the same the global SA that accesses them remainsunchanged.

The first step in representing the SA using rules is to select a suitable language. Many rulebased implementations exist: Prolog (pro10), CLIPS (cli10), JESS (jes10), OPS5 (ops10),etc. RBS usually consist of the following: a rule base, an inference engine, a temporaryworking memory and a user interface for connecting the system to the outside world. How-ever since SA usually require rules to be chained together a Rule Based Workflow System(RBWS) seems a more appropriate choice. RBWS usually follow an Event Condition Ac-tion (ECA) paradigm. Examples of RBWS include AgentWork (MGR04), VIDRE (NRD06)FARAO (WvdHH08) or Drools Flow (Bal09).

Rule based SA also need to be easily distributed if we want to take advantage of multi-provider distributed environments. Existing RBWS such as the ones previously listed arehard, if not impossible, to distribute. Fortunately the γ-calculus – inspired from a chemicalparadigm – introduced by Banatre et al. (BLM93) offers a natural way of distributing work-flows based on chemical solutions. The formalism offers a non-deterministic and inherentlyconcurrent environment. However the model is hard to be implemented in any computerlanguage and no working solution exists. Because of this we felt the need to simplify it andcreate a simpler language and execution platform for it. Section 4.1 describes them in detail.

Based on the designed language the formalism together with some examples and testsare presented in Sect. 4.1.1. The first problem needed to be solved was to identify the mainatoms that make up each SA. Starting from them a structure common to all algorithmscould be devised. This is useful especially when designing new scheduling heuristics. Oncethese atoms have been identified the rules making up the SA were created and distributed.Because of the problems arising in using concurrent access when scheduling, special attentionwas given to this issue as well.

The Chapter ends with the essential conclusions of the work (cf. Sect. 6).

The main results presented in Sect. 4.1 have been published in Frıncu and Petcu (FP10),Frıncu and Craciun (FC10), Frıncu (Fr1) and Frıncu et al. (FPNP09). Frıncu (Fr0a) hasoutlined the results from Sect. 4.2.

Adaptive Scheduling for Distributed Systems 91

4.1 Language and Platform for Executing SAs

This section presents the rule based language designed to allow SAs to be expressed asinference rules (cf. Sect. 4.1.1). The language has been named SiLK (Simple Languagefor worKflows) and is based on the γ-calculus introduced by Banatre et al. (BLM93).Because SiLK permits rules to be chained it basically creates scientific workflows. Hence thelanguage is nothing else than a workflow language. The inference engine – called OSyRIS(Orchestration System using a Rule based Inference Solution) – used to execute rules isintroduced in Sect. 4.1.2. As it will be further discussed the engine relies on Drools (Bal09)to trigger rule firing based on existing facts. Because one of the aims in this chapter is toobtain a distributed SA as well as a distributed SP, a decentralized version of the engine– D-OSyRIS – is detailed in Sect. 4.1.3. D-OSyRIS allows rules to be distributed acrossseveral engines and to be executed without any central coordination. Section 4.1.4 presentssome auxiliary tools such as a workflow rule generator, rule extractor and execution overseer.These tools were developed in order to ease the handling and supervision of SiLK workflowsexecution.

4.1.1 Rule Based Language

As shown in Sect. 4 RBWS are the natural choice for depicting SAs. ECA based engines allowan alternative declarative approach by introducing (WvdHH08): languages having intuitiveformal semantics through the use of a limited set of primitives, direct support for business andscience policies, flexibility by allowing self-adaptation through the use of rules which permitalternative execution paths, adaptability through the insertion and/or retraction of rules andreusability thanks to their property of being isolated from the process context. Althoughnon-ECA approaches can also be modified to incorporate these advantages, ECA solutionsare preferred due to their concept of rule based programming. Non ECA engines usuallyhave limited predefined and built-in control constructs such as sequence, parallel, split, join,loops or events. In contrast, ECA engines have one single built-in control construct, namelythe inference rule, all the other constructs naturally deriving from it. ECA approachesalso offer other advantages like separation of logic (rules) from data (objects), declarativeprogramming, scalability or centralization of knowledge, etc.

Because of the previously mentioned advantages ECA based solution, we favoured suchan approach when designing a model for executing distributed SAs.

Chemical Formalism

It is only recently that attention has been given to nature inspired workflows although theiradvantages include non-deterministic rule firing and implicit parallelism.

Among the few formalisations of a nature inspired workflow enactment we notice thework of Banatre et al. (BFR07) and Nemeth et al. (NPP05; NPP06). Their work re-lies on a chemical metaphor based on the Gamma (General Abstract Model for MultisetManipulation) (BLM93) language. Gamma follows the chemical model in which there is

92 Marc E. Frıncu

no concept of centralization, serialization, ordering and the computations are accomplishednon-deterministically. γ-calculus is a formal definition of the Gamma paradigm. Its fun-damental structure is represented by the multiset M. Terms (or molecules) are representedby: variables x, γ abstractions (reactive molecules) γ〈x〉bCc.M (where C is the conditionrequired for reducing the abstraction), multisets Mi and solutions 〈M〉.

The reduction of the γ-abstraction is performed by reactions inside a solution 〈M〉.The reduction simply means capturing molecules that match given parameters (variables)and replace them with instantiated values inside the specified solution. More specificallyγ〈x〉bCc.M means replacing every occurrence of x in solution 〈x〉 by M if C.

Using γ-calculus, a task execution can be modelled as:

γ〈Tj : x, ωT 〉.(γ〈id : r, Res, ωRes〉.execute Ti using x on r) (4.1.1)

Based on the previously described Gamma formalism the following workflow notationcan be introduced:

W = (V, µ,M1, ...,Mn, (R1, ρ1), ..., (Rn, ρn), s0) (4.1.2)

where: V is the set of molecules (i.e., tasks), µ represents the solution structure containingall the other n solutions, Mi, i = 1, n, stand for the multisets over V belonging to then solutions, Ri symbolizes the reactive molecules (or simply the reactions), ρi, i = 1, nrepresent priorities among the reactions and s0 is the outermost solution.

Given the previous workflow notation (cf. Relation 4.1.2) we easily notice that thedefinition closely resembles that of the P-System (ZFM01). The sole major difference is thatchemical solutions are used instead of cellular membranes.

Although Relation 4.1.1 offers a representation for task execution within chemical so-lutions it is difficult to implement an easy-to-use computer language based on it. Caeiro-Rodriguez et al. (CRNP08) have proposed a CLIPS based solution which follows the Gammaformalism. However calling WS from CLIPS when firing rules is not an easy task. Hence wefurther refined rule representation and chosen a RBS which can easily trigger remote taskexecution through WS or any other means. The RBS we opted for is called Drools Expert(Bal09). It uses a modified object oriented version of the RETE (For90) algorithm for rulematching and is based on the Java programming language.

The concept of conditional execution has been simplified through the use of a functionsimilar to the one presented in Relation 4.1.3:

ε : M∗i

C,ρi−−→Mi (4.1.3)

where M∗i = Mi

⋃{TI} and task TI is an initialization task (fact) required by the engine in

order to start the workflow execution by creating the instance of the first actual task (fact).By using Relation 4.1.3, workflows can also be defined as transitions between different

tasks, with ε being a transition function.Having a set of reactions is however not sufficient as we need to ensure that we have a

complete chaining which leads to the desired goal. Consequently the notion of consistentworkflow can be defined as in Definition 4.1.1:

Adaptive Scheduling for Distributed Systems 93

Definition 4.1.1 A workflow in which the reaction chain reaches its goal is called consistent.

Any workflow not obeying the previous definition is in our assumption inconsistent.

Proposition 4.1.1 Given a consistent workflow, the previously defined ε function is surjec-tive.

Proof Obvious as given a consistent workflow each task part of the co-domain Mi has atleast one predecessor task from the domain M∗

i .

The previous proposition is useful both for workflow validation using a backwards chain-ing method and for generating workflows by starting from a given goal.

Given a solution which can contain several sub-solutions, reactions occur in a parallel andnon-deterministic way. They also proceed automatically to the next surrounding solutiononce all reactions within the current one have finished.

Next we focus on describing the SiLK language and introduce to that extent a simplenotation – based on the ε function – that links ECA approaches to the Gamma formalism.

Simple Language for worK�ows

Workflow languages need not be verbose but have a simple syntax so that inessential syntaxoverhead should be avoided. They also need to be expressive in order to cope with a widerange of constructs and situations without having to introduce new constructs. SiLK is theresult of several design attempts involving both XML (WSB07) and rule based workflowlanguages (Bal09). The initial goal was to develop a simplified language for handling imageprocessing requests (FPNP09) and yet not lose its applicability to other fields including theone of scheduling problems.

XML based approaches (vdAtH05; WSB07) ignore syntax simplicity as they add ex-cessive, unnecessary information and offer a limited set of control constructs making theminherently verbose. Therefore it is the user who has to modify both the language and theparser whenever new constructs are required. Alternatively ECA languages such as the HighOrder Chemical Language (HOCL) (NPP05) – whose syntax we found too complicated foractual user oriented usage and more suitable for formalism description – are better suited asthey separate logic from data.

SiLK offers elementary constructs such as sequence, parallel, split, join, decision or loop(cf. Fig. 4.1.1).

Besides these constructs it also allows for tasks to be defined. Tasks are the correspon-dents of molecules within chemical reactions and are viewed as black boxes having somemandatory attributes and an arbitrary number of optional meta-attributes. To maintainthe generality of the language the only mandatory attributes tasks have are representedby input/output ports and a special meta-attribute which will be addressed in greater de-tail in what follows. Each task belonging to a workflow must have at least one input andone output port defined. Besides these attributes, users can also define workflow specificmeta-attributes. Meta-attributes are defined as a ”name”=”value” pair.

94 Marc E. Frıncu

(a)Se-quence

(b) Split (c) Parallel

(d) Join (e) Decision (f) Loop

Figure 4.1.1: Several types of elementary workflow constructs

The following code fragment shows how we can define a task with one input initializedwith a predefined value, one output port and one meta-attribute:

1: A := [i1:input=”0”, o1:output, ”processing”=”getNextTaskId”];

The value is admitted for storage inside the meta-attributes or as initial value for acertain port. It can contain either plain text data to be sent directly to the service or areference (e.g., ID) to a database (e.g., table, element) or file (e.g., line) which is interpretedby the custom Executor class as explained in Sect. 4.1.2. The value of the reference is eitherforwarded to the service for handling or stored in a database.

Another use for meta-attributes is to define and use task semantics and ontologies. On-tologies require users to define: a set of concepts – or tasks in our case, attributes relatedwith them; and relationships between them. SiLK allows to easily define task semantics orontologies by using task attributes, meta-attributes and task relationships through the useof inference rules.

A SiLK file represents not only a workflow formalization but also an actual task on-tology. In addition to this users can also embed relationships within task definitions as

Adaptive Scheduling for Distributed Systems 95

Figure 4.1.2: Mapping between the epsilon function from Relation 4.1.3 and the SiLK rule

meta-attributes. Hence meta-attributes also allow users to define task relationships withoutusing inference rules. This approach is however not safe as it can lead to problems such asinconsistent workflows due to user errors. As an alternative a backwards chaining generatingmodule could provide more suited for the job (cf. Sect. 4.1.4).

Within a SiLK file task definitions must precede rules as the latter are checked by theOSyRIS engine for validity against the former. More precisely any rule containing an unde-fined task is considered wrong and the loading of the rule base is cancelled.

SiLK rules are defined based on the ε function defined in Relation 4.1.3. The new notationallows us to express transitions from one multiset of tasks to another as follows:

LHS -> RHS | condition , salience

where the condition and the salience are optional. The RHS (Right Hand Side) and LHS(Left Hand Side) are made up of tasks separated by commas and linked with each otherthrough the use of variables attached to ports:

1: A[a=o1], B[b=o1] -> C[i1=a#i2=b]

In the previous example a and b represent variables which bind the output ports o1 oftasks A and B to the input ports i1 and i2 of task C. More than one port can be boundto a variable inside a rule and in this case the bindings are separated by the # character.If more than one task exists in the RHS then each of them gets executed in parallel by theOSyRIS engine inside a separate thread.

The mapping between the ε function (Relation 4.1.3) and the SiLK rule is shown inFigure 4.1.2.

Any RHS task linked to multiple variables must have each of its variables linked todifferent input ports. Generally it can be said that it is prohibited to have multiple incomingedges attached to a single input port. This restriction does not stand in case of LHS taskswhere we can have the same output port linked to more than one RHS input ports as ithappens for a split construct.

As previously mentioned a rule can have optional condition and salience. The salienceis represented by an integer number – negative or positive – and indicates the importance

96 Marc E. Frıncu

of the rule, with large values making the rules more important and increasing their chanceof execution in case multiple rules are ready for firing at the same time. The condition canbe any logical statement which uses integer, floating point or string operands. The booleanvalues need to be expressed as strings (e.g., “true”). The following example shows a set ofrules which mimic a loop construct by using conditions:

1: A[a=o1] -> B[i1=a];2: B[b=o1] -> C[i1=b];3: C[c=o1] -> A[i1=c] — c < 10;4: C[c=o1] -> D[i1=c] — c >= 10;

The loop construct can also be viewed as a chain of reactions where we eventually endup with the initial reactant. This in turn triggers the entire reaction chain all over again.

Inside rules users can also manipulate task behaviours. More precisely the behaviour ofa LHS task refers to whether its instance gets consumed or not after a rule has been fired.Similarly the number of RHS task instances can be predetermined. The default behaviour isconsume=true for LHS tasks and instances=1 for RHS ones. The following example showshow we can tell the engine not to consume the LHS task and to create 5 instances of theresulting RHS task:

1: A[a=o1#consume=false] -> C[i1=a#instances=5];

Task instances are nothing else than a programming construct to represent the notion ofmultisets.

Because rules allow implicit parallelism, by introducing the concept of task instances wecan also create explicit rule sequencing. As an example we can consider the two rules whereLHS task A has only one initial instance and each of them produces another instance oftask A. In this case the rules cannot fire simultaneously and one of them needs to wait forthe other to produce the necessary task instance. When creating new task instances theseare added to the already existing ones. Considering two rules which create two, respectivelythree instances of the same task A with no instance of it being consumed in the processwe end up with five instances of the same task. The number of task instances can also beset during task definition by using a special meta-attribute called instances which receives anumerical value. The default behaviour is of creating zero instances during task definition.Task instances also allow for users to specify loop constructs with a finite number of stepssuch as the for in Java. Thus the exit condition is marked by the number of task instancesleft to trigger rules:

Adaptive Scheduling for Distributed Systems 97

1: # Rule 1:2: A[a=o1] -> B[i1=a#instances=10], C[i1=a];3: # Rule 2:4: C[c=o1], B -> C[i1=c];

In addition task instances are also useful when considering scheduling heuristics basedon task replication as it allows multiple instances of the same task to execute on differentservices. For instance the following lines of code show how five instances of the same task Bare created while only one is being used for executing task C :

1: A[a=o1] -> B[i1=a#instances=5];2: B[b=o1] -> C[i1=b];

The correct destruction of the remaining four tasks needs however to be dealt with frominside the custom task Executor class.

Task instances allow to easily manipulate workflow related operations such as pause,resume, restart and abort. The reason for this is that task instances are the main ruletriggers in OSyRIS.

As previously mentioned by using a single rule construct it is possible to simulate basicworkflow constructs such as sequence, parallel, split, join, single or multiple decision, synchro-nization, loop or timers. All of them except the timer can be constructed using variationsof rules containing one or more input/output ports and conditions.

Timers are not explicitly defined in the SiLK language but can be simulated by usingso called idle services which receive as argument a numerical constant representing time inmilliseconds and return the result after that particular time period.

When dealing with reactions within a solution we can also assume that we have more thanone sub-solution inside the main one. Banatre et al. (BFR07) detail how reactions withinsuch solutions occur. Starting from a sub-solution, reactions consume the task instancesinside of it and only then they proceed to the parent solutions triggering other reactionsand so forth until all possible reactions have been fired. SiLK offers the possibility to mimicthis behaviour by allowing users to specify rule domains. Each rule domain is identified byan integer number. Unless otherwise specified rules are considered to belong by default todomain 0. At any given time only one domain can be active and only the rules inside it canexecute. However users can define transition rules where some LHS tasks belong to a domaindifferent from the other tasks. In this case, after firing the rule, OSyRIS automatically selectsthe new domain – to which the RHS tasks belong to – as the active one. Thus from thispoint only rules in that domain will fire.

Figure 4.1.3 shows three rules belonging to a workflow. The first two belong to domain1 and the third one belongs to domain 2. Assuming that the first rule in domain 1 firesits activation determines the selection of domain 2. Subsequently rule 3, and not rule 2 –which belongs to domain 1 – will fire. The next code fragment shows the representation inSiLK of the example:

98 Marc E. Frıncu

1: # Rule 1 in domain 1:2: 1 : A[a=o1] -> B[i1=a], 2:C[i1=a];3: # Rule 2 in domain 1:4: 1 : C[c=o1] -> D[i1=c];5: # Rule 3 in domain 2:6: 2 : C[c=o1] -> E[i1=c];

C D

ECBA

R u l e D o m a i n 1

R u l e D o m a i n 2

Figure 4.1.3: Switching between rule domains

As it can be noticed from the previous example, SiLK rule domains allow constructingrule sets without inducing a strict hierarchy. Hence SiLK domains are more complex thanthe solutions introduced by Banatre et al. (BFR07) (cf. Sect. 4.1.1) as they allow the work-flow to switch freely between multiple rule sets when necessary. In contrast, the formalismintroduced by Banatre et al. allows transitions between sub-solutions to take place onlywhen there are no more reactions left in the current sub-solution.

4.1.2 Executing SiLK rules: the OSyRIS engine

OSyRIS was built on top of Drools Expert and targets the execution of SiLK rules. Theengine is capable of adapting to failures and changes in the DS – resource and network –by allowing both task relocations to available services and workflow restarts from previousfailure points.

The OSyRIS engine (cf. Fig. 4.1.4) can be either embedded inside any Java application orexposed as a service which can be invoked by any SOAP or REST enabled client. SubmittedSiLK rules are translated by the engine into Drools rules. Any syntactic errors are handledat this point. Once the rules have been successfully converted the Drools inference enginestarts executing them.

Before executing particular workflows two abstract classes need to be implemented.

The first class – OSyRISwf – represents the application specific extension of OSyRIS. Itoffers basic functionality such as: workflow execution, result queries – the result of the lastexecuted task in the workflow, finding the output of a particular workflow task, retrievingthe workflow (or task) status. Each workflow is uniquely identified by an ID which can beused in cases asynchronous workflow execution is needed. Asynchronous execution meansthat the workflow is either started in a different thread or exposed as a service. Users canalso implement their own custom extensions according to the application needs.

Adaptive Scheduling for Distributed Systems 99

Figure 4.1.4: Architecture of the OSyRIS platform

The second class – Executor – handles application specific details including discoveringproper resources for task execution and communication between services and the engine.

An issue not easily dismissed is represented by the inter-task communication. In our case,OSyRIS offers two kinds of communication formats: SOAP based and message queue based.SOAP is predominantly used in cases where the engine is exposed as a WS while a messagequeue service is a preferable choice when the engine is embedded inside Java applications orseveral engines run distributed and need to cooperate with each other. The message systemused by OSyRIS is called RabbitMQ (VMW10), an implementation of the Advanced MessageQueuing Protocol standard (Gro10) which is a wire-level protocol that defines the formatthe data sent over the network. Thus it opens the way for inter-operability as any softwareadhering to this format would be able to attach itself to the global system. RabbitMQ was thenatural choice as it is the closest current solution to the AMQP specification. Communicationis handled by the Executor class which is responsible for sending requests and receivinganswers from remote services.

When using an ECA approach, the workflow result is represented by the output of thelast executed task. This task could be different from the real final task in case inconsis-tent workflows (cf. Sect. 4.1.1) are executed or when task incompatibility occurs. Taskincompatibility means that input from RHS tasks cannot be used by LHS tasks because ofdescription format incompatibilities. For instance an image processing task which requiresto receive as input an image will not be compatible with a task producing a mathematicalresult. This problem can also occur due to unavailable specialized services for solving RHStasks. It can be avoided at design time by applying model checking or by ensuring that byusing a visual tool in order to avoid incompatibilities. Runtime checks can also be producedbut they will mainly lead to exceptions which halt the workflow execution in case no othervalid rule can be triggered.

100 Marc E. Frıncu

Much of the workflow information is stored into a database (cf. Fig. 4.1.4) for data queryand reports. This information includes task ET s, IO data for each task, number of instancesfor each task, task execution status and resource usage. Workflow execution traces are alsostored in logs for debugging purposes.

Resource Selection and Discovering Resource selection is an important aspect inRBWS as it facilitates load balancing DS. OSyRIS implements several ways for selectingresources, but it is up to the user to select the best suitable choice for the situation at hand.The task selection API is implemented inside the Scheduling module as seen in Figure 4.1.4.

The first method involves using an external resource selector. This approach relies onthe fact that each OSyRIS task is exposed as an external service. An intermediate brokerservice added between the actual service and the engine is responsible for dispatching tasksfrom the engine to appropriate services. Additional scheduling capability can be optionallyintegrated inside the broker.

The second method is to implement additional functionality and link it directly with theExecutor class. In this case the scheduling mechanism is not handled by the broker servicebut instead implemented in the customized OSyRIS engine. After taking a decision theExecutor will send the task directly to the service without using any intermediate brokers.

The third method requires to incorporate resource selection decisions inside rules. Inthis scenario the scheduling heuristics is inserted directly inside the rules. Two differentapproaches can be considered in this case. The first one requires modifying the SiLK2Rulesclass responsible for the SiLK to Drools translation and using the WFResource collectionwhich contains a list of available services. In this case a distinct rule condition must beadded inside each of the OSyRIS generated Drools rules. This condition incorporates allthe required scheduling decisions and returns one or more potential services for the taskneeded to execute. The second approach implies using a separate service similar to thebroker service. The sole purpose of the service is to deal with scheduling other tasks. Inaddition the service will be treated as normal task that is attached to the LHS of each rule.Its output consisting of a set of available services will be transmitted as input to all theRHS tasks. The service itself relies on an OSyRIS instance where rules represent schedulingheuristics. This approach will be described in greater detail in Section 4.2.1.

Besides service selection, the service discovery plays an important role too as it en-ables the engine to have a pool of services to choose from. Similar to Taverna (OGAea06),services can be discovered using various techniques including: UDDIs, direct URL input,previous workflow introspection, semantic searches, etc. OSyRIS uses by default UDDIs butextendible plug-ins can be easily integrated to cope with the rest of the choices.

4.1.3 Distributed OSyRIS

A centralized workflow engine exposes applications to:

• single point of failures as the entire application relies on a central command entity;

Adaptive Scheduling for Distributed Systems 101

• communication bottlenecks and resource overloads caused by possible scalability issueswhich arise from system and hardware limitations.

One solution to solving the previous issues is to rely on decentralized (distributed) work-flow engines (cf. Fig. 4.1.5), each executing a partial workflow. Distributed workflow systemsalso have the advantage of giving providers autonomy to manage their partial workflows (e.g.,resource selection, task scheduling) according to their internal policies.

Figure 4.1.5: Distributed Workflow Engine

Decentralized workflow engines must also be: dynamic in choosing the service to invokewhen a task needs to be executed and adaptive to changes in both workflow structure andsurrounding environment.

Most centralized workflow engines can be set up to run distributed engines by replacingindividual tasks invocations with calls to services which expose other workflow engines. Weargue that this approach is not truly decentralized as the workflow distribution is achievedby creating sub-workflows in the form of tasks subject to the authority of the main workflowengine.

Work on distributed workflow engines has been scarcely approached. Among recentresults we could yet mention Woodman et al. (WPSW04) in which a distributed enactmentengine called DECS is presented. Although the authors argue the system is adaptive tofailures little insight is given on it. The workflow to be executed is also static withoutruntime adaptation possibilities. Self-Serv (BSD03) is a P2P based distributed workflowengine. A state coordinator for every service needs to be run on resources owned by theservice provider. Workflow execution is done by messages sent from state coordinators whentheir task state has changed. Services are selected dynamically based on a score functionfrom a service container. The system supports fault tolerance in case of service invocationerrors but the mechanism is centralized and is not extended to support engine recoveries.WORM (SLB06) is another distributed workflow engine and relies on the Active NetworkTechnology (ANTS) (TW07) to distribute and execute the workflow tasks. ANTS’s inbuiltrouting allows for task migration to be achieved. Nonetheless the system lacks support forparallel task execution and failure recovery.

D-OSyRIS addresses the limitations – related to failure recovery – found in the previ-ously presented work and offers a self-healing distributed workflow platform supporting both

102 Marc E. Frıncu

runtime workflow adaptations and dynamic service selection.

The chemical metaphor on which SiLK is based allows workflows to be naturally dis-tributed by means of solutions. However, in order to obtain a distributed engine, OSyRISneeds to be augmented with asynchronous communication mechanisms as well as with au-tonomic self-healing modules. Distributed OSyRIS (D-OSyRIS) consists on one hand, ofseveral engines working together to solve a given workflow and, on the other hand, of heal-ing modules responsible with engine recovery. Workflow deployment is assumed to take placeinside a trusted environment based where access is granted by means of digital certificates.

Communication and Synchronisation

D-OSyRIS engines need to be able to communicate with each other and to synchronisedata transmission and rule execution. Furthermore they must be able to periodically notifytheir presence to existing healing modules. Communication needs to be failure tolerant andprovide asynchronous read/write functionality. Hence standard solutions such as HTTP,RMI or SOAP are improper while distributed message queuing systems seem to provide aviable solution.

D-OSyRIS relies on RabbitMQ to communicate between engine peers. RabbitMQ hasthe advantage of not requiring prior knowledge of the number of registered exchanges. Thisfeature is useful when checking module availability or when attempting to activate modules.As messages are published to exchanges, any module bound to an exchange by a queuecan read messages from it. Moreover RabbitMQ exchanges are persistent and thus anyunprocessed message is safely stored when failures occur. So once the communication systemis restored the messages can be safely recovered.

Messages sent between workflow engines are encoded using JavaScript Object Notation(JSON) (JSO10). A message consists of five elements:

• sender workflow ID – the UUID of the workflow that initiated the message exchange;

• recipient workflow ID – the UUID of the workflow that is intended to receive themessage;

• processing name – represents the operation a task is required to perform. It is usuallyidentical with the processing meta-attribute used when defining tasks;

• message content – stores the content of the message and can represent either the actualtask input data or a reference to a remote file containing that data (in case of largedata);

• message type – identifies the type of the message and can have one of the followingvalues: REQUEST (the message contains a request to create a remote task), RE-SPONSE (the message contains a response to a request call), PING (the message is aping intended for the healing module).

Adaptive Scheduling for Distributed Systems 103

Figure 4.1.6: Message flow and platform recovery

Figure 4.1.6 depicts the main message flow inside the system. A D-OSyRIS engine au-tomatically broadcasts a REQUEST message when the task to execute contains the “re-mote”=“true” and “destination”=“2” meta-attributes. The first specifies that the taskshould be created remotely while the second indicates the solution ID. Every registeredengine receives the broadcast but only the one having a solution ID that matches the des-tination will activate. This triggers, inside the remote solution, the creation of an atomcorresponding to the processing name. The input ports of this atom are then populated withdata extracted from the message content and the execution of consequent rules is triggered.

Synchronisation between engines is accomplished by the creation of remote atoms. Thusif several engines need to wait after each other in order to continue execution D-OsyRISputs the workflow in an idle state until a REQUEST message is received and an atom, thatconsequently triggers rules that resume the workflow, is created.

Self-Healing Mechanism

Errors in communication and resource failures make DS susceptible to component failures.Distributed RBWS are not an exception and failures in any of the distributed engines couldbreak the workflow chain.

Self-healing capabilities inside D-OSyRIS are implemented in the form of intelligent Feed-back Control Loops (FCL). Every FCL is designed as an independent module and supportshealing activities by implementing the foundational steps of an autonomic MAPE (Monitor-Analyse-Plan-Execute) loop (KC03). As depicted in Fig. 4.1.7, the monitor is responsiblefor gathering context information from the surrounding environment. The information ar-rives from context sources represented by monitored engines and contains data regardingmessages sent between modules, the status of the tasks and the status of the engines. Thecollected data is then sent to the analyser which makes a decision on the action to performnext. The decision is based on whether an engine has failed to provide a ping during acertain time interval or not. In case of a ping time-out the engine is presumed to be down.In this way the analyser treats failed and overloaded nodes similarly. Once a failed engine is

104 Marc E. Frıncu

found the analyser sends a broadcast message requesting a forced ping (PING REQUEST)from it. If the engine fails again to signal its existence the analyser decides that the engineshould be marked as failed and notifies the planner. The planner then takes the followingsteps:

• it sends a broadcast message to the failed engine containing a shut down message –needed in case the engine is actually overloaded and not down;

• it searches for an available resource on which to deploy the new cloned engine;

• it selects the best candidate (e.g., the resource with the smallest latency);

• it sends the details of the new engine to the Executor. Based on the details received fromthe planner, the executor transfers the new engine on the selected resource and executesit. The new engine is an identical copy of the failed one. Because D-OSyRIS allowsworkflows to resume execution from previously saved checkpoints, the new engine canresume the halted workflow without having to restart it. Checkpoint data is storedinside a database each time a task is executed and includes the output of the task, thenumber of instances and the workflow to which it belongs.

Because self-healing modules are subject to failures they also need to heal themselves.To achieve this, modules need to register themselves to other partner modules. In this wayeach healing module monitors its siblings and is able to recover failed partners. The recoveryprocess is identical with the one presented for workflow engines.

A special case when recovering failed engines is represented by failures that occur rightafter a RESPONSE has been sent to a REQUEST message. As the checkpoint has not yetbeen created the restored engine loses the message information and is unable to execute theworkflow. Thus the workflow is deadlocked indefinitely as engines that require REQUESTmessages from the recovered component will never receive them. To solve this issue RE-QUEST messages are also sent to the healing modules (cf. Fig 4.1.6). When the engine isrestarted the data of the initial task – contained in the REQUEST message – is also sentalong with the application data, thus ensuring a proper workflow execution after restart.

Task Execution Delay Induced by D-OSyRIS

Of particular interest when dealing with distributed RBWS is the time delay induced by theinter-engine communication. For tests, two solutions which simply send messages betweenthem have been used. Average runtimes are depicted in Fig. 4.1.8. An average of 2sis required for executing the rules when the centralized approach is used and n=1. Thedistributed version takes an average of 4s for the same configuration. The extra 2s correspond(1) to the time between sending the message from solution 1 and receiving the answerfrom solution 2, and (2) to the firing of the additional rule which was added in solution2. The actual message transmission time is around 0.8s. On average a rule takes around0.7s to fire. The cause for this delay is the rule firing mechanism of Drools Expert. The

Adaptive Scheduling for Distributed Systems 105

Figure 4.1.7: The feedback loop for component recovery

0

5

10

15

20

25

30

35

40

45

0 2 4 6 8 10

Wo

rkflo

w e

xe

cu

tio

n t

ime

[s]

Numbers of inter-solution calls

DecentralizedCentralized

Figure 4.1.8: Workflow execution times when centralized and decentralized engines are used

relatively linear evolution of times required to complete the workflow for various n values (cf.Fig. 4.1.8) shows that the transfer and the execution time are invariant to the number ofREQUEST messages. Sending multiple message in parallel to several different engines alsoobeys to this behaviour as each engine has to process only one message containing all datafor rule activation. The message transmission time is invariant on the task as the transmittedREQUEST usually contains meta-data in the form of references to large input data. Forlong running workflows, the additional 0.8s induced each time an inter-engine message issent are probably neglectable with regard to the actual time needed to process the task.

4.1.4 Auxiliary Tools

In order to create a fully functional RBWS additional tools (cf. Fig. 4.1.4) are needed.They include a visual workflow designer, a visual administrator and an automatic workflowgenerator. Each of them will be detailed in what follows.

106 Marc E. Frıncu

Automatic Work�ow Generator

There are cases when due to lack of workflow knowledge or time to design the rule chain,the possibility to automatically generate the workflow given its goal is desired. Automaticworkflow generation is required in cases in which the user knows the desired outcome but hasno clue on how the workflow should look like. For instance image processing could requireseveral steps to achieve a certain result and all the user could know would be the desiredfinal effect. Workflow generation is also possible for cases in which runtime solution selectionis made. In this way a SP changing the scheduling heuristics would select the new policybut the rules for it would have to be extracted automatically from the rule base.

OSyRIS offers the WorkflowExtractor class as part of the workflow generator module(cf. Fig. 4.1.4) for cases where automatic generation is required. There are three basicapproaches to this problem: using task semantic information, relying exclusively on rulesand mixing the last two.

The first one concerns using task semantic information (CAA02; LBL06) stored insidethe task definitions. This approach has a major disadvantage since it might happen not alltasks have the necessary semantic information required for backwards chaining. Moreoverin absence of a standard for representing ontologies different users might use different infor-mation for representing tasks. So the engine might fail to recognize a relationship betweentwo otherwise linked tasks. The next example – used in (FPNP09) to link image processingtasks – shows how additional information can be added inside task definitions into forming asimple ontology. While XML text can be embedded inside the meta-attributes the exampleuses a plain text approach. The format serves to identify corresponding services through anUDDI:

1: A0:=[o1:output=”image normal”, ”instances”=”1”];2: A:=[i2:input, o1:output, ”processing”=”image red extract-band(band,image normal)”,

”argument-list”=”<band=red>”];3: B:=[i2:input, o1:output, ”processing”=”image infrared

extract-band(band,image normal)”, ”argument-list”=”<band=infrared>”];4: C:=[i1:input, i2:input, o1:output, ”processing”=”image ndvi compute-ndvi(image red,

image infrared)”];

In the previous case task information details the processing each task is going to do, its outputtype and the arguments of the call. Ports have indices corresponding to the position of thefunction arguments (e.g., input port i2 is linked to the second argument of the extract-band function called image) and optionally, arguments can have initial default values asspecified by the argument-list meta-attribute. It can be also noted that the example uses aC-like function naming for describing task processing. Starting from the task meta-attributesdescription the backwards chaining creates corresponding workflows. One of the possibleworkflows derived from the example is presented in the next code fragment.

The second approach relies solely on rules described in the rule base. While this approach

Adaptive Scheduling for Distributed Systems 107

1: A0[a=o1] -> A[i2=a], B[i2=a];2: A[a=o1], B[b=o1] -> C[i1=a#i2=b];

is somewhat easier because the rules already offer a precedence relationship between tasks,it could provide wrong workflows as the rule base extends by receiving more and more ruleswhich could conflict with older existing ones.

The last approach combines the previous two in the sense that the generator also considersthe task description during rule selection.

All the above described approaches require an initial rule base. From it a set of disjointgraphs essential when obtaining the list of relevant workflows for a given goal is determined.The backwards chaining phase is generated by using a depth-first algorithm. The user firstchooses the desired goal from a list of available tasks extracted from the existing RHS tasks.The final rules are then extracted from the rule base using backwards chaining. Finally, theuser selects one of the generated workflows and executes it.

As an example we can consider the following code that shows a simple rule base:

1: A:=[i1:input=”initial input A”, o1:output, ”proc”=”operation-A”];2: B:=[i1:input, o1:output, ”proc”=”operation-B”];3: C:=[i1:input, o1:output, ”proc”=”operation-C”];4: D:=[i1:input, o1:output, ”proc”=”operation-D”];5: E:=[i1:input, o1:output, ”proc”=”operation-E”];6: M:=[i1:input, o1:output, ”proc”=”operation-M”];7: F:=[i1:input, i2:input, o1:output, ”proc”=”operation-F”];8: P:=[i1:input=”initial input P”, o1:output, ”proc”=”operation-P”];9: A[a=o1] -> B[i1=a],C[i1=a];

10: B[b=o1] -> D[i1=b];11: D[d=o1] -> E[i1=d];12: C[c=o1] -> E[i1=c];13: E[e=o1],C[c=o1] -> F[i1=e#i2=c];14: M[m=o1] -> F[i1=m];15: P[p=o1] -> M[i1=p];

This rule base can be represented as the graph seen in Figure 4.1.9.

The Visual Work�ow Designer

Visual workflow design aids users by allowing them to seamlessly create both syntacticallyand semantically correct workflows. The OSyRIS platform offers such an option (cf. Fig.4.1.4). The designer lets users create workflows without needing any knowledge regarding theSiLK language. The graphical representation can be saved, loaded and exported to variousformats which can be used as input to the engine. Users can create workflows by using twoelementary visual constructs: tasks and links between them. By default each task has one

108 Marc E. Frıncu

C

A

B D E

F

MP

Figure 4.1.9: Rule Base Chaining Example

input and one output port but the user is free to attach more. Conditions can be optionallyattached to links in order to simulate decision and loop constructs. Any restrictions imposedby the SiLK language (cf. Sect. 4.1.1) are dealt with at this design stage. Figure 4.1.10 showsthis interface together with a simple workflow made up of a sequence of image processingtasks.

Figure 4.1.10: User interface for the Visual Workflow Designer

Adaptive Scheduling for Distributed Systems 109

The Work�ow Manager

Workflow tasks are executed by remotely located services. This can lead to problems suchas: response delays due to either time required to transmit large amount of data or networkoveruse; task failures due to resource unavailability because of service failures, etc. Eventhough these problems can be overcome by taking adaptive measures such as dynamic taskrebalancing, task cloning or other methods, it is important to overview and keep track ofall these problems for statistics, task rescheduling or other administration policies. OSyRISprovides an administrative interface (cf. Fig. 4.1.11) that supplies information including:the status of each workflow task (NOT RUNNING, RUNNING, COMPLETED, PAUSEDor ABORTED); the service which executes them; how long it took to execute the task orhow often a certain service is used.

Figure 4.1.11: User interface for the Visual Workflow Manager

4.2 Expressing Scheduling Heuristics Using a Rule Based Approach

Rules are necessary in order to achieve a complete separation of scheduling heuristicslogic from data. However they are not sufficient. The tasks performed as a consequence torule firing need to be also separated from the regular applications a VO federation offers.Services offer a viable solution as they allow for their logic to be accessed through standardinterfaces. Hence the actual implementation can be freely changed and customized by everyprovider. Figure 4.2.1 depicts the layered structure of our proposed Rule Based Scheduler(RBSch).

110 Marc E. Frıncu

Figure 4.2.1: Layers of the Distributed Rule Based Scheduling Platform

When the tasks that form a SA are exposed as services it is important to provide basicfunctionality as service methods in order to make the logic as customizable and reusable aspossible. Thus a wide range of SAs can use the same service interface. This opens the way toa bottom-up approach to building scheduling heuristics as well as to deriving new algorithmsfrom existing ones. To obtain this behaviour the atomic methods existing in every SA needto be identified. These methods can be viewed as the smallest possible operation applicableto schedule data.

In what follows we study the case of four online scheduling heuristics including Min-Min, Max-Min, Suffrage and DMECT. The first three algorithms have been augmented toconsider ageing (cf. Relation 2.2.5).

Four types of information relevant to SAs have been determined: task, resource, mixedand algorithm specific data.

In the case of task related information the methods are limited to retrieving either specifictask ID based data or a reference (e.g., ID) to the next available task. Resource informationmethods are similar as they offer means of retrieving either information on a given resourceor a reference to the next available one. Methods that tackle both tasks and resources datacomprise: checking whether a task is assigned or not to a resource, unassigning a task froma resource, computing a task’s EET, etc.

Other possible methods are specific to each SA. For instance the Min-Min heuristics usestwo specific atom methods. The first one computes for each task the fastest resource whilethe second one assigns the fastest task to the resource responsible for this result. Max-Minalso introduces two atom methods that are similar to the Min-Min’s but instead of the fastesttask it uses the slowest one. Suffrage’s two atom specialized methods refer to computing theSuffrage value and to assigning the task with the highest Suffrage value to the correspondingresource. DMECT uses one atom method for determining if it is time to move a task froma resource queue and another one to assign a task to the resource queue that would executeit the fastest.

Once the atomic methods have been identified the services and the data they rely on

Adaptive Scheduling for Distributed Systems 111

need to be designed. There are three types of services that need to be considered: for task,for resource and for mixed operations. The specific atoms have been incorporated in themixed operations category.

Task services must be able to access information about either all or a subset of the tasksconsidered for scheduling. In the latter case several services are required. The minimumamount of information needed by every task is represented by: a task ID; a boolean valueindicating whether or not it is the first task in the list; a boolean value indicating whetherit is assigned or not to a resource and if true the resource ID; a list of <resource ID, ECT>pairs containing the estimated completion times and the corresponding resource.

Resource services need to have access to resource related information. Similar to taskcase, the resource set can be the whole resource pool or just a subset. Every resource hasattached to it the following information: a resource ID, a boolean value indicating if it isthe first resource in the list, additional information related to physical characteristics suchas CPU speed, memory load, network cards etc., the list of assigned tasks.

The mixed services have to access information related to both task and resource as theyusually perform operations requiring both. These services access chiefly the task and resourceservices for obtaining the information needed in order to perform their tasks.

Usually task and resource data is kept inside databases and every atomic method queriesthe database for information.

In what follows the main atomic service methods used by the previously described SAsare presented. In order to reduce network traffic almost all methods receive as argumentsreferences to the task/resource they use.

General purpose atomic service methods include:

• getT(TID) – used to retrieve information on the task identified by TID ;

• getNextT(TID) – retrieves the ID of the task following the one identified by TID. Thelist of tasks is treated as a circular list. As a consequence when TID refers to the ID ofthe last task in the list the ID of the first task will be returned. In this case a specialmarker will be attached to the message informing the SA that all the tasks in the listhave been iterated on;

• getR(RID) – retrieves information on the resource identified by RID ;

• getNextR(RID) – is used to obtain the ID of the resource following the one representedby RID. Its behaviour is similar to that of getNextT(TID);

• calcECT(TID,RID) – computes the ECT of the TID task on resource RID. The resultis stored as information belonging to task TID ;

• unassignT(TID) – unassigns a task from its assigned resource. The internal referenceto the resource is kept while the task is removed from the resource’s queue. The internalreference is overwritten when an assignment operation occurs;

• assign(TID,RID) – assigns a task to a resource;

112 Marc E. Frıncu

• checkAssignedT(TID) – checks whether a task is assigned to a resource or not.

Min-Min specific atomic methods are calcMin(TID,RID) and assignMin(TID,RID). ThecalcMin(TID,RID) method computes the minimum ECT for task TID. The computationproceeds as follows: first, the ECT on resource RID is computed; then the result is com-pared to the already existing minimum ECT value – initialized at the beginning with thehighest possible value; if the new ECT is smaller the currently existing one will replace ittogether with the RID of the resource that provided it. The task’s ID is returned after thecomputations are done.

The assignMin(TID,RID) method assigns task TID to resource RID. Task TID is as-sumed to have the smallest ECT among all existing tasks. The method returns the task IDof the first task in the list.

Max-Min specific atomic methods are similar to those of Min-Min’s and include cal-cMin(TID,RID) and assignMax(TID,RID). The former is similar to the analogue methodbelonging to Min-Min. assignMax(TID,RID) resembles Min-Min’s assignMin(TID,RID) butassigns to a resource task TID with the largest ECT among the existing tasks.

Suffrage’s specific atomic methods are represented by calcS(TID,RID) and assign-MaxS(TID,RID). The calcS(TID,RID) method computes the Suffrage value of task TID.The computation proceeds as follows: first, the ECT on resource RID is computed; then itis compared with the highest and second highest of already existing values. If it is largerthan the highest ECT it will replace it. If it is between the highest and the second highestit will replace the latter. As a final step the Suffrage value is computed from the new values.The TID is returned after the computations are done. assignMaxS(TID,RID) assigns taskTID to the resource RID. Task TID is assumed to have the largest Suffrage value amongavailable tasks. The method returns the task ID of the first task in the list.

Finally, DMECT contains two specific atomic methods: canMoveT(TID) and assign-Min(TID,RID). The former one checks if task TID can be moved to another queue bycomparing its LWT with a predefined threshold. The latter assigns task TID to resourceRID provided that RID gave an ECT for task TID smaller than the current value. In thiscase, the given RID will replace it for future comparisons. In addition the task will be addedto the resource RID queue. Unless the ECT is better, the task is left assigned to the currentresource.

Table 4.2.1 summarizes the main atomic methods discussed in the previous paragraphs.

When exposing these atomic methods as service methods the best approach is to groupthem according to their utility and relationship to a particular SA. For instance the generalpurpose methods can be split between two services: one for task and one for resource relatedoperations. In addition each atomic method set belonging to a specific SA can have its ownservice. Depending on the size and distribution of the resource and task providers we canhave several services exposing the same methods but returning different data. This is suitedin cases where a VO federation is sharing its tasks and resources. In order to maintain acomplete set of tasks and resources the scheduling platform would have to query all theavailable services responsible for task and resource management.

Adaptive Scheduling for Distributed Systems 113

Table 4.2.1: Atomic service methods for SAsGeneral methods Min-Min Max-Min Suffrage DMECT

getT calcMin calcMin calcS canMoveTgetNextT assignMin assignMax assignMaxS assignMingetRgetNextRcalcECTassignTunassignTcheckAssignedT

4.2.1 Representing SAs Using Chain Reactions

In what follows we detail, based on the atomic tasks identified in the previous section, thesolutions that form the Min-Min, Max-Min, Suffrage and DMECT SAs. As the differencesbetween Min-Min, Max-Min and Suffrage are little we restrict ourselves to presenting theMin-Min heuristics only. Where applicable, the modifications specific to Max-Min and Suf-frage are also described. The complexity of the initial SAs remains unmodified as they aresimply rewritten using the chemical metaphor.

Every molecule is mapped to an atomic service method. The creation of a new moleculehas therefore the effect of invoking the corresponding atomic service method. Where appli-cable, several output and input ports will be used. This is also the case of the getNextTmolecule where both the reference to the task ID and a boolean value indicating whether ornot the task is the first one in the list are returned.

The Min-Min heuristics can be exemplified in terms of chemical reactions by Algorithm4.2.2. The molecules used in the reaction chain are depicted in Algorithm 4.2.1.

Figure 4.2.2: Reaction chain inside the ECTMin solution

114 Marc E. Frıncu

Algorithm 4.2.1 The Min-Min algorithm molecules using SiLK

# MoleculesT:=[o1:output=”1”,i1:input,”processing”=”getT”,”instances”=”1”];R:=[o1:output=”1”,i1:input,”processing”=”getR”,”instances”=”1”];NT:=[o1:output,o2:output,i1:input,”processing”=”getNextT”];NR:=[o1:output,o2:output,i1:input,”processing”=”getNextR”];CAT:=[o1:output,o2:output,o3:output,i1:input,”processing”=”checkAssignedT”];UAT:=[o1:output,i1:input,”processing”=”unassignT”];ECT:=[o1:output,i1:input,i2:input,”processing”=”calcECT”];CM:=[o1:output,i1:input,i2:input,”processing”=”calcMin”];AM:=[o1:output,i1:input,i2:input,”processing”=”assignMin”];ECTDir:=[o1:input,i1:input,”processing”=”gotoECTSol”];MinDir:=[o1:input,i1:input,”processing”=”gotoMinSol”];# Ports:# a - is first; b - is assigned;# t - task; r - resource# Initialization reaction

Figure 4.2.3: Reaction chain inside the ContinueAndCheck solution

Before the reaction chain can proceed, an ignition reaction must be introduced. In ourcase this reaction simply triggers the first rule inside the ECTMin solution and creates twomolecules that hold the IDs of the first task and of the first resource in the correspondinglists.

The ECTMin solution (cf. Fig. 4.2.2) computes the ECT and subsequently the minimumECT on all resources for each task. The minimum ECT is computed in one pass by comparingeach time the obtained value against the existing minimum. The ECT computation on everyavailable resources ends when the reference to the next returned resource (getNextR) belongsto the first one in the list. At this point the cycle restarts with the next task in the list.

Adaptive Scheduling for Distributed Systems 115

Algorithm 4.2.2 The Min-Min algorithm using SiLK

T[t=o1],R[r=o1] -> 1:T[i1=t], R[i1=r];# ECTMin solution1:T[t=o1],R[r=o1] -> ECT[i1=t#i2=r],NR[i1=r];1:ECT[t=o1#r=o2] -> CM[i1=t#i2=r];1:NR[r=o1#a=o2],CM[t=o1] -> R[i1=r],NT[i1=t] | a == ”true”;1:NT[t=o1#a=o2] -> 2:T[i1=t],ECTDir[i1=t] | a == ”false”;1:NT[t=o1#a=o2] -> 4:T[i1=t] | a == ”true”;1:NR[r=o1#a=o2],AM[t=o1] -> R[i1=r],T[i1=t] | a == ”false”;# ContinueAndCheck solution2:NT[t=o1] -> Error[i1=t] | t == ”null”;2:NT[t=o1#a=o2] -> T[i1=t] | t != ”null” and a == ”false”;2:T[t=o1] -> CAT[i1=t];2:CAT[t=o1#b=o3],ECTDir -> 1:T[i1=t] | b == ”false”;2:CAT[t=o1#b=o3],MinDir -> 4:T[i1=t] | b == ”false”;2:CAT[t=o1#b=o3],MinMinDir -> 5:T[i1=t] | b == ”false”;2:CAT[t=o1#b=o3] -> NT[i1=t] | b == ”true”;2:NT[t=o1#a=o2] -> 3:UAT[i1=t] | t != ”null” and a == ”true”;# UnassignTask solution3:UAT[t=o1] -> NT[i1=t];3:NT[t=o1#a=o2] -> UAT[i1=t] | a == ”false”;3:NT[t=o1#a=o2] -> 1:T[i1=t] | a == ”true”;# SortAscMin solution4: ...# MinMin solution5:T[t=o1],R[r=o1] -> AM[i1=t#i2=r],R[i1=r];5:AM[t=o1] -> 2:T[i1=t],ECTDir[i1=t];

Figure 4.2.4: Reaction chain inside the UnassignTask solution

116 Marc E. Frıncu

Figure 4.2.5: Reaction chain inside the MinMin solution

However as the nextT does not distinguish between assigned and non-assigned tasks we needto skip those assigned. This step is handled by the ContinueAndCheck solution (cf. Fig.4.2.3). If the reference to the first task is returned it means that all of the existing taskshave been assigned. When rescheduling tasks we need to first unassign them by using theUnassignTask solution (cf. Fig. 4.2.4) and then by returning to the ECTMin solution inorder to proceed with the ECT computations. As soon as all the unassigned tasks have hadtheir ECT and minimum ECT computed the reactions proceed to the SortAscMin solution.This solution sorts ascending the unassigned tasks after their minimum ECT. It returns thefirst task in the list. That task is then assigned to the resource which has provided theminimum value and the cycle restarts from the ECTMin solution.

As already mentioned the differences between the Min-Min, Max-Min and Suffrage heuris-tics are small. To adjust the previous Min-Min solutions to fit Max-Min first the assignMinmolecule (cf. Fig. 4.2.5) needs to be replaced with the assignMax one. In addition theSortAscMin solution must be replaced with a solution that sorts tasks descending after theirECTs and which returns the first task in the list.

In the case of Suffrage three modifications have to be performed. Firstly, the calcMinmolecule inside the ECTMin solution needs to be replaced with the calcS molecule. Secondly,the assignMin molecule inside the MinMin solution must be replaced with the assignMaxSmolecule. Finally, the SortAscMin solution is replaced with a solution that sorts tasksdescending after their Suffrage values and returns the first task in the list.

Given the above it can be noticed how easily can a SA be modified with only a couple ofchanges to the rule set.

The DMECT heuristics is even simpler and the reaction chain (cf. Fig. 4.2.6) is depictedby Algorithm 4.2.3.

As in the case of Min-Min the reaction chain starts from an ignition reaction that activatesthe TaskIterator solution and sets the IDs of the first task and first resource. Every task ischecked against the time threshold – cf. with the DMECT solution – to determine whetherit should be relocated or not. If relocation is possible the minimum ECT is determined byiterating over all existing resources. During every iteration the task is reassigned back toeither a new resource – if the ECT is better – or to the current one. Once the ECT iscomputed (i.e., the next resource is the first in the resource list) the next available task isselected.

The rule based DMECT is slightly different from the normal version. Instead of movingthe task reference directly to the smallest resource queue it moves it gradually every time a

Adaptive Scheduling for Distributed Systems 117

Figure 4.2.6: Reaction chain inside the DMECT and TaskIterator solutions

better one is found.

4.2.2 Distributing the Scheduling Heuristics

Separating both the data a SA requires and the heuristics from the SP facilitates the distri-bution and even parallelization of the scheduling process. One major problem is representedby concurrent data access. Next we will detail how using SOA mixed with RBWS basedscheduling solves this issue.

The services exposing atomic methods offer synchronized access to their data. Moreprecisely, as soon as a scheduler has acquired a lock on the service method no other schedulercan invoke the method until it is unlocked by the scheduler that gained exclusive access.Nevertheless, this is a problem in SOA since multiple clients could access the service at thesame moment and thus all of them would obtain a lock on the atomic method. A queuingmechanism allowing the serialization of all requests has been consequently devised.

To distribute the SA every solution part of it can be spread across the DS by usingD-OSyRIS.

In what follows we will focus strictly on the study of DMECT.In our model, task data is split into equal and distinct parts among multiple schedulers.

Only access to resource data is shared. However this approach does not solve the data con-sistency problems induced by the reactions themselves. A concrete example is represented bythe ECT[t=o1#r=o2] -> UAT[i1=t], R[i1=r] and UAT[t=o1],R[r=o1] -> AM[i1=t#i2=r]

reaction chain inside the DMECT solution.The reason why the previous reaction inflicts data inconsistencies is because the computed

ECT could be modified before the AM molecule is created. Inconsistencies are possible asthere is a variable time interval between the moment the ECT molecule computes the resultand the creation of the AM molecule, when the resource is unlocked. During this time periodany other scheduler is free to modify the resource’s ECT and thus the value computed earlier

118 Marc E. Frıncu

Algorithm 4.2.3 The DMECT algorithm using SiLK

1: T:=[o1:output=”1”,i1:input,”processing”=”getT”,”instances”=”1”];2: R:=[o1:output=”1”,i1:input,”processing”=”getR”,”instances”=”1”];3: NT:=[o1:output,o2:output,i1:input,”processing”=”getNextT”];4: NR:=[o1:output,o2:output,i1:input,”processing”=”getNextR”];5: ECT:=[o1:output,o2:output,i1:input,i2:input,”processing”=”calcECT”];6: UAT:=[o1:output,i1:input,i2:input,”processing”=”unassignT”];7: AM:=[o1:output,i1:input,i2:input,”processing”=”assignMin”];8: IsTime:=[o1:output,o2:output,i1:input,”processing”=”canMoveT”]9: # Ports:

10: # a - is first;11: # t - task; r - resource12: # Initialization reaction13: T[t=o1],R[r=o1] -> 1:T[i1=t], R[i1=r];14: # TaskIterator solution15: 1:T[t=o1],R[r=o1] -> 2:T[i1=t],R[i1=r];16: 1:NT[t=o1#a=o2] -> T[i1=t] | a == ”false”;17: 1:NT[t=o1#a=o2] -> 3:Error[i1=t] | a == ”true”;18: # DMECT solution19: 2:T[t=o1] -> IsTime[i1=t];20: 2:IsTime[t=o1#i=o2],R[r=o1] -> ECT[i1=t#i2=r],NR[i1=r] | i == ”true”;21: 2:IsTime[t=o1#i=o2],R[r=o1] -> 1:NT[i1=t],R[i1=r] | i == ”false”;22: 2:ECT[t=o1#r=o2] -> UAT[i1=t], R[i1=r];23: 2:UAT[t=o1],R[r=o1] -> AM[i1=t#i2=r];24: 2:NR[r=o1#a=o2],AM[t=o1] -> R[i1=r],T[i1=t] | a == ”false”;25: 2:NR[r=o1#a=o2],AM[t=o1] -> 1:NT[i1=t],R[i1=r] | a == ”true”;

is invalidated. Figure 4.2.7 illustrates this aspect.

To solve the data consistency problem the molecules belonging to unsafe reactions arereplaced with a single compound molecule that performs all the operations in one synchro-nized step. Once a resource lock is acquired it is kept during the entire computation interval(cf. Fig. 4.2.8).

Algorithm 4.2.4 shows how the newly introduced compound molecules modify the DMECTsolution.

Each resource has three states in which it can be: AVAILABLE (resource is on-line),NOT AVAILABLE (resource is off-line) and LOCKED (resource is used by another sched-uler). They allow the SP to easily check whether a resource can be accessed or not.

In the sequential case DMECT assigns each resource to the smallest resource queue(cf. Fig. 4.2.9). Still in the parallel case this is not always possible as there might besmaller queues locked by other schedulers (cf. Fig. 4.2.10). Each time DMECT distributeddiscovers a queue providing a smaller ECT it assigns the task to that queue. Hence task

Adaptive Scheduling for Distributed Systems 119

Fig

ure

4.2.

7:D

ata

consi

sten

cyis

sues

when

lock

ing

reso

urc

ere

late

dm

olec

ule

sfo

rsy

nch

roniz

edac

cess

120 Marc E. Frıncu

Figu

re4.2.8:

Solv

ing

data

consisten

cyissu

esw

hen

lock

ing

resource

relatedm

olecules

forsy

nch

ronized

access

Adaptive Scheduling for Distributed Systems 121

Algorithm 4.2.4 The modified DMECT algorithm using SiLK

1: ECT AM:=[o1:output,o2:output,i1:input,i2:input,”processing”=”calcECTandAssignMin”];

2: # DMECT solution3: 2:T[t=o1] -> IsTime[i1=t];4: 2:IsTime[t=o1#i=o2],R[r=o1] -> ECT AM[i1=t#i2=r],NR[i1=r] | i == ”true”;5: 2:NR[r=o1#a=o2],ECT AM[t=o1] -> R[i1=r],T[i1=t] | a == ”false”;6: 2:IsTime[t=o1#i=o2],R[r=o1] -> 1:NT[i1=t],R[i1=r] | i == ”false”;7: 2:NR[r=o1#a=o2],AM[t=o1] -> 1:NT[i1=t],R[i1=r] | a == ”true”;

Figure 4.2.9: Example of moving a task from the current queue to the smallest using theDMECT sequential algorithm

Figure 4.2.10: Example of moving a task from the current queue to the smallest availableone using the DMECT distributed algorithm (the dotted squares represent resources lockedby other schedulers)

relocation is done gradually and not necessarily to the final task-to-resource mapping givenby the sequential DMECT. Even though at first glance this relocation seems costly itshould be noted that DMECT only relocates task pointers. The actual physical relocationis accomplished only when the task is known to be executed on the assigned resource (cf.Proposition 3.4.8 in Sect. 3.4.4).

To summarize, DMECT distributed uses subsets Qj ⊂ Q such that⋃Qj = Q and⋂

Qj = ∅. Every Qj contains the list of tasks belonging to the contained queues. Thenumber of subsets is equal with the number of schedulers. Each scheduler can move exclu-sively tasks belonging to its subset. However the destination is not restricted to this subset

122 Marc E. Frıncu

with all unlocked resource queues being potential destinations. To increase the chance ofminimizing ECT, the task is relocated every time a better queue is found. This differs fromDMECT sequential where the task is assigned only to the best queue. Yet, as this queuecould be temporary, the time to move condition is still true until no better queue is found.Every scheduling step is stopped when no more tasks can be moved (i.e., no task exceedsits LWT on the queue). The DMECT sequential was modified to follow the previous rulewhen transformed into a rule based algorithm (cf. Fig. 4.2.6).

The locking mechanism prevents the SA from dealing with inconsistent data regardinginformation related with task ECT. A different approach would be to totally ignore thisinconsistency and assign tasks based on the information available at the moment the ECTshave been calculated. Section 4.2.3 gives a comparison among the two approaches.

4.2.3 Test scenarios and results

For testing the proposed distributed RBSch two scenarios have been considered. In the firstcase we tested the efficiency of using the locking mechanism against a non-locking versionof DMECT distributed. Both online and offline scheduling have been considered. In thesecond case, the efficiency of DMECT distributed against the sequential version and otherclassic SAs has been tested. The gain obtained when applying a distributed version againsta sequential has also been determined.

Comparing the locking version of DMECT distributed with the non-lockingversion

The purpose of the following tests is to determine the behaviour of the locking DMECT(DMECT-par-lock-6) when compared with the non-locking version (DMECT-par-6). Sixdistributed schedulers have been used for comparison. Each scheduler deals with three re-source queues as there are 18 resources in tested G’5000 platform. Three aspects were ofinterest: the makespan, the lateness and the average schedule time. Two scenarios called Sce-nario 1 for the online scheduling and Scenario 2 for offline scheduling have been considered.In all tests the DMECT sequential is given as reference.

Concerning makespan, the results for Scenario 1 show a similar outcome for all cases(cf. Fig. 4.2.12(a)) while the ones for Scenario 2 (cf. Fig. 4.2.12(d)) depict DMECT-par-6producing a better result. The reason could be the fact that when using DMECT-par-lock-6tasks need to wait longer until a resource queue becomes unlock and thus their chances tobe assigned to a better resource diminish.

Lateness tests have shown that DMECT-par-6 and DMECT-par-lock-6 behave rela-tively the same in both scenarios. DMECT-par-6 has however a slight advantage in Scenario2. This is also evidenced by the graphics in Figs. 4.2.12(b) and 4.2.12(e). Nonetheless theyare surpassed in both cases by DMECT sequential.

Average scheduling times tests have also shown a similarity between the two versions.However the values obtained in Scenario 1 are high compared with DMECT sequential. Thisresult is not true for Scenario 2.

Adaptive Scheduling for Distributed Systems 123

After analysing the test results, the following conclusions can be drawn:

• when considering the makespan in online scheduling either of the two versions can beused; yet for offline scheduling the DMECT-par-6 seems the appropriate choice;

• in lateness tests either version could be used in both scenarios as the results are similar;

• the average runtime shows that the distributed versions suffer from time loss whenconsidering online scheduling. However the distributed versions had to deal with 6nnumber of tasks – n per scheduler – while the sequential version only with n tasks.Section 4.2.3 will show that there is actually an improvement when considering anequal number of tasks for both DMECT sequential and DMECT-par-6.

• as for the overall results, the DMECT-par-lock-6 version has no specific advantage onDMECT-par-6 with cases – such as the offline makespan results – when the latter iseven better than the former.

Comparing the non-locking version of DMECT with DMECT sequential andother classic SAs

In this scenario DMECT sequential was tested against three distributed non-locking caseswith 2 (DMECT-par-2), 4 (DMECT-par-4) and 6 (DMECT-par-6) DMECT distributedschedulers running in parallel. These versions have also been compared with Min-Min, Max-Min and Suffrage. The motives for choosing a non-blocking distributed version of DMECThave been discussed at the end of Section 4.2.3. Tests were restricted to online scheduling.

Two cases related with the location of the scheduler, the service and the database theservice accesses have been studied. Scenario 1 uses the same machine to accommodate allthree components. Scenario 2 uses one machine to store the SA and one to store the serviceand the afferent database. The two machines are linked inside a local LAN.

Table 4.2.2 presents some statistics regarding the resource consumption and executiontime of the rule based Min-Min, DMECT sequential and the distributed scheduling heuris-tics.

The speed-up gained by accessing a database and service placed on the same machineas the scheduler is of approximately 5.9. Also an increase in the CPU utilization from44% to 81% is noticed when the service and database are placed on the same resource. Thisis because the scheduler keeps the CPU busy with data access operations.

Tests have also shown an almost linear dependency of the scheduling time relative to thenumber of rules fired. In Scenario 2 – the usual case in SOA – there appears the need tomake the scheduling process more efficient. Two solutions exist: either reduce the numberof rules to be fired by optimizing the scheduling heuristics or distribute the SA. Our testsfocused on the second approach and the DMECT distributed versions have been tested withregard to the schedule makespan and the gain (relative to the sequential version) in schedulecreation time.

124 Marc E. Frıncu

Table 4.2.2: Statistics on the rule based scheduling heuristicsAlgorithm Total time CPU utilization Network (up/down) tasks/resources ScenarioMinMin 2.0 sec ≈ 81% n/a MiB 10/10 1MinMin 26.6 sec ≈ 81% n/a MiB 100/10 1MinMin 0.33 min ≈ 44% 0.6/0.3 MiB 10/10 2MinMin 4.32 min ≈ 44% 5.8/3.2 MiB 100/10 2DMECT 3.8 sec ≈ 81% n/a MiB 10/10 1DMECT 43.2 sec ≈ 81% n/a MiB 100/10 1DMECT 0.42 min ≈ 44% 0.5/0.3 MiB 10/10 2DMECT 4.31 min ≈ 44% 6.5/3.5 MiB 100/10 2

Table 4.2.3: Speed-up for DMECT-par-6 scheduling time compared with the sequentialversion for 10 resources

No tasks 100 300 500 700Speed-up 5.56 5.32 4.77 3.72

Figure 4.2.11 shows the results of the makespan tests while Table 4.2.3 shows the speed-up regarding total scheduling time obtained when using the DMECT-par-6 version on 10resources for Scenario 2. The similarity in terms of makespan between the sequential and thedistributed DMECT versions is clearly visible. The tests also evidenced that the DMECTsequential performs better than the other dynamic heuristics for heterogeneous DS.

The gain of DMECT-par-6 relative to the sequential version shows that a set of dis-tributed schedulers could indeed solve the time (cf. Table 4.2.2) issue observed in the non-distributed case.

Adaptive Scheduling for Distributed Systems 125

0

5000

10000

15000

20000

25000

30000

35000

100 200 300 400 500 600 700

Makespan (

ms)

No. tasks

MaxMinMinMin

SuffrageDMECT

DMECT-par-2DMECT-par-4DMECT-par-6

Figure 4.2.11: Makespan comparison among the Min-Min, Max-Min, Suffrage, DMECTsequential and the non-locking distributed version of DMECT

126 Marc E. Frıncu

2000

4000

6000

8000

10000

12000

14000

100 200 300 400 500 600 700

Makespan (

ms)

No. tasks

DMECTDMECT-par-6

DMECT-par-lock-6

(a) Scenario 1: Makespan comparison for onlinescheduling

-50

0

50

100

150

200

250

300

350

100 200 300 400 500 600 700

Late

ness (

ms)

No. tasks

DMECTDMECT-par-6

DMECT-par-lock-6

(b) Scenario 1: Lateness comparison for online schedul-ing

0

2

4

6

8

10

12

14

100 200 300 400 500 600 700

Runtim

e (

ms)

No. tasks

DMECTDMECT-par-6

DMECT-par-lock-6

(c) Scenario 1: Avg. schedule runtime comparison foronline scheduling

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

100 200 300 400 500 600 700

Make

span (

ms)

No. tasks

DMECTDMECT-par-6

DMECT-par-lock-6

(d) Scenario 2: Makespan comparison for offlinescheduling

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1e+06

100 200 300 400 500 600 700

Late

ness (

ms)

No. tasks

DMECTDMECT-par-6

DMECT-par-lock-6

(e) Scenario 2: Lateness comparison for offline schedul-ing

0

500

1000

1500

2000

2500

3000

100 200 300 400 500 600 700

Runtim

e (

ms)

No. tasks

DMECTDMECT-par-6

DMECT-par-lock-6

(f) Scenario 2: Avg. schedule runtime comparison foroffline scheduling

Figure 4.2.12: Locking vs. non-locking distributed DMECT

Adaptive Scheduling for Distributed Systems 127

4.3 Conclusions

This chapter focused on describing a model for representing SAs using a rule basedchemical paradigm as well as a platform for executing them under the same design. Thegoal was to create a platform where the scheduling heuristics can be separated from theSP. This approach provides two main advantages: first it allows for SAs to be easily definedwithout having to recompile any code and second, it opens the way for adapting the heuristicsat runtime by inserting and retracting policy rules.

To provide a platform for representing scheduling heuristics as rules we have first definedthe SiLK language. The language, which is a simplified version of the γ-formalism, hasseveral advantages over existing workflow languages:

• it allows for workflows to be naturally distributed by using chemical solutions (calleddomains in SiLK);

• it uses only a single construct from which the rest are derived. Hence it is simpler thanthe existing XML-based languages;

• it allows for implicit parallelism and non-determinism when firing rules.

The OSyRIS execution platform also has the merit of being:

• the first workflow system based on the chemical paradigm;

• one of the few workflow systems capable of distributing the workflow;

• self-healing based on a recovery FCL and on the ability to reassign tasks in case theexecuting resource fails.

The platform can also be used for general purpose workflows and not only for representingSAs.

To test the distributed SAs we modelled several known SAs: Max-Min, Min-Min, Suf-frage and DMECT. As mentioned in the beginning of this section the main advantagesare reusability, adaptation and distributivity of both rules and data. Following this modelproviders can group scheduling heuristics based not on the algorithm but on their atoms orsub-algorithms. In this way providers can become specialized sellers of scheduling data –atoms and information. SA distribution also allows various providers to inter-cooperate bysharing tasks, resources and policies which is important as it lowers the negotiation to thescheduling level by integrating the offer/demand logic into the SA itself.

Tests performed on the rule-based SAs have shown (cf. Sect. 4.2.3) that a RBWS causesa large processing overhead due to the high number of rules that need to be fired. Onaverage each rule takes around 0.7s to fire. Despite the apparent small time interval, thelarge amount of tasks required for scheduling inflict a linear delay in the execution of thescheduling heuristics. Consequently besides distributing the data required by the schedulerwe also devised a model to distribute the SAs. For the case study, we have opted for

128 Marc E. Frıncu

DMECT which was the main subject of discussion in Chapter 3. Tests have concludedthat the distributed DMECT versions reduce the period needed to schedule the tasks bymaintaining at the same time the efficiency of the sequential version.

Overall transfer time tests on the distributed platform show that the extra added timeis unsuited for cyclic workflows. For linear workflows this value is neglectable, especiallywhen dealing with long running tasks. The problem of cyclic workflows can be solved bydistributing the data across several engines or by not relying on Drools to execute the rulesanymore.

Platform recovery times show that each component can be recovered fast enough as notto affect the system behaviour. Fast recovery together with the ability to resume frompreviously defined checkpoints is particularly important for long running workflows in whichany failure can cause significant damages to the workflow.

5 SELF-HEALING MULTI-AGENT SYSTEM FORTASK SCHEDULING

Contents5.1 Platform Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.1.1 Agent Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5.1.2 Managed Resources and Touch-points . . . . . . . . . . . . . . . . 133

5.1.3 Scheduling Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.1.4 Self-Healing Agents . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.2 Reducing the Number of Agents used in the Negotiation Phase 147

5.3 Dynamically Changing the Scheduling Policy at Runtime . . . 150

5.4 Platform Testing Scenario and Results . . . . . . . . . . . . . . 159

5.4.1 Optimal Parameters Tests . . . . . . . . . . . . . . . . . . . . . . . 159

5.4.2 Recovery Time Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 161

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

One of the problems encountered by most RMS is that they have been targeted for clus-ter and inter-provider usage (TTL05; Fos05; CKKG99; CD97). However with the constantevolution of Clouds and further on to Sky Computing these solutions are no longer viableand need to be adapted to facilitate task migration under certain access and sharing rules.Infrastructure Providers (IP) working inside a federation have to deal with the uncertain,heterogeneous, transient and volatile environmental conditions that can affect the behaviourof these systems (NFG+06). Hence these systems require effective mechanisms to manage,irrespective of any user intervention, the provisioning of resources. Regardless whether IPsare VOs or Cloud providers, their objective is to offer reliable, secure and efficient compu-tational resources ensuring that the underlying systems are fault tolerant by exposing self-healing capabilities under dynamic environmental conditions (VM10). Moreover, providersneed to maintain their autonomy by keeping their own scheduling, security and negotiationpolicies. Therefore, under these complex dynamics, self-healing and autonomous behaviourare essential in any multi-provider RMS.

Autonomy can be implemented using MASs. They rely on autonomous entities calledagents to interact and make decisions based on internal logic (Woo99). MAS offer a nat-ural extension to RMS as they allow multi-providers to inter-operate through negotiation

129

130 Marc E. Frıncu

(SLGW02; OGM+05). The foundations of multi-site MAS scheduling are briefly presentedby Sauer et al. (SFT00). They describe the problem as a hierarchical two-level structure thatreflects the typical layout found in inter-site systems. The upper level consists of the globalscheduler responsible for coordinating the lower level containing local schedulers working onindividual locations.

Well-known MAS for RMS include Nimrod/G which uses agents for handling the setupof the running environment, transporting the task to the site, executing it and return-ing the result to the client (ABG02); TRACE, which dynamically allocates resources andagents based on the demand (FW01); ARMS (CJS+02), which uses PACE (CKPN00) forapplication performance predictions; and AppLeS (Application-Level Scheduling) that alsoimplements adaptive capabilities (BWC+03).

Tang and Zang proposed a service-oriented peer-to-peer MAS (TZ06). However, theirsystem relies on a simplistic scheduling mechanism and does not implement self-adaptation.The MAS proposed by Amoon et al. uses a single fixed scheduling agent responsible forplanning tasks on resources (AMA09). Tasks are then moved to and from these resourcesby using mobile agents. While the approach offers an advantage by using mobile agentsfor migrating and executing tasks, its main drawback is the use of a single static schedul-ing agent. This compromises the robustness of the system by introducing a bottleneck aswell as a single point of failure for the entire scheduling mechanism. Cao et al. proposeda load balancing method for MAS aimed at Grids (CSJN05). Although their system im-plements scheduling algorithms based on artificial intelligence techniques, it lacks supportboth for overcoming security issues when multiple providers are used and for a customizablenegotiation module. The approach advanced by Ouelhad et al. focuses on Service LevelAgreement (SLA) protocols for MAS and presents a protocol for negotiation as well as forrenegotiation in the presence of uncertainty (OGM+05). The SLA is designed following thetwo level inter-site approach (SFT00) and thus could be easily adapted to multi-providerscenarios. Shen et al. (SLGW02) present a negotiation protocol founded on a case-basedlearning and reasoning mechanism. The four negotiation models included in the experimentsinvolve the Contract Net Protocol, the Auction Model, the Game Theory Based Model andDiscrete Optimal Control Model. This work could be used as a starting point for buildingadaptive negotiation agents, but it lacks the experiments for validating the benefits of suchan approach.

As already described in Sect. 1.2 one way of implementing self-healing mechanismsis through feedback control loops (MKS09). These systems offer a way for monitoring,analysing, planning and applying the necessary changes inside the system.

As it can be noticed from the previous paragraphs several approaches have been proposedto create MAS for task scheduling. However, most of them focus on individual aspectssuch as negotiation, SLA management, scheduling or self-healing. Providing a completelydistributed, self-healing and customizable scheduling solution for inter-provider usage is achallenge from both technical and scientific points of view. Technical issues arise due tocompatibility problems when binding together various software for distributed storage orcommunication. Another reason may be that current solutions are still at an early stage of

Adaptive Scheduling for Distributed Systems 131

development and do not yet meet the required expectations. From a scientific perspectivewe require a model which allows us to represent the system components as autonomous,self-healing, fully customizable entities.

As a general conclusion it can be noticed the lack of a working solution that addresses allthe issues previously mentioned. To face this challenge of building an autonomous self-healingRMS platform, we tackle both technical and scientific aspects and propose an adaptive inter-provider MAS scheduling platform able to:

• offer fully distributed storage and communication mechanisms (cf. Sect. 5.1.2);

• self-heal in order to support fault-tolerance by means of implementing agents as recov-erable modules of feedback control loops (cf. Sects. 5.1, 5.1.3 and 5.1.4);

• support autonomy among providers by separating the dynamically changing schedulingpolicy from the application by means of an inference engine (cf. Sect. 4.2);

• adapt/change the negotiation policy by implementing negotiators as pluggable mod-ules. To minimize the impact of rescheduling tasks on the system, we reduce thenumber of agents needed during the negotiation phase without affecting the schedul-ing objective function (cf. Sect. 5.2);

• adapt/change the scheduling policy at runtime based on the current system configura-tion (cf. Sect. 5.3).

Moreover, we validate the system by testing the time needed to heal the platform af-ter module failures. Finally, we perform a study on finding the optimal setup platformparameters that would not produce false healing events (cf. Sect. 5.4).

The main conclusions of the work are presented in Sect. 5.5.The work described in this chapter has been partially published in Frıncu (Fr9a), Frıncu

(Fr0b) and Frıncu et al. (FVP+11).

5.1 Platform Overview

A MAS intended for task scheduling usually consists of several intelligent agents workingtogether towards efficient task scheduling. In what follows we propose a self-adaptive dis-tributed scheduling platform composed of intelligent agents enhanced with custom modules(cf. Fig. 5.1.1). Agents are defined as intelligent control loops made of one module or moremodules. Every control loop supports scheduling activities by implementing the foundationalsteps performed by an autonomic MAPE (Monitor-Analyse-Plan-Execute) loop (KC03). Asdepicted in Fig. 5.1.1, modules that define an agent correspond to the monitor, analyser,planner and executor. Analysis and planning activities are performed by the negotiatorand the scheduler modules. Thus an intelligent feedback loop supports not only schedulingbut also re-scheduling activities. Results from monitoring the environment received through

132 Marc E. Frıncu

Monitor Module Negotiator Module Scheduling Module Executor Module

Analyser & Planner

Message Queue System (Sensors and effectors)

Monitor Module Self-Healing-Module

Distributed DBExecution

Resources

Distributed

Storage

Managed Resources Touchpoints

Sched

ulin

g

Feedb

ack

Loo

pM

an

aged

In

frastr

uctu

re

Self-H

ealing

Lo

op

Resource Queues

Analyser, Planner & Executor

Figure 5.1.1: Self-healing MAS scheduling platform: high level architecture

the monitor module are analysed by the negotiator and the scheduler modules which actaccordingly by defining scheduling plans. Finally tasks are allocated and executed by theexecutor module. The monitoring of resources and tasks feeds the system back to ensure therescheduling process. Finally a self-healing module built on the top of the MAS enables theinfrastructure with self-healing capabilities that allow module recovery.

Agent distribution across multiple providers is supported by communication mechanismsimplemented on top of AMQP. This queue system not only enables the communicationamong agents and modules but also acts as the sensors and effectors required to supportself-healing capabilities. As proposed in the Autonomic Computing Reference Architecture(IBM06), the queue system implements a level of indirection that supports the gatheringof contextual events related to the scheduling infrastructure behaviour and the execution ofmodule recovery tasks. Depending on the case and in order to increase failure tolerance,agents can be made up of single modules spread across the system. Following the modularapproach an agent can comprise all functionality of the system (e.g., self-healing, negotiatingand scheduling/executing tasks) or be a specialized agent (e.g., only for task scheduling).

The MAS scheduling platform we propose allows each provider to maintain its own in-ternal scheduling policies. Moreover agents can also use their own scheduling policies atmeta-scheduling level. Hence each agent is autonomous in deciding when to request/sendtasks from/to agents from different providers.

Similar with WS which are discovered based on UDDIs, MAS require a Yellow Pagesdirectory to enable agent discovery and registration. In our system this is no longer necessaryas AMQP includes built-in facilities such as to facilitate message broadcasting and topic

Adaptive Scheduling for Distributed Systems 133

based exhanges (Gro10). This eliminates the need of agents to be aware of each other.Therefore all agents need to know is the topic where they should send to/receive frommessages.

5.1.1 Agent Modules

Modules are designed as autonomous entities that continue working even when the connec-tion with their partners has been severed. To increase the platform’s fault tolerance wepropose an approach where each agent is implemented as an intelligent FCL composed ofseveral distributed modules that communicate with each other using asynchronous messagequeues. These FCL enable the scheduling infrastructure to self-adapt to both changes in theresource/task characteristics and module failures. Based on these FCL we classify agents aseither scheduling or self-healing. Both types are defined as a set of modules that implementthe intelligent FCL’s phases (cf. Sect. 5.1.3 and 5.1.4).

The distributed and autonomous approach allows agents to spread across several resourcenodes and continue functioning when some of their modules fail. Moreover, every providercan create custom agents tailored to its own requirements by mixing existing modules orby designing new ones. The MAS relies on the following modules: monitor, negotiator,scheduling, executor and healing modules.

5.1.2 Managed Resources and Touch-points

Managed resources help the platform store and handle relevant data including task, re-source and module information. Managed resources include storage, database and executionresources (cf. Fig. 5.1.1). A special case is represented by the resource queues (dottedbox in the figure) which are not actual resources but are strongly linked to the executionresources as they provide the order in which tasks execute. In the proposed MAS they areimplemented as simple ordered list constructs that can be accessed directly by means of aservice.

Choosing a correct storage environment is essential to any DS. The storage system needsto be reliable and to provide fast access to its data. Because DS can be volatile in terms ofnode availability and network traffic we have opted for a distributed file system where failuresof single nodes do not affect the overall system behaviour. The Hadoop Distributed FileSystem (HDFS) (Fou10b) is a suitable choice as it provides: fault tolerance; scalability up toseveral thousands nodes; data rebalancing on cluster nodes; and can take into considerationthe physical location of the node when allocating storage. HDFS was initially developedfor clusters. However it can be expanded to inter-cluster usage as long as the same trustpolicies exist between peers. In a multi-provider environment this is somewhat difficult toimplement as partners usually want to keep their internal policies. This problem was solvedby using an FTP server that allows agents to connect to any remote HDFS. Hence the FTPworks as a middle layer between the SP and the HDFS.

Storing information related with task-to-resource mappings and task ordering in resourcequeues is also important as it allows the SP to take its scheduling decisions. To this aim

134 Marc E. Frıncu

the MAS relies on a distributed database system called HBase (Fou10a) that is based on akey-value access mode. This database is primarily used by the scheduler, the executor andthe monitor modules.

Execution resources represent the resources on which modules run or where tasks ex-ecute. The choice of the actual task deployment platform is left to the IP. Examples ofdeployment platforms include service oriented (e.g., any WS/GS platform), component ori-ented (e.g., Frascati (tIL10)) or job oriented (e.g., Condor (TTL05), NetSolve (CD97), Legion(CKKG99)).

Platform touch-points represent the means of binding the modules to the managedresources (i.e., the dotted arrows in Fig. 5.1.1). Usually they are specific to the softwaresolution used for implementing the managed resource. For instance, the touch-point betweenHDFS and the MAS is achieved though FTP; the touch-point between HBase and the MASis achieved by the HBase query language, etc. A distinct case of touch-points is representedby the inter-module communication.

Communication is essential in any distributed system and facilitates the dialogue betweenthe system’s components. Messages are handled by RabbitMQ (VMW10). Section 4.1.2 haslisted the reasons for selecting it as messaging system. As already mentioned RabbitMQ hasthe advantage of not requiring to know a priori the number of registered exchanges (i.e.,stateless routing tables). This built-in feature is especially useful when issuing negotiationbids, checking the availability of every existing module or when attempting to activate mod-ules. It also acts as the de facto Yellow Page directory where all agents are published. Theprocedure only requires the agent modules to bind to an exchange and implement the corre-sponding listeners. The platform relies on broadcast messages to issue negotiation bids andagent notification events. Nonetheless there are cases in which messages need to be directedto a more particular sub-set of agents. Registration messages for instance are routed to thecorresponding healing modules by using a routing key specific to each parent IP. The reasonfor this behaviour is that modules should only notify their existence to self-healing modulesinside the parent organization.

As RabbitMQ exchanges are persistent, any unprocessed messages are safely stored whenfailures either in the communication infrastructure or in the modules themselves occur.

Data is sent across communication partners by using JSON (JSO10). A message consistsof five elements as depicted in Fig. 5.1.2. This format allows for modules to easily recognize ifthe message is intended for them and where it originated from. Any message not conformingto this format is automatically ignored.

The meaning of the message components is described in what follows:

• from id – represents the ID of the agent that sent the message;

• to id – represents the ID of the destination agent. In case the message is intended asa broadcast, the symbol * is used;

• content – represents the actual content of the message. The platform supports fourstandard content formats for describing: a task (cf. Fig. 5.1.3), an agent module (cf.

Adaptive Scheduling for Distributed Systems 135

1: {2: ”from id” : UUID3: ”to id” : UUID | *4: ”content” : value5: ”message type” : value6: ”processing name” : value | null7: }

Figure 5.1.2: Message format for the MAS platform

Fig. 5.1.4), a negotiation winner agent (cf. Fig. 5.1.5) and information related with aprovider’s resources (cf. Fig. 5.1.6);

• message type – represents a value which specifies the type of the message. It can haveone of the following values:

– NEW TASK – specifies a new task submitted to the platform;

– RESCHEDULING TASK – specifies information about a task that needs reschedul-ing;

– TASK COMPLETED – specifies that a task has been successfully executed bythe executor module;

– TASK ABORTED – specifies that a task has been aborted by the executor moduleand that it should be rescheduled;

– BID RESPONSE – specifies a bid response from one of the scheduling modules;

– BID REQUEST – specifies a bid request from one of the scheduling modules;

– BID WINNER – specifies a bid winner;

– AGENT MODULE ACTIVATION REQUEST – specifies an activation requestfrom a module;

– AGENT MODULE ACTIVATION REPLY – specifies an activation reply froma module;

– AGENT MODULE REGISTRATION REQUEST – specifies a registration re-quest from a new/existing module;

– AGENT MODULE REGISTRATION REPLY – specifies a registration reply fromthe healing module that registered the module;

– AGENT MODULE CHECK SIMILAR REQUEST – specifies a request for iden-tifying if the receiver module is similar with the sender. Two modules are similaronly if they have the same ID, workflow ID and type;

– AGENT MODULE CHECK SIMILAR REPLY – specifies a response after a sim-ilarity check;

136 Marc E. Frıncu

– PLATFORM INFO – specifies a message containing platform information;

– SCHEDULING POLICY CHANGE – specifies a request for changing the schedul-ing policy of a given scheduling module;

– AGENT MODULE PAUSE REQUEST NO REPLY – specifies a no reply re-quest for pausing a given module;

– AGENT MODULE HEALING REGISTRATION REQUEST – specifies a regis-tration request from a healing module to another.

• processing name – represents the name of the SiLK (FP10) task (atom) to be createdafter receiving a message. This argument is also useful when using D-OSyRIS as itallows to remotely create tasks that consequently start corresponding workflows.

1: {2: [...]3: ”content” :4: {5: ”id” : UUID6: ”original agent id” : UUID7: ”submitter agent id” : UUID8: ”description” : value9: ”executable location” : value

10: ”dependencies” : value1#value2#...11: ”position on resource queue” : value|-112: ”submission time local” : value13: ”submission time total” : value14: ”estimated completion time” : value|-115: ”status” : value16: }17: [...]18: }

Figure 5.1.3: Content format for a message containing task information

5.1.3 Scheduling Agents

Each scheduling agent implements a scheduling FCL composed of at least one monitor, onenegotiator, one scheduler and one executor module (cf. Fig. 5.1.7). To accomplish theirbusiness objectives, the scheduling agents must be able to adjust their behaviour accordingto environmental conditions that affect the state of computational resources. Feedback-loopsare important models in the engineering of adaptive scheduling systems. They define howthe interactions among agent’s modules ensure successful task scheduling according to SLAs

Adaptive Scheduling for Distributed Systems 137

1: {2: [...]3: ”content” :4: {5: ”uuid” : UUID6: ”workflow uuid” : UUID7: ”module type” : UUID8: ”status” : value9: ”agent parent id” : value

10: }11: [...]12: }

Figure 5.1.4: Content format for a message containing agent module information

1: {2: [...]3: ”content” :4: {5: ”winner uuid” : UUID6: }7: [...]8: }

Figure 5.1.5: Content format for a message containing the winner agent information

and desired properties. Moreover context information that characterizes the situation ofresources and tasks must be monitored in order to control the rescheduling process. In whatfollows we detail the modules that compose every scheduling agent.

The Monitor Module

From a dynamic scheduling perspective, context can be defined as any information thatcharacterizes the state of entities that affects the behaviour of computational resources andtasks. This context information must be modelled so that it can be discovered, pre-processedafter its acquisition from the environment and handled to be provisioned based on the sys-tem’s requirements (VM10). Instances of context entities relevant for scheduling platformsare services, processing nodes, SLAs and users.

Monitor modules part of scheduling agents are responsible for gathering informationregarding the current situation of tasks and computational resources. Context acquisitorsdeployed as context probes gather at given events or time intervals information relatedto: resource load; system heterogeneity; task size in bytes; task execution requirementsfor a particular system state (e.g., required processing power in flops); and task’s failures

138 Marc E. Frıncu

1: {2: [...]3: ”content” :4: {5: ”heterogeneity factor” : value6: ”avg task size” : value7: ”avg task EET” : value8: ”avg std dev task EET” : value9: ”avg std dev task size” : value

10: ”no tasks” : value11: ”no long tasks” : value12: ”no resources” : value13: }14: [...]15: }

Figure 5.1.6: Content format for a message containing platform information

Figure 5.1.7: The feedback loop for the task rescheduling process

from logs. Monitoring actions are triggered by task arrivals/completion or by changes inresource availability. Whenever a new task is ready to be scheduled, the monitor gathers therelevant task information and sends this information to the negotiator module, which actsas the analyser in the scheduling feedback loop. Monitored information about task failuresinclude wrong input files, invalid dependencies and resource failures. Resource availabilityis periodically monitored by sending ping signals to the task executor module.

Adaptive Scheduling for Distributed Systems 139

The Negotiator Module

The Negotiator Module is part of the analyser component. The first step in the negotia-tion process is the selection of the desired policy to guide the intra-scheduling process. Theselection is based on a best cost provider approach where every available policy is evalu-ated in the light of the current system configuration. This configuration is gathered fromthe data contained in the PLATFORM INFO messages sent periodically by each underly-ing platform. Once the policy that provides the best solution is selected, a SCHEDUL-ING POLICY CHANGE message is sent to the scheduler requesting the loading of thecorresponding SA rule-based definition. The message contains the location of the new rulebase to be loaded. Section 5.3 will present this selection policy in greater detail.

The negotiator module’s main purpose is inter-provider task relocation. Relocation usu-ally follows a scheduling policy that extends/complements the possibly different heuristicsat intra-provider level. In order to reduce the number of messages the negotiator relies onDMECT. This policy only relocates tasks that have exceeded a certain LWT threshold onthe local resources. The reason for trying to minimize the number of task relocation is adirect consequence of the results provided by Tumer et al. in (TL09) which have shown thata MAS obtains best scheduling results when the agents’ impact on the system is minimized.We argue that in our system this goal is achieved when both the the number of bid requestsand the number of scheduling agents involved in the negotiation is reduced. The former canbe adjusted by tuning the rate of bid requests inside the scheduling policy – as described inSection 3.4 – while the latter can be adjusted by reducing the number of agents involved inthe negotiation phase. Section 5.2 will detail the agent reduction process.

After the proper scheduling polices have been selected, the actual negotiation betweenscheduling modules can begin. Every negotiation flow is comprises several events linkedin a workflow (cf. Fig. 5.1.8) that can start in two different ways depending on whetherthe scheduling module requests pro-actively or awaits passively new tasks. When usingDMECT the default behaviour is to pro-actively request task relocations. The built-infeature of the algorithm is the main reason for this conduct. The negotiation follows anauction based model (SLGW02) where each scheduling module provides a bid for everynegotiation request. The bid represents a value for the task cost function (e.g., makespanor lateness). The winner of the bid is designated to be the module that provided the lowestcost value within a certain amount of time. As agents are not aware of each other it isimpossible for a negotiation request issuer to know the exact number of existing partners.Because of this motive negotiation ends when either a specific time-out has been reached orwhen a minimal number of registered scheduling modules has provided bids.

Internally to the system, task relocation commences when the task exceeds a certainLWT on the currently assigned resource queue. The agent overseeing the task, requests costbids to all partner agents by broadcasting a message containing information related to thetask at hand. The broadcast is received by all scheduling modules including the modulethat initiated the request. Each scheduling module adds the task to its list, marks it astemporary, executes the scheduling algorithm and assigns it to a resource queue. The costfunction estimate is then sent back to the negotiating module that handles the bid request.

140 Marc E. Frıncu

Figure 5.1.8: State transitions between the communication language during the negotiationphase

The final scheduling decision is achieved by broadcasting the ID of the winner moduleto all the registered scheduling modules. The modules that do not match the winner’s IDerase the temporary task from the list.

This negotiation protocol is similar with the DynCNET protocol described in Weyns etal. (WBHD07). DynCNET is a protocol for dynamic task assignments in MAS. It alsoconsists of a four step negotiation flow: (1) sending a call for proposal; (2) answer to thecall from the participants; (3) notification of the winner and (4) task execution. Another

Adaptive Scheduling for Distributed Systems 141

similarity with our protocol is that DynCNET allows changing at runtime the participantassigned to execute the task and supports message synchronization by using confirmationmessages. The usage of the latter in our proposed system will be detailed in Sect. 5.1.4.

Despite the similarities between DynCNET and our negotiation policy, there are a coupleof differences including optimizations in the number of partners used in the call for proposal(cf. Sect. 5.2) and the capacity to extend the protocol beyond this one step negotiationversion by using plug-ins. In this way each provider can create and use its own policies foraccepting/issuing bids.

The convergence of the negotiation protocol (i.e., how to ensure that tasks do not getendless reallocations between the same two agents) is ensured by the convergence of theDMECT algorithm as demonstrated in Sect. 3.4.2.

Task relocation does not imply moving all the task files and dependencies during thenegotiation phase. In order to minimize transfer costs during negotiation only task meta-data is sent between modules. The data themselves (i.e., the actual task) is sent only whenthe executor module submits the task for execution.

Scheduling module

Scheduling modules are in charge of the actual task to resource mapping. In order to maintainautonomy each provider is allowed to use its own internal scheduling policy. Based on thispolicy the module can decide on two things: (1) to relocate/accept tasks to/from otheragents and (2) to reschedule some tasks on the provider’s resource queues. The schedulingmodule operates at meta-level (i.e., inter-provider scheduling). Once a task is submittedto a resource for execution the scheduling module relinquishes any control over it and thescheduling proceeds according to the rules of the internal schedulers (e.g., Condor, Legion,NetSolve).

A scheduling module is usually attached to one or more resources (e.g., that form acluster) belonging to a single provider. Modules can even share the same resources for fasterprocessing as shown in Sect. 4.2.2 when the distributed DMECT was discussed.

To facilitate the management and the separation of the scheduling policy from the sched-uler itself the algorithms are implemented using the SiLK (cf. Sect. 4.2.1) language. SiLKSA rules can be easily loaded and unloaded from the rule base without the necessity torestart the module’s OSyRIS engine. Rule bases are usually changed in the following cases:failures of the inference engine; changes in the scheduling policy as dictated by the provider;adaptations to the new platform characteristics which require different scheduling heuristicsto optimize the schedule. Once the rule base has been loaded, OSyRIS fires all the rules thathave their conditions matched in the working memory. The working memory is updatedautomatically after every successful rule execution. The process continues until either anerror occurs or a special rule that puts the system into an idle state is fired. In either case theworking memory is cleared and the cycle restarts. When triggering rules, service invocationsare made for specific methods that provide information related with resources and tasks (cf.Sect. 4.2).

142 Marc E. Frıncu

1: {2: [...]3: ”content” :4:

5: [...]6: ”description” :7: ”Requirements = Memory >= 32 && OpSys

== ’LINUX’ && Arch ==’INTEL’8: Rank = Memory >= 649: Image Size = 28 Meg

10: Error = err.file11: Input = in.file12: Output = out.file13: Log = foo.log”14:

15: }

Figure 5.1.9: Example of a task description requirements using a Condor-like format

Rescheduling task issues (i.e., failures and optimization) are generally solved by relocatingtasks to different resources. Task rescheduling is mandatory in any DS because of both thevolatile nature of resource nodes and the online task arrival rate. Hence the system need toadapt to various system changes.

Special failure cases are represented by already running tasks that failed to finish beforethe cost for the task has been computed and by resources that have gone off-line during taskexecution. In both cases the tasks are considered to be aborted and the scheduler will attemptto reschedule them. Because it is not trivial to distinguish dead resources from overloadedones it is likely to end up with several instances of the same task running in parallel. In thiscase we opted for an approach where only the first received result is considered valid. Oncean instance has been completed all the executor modules running clone tasks are informedthat their tasks can be stopped (if possible) or ignored after finishing.

An important aspect during the scheduling process is represented by the computation ofthe task cost. This is computed based on the resource characteristics at a given moment andon the task requirements specified in the task description. The format for describing tasksis similar with that used in the Condor submission description files. For instance we candescribe: the input, output, error or log files; and the runtime requirements such as CPU,memory or preferred architecture (cf Fig. 5.1.9).

The rescheduling process is linked to the negotiation phase. Every time tasks get resched-uled the scheduler sends a RESCHEDULING TASK message to the negotiator. After re-ceiving the message the negotiator broadcasts BID REQUESTS to the registered schedulingmodules. These add the temporary task contained in the BID REQUEST message and ex-ecute the scheduling policy. Once the task costs have been computed a reply containing a

Adaptive Scheduling for Distributed Systems 143

BID RESPONSE is sent back to the negotiator. The message contains the cost for the tem-porary task. The scheduling modules then wait for the BID WINNER message containingthe winning scheduling module ID. Only the winning module keeps the temporary task inits queues.

Modifying the DMECT rules described to fit in the negotiation scheme implies some mi-nor changes. These comprise a set of new rules which allow the MAS to send RESCHEDUL-ING TASK and BID REQUESTS messages to the negotiator and to receive BID WINNERmessages from it. To achieve this we first need to integrate the scheduling policy with theagent. Therefore we introduce the agentUuidSelf and getAgentIDByUuid atoms which let usidentify the owner agent of the scheduling module. Furthermore we also introduce the send-TaskToNegotiator and the sendBidResponseToNegotiator atoms. These are used for sendingthe RESCHEDULING TASK and the BID REQUESTS messages to the negotiator module.During the time a task is being sent to the negotiator and the time a bid winner is beingdesignated the task goes through several statuses as described in the following paragraphs.Figure 5.1.10 also details these transitions.

Figure 5.1.10: State transitions between the task statuses

A task sent from the negotiation module to the scheduling modules is in the SUBMIT-TED status. Newly submitted tasks also share this state because they are handled by thenegotiator module before being assigned to any resource. This state also marks the task asa temporary task since it is not permanently assigned to a resource until a bid winner isdesignated. Once the scheduling module gives a cost estimate, a SUBMITTED becomes RE-SOLVED. The winner module will change the status of the RESOLVED task to ASSIGNEDwhile the rest of the modules will erase the temporary task from the task list. During theprocessing, every SUBMITTED and ASSIGNED tasks will be locked by switching theirstatuses to LOCKED SUBMITTED, respectively LOCKED ASSIGNED.

Algorithm 5.1.1 shows the new rules that have been introduced inside the DMECTheuristics in order to adapt the policy to the MAS negotiation.

The new rules change the existing policy described in Sect. 4.2.1 as follows: during theinitialization phase an atom which holds information about the scheduling module’s agentis created. This atom is essential as it lets modules identify tasks belonging to them based

144 Marc E. Frıncu

Algorithm 5.1.1 The new rules added to the DMECT policy in order to enable it foragent-based negotiation

1: I:=[o1:output,i1:input,”processing”=”init”, ”instances”=”1”];2: AU:=[i1:input,o1:output,”processing”=”agentUuidSelf”];3: AI:=[i1:input,o1:output,”processing”=”getAgentIDByUuid”];4: ST:=[i1:input,i2:input,o1:output,”processing”=”sendTaskToNegotiator”];5: SBR:=[i1:input,o1:output,”processing”=”sendBidResponseToNegotiator”];6: I, AU[a=o1] -> 1:AI[i1=a];7: 1:I, AI[a=o1#consume=false] -> FT[i1=a], FR[i1=a] | a != ”-1”;8: 2:NR[r=o1#a=o2], AM[t=o1], AI[b=o1#consume=false] -> UT[i1=t], SBR[i1=t],

6:NT[i1=t#i2=b],R[i1=r] | a == ”true” and r != ”-1” and b != ”-1”;9: 4:IsTime[t=o1#i=o2#a=o9#m=o14],RT[rt=o1#consume=false],AI[u=o1#consume=false]

-> NMT[i1=m#instances=1], 7:ST[i1=t#i2=u] | i == ”true” and a == ”false”;10: 7:ST[t=o1],AI[a=o1#consume=false] -> 6:NT[i1=t#i2=a];11: 5:T, NMT[t=o1] -> Sleep[i1=t];12: 5:Sleep, AI[a=o1#consume=false] -> FT[i1=a], FR[i1=a] — a != ”-1”;

on the agent ID. The ID of the first task and the ID of the resource are then obtained. Thethird rule (line 7) specifies the events and conditions that will trigger a BID RESPONSEmessage. The rule specifies that only when all the resources have been considered (i.e., a== “true”) for computing the minimal cost, will the task be sent to the negotiator (theSBR[i1=t] atom). Afterwards the algorithm proceeds to the next task in the list startingfrom the first resource (the atoms 6:NT[i1=t#i2=b],R[i1=r]). The UT[i1=t] atom specifiesthat the current task can be unlocked from exclusive usage. The fourth rule (line 8) specifieswhen a RESCHEDULING TASK message containing task information needs to be sent tothe negotiator. The message can only be sent when the task waiting time on the local queue(atom IsTime[t=o1#i=o2#a=o9#m=o14]) has been exceeded (i.e., i == “true” and a ==“false”). The fifth rule (line 9) specifies that after a task has been sent to the negotiator thealgorithm can proceed to the next task in the list.

To relieve the scheduler from having the burden of rescheduling tasks at fixed intervals –which can lead to iterations where no task is rescheduled but the sweep is still performed onall existing tasks – an event driven mechanism has been devised. As a result rescheduling isperformed in two cases:

• when a BID REQUEST is received from the negotiator. In this case an initialisa-tion atom I which triggers the workflow execution is created inside the solution. ABID REQUEST follows a NEW TASK, TASK COMPLETED or TASK ABORTEDmessage received by the negotiator. Rescheduling follows these messages because theyare the only times when the resource load changes;

• when the first is time to move (cf. Relation 3.4.1) is reached. This time is computedduring each rescheduling when the IsTime atom reacts to trigger further rules through

Adaptive Scheduling for Distributed Systems 145

the NMT[i1=m] atom. When all the tasks have been iterated a sleep event causes therule firing to cease for a certain amount of time (lines 11 and 12).

The Executor Module

The purpose of the executor modules is to take tasks from resource queues and start theirexecution. When the resources are exposed as services the module’s role is to submit andmonitor the tasks’ execution by using the service’s interface. Before the task execution, itsdata including executable/jar archive and dependencies is grabbed from HDFS and copiedto the resource.

To determine the set of tasks to be executed, the executor accesses the resource queueby querying HBase, processes the first n tasks, ordered by execution number (or TWT incase of DMECT and MinQL) and starts the task transfer. The status of every task thatis executed is changed to LOCKED EXECUTING (cf. Fig. 5.1.10).

Once the archive is copied to the corresponding resource, the executor extracts it andstarts the execution. It is up to the provider to specify a means for executing the tasks.The executor is custom tailored to fit the provider and acts but as a liaison between theactual deployment platform and the MAS. As soon as the execution is completed, the taskstatus is changed to COMPLETED and the results are archived and sent back to the HDFS.A completed execution is not guaranteed to be a successful one. The actual state can bequeried by the monitor module from the execution logs. In case of failure the task willbe rescheduled by sending a TASK ABORTED message to the negotiator module whichcontains information about the aborted task. Once this message is received, the negotiatorwill issue a BID REQUEST to the available schedulers to determine where to reschedule thefailed task. From this point the process is similar with the one taken during a negotiationstep.

5.1.4 Self-Healing Agents

Self-healing agents built on the top of the MAS SP enable the infrastructure with an in-telligent FCL that supports module recovery capabilities. Self-healing agents are designedsimilar to the multi-agent feedback loop that is part of the autonomic system proposed byCaprarescu and Petcu (CP09). Module failures are caused by the execution breakdown ofany of the modules that make up the MAS.

Self-healing agents implement a self-healing FCL composed of at least one monitor moduleand one self-healing module that implements the analyser, the planner and the executorrequired to support module recovery capabilities.

As presented in Fig. 5.1.11, the monitor module is responsible for gathering informationabout the status of modules. Data gathering mechanisms are supported either actively bysending requests or pings or passively by receiving messages from the monitored modules.Context sources are represented by the distributed database and message queues. The firstone provides information for determining the load of the system and the DS topology while

146 Marc E. Frıncu

the second one offers data on the existing modules. The monitored information is sent tothe analyser which determines the current state of the monitored modules.

Modules that become active send an AGENT MODULE REGISTRATION REQUESTmessage to the monitor modules of the self-healing agents. The self-healing agents thatreceive the message belong to the same provider as the sender module. This type of messageis periodically sent to the self-healing agent in order to notify it of the registered module’savailability. The messages are followed by AGENT MODULE REGISTRATION REPLYmessage replies. By using the monitored information the analyser checks whether or not anyof the registered modules has timed out. Once a time out has been reached, the analyser sendsto the planner the relevant details about the failure. In addition an AGENT MODULEPAUSE REQUEST NO REPLY message is sent to the failed module which has the effectof pausing the module. This is necessary in case false healing events are triggered. Section5.4 will detail this problem by showing how it can be avoided by fine tuning the platformparameters. The planner is responsible for searching paused clone module that can be usedto replace the failed one. Clone modules are registered modules that are in an idle state.Their status is set to PAUSED. They receive messages from the environment and also sendnotification pings yet do not execute any of their usual operations.

After a clone module has been identified the planner sends the details to the executorwhich activates it by setting its status to READY FOR RUNNING. The activation requestis encoded in an AGENT MODULE ACTIVATION REQUEST message. In case no suchclone exists the MAS sends a multicast message searching for resources willing to deploymodules on demand. If at least one resource is found its endpoint identification (e.g., IPaddress) is sent to the executor which transfers to it and executes the module logic and data.On demand deployed agent modules resemble mobile agents (AMA09).

To avoid running several identical modules, every registered module having a READY FORRUNNING state sends a broadcast AGENT MODULE CHECK SIMILAR REQUEST mes-

sage searching for identical siblings. This is followed by a change to WAITING TO RUNin its status. Every registered module checks to see if it is a sibling clone, parent or childof the module that issued the similarity check. If a module is in one of these three cases itwill reply with a AGENT MODULE CHECK SIMILAR REPLY. Depending on the broad-cast outcome (i.e., the number of AGENT MODULE CHECK SIMILAR REPLY messagereplies) the module either activates by switching its status to RUNNING or becomes idlewith its status set to PAUSED. Independent on the outcome the activation request is followedby an AGENT MODULE ACTIVATION REPLY which informs the self-healing agent onthe status of the targeted module new status.

Self-healing modules are also able to recover themselves from failures. As all modules areregistered during their activation, self-healing modules are also monitored by partner modulesand are subject to the same recovery mechanisms. Self-healing modules register to partnermodules by issuing AGENT MODULE HEALING REGISTRATION REQUEST messages.These messages are similar with the AGENT MODULE REGISTRATION REQUEST mes-sages sent by regular modules.

A special category of failures consists of infrastructure failures. These are failures that

Adaptive Scheduling for Distributed Systems 147

Figure 5.1.11: The feedback loop for the module recovery process

cannot be handled explicitly by the self-healing mechanisms implemented on the top of theMAS SP. They include most of the problems that arise from infrastructure related issues.In this paper the infrastructure is seen as the bundle of existing physical resources, networkfabric, communication system, database storage and file system.

The first three problems are solved by the autonomic nature of the MAS and are similarin effects and responses with the module failures. The MAS can be viewed as a tree withthe negotiator at the root, the providers’ schedulers at the second level and the executorsas leaves. Failures at any level allow the system to continue functioning partially evenwithout recovery as follows: without a negotiator the providers’ schedulers would continuescheduling tasks without inter-provider migration; without schedulers the executors wouldcontinue executing tasks in the order provided by the last schedule; without some of theexecutors the scheduler would reschedule tasks to resources still having executors attachedto them. Other combination of failures would lead to similar results.

Failures in the distributed database and file system are dealt with automatically by thesystem itself as it usually replicates data on several nodes for increasing fault tolerance.

Self-healing modules that have been isolated from the platform will continue monitoringand heal reachable modules. Hence isolated islands of modules can continue functioning evenwhen separated from the platform.

5.2 Reducing the Number of Agents used in the Negotiation Phase

Having a large number of agents communicating with each other can lead to possiblenetwork bottlenecks, especially when a large amount of data needs to be often relocated.

148 Marc E. Frıncu

As a result MAS would benefit from a smart reduction of the number of agents involvedin the scheduling process. This reduction needs to minimize the number of agents so thatthe schedule’s cost function would not be affected by it. Reducing the number of agentsdiminishes the number of sent messages and task data, as well as the number of resourcesinvolved in the scheduling. This is important as most scheduling heuristics have a complexityof O(m × n). In spite of the fact that the number of resources m is significantly smallerthan the number of tasks n the reduction in resource numbers improves nonetheless thealgorithm’s performance.

In this paper we use a selection method based on the Fuzzy C-Mean algorithm (CLSS01)for clustering agents. We opted for this solution because clustering algorithms allow datato be partitioned based on similarities found inside of them. The algorithm requires severalparameters depicted in what follows: acceptTasks which specifies whether the agent moduleaccepts (1) or not (0) tasks; offerTasks which specifies whether the agent module offers (1) ornot tasks (0); noTasks which represents a normalized value of the number of tasks handled bythis module at a given moment in time; and a list in which an element operationk indicateswhether the agent module supports (1) or not (0) the respective operation for k = 1, n. Heren indicates the total number of operations supported by the platform and is used mostlywhen tasks are submitted for execution through a service interface that supports a restrictednumber of operations.

When trying to find a suitable cluster for a specific task the previous parameters are setas follows: acceptTasks=1, offerTasks=0.5, noTasks=0, and operationk = 1 for all operationsrequired by the task.

In order to model the problem by using the C-Mean algorithm we also need to define aset of parameters. For every existing scheduling module a vector xj consisting of the n+3parameters is created. So we obtain X = {x1, x2, ..., xm}, a set consisting of the featureattributes belonging to the m agents. The centroid matrix C = {ci, c2, ..., cl} represents thecentroids of every existing cluster. Initially all values are randomly chosen inside the [0, 1)interval. The fuzzy membership matrix is represented by Uij = [uij], i = 1, c, i = 1,m, where∑

j uij = 1 and∑

i uij > 0. Every uij element is computed according to equation 5.2.1:

uij =[ 1d2(xj ,ci)

]1q−1∑l

k=1[1

d2(xj ,ck)]

1q−1

(5.2.1)

where q > 1 is the weighting exponent and controls the fuzziness of the resulted elements.The uij elements represent the degree of membership of xj elements in cluster ci. In the

case of the C-Mean fuzzy algorithm the following function must be minimized:

Jq(U,C) =∑j

∑i

(uqijd2(xj, ci)) (5.2.2)

where q > 1.Tests on the algorithm used a fuzziness value q=1.25 and an algorithm stopping condition

with a value of 10−10. The number of agents was set to 100 and tests were repeated a number

Adaptive Scheduling for Distributed Systems 149

0

200000

400000

600000

800000

1e+06

1.2e+06

0 2 4 6 8 10

Ma

ke

sp

an

[m

s]

No. clusters

DMECT

Figure 5.2.1: The influence of the number of clusters over the makespan produced byDMECT

of 20 times. The number of clusters increased with each test from one to ten clusters andthe pair <mean makespan, standard makespan deviation> was retained after each test. Thedistance d was selected to be the Euclidean distance. For each scheduling module a numberof 250 tasks were generated each with an estimated completion time following a Paretodistribution with a minimum EET value of 1,000ms and a shape parameter of 2. For thetask size generation a Pareto shape parameter of 1.3 was used. This provided 25,000 tasksto be scheduled for the entire platform. Task arrival rate was modelled by using a 8 degreeextrapolation polynomial (cf. Sect. 3.1). The Grid’5000 network was used as topology.At the meta-level the DMECT algorithm was used while at infrastructure provider level asimple FIFO policy was in place. The cost function was set to Cmax.

Figure 5.2.1 depicts the main test results. It shows the evolution of the makespanwhen clustering is used. The plot shows the mean makespan obtained from the DMECTheuristics, the standard deviation of the makespan (grey polygon) and the minimum andmaximum values obtained in the tests. Tests reveal that the cost function obtained byapplying the chosen clustering technique does not differ significantly for the 2 cluster case (-5%) while a maximum mean degradation (relative to the best case) is observed for 4 (+47%),respectively 10 (+44%) clusters. Test results also allow us to infer that the safest way ofreducing the number of involved schedulers without negatively influencing the makespan isto permit the negotiator to use 2 clusters during the scheduling modules selection phase.Despite the fact that the clustering method does not provide a solution for reducing theagent number with a certain number, the tests that we performed show (cf. Table 5.2.1)that in the case of 2 clusters the agents are divided in two groups consisting of approximately50.5 ± 4.02, respectively 49.5 ± 4.02 agents. As the groups in the 3 clusters case – whichprovides a makespan deterioration of 25.97% – are made up of approximately 31 agents eachwe can assume that the number of agents suited (i.e., that support the task and offer bettertask costs) for executing the selected tasks is somewhere between 31 and 51.

150 Marc E. Frıncu

Table 5.2.1: Agent grouping per cluster# clusters Agent repartition # clusters Agent repartition

1 100 2 50.5± 4.0249.5± 4.02

3 35.5± 10.87 4 27± 3.231.7± 10.58 24.6± 4.6832.8± 11.85 24.5± 5.1

23.9± 3.75 18.95± 9.93 6 19.15± 9.85

22.35± 7.17 11.8± 9.4220.65± 7.24 22.55± 6.8819.75± 7.25 15.3± 8.5718.3± 9.78 16.35± 9.85

14.85± 10.847 11.6± 8.89 8 14.15± 6.68

14.9± 8.09 9.8± 8.5912.8± 9.1 13.7± 7.1213.05± 7.6 11.15± 8.8816.05± 9.5 12.3± 8.9515.75± 10.02 11.05± 7.815.85± 10.53 14.8± 7.36

13.05± 9.99 12± 9.9 10 12.55± 8.71

9.65± 7.73 9.2± 7.8710.85± 10.2 8.75± 7.6810.6± 6.82 8.4± 7.2612.8± 6.83 10.85± 8.713.45± 7.88 11.55± 7.929.85± 8.74 8.8± 7.979.85± 8.33 12.45± 7.4610.95± 9.71 8.3± 6.57

9.15± 7.16

5.3 Dynamically Changing the Scheduling Policy at Runtime

DS are dynamic and require constant optimization/change of the cost function in orderto satisfy the user needs. This optimization can be obtained either by using evolutionaryapproaches based on neural networks or by applying a best policy selection approach. Thelatter has been recently addressed in the works of Chauhan et al. (CJ10), Etminani et al.(EN07) and Singh et al. (SS08). Their work however only addresses the issue of switchingMin-Min with Max-Min and do not offer a generalized solution to the problem. Further-

Adaptive Scheduling for Distributed Systems 151

more switching and best policy approaches are usually time consuming as they require thealgorithm to test every existing scheduling heuristics before taking a decision.

In what follows we show a new approach in which we select the best scheduling strategy tobe applied based on a clustering (unsupervised learning) and a neural network (supervisedlearning). To use these techniques we require a training set composed of characteristicswhich define both the platform and the tasks for certain platform configurations. Previouswork (CLZB00; SS08; MAS+99) has shown that the task size and platform heterogeneityinfluence the SAs. Thus we selected a set of parameters that reflect these characteristics: thetime when the schedule was completed, the mean task EET (in seconds); the mean standarddeviation of the EET ; the mean task ECT ; the mean standard deviation of the ECT ; themean task size (in bytes); the mean standard deviation of the task size; the total numberof tasks ; and the number of long tasks used in the experiment. The number of long tasksis computed as described next. First the standard deviation – in terms of EET – of thetask set is computed. The tasks are ordered ascending by EET and the difference of eachconsecutive values are compared against the standard deviation. When a difference greaterthan the standard deviation is found the tasks remaining in the set are considered to belong. If no such difference is found the standard deviation is compared against a predefinedthreshold. If the threshold is greater then all tasks are considered to be small. Parameterswere chosen to reflect the distribution of the execution times and the number of long tasks asthese factors are known to influence the scheduling heuristics. The platform heterogeneityfactor h has also been considered as it is known that it influences SAs too.

The previous information can be extracted from traces found at the Grid WorkflowArchive (GWA) (ILD+). The format contains information about the submission time, wait-ing time, running time, number of allocated/used processors, requested execution time/memory,etc. From this data the characteristics used in our model can be easily extracted. The GWAhowever does not have information regarding the SA used by the underlying system, whichis somewhat natural as the algorithm could vary depending on the cluster or site. Becauseof this limitation we cannot use the real traces found on the GWA website1. As a result weneed a method of synthesizing data that models the real traces. A solution is to use thesimulator presented in Sect. 3.1. The network topology was selected to be that of Grid’5000.The platform’s resources have been modified to fit a h=0 and h=42 network and the taskswere generated using an online model following an arrival rate modelled by Relation 3.1.1.The EETs were generated using a Pareto shape parameter of 2 and a minimum value of1s. For the task sizes a Pareto shape parameter of 1.3 with a minimum value of 106 byteswas used. This generation model is suited for workstation tasks. A total of 500 tasks weregenerated this way.

Rescheduling was done at every task completion or task arrival event. Before the actualrescheduling a Best Selection policy has been performed for determining the best SA forthe existing configuration. The SAs used for determining the best policy were: Max-Min,Min-Min, Suffrage, MinQL, MinQL-Plain, DMECT and DMECT2. After each step thebest resulted SA together with the parameters described in the beginning of this section

1http://gwa.ewi.tudelft.nl

152 Marc E. Frıncu

were stored. A total of 303 training data elements – for h = 0 – and 366 training dataelements – for h = 42 – have been generated in this way. Figures 5.3.1 and 5.3.2 depictthe distribution of the data with regard to some of the input parameters. The generatedinput data is normalized before being used for either clustering or neural network basedclassification.

0

5

10

15

20

25

0 2 4 6 8 10 12

Me

an

Estim

ate

d C

om

ple

tio

n T

ime

[s]

Mean Estimated Execution Time [s]

0

5e+07

1e+08

1.5e+08

2e+08

2.5e+08

3e+08

3.5e+08

4e+08

0 2 4 6 8 10 12

Me

an

Siz

e [

byte

s]

Mean Estimated Execution Time [s]

0

1

2

3

4

5

6

7

8

0 2 4 6 8 10 12Me

an

Estim

ate

d E

xe

cu

tio

n T

ime

Std

. D

ev.

[s]

Mean Estimated Execution Time [s]

0

2

4

6

8

10

12

0 20 40 60 80 100 120

Estim

ate

d E

xe

cu

tio

n T

ime

[s]

No. tasks

0

50

100

150

200

250

300

350

0 5 10 15 20 25

Tim

e S

ince

Sta

rt [

s]

Mean Estimated Completion Time [s]

Figure 5.3.1: Data distribution for the case of h=0

Adaptive Scheduling for Distributed Systems 153

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18

Me

an

Estim

ate

d E

xe

cu

tio

n T

ime

[s]

Mean Estimated Completion Time [s]

0

1e+08

2e+08

3e+08

4e+08

5e+08

6e+08

7e+08

0 2 4 6 8 10 12 14 16 18

Me

an

Siz

e [

byte

s]

Mean Estimated Execution Time [s]

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16

Me

an

Estim

ate

d E

xe

cu

tio

n T

ime

[s]

Mean Estimated Execution Time Std. Dev. [s]

0

2

4

6

8

10

12

14

16

18

0 20 40 60 80 100 120 140 160

Me

an

Estim

ate

d E

xe

cu

tio

n T

ime

[s]

No. tasks

0

50

100

150

200

250

300

350

400

0 5 10 15 20 25 30 35 40

Tim

e s

ince

sta

rt [

s]

Mean Estimated Completion Time [s]

Figure 5.3.2: Data distribution for the case of h=42

Clustering based technique

The clustering technique used to identify the new scheduling policy is based on the samealgorithm as the one used for reducing the number of scheduling agents (cf. Sect. 5.2).Fuzzy C-Mean requires the minimization of equation 5.2.2 before returning the clusters.The parameters that have been used to create the xj parameter vectors are identical withthe ones depicted previously. As mentioned in the previous section every xj corresponds

154 Marc E. Frıncu

to a configuration of the system given by the insertion of a new set of tasks generated bythe online task generator in the already existing resource queues. These were used to create7 clusters. The number of clusters was chosen to be equal with the number of schedulingheuristics. The ideal scenario would be to have every cluster consisting of xi elements thathave the same best cost value scheduling heuristics. In the real world the clusters are howevermade up of xi that have different best cost value scheduling heuristics.

Once the clusters have been computed the new xk, k > j can be classified. This stepinvolves computing the U

′= [u

′i0] matrix for the xk element. This matrix is similar with

the U matrix with the only difference being that in this case the centroids Ci are alreadycomputed and there is a single element in the X vector, namely xk. The parent clustercan thus be selected as the ith cluster where i : max(u

′i0). Selecting the scheduling policy

from the ones found from the cluster’s xj elements can be done in two simple ways. Thefirst one follows a random approach where the scheduling heuristics is chosen with a uniformprobability. The second one uses a highest number of occurrences approach where the winneris selected as the scheduling heuristics which provided the best cost for the highest numberof the xj elements. Our simulated experiments used the first case.

The fuzziness parameter together with the stopping conditions had the same values asthe ones used in Sect. 5.2.

The results for the two platform scenarios are depicted in Fig. 5.3.3. They show theaverage gain (cf. Relation 3.1.1) of the four scheduling heuristics – the other three beingMax-Min, MinQL and DMECT– used as comparison with the Best Selection strategy. Thegain is computed in this case as the ratio between the makespan provided by the tested BestSelection strategy and the one given by the tested scheduling heuristics. A value greater than1 for the gain means that the tested heuristics is better. Tests have concluded that amongthe four tested algorithms only the clustering algorithm provides better results. The averagegain in this case was of 1.0061 for the h = 0 scenario and 1.0088 for the h = 0.42 scenario.The close to one values indicate that the clustering based selection algorithm is as good asthe time consuming Best Selection strategy. Tests concluded that in an online scenario theBest Selection strategy produces the best makespan. In addition it can be inferred that theproposed clustering technique offers a viable alternative to the Best Selection strategy as itdoes not have a negative impact on the makespan. At the same time it frees the platformfrom the burden of running every scheduling algorithm each time in order to determine theone which provides the best cost.

Concerning the correct classification percentage it ranges from 63.93±10.59% (h = 0) to74.81± 6.58% (h = 42). Consequently even though the SAs selected by using the clusteringstrategy do not always match the best algorithm for the given system configuration, thematching percentage is high enough not to negatively influence the resulted makespan.

Fuzzy C-Mean was also compared with the K-Means clustering algorithm in order tosee the fluctuation in the correct classification percentage. The results for K-Means gavea matching of 68.31% for h = 0 and 55.46% for h = 42. A reason for the less effectivebehaviour of the K-Means algorithm could be that the structure of the training set makesexclusive clustering less efficient than the overlapping approach. As observable from Figs.

Adaptive Scheduling for Distributed Systems 155

0

0.5

1

1.5

2

h=0 h=0.42

Ga

in

Scenario

MaxMinMinQL

DMECTClustering

Figure 5.3.3: The gain with regard to the Best Selection strategy of several schedulingheuristics

5.3.1 and 5.3.2 the close relationship regarding EET, ECT and task size could provide afuzzy technique with greater flexibility and efficiency in determining the cluster sets than anexclusive one.

The observed decrease in the matching efficiency as the heterogeneity of the platformincreases, in the case of K-Means, can affect the schedule makespan for high heterogeneousplatforms. The reason for this behaviour is the drop in the matching percentage bellow theobserved 48% threshold under which the clustering technique becomes less effective. Thedeterioration threshold has been determined experimentally by forcing the algorithm notto consider any more correct matches when their number reaches a certain limit. Bellowthis percentage the classification method has a gain bellow 0.97 which indicates that themakespan has deteriorated.

Neural Network based technique

Besides the unsupervised Fuzzy C-Mean algorithm we have also performed tests on a Mul-tiLayer Perceptron (MLP) neural network (Arb95). This kind of network is a feedforwardbackpropagation neural network consisting of one input layer, one output layer and severalhidden layer elements. The number of neurons xj in the input layer is equal with the numberof attributes considered during classification. In our case this number is equal with 10. Thenumber of neurons yk in the output layer is equal with the number of SA algorithms (i.e.,seven) used as classification classes. The number of hidden layer elements as well as thelearning rate was varied between 1 and 100, respectively between 0.1 and 1, in order to de-termine the best matching percentage. The number of epochs until the training is consideredcomplete has been set to a fix value of 500.

Results when using the training set for the h = 42 platform have shown that the match-ing percentage varies between 60.65% and 82.24% with the best results obtained when the

156 Marc E. Frıncu

learning rate is set to 1 and the number of hidden layer elements is 100. However the timerequired to produce a result in this case reaches 24.73s. The best results were obtained whena set of values of 0.3 for the learning rate and 8 for the number of hidden layer elements wasused. In this case a matching percentage of 81.42% was obtained inside an interval of 2.41s.Table 5.3.1 shows the confusion matrix for the previous set up. Ideally this matrix should bea diagonal matrix. Still as it can be noticed there are cases in which a SA has been wronglyclassified as belonging to a certain scenario. Table 5.3.2 depicts the True Positive (TP) –percentage of correctly classified scenarios – and False Positive (FP) – percentage of falselyclassified scenarios. As it can be shown the h = 42 training set works well for identifyingMax-Min (TP=0.969) and DMECT (TP=0.978). It is however not suited for identifyingthe scenarios when Min-Min should be used. It can be noticed from the two tables that thereis no matching for MinQL-Plain. The reason for this is that MinQL-Plain was not found tobe the best SA in any of the 366 training scenarios.

Table 5.3.1: Confusion matrix for the case of using a h = 42 training set with 8 hidden layerelements and a learning rate of 0.3

a b c d e f ← classified as

45 7 3 2 3 2 a = MinQL1 127 0 2 0 1 b = Max-Min5 10 23 0 1 0 c = Suffrage4 22 0 5 0 0 d = Min-Min0 2 0 0 90 0 e = DMECT1 1 0 0 0 9 f = DMECT2

Table 5.3.2: True Positive and False Positive percentage for the platform with h = 42,training set with 8 hidden layer elements and a learning rate of 0.3

TP Rate FP Rate Class

0.726 0.036 MinQL0.969 0.179 Max-Min0.59 0.009 Suffrage0.161 0.012 Min-Min0.978 0.015 DMECT0.818 0.008 DMECT2

For the case of the h = 0 training set, results have shown that the matching percentageranges from 71.94% to 80.85% for the same configurations as in the h = 42 scenario. Thebest matching percentage was obtained for the case when a learning rate of 0.3 and a numberof hidden layer elements equal with 8 were used. The time needed to obtain this solution wasof 2.05s. It can be noticed that this configuration is identical with the one used for obtainingthe best matching percentage in the h = 42 scenario. Table 5.3.3 shows the confusion matrix

Adaptive Scheduling for Distributed Systems 157

for the the h = 0 scenario. It can be noticed that the training set was unable to determinethe class for Min-Min, MinQL-Plain and DMECT2. Despite this, as seen in Table 5.3.4, itprovides a great source for identifying the DMECT (TP=0.946) and Max-Min (TP=0.846)algorithms. The case of Suffrage is interesting as both training sets give the same percentageof correctly classifying it (0.59 respectively 0.65). However this percentage leaves enoughSuffrage scenarios to be identified as being either of DMECT or Max-Min classes.

Table 5.3.3: Confusion matrix for the case of using a h = 0 training set with 8 hidden layerelements and a learning rate of 0.3

a b c d e f ← classified as

70 4 0 0 0 0 a = DMECT28 154 0 0 0 0 b = Max-Min11 0 0 0 0 0 c = Min-Min1 0 0 0 2 0 d = MinQL-Plain1 10 0 0 21 0 e = Suffrage0 0 0 0 1 0 f = DMECT2

Table 5.3.4: True Positive and False Positive percentage for the platform with h = 0, trainingset with 8 hidden layer elements and a learning rate of 0.3

TP Rate FP Rate Class

0.946 0.179 DMECT0.846 0.116 Max-Min0 0 Min-Min0 0 MinQL-Plain0.656 0.011 Suffrage0 0 DMECT2

When using a combined training set made up of both the data for h = 0 and h = 42the matching percentage varies between 65.32% and 76.98%. The best matching requireshowever around 42s to be done. To obtain a match similar with the best ones given by thealgorithm in the two previous cases we needed to increase the epoch to 1,000 and keep thelearning rate as well as the number of hidden layer elements to the maximum value of thetested interval. This set up gave us a 81.01% matching in 83.71s. As it was to be expectedin this case the confusion matrix provides (cf. Table 5.3.5) a good method for identifyingDMECT (TP=0.91 cf. Table 5.3.6) and Max-Min (TP=0.92 cf. Table 5.3.6) and less usefulfor the rest.

The success rate of above 80% when classifying system configurations in the correct SAclass makes the MultiLayer Perceptron a good method for using dynamic scheduling heuris-tics selection. The method is especially useful for identifying the cases in which DMECTand Max-Min need to be used. Nevertheless as these algorithms are considered the best in

158 Marc E. Frıncu

Table 5.3.5: Confusion matrix for the case of using a mixed training set with 100 hiddenlayer elements and a learning rate of 1

a b c d e f g ← classified as

43 11 3 2 2 1 0 a = MinQL1 288 1 5 17 1 0 b = Max-Min2 27 42 0 0 0 0 c = Suffrage1 17 3 10 11 0 0 d = Min-Min0 15 0 0 151 0 0 e = DMECT0 3 1 0 1 7 0 f = DMECT21 0 0 0 1 0 1 g = MinQL-Plain

Table 5.3.6: True Positive and False Positive percentage for the platform with a mixedtraining set having 100 hidden layer elements and a learning rate of 1

TP Rate FP Rate Class

0.694 0.008 MinQL0.92 0.205 Max-Min0.592 0.013 Suffrage0.238 0.011 Min-Min0.91 0.064 DMECT0.583 0.003 DMECT20.333 0 MinQL-Plain

71.59% of the total training sets, and with a TP higher than 0.5 for the next to dominantSAs (MinQL and Suffrage – which constitute 19.88% of the total training scenarios) themethod gives, in the best configuration set up and with a high probability, a classificationsuccess higher than 80%. Given the results depicted for the clustering scenario (cf. Fig.5.3.3) this percentage suffices for producing a schedule with a makespan close to the bestone – given by the selection of the best SAs during every rescheduling step.

A comparison of the MultiLayer Perceptron with the Radial Basis Function Network wasmade (Arb95). Results have shown a success rate in correctly classifying the algorithmsof 67.98% for the h = 0 network and of 50.54% for the h = 42 network. The Radial basisFunction used the K-Means clustering algorithm to provide the basis functions and a numberof eight clusters generated by the clustering algorithm.

Several of the parameters have proven to play an important part in selecting the SAs.The total number of tasks is an important parameter for selecting DMECT while the numberof long tasks is essential in classifying Max-Min. The former is a consequence of the factthat DMECT works well when the number of tasks in the queues is small. The latter wasalso noticed empirically by Maheswaran et al. (MAS+99). The rest of the parameters sharetheir effectiveness in determining either DMECT or Max-Min.

In terms of runtime Table 5.3.7 shows a comparison of the brute Best Selection strategy

Adaptive Scheduling for Distributed Systems 159

against the supervised and unsupervised techniques. Results show that the learning tech-niques offer a faster solution then the Best Selection strategy. The advantage comes fromthe fact that the Best Selection strategy’s runtime increases as the number of algorithmsgrows. For instance in our case the Best Selection is composed of Min-Min (0.8s), Max-Min(1.8s), DMECT and DMECT2 (0.9s each), MinQL and MinQL-Plain (0.2s each) and Suf-frage (1.2s). Despite the fact that the learning technique’s runtimes are influenced – andthus can become larger than that of the Best Selection strategy – by the setup parametersthey allow a more flexible and easier method of integrating new SAs in the training set asthey do not require to constantly reapply the policy. Furthermore they provide a classifica-tion success higher than the deterioration threshold throughout the tested setup parameters(i.e., even for the most time efficient tested versions which do not always provide the bestmatching percentage).

Table 5.3.7: Average schedule selection runtime in seconds

Method Time

Best Selection 6Fuzzy C-Mean h=0 1.9Fuzzy C-Mean h=42 2.1MLP h=0 2.05MLP h=42 2.41MLP combined 2.3

5.4 Platform Testing Scenario and Results

In order to test the recovery times and parameter setup of proposed MAS we have useda real life platform made up of 12 distributed virtual machines (VM). The testing plat-form is based on the infrastructure services provided by the Cloud-related project mOSAIC(mOS10). The scenario used 12 agents: 2 healers, 1 negotiator, 2 schedulers and 8 executors.Tests were aimed at determining the time needed to heal the platform after partial or totalagent failure. A study on finding the optimal platform parameters that would not producefalse healing events was also conducted. These events usually happen when a running mod-ule’s ping is not read inside the time-out interval. Consequently the module can be wronglypaused and another restarted. Although this does not influence the platform’s performance,it has the drawback of creating in the worst case scenario a large number of clone modules.

5.4.1 Optimal Parameters Tests

Finding optimal parameters for the platform is crucial for it to function correctly. By thiswe mean not only that the platform should self-heal in the smallest amount of time but alsonot to provide false self-healing events due to a poor setting of the time-out between two

160 Marc E. Frıncu

consecutive module pings (pingi) (i.e., the time interval inside which a module is consideredactive), module idle times (miti), message receive time-out (msgtimeout) or batch size for thenumber of messages read (msgbatch) during one healing iteration. Before proceeding with thetests we need to describe the main mathematical relations that govern the platform messagereads.

To avoid false healing events we need to properly adjust the message reading rate to theprovided parameters. To achieve this we have to identify the length of every read heart beat(i.e., the time between two message reads). The events occuring in the designed MAS canbe mapped to a discrete time space. Thus we can deduce the following relationship betweentwo time intervals: ∆t = hit + msgbatchmsgtimeout + τ kH , where hit represents the healing’smodule idle time and τ kH represents the time – starting from moment k – taken by moduleoperations other than message reads. The difference ∆t = tk+1− tk represents the read heartbeat. After each ∆t a certain number of unread messages are left. Their value is equal withmsgunreadk+1

= max(msgunreadk − msgbatch +∑n

i ∆t/miti, 0), where the sum provides thenumber of messages sent to the modules during ∆t.

In order to maintain msgunreadk = 0 we need to always have:

msgbatch =n∑i

∆t/miti (5.4.1)

Because of the high number of variables involved in computing ∆t we need to simplifythe equation in order to solve it easily. The simplest way is to assume hit = miti, miti =mitj,∀i 6= j and τ kH = τ k+1

H ,∀k. This is useful in systems having the same idle times forall modules. Solving Relation 5.4.1 gives msgbatch = (n · τH)/(hit− n ·msgtimeout), where nstands for the number of registered modules.

It can be noticed this relationship is independent of the pingi time. However in generalcases this identity is not useful and a relationship between pingi and ∆t and n must befound. In this case the ideal scenario would be to always read at least a number of messagesequal with the number of registered modules inside to pings. There are cases in which smallmodule idle times are used. These provide large groups of pings coming from a subset ofmodules and could take the whole pingi interval during processing. As a result we needto ensure that there will always be at least one ping from every module read during eachinterval. The necessary and sufficient condition for achieving this is:

∀msgk | arrival time msgk ∈ [last ping time, next ping time]⇒⋂

sender msgk = n

(5.4.2)To avoid violating condition (5.4.2) we need to determine the minimal functioning config-

uration for the simplest case (i.e., all modules send pings at the same moment) and increasethe ping interval until a sufficient limit is reached. The following tests start from this scenarioand provide some conclusions regarding the platform minimal setup parameters.

For testing we have used a variation of all of the previously listed parameters as follows:miti = 1, 500ms, msgbatch = 1, 20, msgtimeout = 1 ms and pingi ∈ {1000i | i = 1, 10}ms.

Adaptive Scheduling for Distributed Systems 161

Figures 5.4.1(a) and 5.4.1(b) show some of the main results achieved during testing. Timesare expressed in milliseconds. The plot represents the average result obtained after 10 testson each configuration. Results have shown that configurations that have a minimal pingitime-out of 0.9s have a low chance of providing false module healing events. The reasonfor this can be the fact that too small ping intervals cannot deal with the likely cases whencondition (5.4.2) is not met. This is highly probable as modules are usually set with differentidle times and started at different moments in time. The batch interval is maximized whenhit and miti have both the smallest possible values (cf. Fig. 5.4.1(a)). As shown by theresults depicted in Figs. 5.4.1(a) and 5.4.1(b) the delay becomes irrelevant for cases whenboth the pingi time-out and the batch size become large enough. The reason is that in thiscase the healing module has enough time to obey condition (5.4.2).

5

10

15

20

25

30

10002000300040005000600070008000900010000

Ba

tch

siz

e

Ping timeout [ms]

FALSE EVENTS HIT<PINGNO FALSE EVENTS

FALSE EVENTS HIT>PING

(a) False recovery events related with the pinginterval size for hit = 1, miti = 1, msgtimeout =1

5

10

15

20

25

30

10002000300040005000600070008000900010000

Ba

tch

siz

e

Ping timeout [ms]

FALSE EVENTS HIT<PINGNO FALSE EVENTS

FALSE EVENTS HIT>PING

(b) False recovery events related with the pinginterval size for hit = 1, miti = 500,msgtimeout = 1

Figure 5.4.1: Occurrence of false healing events with regard to the ping time and messagebatch size

5.4.2 Recovery Time Tests

Recovery time of the platform after a partial or total failure of the involved modules iscrucial to its efficiency. In order to fully recover from a module failure event the MASneeds to converge to a stable form. The convergence is only possible if and only if there areenough self-healing modules, idle modules and on-demand resources available to perform therecovery. Formally this can be expressed as:

∀φ∃ξ (φ ∈ F ∧ ξ ∈M∧ isHealer (ξ) ∧ canHeal (ξ, φ) ∧ (∃µ (µ ∈ I ∧ µtype = φtype)∨(∃Rj (Rj ∈ R) ∧ ∃µ (µ ∈ I ∧ µtype = φtype)

)))(5.4.3)

where F represents the set of failed modules,M represents the set of running modules and Irepresents the set of idle modules. isHealer(ξ) and canHeal (ξ, φ) are boolean methods. The

162 Marc E. Frıncu

0

10000

20000

30000

40000

50000

60000

70000

0 2 4 6 8 10

Re

co

ve

ry T

ime

(m

s)

No. failed modules

1 Healer - on demand1 Healer - idle clones

Figure 5.4.2: Recovery time vs. number of failed modules in the platform

former indicates whether or not module ξ is a self-healing module while the latter specifiesthe healing ability of module ξ over failed module φ.

The following tests have used the optimal setup parameters as follows: hit = 1ms,miti = 1ms, msgbatch = 20, msgtimeout = 1ms and ping = 20s. Two scenarios were ofinterest. The first uses one healing agent and only on demand deployment while the secondone uses one healing agent and only idle clones. Test results are depicted in Fig. 5.4.2. Asit can be noticed the recovery times when using idle clones are significantly improved fromthe on demand deployment scenario. The reason for this is that when using idle clones thereare no other time consuming operations besides sending activation messages to the clonemodules.

The average recovery time of a module depends on its type. Executor and negotiatormodules require an average of 4.2s, respectively 3.6s while schedulers need around 15.3s tofully start. This difference is induced by the fact that schedulers also require to initializethe rule engine and load the rule base in memory before starting. These two operations takearound 12s for the case of the DMECT algorithm.

A complete healer heart beat takes around 1.7s from which 75.29% of the time is usedto process the messages, 1.17% is used for sleeping and the rest of 23.54% is required forexecuting other module logic. When starting remote modules an iteration can increase byup to 3.3 times due to transfer times.

Adaptive Scheduling for Distributed Systems 163

5.5 Conclusions

In this chapter we presented a MAS system task scheduling inside a VO federation. Thenovelty of the platform is its self-healing capability. This was achieved by mixing a seriesof existing software solutions with software engineering concepts. Two main aspects wereconsidered essential in this scope: distributivity of the MAS components (cf. Sects. 5.1.1and 5.1.2) and the component monitoring and decision interface (cf. Sects. 5.1.3 and 5.1.4).The resulted MAS SP consists of the following:

• a distributed storage and communication mechanism by using RabbitMQ, HDFS andHbase;

• fault-tolerant and self-healing agents, designed as modular intelligent FCL for handlingtask rescheduling and module recovery;

• clone and on-demand based module redeployment capabilities;

• independent resource providers. Each provider can use its custom tailored schedulingpolicies. In addition in order to separate the policies from the scheduling’s modulelogic a rule based approach following the model described in Chapter 4 was taken;

• a customizable negotiation policy. The policy is designed as a plug-in module whichcan be easily and safely changed whenever the requirements demand it;

• dynamic selection of the scheduling policy relying on a novel clustering and neuralnetwork based technique that uses platform related parameters to classify a systemconfiguration in a specific class of SAs.

With the emergence of inter-provide computing and the lack of complete SP – in termsof negotiation, self-healing, autonomy and distributivity – to handle task migration betweenthem, our solution provides one of the first steps into offering such a solution. The usage ofFCL to handle both rescheduling and self-healing has proven the utility of this concept tothe field of distributed scheduling. Furthermore the decentralized approach based on mes-sage queues to achieve inter-component communication has taken the next step from classicHTTP SOAP- based service-oriented model used for communicating between distributedapplications. It has also shown the advantages of having an asynchronous communicationsystem in volatile systems such as Clouds are. Synchronous communication causes messagesto be lost if the receiver or sender encounters errors during their execution.

The ability of the platform to redeploy modules by using both idle clone modules andon demand deployment increases its dynamism and allows it to recover faster from failures.Furthermore by our knowledge the proposed system is the only existing SP with self-healingcapabilities.

As a novel approach we have also shown how the number of agents involved in thenegotiation phase can be reduced by using a clustering algorithm. This is crucial especially

164 Marc E. Frıncu

inside large systems as it allows the system to choose in an intelligent way only the mostimportant agent subset (i.e., the subset which is guaranteed to improve the cost function).

In a similar way a method for selecting the best scheduling heuristics for the given systemconfiguration was described. The method uses several parameters extracted from task andresource information and is faster than current solutions (e.g., switching algorithms). Ourapproach is also as efficient as the best selection strategy. Using clustering and neuralnetworks to determine the best SA from a set of previously known scenarios is an importantstep into building adaptive scheduling algorithms that do not rely on brute force checkingbut rather on (un)supervised techniques.

The simulated tests (cf. Sect. 5.4) conducted in order to determine the behaviourof the platform have proven the healing capabilities of the platform. To facilitate thesetests, a study on the minimal optimal platform parameters has been taken. These optimalparameters allow the platform to work without producing any false self-healing events whichcould cause unnecessary delays by activating additional modules. For platform recoverytests both on-demand deployment and idle modules have been used. Results have shownthat on average the idle clones permit the platform to recover faster than when on demanddeployment is used. However by mixing the two we ensure that the platform is restored evenwhen no idle clones exist.

6 CONCLUSIONS

Due to the unpredictable behaviour of heterogeneous DS there is a constant need foroffering reliable systems for executing tasks. Scheduling plays an important role in thisenvironment as it allows tasks to be mapped on resources in an efficient manner. Efficiencyis the keyword as the number of tasks is usually much higher than that of the availableresources. Furthermore tasks can be linked in a precedence relation and generally belong tovarious users which have different priorities and goals. Based on the goals, a task (batch)might need to finish before a given deadline or simply in the shortest amount of availabletime. Because of the volatile nature DS exhibit, classic SAs in which a single policy isused experience difficulties, as shown in Sect. 1.2, in optimizing the cost function. Taskrescheduling offers an improvement to the static scheduling but certain scheduling heuristicsfail when the resource heterogeneity factors change. In this case a combination of several SAscould prove more successful as it allows to easily switch scheduling policies based on the newsystem configuration. Alternatively to offering a set of existing scheduling policies wouldbe to design a SA which provides a stable (i.e., not influenced by system fluctuations) costvalue in a wide range of platform and task scenarios. In addition to allowing dynamic costfunction optimization, a scheduling/execution platform should be able to cope with varioussystem failures which could occur inside the DS. The best solution is to allow the platform toself-heal. The implicit autonomy of self-* systems is essential as it frees users from having tokeep a constant watch on the platform. It is also useful when a VO federation or multi-cloudplatform uses the SP, as it allows each provider to maintain its own scheduling, access andnegotiation policies.

This thesis has tackled the previously mentioned problems and described incrementallyduring the course of its chapters the main problems and needed solutions. The three chap-ters dedicated to the thesis contribution have started by presenting (cf. Chapter 3) a newscheduling heuristics designed for heterogeneous environments and with stability as maingoal. Based on the new algorithm a new model for representing distributed SAs for hetero-geneous systems has been devised (cf. Chapter 4). The innovative self-healing MAS SP (cf.Chapter 5) relies on the proposed model to represent SAs at intra/inter-provider level, whilethe inter-provider negotiation decisions are left to the newly designed scheduling heuristics.

The main contributions of the thesis can be summarized as follows:

• Chapter 3 presented and tested under various scenarios a new SA called DMECTthat proved to be the most stable of the tested algorithms. A simplified case for loadbalancing strategies – MinQL– was also studied. The novelty introduced by DMECTis represented by its customizable task relocation condition. Hence the decision on when

165

166 Marc E. Frıncu

to move the task can not only be adjusted depending on the needs of the resourceprovider, but be adapted at runtime based on specific requirements;

• Chapter 4 presented a new model for formalising SAs based on a chemical metaphor.The model allows, in a new and efficient way, for scheduling heuristics to be expressedas inference rules and distributed across several providers. Distributing the SA itselfrepresents a new kind of approach as until today the policies have been centralized andonly the scheduler could have been distributed. This opens the way for collaborativescheduling inside multi-provider environments. The language for expressing SAs asworkflows is called SiLK and together with the OSyRIS engine represent the firstworking composition platform based on a nature inspired paradigm. Besides it is amongthe few distributed workflow engines capable of self-healing ;

• Chapter 5 described a MAS based SP that brings as novelty a self-healing mechanismworking both for rescheduling tasks and for recovering from component failures. Theplatform relies on OSyRIS to execute decentralized scheduling heuristics. Two newfeatures are presented as part of the platform. The first is represented by the mech-anism for optimizing the negotiation by reducing, without affecting the cost function,the number of scheduling agents involved in the process. The second one is representedby an original approach for dynamically selecting the scheduling heuristics at runtimebased on some platform parameters that include task and resource information.

As a general conclusion we argue that in this thesis we have identified and solved someof the problems found when dealing with dynamic heterogeneous DS. However despite thenovelty of the self-healing MAS SP and the steps taken in the direction of self-adapting SAsthe road ahead is still long. The problem of designing an efficient self-adapting SA is stillopen and the platform parameter based solution needs to be further investigated by takinginto consideration a wider spectrum of system characteristics. Although DMECT offersa stable scheduling heuristics which could be a viable replacement for dynamic algorithmselection, yet it does not offer the best schedules in every considered scenario.

Future work will attempt to further consolidate the problem of self-adapting SAs bystudying the problem of multi-objective adaptive scheduling and by broadening (and possiblyinclude) the classification (in terms of performance and complexity) of the SAs used in thedynamic selection phase.

Bibliography

[Aas05] J. Aas, Understanding the linux 2.6.8.1 cpu scheduler, Silicon Graphics, Inc., 2005.

[ABC+04] A. Anjum, J. Bunn, R. Cavanaugh, F. van Lingen, M. A. Mehmood, C. Steenberg, andI. Willers, Predicting resource requirements of a job submission, Proceedings of the Confer-ence on Computing in High Energy and Nuclear Physics, 2004, p. 273.

[ABG02] D. Abramson, R. Buyya, and J. Giddy, A computational economy for grid computing and itsimplementation in the nimrod-g resource broker, Future Gener. Comput. Syst. 18 (2002), no. 8,1061–1074.

[AHK98] R. Armstrong, D. Hensgen, and T. Kidd, The relative performance of various mapping algo-rithms is independent of sizable variances in run-time predictions, 7th IEEE HeterogeneousComputing Workshop (HCW 98, 1998, pp. 79–87.

[AMA09] M. Amoon, M. Mowafy, and T. Altameem, A multiagent-based system for scheduling jobsin computational grids, ICGST International Journal on Artificial Intelligence and MachineLearning 9 (2009), no. 2, 19–27.

[Amd67] G. Amdahl, Validity of the single processor approach to achieving large-scale computing capa-bilities, AFIPS Conference Proceedings, 1967, pp. 483–485.

[APR99] A. H. Alhusaini, V. K. Prasanna, and C. S. Raghavendra, A unified resource scheduling frame-work for heterogeneous computing environments, 8th Heterogeneous Computing Workshop,1999, pp. 156–165.

[Arb95] M. Arbib, The handbook of brain theory and neural networks 2nd ed., MIT Press, 1995.

[BA04] R. Bajaj and D. P. Agrawal, Improving scheduling of tasks in a heterogeneous environment,IEEE Transactions on Parallel and Distributed Systems 15 (2004), 107–118.

[BAG00] R. Buyya, D. Abramson, and J. Giddy, Nimrod/g: an architecture for a resource managementand scheduling system in a global computational grid, Proceedings of the 4th InternationalConference on High Performance Computing in Asia-Pacific Region, IEEE Computer Press,2000, pp. 283–289.

[Bal09] M. Bali, Drools jboss rules 5.0 developer’s guide, Packt Publishing, 2009.

[BFR07] J. P. Banatre, P. Fradet, and Y. Radenac, Programming self-organizing systems with the higher-order chemical language, International Journal of Unconventional Computing 3 (2007), no. 3,161–177.

[BHLR01] S. Binato, W. J. Hery, D. M. Loewenstern, and M. G. C. Resende, A grasp for job shopscheduling, Essays and surveys on meta-heuristics (2001), 59–79.

[BLM93] J. P. Banatre and D. Le Metayer, Programming by multiset transformation, Commun. ACM36 (1993), no. 1, 98–111.

167

168 Marc E. Frıncu

[BM02] R. Buyya and M. M. Murshed, Gridsim: A toolkit for the modeling and simulation of dis-tributed resource management and scheduling for grid computing, Journal of Concurrency andComputation: Practice and Experience 14 (2002), 13–15.

[BMM09] S. Banerjee, I. Mukherjee, and P. K. Mahanti, Cloud computing initiative using modified antcolony framework, World Academy of Science, Engineering and Technology 56 (2009), 221–224.

[BMP+10] A. Benoit, L. Marchal, J. F. Pineau, Y. Robert, and F. Vivien, Scheduling concurrent bag-of-tasks applications on heterogeneous platforms, IEEE Transactions on Computers 59 (2010),202–217.

[BR99] A. Barabasi and A. Reka, Emergence of scaling in random networks, Science 286 (1999),no. 5439, 509–512.

[Bru07] P. Bruckner, Scheduling algorithms 5th edition, Springer, 2007.

[BSB01] T. D. Braun, H. J. Siegel, and N. Beck, A comparison of eleven static heuristics for mapping aclass of independent tasks onto heterogeneous distributed computing systems, Journal of Paralleland Distributed Computing 61 (2001), no. 6, 801–837.

[BSD03] B. Benatallah, Q. Z. Sheng, and M. Dumas, The self-serv environment for web services com-position, IEEE Internet Computing (2003), 40–49.

[BWC+03] F. Berman, R. Wolski, H Casanova, D. Cirne, and et al., Adaptive computing on the grid usingapples, IEEE Trans. Parallel Distrib. Syst. 14 (2003), no. 4, 369–382.

[CAA02] S. A. Chun, V. Atluri, and N. R. Adam, Domain knowledge-based automatic workflow gen-eration, Proceedings of the 13th International Conference on Database and Expert SystemsApplications (DEXA’02) (London, UK), Springer-Verlag, 2002, pp. 81–92.

[CADV02] S. Chiang, A. C. Arpaci-Dusseau, and M. K. Vernon, The impact of more accurate requestedruntimes on production job scheduling performance, Revised Papers from the 8th InternationalWorkshop on Job Scheduling Strategies for Parallel Processing (London, UK), Lecture Notesin Computer Science, Springer-Verlag, 2002, pp. 103–127.

[CCDND10] B. A. Caprarescu, N. M. Calcavecchia, E. Di Nitto, and D. J. Dubois, Sos cloud: Self-organizingservices in the cloud, Proceedings of the 5th International ICST Conference on Bio-InspiredModels of Network, Information, and Computing Systems, Springer-Verlag, 2010, pp. x–x.

[CD97] H. Casanova and J. Dongarra, Netsolve: A network server for solving computational scienceproblems, Intl. Journal of Supercomputing Applications and High Performance Computing 11(1997), no. 3, 212–223.

[CD98] , Applying netsolve’s network-enabled server, IEEE Comput. Sci. Eng. 5 (1998), no. 3,57–67.

[CDGJ09] L. C. Canon, O. Dubuisson, J. Gustedt, and E. Jeannot, Defining and control-ling the heterogeneity of a cluster: the wrekavoc tool, Tech. report, INRIA, 2009,http://www.labri.fr/perso/ejeannot/publications/RR-7135.pdf (accessed Jun 23rd 2010).

[CFK98] P. Chandra, A. Fisher, and C. et al. Kosak, Darwin: Customizable resource management forvalue-added network services, ICNP ’98: Proceedings of the Sixth International Conference onNetwork Protocols (Washington, DC, USA), IEEE Computer Society, 1998, p. 177.

Adaptive Scheduling for Distributed Systems 169

[CFMP10] A. Carstea, M. Frıncu, G. Macariu, and D. Petcu, Validation of symgrid-services frameworkthrough event based simulation, International Journal of Grid and Utility Computing 2 (2011),no. 1, 33–44.

[CJ10] S. S. Chauhan and R. C. Joshi, A weighted mean time min-min max-min selective schedulingstrategy for independent tasks on grid, IACC ’10: Proceedings of the IEEE 2nd InternationalAdvance Computing Conference, 2010, pp. 4–9.

[CJS+02] J. Cao, S. A. Jarvis, S. Saini, D. J. Kerbyson, and G. R. Nudd, Arms: An agent-based resourcemanagement system for grid computing, Scientific Programming 10 (2002), no. 2, 135–148.

[CKKG99] S. J. Chapin, D. Katramatos, J. Karpovich, and A. Grimshaw, Resource management in legion,Future Generation Computer Systems 15 (1999), no. 5, 583–594.

[CKPN00] J. Cao, D. J. Kerbyso, E. Papaefstathiou, and G. R. Nudd, Performance modeling of paralleland distributed computing using pace, Proceedings of 19th IEEE International Performance,Computing and Communication Conference, Lecture Notes in Computer Science, IEEE Com-puter Press, 2000, pp. 485–492.

[CL91] K. W. Chow and B. Liu, On mapping signal processing algorithms to a heterogeneous multipro-cessor system, ICASSP ’91: Proceedings of the Acoustics, Speech, and Signal Processing, 1991.ICASSP-91., 1991 International Conference (Washington, DC, USA), vol. 3, IEEE ComputerSociety, 1991, pp. 1585–1588.

[Cla52] W. Clark, The gantt chart 3rd edition, Pitman and Sons, London, 1952.

[cli10] Clips, 2010, http://clipsrules.sourceforge.net/.

[CLSS01] A. S. Chuai, C. Lursinsap, P. Sophasathit, and S. Siripant, Fuzzy c-mean: A statistical featureclassification of text and image segmentation, International Journal of Uncertainty, Fuzzinessand Knowledge-Based Systems 9 (2001), no. 6, 661–671.

[CLZB00] H. Casanova, A. Legrand, D. Zagorodnov, and F. Berman, Heuristics for scheduling parame-ter sweep applications in grid environments, Procs. 9th Heterogeneous Computing Workshop(HCW) (Cancun, Mexico), May 2000, pp. 349–363.

[CMN+04] D. G. Cameron, A. P. Millar, C. Nicholson, R. Carvajal-Schiaffino, F. Zini, and K. Stockinger,Optorsim: a simulation tool for scheduling and replica optimisation in data grids, 2004.

[COBW00] H. Casanova, G. Obertelli, F. Berman, and R. Wolski, The apples parameter sweep template:User-level middleware for the grid, Procs. of Super Computing SC’00, 2000.

[Con10] G’5000 Consortium, The grid’5000 experimental testbed, 2010, https://www.grid5000.fr (ac-cessed Jun 23rd 2010).

[Cor09a] Google Corporation, Google application engine, 2009, http://appengine.google.com (accessedJanuary 9 2010).

[Cor09b] Microsoft Corporation, Microsoft live mesh, 2009, http://www.mesh.com (accessed January 92010).

[CP09] B. A. Caprarescu and D. Petcu, A self-organizing feedback loop for autonomic computing,Proceedings of the International Conference on Adaptive and Self-Adaptive Systems and Ap-plications, 2009, pp. 126–131.

170 Marc E. Frıncu

[CQ08] A. Casanova, H. Legrand and M. Quinson, Simgrid: a generic framework for large-scale dis-tributed experiments, 10th IEEE International Conference on Computer Modeling and Simula-tion, 2008.

[CRNP08] M. Caeiro-Rodriguez, Z. Nemeth, and T. Priol, A chemical workflow engine to support scientificworkflows with dynamicity support, Proceedings of the 3rd Workshop on Workflows in Supportof Large-Scale Science, IEEE, November 2008, to appear.

[CSJN05] J. Cao, D. P. Spooner, S. A. Jarvis, and G. R. Nudd, Grid load balancing using intelligentagents, Future Generation Computer Systems 21 (2005), no. 1, 135–149.

[DA06] F. Dong and S. G. Akl, Scheduling algorithms for grid computing: State of the art and openproblems, Tech. report, Queen’s University, 2006, http://www.cs.queensu.ca/TechReports/Reports/2006-504.pdf.

[DAS10] DAS-3, The distributed asci supermcomputer 3, 2010, http://www.cs.vu.nl/das3/ (accessed Jun23rd 2010).

[DBG+03] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini,A. Arbree, R. Cavanaugh, and S. Koranda, Mapping abstract complex workflows onto gridenvironments, Journal of Grid Computing V1 (2003), no. 1, 25–39.

[DPAM02] K. D. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, A fast and elitist multiobjective geneticalgorithm : NSGA-II, IEEE Transactions on Evolutionary Computation 6 (2002), no. 2, 182–197.

[DV05] F. Desprez and A. Vernois, Simultaneous scheduling of replication and computation for bioin-formatic applications on the grid, Procs. of Workshop on Challanges of Large Applications inDistributed Environment CLADE 2005, 2005, pp. 66–74.

[EDQ07] L. Eyraud-Dubois and M. Quinson, Assessing the quality of automatically built network repre-sentations, Proceedings of the Seventh IEEE International Symposium on Cluster Computingand the Grid (Washington, DC, USA), CCGRID ’07, IEEE Computer Society, 2007, pp. 795–800.

[EG04] M. Ehrgott and X. Gandibleux, Approximative solution methods for multiobjective combinato-rial optimization, TOP 12 (2004), 1–63, 10.1007/BF02578918.

[EHS+02] C. Ernemann, V. Hamscher, U. Schwiegelshohn, A. Streit, and R. Yahyapour, On advantagesof grid computing for parallel job scheduling, 2002.

[EN07] K. Etminani and M. Naghibzadeh, A min-min max-min selective algorihtm for grid taskscheduling, ICI ’07: Proceedings of the 3rd IEEE/IFIP International Conference in CentralAsia on Internet, IEEE Computer Society, 2007, pp. 1–7.

[FC10] M. E. Frıncu and C. Craciun, Dynamic and adaptive rule-based workflow engine for scientificproblems in distributed environments, Cloud Computing and Software Services: Theory andTechniques (S. Ahson and M. Ilyas, eds.), Taylor & Francis, 2010, pp. 227–252.

[Fei10] D. G. Feitelson, Workload modeling for computer systems performance evaluation, September2010.

Adaptive Scheduling for Distributed Systems 171

[FH04] N. Fujimoto and K. Hagihara, A comparison among grid scheduling algorithms for independentcoarse-grained tasks, Proc. of the 2004 Symp. on Applications and the Internet-Workshops,IEEE Computer Society Press, 2004, pp. 674–680.

[FJK04] I. Foster, N. R. Jennings, and C. Kesselman, Brain meets brawn:why grid and agents need eachother, Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004. Proceedings of theThird International Joint Conference on, 2004, pp. 8–15.

[FK99] I. Foster and C. Kesselman, The grid: Blueprint for a new computing infrastructure, MorganKaufman Publishers, Inc. San Francisco, California, 1999.

[FMC09] M. Frıncu, G Macariu, and A. Carstea, Dynamic and adaptive workflow execution platformfor symbolic computations, Pollack Periodica 4 (2009), no. 1, 145–156.

[For90] C. Forgy, Rete: a fast algorithm for the many pattern/many object pattern match problem,Expert systems: a software methodology for modern applications (1990), 324–341.

[Fos05] I. Foster, Globus toolkit version 4: Software for service-oriented systems., NPC (Hai Jin,Daniel A. Reed, and Wenbin Jiang, eds.), Lecture Notes in Computer Science, vol. 3779,Springer, 2005, pp. 2–13.

[Fou10a] Apache Software Foundation, Hadoop database, 2010, http://hbase.apache.org/ (accessed Oc-tomber 14, 2010).

[Fou10b] , The hadoop distributed file system, 2010, http://hadoop.apache.org/hdfs/docs/r0.21.0/(accessed September 28, 2010).

[FP10] M. Frıncu and D. Petcu, Osyris: a nature inspired workflow engine for service oriented envi-ronments, Scalable Computing: Practice and Experience 11 (2010), no. 1, 81–97.

[FPNP09] M. Frıncu, S. Panica, M. Neagul, and D. Petcu, Gisheo: On demand grid service-based plat-form for earth observation data processing, Procs. of HiperGRID’09, Politehnica Press, 2009,pp. 415–422.

[FQS08a] M. E. Frıncu, M. Quinson, and F. Suter, A formalism for the description of large scalecomputing platforms, 2008, two page poster.

[FQS08b] , Handling very large platforms with the new simgrid platform description formalism,Technical Report 0348, INRIA, 02 2008.

[FR95] T. A. Feo and M. G. C. Resende, Greedy randomized adaptive search procedures, Journal ofGlobal Optimization 6 (1995), 109–133.

[Fr9a] M. E. Frıncu, Distributed scheduling policy in service oriented environments, Procs. of the 11thInt. Symposium on Symbolic and Numeric Algorithms for Scientific Computing SYNASC’09,IEEE Conference Publishing Series, IEEE Press, 2009.

[Fr9b] , Dynamic scheduling algorithm for heterogeneous environments with regular task inputfrom multiple requests, Procs. of the 4th Int. Conf. in Grid and Pervasive Computing GPC’09,Lecture Notes in Computer Science, vol. 5529, Springer-Verlag, 2009, pp. 199–210.

[Fr0a] , A method for distributing scheduling heuristics inside service oriented environmentsusing a nature-inspired approach, Proceedings of the 9th International Symposium on Paralleland Distributed Computing, 2010, pp. 211–218.

172 Marc E. Frıncu

[Fr0b] , Scheduling service oriented workflows inside clouds using an adaptive agent basedapproach, Handbook of Cloud Computing (B Furht and A. Escalante, eds.), Springer US,2010, pp. 159–182.

[Fr1] , D-osyris: A self-healing distributed workflow engine, Proceedings of the 10th Interna-tional Symposium on Parallel and Distributed Computing, 2011, in print.

[FVP+11] M. Frıncu, N. Villegas, D. Petcu, H. Muller, and R. Rouvoy, Self-healing distributed schedulingplatform, Proceedings of the 11th IEEE/ACM International Symposium on Cluster, Cloud andGrid Computing, 2011, in print.

[FW01] S. S. Fatima and M. Wooldridge, Adaptive task resources allocation in multi-agent systems,AGENTS ’01: Proceedings of the fifth international conference on Autonomous agents (NewYork, NY, USA), ACM, 2001, pp. 537–544.

[GRC09] Francesc Guim, Ivan Rodero, and Julita Corbalan, Job scheduling strategies for parallelprocessing, Job Scheduling Strategies for Parallel Processing (Eitan Frachtenberg and UweSchwiegelshohn, eds.), Springer-Verlag, Berlin, Heidelberg, 2009, pp. 59–79.

[GRH05] Y. Gao, H. Rong, and J. Huang, Adaptive grid job scheduling with genetic algorithms, FutureGeneration Computer Systems 21 (2005), no. 1, 151–161.

[Gri10] GridPP, The uk computing for particle physics grid, 2010, http://www.gridpp.ac.uk/ (accessedJun 23rd 2010).

[Gro10] AMQP Working Group, Advanced message queuing protocol, 2010,http://www.amqp.org/confluence/download/attachments/720900/amqp-master.1-0r0.pdf (accessed September 28, 2010).

[GYAB09] S. K. Garg, C. S. Yeo, A. Anandasivam, and R. Buyya, Energy-efficient scheduling of hpc appli-cations in cloud computing environments, Tech. report, The Cloud Computing and DistributedSystems (CLOUDS) Laboratory, University of Melbourne, 2009.

[HAR94] E. Hou, N. Ansari, and H. Ren, A genetic algorithm for multiprocessor scheduling, IEEE Trans.Parallel Distrib. Syst. 5 (1994), no. 2, 113–120.

[HKSJ+99] D. A. Hensgen, T. Kidd, D. St. John, M. .C. Schnaidt, and et al., An overview of mshn: Themanagement system for heterogeneous networks, 8th IEEE Heterogeneous Computing Work-shop (HCW 99, 1999, pp. 184–198.

[IBM06] IBM Corporation, An architectural blueprint for autonomic computing, Tech. report, IBM Cor-poration, 2006.

[ILD+] A. Iosup, H. Li, C. Dumitrescu, L. Wolters, and D. H. J. Epema, Grid workflow archive,http://gwa.ewi.tudelft.nl/TheGridWorkloadFormat.pdf (accessed March, 2011).

[Jen94] K. Jensen, An introduction to the theoretical aspects of coloured petri nets, A Decade of Concur-rency, Reflections and Perspectives, REX School/Symposium (London, UK), Springer-Verlag,1994, pp. 230–272.

[jes10] Jess, 2010, http://www.jessrules.com/.

[JH08] V. Janjic and K. Hammond, Prescient scheduling of parallel functional programs on the grid,Intl. Symp. on Trends in Functional Programming (TFP 2008), 2008.

Adaptive Scheduling for Distributed Systems 173

[JHY08] V. Janjic, K. Hammond, and Y. Yang, Using application information to drive adaptive gridmiddleware scheduling decisions, MAI ’08: Proceedings of the 2nd workshop on Middleware-application interaction (New York, NY, USA), ACM, 2008, pp. 7–12.

[JSO10] JSON, Java simple object notation, 2010, http://www.json.org/ (accessed September 28, 2010).

[KBM02] K. Krauter, R. Buyya, and M. Maheswaran, A taxonomy and survey of grid resource manage-ment systems, Software Practice and Experience 32 (2002), 135–164.

[KC03] J. O. Kephart and D. M. Chess, The vision of autonomic computing, Computer 36 (2003),no. 1, 41–50.

[KL07] A. Konovalov and S. Linton, Symbolic computation software composability protocol spec-ification, circa preprint 2007/5 university of st. andrews, 2007, http://www-circa.mcs.st-and.ac.uk/preprints.html (accessed November, 2008).

[KMSN07] N. Kiran, V. Maheswaran, M. Shyam, and P. Narayanasamy, A novel taskreplica based resource scheduling algorithm in grid computing, 2007, (online)http://hipc.org/hipc2007/posters/resource-sched.pdf (accessed January, 2011).

[LA87] P. J. M. Laarhoven and E. H. L. Aarts, Simulated annealing: theory and applications, KluwerAcademic Publishers, Norwell, MA, USA, 1987.

[LBL06] S. Lu, A. Bernstein, and P. Lewis, Automatic workflow verification and generation, TheoreticalComputer Science 353 (2006), no. 1, 71–92.

[LCG10] LCG, The worldwide lhc computing grid, 2010, http://lcg.web.cern.ch/lcg/ (accessed Jun 23rd2010).

[LS02] B.G. Lawson and E. Smirni, Multiple-queue backfilling scheduling with priorities and reserva-tions for parallel systems, Lecture Notes in Computer Science, vol. 2862, Springer-Verlag, 2002,pp. 72–87.

[LSACi07] S. Lorpunmanee, M. N. Sap, A. H. Abdullah, and C. Chompoo-inwai, An ant colony optimiza-tion for dynamic job scheduling in grid environment, Procs. World Academy of Science, vol. 23,Engineering and Technology, 2007.

[LSHS05] C. Lee, Y. Schartzman, J. Hardy, and A. Snavely, Are user runtime estimates inherently inac-curate?, Job Scheduling Strategies for Parallel Processing (Berlin/Heidelberg), Lecture Notesin Computer Science, Springer-Verlag, 2005, pp. 253–263.

[LSSD03] H. Lamehamedi, Z. Shentu, B. Szymanskia, and E. Deelman, Simulation of dynamic data repli-cation strategies in data grids, the 17th International Symposium on Parallel and DistributedProcessing, 2003, pp. 10pp+.

[M.88] Waxman B. M., Routing of multipoint connections, Journal on Selected Areas in Communica-tions 6 (1988), no. 9, 1617–1622.

[MAS+99] M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund, Dynamic mapping ofa class of independent tasks onto heterogeneous computing systems, Journal of Parallel andDistributed Computing 59 (1999), 107–131.

174 Marc E. Frıncu

[MF01] A. W. Mu’alem and D. G. Feitelson, Utilization, predictability, workloads, and user runtimeestimates in scheduling the ibm sp2 with backfilling, IEEE Transactions in Parallel DistributedSystems 12 (2001), no. 6, 529–543.

[MGR04] R. Muller, G. Greiner, and E. Rahm, Agent work: a workflow system supporting rule-basedworkflow adaptation, Data Knowl. Eng. 51 (2004), no. 2, 223–256.

[Mit03] M. Mitzenmacher, A brief history of generative models for power law and lognormal distribu-tions, Internet Mathematics 1 (2003), no. 2, 226–251.

[MKL+03] D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins,and Z. Xu, Peer-to-peer computing, Tech. report, HP Laboraties Palo Alto, 2003,http://www.hpl.hp.com/techreports/2002/HPL-2002-57R1.pdf.

[MKS09] H. A. Muller, H. M. Kienle, and U. Stege, Autonomic computing: Now you see it, now youdon’t—design and evolution of autonomic software systems, International Summer School onSoftware Engineering (ISSE) 2006-2008 (A. De Lucia and F. Ferrucci, eds.), LNCS, vol. 5413,Springer, 2009, pp. 32–54.

[mOS10] mOSAIC, The mosaic fp7 project, 2010, http://www.mosaic-project.eu.

[MRR+53] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, Equation of StateCalculations by Fast Computing Machines, Journal of Chemical Physics 21 (1953), 1087–1092.

[NFG+06] L. Northrop, P. Feiler, R. Gabriel, J. Goodenough, and et al., Ultra-large-scale systems—Thesoftware challenge of the future, Tech. report, Carnegie Mellon University Software EngineeringInstitute, 2006.

[NPP05] Z. Nemeth, C. Perez, and T. Priol, Workflow enactment based on a chemical metaphor, SEFM’05: Proceedings of the Third IEEE International Conference on Software Engineering andFormal Methods (Washington, DC, USA), IEEE Computer Society, 2005, pp. 127–136.

[NPP06] Z. Nemeth, C. Perez, and T. Priol, Distributed workflow coordination: molecules and reactions,Parallel and Distributed Processing Symposium, International 0 (2006), 260–268.

[NRD06] C. Nagl, F. Rosenberg, and S. Dustdar, Vidre–a distributed service-oriented business rule enginebased on ruleml, Proceedings of the 10th IEEE International Enterprise Distributed ObjectComputing Conference (EDOC ’06) (Washington, DC, USA), IEEE Computer Society, 2006,pp. 35–44.

[OBSS04] L. Oliker, R. Biswas, H. Shan, and W. Smith, Job scheduling in a heterogenous grid environ-ments, February 2004.

[OGAea06] T. M. Oinn, M. Greenwood, M. Addis, and et al., Taverna: lessons in creating a workflowenvironment for the life sciences: Research articles, Concurrency and Computation: Practiceand Experience 18 (2006), no. 10, 1067–1100.

[OGM+05] D. Ouelhadj, G. M. Garibaldi, J. MacLaren, R. Sakellariou, and K. Krishnakumar, A multi-agent infrastructure and a service level agreement negotiation protocol for robust scheduling ingrid computing, EGC, 2005, pp. 651–660.

[ops10] Ops5, 2010, http://www.pcai.com/web/ai info/pcai ops.html.

Adaptive Scheduling for Distributed Systems 175

[PGOE04] V. B. Primet, O. Gluck, C. Otal, and F. Echantillac, Emulation d’un nuage reseau de grillesde calcul: ewan, Tech. report, Ecole Normale Superieur Lyon, 2004.

[PKN04] A. Page, T. Keane, and T. J. Naughton, Adaptive scheduling across a distributed computa-tion platform, Third International Symposium on Parallel and Distributed Computing (Cork,Ireland) (John P. Morrisson, ed.), IEEE Computer Society, July 2004, pp. 141–149.

[Pla] PlanetLab, Planet-lab, http://www.planet-lab.org/ (accessed Jun 23rd 2010).

[pro10] Prolog, 2010, http://pauillac.inria.fr/ deransar/prolog/docs.html.

[RF02] K. Ranganathana and I. Foster, Decoupling computation and data scheduling in distributed data-intensive applications, HPDC ’02: Proceedings of the 11th IEEE International Symposium onHigh Performance Distributed Computing (Washington, DC, USA), IEEE Computer Society,2002, p. 352.

[RL04] G. R. S. Ritchie and J. Levine, A hybrid ant algorithm for scheduling independent jobs inheterogeneous computing environments, Procs. the 23rd Workshop of the UK Planning andScheduling Special Interest Group, 2004.

[RvG01] A. Radulescu and A. van Gemund, A low-cost approach towards mixed task and data parallelscheduling, 2001.

[SC07] F. Suter and H. Casanova, Extracting synthetic multi-cluster platform configurations fromgrid’5000 for driving simulation experiments, Tech. report, INRIA, 2007, https://halinria.fr/inria-00166181.

[Sch04] J. M. Schopf, Grid resource management, Grid resource management (Jarek Nabrzyski, Jen-nifer M. Schopf, and Jan Weglarz, eds.), Kluwer Academic Publishers, Norwell, MA, USA,2004, pp. 15–23.

[sci] Symbolic computation infrastructure for europe, http://www.symbolic-computation.org/ (ac-cessed Jun 23rd 2010).

[SD06] Z. Shi and J. J. Dongarra, Scheduling workflow applications on processors with different capa-bilities, Future Gener. Comput. Syst. 22 (2006), no. 6, 665–675.

[SFT98] W. Smith, I. T. Foster, and V. E. Taylor, Predicting application run times using historicalinformation, Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing,vol. Lecture Notes in Computer Science Vol. 1459, Springer-Verlag, 1998, pp. 122–142.

[SFT00] J. Sauer, T. Freese, and T. Teschke, Towards agent-based multi-site scheduling, Proceedings ofthe ECAI 2000 Workshop on New Results in Planning, Scheduling, and Design (Berlin), 2000,pp. 123–130.

[SKF08] B. Sotomayor, K. Keahey, and I. Foster, Combining batch execution and leasing using virtualmachines, Proceedings of the 17th international symposium on High performance distributedcomputing (New York, NY, USA), HPDC ’08, ACM, 2008, pp. 87–96.

[SLB06] J. Schneider, B. Linnert, and L. O. Burchard, Distributed workflow management for large-scalegrid environments, Applications and the Internet, IEEE/IPSJ International Symposium on 0(2006), 229–235.

176 Marc E. Frıncu

[SLGW02] W. Shen, Y. Li, H. Genniwa, and C. Wang, Adaptive negotiation for agent-based grid computing,Proceedings of the Agentcities/AAMAS’02, 2002, pp. 32–36.

[SMLF09] B. Sotomayor, R. S. Montero, I. M. Llorente, and I. Foster, Capacity Leasing in Cloud Systemsusing the OpenNebula Engine, Cloud Computing and Applications 2008 (CCA08), 2009.

[SOBS04] H. Shan, L. Oliker, R. Biswas, and W. Smith, Scheduling in heterogeneous grid environments:The effects of data migration, Proc. of ADCOM2004: International Conference on AdvancedComputing and Communication, December 2004.

[SQF] F. Suter, M. Quinson, and M. Frıncu, Platform description archive, http://pda.gforge.inria.fr/(accessed January, 2011).

[SS08] M. Singh and P. K. Suri, A qos based predictive max-min, min-min switcher algorithm for jobscheduling in a grid, Information Technology Journal 7 (2008), no. 8, 1176–1181.

[SZ04a] R. Sakellariou and H. Zhao, A hybrid heuristic for dag scheduling on heterogeneous systems,IPDPS 02 (2004), 111b.

[SZ04b] , A hybrid heuristic for dag scheduling on heterogeneous systems, Parallel and Dis-tributed Processing Symposium, 2004. Proceedings. 18th International, 2004.

[SZ04c] , A low-cost rescheduling policy for efficient mapping of workflows on grid systems, Sci.Program. 12 (2004), no. 4, 253–262.

[SZTD07] R. Sakellariou, H. Zhao, E. Tsiakkouri, and M. Dikaiakos, Scheduling workflows with budgetconstraints, Integrated Research in GRID Computing (Sergei Gorlatch and Marco Danelutto,eds.), CoreGRID, Springer-Verlag, 2007, pp. 189–202.

[TGJ+02] H. Tangmunarunkit, R. Govindan, S. Jamin, S. Shenker, and W. Willinger, Network topologygenerators: Degree-based vs structural, Proceedings of ACM SIGCOMM’2002, 2002.

[TH02] T. Takao and K. Hironori, A standard task graph set for fair evaluation of multiprocessorscheduling algorithms, Journal of Scheduling 5 (2002), no. 5, 379–394.

[tIL10] ADAM team INRIA Lille, Frascati, 2010, https://wiki.ow2.org/frascati/Wiki.jsp?page=FraSCAti(accessed November 17, 2010).

[TL09] K. Tumer and J. Lawson, Multiagent coordination for multiple resource job scheduling, Adaptiveand Learning Agents (M. Taylor and K. Tuyls, eds.), Lecture notes in AI, Springer, 2009.

[TMN+99] A. Takefusa, S. Matsuoka, H. Nakada, K. Aida, and Nagashima U., Overview of a perfor-mance evaluation system for global computing scheduling algorithms, 8th IEEE InternationalSymposium on High Performance Distributed Computing (Redondo Beach, California, USA),1999.

[TS07] T. ’Takpe and F. Suter, A comparison of scheduling approaches for mixed-parallel applicationson heterogenous platforms, Proc. of the 2007 Int. Symp. on Parallel and Distributed Computing,IEEE Computer Society Press, 2007, pp. 250–257.

[TTL03] D. Thain, T. Tannenbaum, and M. Livny, Condor and the grid, Grid Computing : Makingthe Global Infrastructure a Reality (G. Fox F. Berman and A. Hey, eds.), John Wiley & Sons,2003, pp. 301–335.

Adaptive Scheduling for Distributed Systems 177

[TTL05] , Distributed computing in practice: the condor experience., Concurrency - Practice andExperience 17 (2005), no. 2-4, 323–356.

[TUoV10] Canada The University of Victoria, Cloud scheduler, 2010, http://cloudscheduler.org/ accessed7 Jan 2010).

[TW07] D. L. Tennenhouse and D. J. Wetherall, Towards an active network architecture, SIGCOMMComput. Commun. Rev. 37 (2007), 81–94.

[TZ06] J. Tang and M. Zhang, An agent-based peer-to-peer grid computing architecture: convergenceof grid and peer-to-peer computing, ACSW Frontiers ’06: Proceedings of the 2006 Australasianworkshops on Grid computing and e-research (Darlinghurst, Australia, Australia), AustralianComputer Society, Inc., 2006, pp. 33–39.

[vdA98] W. M. P. van der Aalst, The application of Petri nets to workflow management, The Journalof Circuits, Systems and Computers 8 (1998), no. 1, 21–66.

[vdAtH05] W. M. P. van der Aalst and A. H. M. ter Hofstede, Yawl: yet another workflow language,Information Systems 30 (2005), no. 4, 245 – 275.

[VM10] N. M. Villegas and H. A. Muller, Managing dynamic context to optimize smart interactionsand services, 1 ed., LNCS, vol. 6400, Springer, 2010, ISBN 978-3-642-16598-6.

[VMW10] VMWare, Rabbitmq, 2010, http://www.rabbitmq.com/documentation.html (accessed September 28, 2010).

[VYW+02] A. Vahdat, K. Yocum, K. Walsh, P. Mahadevan, J. Kosti’c, D. Chase, and D. Becker, Scalabilityand accuracy in a large-scale network emulator, SIGOPS Oper. Syst. Rev., vol. 36, 2002,pp. 271–284.

[WARW04] G. Weichhart, M. Affenzeller, A. Reitbauer, and S. Wagner, Modelling of an agent-based sched-ule optimisation system, Proc. of the IMS International Forum (Como, Italy), 2004, pp. 79–87.

[WBHD07] D. Weyns, N. Boucke, T. Holvoet, and B. Demarsin, Dyncnet: A protocol for dynamic taskassignment in multiagent systems, Proceedings of the First International Conference on Self-Adaptive and Self-Organizing Systems (Washington, DC, USA), SASO ’07, IEEE ComputerSociety, 2007, pp. 281–284.

[Woo99] M. Wooldridge, Inteligent agents, Multiagent Systems: A modern Approach to DistributedArtificial Inteligence (G. Weiss, ed.), MIT Press, 1999, pp. 27–77.

[WPSW04] S.J. Woodman, D.J. Palmer, S.K. Shrivastava, and S.M. Wheater, Distributed enactment ofcomposite web services, Tech. report, University of Newcastle upon Tyne, 2004, http://www.cs.ncl.ac.uk/publications/trs/papers/848.pdf.

[WSB07] WSBPEL, Web services business process execution language version 2.0, 2007,http://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.pdf (accessed March 30, 2009).

[WSRM97] L. Wang, H. Siegel, V. Roychowdhury, and A. Maciejewski, Task matching and scheduling inheterogeneous computing environments using a genetic-algorithm-based approach, J. ParallelDistrib. Comput. 47 (1997), no. 1, 8–22.

178 Marc E. Frıncu

[WvdHH08] H. Weigand, W. J. van den Heuvel, and M. Hiel, Rule-based service composition and service-oriented business rule management, Proceedings of the International Workshop on RegulationsModelling and Deployment (ReMoD’08) (2008), 1–12.

[WVZX10] G. Wei, A. Vasilakos, Y. Zheng, and N. Xiong, A game-theoretic method of fair resourceallocation forcloud computing services, The Journal of Supercomputing 54 (2010), 252–269,10.1007/s11227-009-0318-1.

[WYJL04] A. Wu, H. Yu, S. Jin, and K. Lin, An incremental genetic algorithm approach to multiprocessorscheduling, IEEE Trans. Parallel Distrib. Syst. 15 (2004), no. 9, 824–834, Member-Schiavone,,Guy.

[XA10] F. Xhafa and A. Abraham, Computational models and heuristic methods for grid schedulingproblems, Future Gener. Comput. Syst. 26 (2010), 608–621.

[XC07] F. Xhafa and A. Carretero, J. Abraham, Genetic algorithm based schedulers for grid computingsystems, International Journal of Innovative Computing, Information and Control 3 (2007),no. 5, 1053–1071.

[XCWB09] M. Xu, L. Cui, H. Wang, and Y. Bi, A multiple qos constrained scheduling strategy of mul-tiple workflows for cloud computing, Parallel and Distributed Processing with Applications,International Symposium on 0 (2009), 629–634.

[XDCC04] H. Xia, H. Dail, H. Casanova, and A. A. Chien, The microgrid: Using online simulationto predict application performance in diverse grid network environments, Challenges of LargeApplications in Distributed Environments (CLADE), 2004, p. 52.

[Xha07] F. Xhafa, A hybrid evolutionary heuristic for job scheduling on computational grids, HybridEvolutionary Algorithms (Ajith Abraham, Crina Grosan, and Hisao Ishibuchi, eds.), Studiesin Computational Intelligence, vol. 75, Springer Berlin / Heidelberg, 2007, 10.1007/978-3-540-73297-6 11, pp. 269–311.

[YB05] J. Yu and R. Buyya, Scheduling scientific workflow applications with deadline and budget con-straints using genetic algorithms, Scientific Programming 14 (2005), no. 3-4, 217–230.

[YB07] , Workflow scheduling algorithms for grid computing, Tech. report, The University ofMelbourne, 2007, http://www.gridbus.org/reports/WorkflowSchedulingAlgs2007.pdf.

[YCA04] S. K. Yang, H. Casanova, and Chien A., Realistic modeling and synthesisof resources for computational grids, Proceedings of SuperComputing’04, 2004,http://vgrads.rice.edu/publications/pdfs/SC04-Kee.pdff.

[YD02] A. YarKhan and J. J. Dongarra, Experiments with scheduling using simulated annealing in agrid environment, Proceedings of the Third International Workshop on Grid Computing, vol.Lecture Notes in Computer Science Vol. 2536, Springer-Verlag, 2002, pp. 232–242.

[ZCD97] Ellen W. Zegura, Kenneth L. Calvert, and Michael J. Donahoo, A quantitative comparison ofgraph-based models for internet topology, IEEE/ACM Trans. Netw. 5 (1997), 770–783.

[ZFM01] C. Zandron, C. Ferretti, and G. Mauri, Using membrane features in p systems, RomanianJournal of Information Science and Technology 4 (2001), no. 2, 241 – 257.

Adaptive Scheduling for Distributed Systems 179

[ZFZ11] F. Zamfirache, M. Frıncu, and D. Zaharie, Population-based metaheuristics for tasks schedul-ing in heterogeneous distributed systems, NMA ’10: Proceedings of the 7th International Con-ference on Numerical Methods and Applications, Lecture Notes in Computer Science, vol. 6046,Springer-Verlag, 2011, pp. 321–328.

[ZS03] H. Zhao and R. Sakellariou, An experimental investigation into the rank function of the het-erogeneous earliest finish time scheduling algorithm, Euro-Par, vol. Lecture Notes in ComputerScience Vol. 2790, Springer-Verlag, 2003.

[ZTF98] R. E. Ziemer, W. H. Tranter, and R. D. Fannin, Signals and systems continuous and discretefourth edition, Prentice Hall, 1998.

[ZWM99] A. Zomaya, C. Ward, and B. Macey, Genetic scheduling for parallel processor systems: Com-parative studies and performance issues, IEEE Trans. Parallel Distrib. Syst. 10 (1999), no. 8,795–812.