©2003 dror feitelson parallel computing systems part iii: job scheduling dror feitelson hebrew...

84
©2003 Dror Feitelson Parallel Computing Systems Part III: Job Scheduling Dror Feitelson Hebrew University

Post on 19-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

©2003 Dror Feitelson

Parallel Computing SystemsPart III: Job Scheduling

Dror Feitelson

Hebrew University

©2003 Dror Feitelson

Types of Scheduling

• Task scheduling– Application is partitioned into tasks– Tasks have precedence constraints– Need to map tasks to processors– Need to consider communications too– Part of creating an application

• Job scheduling– Scheduling competing jobs belonging to

different users– Part of the operating system

©2003 Dror Feitelson

We’ll Focus on Job Scheduling

©2003 Dror Feitelson

Dimensions of Scheduling

• Space slicing– Partition the machine into disjoint parts– Jobs get exclusive use of a partition

• Time slicing– Multitasking on each processor– Similar to conventional systems

• Use both together

• Use none – batch scheduling on dedicated machine

Feitelson, RC 19790 1997

©2003 Dror Feitelson

Space Slicing

• Fixed: predefined partitions– Used on CM-5

• Variable: carve out number requested– Used on most systems: Paragon, SP, …– Some restrictions may apply, e.g. torus

• Adaptive: modify request size according to system considerations– Less nodes if more jobs are present

• Dynamic: modify size at runtime too

©2003 Dror Feitelson

Time Slicing

• Uncoordinated: each PE schedules on its ownLocal queue: processes allocated to PEs– Requires load balancing

Global queue– Provides automatic load sharing– Queue may become a bottleneck

• Coordinated across multiple Pes– Explicit gang scheduling– Implicit co-scheduling

©2003 Dror Feitelson

run to comp.

time slicing

local queue

global queue

gang sched.

full machine

Illiac IV

MPP GF11

StarOS Tera

NYU Ult. Dynix

2level/top

Alliant FX/8

space slicing

fixed or powers

of 2

hypercube CM-2

iPSC/2 nCUBE

CM-5 Cedar

flexibleIBM SP

2level/bottransput.

Chrysalis

Mach ParPar

LLNL

©2003 Dror Feitelson

Scheduling Framework

Arriving

jobs

Terminating

jobs

Allocation

Partitioning with run-to-completion– Order of taking jobs from the queue– Re-definition of job size

©2003 Dror Feitelson

Scheduling Framework

Arriving

jobs

Terminating

jobs

Preemption

Time slicing with preemption– Setting time quanta and priorities– May jobs migrate/change size when preempted?

©2003 Dror Feitelson

Memory Considerations

• The processes of a parallel application typically communicate

• To make good progress, they should all run simultaneously

• A process that suffers a page fault is unavailable for communication

• Paging should therefore be avoided

©2003 Dror Feitelson

Scheduling Framework

Dispatching

Memory allocation

Two stages of scheduling– Or three stages, with swapping

©2003 Dror Feitelson

Variable Partitioning

©2003 Dror Feitelson

Batch Systems

• Define system of queues with different combinations of resource bounds

• Schedule FCFS from these queues

• Different queues active at prime vs. non-prime time

• Sophisticated/complex services provided– Accounting and limits on users/groups– Staging of data in and out of machine– Political prioritization as needed

©2003 Dror Feitelson

Example – SDSC Paragonnodes

1 4 8 16 32 64 128 256 *15m q4t

1h q4s q8s

Q8s

q16s

Q16s

q32s

Q32s

q64s

4h q32m

Q32m

q64m

Q64m

q128m

Q128m

q256m

Q256m

12h q1l q32l

Q32l

q64l

Q64l

q128l

Q128l Q256l

* qstb

Qstb

tim

e

16MB

32MB

Low priority

Wan et al., JSSPP 1996

©2003 Dror Feitelson

The Problem

• Fragmentation– If the first queued job needs more processors

than are available, need to wait for more to be freed

– Available processors remain idle during the wait

• FCFS (first come first serve)– Short jobs may be stuck behind long jobs in the

queue

©2003 Dror Feitelson

The Solution

• Out of order scheduling– Allows for better packing of jobs– Allows for prioritization according to desired

considerations

©2003 Dror Feitelson

Backfilling

• Allow jobs from the back of the queue to jump over previous jobs

• Make reservations for jobs at the head of the queue to prevent starvation

• Requires estimates of job runtimes

Lifka, JSSPP 1995

©2003 Dror Feitelson

Example

job2

job1

time

proc

esso

rs

job3

job4

FCFS

©2003 Dror Feitelson

Example

job2

job1

time

proc

esso

rs

job3

job4

Backfilingreservation

©2003 Dror Feitelson

Parameters

• Order for going over the queue– FCFS– Some prioritized order (Maui)

• How many reservations to make– Only one (EASY)– For all skipped jobs (Conservative)– According to need

• Lookahead– Consider one job at a time– Look deeper into the queue

©2003 Dror Feitelson

EASY Backfilling

Extensible Argonne Scheduling System(first large IBM SP installation)

• Definitions:– Shadow time: time at which first queued job

can run– Extra processors: processors left over when

first job runs

• Backfill if– Job will terminate by shadow time– Job needs less than extra processors

Lifka, JSSPP 1995

©2003 Dror Feitelson

First Case

job2

time

proc

esso

rs

job3

job4

shadow time

©2003 Dror Feitelson

Second Case

job2

job1

time

proc

esso

rs

job3

job4

extra processors

©2003 Dror Feitelson

Properties

• Unbounded delay– Backfill jobs will not delay first queued job– But they may delay other queued jobs…

Mu’alem & Feitelson, IEEE TPDS 2001

©2003 Dror Feitelson

Delay

job1

job4

time

proc

esso

rs

job2job3

©2003 Dror Feitelson

Delay

job1

time

proc

esso

rs

job2job3

job4

delay

©2003 Dror Feitelson

Properties

• Unbounded delay– Backfill jobs will not delay first queued job– But they may delay other queued jobs…

• No starvation– Delay of first queued job is bounded by runtime

of current jobs– When it runs, the second queued job becomes

first– It is then immune of further delays

Mu’alem & Feitelson, IEEE TPDS 2001

©2003 Dror Feitelson

User Runtime Estimates

• Small estimates allow job to backfill and skip the queue

• Too short estimates risk the job being killed because it exceeded its time

• So estimates may be expected to be accurate

©2003 Dror Feitelson

They Aren’t

Mu’alem & Feitelson, IEEE TPDS 2001

©2003 Dror Feitelson

Surprising Consequence

Performance is actually better if runtime estimates are inaccurate!

Experiment: replace user estimates by up to f times the actual runtime(Data for KTH)

f resp. time slowdown

0 15001 67.6

1 14717 67.0

3 14645 62.7

10 14880 63.7

30 15028 64.7

100 15110 64.9

users 15568 84.0

©2003 Dror Feitelson

Exercise

Understand why this happens

• Run simulations of EASY backfilling with real workloads

• Insert instrumentation to record detailed behavior

• Try to find why f10 is better than f=1

• Try to find why user estimates are so bad

©2003 Dror Feitelson

Hint

• It may be beneficial to look at different job classes

• Example: EASY vs. Conservative– EASY favors small long jobs: can backfill

despite delaying non-first jobs– This comes at expense of larger short jobs– Happens more with user estimates than with

accurate estimates

Small Large

Short

Long

©2003 Dror Feitelson

Another Surprise

Possible to improve performance by multiplying user estimates by 2!(table shows reduction in %)

EASY Conserv.

Bounded slowdown

KTH -4.8% -23.0%

CTC -7.9% -18.0%

SDSC +4.6% -14.2%

Response time

KTH -3.3% -7.0%

CTC -0.9% -1.6%

SDSC -1.6% -10.9%

©2003 Dror Feitelson

The MAUI SchedulerQueue order depends on• Waiting time in queue

– Promote equitable service

• Fair share status• Political priority• Job parameters

– Favor small/large jobs etc.

• Number of times skipped by backfill– Prevent starvation

• Problem: conflicts are possible, hard to figure out what will happen

Jackson et al, JSSPP 2001

©2003 Dror Feitelson

Fair Share

• Actually unfair: strive for specific share

• Based on comparison with historical data

• Parameters:– How long to keep information– How to decay old information– Specifying shares for user or group– Shares are upper/lower bound or both

• Handling of multiple resources by maximal “PE equivalents” (usage out of total available)

©2003 Dror Feitelson

Lookahead

• EASY uses a greedy algorithm and considers jobs in one given order

• The alternative is to consider a set of jobs at once and try to derive an optimal packing

©2003 Dror Feitelson

Dynamic Programming

• Outer loop: number of jobs that are being considered

• Inner loop: number of processors that are available

Edi Shmueli, IBM Haifa

p=1 p=2 p=3 p=4

no job 0 0 0 0

Job 1

Job 2 u 2,3

Job 3

Achievable utilization on 3

processors using only first 2 jobs

©2003 Dror Feitelson

Cell Update

• If j.size > p job is too big to consider

uj,p = uj-1,p

j is not selected• Else consider adding job j

u’ = uj-1,p-j.size + j.size

if u’ > uj-1,p then uj,p = u’j is selected

else uj,p = uj-1,p

j is not selected

©2003 Dror Feitelson

Preventing Starvation

• Option I: only use jobs that will terminate by the shadow time

• Option II: make a reservation for the first queued job (as in EASY)

Requires a 3D data structure:1. Jobs being considered2. Processors being used now3. Extra processors used at the shadow time

©2003 Dror Feitelson

Dynamic Programming

• In the end the bottom-right cell contains the maximal achievable utilization

• The set of jobs to schedule is obtained by the path of selected jobs

©2003 Dror Feitelson

Performance

• Backfilling leads to significant performance gains relative to FCFS

• More reservations reduce performance somewhat (EASY better than conservative)

• Lookahead improves performance somewhat

©2003 Dror Feitelson

Dynamic Partitioning

©2003 Dror Feitelson

Two-Level Scheduling

• Bottom level – processor allocation– Done by the system– Balance requests with availability– Can change at runtime

• Top level – process scheduling– Done by the application– Use knowledge about priorities, holding locks,

etc.

©2003 Dror Feitelson

Programming Model

• Applications required to handle arbitrary changes in allocated processors

• Workpile model– Easy to change number of worker threads

• Scheduler activations– Any change causes an upcall into the

application, which can reconsider what to run

©2003 Dror Feitelson

Equipartitioning• Strive to give all applications equal numbers of

processors– When a job arrives take some processors from each

running job– When it terminates, give some to each other job

• Fair and similar to processor sharing• Caveats

– Applications may have a maximal number of processors they can use efficiently

– Applications may need a minimal number of processors due to memory constraints

– Reconfigurations require many process migrations

Not an issue for shared memory

©2003 Dror Feitelson

Folding

• Reduce processor preemptions by selecting a partition and dividing it in half

• All partition sizes are powers of 2

• Easier for applications: when halved, multitask two processes on each processor

McCann & Zahorjan, SIGMETRICS 1994

©2003 Dror Feitelson

The Bottom Line

• Places restrictions on programming model– OK for workpile, Cray autotasking– Not suitable for MPI

• Very efficient at the system level– No fragmentation– Load leads to smaller partitions and reduced

overheads for parallelism

• Of academic interest only, in shared memory architectures

©2003 Dror Feitelson

Gang Scheduling

©2003 Dror Feitelson

Definition

• Processes are mapped one-to-one on processors

• Time slicing is used for multiprogramming• Context switching is coordinated across

processors– All processes are switched at the same time – Either all run or none do

• This applies to gangs, typically all processes in a job

©2003 Dror Feitelson

CoScheduling

• Variant in which an attempt is made to schedule all the processes, but subsets may also be scheduled

• Assumes “process working set” that should run together to make progress

• Does this make sense?– All processes are active entities– Are some more important than others?

Ousterhout, ICDCS 1982

©2003 Dror Feitelson

Advantages

• Compensate for lack of knowledge– If runtimes are not known in advance, preemption

prevents short jobs from being stuck– Same as in conventional systems

• Retain dedicated machine model– Application doesn’t need to handle interference

by system (as in dynamic partitioning)– Allow use of hardware support for fine-grain

communication and synchronization

• Improve utilization

©2003 Dror Feitelson

Utilization

• Assume a 128-node machine

• A 64-node job is running

• A 32-node job and a 128-node job are queued

• Should the 32-node job be started?

Feitelson &Jette, JSSPP 1997

©2003 Dror Feitelson

Best Case

Start 32-node job leading to 75% utilization

left idle

time

32-node job

64-node job

©2003 Dror Feitelson

Worst Case

Start 32-node job, but then 64-node job terminates leading to 25% utilization

left idle

time

32

64

©2003 Dror Feitelson

With Gang Scheduling

Start 32-node job in slot with 64-node job, and 128-node job in another slot. Utilization is 87.5% (or 62.5% if 64-node job terminates)

time

32

64

128

©2003 Dror Feitelson

Disadvantages

• Overhead for context switching

• Overhead for coordinating context switching across many processors

• Reduced cache efficiency

• Memory pressure – more jobs need to be memory resident

©2003 Dror Feitelson

Implementation

• Pack jobs in processor-time space– Assign processes to processors– Is migration allowed?

• Perform coordinated context switching– Decide on time slices: are all equal?

©2003 Dror Feitelson

Packing

• Ousterhout matrix– Rows represent time slices– Columns represent processors

• Each job mapped to a single row with enough space

• Optimizations– Slot unification when occupancy is

complementary– Alternate scheduling

©2003 Dror Feitelson

Example

Ousterhout matrix

Ousterhout, ICDCS 1982

©2003 Dror Feitelson

Fragmentation

There can be unused space in each slot

(Unused slots are not a problem)

©2003 Dror Feitelson

Alternate Scheduling

Effect can be reduced by running jobs in additional slots

This depends on good packing

Ousterhout, ICDCS 1982

©2003 Dror Feitelson

Buddy System

• Allocate processors according to a buddy system– Successive partitioning into halves as needed

• Used and unused processors in different slots will tend to be aligned

• This facilitates alternate schedulingFeitelson &Rudolph, Computer 1990

©2003 Dror Feitelson

Results

Feitelson, JSSPP 1996

©2003 Dror Feitelson

Coordinated Context Switching

• Typically coordinated by a central manager

• Executed by local daemons

• Use SIGSTOP/SIGCONT to leave only one runnable process

©2003 Dror Feitelson

STORM

• Base job execution and scheduling of 3 primitives– xfer-&-signal (broadcast)– test-event (optional block)– comp-&-write (on global variables)

• Implemented efficiently using NIC and network support

Frachtenberg et al., SC 2002

©2003 Dror Feitelson

Flexible Co-Scheduling

©2003 Dror Feitelson

Flexibility

Idea: reduce constraints of strict gang scheduling

• DCS: demand-based coscheduling

• ICS: implicit coscheduling

• FCS: flexible coscheduling

• Paired gang scheduling

The common factor: involve local scheduling

©2003 Dror Feitelson

DCS

• Coscheduling is good for threads that communicate or synchronize

• Prioritizing threads that communicate will cause them to be coscheduled on different nodes

Sobalvarro & Weihl, JSSPP 1995

©2003 Dror Feitelson

Algorithmic Details

• Switch to a thread that receives a message• But only if that thread has received less than its

fair share of the CPU– To prevent jobs with more messages from

monopolizing the system

• And only if the arriving message does not belong to a previous epoch– A new epoch started when some node switches

spontaneously– Prevents thrashing and allows a job to gain control

of the full machine

©2003 Dror Feitelson

ICS

• Priority-based schedulers automatically give priority to threads waiting on communication

• But need to ensure that sending thread stays scheduled until a reply arrives

• Do this with two-phase blocking, and wait for about 5 context-switch durations

Dusseau & Culler, SIGMETRICS 1996

©2003 Dror Feitelson

FCS

• Retain coordinated context switching across the machine

• But allow local scheduler to override global scheduling instructions according to local data

• Use local classification of processes into those that require gang scheduling and those that don’t

Frachtenberg et al., IPDPS 2003

©2003 Dror Feitelson

Implementation Details

• Instrument MPI library to measure compute time between calls (granularity) and wait time for communication to complete

• Use this to classify processes

• Data is aged exponentially

• Upon a coordinated context switch, decide whether to abide or use local scheduler

©2003 Dror Feitelson

Process Classification

Computation time per iteration

Com

mun

icat

ion

wai

t tim

e pe

r it

erat

ion

CS

F DC

Fine-grain and low wait: coscheduling

is effective

Fine grain but long wait: process is frustrated.

Don’t coschedule, but give priority Other processes

don’t care;use as filler

©2003 Dror Feitelson

Paired Gang Scheduling

• Monitor CPU activity of processes as they run

• Low CPU implies heavy I/O or communication

• Schedule pairs of complementary gangs together

Wiseman & Feitelson, IEEE TPDS 2003

©2003 Dror Feitelson

Scheduling Memory

©2003 Dror Feitelson

Memory Pressure

• With partitioning: memory requirements may set minimal partition size– May lead to underutilization of processors

• With gang scheduling: memory requirements may lead to excessive paging– No local context switching: processors remain

idle waiting for pages– However, no thrashing

©2003 Dror Feitelson

Paging

• Paging is asynchronous

• If two nodes communicate, and one suffers a page fault, the other will have to wait

• The whole application makes progress at the rate of the slowest process at each instant

• Effect is worse for finer grained applications

©2003 Dror Feitelson

Solutions

• Admission control– Allow jobs into the system only as long as

memory suffices

(variant: allow only 3 jobs…)– Jobs that do not fit may wait a long time

• Swapping– Perform long-range scheduling to allow queued

jobs a chance to run

©2003 Dror Feitelson

Admission Control

Dispatching

Memory allocation

©2003 Dror Feitelson

Problems

• Assessing memory requirements– Data size as noted in executable

Good for static Fortran code– Historical data from previous runs– Problem of dynamic allocations

• Blocking of short jobs

Batat & Feitelson, IPPS/SPDP 1999

©2003 Dror Feitelson

PerformancePrevention of paging compensates for blocking of short jobs

Batat & Feitelson, IPPS/SPDP 1999

0

1000

2000

3000

4000

5000

6000

0.3 0.4 0.5 0.6 0.7 0.8 0.9

system load

aver

age

resp

onse

time

1.25

1.5

1.75

1.99

1.5+

Add memoryconsiderations

Higher locality

©2003 Dror Feitelson

Swapping

DispatchingMemory allocation

Swapping

©2003 Dror Feitelson

Swapping Overhead

• m memory

• r bandwidth

• Swapping time is then 2m/r

Alverson et al., JSSPP 1995

time

m

Thread Arunning

Thread Brunning

Swapping A out

Swapping B in2/r

©2003 Dror Feitelson

Swapping Overhead

When swapping multiple jobs, each needs all its memory in order to run

Best memory utilization if handling jobs one at a time

Alverson et al., JSSPP 1995

time

Job A Job B

Job C Job D Job E

unusedmemory