dynamic fractional resource scheduling for hpc workloads -- 2009, lyon

33
Scheduling DFRS Heuristics Experiments Conclusions Dynamic Fractional Resource Scheduling for HPC Workloads Mark Stillwell 1 Frédéric Vivien 2,1 Henri Casanova 1 1 Department of Information and Computer Sciences University of Hawai’i at M¯ anoa 2 INRIA, France Invited Talk, October 8, 2009 M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA Dynamic Fractional Resource Schedulingfor HPC Workloads

Upload: mark-stillwell

Post on 13-Jul-2015

130 views

Category:

Technology


1 download

TRANSCRIPT

Scheduling DFRS Heuristics Experiments Conclusions

Dynamic Fractional Resource Schedulingfor HPC Workloads

Mark Stillwell1 Frédéric Vivien2,1 Henri Casanova1

1Department of Information and Computer SciencesUniversity of Hawai’i at Manoa

2INRIA, France

Invited Talk, October 8, 2009

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Formalization

HPC Job Scheduling Problem

0 < N homogeneous nodes0 < J jobs, each job j has:

arrival time 0 ≤ rj0 < tj ≤ N taskscompute time 0 < cj

J not knownrj and tj not known before rj

cj not known until j completes

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Formalization

Schedule Evaluation

make span not relevant for unrelated jobsflow time over-emphasizes very long jobsstretch re-balances in favor of short jobsaverage stretch prone to starvationmax stretch helps with average while bounding worst case

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Current Approaches

Current Approaches

Batch Scheduling, which no one likesusually FCFS with backfillingbackfilling needs (unreliable) compute time estimatesunbounded wait timespoor resource utilizationNo particular objective

Gang Scheduling, which no one usesglobally coordinated time sharingcomplicated and slowmemory pressure a concern

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Dynamic Fractional Resource Scheduling

VM Technology

basically, time sharingpooling of discrete resources (e.g., multiple CPUs)hard limits on resource consumptionjob preemption and task migration

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Dynamic Fractional Resource Scheduling

Problem Formulation

extends basic HPC problemjobs now have per-task CPU need αj and memoryrequirement mj

multiple tasks can run on one node if total memoryrequirement ≤ 100%

job tasks must be assigned equal amounts of CPUresourceassigning less than the need results in proportionalslowdownassigned allocations can changeno run-time estimatesso we need another metric to optimize

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Dynamic Fractional Resource Scheduling

Yield

Definition

The yield, yj(t) of job j at time t is the ratio of the CPUallocation given to the job to the job’s CPU need.

requires no knowledge of flow or compute timescan be optimized for at each scheduling eventmaximizing minimum yield related to minimizing maximumstretchHow do we keep track of job progress when the yield canvary?

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Dynamic Fractional Resource Scheduling

Virtual Time

Definition

The virtual time vj(t) of job j at time t is the subjective timeexperienced by the job.

vj(t) =∫ t

rjyj(τ)dτ

job completes when vj(t) = cj

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Dynamic Fractional Resource Scheduling

The Need for Preemption

final goal is to minimize maximum stretchwithout preemption, stretch of non-clairvoyant on-linealgorithms unbounded

consider 2 jobsboth require all of the system resourcesone has cj = 1other has cj = ∆

need criteria to decide which jobs should be preempted

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Dynamic Fractional Resource Scheduling

Priority

Jobs should be preempted in order by increasing priority.newly arrived jobs may have infinite priority1/vj(t) performs well, but subject to starvation(t − rj)/vj(t) time avoids starvation, but does not performwell(t − rj)/(vj(t))2 seems a reasonable compromiseother possibilities exist

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Greedy Heuristics

Greedy Scheduling Heuristics

GREEDY– Put tasks on the host with the lowest CPUdemand on which it can fit into memory; new jobs mayhave to be resubmitted using bounded exponential backoff.GREEDY-PMTN– Like GREEDY, but older tasks may bepreemptedGREEDY-PMTN-MIGR– Like GREEDY-PMTN, but older tasksmay be migrated as well as preempted

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

MCB Heuristics

Connection to multi-capacity bin packing

For each discrete scheduling event:problem similar to multi-capacity (vector) bin packing, buthas optimization target and variable CPU allocationscan formulate as an MILP [Stillwell et al., 2009](NP-complete)relaxed LP heuristics slow, give low quality solutions

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

MCB Heuristics

Applying MCB heuristics

yield is continuous, so choose a granularity (0.01)perform a binary search on yield, seeking to maximizefor each fixed yield, set CPU requirement and applyheuristicfound yield is the maximized minimum, leftover CPU usedto improve averageif a solution cannot be found at any yield, remove thelowest priority job and try again

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensionalcase:

1 Put job tasks in two lists: CPU-intensive andmemory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order bymaximum)

3 Starting with the first host, pick tasks that fit in order fromthe list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memoryAssign the next task that fits from the list of CPU-intensivejobs.

4 When no tasks can fit on a host, go to the next host.5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensionalcase:

1 Put job tasks in two lists: CPU-intensive andmemory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order bymaximum)

3 Starting with the first host, pick tasks that fit in order fromthe list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memoryAssign the next task that fits from the list of CPU-intensivejobs.

4 When no tasks can fit on a host, go to the next host.5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensionalcase:

1 Put job tasks in two lists: CPU-intensive andmemory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order bymaximum)

3 Starting with the first host, pick tasks that fit in order fromthe list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memoryAssign the next task that fits from the list of CPU-intensivejobs.

4 When no tasks can fit on a host, go to the next host.5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensionalcase:

1 Put job tasks in two lists: CPU-intensive andmemory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order bymaximum)

3 Starting with the first host, pick tasks that fit in order fromthe list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memoryAssign the next task that fits from the list of CPU-intensivejobs.

4 When no tasks can fit on a host, go to the next host.5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensionalcase:

1 Put job tasks in two lists: CPU-intensive andmemory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order bymaximum)

3 Starting with the first host, pick tasks that fit in order fromthe list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memoryAssign the next task that fits from the list of CPU-intensivejobs.

4 When no tasks can fit on a host, go to the next host.5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensionalcase:

1 Put job tasks in two lists: CPU-intensive andmemory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order bymaximum)

3 Starting with the first host, pick tasks that fit in order fromthe list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memoryAssign the next task that fits from the list of CPU-intensivejobs.

4 When no tasks can fit on a host, go to the next host.5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

MCB Heuristics

MCB8 Heuristic

Based on [Leinberger et al., 1999], simplified to 2-dimensionalcase:

1 Put job tasks in two lists: CPU-intensive andmemory-intensive

2 Sort lists by “some criterion”. (MCB8: descending order bymaximum)

3 Starting with the first host, pick tasks that fit in order fromthe list that goes against the current imbalance. Example:

current host tasks total 50% CPU and 60% memoryAssign the next task that fits from the list of CPU-intensivejobs.

4 When no tasks can fit on a host, go to the next host.5 If all tasks can be placed, then success, otherwise failure.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

MCB Heuristics

MCB8 Scheduling Heuristics

DYNMCB8– Apply heuristic on every eventDYNMCB8-PER– Apply heuristic periodicallyDYNMCB8-ASAP-PER– like DYNMCB8-PER, but try togreedily schedule incoming jobsDYNMCB8-STRETCH-PER– like DYNMCB8-PER, but try tooptimize worst-case max stretch

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Methodology

Methodology

discrete event simulator takes list of jobs and returnsstretch valuesworkloads based on synthetic and real tracessynthetic workload arrival times scaled to showperformance on different load conditionsalgorithms evaluated by per-trace degredation factorexperiment with “free” preemption/migration andexperiment where preemption/migration costs job aconstant amount of wall clock time.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Results

Average Maximum Yield, No preemption/migrationpenalty

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Deg

reda

tion

Fac

tor

Load

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Results

Average Maximum Yield, No preemption/migrationpenalty

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Deg

reda

tion

Fac

tor

Load

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Results

Average Maximum Yield, No preemption/migrationpenalty

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Deg

reda

tion

Fac

tor

Load

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Results

Average Maximum Yield, 5 minutepreemption/migration penalty

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Deg

reda

tion

Fac

tor

Load

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Results

Average Maximum Yield, 5 minutepreemption/migration penalty

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Deg

reda

tion

Fac

tor

Load

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Results

Average Maximum Yield, 5 minutepreemption/migration penalty

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Deg

reda

tion

Fac

tor

Load

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Results

Comparison of Synthetic vs. Real workload results

Scaled synth. Unscaled synth. Real-worldAlgs Deg. factor Deg. factor Deg. factor

avg. max avg. max avg. maxEASY 167 560 139 443 94 1476FCFS 186 569 154 476 118 2219greedy 294 1093 249 1050 153 1527greedyp 41 875 35 785 9 147greedypm 62 835 37 773 17 759mcb 32 162 11 162 11 231mcbp 1 12 2 21 3 20gmcbp 1 9 2 22 2 20mcbsp 1 12 2 21 3 23

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Results

Computation Times

Most scheduling events involve 10 or fewer jobs andrequire negligible time for all schedulers.Even when there are about 100 jobs, the time for MCB8 isunder 5 seconds on a 3.2Ghz machine

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Results

Costs

Greedy approaches use significantly less bandwidth thanMCB approaches (<1GB/s in the worst case)MCB approaches cause jobs to be preempted around 5times on average.DYNMCB8 uses 1.3GB/s on average, 5.1GB/s maximumperiodic algorithms 0.6GB/s on average, 2.1GB/smaximum

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Scheduling DFRS Heuristics Experiments Conclusions

Conclusions

DFRS potentially much better than batch schedulingmulti-capacity bin packing heuristics perform besttargeting yield does as well as targeting worst case maxstretchperiodic MCB approaches perform nearly as well asaggressive ones when there is no migration cost and muchbetter when there is a fixed migration costadding an opportunistic greedy scheduling heuristic toDYNMCB8-PER gives no real benefit to max stretchMCB approaches can calculate resource allocationsreasonably quicklyMCB approaches need to try to mitigatemigration/preemptions costs

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads

Appendix

For Further Reading

References I

Leinberger, W., Karypis, G., and Kumar, V. (1999).Multi-capacity bin packing algorithms with applications tojob scheduling under multiple constraints.In Proc. of the Intl. Conf. on Parallel Processing, pages404–412. IEEE.

Stillwell, M., Shanzenbach, D., Vivien, F., and Casanova, H.(2009).Resource Allocation using Virtual Clusters.In Proc. of CCGrid 2009, pages 260–267. IEEE.

M Stillwell, F Vivien, H Casanova UH Manoa ICS, INRIA

Dynamic Fractional Resource Schedulingfor HPC Workloads