dynamic fractional resource scheduling practical issues and future directions -- 2011, cetraro
DESCRIPTION
TRANSCRIPT
Introduction Implementation Concluding Remarks Appendix
Dynamic Fractional Resource SchedulingPractical Issues and Future Directions
Mark Stillwell Frederic Vivien Henri Casanova
INRIA, Universite de Lyon, LIP
International Research Workshop on Advanced HighPerformance Computing Systems
Cetraro, ItalyJune 27–29, 2011
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Outline
Introduction
Implementation
Concluding Remarks
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Outline
Introduction
Implementation
Concluding Remarks
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Dynamic Fractional Resource SchedulingI HPC Scheduling Problem:
I homogeneous nodes with high-speed interconnectI network file systemI release date and number of tasks known at submissionI computation time known only after completionI goal: minimize maximum stretch
I approach:I based on virtual machine technology
I fractional resource allocationsI preemption / migration
I on-line scheduling problem → sequence of off-line resourceallocation problems
I computable, on-line performance metric related to maxstretch
I heuristics for task placement/resource allocation
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Task Placement and Resource Allocation Problem
I nodes comprise a number of resources (CPU cores,memory, bandwith, etc...)
I tasks have rigid requirements and fluid needsI job tasks have the same requirements and needsI rigid requirements specify minimum resource allocationsI fluid needs specify minimum allocations without
performance degradationI yield of a job is a value between 0 and 1
I correlated with performanceI jobs allocated product of yield and need in each resource
I Goal: find an allocation that maximizes the minimum yield
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Task Placement Heuristics
I Greedy Task Placement – incremental, puts each task onthe node with the lowest computational load on which itcan fit without violating memory constraints
I MCB Task Placement – global, iteratively appliesmulti-capacity (vector) bin-packing heuristics during abinary search for the maximized minimum yield
I achieves higher minimum yield values than GreedyI can potentially cause lots of migration
I what if the system is oversubscribed?I use a priority function to decide which jobs to run
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Priority Function?
I Virtual Time: subjective time experienced by a jobI first idea: 1
VIRTUAL TIMEI informed by ideas about fairnessI lead to good resultsI theoretically prone to starvation
I second idea: FLOW TIMEVIRTUAL TIME
I addresses starvation problemI lead to poor performance
I Third Idea: FLOW TIME(VIRTUAL TIME)2
I combines idea #1 and idea #2I addresses starvationI performs about the same as first priority function
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
When to apply Heuristics
We consider a number of different options:I Job Submission – heuristics can use greedy or bin packing
approachesI Job Completion – as above, can help with throughput
when there are lots of short running jobsI Periodically – some heuristics periodically apply vector
packing to improve overall job placement
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Methodology
I experiments conducted using discrete event simulatorI mix of synthetic and real trace dataI considered only memory requirements and CPU needsI preemption/migration of jobs causes 300 second penalty to
runtimeI periodic approaches use a 600 second (10 minute) periodI absolute bound on max stretch computed for each instanceI performance comparison based on degradation from max
stretch bound
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Simulation Algorithms
I FCFSI EASYI Greedy*I GreedyPM*I /perI GreedyPM*/perI MCB*/perI GreedyPM*/per/mvtI MCB*/per/mvtI /stretch-per
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Max Stretch Degradation vs. Load, 5 minute penalty
1
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Max
stre
tch
degr
adat
ion
Fro
m B
ound
Load
FCFSEASYGreedy*
GreedyPM*
/per
GreedyPM*/per
MCB*/per
/stretch-per
MCB*/per/mvtGreedyPM*/per/mvt
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Bandwidth vs. Period
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
2000 4000 6000 8000 10000 12000
Ave
rage
Ban
dwid
th (
GB
/s)
Period (seconds)
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Max Stretch Degradation vs. Period
5
6
7
8
9
10
11
12
13
14
2000 4000 6000 8000 10000 12000
Ave
rage
Max
stre
tch
Deg
rada
tion
Period (seconds)
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Outline
Introduction
Implementation
Concluding Remarks
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Development Platform
I Grid5000 (http://www.grid5000.fr)I Kadeploy imagesI Genepi Nodes: Quad Core Xeon 2.5GHz w/ 8GB ram
I Open-Source SoftwareI Xen 4.0I Debian Squeeze / Linux 2.6.32
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Goals
I “small” experiment topics:I VM overheadI resource requirement/needs estimationI time sharing performance impactI preemption/migration costs
I bandwidth consumptionI performance impact
I migration planning strategiesI “large” experiment topic: DFRS in practice
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Goals
I “small” experiment topics:I VM overheadI resource requirement/needs estimationI time sharing performance impactI preemption/migration costs
I bandwidth consumptionI performance impact
I migration planning strategiesI “large” experiment topic: DFRS in practice
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Goals
I “small” experiment topics:I VM overheadI resource requirement/needs estimationI time sharing performance impactI preemption/migration costs
I bandwidth consumptionI performance impact
I migration planning strategiesI “large” experiment topic: DFRS in practice
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Goals
I “small” experiment topics:I VM overheadI resource requirement/needs estimationI time sharing performance impactI preemption/migration costs
I bandwidth consumptionI performance impact
I migration planning strategiesI “large” experiment topic: DFRS in practice
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Goals
I “small” experiment topics:I VM overheadI resource requirement/needs estimationI time sharing performance impactI preemption/migration costs
I bandwidth consumptionI performance impact
I migration planning strategiesI “large” experiment topic: DFRS in practice
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Estimating Resource Needs
I Techniques:I repeated runs for benchmarking / model buildingI online model buildingI introspection / configuration variation
I Models:I constant valuesI time-varying valuesI random variables / probability distributions
I Questions:I how accurate can we be?I how accurate do we need to be?
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Benchmark Task VM Memory Requirements
bench/tasks 1 2 3 4CG 477M 251M – 147MEP <64M <64M <64M <64MFT 1930M 977M – 499MIS 637M 329M – 176MLU 218M 125M – 80MMG 485M 259M – 147MSP 369M – – 123M
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Time Sharing Performance Impact
I Process thrashing!!!I Gang SchedulingI Flexible Co-scheduling
I Uncoordinated might not be so bad...
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
VM Consolidation
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8
slo
wd
ow
n
consolidation factor (maximum tasks/core)
y = xCGEPFTIS
LUMG
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Parallel Task Preemption/Migration
I preemption of parallel distributed jobs is theoreticallydifficult
I in the general case this may require coordinated shutdownI on the target platform everything happens quickly enough
that it isn’t an issueI migration can be done by preempt/restore or “live”
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Preemption/Migration Timing Results
0
10
20
30
40
50
60
70
80
90
100
400 600 800 1000 1200 1400 1600 1800 2000
Tim
e (
se
co
nd
s)
Memory Allocation (MB)
PreemptRestore
Preempt+RestoreMigrate
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Migration Planning
I using only live migration is NP-hardI could fail (circular dependencies)I heuristic: try to move the highest priority job
I if it can’t move, try next highestI if none can move, preempt the lowest priority job scheduled
for migration and try againI restore any preempted jobs at the end
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Planned Implementation Validation Experiment
I generate traces using an accepted modelI annotate them with CPU/memory valuesI ”fill in” the traces using benchmarks...
I NAS parallel benchmarks (HPC)I TCPP benchmarks (Service Hosting)
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Future Work
I complete development of practical implementationI extend the model
I nonlocal resourcesI heterogeneous nodesI varying communications delays
I new or arbitrary optimization targetsI fault toleranceI energy or other costsI formal definitions of fairness
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
Summary
I DFRS performs well in simulationI orders of magnitude better than batchI close to a theoretical bound
I prototype in developmentI preliminary results are promising
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
Introduction Implementation Concluding Remarks Appendix
References I
Stillwell, M., Schanzenbach, D., Vivien, F., and Casanova,H. (2010).Resource allocation algorithms for virtualized servicehosting platforms.Journal of Parallel and Distributed Computing,70(9):962–974.
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice