boost your efficiency when dealing with multiple jobs on ... · let's help the scheduler!...
TRANSCRIPT
Boost your efficiency when dealingwith multiple jobs on the CrayXC40 supercomputer Shaheen II
Samuel KORTAS June 5th t 2016KAUST Supercomputing LaboratoryKSL Workshop Series
• A few tips when dealing with numerous jobs• Slurm way (up to a limit)• Four KSL tools to move you further
– Breakit (1 to 10000s, all same)
– KTF (1 to 100, tuned)
– Avati (1 to 1000s, programmed)
– Decimate (dependent jobs)
• Hands-out session: /scratch/tmp/ksl_workshop• Documentation on hpc.kaust.edu.sa/1001_jobs (to be
completed today)
• Conclusion
Agenda
Launching thousands of jobs…
• Some of our users use shaheen to explore parameterssweeping involving thousands of jobs saving thousands oftemporary files
• Need a result in a guaranteed time
• Are not hpc experts, but are challenging problem in terms ofscheduling and file system stress.
• Implement complex workflows sending the output of one codeinto the input of others and producing a lot of small files
Scheduling thousands of jobs
• KSL does its best… but it's not that easy folks! →The tetris Game gets rough with long rectangles ;-(
6144 Nodes availables
Time
`
X 1000s !!!!
Let's help the scheduler! (1/5)Putting the right elapsed time
Let's help the scheduler! (2/5)Let's share resources better among us ● Current policy of scheduler is first in first served● Your priority increases as long as you are waiting
'actively' in the queue,● hold or dependent jobs are not counted● Slurm takes into account your backfilling potential● But we have to share guys…
→ number of jobs in the queue is limited● Fair share slurm implementation is reported to work well
with only a small number of projects
Let's help the scheduler! (3/5)Let's lower the stress on the filesystem ● Each one of the 1000s jobs may need to read, probe or
write a file.● We got a unique filesystem shared by all the jobs, let's
save it● Lustre is not tuned for little files● → Let's use ramdisk when it's possible and save data
that matters to Lustre (see next slide)● → Let's communicate in memory instead of via files● → Let's choose the right stripe count
Let's help the scheduler! (4/5)How to use ramdisk?● On each shaheen II computing, /tmp is a ramdisk, a POSIX
filesystem hosted directly in memory– → starting at 64 GB, it shrinks as your program uses more and more memory
● → an additional memory requests or a write in /tmp fails when : size(OS) + size(program instructions) + size(program variable) + size(/tmp) > 128 GB
– Still /tmp is the fastest filesystem of all (compared to lustre and datawarp)
– But it's distibuted and lost at the end of the job.
● → think of storing temporary files in /tmp and save them atthe end of the job
● → think of storing frequently accessed files in /tmp
Let's help the scheduler! (5/5)Off-loading the cdls to compute nodes● You may need to
– Pre/postprocess
– Monitor a job
– Relaunch it
– Get notified when it's starting or ending...
● Automate all this and move the load from the cdl to thecompute nodes– Use #SBATCH mail-user
– Use breakit, ktf, maestro, decimate
– Ask KSL team for help: it's only a script away
Managing 1001 jobs 1 - the SLURM waysubmitting Arrays...
Slurm Way (1/3)
● Slurm can Submit and manage collection of similar jobseasily → job_array
● To submit 500 element job array:sbatch --array=1-500 -N1 -i my_in_%a -o my_out_%a job.sh
where “%a” in file name mapped to array task ID (1 – 500)
● squeue -r -user <my_user_name> 'unfolds' job queued as job array
● More info at http://slurm.schedmd.com/job_array.html
Slurm Way (2/3)Job environment variables
• squeue and scancel commands plus some scontrol options can operate on entire job array or select task IDs
• squeue -r option prints each task ID separately
–
Slurm Way (3/3)Job example
Possible commands:
sbatch --array=1-16 my_job
sbatch --array=1-500%20 my_job only allow 20 active running jobs at a given time
Taken from https://rcc.uchicago.edu/docs/running-jobs/array/index.html
Slurm Way But …● Slurm count each job of the array as a job per
se: as for now the total number of jobs in thequeue is limited to 800 jobs per user
● Pending job are not gaining priority● Only one parameter can vary
– → if need to work on several parameter, the script himselfhas to deduce them from the number in the array...
Slurm Way hands-on…● Submit the job
/scratch/tmp/ksl_workshop/slurm/demo.job
As an array of 20 occurrences, check
– the script,
– its output
– The queue
– Cancel it
Slurm Way hands-on…solution● Submit the job
/scratch/tmp/ksl_workshop/slurm/demo.job
sbatch –array=1-20 /scratch/tmp/ksl_workshop/slurm/demo.job
As an array of 20 occurrences, check
– the script,
– its output,
– The queue, → squeue -r --user=<my_user>
– Cancel it → scancel -n <my_job_name>
Managing 1001 jobs?4 KSL open source Tools
Why? Ease your life and centralizesome common developments
breakit ktf maestro decimate
● Soon Available at https://bitbucket.org/kaust_KSL/ (GNU GPL License)
● Written in python 2.7
● Installed on Shaheen II, Portable on workstation, Noor…
● All share common api and internal library engine also available on bitbucket.org/kaust_KSL
● Maintained by KSL (samuel.kortas (at) kaust.edu.sa)
Availaible on shaheen as modules Under development for 2 PIsreleased soon on bitbucket.org
Our Goal:Hiding Complexity
Managing 1001 jobsUsing the breakit wrapper
Breakit (1/3)Idea and status ● To allow you to cope seamlessly with the limit
of 800 jobs● No need to change your job array● Breakit automatically monitors the process for
you
● → version 0.1 –→ I need your feedback!
Slurm way (1/2)How to handle it with slurm?
Max numberof jobs in queueYou
Or prog on cdl
Slurm way (2/2)How to handle it with slurm
Max number of jobs in queueYou
Or prog on cdl
Breakit (2/3)How does it work?
Max number of jobsbreakit
Breakit (2/3)How does it work?
Max number of jobsbreakit
Breakit (2/3)How does it work?
Max number of jobsGone!
Breakit is notactive anymore!
Breakit (2/3)How does it work?
Max number of jobsGone !t
The jobs are starting
Breakit (2/3)How does it work?
Max number of jobs
They submit the next jobs witha dependency
Breakit (2/3)How does it work?
Max number of jobs
First stop are donedependency is solvedNext ones are prending
Breakit (2/3)How does it work?
Max number of jobs
They submit the next jobs witha dependency
Breakit (2/3)How does it work? ● Instead of submitting all the jobs, they are
submitted by chunks– Chunk #n is running or pending
– Chunk #n+1 is depending on Chunk #n,● Starts only when every jobs of chunk #n have completed● Submit Chunk #n+2 setting a dependency on Chunk # n+1
● ….We did offload some task from the cdl oncompute nodes ;-)
Breakit (3/3)How to use it? 1) Load the breakit module
module load breakitman breakit (to be completed)breakit -h
2) Launch your job:breakit --job=your.job –array=<nb of jobs> --chunk=<max_nb_of_jobs_in_queue>
3) Manage it:
squeue -r -u <user> -n <job_name>
scancel -n <job_name>
Breakit Hands on• Via breakit submit an array of 100 occurrences of job
/scratch/tmp/breakit/demo.job only having 16 jobssimultaneously in the queue
Breakit Hands on (solution)• Via breakit submit an array of 100 occurrences of job
/scratch/tmp/breakit/demo.job only having 16 jobssimultaneously in the queue
module load breakit
breakit --job=/scratch/tmp/breakit/demo.job --range=100 --chunk=16
Breakit Next steps• Find a better name!
• Support all array range (not only 1-n)
• Provide an easy restart
• Provide an easier way to kill jobs
Managing 101 jobs Using KTF
KTF Idea ● At a certain point, you may need:
– to evaluate the performance of a code under different conditions,
– to run a parametric study.
● the same executable is run several times with a different set ofparameters– Physical values characterizing the problem,
– number of processors, threads and/or nodes
– compiler used
– compiling option
– parameters passed on the srun command line to experiment different placement strategies
– …
● KTF (Kaust Test Framework) can help you on this!
What is KTF?
● KTF (Kaust Test Framework) has been designed and usedduring Shaheen II procurement in order to ease– Generation
– Submission
– Monitoring
– Result collecting
● Written in python 2.7● Self-contained and portable● Available on bitbucket.org/kaust_KSL/ktf
Of a set of jobs depending ona set of parameters to explore.
How does KTF works?A few definitions● An 'experiment'● A case is one single run of this experiment with
a given set of parameters● A test gathers a number of cases
How does KTF works?
● KTF relies on– A centralized file listing all combinations of parameters to
address : ie shaheen_cases.kt
– A set of template files where the parameters needs to bereplaced before the submission in all files ending by .template
KTF hands-on! (1/)Initialize environment1) Load the environment, and check that ktf is available
module load ktfman ktfktf -h
2) Create and initialize your working directory
mkdir <my_test_dir>cd <my_test_dir>ktf --init
→ you should get a ktf-like tree structure with some example of centralize case filesand associated templates
3) Examine the case file shaheen_cases.ktf, understands the ktf syntax, modify parametersand check your change by listing all the combinations
ktf --exp
KTF Centralized case file (see file shaheen_zephyr0.ktf)
According to this case file, for the third test case, in each file ending by .template:✔ __Case__ will be replaced by 128✔ __Experiment__ will be replaced by zephyr/stong✔ __NX__ will be replaced by 255✔ __NY__ will be replaced by 255✔ __NB_CORES__ will be replaced by 128✔ __ELLAPSED_TIME__ will be replaced by 0:05:00
● # is a comment → not parsed by KTF● First line gives the name of the parameters● Case and Experiment are absolutely mandatory● Each line following is a test case, setting value for EACH of parameter
← KTF comment
← third test case
← list of parameters
KTF Directory initial structure
ktf
←default case file
←one experiment directory
←subdirectory containing files common to all the experiments
←one experiment directory
KTF job.shaheen.template(see files in tests/zephyr/strong/)
← third test case
← list of parameters
← KTF comment
← file job.shaheen.template
./zephyr input
KTF job.shaheen.template(see files in tests/zephyr/strong/)
← third test case
← list of parameters
← KTF comment
← file input.template
KTF commands ktf ...
… --help : get help on command line
… --init : initialize the environment copying example .template and .kt files
… --build : generate all combination listed in the case file
… --launch: generate all combination listed in the case file and submit them
… --exp : list all combination present in the case .ktf file
… --monitor: monitor all the experiments and displays all results in a dashboard
… --kill : kill all jobs related to this ktf session
… --status : list all stamp dates and cases of the experiments made or currently occuring
KTF hands-on! (2/)Prepare a first experiment
4) Examine the case file shaheen_cases.ktf, understands thektf syntax, modify parameters and check your change bylisting all the combinations
ktf --exp
5) Build an experiment and check that the templated fileshave been well processed
ktf --build→ should create one tests_ directories : tests_shaheen_<date>_<time>
KTF Directory KTF Directory after --build
←Initial template
← Third case
KTF Directory KTF Directory after --launch
←job.shaheen processed from job.shaheen.template
←input processed from input.template
Zephyr is copied from the←common directory
KTF Centralized case file Handling constant parameters
← KTF comment
← third test case
← list of parameters
← list of parameters
←#KTF pragma declaring new parameters that will keep same value ever after
…. strictly identical to File shaheen_zephyr1.ktf
File shaheen_zephyr0.ktf ….
Another example KTF case fileCase
Experiment
Experiment
KTF filters and flags ktf --xxx ...
… --case-file=<case file> : use another case files thanshahen_cases.kt
… --what=zzzz : filters on some cases
… --reservation=<reservation name> : submit within a reservation
● ktf --exp --what=128
● ktf --launch –what=64 --reservation=workshop
● ktf --exp –case-file=shaheen_zephyr1.ktf
KTF filters and flags ktf --xxx ...
… --ktf-file=<case file> : use another case files than shahen_cases.ktf
… --what=zzzz : filters on some cases
… --when=yyyy | --today | --now : filters on some date stamps
… --times=<nb>: repeat submission <nb> times
… --info : switch on informative traces
… --info-level=[0|1|2|3] : change informative trace level
… --debug : switch on debugging traces
… --debug-level=[0|1|2|3] : change debugging trace level
KTF hands-on! (3/)Playing with –what filter4) Examine the case file shaheen_cases.ktf, understands the ktf syntax,
modify parameters and check your change by listing all thecombinations with or without filtering and using other cases files
ktf --expktf --exp --what=<your filter>ktf --exp –case-file=shaheen_zephyr1.ktf
5) Build an experiment and check that the templated files have been wellprocessed
ktf --buildktf --build --what=<your filter>
→ should create two tests directories from where you call ktf tests_shaheen_<date>_<time>
KTF hands-on! (3/)launch and monitor our first experiment
6) Build an experiment and submit it
ktf –launch [ --reservation=workshop ]→ should create a new tests directory and spawn the jobs ./tests_shaheen_<date>_<time>
ktf --monitor→ will monitor your current ktf session→ check what shows in the R/ directory
7) Play with repeating experiments and filtering results
ktf --launch --what=<your filter> [ --reservation=workshop ]ktf --launch --times=5 [ --reservation=workshop ]ktf --monitorktf --monitor --what=<your case filter> --when=<your date filter>
→ check what shows in the R/ directory
KTF results dashboardreading the result dashboard
% ktf --monitor
KTF results dashboardreading the result dashboard
When
What
% ktf --monitor
Status
Time
Subdir
in R/
Not
finished yet
!Job.err
not empty
KTF R/ directoryquick access to results
• This R/ directory is updated each time you call kt--monitor
• It builds symbolic links to the results directory in order toprovide you quick access to the results you want to check.
KTF R/ directoryquick access to results directory
^
KTF results configurationimplementation and default printing
● In fact… alias ktf = python run_test.py alias ki = python run_test.py --init alias km = python run_test.py --monitor
● In run_test.py, is encoded the value to be displayed in the dashboard(printed when calling –monitor)
● By default, it is <ellapsed time taken by the whole test>/<status of the test>
with a '!' after the status if ever job.err is not empty… with a '!' before the status if ever the job is not terminated properly remember you can use
cat or more or tail R/*/job.err to scan all these files!
KTF results configurationchanging default printing
● But you can change the displayed values at will! And adaptit to your own needs:● Other values: Flops, intermediate results, total number of
iterations, convergence rate,● Several values : <flops>/<time>/<status>● Other event to trigger the '!' sign● Other typographic signs
● → how to do it…
KTF run_test.py file
KTF hands-on! (5/)modifying the result printed
8) Check what ktf prints of it:
ktf --monitorand understand how run_test.py is working
9) Modify run_test.py in order to print the time per iteration
KTF Next steps• Gather tests into campaign
• Have a better display --monitor option, Web interface,Automated generation of plots
• Enrich the filtering feature : regular expression, severalfilters possible
• Enable coding capability inside the case file
• Complete the documentation
• Save results into database and be able to compute statistics
• Cover the compiling step
KTF Next steps• Support –clean and campaigns
• Chains several jobs into one
• Support job arrays, dependencies, mail to user
• Port on Noor and workstation
• Offload from workstation to shaheen
• Better versioning of the template file
• Decline one ktf initial environment per science fields
Managing 1001 jobs using Maestro
Maestro principles (1/2)
• Handling these studies should be same on:– A linux box– Shaheen, Noor, Stampede…– A laptop under windows or mac OS– A given set of linux boxes
• The only prerequisite:– Python > 2.4 and MPI on a supercomputer– Python > 2.4 on a workstation
Maestro principles (2/2)
• Minimal or no knowledge of HPCenvironment required
• Easy management of the jobs handled asa whole.
A set of tools adapted to a distributedexecution (1/3)
• No pre-installation needed on the machines: maestro isself contained
• Easy and quick prototyping on workstation withimmediate porting on supercomputer
• Global Error signals easy to throw and trace
• Global handling of the jobs as a whole study (launching,monitoring, killing and restarting through one command)
A set of tools adapted to a distributedexecution (2/3)
• All the flexibility of python available to the user in adistributed environment (class inheritance, modules…) production of code robust, easy to read with anexplicit error stack in case of problem to debug
• Transparent replication of the environment on each ofthe compute nodes
• Work in /tmp of each compute node to minimize thestress of the filesystem
A set of tools adapted to a distributedexecution (3/3)
• Extended Grep (multi-line, multi-column, regular expressions) topostprocess the output files
• Centralized management of the template to replace
• Global selection of files to be kept and parametrization of thereceiving directory
• A console to explore easily subdirectories where results aresaved
• Each running process can write in a same global file
Maestro Principles
maestro
Maestro Principles
maestro
MaestroAllocate A pool of Nodes andruns elementary jobin it
Maestro Principles
maestro
MaestroAllocate A pool of Nodes andruns elementary jobin it
Maestro Principles
maestro
MaestroAllocate A pool of Nodes andruns elementary jobin it
An example
File to saveFile to save
Directory name whereResults are saved
Directory name whereResults are saved
Elementary computationSending local andGlobal messages
Elementary computationSending local andGlobal messages
Parametrized Z rangeParametrized Z range
Definition of the domainto sweep
Definition of the domainto sweep
Command line options
<no option> : classical sequential run on 1 core stopping at the first error encountered--cores=<n> : parallel run on n cores--depth=<p> : partial parallelisation up to level p--stat : live status of ongoing computation--reservation=<id> : run inside a reservation--time=hh:mm:ss : set the elapsed duration of the overall job--kill : kills ongoing computation and clean environment--resume : resume a computation--restart : restart from scratch a computation--help : help screen
• Demo!
Next Steps
• Allowing maestro to launch multicore jobs
• More clever sweeping algorithms decime project
• Support of a given set of workstation
• Coupling maestro with website
• Remote launching and dynamic off-loading fromworkstation to supercomputer
Managing depedent jobs incomplex workflow Using Decimate
Idea
• Some workflow involve several steps dependingof one another
– → several jobs with a dependency betweenthem
• Some intermediate steps may break
– → dependency will break
– → the workflow will remain idle, requesting anaction
• We want to automate it
What is decimate?Add-ons and goodies• Tool in python written for two different PIs
with the same need• Launch, monitor, heal dependent jobs• Make things automated and smooth
What is decimate?
• Add-ons
– Centralized log files,
– Global –resume, --status and –kill command
– Sends a mail at any time to the user to keep himupdated
– Can make decision when dependency is broken● Relaunch same job again and fix dependency● Change input data, relaunch and fix dependency● cancel only this job and move on.● Cancel the whole workflow.
Some example of workflow
Conclusion
slurm breakit ktf maestro decimate
Typical #job
< 800 > 800 100 1-1000 ?
Job are same same different different different
parameter 1 1 several many any
#nodes/job same same any same Any
dependent One at atime
One at atime
no no yes
We have presented some useful tools to handle many jobs at a time
Your feedback is [email protected]