boost your efficiency when dealing with multiple jobs on ... · let's help the scheduler!...

Boost your efficiency when dealingwith multiple jobs on the CrayXC40 supercomputer Shaheen II

Samuel KORTAS June 5th t 2016KAUST Supercomputing LaboratoryKSL Workshop Series

• A few tips when dealing with numerous jobs• Slurm way (up to a limit)• Four KSL tools to move you further

– Breakit (1 to 10000s, all same)

– KTF (1 to 100, tuned)

– Avati (1 to 1000s, programmed)

– Decimate (dependent jobs)

• Hands-out session: /scratch/tmp/ksl_workshop• Documentation on hpc.kaust.edu.sa/1001_jobs (to be

completed today)

• Conclusion

Agenda

Launching thousands of jobs…

• Some of our users use shaheen to explore parameterssweeping involving thousands of jobs saving thousands oftemporary files

• Need a result in a guaranteed time

• Are not hpc experts, but are challenging problem in terms ofscheduling and file system stress.

• Implement complex workflows sending the output of one codeinto the input of others and producing a lot of small files

Scheduling thousands of jobs

• KSL does its best… but it's not that easy folks! →The tetris Game gets rough with long rectangles ;-(

6144 Nodes availables

Time

`

X 1000s !!!!

Let's help the scheduler! (1/5)Putting the right elapsed time

Let's help the scheduler! (2/5)Let's share resources better among us ● Current policy of scheduler is first in first served● Your priority increases as long as you are waiting

'actively' in the queue,● hold or dependent jobs are not counted● Slurm takes into account your backfilling potential● But we have to share guys…

→ number of jobs in the queue is limited● Fair share slurm implementation is reported to work well

with only a small number of projects

Let's help the scheduler! (3/5)Let's lower the stress on the filesystem ● Each one of the 1000s jobs may need to read, probe or

write a file.● We got a unique filesystem shared by all the jobs, let's

save it● Lustre is not tuned for little files● → Let's use ramdisk when it's possible and save data

that matters to Lustre (see next slide)● → Let's communicate in memory instead of via files● → Let's choose the right stripe count

Let's help the scheduler! (4/5)How to use ramdisk?● On each shaheen II computing, /tmp is a ramdisk, a POSIX

filesystem hosted directly in memory– → starting at 64 GB, it shrinks as your program uses more and more memory

● → an additional memory requests or a write in /tmp fails when : size(OS) + size(program instructions) + size(program variable) + size(/tmp) > 128 GB

– Still /tmp is the fastest filesystem of all (compared to lustre and datawarp)

– But it's distibuted and lost at the end of the job.

● → think of storing temporary files in /tmp and save them atthe end of the job

● → think of storing frequently accessed files in /tmp

Let's help the scheduler! (5/5)Off-loading the cdls to compute nodes● You may need to

– Pre/postprocess

– Monitor a job

– Relaunch it

– Get notified when it's starting or ending...

● Automate all this and move the load from the cdl to thecompute nodes– Use #SBATCH mail-user

– Use breakit, ktf, maestro, decimate

– Ask KSL team for help: it's only a script away

Managing 1001 jobs 1 - the SLURM waysubmitting Arrays...

Slurm Way (1/3)

● Slurm can Submit and manage collection of similar jobseasily → job_array

● To submit 500 element job array:sbatch --array=1-500 -N1 -i my_in_%a -o my_out_%a job.sh

where “%a” in file name mapped to array task ID (1 – 500)

● squeue -r -user <my_user_name> 'unfolds' job queued as job array

● More info at http://slurm.schedmd.com/job_array.html

http://slurm.schedmd.com/job_array.html

Slurm Way (2/3)Job environment variables

• squeue and scancel commands plus some scontrol options can operate on entire job array or select task IDs

• squeue -r option prints each task ID separately

–

Slurm Way (3/3)Job example

Possible commands:

sbatch --array=1-16 my_job

sbatch --array=1-500%20 my_job only allow 20 active running jobs at a given time

Taken from https://rcc.uchicago.edu/docs/running-jobs/array/index.html

Slurm Way But …● Slurm count each job of the array as a job per

se: as for now the total number of jobs in thequeue is limited to 800 jobs per user

● Pending job are not gaining priority● Only one parameter can vary

– → if need to work on several parameter, the script himselfhas to deduce them from the number in the array...

Slurm Way hands-on…● Submit the job

/scratch/tmp/ksl_workshop/slurm/demo.job

As an array of 20 occurrences, check

– the script,

– its output

– The queue

– Cancel it

Slurm Way hands-on…solution● Submit the job

/scratch/tmp/ksl_workshop/slurm/demo.job

sbatch –array=1-20 /scratch/tmp/ksl_workshop/slurm/demo.job

As an array of 20 occurrences, check

– the script,

– its output,

– The queue, → squeue -r --user=<my_user>

– Cancel it → scancel -n <my_job_name>

Managing 1001 jobs?4 KSL open source Tools

Why? Ease your life and centralizesome common developments

breakit ktf maestro decimate

● Soon Available at https://bitbucket.org/kaust_KSL/ (GNU GPL License)

● Written in python 2.7

● Installed on Shaheen II, Portable on workstation, Noor…

● All share common api and internal library engine also available on bitbucket.org/kaust_KSL

● Maintained by KSL (samuel.kortas (at) kaust.edu.sa)

Availaible on shaheen as modules Under development for 2 PIsreleased soon on bitbucket.org

Our Goal:Hiding Complexity

https://bitbucket.org/kaust_KSL/

Managing 1001 jobsUsing the breakit wrapper

Breakit (1/3)Idea and status ● To allow you to cope seamlessly with the limit

of 800 jobs● No need to change your job array● Breakit automatically monitors the process for

you

● → version 0.1 –→ I need your feedback!

Slurm way (1/2)How to handle it with slurm?

Max numberof jobs in queueYou

Or prog on cdl

Slurm way (2/2)How to handle it with slurm

Max number of jobs in queueYou

Or prog on cdl

Breakit (2/3)How does it work?

Max number of jobsbreakit


Max number of jobsGone!

Breakit is notactive anymore!


Max number of jobsGone !t

The jobs are starting


Max number of jobs

They submit the next jobs witha dependency


Max number of jobs

First stop are donedependency is solvedNext ones are prending


Max number of jobs

They submit the next jobs witha dependency

Breakit (2/3)How does it work? ● Instead of submitting all the jobs, they are

submitted by chunks– Chunk #n is running or pending

– Chunk #n+1 is depending on Chunk #n,● Starts only when every jobs of chunk #n have completed● Submit Chunk #n+2 setting a dependency on Chunk # n+1

● ….We did offload some task from the cdl oncompute nodes ;-)

Breakit (3/3)How to use it? 1) Load the breakit module

module load breakitman breakit (to be completed)breakit -h

2) Launch your job:breakit --job=your.job –array=<nb of jobs> --chunk=<max_nb_of_jobs_in_queue>

3) Manage it:

squeue -r -u <user> -n <job_name>

scancel -n <job_name>

Breakit Hands on• Via breakit submit an array of 100 occurrences of job

/scratch/tmp/breakit/demo.job only having 16 jobssimultaneously in the queue

Breakit Hands on (solution)• Via breakit submit an array of 100 occurrences of job

/scratch/tmp/breakit/demo.job only having 16 jobssimultaneously in the queue

module load breakit

breakit --job=/scratch/tmp/breakit/demo.job --range=100 --chunk=16

Breakit Next steps• Find a better name!

• Support all array range (not only 1-n)

• Provide an easy restart

• Provide an easier way to kill jobs

Managing 101 jobs Using KTF

KTF Idea ● At a certain point, you may need:

– to evaluate the performance of a code under different conditions,

– to run a parametric study.

● the same executable is run several times with a different set ofparameters– Physical values characterizing the problem,

– number of processors, threads and/or nodes

– compiler used

– compiling option

– parameters passed on the srun command line to experiment different placement strategies

– …

● KTF (Kaust Test Framework) can help you on this!

What is KTF?

● KTF (Kaust Test Framework) has been designed and usedduring Shaheen II procurement in order to ease– Generation

– Submission

– Monitoring

– Result collecting

● Written in python 2.7● Self-contained and portable● Available on bitbucket.org/kaust_KSL/ktf

Of a set of jobs depending ona set of parameters to explore.

How does KTF works?A few definitions● An 'experiment'● A case is one single run of this experiment with

a given set of parameters● A test gathers a number of cases

How does KTF works?

● KTF relies on– A centralized file listing all combinations of parameters to

address : ie shaheen_cases.kt

– A set of template files where the parameters needs to bereplaced before the submission in all files ending by .template

KTF hands-on! (1/)Initialize environment1) Load the environment, and check that ktf is available

module load ktfman ktfktf -h

2) Create and initialize your working directory

mkdir <my_test_dir>cd <my_test_dir>ktf --init

→ you should get a ktf-like tree structure with some example of centralize case filesand associated templates

3) Examine the case file shaheen_cases.ktf, understands the ktf syntax, modify parametersand check your change by listing all the combinations

ktf --exp

KTF Centralized case file (see file shaheen_zephyr0.ktf)

According to this case file, for the third test case, in each file ending by .template:✔ __Case__ will be replaced by 128✔ __Experiment__ will be replaced by zephyr/stong✔ __NX__ will be replaced by 255✔ __NY__ will be replaced by 255✔ __NB_CORES__ will be replaced by 128✔ __ELLAPSED_TIME__ will be replaced by 0:05:00

● # is a comment → not parsed by KTF● First line gives the name of the parameters● Case and Experiment are absolutely mandatory● Each line following is a test case, setting value for EACH of parameter

← KTF comment

← third test case

← list of parameters

KTF Directory initial structure

ktf

←default case file

←one experiment directory

←subdirectory containing files common to all the experiments

←one experiment directory

KTF job.shaheen.template(see files in tests/zephyr/strong/)

← third test case


← KTF comment

← file job.shaheen.template

./zephyr input

KTF job.shaheen.template(see files in tests/zephyr/strong/)

← third test case


← KTF comment

← file input.template

KTF commands ktf ...

… --help : get help on command line

… --init : initialize the environment copying example .template and .kt files

… --build : generate all combination listed in the case file

… --launch: generate all combination listed in the case file and submit them

… --exp : list all combination present in the case .ktf file

… --monitor: monitor all the experiments and displays all results in a dashboard

… --kill : kill all jobs related to this ktf session

… --status : list all stamp dates and cases of the experiments made or currently occuring

KTF hands-on! (2/)Prepare a first experiment

4) Examine the case file shaheen_cases.ktf, understands thektf syntax, modify parameters and check your change bylisting all the combinations

ktf --exp

5) Build an experiment and check that the templated fileshave been well processed

ktf --build→ should create one tests_ directories : tests_shaheen_<date>_<time>

KTF Directory KTF Directory after --build

←Initial template

← Third case

KTF Directory KTF Directory after --launch

←job.shaheen processed from job.shaheen.template

←input processed from input.template

Zephyr is copied from the←common directory

KTF Centralized case file Handling constant parameters

← KTF comment

← third test case



←#KTF pragma declaring new parameters that will keep same value ever after

…. strictly identical to File shaheen_zephyr1.ktf

File shaheen_zephyr0.ktf ….

Another example KTF case fileCase

Experiment

Experiment

KTF filters and flags ktf --xxx ...

… --case-file=<case file> : use another case files thanshahen_cases.kt

… --what=zzzz : filters on some cases

… --reservation=<reservation name> : submit within a reservation

● ktf --exp --what=128

● ktf --launch –what=64 --reservation=workshop

● ktf --exp –case-file=shaheen_zephyr1.ktf

KTF filters and flags ktf --xxx ...

… --ktf-file=<case file> : use another case files than shahen_cases.ktf

… --what=zzzz : filters on some cases

… --when=yyyy | --today | --now : filters on some date stamps

… --times=<nb>: repeat submission <nb> times

… --info : switch on informative traces

… --info-level=[0|1|2|3] : change informative trace level

… --debug : switch on debugging traces

… --debug-level=[0|1|2|3] : change debugging trace level

KTF hands-on! (3/)Playing with –what filter4) Examine the case file shaheen_cases.ktf, understands the ktf syntax,

modify parameters and check your change by listing all thecombinations with or without filtering and using other cases files

ktf --expktf --exp --what=<your filter>ktf --exp –case-file=shaheen_zephyr1.ktf

5) Build an experiment and check that the templated files have been wellprocessed

ktf --buildktf --build --what=<your filter>

→ should create two tests directories from where you call ktf tests_shaheen_<date>_<time>

KTF hands-on! (3/)launch and monitor our first experiment

6) Build an experiment and submit it

ktf –launch [ --reservation=workshop ]→ should create a new tests directory and spawn the jobs ./tests_shaheen_<date>_<time>

ktf --monitor→ will monitor your current ktf session→ check what shows in the R/ directory

7) Play with repeating experiments and filtering results

ktf --launch --what=<your filter> [ --reservation=workshop ]ktf --launch --times=5 [ --reservation=workshop ]ktf --monitorktf --monitor --what=<your case filter> --when=<your date filter>

→ check what shows in the R/ directory

KTF results dashboardreading the result dashboard

% ktf --monitor

KTF results dashboardreading the result dashboard

When

What

% ktf --monitor

Status

Time

Subdir

in R/

Not

finished yet

!Job.err

not empty

KTF R/ directoryquick access to results

• This R/ directory is updated each time you call kt--monitor

• It builds symbolic links to the results directory in order toprovide you quick access to the results you want to check.

KTF R/ directoryquick access to results directory

^

KTF results configurationimplementation and default printing

● In fact… alias ktf = python run_test.py alias ki = python run_test.py --init alias km = python run_test.py --monitor

● In run_test.py, is encoded the value to be displayed in the dashboard(printed when calling –monitor)

● By default, it is <ellapsed time taken by the whole test>/<status of the test>

with a '!' after the status if ever job.err is not empty… with a '!' before the status if ever the job is not terminated properly remember you can use

cat or more or tail R/*/job.err to scan all these files!

KTF results configurationchanging default printing

● But you can change the displayed values at will! And adaptit to your own needs:● Other values: Flops, intermediate results, total number of

iterations, convergence rate,● Several values : <flops>/<time>/<status>● Other event to trigger the '!' sign● Other typographic signs

● → how to do it…

KTF run_test.py file

KTF hands-on! (5/)modifying the result printed

8) Check what ktf prints of it:

ktf --monitorand understand how run_test.py is working

9) Modify run_test.py in order to print the time per iteration

KTF Next steps• Gather tests into campaign

• Have a better display --monitor option, Web interface,Automated generation of plots

• Enrich the filtering feature : regular expression, severalfilters possible

• Enable coding capability inside the case file

• Complete the documentation

• Save results into database and be able to compute statistics

• Cover the compiling step

KTF Next steps• Support –clean and campaigns

• Chains several jobs into one

• Support job arrays, dependencies, mail to user

• Port on Noor and workstation

• Offload from workstation to shaheen

• Better versioning of the template file

• Decline one ktf initial environment per science fields

Managing 1001 jobs using Maestro

Maestro principles (1/2)

• Handling these studies should be same on:– A linux box– Shaheen, Noor, Stampede…– A laptop under windows or mac OS– A given set of linux boxes

• The only prerequisite:– Python > 2.4 and MPI on a supercomputer– Python > 2.4 on a workstation

Maestro principles (2/2)

• Minimal or no knowledge of HPCenvironment required

• Easy management of the jobs handled asa whole.

A set of tools adapted to a distributedexecution (1/3)

• No pre-installation needed on the machines: maestro isself contained

• Easy and quick prototyping on workstation withimmediate porting on supercomputer

• Global Error signals easy to throw and trace

• Global handling of the jobs as a whole study (launching,monitoring, killing and restarting through one command)


• All the flexibility of python available to the user in adistributed environment (class inheritance, modules…) production of code robust, easy to read with anexplicit error stack in case of problem to debug

• Transparent replication of the environment on each ofthe compute nodes

• Work in /tmp of each compute node to minimize thestress of the filesystem


• Extended Grep (multi-line, multi-column, regular expressions) topostprocess the output files

• Centralized management of the template to replace

• Global selection of files to be kept and parametrization of thereceiving directory

• A console to explore easily subdirectories where results aresaved

• Each running process can write in a same global file

Maestro Principles

maestro

Maestro Principles

maestro

MaestroAllocate A pool of Nodes andruns elementary jobin it

An example

File to saveFile to save

Directory name whereResults are saved

Directory name whereResults are saved

Elementary computationSending local andGlobal messages

Elementary computationSending local andGlobal messages

Parametrized Z rangeParametrized Z range

Definition of the domainto sweep

Definition of the domainto sweep

Command line options

<no option> : classical sequential run on 1 core stopping at the first error encountered--cores=<n> : parallel run on n cores--depth=<p> : partial parallelisation up to level p--stat : live status of ongoing computation--reservation=<id> : run inside a reservation--time=hh:mm:ss : set the elapsed duration of the overall job--kill : kills ongoing computation and clean environment--resume : resume a computation--restart : restart from scratch a computation--help : help screen

• Demo!

Next Steps

• Allowing maestro to launch multicore jobs

• More clever sweeping algorithms decime project

• Support of a given set of workstation

• Coupling maestro with website

• Remote launching and dynamic off-loading fromworkstation to supercomputer

Managing depedent jobs incomplex workflow Using Decimate

Idea

• Some workflow involve several steps dependingof one another

– → several jobs with a dependency betweenthem

• Some intermediate steps may break

– → dependency will break

– → the workflow will remain idle, requesting anaction

• We want to automate it

What is decimate?Add-ons and goodies• Tool in python written for two different PIs

with the same need• Launch, monitor, heal dependent jobs• Make things automated and smooth

What is decimate?

• Add-ons

– Centralized log files,

– Global –resume, --status and –kill command

– Sends a mail at any time to the user to keep himupdated

– Can make decision when dependency is broken● Relaunch same job again and fix dependency● Change input data, relaunch and fix dependency● cancel only this job and move on.● Cancel the whole workflow.

Some example of workflow

Conclusion

slurm breakit ktf maestro decimate

Typical #job

< 800 > 800 100 1-1000 ?

Job are same same different different different

parameter 1 1 several many any

#nodes/job same same any same Any

dependent One at atime

One at atime

no no yes

We have presented some useful tools to handle many jobs at a time

Your feedback is [email protected]

boost your efficiency when dealing with multiple jobs on ... · let's help the scheduler!...

Documents