running jobs at wang hall - national energy research ... · running jobs at wang hall. ... •...
TRANSCRIPT
![Page 1: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/1.jpg)
Daniel Udwary "NERSC Data Science Engagement Group"February 3, 2016
Running Jobs at Wang Hall
![Page 2: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/2.jpg)
Outline
• Genepoolmovelogis-cs• DifferencesbetweenCraysandGenepool• CoriandEdisonarchitectureandconfigura-ons• IntrotoSLURM
2
![Page 3: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/3.jpg)
Why am I running this training session? • DuringtheMendelmove(nextweek!),wewillhaveaperiodof
reducedGenepoolcomputeavailability
• WewanttoencouragemoreJGIcomputeworkonNERSC’sflagshipsupercomputers,whenitmakessense– Lastyear,usedlessthanhalfofCPU-houralloca6on
• NERSCwantstoknowwhatitcandotobeOerenablebioinforma-csworkonthosemachines,andiden-fywherefutureproblemsmightlie
• GenepoolmaymovetoSLURMinthefuture
3
![Page 4: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/4.jpg)
NERSC has moved to a new building
• AllsystemsmustmovefromOaklandtoBerkeley
- 4 -
![Page 5: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/5.jpg)
Resources at Wang Hall (aka CRT)
• NewMendel+nodes• Newloginnodes(genepool13andgenepool14)• Allfilesystems(almost…)• Cori• Edison
S-llatOSF:– OldMendelnodes–movingstar6ngFeb8– LegacyGenepoolnodes–tobeshutdown~Feb22– Tapearchive–Noplantomove(yet)
5
![Page 6: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/6.jpg)
Move Schedule – Current Plan
-6-
Feb8
Mendel+
LegacyComputes
Feb22?
Mendel
Outage@CRT@OSF
Filesystems
Scheduler
Down6meforpowerworkand
networkingmaintenance
?
SeqFS
![Page 7: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/7.jpg)
Key Differences Between Cori/Edison and Genepool
CoriandEdison• Generallylarge,mul--node
jobs
• Jobsarecharged
• Wait-meun-ljobstartmeasuredindays
• Usersgenerallycompileandinstalltheirownso\ware–fewmodules
• SLURM
Genepool• Manysmall,singlenode(or
evensingle-CPU)jobs
• Nojobcharging• Wait-memeasuredinhours,
ifnotminutes
• AwesomeJGIconsultantsmanagebioinforma-csso\wareasmodules
• UGE
7
![Page 8: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/8.jpg)
Basics of NERSC Cray architecture
• CoriPhaseI– CrayXC– 1630nodes– 128GBmemorypernode– 32corespernode
• (2x16core2.3GHzHaswell)
• CoriPhaseII– >9300nodes– KnightsLandingCPUs
• Edison– CrayXC30– 5576nodes– 64GBmemorypernode– 24corespernode
• (2x12core2.4GHzIvyBridge)
8
![Page 9: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/9.jpg)
Edison Queue Structure
9
https://www.nersc.gov/users/computational-systems/edison/running-jobs/queues-and-policies/
So, use Edison for large parallel jobs using >682 nodes
![Page 10: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/10.jpg)
Cori queue structure • hOps://www.nersc.gov/users/computa-onal-systems/cori/running-jobs/queues-and-policies/
10
![Page 11: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/11.jpg)
What is SLURM?
• Insimpleword,SLURMisaworkloadmanager,orabatchscheduler.
• SLURMstandsforSimpleLinuxU-lityforResourceManagement.
• SLURMunitestheclusterresourcemanagement(suchasTorque)andjobscheduling(suchasMoab)intoonesystem.Avoidsinter-toolcomplexity.
• AsofJune2015,SLURMisusedin6ofthetop10computers,includingthe#1system,Tianhe-2,withover3Mcores.
• CoriinstalledwithSLURM,andEdisonswitchedlastNov,a\erits’move
- 11 -
![Page 12: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/12.jpg)
Advantages of Using SLURM
• Fullyopensource.• SLURMisextensible(pluginarchitecture)• Lowlatencyscheduling.Highlyscalable.• Integrated“serial”or“shared”queue• IntegratedBurstBuffersupport• Goodmemorymanagement• Built-inaccoun-nganddatabasesupport• “Na-ve”SLURMrunswithoutCrayALPS(Applica-onLevel
PlacementScheduler)– Batchscriptrunsontheheadcomputenodedirectly– Easiertouse.Lesschanceforconten6oncomparedtosharedMOM
node.
- 12 -
![Page 13: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/13.jpg)
SLURM User Commands • sbatch qsub submitabatchscript• salloc qlogin requestaninterac6vesession• scancel qdel deleteabatchjob• scontrolhold qhold holdajob• scontrolrelease qrls releaseajob• sacct qacct displayjobaccoun6ngdata• sqs qs NERSCcustomqueuedisplay
- 13 -
![Page 14: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/14.jpg)
Running with SLURM • Use“sbatch”(as“qsub”inUGE)tosubmitbatchscript
or“salloc”(as“qlogin”inUGE)torequestinterac-vebatchsession.
• Needtospecifywhichshelltouseforbatchscript.• Environmentisautoma-callyimported(as“qsub-V”in
UGE)• Landsonthesubmitdirectory• Batchscriptrunsontheheadcomputenode• Noneedtorepeatflagsinthesruncommandifalready
definedinSBATCHkeywords.• Hyperthreadingisenabledbydefault.Jobsreques-ng
morethan32cores(MPItasks*OpenMPthreads)pernodewillusehyperthreadsautoma-cally.
- 14 -
![Page 15: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/15.jpg)
Running with SLURM continued
• Use“srun”tolaunchparalleljobs(aswith“aprun”withTorque/Moab)
• srunflagsoverwriteSBATCHkeywords• srundoesmostofop-malprocessandthreadbindingautoma-cally.Onlyflagssuchas“-n”“-c”,alongwithOMP_NUM_THREADSareneededformostapplica-ons.Advanceduserscanexperimentmoreop-onssuchas–num_tasks_per_socket,–cpu_bind,--mem,etc.
15
![Page 16: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/16.jpg)
16
http://slurm.schedmd.com/rosetta.pdf
![Page 17: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/17.jpg)
SLURM Task arrays
TaskarraysworksimilarlytoUGE• sbatch--array=1-100
– Wouldstarta100taskjobarray
• Jobarrayswillhavetwoaddi6onalenvironmentvariablesset:– $SLURM_ARRAY_JOB_IDwillbesettothefirstjobIDofthearray.
– $SLURM_ARRAY_TASK_IDwillbesettothejobarrayindexvalue.
17
![Page 18: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/18.jpg)
Sample SLURM Batch Script
-18-
#!/bin/bash-l#SBATCH--par66on=regular#SBATCH--job-name=test#SBATCH--account=mpccc#SBATCH--nodes=2#SBATCH--6me=00:30:00srun-n16./mpi-helloexportOMP_NUM_THREADS=8srun-n8-c8./xthi
#!/bin/bash-l#SBATCH-pregular#SBATCH-Jtest#SBATCH-Ampccc#SBATCH-N2#SBATCH-t00:30:00srun-n16./mpi-helloexportOMP_NUM_THREADS=8srun-n8-c8./xthi
Longcommandop6ons Shortcommandop6ons
Tosubmitabatchjob:%sbatchmytest.slSubmiqedbatchjob15400
![Page 19: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/19.jpg)
SLURMmary
• SLURMprovidesequivalentorsimilarfunc-onalitywithTorque/MoabandUGE.
• srunprovidesequivalentorsimilarprocessandthreadaffinitywithaprun.
• Pleaseletusknowifyouhaveanadvancedorcomplicatedworkflow,andan-cipatepoten-alpor-ngissues.Wecanworkwithyoutomigrateyourscripts.
• Batchconfigura-onsares-llsubjecttotuningsandmodifica-onsbeforethesystemisinfullproduc-on.
- 19 -
![Page 20: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/20.jpg)
Documentations • SchedMDwebpage:
– hqp://www.schedmd.com/• RunningJobsonCori
– hqps://www.nersc.gov/users/computa6onal-systems/cori/running-jobs/• Manpagesforslurm,sbatch,salloc,squeue,sinfo,sacct,scontrol,
scancel,etc.• Torque/Moabvs.SLURMComparisons
– hqps://www.nersc.gov/users/computa6onal-systems/cori/running-jobs/for-edison-users/torque-moab-to-slurm-transi6on-guide/
• RunningjobsonBabbageusingSLURM:– hqps://www.nersc.gov/users/computa6onal-systems/testbeds/babbage/
running-jobs-under-slurm-on-babbage/• RunningiobsonEdison’stestsystem(Alva)withna-veSLURM
– hqps://www.nersc.gov/users/computa6onal-systems/edison/alva-test-and-development-system-for-edison/#toc-anchor-7
- 20 -
![Page 21: Running Jobs at Wang Hall - National Energy Research ... · Running Jobs at Wang Hall. ... • SLURM stands for Simple Linux U-lity for Resource ... for-edison-users/torque-moab-to-slurm-transi6on-guide](https://reader034.vdocuments.site/reader034/viewer/2022051802/5af0af4b7f8b9a8b4c8dc786/html5/thumbnails/21.jpg)