Inria Sophia Nef cluster
Inria Sophia Antipolis Méditerranée - « ateliers thématiques »
SIC – SED v2.1 29 March 2016
29/03/2016Inria Sophia Nef cluster 2
Nef platform
Nef is a cluster computing platform :● nodes now : 3 node types, 69 nodes, 1004 cores, 18 GPUs ● nodes 04-2016 : 7 node types, 118 nodes, 2148 cores, 18 GPUs
Legacy Nef : stops 17 April 2016, hardware re-installed to Nef
● storage (~15TB for homes, ~150TB for data)● fast network interconnect (Infiniband QDR 32Gbit/s)● front-end servers
For what ?● all computation needs for Inria Sophia teams activity● includes experimentation, « production », big data, parallel computations,
sequential jobs, GPU
29/03/2016Inria Sophia Nef cluster 3
Nef platform
For who ?● all Inria Sophia research teams users● Inria users● academic and industrial partners of Inria (under agreement)
By who ?● financed by Inria, CPER, research teams● scientific piloting committee – CSPP « Cluster, Grid, Cloud, HPC »● technical team https://helpdesk.inria.fr
Which future ?● perennial and evolutive platform ● CPER OPAL 2015-2020 : distributed meso-center with regional academic
partners
29/03/2016Inria Sophia Nef cluster 4
Nef platform
29/03/2016Inria Sophia Nef cluster 5
Nef platform
29/03/2016Inria Sophia Nef cluster 6
Nef evolution in a nutshell
What changes from Legacy Nef to Nef :
Legacy Nef Nef
scheduler Torque/Maui OAR
queues many, complex(see documentation)
default, besteffort, big
storage /dfs /data
nodes Legacy Nef nodes+ added hardware
system Fedora16 CentOS7
software new versions,environment modules
29/03/2016Inria Sophia Nef cluster 7
Accessing Nef – account request
Account on Nef is distinct from Inria account (requires request & renewal)
Kali web portal https://kali.inria.fr is the preferred account management interface● click Sign in/up with CAS and use your Inria credentials● go to Clusters > Overview page and apply for an account on Sophia New Nef
Kali is also a web portal for simple cluster usage● follow Kali online help to prepare and launch your jobs
29/03/2016Inria Sophia Nef cluster 8
Accessing Nef – ssh (1/2)
front-end access from jobsubmission
developmenttools
nef-frontal internet yes no
nef-devel2 Inria yes yes
nef-devel(04-2016)
Inria (Legacy Nef) (Legacy Nef)
Inria Sophia Nef cluster 9
Accessing Nef – ssh (2/2)
Example : successful connection from outside (better : use ~/.ssh/config)
mylaptop$ ssh [email protected]## not needed from Inria network or VPNnef-frontal$ ssh nef-devel2nef-devel2$
Example : bad ssh key configuration in ~/.ssh/authorized_keys
mylaptop$ ssh myneflogin@nef-devel2
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
mylaptop$
Example : bad connection to nef-frontal when not on Inria network (or configure
~/.ssh/config)
mylaptop$ ssh [email protected]
ssh: connect to host nef-devel2.inria.fr port 22: Connection
timed out
mylaptop$
Inria Sophia Nef cluster 10
Resource manager / scheduler - OAR
OAR is an open source batch scheduler :● submit a job to request resources for a given amount of time (walltime)● core = most basic resource● hierarchy of resources : /cluster/nodes/cpu/core● when you reserve a core, you get a fraction of the memory of the node :
● total mem / total cores
(A few) changes from Torque :● you can't ssh to the nodes (but can use oarsh or oarsub -C)● you can get all the cores of a node without specifying the number of cores● you can reserve a given amount of cores whatever the number of nodes
29/03/2016Inria Sophia Nef cluster 11
Requesting resources – oarsub (1/4)
Batch mode job (default, recommended) :
● oarsub -l /nodes=3,walltime=02:00:00 /path/to/my/script
Interactive mode job :
● oarsub -l /core=10 -I
Advance reservation (don't use it if you don't need it) :
● oarsub -l /nodes=1 -r "2016-06-12 14:00:00" /path/to/my/script
29/03/2016Inria Sophia Nef cluster 12
Requesting resources – oarsub (2/4)
Example - resource specification :
● oarsub -l /core=32 -I
● oarsub -l /nodes=4/core=2,walltime=00:30:00 ./runme
Example - properties :
● oarsub -p 'mem > 100000' -l /nodes=1 -I
● oarsub -p "cputype='xeon' and not cluster='dellc6220' " -l
"{mem_core > 6000}"/core=6 ./runme
● oarsub -p "gpu='YES' " -l /nodes=1 -I
GPUs inside a node are shared among cores, so you should reserve
complete nodes and not a few cores !
29/03/2016Inria Sophia Nef cluster 13
Requesting resources – oarsub (3/4)
Example - moldable jobs (either-or) :
● oarsub -l /nodes=4,walltime=2 -l /nodes=2,walltime=4 ./runme
Example - submission script :
● oarsub -S ./test2.sh
nef-devel2 $ cat ./test2.sh
#OAR -l /nodes=2,walltime=1
#OAR -p ibswitch='ibswy1nef'
#OAR -q default
/path/to/my/command
Example – job array with param file (one line per job) :
● oarsub –array-param-file ./param_file ./runme
29/03/2016Inria Sophia Nef cluster 14
Requesting resources – oarsub (4/4)
Example - bad request, can't be satisfied by cluster
● oarsub -l /nodes=20 -p "cluster='dellr900'" -I
● fails with "There are not enough resources for your request"
Example - bad request, doesn't comply with per user resource allocation limits
● oarsub -l /core=300 -I
● remains in "Waiting" state
Hint : use oarstat -fj OAR_JOB_ID or Monika for detailed information
Warning : with oarsub -l /core=4 you can get 1 core on 4 nodes
For multithreaded runs, use oarsub -l /nodes=1/core=4
29/03/2016Inria Sophia Nef cluster 15
Obtaining resources – queues (1/3)
Jobs are submitted to a queue, available queues and limits are :
Example : 128 cores with 2x default RAM/core during 3.5 days == 21504
Best effort jobs : not subject to per user limits, but can be killed while running
● Good practice : use if appropriate (eg many short jobs)
queue name max userresources
max duration(days)
priority max user(hours*resources)
default 256 30 10 21504
big 1024 30 5 2000
besteffort 30 0
29/03/2016Inria Sophia Nef cluster 16
Obtaining resources – queues (2/3)
Job priority order :
● higher priority queue first
● then user's Karma : last 30 days resource consumption
● includes resource consumed + resource requested (used and unused)
Good practice : adjust walltime, RAM and CPU :
● Colmet : http://nef-devel2.inria.fr:5000/ (from Inria network)
"Why is my job still 'Waiting' while there are unused resources ?"
"Why is my job still 'Waiting' while other jobs go 'Running' ?"
● hint : "best fit", per user limits, specific resource request, etc.
29/03/2016Inria Sophia Nef cluster 17
Obtaining resources – queues (3/3)
Submit a job to the default queue :
● oarsub ./myscript
Submit a job to the big queue :
● oarsub -q big ./myscript
Submit a best effort job :
● oarsub -t besteffort ./myscript
● oarsub -t besteffort -t idempotent ./myscript
29/03/2016Inria Sophia Nef cluster 18
Monitoring jobs and interacting (1/3)
Monika : view jobs/nodes status and properties
29/03/2016Inria Sophia Nef cluster 19
Monitoring jobs and interacting (2/3)
Drawgantt : display gantt chart of nodes and jobs for the past and future
29/03/2016Inria Sophia Nef cluster 20
Monitoring jobs and interacting (3/3)
oarstat : print info about jobs
oardel : delete a job
oarpeek : show the stdout/stderr of a running job
oarnodes : print info about cluster nodes
Connect to a cluster node where jobid is running from nef-devel2 or nef-frontal
● oarsub -C jobid
● OAR_JOB_ID=jobid oarsh nodename
29/03/2016Inria Sophia Nef cluster 21
Managing data (1/4)
All data stored on the cluster ARE NOT backed up.
/home/myneflogin : home (default) directory
● seen cluster wide (nodes, nef-devel2, nef-frontal), long term
● quota 150GB/user, check occupation with quota -s
● hard limit 600GB, grace delay 4 weeks
Local storage on nodes (for jobs temporary files) :
● /tmp : local hard disk
● /dev/shm : RAM filesystem
29/03/2016Inria Sophia Nef cluster 22
Managing data (2/4)
All data stored on the cluster ARE NOT backed up.
/data : distributed scalable filesystem
● seen cluster wide (nodes, nef-devel2, nef-frontal)
● team directory : /data/myteamgroup/share
user directory : /data/myteamgroup/user/myneflogin
● long term storage : 1TB/team + quota bought by the team
● tag with chgrp myteamgroup ./long_term_file (Unix group)
● scratch storage : no quota, variable size, may be purged periodically
● tag with chgrp scratch ./scratch_file (Unix group)
● check quota with sudo nef-getquota -g myteamgroup
29/03/2016Inria Sophia Nef cluster 23
Managing data (3/4)
Copying files to/from the cluster using rsync :
# example : from nef to mylaptop on Inria Sophia network
# or user customized ~/.ssh/config
mylaptop$ rsync -av [email protected]:nef_source_dir
./laptop_dest_dir
# example : from mylaptop on the Internet to nef
mylaptop$ rsync -av ./laptop_source_dir myneflogin@nef-
frontal.inria.fr:nef_dest_dir
Good practice : avoid scp -r (follows symlinks)
Good practice : copy to/from nef-devel2 when possible (performance)
29/03/2016Inria Sophia Nef cluster 24
Managing data (4/4)
Accessing files on the cluster using sshfs :
# example : from mylaptop, Fedora, on Inria Sophia network
# or user customized ~/.ssh/config
mylaptop$ mkdir $XDG_RUNTIME_DIR/nef
mylaptop$ sshfs -o transform_symlinks nef-devel2:/
$XDG_RUNTIME_DIR/nef
mylaptop$ fusermount -u $XDG_RUNTIME_DIR/nef
29/03/2016Inria Sophia Nef cluster 25
Using software (1/4)
Overview of the tools available :
● Alinea DDT (debugger for openMP/MPI) & MAP (profiler)
● Intel Parallel Studio (c/c++/fortran compilers, MPI, Vtune)
● Scientific libraries (petsc, trilinos, hypre, mumps, openblas, gmsh, …)
● GPU : cuda 7.5, caffe
● Many languages : GCC, Matlab, R, Python (scipy, numpy, pip), java, ...
● Recommended MPI : openmpi 1.10.1
● Visualization : Paraview ; vnc & virtualGL on a GPU node
You can also install your own software in your home directory :
e.g. with python : pip install –user params
29/03/2016Inria Sophia Nef cluster 26
Using software (2/4)
Nef nodes and nef-devel2 use Linux CentOS7 64bit distribution
Compilation : use nef-devel2 (or a node)
Environment modules : configures user environment for using a tool
● module avail : list all available modules
● module load module_name : configures current session for module_name
● module list : show loaded module
● module purge : unload all modules
29/03/2016Inria Sophia Nef cluster 27
Using software (3/4)
Example : PETSc / OpenMPI test code from the PETSc distribution
Compilation :
nef-devel2$ module load mpi/openmpi-1.10.1-gcc
nef-devel2$ module load petsc/3.6.3
nef-devel2$ ./configure ## openmpi and petsc PATH/params come
from module
nef-devel2$ make test_code
nef-devel2$
29/03/2016Inria Sophia Nef cluster 28
Using software (4/4)
Example : PETSc / OpenMPI test code from the PETSc distribution (continued)
Job script :
nef-devel2$ cat job_script
# !/bin/bash
source /etc/profile.d/module.sh
module load mpi/openmpi-1.10.1-gcc
module load petsc/3.6.3
mpirun –prefix $MPI_HOME /path/to/test_code
nef-devel2$
Submitting job :
nef-devel2$ oarsub -l /core=20 /path/to/job_script
nef-devel2$
29/03/2016Inria Sophia Nef cluster 29
Appendix : Nef nodes (1/2)
nodes CPU type cores memory GPU HDD
8x Dell C6220 Xeon E5-2650v2
2x8 @ 2.6Ghz 256 GB - 1TB SATA
44x Dell C6100 Xeon X5670 [email protected] 96 GB - 250GB SATA
13x Dell R900 Xeon E7450 [email protected] 64 GB - 146GB SAS
2x Carri 5600XLR8
Xeon X5650 [email protected] 72 GB 7 GPU(C2050/C2070)
160GB SSD
2x Dell C6100 Xeon X5670 1x6 @ 2.66Ghz 24 GB 2 GPU (M2050) 250GB SATA
Current Nef nodes :
29/03/2016Inria Sophia Nef cluster 30
Appendix : Nef nodes (2/2)
nodes CPU type cores memory GPU HDD
16x Dell C6220 Xeon E5 2680 v2 2x10 @ 2.6Ghz 192 GB - 2TB SATA
6x Dell C6145 Opteron 6376 4x16 @ 2.3Ghz 256 GB - 500GB SATA
6x Dell R815 Opteron 6174 4x12@ 2.2Ghz 256/512 GB - 600GB SAS
19x Dell PE1950 Xeon X5670 [email protected] 16 GB - 73GB SAS
Nodes to be added 04/2016 (currently in Legacy Nef) :
Thank you
wiki.inria.fr/ClustersSophia
Inria Sophia Antipolis Méditerranée
29/03/2016