ttu high performance computing user training: part 2 srirangam addepalli and david chaffin, ph.d....
TRANSCRIPT
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
Advanced Session: Outline
Cluster Architecture
File System and Storage
Lectures with Labs:
Advanced Batch Jobs
Compilers/Libraries/Optimization
Compiling/Running Parallel Jobs
Grid Computing
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
HPCC Clusters
hrothgar: 128 dual-processor 64-bit Xeons, 3.2 Ghz, 4GB memory, Infiniband and Gigabit Ethernet, Centos 4.3 (Redhat)
community cluster: 64 nodes, part of hrothgar, same except no Infiniband. Owned by faculty members, controlled by batch queues.
minigar; 20 nodes, 3.6 Ghz, IB, for development, open soon
Physics grid machine on order: some nodes available
poseidon: Opteron, 3 nodes, pathscale compilers
Several retired, test, grid systems
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
Cluster Performance
Main factors:
1. Individual node performance: of course. SpecFP2000Rate (www.spec.org) matches our apps well. Newest dual cores have 2x cores, ~1.5x perf per core for 3x performance per node vs. hrothgar.
2. Fabric latency (delay time of one message, ms. IB=6 GE=40)
3. Fabric bandwidth (MB/s IB=600 GE=60)
Intels better cpu right now, AMD better shared mem performance. Overall about equal.
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
Cluster Architecture
An application example where the system is limited by interconnect performance:
gromacs, simulation time completed/real time
Hrothgar, 8 nodes, Gig-E: ~1200 ns/day
Hrothgar, 8 nodes, IB: ~2800 ns/day
Current dual-core systems have 3x the serial throughput of hrothgar, and quad-core systems are coming next year. They need more bandwidth: Gig-E will in the future be suitable only for serial jobs.
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
Cluster Usage
ssh to hrothgar
scp files to hrothgar
compile on hrothgar
run on compute nodes (only) using lsf batch system (only)
example files: /home/shared/examples/
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
Some Useful LSF Commandsbjobs –w (-w for wide shows full node name)
bjobs –l [job#] (–l for long shows everything)
bqueues [-l] shows queues [everything]
bhist [job#] job history
bpeek [job#] stdout/err stored by lsf
bkill job# kill it
-bash-3.00$ /home/shared/bin/check-hosts-batch.sh
hrothgar, 2 free=0 nodes, 0 cpus
hrothgar, 1 free=3 nodes, 3 cpus
hrothgar, 0 free=125 nodes
hrothgar, offline=0 nodes
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
Batch Queues on hrothgarbqueues
QUEUE_NAME PRIO STATU MAX JL/U JL/P JL/H NJOBS PEND RUN
short 35 Open 56 56 - - 0 0 0
parallel 35 Open 224 40 - - 108 0 108
serial 30 Open 156 60 - - 204 140 64
parallel_long 25 Open 256 64 - - 16 0 16
idle 20 Open 256 256 - - 100 0 55
Every 30 sec the scheduler cycles queued jobs. Starts if:
(1) nodes are available, free or idle run
(2) Cpu’s less than user queue limit “bqueues JL/U”
(3) Cpu’s Less that total queue limit “bqueues MAX”
(4) Highest priority queue (short,par,ser,par_long,idle)
(5) Fair share (user with smallest current usage goes first)
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
Unix/Linux Compiling Common Features
[compiler] [options] [source files] [linker options](pathscale is only on poseidon)
C compilers: gcc, icc, pathcc
C++: g++, icpc, pathCC
Fortran: g77, ifort, pathf90
Options: -O [optimize] -o outputfilename
Source files: new.f or *.f or *.c
Linker options: To link with libx.a or libx.so in /home/elvis/lib:
-L/home/elvis/lib –lx
Many programs need: -lm, -pthread
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
MPI Compile: Path. /home/shared/examples/new-bashrc [using bash]
source /home/shared/examples/new-cshrc [using tcsh]
hrothgar:dchaffin:dchaffin $ echo $PATH
/sbin:/bin:/usr/bin:/usr/sbin:/usr/X11R6/bin:\
/usr/share/bin:/opt/rocks/bin:/opt/rocks/sbin:\
/opt/lsfhpc/6.2/linux2.6-glibc2.3-x86_64/bin:\
/opt/intel/fce/9.0/bin:/opt/intel/cce/9.0/bin:\
/share/apps/mpich/IB-icc-ifort-64/bin:\
/opt/lsfhpc/6.2/linux2.6-glibc2.3-x86_64/bin
mpich: IB or GE, icc or gcc or pathcc, ifort or g77 or pathf90
mpicc/mpif77/mpif90/mpiCC must match mpirun!
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
MPI Compile/Runcp /home/shared/examples/mpi-basic.sh .
cp /home/shared/examples/cpi.c .
/opt/mpich/gnu/bin/mpicc cpi.c [or]
/share/apps/mpich/IB-icc-ifort-64/bin/mpicc cpi.c
vi mpi-basic.sh
Ptiles comment out the mpirun that you are not using either IB or default
Could change executable name
bsub < mpi-basic.sh
produces:
job#.out lsf output
job#.pgm.out mpirun output
job#.err lsf stderr
job#.pgm.err mpirun stderr
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.
Exercise/HomeworkRun mpi benchmark on Infiniband, Ethernet, and Shared memory. Compare latency and bandwidth. Research and briefly discuss reasons for the performance:
Hardware bandwidth (look it up)
Software layers (OS, interrupts, MPI, one-sided copy, two-sided copy)
Hardware:
Topspin Infiniband SDR, PCI-X
Xeon Nocona shared memory
Intel Gigabit, on board
Program:/home/shared/examples/mpilc.c or equivalent