ttu high performance computing user training: part 2 srirangam addepalli and david chaffin, ph.d....

11
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System and Storage Lectures with Labs: Advanced Batch Jobs Compilers/Libraries/Optimization Compiling/Running Parallel Jobs Grid Computing

Upload: meagan-freeman

Post on 17-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

Advanced Session: Outline

Cluster Architecture

File System and Storage

Lectures with Labs:

Advanced Batch Jobs

Compilers/Libraries/Optimization

Compiling/Running Parallel Jobs

Grid Computing

Page 2: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

HPCC Clusters

hrothgar: 128 dual-processor 64-bit Xeons, 3.2 Ghz, 4GB memory, Infiniband and Gigabit Ethernet, Centos 4.3 (Redhat)

community cluster: 64 nodes, part of hrothgar, same except no Infiniband. Owned by faculty members, controlled by batch queues.

minigar; 20 nodes, 3.6 Ghz, IB, for development, open soon

Physics grid machine on order: some nodes available

poseidon: Opteron, 3 nodes, pathscale compilers

Several retired, test, grid systems

Page 3: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

Cluster Performance

Main factors:

1. Individual node performance: of course. SpecFP2000Rate (www.spec.org) matches our apps well. Newest dual cores have 2x cores, ~1.5x perf per core for 3x performance per node vs. hrothgar.

2. Fabric latency (delay time of one message, ms. IB=6 GE=40)

3. Fabric bandwidth (MB/s IB=600 GE=60)

Intels better cpu right now, AMD better shared mem performance. Overall about equal.

Page 4: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

Cluster Architecture

An application example where the system is limited by interconnect performance:

gromacs, simulation time completed/real time

Hrothgar, 8 nodes, Gig-E: ~1200 ns/day

Hrothgar, 8 nodes, IB: ~2800 ns/day

Current dual-core systems have 3x the serial throughput of hrothgar, and quad-core systems are coming next year. They need more bandwidth: Gig-E will in the future be suitable only for serial jobs.

Page 5: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

Cluster Usage

ssh to hrothgar

scp files to hrothgar

compile on hrothgar

run on compute nodes (only) using lsf batch system (only)

example files: /home/shared/examples/

Page 6: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

Some Useful LSF Commandsbjobs –w (-w for wide shows full node name)

bjobs –l [job#] (–l for long shows everything)

bqueues [-l] shows queues [everything]

bhist [job#] job history

bpeek [job#] stdout/err stored by lsf

bkill job# kill it

-bash-3.00$ /home/shared/bin/check-hosts-batch.sh

hrothgar, 2 free=0 nodes, 0 cpus

hrothgar, 1 free=3 nodes, 3 cpus

hrothgar, 0 free=125 nodes

hrothgar, offline=0 nodes

Page 7: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

Batch Queues on hrothgarbqueues

QUEUE_NAME PRIO STATU MAX JL/U JL/P JL/H NJOBS PEND RUN

short 35 Open 56 56 - - 0 0 0

parallel 35 Open 224 40 - - 108 0 108

serial 30 Open 156 60 - - 204 140 64

parallel_long 25 Open 256 64 - - 16 0 16

idle 20 Open 256 256 - - 100 0 55

Every 30 sec the scheduler cycles queued jobs. Starts if:

(1) nodes are available, free or idle run

(2) Cpu’s less than user queue limit “bqueues JL/U”

(3) Cpu’s Less that total queue limit “bqueues MAX”

(4) Highest priority queue (short,par,ser,par_long,idle)

(5) Fair share (user with smallest current usage goes first)

Page 8: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

Unix/Linux Compiling Common Features

[compiler] [options] [source files] [linker options](pathscale is only on poseidon)

C compilers: gcc, icc, pathcc

C++: g++, icpc, pathCC

Fortran: g77, ifort, pathf90

Options: -O [optimize] -o outputfilename

Source files: new.f or *.f or *.c

Linker options: To link with libx.a or libx.so in /home/elvis/lib:

-L/home/elvis/lib –lx

Many programs need: -lm, -pthread

Page 9: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

MPI Compile: Path. /home/shared/examples/new-bashrc [using bash]

source /home/shared/examples/new-cshrc [using tcsh]

hrothgar:dchaffin:dchaffin $ echo $PATH

/sbin:/bin:/usr/bin:/usr/sbin:/usr/X11R6/bin:\

/usr/share/bin:/opt/rocks/bin:/opt/rocks/sbin:\

/opt/lsfhpc/6.2/linux2.6-glibc2.3-x86_64/bin:\

/opt/intel/fce/9.0/bin:/opt/intel/cce/9.0/bin:\

/share/apps/mpich/IB-icc-ifort-64/bin:\

/opt/lsfhpc/6.2/linux2.6-glibc2.3-x86_64/bin

mpich: IB or GE, icc or gcc or pathcc, ifort or g77 or pathf90

mpicc/mpif77/mpif90/mpiCC must match mpirun!

Page 10: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

MPI Compile/Runcp /home/shared/examples/mpi-basic.sh .

cp /home/shared/examples/cpi.c .

/opt/mpich/gnu/bin/mpicc cpi.c [or]

/share/apps/mpich/IB-icc-ifort-64/bin/mpicc cpi.c

vi mpi-basic.sh

Ptiles comment out the mpirun that you are not using either IB or default

Could change executable name

bsub < mpi-basic.sh

produces:

job#.out lsf output

job#.pgm.out mpirun output

job#.err lsf stderr

job#.pgm.err mpirun stderr

Page 11: TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

Exercise/HomeworkRun mpi benchmark on Infiniband, Ethernet, and Shared memory. Compare latency and bandwidth. Research and briefly discuss reasons for the performance:

Hardware bandwidth (look it up)

Software layers (OS, interrupts, MPI, one-sided copy, two-sided copy)

Hardware:

Topspin Infiniband SDR, PCI-X

Xeon Nocona shared memory

Intel Gigabit, on board

Program:/home/shared/examples/mpilc.c or equivalent