ece 574 { cluster computing lecture...
TRANSCRIPT
![Page 1: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/1.jpg)
ECE 574 – Cluster ComputingLecture 4
Vince Weaver
http://web.eece.maine.edu/~vweaver
31 January 2017
![Page 2: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/2.jpg)
Announcements
• Don’t forget about homework #3
• I ran HPCG benchmark on Haswell-EP machine.
Theoretical: 16DP FLOP/cycle * 16 cores * 2.6GHz
= 666 GFLOPS
Linpack/OpenBLAS: 436 GFLOPS (65% of peak),
HPCG: 0.7 GFLOPS (0.1% of peak)
1
![Page 3: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/3.jpg)
Running Linpack
• HPL solves linear system of equations, Ax=b. LU
factorization.
• Download and install a BLAS. ATLAS? OpenBLAS?
Intel?
Compiler? intel? gcc? gfortran?
• Download and install MPI (we’ll talk about that later).
MPICH? OpenMPI?
• Download HPL. Current version 2.2?
Modify a Makefile (not trivial) make sure links to proper
2
![Page 4: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/4.jpg)
BLAS. make arch=OpenBLAS
• Above step, might need to create a link from hpl in your
home directory to actual location for reasons
• Create a a bin/OpenBLAS with default HPL.dat file
• Run it ./xhpl Or if on cluster ./mpirun -np 4
./xhpl or similar.
• Result won’t be very good. Need to tune HPL.dat
• N is problem size. In general want this to fill RAM. Take
RAM size, squareroot, round down. NxN matrix. Each
N is 8 bytes for double precision.
• NB block size, can be tuned
3
![Page 5: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/5.jpg)
• PxQ, if on cluster can specify machine grid to work on.
Linpack works best with as square as possible.
• Fiddle with all the results until you get the highest.
4
![Page 6: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/6.jpg)
Commodity Cluster Layout
...
Compute Nodes
Compile NodeLogin/
Netw
ork
Storage
• Simple cluster like the pi-cluster, or older ones I’ve made
• Commodity cluster design is a combo of
ECE331/ECE435 more than anything else
5
![Page 7: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/7.jpg)
• Why have a head node?
• What kind of network? Ethernet? Inifiniband?
Something fancier?
• Operating system? Do all nodes need a copy of the OS?
Linux? Windows? None?
• Booting: network boot, local disk boot.
• Network topology? Star? Direct-connect? Cube?
Hyper-cube?
• Disk: often shared network filesystem. Why? Simple:
NFS (network file system). More advanced cluster
filesystems available.
6
![Page 8: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/8.jpg)
• Don’t forget power/cooling
• Running software?
7
![Page 9: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/9.jpg)
Job Schedulers
• On a big cluster, how do you submit jobs?
• If everyone just logged in to nodes at random, would be
a mess
• Batch job scheduling
• Different queues (high priority, long running, etc)
• Resource management (make sure don’t over commit,
use too much RAM, etc)
• Notify you when finished?
• Accounting (how much time used per user, who is going
8
![Page 10: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/10.jpg)
to pay?)
9
![Page 11: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/11.jpg)
Scheduling
• Different Queues Possible – Low priority? Normal? High
priority (paper deadline)? Friends/Family?
• FIFO – first in, first out
• Backfill – bypass the FIFO to try to efficiently use any
remaining space
• Resources – how long can run before being killed, how
many CPUs, how much RAM, how much power? etc.
10
![Page 12: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/12.jpg)
• Heterogeneous Resources – not all nodes have to be
same. Some more cores, some older processors, some
GPUs, etc.
11
![Page 13: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/13.jpg)
Common Job Schedulers
• PBS (Portable Batch System) – OpenPBS/PBSPro/TORQUE
• nbs
• slurm
• moab
• condor
• many others
12
![Page 14: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/14.jpg)
Slurm
• http://slurm.schedmd.com/
• Slurm Workload Manager
Simple Linux Utility for Resource Management
Futurama Joke?
• Developed originally at LLNL
• Over 60% of top 500 use it
13
![Page 15: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/15.jpg)
sinfo
provides info on the cluster
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug up infinite 1 idle haswell-ep
general* up infinite 1 idle haswell-ep
14
![Page 16: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/16.jpg)
srun
start a job, but interactively
15
![Page 17: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/17.jpg)
sbatch
submit job to job queue
#!/bin/bash
#SBATCH -p general # partition (queue)
#SBATCH -N 1 # number of nodes
#SBATCH -n 8 # number of cores
#SBATCH -t 0-2:00 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR
export OMP_NUM_THREADS=4
./xhpl
16
![Page 18: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/18.jpg)
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
63 general time_hpl ece574-0 PD 0:00 1 (Resources)
64 general time_hpl ece574-0 PD 0:00 1 (Resources)
65 general time_hpl ece574-0 PD 0:00 1 (Resources)
62 general time_hpl ece574-0 R 0:14 1 haswell-ep
17
![Page 19: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/19.jpg)
scancel
kills job
scancel 65
18
![Page 20: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/20.jpg)
Some sample code
int x[8][8];
for(i=0;i<8;i++) {
for(j=0;j<8;j++) {
x[i][j]=0;
}
}
19
![Page 21: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/21.jpg)
mov r0 ,0 ; i
i_loop:
mov r1 ,0 ; j
j_loop:
mov r3 ,0
mov r4 ,x
add r4 ,r4 ,r1 ,lsl#5
add r4 ,r4 ,r0 ,lsl#3
str r3 ,[r4]
add r1 ,r1 ,#1
cmp r1 ,8
blt j_loop
add r0 ,r0 ,#1
cmp r0 ,8
blt i_loop
20
![Page 22: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/22.jpg)
Parallel Computing – Single Core
21
![Page 23: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/23.jpg)
Simple CPUs
• Ran one instruction at a time.
• Could take one or multiple cycles (IPC 1.0 or less)
• Example – single instruction take 1-5 cycles?
ALU
PC
Control
CPU
Memory Regs
22
![Page 24: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/24.jpg)
Pipelined CPUs
• 5-stage MIPS pipeline
• From 2-stage to Pentium 4 31-stage
• Example – single instruction always take 5 cycles? But
what about on average?
IF ID EX MEM WB
23
![Page 25: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/25.jpg)
Pipelined CPUs
• IF = Instruction Fetch.
Fetch 32-bit instruction from L1-cache
• ID = Decode
• EX = execute (ALU, maybe shifter, multiplier, divide)
Memory address calculated
• MEM = Memory – if memory had to be accessed,
happens now.
• WB = register values written back to the register file
24
![Page 26: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/26.jpg)
Data Hazards
Happen because instructions might depend on results from
instructions ahead of them in the pipeline that haven’t been
written back yet.
• RAW – “true” dependency – problem. Bypassing?
• WAR – “anti” dependency – not a problem if commit in
order
• WAW – “output” dependency – not a problem as long
as ordered
• RAR – not a problem
25
![Page 27: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/27.jpg)
Structural Hazards
• CPU can’t just provide. Not enough multipliers for
example
26
![Page 28: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/28.jpg)
Control Hazards
• How quickly can we know outcome of a branch
• Branch prediction? Branch delay slot?
27
![Page 29: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/29.jpg)
Branch Prediction
• Predict (guess) if a branch is taken or not.
• What do we do if guess wrong? (have to have some way
to cancel and start over)
• Modern predictors can be very good, greater than 99%
• Designs are complex and could fill an entire class
28
![Page 30: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/30.jpg)
Memory Delay
• Memory/cache is slow
• Need to bubble / Memory Delay Slot
29
![Page 31: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/31.jpg)
The Memory Wall
• Wulf and McKee
• Processors getting faster more quickly than memory
• Processors can spend large amounts of time waiting for
memory to be available
• How do we hide this?
30
![Page 32: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/32.jpg)
Caches
• Basic idea is that you have small, faster memories that
are closer to the CPU and much faster
• Data from main memory is cached in these caches
• Data is automatically brought in as needed.
Also can be pre-fetched, either explicitly by program or
by the hardware guessing.
• What are the downsides of pre-fetching?
• Modern systems often have multiple levels of cache.
Usuall a small (32k or so each) L1 instruction and data,
31
![Page 33: ECE 574 { Cluster Computing Lecture 4web.eece.maine.edu/.../classes/ece574_2017s/ece574_lec04.pdfBLAS. make arch=OpenBLAS Above step, might need to create a link from hpl in your home](https://reader034.vdocuments.site/reader034/viewer/2022042310/5ed842eb0fa3e705ec0e24a6/html5/thumbnails/33.jpg)
a larger (128k?) shared L2, then L3 and even L4.
• Modern systems also might share caches between
processors, more on that later
• Again, could teach a whole class on caches
32