parallel programming amano, hideharu. parallel programming message passing pvm mpi shared memory ...
TRANSCRIPT
![Page 1: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/1.jpg)
Parallel Programming
AMANO, Hideharu
![Page 2: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/2.jpg)
Parallel Programming
Message Passing PVM MPI
Shared Memory POSIX thread OpenMP CUDA/OpenCL
Automatic Parallelizing Compilers
![Page 3: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/3.jpg)
Message passing( Blocking: randezvous )
Send Receive Send Receive
![Page 4: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/4.jpg)
Message passing(with buffer)
Send Receive Send Receive
![Page 5: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/5.jpg)
Message passing( non-blocking)
Send ReceiveOther Job
![Page 6: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/6.jpg)
PVM (Parallel Virtual Machine) A buffer is provided for a sender. Both blocking/non-blocking receive is
provided. Barrier synchronization
![Page 7: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/7.jpg)
MPI(Message Passing Interface) Superset of the PVM for 1 to 1
communication. Group communication Various communication is supported. Error check with communication tag.
![Page 8: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/8.jpg)
Programming style using MPI SPMD (Single Program Multiple Data
Streams) Multiple processes executes the same program. Independent processing is done based on the
process number. Program execution using MPI
Specified number of processes are generated. They are distributed to each node of the NORA
machine or PC cluster.
![Page 9: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/9.jpg)
Communication methods
Point-to-Point communication A sender and a receiver executes function for
sending and receiving. Each function must be strictly matched.
Collective communication Communication between multiple processes. The same function is executed by multiple
processes. Can be replaced with a sequence of Point-to-Point
communication, but sometimes effective.
![Page 10: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/10.jpg)
Fundamental MPI functions Most programs can be described using six fu
ndamental functions MPI_Init() … MPI Initialization MPI_Comm_rank() … Get the process # MPI_Comm_size() … Get the total process # MPI_Send() … Message send MPI_Recv() … Message receive MPI_Finalize() … MPI termination
![Page 11: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/11.jpg)
Other MPI functions
Functions for measurement MPI_Barrier() … barrier synchronization MPI_Wtime() … get the clock time
Non-blocking function Consisting of communication request and check Other calculation can be executed during waiting.
![Page 12: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/12.jpg)
An Example1: #include <stdio.h>2: #include <mpi.h>3:4: #define MSIZE 645:6: int main(int argc, char **argv)7: {8: char msg[MSIZE];9: int pid, nprocs, i;10: MPI_Status status;11:12: MPI_Init(&argc, &argv);13: MPI_Comm_rank(MPI_COMM_WORLD, &pid);14: MPI_Comm_size(MPI_COMM_WORLD, &nprocs);15:16: if (pid == 0) {17: for (i = 1; i < nprocs; i++) {18: MPI_Recv(msg, MSIZE, MPI_CHAR, i, 0, MPI_COMM_WORLD, &status);19: fputs(msg, stdout);20: }21: }22: else {23: sprintf(msg, "Hello, world! (from process #%d)\n", pid);24: MPI_Send(msg, MSIZE, MPI_CHAR, 0, 0, MPI_COMM_WORLD);25: }26:27: MPI_Finalize();28:29: return 0;30: }
![Page 13: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/13.jpg)
Initialize and Terminate
int MPI_Init(
int *argc, /* pointer to argc */
char ***argv /* pointer to argv */ );
mpi_init(ierr)
integer ierr ! return code
The attributes from command line must be passed directly to argc and argv.
int MPI_Finalize();
mpi_finalize(ierr)
integer ierr ! return code
![Page 14: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/14.jpg)
Commincator functionsIt returns the rank (process ID) in the communicator comm.int MPI_Comm_rank( MPI_Comm comm, /* communicator */ int *rank /* process ID (output) */ );mpi_comm_rank(comm, rank, ierr) integer comm, rank integer ierr ! return code
It returns the total number of processes in the communicator comm.int MPI_Comm_size( MPI_Comm comm, /* communicator */ int *size /* number of process (output) */ );mpi_comm_size(comm, size, ierr) integer comm, size integer ierr ! return code
Communicators are used for sharing commnication space among a subset of processes. MPI_COMM_WORLD is pre-defined one for all processes.
![Page 15: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/15.jpg)
MPI_Send
It sends data to process “dest”.int MPI_Send( void *buf, /* send buffer */ int count, /* # of elements to send */ MPI_Datatype datatype, /* datatype of elements */ int dest, /* destination (receiver) process ID */ int tag, /* tag */ MPI_Comm comm /* communicator */ );
mpi_send(buf, count, datatype, dest, tag, comm, ierr) <type> buf(*) integer count, datatype, dest, tag, comm integer ierr ! return code
Tags are used for identification of message.
![Page 16: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/16.jpg)
MPI_Recv
int MPI_Recv( void *buf, /* receiver buffer */ int count, /* # of elements to receive */ MPI_Datatype datatype, /* datatype of elements */ int source, /* source (sender) process ID */ int tag, /* tag */ MPI_Comm comm, /* communicator */ MPI_Status /* status (output) */ );
mpi_recv(buf, count, datatype, source, tag, comm, status, ierr) <type> buf(*) integer count, datatype, source, tag, comm, status(mpi_status_size) integer ierr ! return code
The same tag as the sender’s one must be passed to MPI_Recv. Set the pointers to a variable MPI_Status. It is a structure with three members:
MPI_SOURCE, MPI_TAG and MPI_ERROR, which stores process ID of the sender, tag and error code.
![Page 17: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/17.jpg)
datatype and count
The size of the message is identified with count and datatype. MPI_CHAR char MPI_INT int MPI_FLOAT float MPI_DOUBLE double … etc.
![Page 18: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/18.jpg)
Compile and Execution
% icc –o hello hello.c -lmpi
% mpirun –np 8 ./hello
Hello, world! (from process #1)
Hello, world! (from process #2)
Hello, world! (from process #3)
Hello, world! (from process #4)
Hello, world! (from process #5)
Hello, world! (from process #6)
Hello, world! (from process #7)
![Page 19: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/19.jpg)
POSIX Thread
Standard API on Linux for controlling threads. Portable Operating System Interface
Thread handling pthread_create(); pthread_join(); pthread_exec();
Synchronization mutex
pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock();
Condition variable: Semaphore pthread_cond_signal(); pthread_cond_wait(); etc.
![Page 20: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/20.jpg)
OpenMP
#include <stdio.h>int main(){pragma omp parallel{ int tid, npes; tid = omp_get_thread_num(); npes = omp_get_num_threads(); printf(“Hello World from %d of %d\n”, tid, npes)}return 0;} Multiple threads are generated by using pragma. Variables declared globally can be shared.
![Page 21: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/21.jpg)
Convenient pragma for parallel execution#pragma omp parallel {#pragma omp for for (i=0; i<1000; i++){ c[i] = a[i] + b[i]; }} The assignment between i and thread is automatical
ly adjusted in order that the load of each thread becomes even.
![Page 22: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/22.jpg)
CUDA/OpenCL
CUDA is developed for GPGPU programming.
SPMD(Single Program Multiple Data) 3-D management of threads 32 threads are managed with a Warp
SIMD programming Architecture dependent memory model OpenCL is standard language for heterogene
ous accelerators.
![Page 23: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/23.jpg)
Heterogeneous Programming with CUDA
…
Host
Device
…
Host
Device
Serial Code
Parallel KernelKernelA(args);
Serial Code
Parallel KernelKernelB(args);
![Page 24: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/24.jpg)
Threads and thread blocks
0 1 2 3 4 5 6 7Thread Block 0
threadID
…float x =input[threadID];float y=func(x);output[threadID]=y;…
0 1 2 3 4 5 6 7Thread Block 1
…float x =input[threadID];float y=func(x);output[threadID]=y;…
0 1 2 3 4 5 6 7Thread Block N-1
…float x =input[threadID];float y=func(x);output[threadID]=y;…
…
Kernel = grid of thread blockseach threadexecutesthe same code
Threads in the same block may synchronize with barriers._syncthreads();Thread blocks cannot synchronize-> Execution is depending on machines.
![Page 25: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/25.jpg)
Memory HierarchyThread
Per-threadLocal memory
BlockPer-blockSharedMemory
…
…
Per-deviceGlobal
Memory
Kernel 0
Kernel 1
SequentialKernels
Between host memorycudaMemcpy();is used.
![Page 26: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/26.jpg)
CUDA extensions
Declaration specifiers : _global_ void kernelFunc(…); // kernel function, runs on device _device_ int GlobalVar; //variable in device memory _shared_ int sharedVar; //variable in per-block shared memory
Extend function invocation syntax for paralell kernel launch KernelFunc<<<dimGrid, dimBlock>>> // launch dimGrid blocks wi
th dimBlock threads each Special variables for thread identification in kernels
dim3 threadIDx; dim3 blockIdx; dim3 block Dim; dim3 gridDim; Barrier Synchronization between threads
_syncthreads();
![Page 27: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/27.jpg)
CUDA runtime
Device menagement: cudaGetDeviceCount(), cudaGetDeviceProperties
(); Device memory management:
cudaMalloc(), cudaFree(),cudaMemcpy() Graphics interoperability:
cudaGLMapBufferObject(), cudaD3D9MapResources()
Texture management: cudaBindTexture(), cudaBindTextureToArray()
![Page 28: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/28.jpg)
Example: Increment Array Elements
void increment_cpu(float *a, float b, int N)
{
for(int idx=0;idx<N;idx++)
a[idx]=a[idx]+b;
}
void main()
{
…
increment_cpu(a,b,N);
}
_global_ void increment_gpu(float *a, float b, int N)
{
int idx=blockidx.x*blockDim.x+threadIdx.x;
if(idx<N)
a[idx]=a[idx]+b;
}
void main()
{
…
dim3 dimBlock(blocksize);
dim3 dimGrid(ceil(N/(float)blocksize));
increment_gpu<<dimGrid,dimBlock>>(a,b,N);
}
![Page 29: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/29.jpg)
Example: Increment Array Elements
blockIdx.x=0blockDim.x=4threadIdx.x=0,1,2,3idx=0,1,2,3
blockIdx.x=1blockDim.x=4threadIdx.x=0,1,2,3idx=4,5,6,7
blockIdx.x=2blockDim.x=4threadIdx.x=0,1,2,3idx=8,9,10,11
blockIdx.x=3blockDim.x=4threadIdx.x=0,1,2,3idx=12,13,14,15
Let’s assume N=16, blockDim=4
int idx = blockDim.x * blockId.x + threadldx.x;will map from local index threadIdx to global index.
blockDim should be >= 32 in real code!Using more number of blocks hides the memory latency in GPU.
![Page 30: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/30.jpg)
Host code
// allocate host memoryunsigned int numBytes = N*sizeof(float);float * h_A = (float *) malloc(numBytes);// allocate device memoryfloat* d_A=0;cudaMalloc((void**)&d_a, numBytes);// copy data from host to devicecudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);// execute the kernelincrement_cpu<<<N/blockSize, blockSize>>>(d_A,b);// copy data from device back to hostcudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);// free device memorycudaFree(d_A);
![Page 31: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/31.jpg)
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
…
Thread Execution Manager
Input Assembler
Host
Load/Store
Global Memory
GeForceGTX280240 cores
![Page 32: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/32.jpg)
Hardware Implementation: Execution Model
Kernel1 Block
(0,0)Block(1,0)
Block(2,0)
Block(0,1)
Block(1,1)
Block(2,1)
Grid1
DeviceHost
Kernel2 Block
(0,0)Block(1,0)
Block(2,0)
Block(0,1)
Block(1,1)
Block(2,1)
Grid2
Thread(0,0) …
Thread(31,0)
Thread(32,0) …
Thread(63,0)
Warp 0 Warp 1
Thread(0,1) …
Thread(31,1)
Thread(32,1) …
Thread(63,1)
Warp 2 Warp 3
Thread(0,2) …
Thread(31,2)
Thread(32,2) …
Thread(63,2)
Warp 4 Warp 5
Block (1,1)
A multiprocessor executes thesame instruction on a group ofthreads called a warp.Warp size= the number of threads in a warp
![Page 33: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/33.jpg)
Automatic parallelizing Compilers Automatically translating a code for uniprocessors in
to multiprocessors. Loop level parallelism is main target of parallelizing. Fortran codes have been main targets
No pointers The array structure is simple
Recently, restricted C becomes a target language Oscar Compiler (Waseda Univ.), COINS
![Page 34: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/34.jpg)
Shared memory model vs . Message passing model
Benefits Distributed OS is easy to implement. Automatic parallelize compiler. POSIX thread, OpenMP
Message passing Formal verification is easy (Blocking) No-side effect (Shared variable is side effect itsel
f) Small cost
![Page 35: Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing](https://reader035.vdocuments.site/reader035/viewer/2022062806/56649e205503460f94b0c6b0/html5/thumbnails/35.jpg)
Parallel Programming Contest In this lecture, a parallel programming contest
will be held. All students who want to get the credit must
join it. At least, the program must correctly run. For students with good achievement, the credit
will be given unconditionally.