lab 7 -parallel programming models -mpi continued

-Parallel Programming Models-MPI Continued

Parallel Programming Models

• Programming Models

• MPI Programming

Fundamental Design Issues

• Design of user-system and hardware-software interface– proposed by Culler et. al.

• Functional Issues:– Naming: how are logically shared data referenced?– Operations: what operations are provided on shared data?– Ordering: how are accesses to data ordered and coordinated?

• Performance Issues:– Replication: how are data replicated to reduce communication– Communication Cost: latency, bandwidth, overhead

• Others– Node granularity

Sequential Programming Model

• Functional Issues– Naming: Can name any variable in virtual

address space• Hardware (and perhaps compilers) does

translation to physical addresses

– Operations: Loads and Stores– Ordering: Sequential program order

• Performance– Rely on dependences on single location

(mostly): dependence order– Compilers and hardware violate other orders

without getting caught– Compiler: reordering and register allocation– Hardware: out of order, pipeline bypassing,

write buffers– Transparent replication in caches

SAS Programming Model

Thread(Process)

SystemSystem

read(X) write(X)

Processor Memory

Shared variable

Shared Address Space Programming Model

• Naming– Any process can name any variable in shared space

• Operations– Loads and stores, plus those needed for ordering

• Simplest Ordering Model– Within a process/thread: sequential program order– Across threads: some interleaving (as in time-sharing)– Additional orders through synchronization– Again, compilers/hardware can violate orders without

getting caught

Synchronization

• Mutual exclusion (locks)– Ensure certain operations on certain data can be

performed by only one process at a time– Room that only one person can enter at a time– No ordering guarantees

• Event synchronization – Ordering of events to preserve dependences – e.g. producer —> consumer of data– 3 main types:

• point-to-point• global• group

MP Programming Model

processprocess processprocess

Node ANode A

message

Y Y’

send (Y) receive (Y’)

Node BNode B

Processor Memory

Message Passing Programming Model

• Naming– Processes can name private data directly. – No shared address space

• Operations– Explicit communication: “send” and “receive”– “Send” transfers data from private address

space to another process– “Receive” copies data from process to private

address space

• Ordering– Program order within a process– Send and receive can provide pt-to-pt synch

between processes– Mutual exclusion inherent

• Can construct global address space– Process number + address within process

address space– But no direct operations on these names

Message Passing BasicsWhat is MPI? The Message-Passing Interface Standard(MPI) is a library that allows

you to do problems in parallel using message- passing to communicate between processes.

• LibraryIt is not a language (like FORTRAN 90, C or HPF) or even an extension to a language. Instead, it is a library that your native, standard, serial compiler (f77, f90, cc, CC) uses.

• Message PassingMessage passing is sometimes referred to as a paradigm itself. But it is really just a method of passing data between processes that is flexible enough to implement most paradigms (Data Parallel, Work Sharing, etc.) with it

• CommunicateThis communication may be via a dedicated MPP torus network, or merely an office LAN. To the MPI programmer, it looks much the same.

• ProcessesThese can be 512 PEs on a T3E, or 4 processes on a single workstation.

Basic MPI

In order to do parallel programming, you require some basic functionality, namely, the ability to:

• Start Processes

• Send Messages

• Receive Messages

• Synchronize

With these four capabilities, you can construct any program. We will look at the basic versions of the MPI routines that implement this. Of course, MPI offers over 125 functions. Many of these are more convenient and efficient for certain tasks. However, with what we learn here, we will be able to implement just about any algorithm. Moreover, the vast majority of MPI codes are built using primarily these routines.

Starting Processes on the Cluster

On the Clusters, the fundamental control of processes is fairly simple. There is always one process for each PE that your code is running on. At run time, you specify how many PEs you require and then your code is copied to each PE and run simultaneously. In other words, a 512 PE Cluster code has 512 copies of the same code running on it from start to finish.

At first the idea that the same code must run on every node seems very limiting. We'll see in a bit that this is not at all the case.

MPI Programming Fundamentals

• Environment

• Point-to-point communication

• Collective communication

• Derived data type

• Group management

MPI Terms

• Blocking– If return from the procedure indicates that the user is allowed to

reuse resources specified in the call

• Non-blocking– If the procedure may return before the operation completes, and

before the user is allowed to reuse resources specified in the call

• Collective– If all processes in a process group need to invoke the procedure

• Message envelope– Information used to distinguish messages and selectively receive

them– <source, destination, tag, communicator>

MPI Terms (Cont’d)

• Communicator– The communication context for a communication operation– Messages are always received within the context they

were sent– Messages sent in different contexts do not interfere– MPI_COMM_WORLD

• Process group– The communicator specifies the set of processes that

share this communication context.– This process group is ordered and processes are identified

by their rank within this group

MPI Environment Calls

• MPI_INIT

• MPI_COMM_SIZE

• MPI_COMM_RANK

• MPI_FINALIZE

• MPI_ABORT

MPI_INIT

• Usage– int MPI_Init( int* argc_ptr, /* in */

char** argv_ptr[] ); /* in */

• Description– Initialize MPI– All MPI programs must call this routines once

and only once before any other MPI routines

MPI_COMM_SIZE

• Usage– int MPI_Comm_size( MPI_Comm comm,

/* in */

int* size ); /* out */

• Description– Return the number of processes in the group

associated with a communicator

MPI_COMM_RANK

• Usage– int MPI_Comm_rank ( MPI_Comm comm, /*

int* rank ); /* out */

• Description– Returns the rank of the local process in the grou

p associated with a communicator– The rank of the process that calls it in the range

from 0 … size - 1

MPI_FINALIZE

• Usage– int MPI_Finalize (void);

• Description– Terminates all MPI processing– Make sure this routine is the last MPI call.– All pending communications involving a proce

ss have completed before the process calls MPI_FINALIZE

MPI_ABORT

• Usage– int MPI_Abort( MPI_Comm comm, /* in */

int errorcode ); /* in */

• Description– Forces all processes of an MPI job to terminat

Too Simple Program

#include “mpi.h”

int main( int argc, char* argv[] ){ int rank; int nproc; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

/* write codes for you */

MPI_Finalize();}

Point-To-Point Communication

• MPI_SEND

• MPI_RECV

• MPI_ISEND

• MPI_IRECV

• MPI_WAIT

• MPI_GET_COUNT

4 Communication Modes in MPI

• Standard mode– It is up to MPI to decide whether outgoing

messages will be buffered– Non-local operation– Buffered or synchronous?

• Buffered(asynchronous) mode– A send operation can be started whether or not a

matching receive has been posted– It may complete before a matching receive is

posted– Local operation

4 Communication Modes in MPI (Cont’d)

• Synchronous mode– A send operation can be started whether or not

a matching receive was posted– The send will complete successfully only if a

matching receive was posted and the receive operation has started to receive the message

– The completion of a synchronous send not only indicates that the send buffer can be reused but also indicates that the receiver has reached a certain point in its execution

– Non-local operation

4 Communication Modes in MPI (Cont’d)

• Ready mode– A send operation may be started only if the

matching receive is already posted– The completion of the send operation does

not depend on the status of a matching receive and merely indicates the send buffer can be reused

– EAGER_LIMIT of SP system

MPI_SEND

• Usage– int MPI_Send( void* buf, /* in */ int count, /* in */ MPI_Datatype datatype, /* in */ int dest, /* in */ int tag, /* in */ MPI_Comm comm ); /* in */

• Description– Performs a blocking standard mode send operation– The message can be received by either MPI_RECV o

r MPI_IRECV

MPI_RECV

• Usage– int MPI_Recv( void* buf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ int source, /* in */ int tag, /* in */ MPI_Comm comm, /* in */ MPI_Status* status ); /* out */

• Description– Performs a blocking receive operation– The message received must be less than or equal to the length of t

he receive buffer– MPI_RECV can receive a message sent by either MPI_SEND or M

PI_ISEND

Blocking Operations

Sample Program for Blocking Operations

int main( int argc, char* argv[] ){ int rank, nproc; int isbuf, irbuf; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

if(rank == 0) { isbuf = 9; MPI_Send( &isbuf, 1, MPI_INTEGER, 1, TAG, MPI_COMM_WORLD);

Sample Program for Blocking Operations (Cont’d)

} else if(rank == 1) {

MPI_Recv( &irbuf, 1, MPI_INTEGER, 0, TAG, MPI_COMM_WORLD,

&status);

printf( “%d\n”, irbuf );

MPI_Finalize();

MPI_ISEND• Usage

– int MPI_Isend( void* buf, /* in */ int count, /* in */ MPI_Datatype datatype, /* in */ int dest, /* in */ int tag, /* in */ MPI_Comm comm, /* in */ MPI_Request* request ); /* out */

• Description– Performs a nonblocking standard mode send operation– The send buffer may not be modified until the request has been co

mpleted by MPI_WAIT or MPI_TEST– The message can be received by either MPI_RECV or MPI_IRECV.

MPI_IRECV

• Usage– int MPI_Irecv( void* buf, /* out */

int count, /* in */

MPI_Datatype datatype,/* in */

int source, /* in */

int tag, /* in */

MPI_Comm comm, /* in */

MPI_Request* request ); /* out */

MPI_IRECV (Cont’d)

• Description– Performs a nonblocking receive operation– Do not access any part of the receive buffer u

ntil the receive is complete– The message received must be less than or e

qual to the length of the receive buffer– MPI_IRECV can receive a message sent by ei

ther MPI_SEND or MPI_ISEND

MPI_WAIT

• Usage– int MPI_Wait( MPI_Request* request, /* inout */ MPI_Status* status ); /* out */

• Description– Waits for a nonblocking operation to complete– Information on the completed operation is found i

n status.– If wildcards were used by the receive for either th

e source or tag, the actual source and tag can be retrieved by status->MPI_SOURCE and status->MPI_TAG

Non-Blocking Operations

MPI_GET_COUNT

• Usage– int MPI_Get_count( MPI_Status status, /* in */

MPI_Datatype datatype, /* in */

int* count ); /* out */

• Description– Returns the number of elements in a message– The datatype argument and the argument provide

d by the call that set the status variable should match

Sample Program for Non-Blocking Operations

int main( int argc, char* argv[] ){ int rank, nproc; int isbuf, irbuf, count; MPI_Request request; MPI_Status status; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

if(rank == 0) { isbuf = 9; MPI_Isend( &isbuf, 1, MPI_INTEGER, 1, TAG, MPI_COMM_WORLD, &request );

Sample Program for Non-Blocking Operations (Cont’d)

} else if(rank == 1) {

MPI_Irecv( &irbuf, 1, MPI_INTEGER, 0, TAG, MPI_COMM_WORLD,

&request);

MPI_Wait(&request, &status);

MPI_Get_count(&status, MPI_INTEGER, &count);

printf( “irbuf = %d source = %d tag = %d count = %d\n”,

irbuf, status.MPI_SOURCE, status.MPI_TAG, count);

MPI_Finalize();

Collective Communication

• MPI_BCAST• MPI_SCATTER• MPI_SCATTERV• MPI_GATHER• MPI_GATHERV• MPI_ALLGATHER• MPI_ALLGATHERV• MPI_ALLTOALL

MPI_BCAST

• Usage– int MPI_Bcast( void* buffer, /* inout */

int count, /* in */

MPI_Datatype datatype, /* in */

int root, /* in */

MPI_Comm comm); /* in */

• Description– Broadcasts a message from root to all processes in commu

nicator

– The type signature of count, datatype on any process must be equal to the type signature of count, datatype at the root

Example of MPI_BCAST#include “mpi.h”

int main( int argc, char* argv[] ){ int rank; int ibuf; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

if(rank == 0) ibuf = 12345; else ibuf = 0; MPI_Bcast(&ibuf, 1, MPI_INTEGER, 0, MPI_COMM_WORLD); printf(“ibuf = %d\n”, ibuf);

MPI_Finalize();}

MPI_SCATTER

• Usage– int MPI_Scatter( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ int root, /* in */ MPI_Comm comm); /* in */

• Description– Distribute individual messages from root to each process in com

municator– Inverse operation to MPI_GATHER

Example of MPI_SCATTER

int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[3], irecv; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

if(rank == 0) { for(i=0; i<nproc; i++) isend(i) = i+1; }

Example of MPI_SCATTER (Cont’d)

MPI_Scatter( isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, 0, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv);

MPI_Finalize();}

MPI_SCATTERV

• Usage– int MPI_Scatterv( void* sendbuf, /* in */ int* sendcounts, /* in */ int* displs, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* in */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ int root, /* in */ MPI_Comm comm); /* in */

• Description– Distributes individual messages from root to each process in comm

unicator– Messages can have different sizes and displacements

Example of MPI_SCATTERV

int main( int argc, char* argv[] ){ int i; int rank, nproc; int iscnt[3] = {1,2,3}, irdisp[3] = {0,1,3}; int isend[6] = {1,2,2,3,3,3}, irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

ircnt = rank + 1;

Example of MPI_SCATTERV (Cont’d)

MPI_Scatterv( isend, iscnt, idisp, MPI_INTEGER, irecv, ircnt, MPI_INTEGER, 0, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv);

MPI_Finalize();}

MPI_GATHER

• Usage– int MPI_Gather( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ int root, /* in */ MPI_Comm comm ); /* in */

• Description– Collects individual messages from each process in commu

nicator to the root process and store them in rank order

Example of MPI_GATHER

int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend, irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

isend = rank + 1; MPI_Gather( &isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);

Example of MPI_GATHER (Cont’d)

if(rank == 0) { for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]);

MPI_Finalize();}

MPI_GATHERV

• Usage– int MPI_Gather( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int* recvcount, /* in */ int* displs, /* in */ MPI_Datatype recvtype, /* in */ int root, /* in */ MPI_Comm comm ); /* in */

• Description– Collects individual messages from each process in communica

tor to the root process and store them in rank order

Example of MPI_GATHERV

int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[3], irecv[6]; int ircnt[3] = {1,2,3}, idisp[3] = {0,1,3}; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

for(i=0; i<rank; i++) isend[i] = rank + 1; iscnt = rank + 1;

Example of MPI_GATHERV (Cont’d)

MPI_Gatherv( isend, iscnt, MPI_INTEGER, irecv, ircnt, idisp, MPI_INTEGER, 0, MPI_COMM_WORLD);

if(rank == 0) { for(i=0; i<6; i++) printf(“irecv = %d\n”, irecv[i]);

MPI_Finalize();}

MPI_ALLGATHER

• Usage– int MPI_Allgather( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm); /* in */

• Description– Gathers individual messages from each process in communic

ator and distributes the resulting message to each process– Similar to MPI_GATHER except that all processes receive the

result

Example of MPI_ALLGATHERint main( int argc, char* argv[] ){ int i; int rank, nproc; int isend, irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

isend = rank + 1; MPI_Allgather(&isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, MPI_COMM_WORLD); for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize();}

MPI_ALLGATHERV

• Usage– int MPI_Allgatherv( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int* recvcounts, /* in */ int* displs, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm); /* in */

• Description– Collects individual messages from each process in communicat

or and distributes the resulting message to all processes– Messages can have different sizes and displacements

Example of MPI_ALLGATHERV

int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[3], irecv[6]; int ircnt[3] = {1,2,3}, idisp[3] = {0,1,3}; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

for(i=0; i<rank+1; i++) isend[i] = rank + 1; iscnt = rank + 1;

Example of MPI_ALLGATHERV (Cont’d)

MPI_Allgather(isend, iscnt, MPI_INTEGER, irecv, ircnt, idisp,

MPI_INTEGER, MPI_COMM_WORLD);

for(i=0; i<6; i++)

printf(“irecv = %d\n”, irecv[i]);

MPI_Finalize();

MPI_ALLTOALL

• Usage– int MPI_Alltoall( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm); /* in */

• Description– Sends a distinct message from each process to every other pr

ocess– The j-th block of data sent from process i is received by proce

ss j and placed in the i-th block of the buffer recvbuf

Example of MPI_ALLTOALL

int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[3], irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

for(i=0; i<nproc; i++) isend[i] = i + nproc * rank;

Example of MPI_ALLTOALL (Cont’d)

MPI_Alltoall(isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, MPI_COMM_WORLD); for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]);

MPI_Finalize();}

MPI_ALLTOALLV

• Usage– int MPI_Alltoallv( void* sendbuf, /* in */ int* sendcounts, /* in */ int* sdispls, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* in */ int* recvcounts, /* in */ int* rdispls, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm); /* in */

• Description– Sends a distinct message from each process to every process– Messages can have different sizes and displacements

MPI_ALLTOALLV (Cont’d)

• Description (Cont’d)– The type signature associated with sendcount

(j), sendtype at process i must be equal to the type signature associated with recvcounts(j)

Example of MPI_ALLTOALLV

int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[6] = {1,2,2,3,3,3}, irecv[9]; int iscnt[3] = {1,2,3}, isdsp[3] = {0,1,3}, ircnt[3], irdsp[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

for(i=0; i<6; i++) isend[i] = isend[I] + nproc * rank;

Example of MPI_ALLTOALLV (Cont’d)

for(i=0; i<nproc; i++) {

ircnt[i] = rank + 1;

irdsp[i] = i * (rank + 1);

MPI_Alltoallv( isend, iscnt, isdsp, MPI_INTEGER,

irecv, ircnt, irdsp, MPI_INTEGER, MPI_COMM_WORLD);

for(i=0; i<iscnt[rank] * nproc; i++)

printf(“irecv = %d\n”, irecv[i]);

MPI_Finalize();

MPI_REDUCE

• Usage– int MPI_Reduce( void* sendbuf, /* in */ void* recvbuf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ int root, /* in */ MPI_Comm comm); /* in */

• Description– Applies a reduction operation to the vector sendbuf over

the set of processes specified by communicator and places the result in recvbuf on root

MPI_REDUCE (Cont’d)

• Description (Cont’d)– Both the input and output buffers have the same

number of elements with the same type– Users may define their own operations or use the

predefined operations provided by MPI

• Predefined operations– MPI_SUM, MPI_PROD– MPI_MAX, MPI_MIN– MPI_MAXLOC, MPI_MINLOC– MPI_LAND, MPI_LOR, MPI_LXOR– MPI_BAND, MPI_BOR, MPI_BXOR

Example of MPI_REDUCE#include “mpi.h”

int main( int argc, char* argv[] ){ int rank, nproc; int isend, irecv; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

isend = rank + 1; MPI_Reduce(&isend, &irecv, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); if(rank == 0) printf(“irecv = %d\n”, irecv); MPI_Finalize();}

MPI_REDUCE for scalars

MPI_REDUCE for arrays

MPI_ALLREDUCE

• Usage– int MPI_Allreduce( void* sendbuf, /* in */ void* recvbuf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ MPI_Comm comm); /* in */

• Description– Applies a reduction operation to the vector sendbuf over the se

t of processes specified by communicator and places the result in recvbuf o nall the processes in communicator

– This routine is similar to MPI_REDUCE except the result is returned to the receive buffer of all the group members

Example of MPI_ALLREDUCE#include “mpi.h”

isend = rank + 1; MPI_Allreduce(&isend, &irecv, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv); MPI_Finalize();}

MPI_SCAN

• Usage– int MPI_Scan( void* sendbuf, /* in */ void* recvbuf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ MPI_Comm comm); /* in */

• Description– Performs a parallel prefix reduction on data distributed acro

ss a group– The operation returns, in the receive buffer of the process wi

th rank I, the reduction of the values in the send buffers of processes with ranks 0…i

Example of MPI_SCAN#include “mpi.h”

isend = rank + 1; MPI_Scan(&isend, &irecv, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv); MPI_Finalize();}

MPI_REDUCE_SCATTER

• Usage– int MPI_Reduce_scatter( void* sendbuf, /* in */ void* recvbuf, /* out */ int* recvcounts, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ MPI_Comm comm); /* in */

• Description– Applies a reduction operation to the vector sendbuf over the set of

processes specified by communicator and scatters the result according to the values in recvcounts

– Functionally equivalent to MPI_REDUCE with count equal to the sum of recvcounts(I) followed by MPI_SCATTERV with sendcounts equal to recvcounts

Example of MPI_REDUCE_SCATTER

int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[6], irecv[3]; int ircnt[3] = {1,2,3}; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

for(i=0; i<6; i++) isend[i] = 1 + rank * 10;

Example of MPI_REDUCE_SCATTER

(Cont’d) MPI_Reduce_scatter(isend, irecv, ircnt, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize();}

MPI_OP_CREATE

• Usage– int MPI_Op_create( MPI_User_function* function, /* in */ int commute, /* in */ MPI_Op* op); /* out */

• Description– Binds a user-defined reduction operation to an op handle– MPI_REDUCE, MPI_ALLREDUCE, MPI_REDUCE_SCATTE

R, MPI_SCAN– If commute is true, then the operation must be both commutat

ive and associative.– MPI_User_function

• typedef void MPI_User_function(void* invec, void* inoutvec, int* len, MPI_Datatype* datatype);

Example of MPI_OP_CREATE#include “mpi.h”

int main( int argc, char* argv[] ){ int i; int rank, nproc; MPI_Op isum; COMPLEX c[2], csum[2]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );

MPI_Op_create(my_sum, true, &isum); MPI_Reduce(c, csum, 2, MPI_COMPLEX, isum, 0, MPI_COMM_WORLD); MPI_Finalize();}

void my_sum(void* cin, void* cinout, int* len, MPI_Datatype type){ int i;

for(i=0; i<*len; i++) { cinout[i] = cinout[i] + cin[i]; }}

MPI_BARRIER

• Usage– int MPI_Barrier(MPI_Comm comm); /* in *

• Description– Blocks each process in communicator until all

processes have called it

Hello World: C Code

The easiest way to see exactly how a parallel code is put together and run is to write the classic "Hello World" program in parallel. In this case it simply means that every PE will say hello to us. Let's take a look at the code to do this. Hello World C Code

#include <stdio.h> #include "mpi.h"

main(int argc, char** argv){

int my_PE_num; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num); printf("Hello from %d.\n", my_PE_num); MPI_Finalize(); }

Hello World: Fortran Code

program shifter include 'mpif.h'

integer my_pe_num, errcode

call MPI_INIT(errcode) call MPI_COMM_RANK(MPI_COMM_WORLD, my_pe_num, errcode) print *, 'Hello from ', my_pe_num,'.' call MPI_FINALIZE(errcode) end

Output

Hello from 5. Hello from 3. Hello from 1. Hello from 2. Hello from 7. Hello from 0. Hello from 6. Hello from 4.

There are two issues here that may not have been expected. The most obvious is that the output might seem out of order. The response to that is "what order were you expecting?" Remember, the code was started on all nodes practically simultaneously. There was no reason to expect one node to finish before another. Indeed, if we rerun the code we will probably get a different order. Sometimes it may seem that there is a very repeatable order. But, one important rule of parallel computing is don't assume that there is any particular order to events unless there is something to guarantee it. Later on we will see how we could force a particular order on this output.

Master and Slaves PEs The much more common case is to have a single PE that is used for some sort of

coordination purpose, and the other PEs run code that is the same, although the data will be different. This is how one would implement a master/slave or host/node paradigm.

if (my_PE_num = 0) MasterCodeRoutine

else SlaveCodeRoutine

Of course, the above code is the trivial case of EveryBodyRunThisRoutine

and consequently the only difference will be in the output, as it actually uses the PE number.

MPI_COMM_WORLD

In the Hello World program, we see that the first parameter in MPI_Comm_rank (MPI_COMM_WORLD, &my_PE_num) isMPI_COMM_WORLD. MPI_COMM_WORLD is known as the "communicator" and can be found in many of the MPI routines. In general, it is used so that one can divide up the PEs into subsets for various algo-rithmic purposes. For example, if we had an array that we wished to find the determinant of distributed across the PEs, we might wish to define some subset of the PEs that holds a certain column of the array so that we could address only those PEs conveniently.

However, this is a convenience that can often be dispensed with. As such, one will often see the value MPI_COMM_WORLD used anywhere that a communicator is required. This is simply the global set that states we don't really care to deal with any particular subset here.

Compiling and Running

Well, now that we may have some idea how the above code will perform, let's compile it and run it to see if it meets our expectations. We compile using a normal ANSI C or Fortran 90 compiler (C++ is also available): While logged in the cluster master:

For C codes: cc -lmpi hello.c For Fortran codes: f90 -lmpi hello.c We now have an executable. To run on the Cluster we must tell the

machine how many copies we wish to run. You can choose any number. We'll try 8: Some Clusters use mpprun –n8 a.outOthers use prun –n8 a.out

Where Will The Output Go?The second issue, although you may have taken it for granted, is "where will the output go?". This is another question that MPI dodges because it is so implementation

dependent. On the Cluster, the I/O is structured in about the simplest way possible. All PEs can read and write (files as well as console I/O) through the standard channels. This is very convenient, and in our case results in all of the "standard output" going back to your terminal window on the Cluster.

In general, it can be much more complex. For instance, suppose you were running this on a cluster of 8 workstations. Would the output go to eight separate consoles? Or, in a more typical situation, suppose you wished to write results out to a file:

With the workstations, you would probably end up with eight separate files on eight separate disks.

With the Cluster, they can all access the same file simultaneously.There are some good reasons why you would want to exercise some constraint even on the Cluster. 512 PEs accessing the same file would be extremely inefficient.

Sending and Receiving Messages

Hello world might be illustrative, but we haven't really done any message passing yet.

Let's write the simplest possible message passing program.

It will run on 2 PEs and will send a simple message (the number 42) from PE 1 to PE 0. PE 0 will then print this out.

Sending a Message

Sending a message is a simple procedure. In our case the routine will look like this in C (the standard man pages are in C, so you should get used to seeing this format):

MPI_Send( &numbertosend, 1, MPI_INT, 0, 10, MPI_COMM_WORLD)

Sending a Message Cont’d

Let's look at the parameters individually:

&numbertosenda pointer to whatever we wish to send. In this case it is simply an integer. It could be anything from a character string to a column of an array or a structure. It is even possible to pack several different data types in one message.

1 the number of items we wish to send. If we were sending a vector of 10 int's, we would point to the first one in the above parameter and set this to the size of the array.

MPI_INT the type of object we are sending. Possible values are: MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_UNSIGNED_CHAR, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNSIGNED_LING, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE, MPI_PACKED Most of these are obvious in use. MPI_BYTE will send raw bytes (on a heterogeneous workstation cluster this will suppress any data conversion). MPI_PACKED can be used to pack multiple data types in one message, but it does require a few additional routines we won't go into (those of you familiar with PVM will recognize this).

0 Destination of the message. In this case PE 0.

10 Message tag. All messages have a tag attached to them that can be useful for sorting messages. For example, one could give high priority control messages a different tag then data messages. When receiving, the program would check for messages that use the control tag first. We just picked 10 at random.

MPI_COMM_WORLD We don't really care about any subsets of PEs here. So, we just chose this "default".

Receiving a Message

Receiving a message is equally simple. In our case it will look like:MPI_Recv( &numbertoreceive, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,MPI_COMM_WORLD, &status)

&numbertoreceive A pointer to the variable that will receive the item. In our case it is simply an integer that has has some undefined value until now.

1 Number of items to receive. Just 1 here.

MPI_INT Datatype. Better be an int, since that's what we sent.

MPI_ANY_SOURCE The node to receive from. We could use 1 here since the message is coming from there, but we'll illustrate the "wild card" method of receiving a message from anywhere.

MPI_ANY_TAG We could use a value of 10 here to filter out any other messages (there aren't any) but, again, this was a convenient place to show how to receive any tag.

MPI_COMM_WORLD Just using default set of all PEs.

&status A structure that receive the status data which includes the source and tag of the message.

Send and Receive C Code

#include <stdio.h> #include "mpi.h" main(int argc, char** argv){

int my_PE_num, numbertoreceive, numbertosend=42; MPI_Status status;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num);

if (my_PE_num==0){ MPI_Recv( &numbertoreceive, 1, MPI_INT, MPI_ANY_SOURCE,

MPI_ANY_TAG, MPI_COMM_WORLD, &status); printf("Number received is: %d\n", numbertoreceive);

} else MPI_Send( &numbertosend, 1, MPI_INT, 0, 10, MPI_COMM_WORLD);

MPI_Finalize(); }

Send and Receive Fortran Codeprogram shifter implicit none include 'mpif.h'

integer my_pe_num, errcode, numbertoreceive, numbertosend integer status(MPI_STATUS_SIZE)

call MPI_INIT(errcode) call MPI_COMM_RANK(MPI_COMM_WORLD, my_pe_num, errcode)

numbertosend = 42

if (my_PE_num.EQ.0) then call MPI_Recv( numbertoreceive, 1, MPI_INTEGER,MPI_ANY_SOURCE,

MPI_ANY_TAG, MPI_COMM_WORLD, status, errcode)

print *, 'Number received is:‘ ,numbertoreceive endif

if (my_PE_num.EQ.1) then call MPI_Send( numbertosend, 1,MPI_INTEGER, 0, 10, MPI_COMM_WORLD, errcode) endif

call MPI_FINALIZE(errcode)

Non-Blocking Recieves

All of the receives that we will use are blocking. This means that they will wait until a message matching their requirements for source and tag has been received. It is possible to use non-blocking communications. This means a receive will return immediately and it is up to the code to determine when the data actually arrives using additional routines.

In most cases this additional coding is not worth it in terms of performance and code robustness. However, for certain algorithms this can be useful to keep in mind.

Communication Modes

There are four possible modes (with slight differently named MPI_XSEND routines) for buffering and sending messages in MPI. We use the standard mode here, and you may find this sufficient for the majority of your needs. However, these other modes can allow for substantial optimization in the right circumstances:

Standard mode Send will usually not block even if a receive for that message has not occurred. Exception is if there are resource limitations (buffer space).

Buffered Mode Similar to above, but will never block (just return error).

Synchronous Mode

will only return when matching receive has started.

Ready Mode will only work if matching receive is already waiting.

Synchronization

We are going to write one more code which will employ the remaining tool that we need for general parallel programming: synchronization. Many algorithms require that you be able to get all of the nodes into some controlled state before proceeding to the next stage. This is usually done with a synchronization point that require all of the nodes (or some specified subset at the least) to reach a certain point before proceeding. Sometimes the manner in which messages block will achieve this same result implicitly, but it is often necessary to explicitly do this and debugging is often greatly aided by the insertion of synchronization points which are later removed for the sake of efficiency.

Our code will perform the rather pointless operation of having PE 0 send a number to the other 3 PEs and have them multiply that number by their own PE number. They will then print the results out (in order, remember the hello world program?) and send them back to PE 0 which will print out the sum.

Synchronization: C Code

#include <stdio.h> #include "mpi.h"

main(int argc, char** argv){

int my_PE_num, numbertoreceive, numbertosend=4,index, result=0; MPI_Status status;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num);

if (my_PE_num==0) for (index=1; index<4; index++)

MPI_Send( &numbertosend, 1,MPI_INT, index, 10,MPI_COMM_WORLD); else{ MPI_Recv( &numbertoreceive, 1, MPI_INT, 0, 10, MPI_COMM_WORLD,

&status); result = numbertoreceive * my_PE_num;}

for (index=1; index<4; index++){ MPI_Barrier(MPI_COMM_WORLD); if (index==my_PE_num)

printf("PE %d's result is %d.\n", my_PE_num, result);}

if (my_PE_num==0){ for (index=1; index<4; index++){ MPI_Recv( &numbertoreceive, 1,MPI_INT,index,10, MPI_COMM_WORLD,

&status); result += numbertoreceive; } printf("Total is %d.\n", result);} else MPI_Send( &result, 1, MPI_INT, 0, 10, MPI_COMM_WORLD);

MPI_Finalize(); }

Synchronization: Fortran Codeprogram shifter implicit none

include 'mpif.h'

integer my_pe_num, errcode, numbertoreceive, numbertosend integer index, result integer status(MPI_STATUS_SIZE)

call MPI_INIT(errcode)

call MPI_COMM_RANK(MPI_COMM_WORLD, my_pe_num, errcode)

numbertosend = 4 result = 0

if (my_PE_num.EQ.0) then do index=1,3 call MPI_Send( numbertosend, 1, MPI_INTEGER, index, 10, MPI_COMM_WORLD, errcode) enddo

else call MPI_Recv( numbertoreceive, 1, MPI_INTEGER, 0, 10, MPI_COMM_WORLD, status, errcode) result = numbertoreceive * my_PE_num

do index=1,3 call MPI_Barrier(MPI_COMM_WORLD, errcode) if (my_PE_num.EQ.index) then print *, 'PE ',my_PE_num,'s result is ',result,'.' endif

if (my_PE_num.EQ.0) then do index=1,3 call MPI_Recv( numbertoreceive, 1, MPI_INTEGER, index,10, MPI_COMM_WORLD, status, errcode) result = result + numbertoreceive enddo print *,'Total is ',result,'.'

else call MPI_Send( result, 1, MPI_INTEGER, 0, 10, MPI_COMM_WORLD, errcode)

call MPI_FINALIZE(errcode) end

Results of “Synchronization”

The output you get when running this code with 4 PEs (what will happen if you run with more or less?) is the following:

PE 1’s result is 4.

Total is 24

Analysis of “Synchronization”

The best way to make sure that you understand what is happening in the code above is to look at things from the perspective of each PE in turn. THIS IS THE WAY TO DEBUG ANY MESSAGE-PASSING (or MIMD) CODE.

Follow from the top to the bottom of the code as PE 0, and do likewise for PE 1. See exactly where one PE is dependent on another to proceed. Look at each PEs progress as though it is 100 times faster or slower than the other nodes. Would this affect the final program flow? It shouldn't unless you made assumptions that are not always valid.

Reduction

MPI_Reduce: Reduces values on all processes to a single value.

Synopsis

#include "mpi.h" int MPI_Reduce ( sendbuf, recvbuf, count, datatype, op, root, comm ) void *sendbuf; void *recvbuf; int count; MPI_Datatype datatype; MPI_Op op; int root; MPI_Comm comm;

Reduction Cont’dInput Parameters:

sendbuf address of send buffer

count number of elements in send buffer (integer)

datatype data type of elements of send buffer (handle)

op reduce operation (handle)

root rank of root process (integer)

comm communicator (handle)

Output Parameter:

recvbuf address of receive buffer (choice, significant only at root)

Algorithm: This implementation currently uses a simple tree algorithm.

Finding Pi

Our last example will find the value of pi by integrating 4/(1 + x2) for -1/2 to +1/2.

This is just a geometric circle. The master process (0) will query for a number of intervals to use, and then broadcast this number to all of the other processors.

Each processor will then add up every nth interval (x = -1/2 + rank/n, -1/2 + rank/n + size/n).

Finally, the sums computed by each processor are added together using a new type of MPI operation, a reduction.

Finding Pi

program FindPI implicit none

include 'mpif.h' integer n, my_pe_num, numprocs, index, errcode real mypi, pi, h sum, x

call MPI_Init(errcode) call MPI_Comm_size(MPI_COMM_WORLD, numprocs, errcode) call MPI_Comm_rank(MPI_COMM_WORLD, my_pe_num, errcode)

if (my_pe_num.EQ.0) then print *,'How many intervals?:' read *, n

call MPI_Bcast(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, errcode)

h = 1.0 / n

sum = 0.0

do index = my_pe_num+1, n, numprocs x = h * (index - 0.5) sum = sum + 4.0 / (1.0 + x*x)

enddo mypi = h * sum

call MPI_Reduce(mypi, pi, 1, MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD, errcode)

if (my_pe_num.EQ.0) then print *,'pi is approximately ',pi print *,'Error is ',pi-3.14159265358979323846

call MPI_Finalize(errcode) end

Do Not Make Any Assumptions

Do not make any assumptions about the mechanics of the actual message- passing. Remember that MPI is designed to operate not only on fast MPP networks, but also on Internet size meta-computers. As such, the order and timing of messages may be considerably skewed.

MPI makes only one guarantee: two messages sent from one process to another process will arrive in that relative order. However, a message sent later from another process may arrive before, or between, those two messages.

ReferencesThere is a wide variety of material available on the Web, some of which is intended to be used

as hardcopy manuals and tutorials.

http://www.psc.edu/htbin/software_by_category.pl/hetero_software

you may wish to start at one of the MPI home pages at

http://www.mcs.anl.gov/Projects/mpi/index.html

from which you can find a lot of useful information without traveling too far. To learn the syntax of MPI calls, access the index for the Message Passing Interface Standard at:

http://www-unix.mcs.anl.gov/mpi/www/

Books:• Parallel Programming with MPI. Peter S. Pacheco. San Francisco: Morgan Kaufmann

Publishers, Inc., 1997. • PVM: a users' guide and tutorial for networked parallel computing. Al Geist, Adam

Beguelin, Jack Dongarra et al. MIT Press, 1996. • Using MPI: portable parallel programming with the message-passing interface . William

Gropp, Ewing Lusk, Anthony Skjellum. MIT Press, 1996.

lab 7 -parallel programming models -mpi continued

Documents

chapter 3 parallel algorithm design parallel programming ·...

parallel programming with mpi- day 3

parallel programming with mpi

parallel scientiﬁc computing in c++ and mpi -...

ece 1747h : parallel programming message passing (mpi)

parallel programming in mpi

parallel programming and mpi

parallel programming 1: mpi - tum

parallel programming with mpi- day 4

development of hybrid mpi+upc parallel programming...

using the parallel universe beyond mpi

parallel computing—higher-level concepts of mpi

parallel programming with mpi parallel programming overview...

advanced parallel programming with mpi

parallel algorithms underlying mpi implementations

parallel programming in mpi part 1

introduction to parallel programming with mpi ·...

hybrid parallel programming with mpi and unified parallel c

best practices for parallel io and mpi-io hints - · pdf...

multigrid method using openmp/mpi hybrid parallel ... ·...