lab 7 -parallel programming models -mpi continued
Post on 29-Dec-2015
240 Views
Preview:
TRANSCRIPT
LAB 7
-Parallel Programming Models-MPI Continued
Parallel Programming Models
• Programming Models
• MPI Programming
Fundamental Design Issues
• Design of user-system and hardware-software interface– proposed by Culler et. al.
• Functional Issues:– Naming: how are logically shared data referenced?– Operations: what operations are provided on shared data?– Ordering: how are accesses to data ordered and coordinated?
• Performance Issues:– Replication: how are data replicated to reduce communication– Communication Cost: latency, bandwidth, overhead
• Others– Node granularity
Sequential Programming Model
• Functional Issues– Naming: Can name any variable in virtual
address space• Hardware (and perhaps compilers) does
translation to physical addresses
– Operations: Loads and Stores– Ordering: Sequential program order
• Performance– Rely on dependences on single location
(mostly): dependence order– Compilers and hardware violate other orders
without getting caught– Compiler: reordering and register allocation– Hardware: out of order, pipeline bypassing,
write buffers– Transparent replication in caches
SAS Programming Model
Thread(Process)
Thread(Process)
Thread(Process)
Thread(Process)
SystemSystem
X
read(X) write(X)
Processor Memory
Shared variable
Shared Address Space Programming Model
• Naming– Any process can name any variable in shared space
• Operations– Loads and stores, plus those needed for ordering
• Simplest Ordering Model– Within a process/thread: sequential program order– Across threads: some interleaving (as in time-sharing)– Additional orders through synchronization– Again, compilers/hardware can violate orders without
getting caught
Synchronization
• Mutual exclusion (locks)– Ensure certain operations on certain data can be
performed by only one process at a time– Room that only one person can enter at a time– No ordering guarantees
• Event synchronization – Ordering of events to preserve dependences – e.g. producer —> consumer of data– 3 main types:
• point-to-point• global• group
MP Programming Model
processprocess processprocess
Node ANode A
message
Y Y’
send (Y) receive (Y’)
Node BNode B
Processor Memory
Message Passing Programming Model
• Naming– Processes can name private data directly. – No shared address space
• Operations– Explicit communication: “send” and “receive”– “Send” transfers data from private address
space to another process– “Receive” copies data from process to private
address space
• Ordering– Program order within a process– Send and receive can provide pt-to-pt synch
between processes– Mutual exclusion inherent
• Can construct global address space– Process number + address within process
address space– But no direct operations on these names
Message Passing BasicsWhat is MPI? The Message-Passing Interface Standard(MPI) is a library that allows
you to do problems in parallel using message- passing to communicate between processes.
• LibraryIt is not a language (like FORTRAN 90, C or HPF) or even an extension to a language. Instead, it is a library that your native, standard, serial compiler (f77, f90, cc, CC) uses.
• Message PassingMessage passing is sometimes referred to as a paradigm itself. But it is really just a method of passing data between processes that is flexible enough to implement most paradigms (Data Parallel, Work Sharing, etc.) with it
• CommunicateThis communication may be via a dedicated MPP torus network, or merely an office LAN. To the MPI programmer, it looks much the same.
• ProcessesThese can be 512 PEs on a T3E, or 4 processes on a single workstation.
Basic MPI
In order to do parallel programming, you require some basic functionality, namely, the ability to:
• Start Processes
• Send Messages
• Receive Messages
• Synchronize
With these four capabilities, you can construct any program. We will look at the basic versions of the MPI routines that implement this. Of course, MPI offers over 125 functions. Many of these are more convenient and efficient for certain tasks. However, with what we learn here, we will be able to implement just about any algorithm. Moreover, the vast majority of MPI codes are built using primarily these routines.
Starting Processes on the Cluster
On the Clusters, the fundamental control of processes is fairly simple. There is always one process for each PE that your code is running on. At run time, you specify how many PEs you require and then your code is copied to each PE and run simultaneously. In other words, a 512 PE Cluster code has 512 copies of the same code running on it from start to finish.
At first the idea that the same code must run on every node seems very limiting. We'll see in a bit that this is not at all the case.
MPI Programming Fundamentals
• Environment
• Point-to-point communication
• Collective communication
• Derived data type
• Group management
MPI Terms
• Blocking– If return from the procedure indicates that the user is allowed to
reuse resources specified in the call
• Non-blocking– If the procedure may return before the operation completes, and
before the user is allowed to reuse resources specified in the call
• Collective– If all processes in a process group need to invoke the procedure
• Message envelope– Information used to distinguish messages and selectively receive
them– <source, destination, tag, communicator>
MPI Terms (Cont’d)
• Communicator– The communication context for a communication operation– Messages are always received within the context they
were sent– Messages sent in different contexts do not interfere– MPI_COMM_WORLD
• Process group– The communicator specifies the set of processes that
share this communication context.– This process group is ordered and processes are identified
by their rank within this group
MPI Environment Calls
• MPI_INIT
• MPI_COMM_SIZE
• MPI_COMM_RANK
• MPI_FINALIZE
• MPI_ABORT
MPI_INIT
• Usage– int MPI_Init( int* argc_ptr, /* in */
char** argv_ptr[] ); /* in */
• Description– Initialize MPI– All MPI programs must call this routines once
and only once before any other MPI routines
MPI_COMM_SIZE
• Usage– int MPI_Comm_size( MPI_Comm comm,
/* in */
int* size ); /* out */
• Description– Return the number of processes in the group
associated with a communicator
MPI_COMM_RANK
• Usage– int MPI_Comm_rank ( MPI_Comm comm, /*
in */
int* rank ); /* out */
• Description– Returns the rank of the local process in the grou
p associated with a communicator– The rank of the process that calls it in the range
from 0 … size - 1
MPI_FINALIZE
• Usage– int MPI_Finalize (void);
• Description– Terminates all MPI processing– Make sure this routine is the last MPI call.– All pending communications involving a proce
ss have completed before the process calls MPI_FINALIZE
MPI_ABORT
• Usage– int MPI_Abort( MPI_Comm comm, /* in */
int errorcode ); /* in */
• Description– Forces all processes of an MPI job to terminat
e
Too Simple Program
#include “mpi.h”
int main( int argc, char* argv[] ){ int rank; int nproc; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
/* write codes for you */
MPI_Finalize();}
Point-To-Point Communication
• MPI_SEND
• MPI_RECV
• MPI_ISEND
• MPI_IRECV
• MPI_WAIT
• MPI_GET_COUNT
4 Communication Modes in MPI
• Standard mode– It is up to MPI to decide whether outgoing
messages will be buffered– Non-local operation– Buffered or synchronous?
• Buffered(asynchronous) mode– A send operation can be started whether or not a
matching receive has been posted– It may complete before a matching receive is
posted– Local operation
4 Communication Modes in MPI (Cont’d)
• Synchronous mode– A send operation can be started whether or not
a matching receive was posted– The send will complete successfully only if a
matching receive was posted and the receive operation has started to receive the message
– The completion of a synchronous send not only indicates that the send buffer can be reused but also indicates that the receiver has reached a certain point in its execution
– Non-local operation
4 Communication Modes in MPI (Cont’d)
• Ready mode– A send operation may be started only if the
matching receive is already posted– The completion of the send operation does
not depend on the status of a matching receive and merely indicates the send buffer can be reused
– EAGER_LIMIT of SP system
MPI_SEND
• Usage– int MPI_Send( void* buf, /* in */ int count, /* in */ MPI_Datatype datatype, /* in */ int dest, /* in */ int tag, /* in */ MPI_Comm comm ); /* in */
• Description– Performs a blocking standard mode send operation– The message can be received by either MPI_RECV o
r MPI_IRECV
MPI_RECV
• Usage– int MPI_Recv( void* buf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ int source, /* in */ int tag, /* in */ MPI_Comm comm, /* in */ MPI_Status* status ); /* out */
• Description– Performs a blocking receive operation– The message received must be less than or equal to the length of t
he receive buffer– MPI_RECV can receive a message sent by either MPI_SEND or M
PI_ISEND
Blocking Operations
Sample Program for Blocking Operations
#include “mpi.h”
int main( int argc, char* argv[] ){ int rank, nproc; int isbuf, irbuf; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
if(rank == 0) { isbuf = 9; MPI_Send( &isbuf, 1, MPI_INTEGER, 1, TAG, MPI_COMM_WORLD);
Sample Program for Blocking Operations (Cont’d)
} else if(rank == 1) {
MPI_Recv( &irbuf, 1, MPI_INTEGER, 0, TAG, MPI_COMM_WORLD,
&status);
printf( “%d\n”, irbuf );
}
MPI_Finalize();
}
MPI_ISEND• Usage
– int MPI_Isend( void* buf, /* in */ int count, /* in */ MPI_Datatype datatype, /* in */ int dest, /* in */ int tag, /* in */ MPI_Comm comm, /* in */ MPI_Request* request ); /* out */
• Description– Performs a nonblocking standard mode send operation– The send buffer may not be modified until the request has been co
mpleted by MPI_WAIT or MPI_TEST– The message can be received by either MPI_RECV or MPI_IRECV.
MPI_IRECV
• Usage– int MPI_Irecv( void* buf, /* out */
int count, /* in */
MPI_Datatype datatype,/* in */
int source, /* in */
int tag, /* in */
MPI_Comm comm, /* in */
MPI_Request* request ); /* out */
MPI_IRECV (Cont’d)
• Description– Performs a nonblocking receive operation– Do not access any part of the receive buffer u
ntil the receive is complete– The message received must be less than or e
qual to the length of the receive buffer– MPI_IRECV can receive a message sent by ei
ther MPI_SEND or MPI_ISEND
MPI_WAIT
• Usage– int MPI_Wait( MPI_Request* request, /* inout */ MPI_Status* status ); /* out */
• Description– Waits for a nonblocking operation to complete– Information on the completed operation is found i
n status.– If wildcards were used by the receive for either th
e source or tag, the actual source and tag can be retrieved by status->MPI_SOURCE and status->MPI_TAG
Non-Blocking Operations
MPI_GET_COUNT
• Usage– int MPI_Get_count( MPI_Status status, /* in */
MPI_Datatype datatype, /* in */
int* count ); /* out */
• Description– Returns the number of elements in a message– The datatype argument and the argument provide
d by the call that set the status variable should match
Sample Program for Non-Blocking Operations
#include “mpi.h”
int main( int argc, char* argv[] ){ int rank, nproc; int isbuf, irbuf, count; MPI_Request request; MPI_Status status; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
if(rank == 0) { isbuf = 9; MPI_Isend( &isbuf, 1, MPI_INTEGER, 1, TAG, MPI_COMM_WORLD, &request );
Sample Program for Non-Blocking Operations (Cont’d)
} else if(rank == 1) {
MPI_Irecv( &irbuf, 1, MPI_INTEGER, 0, TAG, MPI_COMM_WORLD,
&request);
MPI_Wait(&request, &status);
MPI_Get_count(&status, MPI_INTEGER, &count);
printf( “irbuf = %d source = %d tag = %d count = %d\n”,
irbuf, status.MPI_SOURCE, status.MPI_TAG, count);
}
MPI_Finalize();
}
Collective Communication
• MPI_BCAST• MPI_SCATTER• MPI_SCATTERV• MPI_GATHER• MPI_GATHERV• MPI_ALLGATHER• MPI_ALLGATHERV• MPI_ALLTOALL
MPI_BCAST
• Usage– int MPI_Bcast( void* buffer, /* inout */
int count, /* in */
MPI_Datatype datatype, /* in */
int root, /* in */
MPI_Comm comm); /* in */
• Description– Broadcasts a message from root to all processes in commu
nicator
– The type signature of count, datatype on any process must be equal to the type signature of count, datatype at the root
Example of MPI_BCAST#include “mpi.h”
int main( int argc, char* argv[] ){ int rank; int ibuf; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
if(rank == 0) ibuf = 12345; else ibuf = 0; MPI_Bcast(&ibuf, 1, MPI_INTEGER, 0, MPI_COMM_WORLD); printf(“ibuf = %d\n”, ibuf);
MPI_Finalize();}
MPI_SCATTER
• Usage– int MPI_Scatter( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ int root, /* in */ MPI_Comm comm); /* in */
• Description– Distribute individual messages from root to each process in com
municator– Inverse operation to MPI_GATHER
Example of MPI_SCATTER
#include “mpi.h”
int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[3], irecv; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
if(rank == 0) { for(i=0; i<nproc; i++) isend(i) = i+1; }
Example of MPI_SCATTER (Cont’d)
MPI_Scatter( isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, 0, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv);
MPI_Finalize();}
MPI_SCATTERV
• Usage– int MPI_Scatterv( void* sendbuf, /* in */ int* sendcounts, /* in */ int* displs, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* in */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ int root, /* in */ MPI_Comm comm); /* in */
• Description– Distributes individual messages from root to each process in comm
unicator– Messages can have different sizes and displacements
Example of MPI_SCATTERV
#include “mpi.h”
int main( int argc, char* argv[] ){ int i; int rank, nproc; int iscnt[3] = {1,2,3}, irdisp[3] = {0,1,3}; int isend[6] = {1,2,2,3,3,3}, irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
ircnt = rank + 1;
Example of MPI_SCATTERV (Cont’d)
MPI_Scatterv( isend, iscnt, idisp, MPI_INTEGER, irecv, ircnt, MPI_INTEGER, 0, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv);
MPI_Finalize();}
MPI_GATHER
• Usage– int MPI_Gather( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ int root, /* in */ MPI_Comm comm ); /* in */
• Description– Collects individual messages from each process in commu
nicator to the root process and store them in rank order
Example of MPI_GATHER
#include “mpi.h”
int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend, irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
isend = rank + 1; MPI_Gather( &isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);
Example of MPI_GATHER (Cont’d)
if(rank == 0) { for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]);
MPI_Finalize();}
MPI_GATHERV
• Usage– int MPI_Gather( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int* recvcount, /* in */ int* displs, /* in */ MPI_Datatype recvtype, /* in */ int root, /* in */ MPI_Comm comm ); /* in */
• Description– Collects individual messages from each process in communica
tor to the root process and store them in rank order
Example of MPI_GATHERV
#include “mpi.h”
int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[3], irecv[6]; int ircnt[3] = {1,2,3}, idisp[3] = {0,1,3}; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
for(i=0; i<rank; i++) isend[i] = rank + 1; iscnt = rank + 1;
Example of MPI_GATHERV (Cont’d)
MPI_Gatherv( isend, iscnt, MPI_INTEGER, irecv, ircnt, idisp, MPI_INTEGER, 0, MPI_COMM_WORLD);
if(rank == 0) { for(i=0; i<6; i++) printf(“irecv = %d\n”, irecv[i]);
MPI_Finalize();}
MPI_ALLGATHER
• Usage– int MPI_Allgather( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm); /* in */
• Description– Gathers individual messages from each process in communic
ator and distributes the resulting message to each process– Similar to MPI_GATHER except that all processes receive the
result
Example of MPI_ALLGATHERint main( int argc, char* argv[] ){ int i; int rank, nproc; int isend, irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
isend = rank + 1; MPI_Allgather(&isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, MPI_COMM_WORLD); for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize();}
MPI_ALLGATHERV
• Usage– int MPI_Allgatherv( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int* recvcounts, /* in */ int* displs, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm); /* in */
• Description– Collects individual messages from each process in communicat
or and distributes the resulting message to all processes– Messages can have different sizes and displacements
Example of MPI_ALLGATHERV
#include “mpi.h”
int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[3], irecv[6]; int ircnt[3] = {1,2,3}, idisp[3] = {0,1,3}; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
for(i=0; i<rank+1; i++) isend[i] = rank + 1; iscnt = rank + 1;
Example of MPI_ALLGATHERV (Cont’d)
MPI_Allgather(isend, iscnt, MPI_INTEGER, irecv, ircnt, idisp,
MPI_INTEGER, MPI_COMM_WORLD);
for(i=0; i<6; i++)
printf(“irecv = %d\n”, irecv[i]);
MPI_Finalize();
}
MPI_ALLTOALL
• Usage– int MPI_Alltoall( void* sendbuf, /* in */ int sendcount, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* out */ int recvcount, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm); /* in */
• Description– Sends a distinct message from each process to every other pr
ocess– The j-th block of data sent from process i is received by proce
ss j and placed in the i-th block of the buffer recvbuf
Example of MPI_ALLTOALL
#include “mpi.h”
int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[3], irecv[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
for(i=0; i<nproc; i++) isend[i] = i + nproc * rank;
Example of MPI_ALLTOALL (Cont’d)
MPI_Alltoall(isend, 1, MPI_INTEGER, irecv, 1, MPI_INTEGER, MPI_COMM_WORLD); for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]);
MPI_Finalize();}
MPI_ALLTOALLV
• Usage– int MPI_Alltoallv( void* sendbuf, /* in */ int* sendcounts, /* in */ int* sdispls, /* in */ MPI_Datatype sendtype, /* in */ void* recvbuf, /* in */ int* recvcounts, /* in */ int* rdispls, /* in */ MPI_Datatype recvtype, /* in */ MPI_Comm comm); /* in */
• Description– Sends a distinct message from each process to every process– Messages can have different sizes and displacements
MPI_ALLTOALLV (Cont’d)
• Description (Cont’d)– The type signature associated with sendcount
(j), sendtype at process i must be equal to the type signature associated with recvcounts(j)
Example of MPI_ALLTOALLV
#include “mpi.h”
int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[6] = {1,2,2,3,3,3}, irecv[9]; int iscnt[3] = {1,2,3}, isdsp[3] = {0,1,3}, ircnt[3], irdsp[3]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
for(i=0; i<6; i++) isend[i] = isend[I] + nproc * rank;
Example of MPI_ALLTOALLV (Cont’d)
for(i=0; i<nproc; i++) {
ircnt[i] = rank + 1;
irdsp[i] = i * (rank + 1);
}
MPI_Alltoallv( isend, iscnt, isdsp, MPI_INTEGER,
irecv, ircnt, irdsp, MPI_INTEGER, MPI_COMM_WORLD);
for(i=0; i<iscnt[rank] * nproc; i++)
printf(“irecv = %d\n”, irecv[i]);
MPI_Finalize();
}
MPI_REDUCE
• Usage– int MPI_Reduce( void* sendbuf, /* in */ void* recvbuf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ int root, /* in */ MPI_Comm comm); /* in */
• Description– Applies a reduction operation to the vector sendbuf over
the set of processes specified by communicator and places the result in recvbuf on root
MPI_REDUCE (Cont’d)
• Description (Cont’d)– Both the input and output buffers have the same
number of elements with the same type– Users may define their own operations or use the
predefined operations provided by MPI
• Predefined operations– MPI_SUM, MPI_PROD– MPI_MAX, MPI_MIN– MPI_MAXLOC, MPI_MINLOC– MPI_LAND, MPI_LOR, MPI_LXOR– MPI_BAND, MPI_BOR, MPI_BXOR
Example of MPI_REDUCE#include “mpi.h”
int main( int argc, char* argv[] ){ int rank, nproc; int isend, irecv; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
isend = rank + 1; MPI_Reduce(&isend, &irecv, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); if(rank == 0) printf(“irecv = %d\n”, irecv); MPI_Finalize();}
MPI_REDUCE for scalars
MPI_REDUCE for arrays
MPI_ALLREDUCE
• Usage– int MPI_Allreduce( void* sendbuf, /* in */ void* recvbuf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ MPI_Comm comm); /* in */
• Description– Applies a reduction operation to the vector sendbuf over the se
t of processes specified by communicator and places the result in recvbuf o nall the processes in communicator
– This routine is similar to MPI_REDUCE except the result is returned to the receive buffer of all the group members
Example of MPI_ALLREDUCE#include “mpi.h”
int main( int argc, char* argv[] ){ int rank, nproc; int isend, irecv; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
isend = rank + 1; MPI_Allreduce(&isend, &irecv, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv); MPI_Finalize();}
MPI_SCAN
• Usage– int MPI_Scan( void* sendbuf, /* in */ void* recvbuf, /* out */ int count, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ MPI_Comm comm); /* in */
• Description– Performs a parallel prefix reduction on data distributed acro
ss a group– The operation returns, in the receive buffer of the process wi
th rank I, the reduction of the values in the send buffers of processes with ranks 0…i
Example of MPI_SCAN#include “mpi.h”
int main( int argc, char* argv[] ){ int rank, nproc; int isend, irecv; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
isend = rank + 1; MPI_Scan(&isend, &irecv, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); printf(“irecv = %d\n”, irecv); MPI_Finalize();}
MPI_REDUCE_SCATTER
• Usage– int MPI_Reduce_scatter( void* sendbuf, /* in */ void* recvbuf, /* out */ int* recvcounts, /* in */ MPI_Datatype datatype, /* in */ MPI_Op op, /* in */ MPI_Comm comm); /* in */
• Description– Applies a reduction operation to the vector sendbuf over the set of
processes specified by communicator and scatters the result according to the values in recvcounts
– Functionally equivalent to MPI_REDUCE with count equal to the sum of recvcounts(I) followed by MPI_SCATTERV with sendcounts equal to recvcounts
Example of MPI_REDUCE_SCATTER
#include “mpi.h”
int main( int argc, char* argv[] ){ int i; int rank, nproc; int isend[6], irecv[3]; int ircnt[3] = {1,2,3}; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
for(i=0; i<6; i++) isend[i] = 1 + rank * 10;
Example of MPI_REDUCE_SCATTER
(Cont’d) MPI_Reduce_scatter(isend, irecv, ircnt, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); for(i=0; i<3; i++) printf(“irecv = %d\n”, irecv[i]); MPI_Finalize();}
MPI_OP_CREATE
• Usage– int MPI_Op_create( MPI_User_function* function, /* in */ int commute, /* in */ MPI_Op* op); /* out */
• Description– Binds a user-defined reduction operation to an op handle– MPI_REDUCE, MPI_ALLREDUCE, MPI_REDUCE_SCATTE
R, MPI_SCAN– If commute is true, then the operation must be both commutat
ive and associative.– MPI_User_function
• typedef void MPI_User_function(void* invec, void* inoutvec, int* len, MPI_Datatype* datatype);
Example of MPI_OP_CREATE#include “mpi.h”
int main( int argc, char* argv[] ){ int i; int rank, nproc; MPI_Op isum; COMPLEX c[2], csum[2]; MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nproc ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Op_create(my_sum, true, &isum); MPI_Reduce(c, csum, 2, MPI_COMPLEX, isum, 0, MPI_COMM_WORLD); MPI_Finalize();}
void my_sum(void* cin, void* cinout, int* len, MPI_Datatype type){ int i;
for(i=0; i<*len; i++) { cinout[i] = cinout[i] + cin[i]; }}
MPI_BARRIER
• Usage– int MPI_Barrier(MPI_Comm comm); /* in *
/
• Description– Blocks each process in communicator until all
processes have called it
Hello World: C Code
The easiest way to see exactly how a parallel code is put together and run is to write the classic "Hello World" program in parallel. In this case it simply means that every PE will say hello to us. Let's take a look at the code to do this. Hello World C Code
#include <stdio.h> #include "mpi.h"
main(int argc, char** argv){
int my_PE_num; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num); printf("Hello from %d.\n", my_PE_num); MPI_Finalize(); }
Hello World: Fortran Code
program shifter include 'mpif.h'
integer my_pe_num, errcode
call MPI_INIT(errcode) call MPI_COMM_RANK(MPI_COMM_WORLD, my_pe_num, errcode) print *, 'Hello from ', my_pe_num,'.' call MPI_FINALIZE(errcode) end
Output
Hello from 5. Hello from 3. Hello from 1. Hello from 2. Hello from 7. Hello from 0. Hello from 6. Hello from 4.
There are two issues here that may not have been expected. The most obvious is that the output might seem out of order. The response to that is "what order were you expecting?" Remember, the code was started on all nodes practically simultaneously. There was no reason to expect one node to finish before another. Indeed, if we rerun the code we will probably get a different order. Sometimes it may seem that there is a very repeatable order. But, one important rule of parallel computing is don't assume that there is any particular order to events unless there is something to guarantee it. Later on we will see how we could force a particular order on this output.
Master and Slaves PEs The much more common case is to have a single PE that is used for some sort of
coordination purpose, and the other PEs run code that is the same, although the data will be different. This is how one would implement a master/slave or host/node paradigm.
if (my_PE_num = 0) MasterCodeRoutine
else SlaveCodeRoutine
Of course, the above code is the trivial case of EveryBodyRunThisRoutine
and consequently the only difference will be in the output, as it actually uses the PE number.
MPI_COMM_WORLD
In the Hello World program, we see that the first parameter in MPI_Comm_rank (MPI_COMM_WORLD, &my_PE_num) isMPI_COMM_WORLD. MPI_COMM_WORLD is known as the "communicator" and can be found in many of the MPI routines. In general, it is used so that one can divide up the PEs into subsets for various algo-rithmic purposes. For example, if we had an array that we wished to find the determinant of distributed across the PEs, we might wish to define some subset of the PEs that holds a certain column of the array so that we could address only those PEs conveniently.
However, this is a convenience that can often be dispensed with. As such, one will often see the value MPI_COMM_WORLD used anywhere that a communicator is required. This is simply the global set that states we don't really care to deal with any particular subset here.
Compiling and Running
Well, now that we may have some idea how the above code will perform, let's compile it and run it to see if it meets our expectations. We compile using a normal ANSI C or Fortran 90 compiler (C++ is also available): While logged in the cluster master:
For C codes: cc -lmpi hello.c For Fortran codes: f90 -lmpi hello.c We now have an executable. To run on the Cluster we must tell the
machine how many copies we wish to run. You can choose any number. We'll try 8: Some Clusters use mpprun –n8 a.outOthers use prun –n8 a.out
Where Will The Output Go?The second issue, although you may have taken it for granted, is "where will the output go?". This is another question that MPI dodges because it is so implementation
dependent. On the Cluster, the I/O is structured in about the simplest way possible. All PEs can read and write (files as well as console I/O) through the standard channels. This is very convenient, and in our case results in all of the "standard output" going back to your terminal window on the Cluster.
In general, it can be much more complex. For instance, suppose you were running this on a cluster of 8 workstations. Would the output go to eight separate consoles? Or, in a more typical situation, suppose you wished to write results out to a file:
With the workstations, you would probably end up with eight separate files on eight separate disks.
With the Cluster, they can all access the same file simultaneously.There are some good reasons why you would want to exercise some constraint even on the Cluster. 512 PEs accessing the same file would be extremely inefficient.
Sending and Receiving Messages
Hello world might be illustrative, but we haven't really done any message passing yet.
Let's write the simplest possible message passing program.
It will run on 2 PEs and will send a simple message (the number 42) from PE 1 to PE 0. PE 0 will then print this out.
Sending a Message
Sending a message is a simple procedure. In our case the routine will look like this in C (the standard man pages are in C, so you should get used to seeing this format):
MPI_Send( &numbertosend, 1, MPI_INT, 0, 10, MPI_COMM_WORLD)
Sending a Message Cont’d
Let's look at the parameters individually:
&numbertosenda pointer to whatever we wish to send. In this case it is simply an integer. It could be anything from a character string to a column of an array or a structure. It is even possible to pack several different data types in one message.
1 the number of items we wish to send. If we were sending a vector of 10 int's, we would point to the first one in the above parameter and set this to the size of the array.
MPI_INT the type of object we are sending. Possible values are: MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_UNSIGNED_CHAR, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNSIGNED_LING, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE, MPI_PACKED Most of these are obvious in use. MPI_BYTE will send raw bytes (on a heterogeneous workstation cluster this will suppress any data conversion). MPI_PACKED can be used to pack multiple data types in one message, but it does require a few additional routines we won't go into (those of you familiar with PVM will recognize this).
0 Destination of the message. In this case PE 0.
10 Message tag. All messages have a tag attached to them that can be useful for sorting messages. For example, one could give high priority control messages a different tag then data messages. When receiving, the program would check for messages that use the control tag first. We just picked 10 at random.
MPI_COMM_WORLD We don't really care about any subsets of PEs here. So, we just chose this "default".
Receiving a Message
Receiving a message is equally simple. In our case it will look like:MPI_Recv( &numbertoreceive, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,MPI_COMM_WORLD, &status)
&numbertoreceive A pointer to the variable that will receive the item. In our case it is simply an integer that has has some undefined value until now.
1 Number of items to receive. Just 1 here.
MPI_INT Datatype. Better be an int, since that's what we sent.
MPI_ANY_SOURCE The node to receive from. We could use 1 here since the message is coming from there, but we'll illustrate the "wild card" method of receiving a message from anywhere.
MPI_ANY_TAG We could use a value of 10 here to filter out any other messages (there aren't any) but, again, this was a convenient place to show how to receive any tag.
MPI_COMM_WORLD Just using default set of all PEs.
&status A structure that receive the status data which includes the source and tag of the message.
Send and Receive C Code
#include <stdio.h> #include "mpi.h" main(int argc, char** argv){
int my_PE_num, numbertoreceive, numbertosend=42; MPI_Status status;
MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num);
if (my_PE_num==0){ MPI_Recv( &numbertoreceive, 1, MPI_INT, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &status); printf("Number received is: %d\n", numbertoreceive);
} else MPI_Send( &numbertosend, 1, MPI_INT, 0, 10, MPI_COMM_WORLD);
MPI_Finalize(); }
Send and Receive Fortran Codeprogram shifter implicit none include 'mpif.h'
integer my_pe_num, errcode, numbertoreceive, numbertosend integer status(MPI_STATUS_SIZE)
call MPI_INIT(errcode) call MPI_COMM_RANK(MPI_COMM_WORLD, my_pe_num, errcode)
numbertosend = 42
if (my_PE_num.EQ.0) then call MPI_Recv( numbertoreceive, 1, MPI_INTEGER,MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, status, errcode)
print *, 'Number received is:‘ ,numbertoreceive endif
if (my_PE_num.EQ.1) then call MPI_Send( numbertosend, 1,MPI_INTEGER, 0, 10, MPI_COMM_WORLD, errcode) endif
call MPI_FINALIZE(errcode)
end
Non-Blocking Recieves
All of the receives that we will use are blocking. This means that they will wait until a message matching their requirements for source and tag has been received. It is possible to use non-blocking communications. This means a receive will return immediately and it is up to the code to determine when the data actually arrives using additional routines.
In most cases this additional coding is not worth it in terms of performance and code robustness. However, for certain algorithms this can be useful to keep in mind.
Communication Modes
There are four possible modes (with slight differently named MPI_XSEND routines) for buffering and sending messages in MPI. We use the standard mode here, and you may find this sufficient for the majority of your needs. However, these other modes can allow for substantial optimization in the right circumstances:
Standard mode Send will usually not block even if a receive for that message has not occurred. Exception is if there are resource limitations (buffer space).
Buffered Mode Similar to above, but will never block (just return error).
Synchronous Mode
will only return when matching receive has started.
Ready Mode will only work if matching receive is already waiting.
Synchronization
We are going to write one more code which will employ the remaining tool that we need for general parallel programming: synchronization. Many algorithms require that you be able to get all of the nodes into some controlled state before proceeding to the next stage. This is usually done with a synchronization point that require all of the nodes (or some specified subset at the least) to reach a certain point before proceeding. Sometimes the manner in which messages block will achieve this same result implicitly, but it is often necessary to explicitly do this and debugging is often greatly aided by the insertion of synchronization points which are later removed for the sake of efficiency.
Our code will perform the rather pointless operation of having PE 0 send a number to the other 3 PEs and have them multiply that number by their own PE number. They will then print the results out (in order, remember the hello world program?) and send them back to PE 0 which will print out the sum.
Synchronization: C Code
#include <stdio.h> #include "mpi.h"
main(int argc, char** argv){
int my_PE_num, numbertoreceive, numbertosend=4,index, result=0; MPI_Status status;
MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num);
if (my_PE_num==0) for (index=1; index<4; index++)
MPI_Send( &numbertosend, 1,MPI_INT, index, 10,MPI_COMM_WORLD); else{ MPI_Recv( &numbertoreceive, 1, MPI_INT, 0, 10, MPI_COMM_WORLD,
&status); result = numbertoreceive * my_PE_num;}
for (index=1; index<4; index++){ MPI_Barrier(MPI_COMM_WORLD); if (index==my_PE_num)
printf("PE %d's result is %d.\n", my_PE_num, result);}
if (my_PE_num==0){ for (index=1; index<4; index++){ MPI_Recv( &numbertoreceive, 1,MPI_INT,index,10, MPI_COMM_WORLD,
&status); result += numbertoreceive; } printf("Total is %d.\n", result);} else MPI_Send( &result, 1, MPI_INT, 0, 10, MPI_COMM_WORLD);
MPI_Finalize(); }
Synchronization: Fortran Codeprogram shifter implicit none
include 'mpif.h'
integer my_pe_num, errcode, numbertoreceive, numbertosend integer index, result integer status(MPI_STATUS_SIZE)
call MPI_INIT(errcode)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_pe_num, errcode)
numbertosend = 4 result = 0
if (my_PE_num.EQ.0) then do index=1,3 call MPI_Send( numbertosend, 1, MPI_INTEGER, index, 10, MPI_COMM_WORLD, errcode) enddo
else call MPI_Recv( numbertoreceive, 1, MPI_INTEGER, 0, 10, MPI_COMM_WORLD, status, errcode) result = numbertoreceive * my_PE_num
endif
do index=1,3 call MPI_Barrier(MPI_COMM_WORLD, errcode) if (my_PE_num.EQ.index) then print *, 'PE ',my_PE_num,'s result is ',result,'.' endif
enddo
if (my_PE_num.EQ.0) then do index=1,3 call MPI_Recv( numbertoreceive, 1, MPI_INTEGER, index,10, MPI_COMM_WORLD, status, errcode) result = result + numbertoreceive enddo print *,'Total is ',result,'.'
else call MPI_Send( result, 1, MPI_INTEGER, 0, 10, MPI_COMM_WORLD, errcode)
endif
call MPI_FINALIZE(errcode) end
Results of “Synchronization”
The output you get when running this code with 4 PEs (what will happen if you run with more or less?) is the following:
PE 1’s result is 4.
PE 2’s result is 8.
PE 3’s result is 12.
Total is 24
Analysis of “Synchronization”
The best way to make sure that you understand what is happening in the code above is to look at things from the perspective of each PE in turn. THIS IS THE WAY TO DEBUG ANY MESSAGE-PASSING (or MIMD) CODE.
Follow from the top to the bottom of the code as PE 0, and do likewise for PE 1. See exactly where one PE is dependent on another to proceed. Look at each PEs progress as though it is 100 times faster or slower than the other nodes. Would this affect the final program flow? It shouldn't unless you made assumptions that are not always valid.
Reduction
MPI_Reduce: Reduces values on all processes to a single value.
Synopsis
#include "mpi.h" int MPI_Reduce ( sendbuf, recvbuf, count, datatype, op, root, comm ) void *sendbuf; void *recvbuf; int count; MPI_Datatype datatype; MPI_Op op; int root; MPI_Comm comm;
Reduction Cont’dInput Parameters:
sendbuf address of send buffer
count number of elements in send buffer (integer)
datatype data type of elements of send buffer (handle)
op reduce operation (handle)
root rank of root process (integer)
comm communicator (handle)
Output Parameter:
recvbuf address of receive buffer (choice, significant only at root)
Algorithm: This implementation currently uses a simple tree algorithm.
Finding Pi
Our last example will find the value of pi by integrating 4/(1 + x2) for -1/2 to +1/2.
This is just a geometric circle. The master process (0) will query for a number of intervals to use, and then broadcast this number to all of the other processors.
Each processor will then add up every nth interval (x = -1/2 + rank/n, -1/2 + rank/n + size/n).
Finally, the sums computed by each processor are added together using a new type of MPI operation, a reduction.
Finding Pi
program FindPI implicit none
include 'mpif.h' integer n, my_pe_num, numprocs, index, errcode real mypi, pi, h sum, x
call MPI_Init(errcode) call MPI_Comm_size(MPI_COMM_WORLD, numprocs, errcode) call MPI_Comm_rank(MPI_COMM_WORLD, my_pe_num, errcode)
if (my_pe_num.EQ.0) then print *,'How many intervals?:' read *, n
endif
call MPI_Bcast(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, errcode)
h = 1.0 / n
sum = 0.0
do index = my_pe_num+1, n, numprocs x = h * (index - 0.5) sum = sum + 4.0 / (1.0 + x*x)
enddo mypi = h * sum
call MPI_Reduce(mypi, pi, 1, MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD, errcode)
if (my_pe_num.EQ.0) then print *,'pi is approximately ',pi print *,'Error is ',pi-3.14159265358979323846
endif
call MPI_Finalize(errcode) end
Do Not Make Any Assumptions
Do not make any assumptions about the mechanics of the actual message- passing. Remember that MPI is designed to operate not only on fast MPP networks, but also on Internet size meta-computers. As such, the order and timing of messages may be considerably skewed.
MPI makes only one guarantee: two messages sent from one process to another process will arrive in that relative order. However, a message sent later from another process may arrive before, or between, those two messages.
ReferencesThere is a wide variety of material available on the Web, some of which is intended to be used
as hardcopy manuals and tutorials.
http://www.psc.edu/htbin/software_by_category.pl/hetero_software
you may wish to start at one of the MPI home pages at
http://www.mcs.anl.gov/Projects/mpi/index.html
from which you can find a lot of useful information without traveling too far. To learn the syntax of MPI calls, access the index for the Message Passing Interface Standard at:
http://www-unix.mcs.anl.gov/mpi/www/
Books:• Parallel Programming with MPI. Peter S. Pacheco. San Francisco: Morgan Kaufmann
Publishers, Inc., 1997. • PVM: a users' guide and tutorial for networked parallel computing. Al Geist, Adam
Beguelin, Jack Dongarra et al. MIT Press, 1996. • Using MPI: portable parallel programming with the message-passing interface . William
Gropp, Ewing Lusk, Anthony Skjellum. MIT Press, 1996.
top related