tm message passing mpi on origin systems. tm mpi programming model

TM

Message Passing Message Passing

MPI on Origin Systems

TM

MPI Programming ModelMPI Programming Model

TM

Compiling MPI ProgramsCompiling MPI Programs

cc -64 compute.c -lmpi

f77 -64 -LANG:recursive=on compute.f -lmpi

f90 -64 -LANG:recursive=on compute.f -lmpi

CC -64 compute.c -lmpi++ -lmpi

-64 NOT required but improves functionality and optimization

With 7.2.1 compiler level or higher, can use:

-auto_use mpi_interface

with f77 / f90 for compile time subroutine interface checking

TM


• Must use header file from /usr/include since SGI libraries built with it (do not use public domain version)– FORTRAN: mpif.h or USE MPI– C: mpi.h– C++: mpi++.h

• mpi_init version must match main program language (if called from multiple shared memory threads must use mpi_init_thread)

TM


• MPI definitions:– FORTRAN: MPI_XXXX (not case sensitive)– C: MPI_Xxxx (upper and lower case)– C++: Xxxx (part of name space MPI::)

• Every entry point MPI_ in the MPI Library has a “shadow” entry point PMPI_ to aid with implementation of user profiling

• Array Services required to run MPI (arrayd)

TM

Basic MPI FeaturesBasic MPI Features

TM

MPI Basic CallsMPI Basic Calls MPI has a large number of calls. The following are most basic:

• every MPI program has to start and finish with these calls (the first and the last executable statements):

mpi_init mpi_finalize

• essential inquiry about the environment:

mpi_comm_size mpi_comm_rank

• basic communication calls: mpi_send mpi_recv

• basic synchronization calls: mpi_barrier

Program mpitestinclude “mpif.h”call mpi_init(ierr)

call mpi_comm_size(MPI_COMM_WORLD,np,ierr)call mpi_comm_rank(MPI_COMM_WORLD,id,ierr)

do I=0,np-1 if(I.eq.id) print *,’np, id’,np,id call mpi_barrier(MPI_COMM_WORLD,ierr)enddo

call mpi_finalize(ierr)stopend

Compile with:f77 -o mpitest -LANG:recursive=on mpitest.f -lmpi

run with:mpirun -np N [-stats -prefix “%g”]mpitest

TM

MPI send and receive CallsMPI send and receive Calls mpi_send(buf,count,datatype,dest,tag,comm,ierr)

mpi_recv(buf,count,datatype,dest,tag,comm,stat,ierr)

buff data to be send/recv count number of items to be send; size of buf for recv datatype type of data items to send/recv (MPI_INTEGER, MPI_FLOAT, MPI_DOUBLE_PRECISION, etc.)

dest id of the pear process (MPI_ANY_SOURCE) tag integer mark of the message (MPI_ANY_TAG) comm communication handle (MPI_COMM_WORLD)

stat status of the message of MPI_STATUS type; in Fortran INTEGER stat(MPI_STATUS_SIZE) call mpi_get_count(stat,MPI_REAL,nitems) where nitems can be <= count

check for errors: if(ierr.ne.MPI_SUCCESS) call abort()

message

envel

ope

TM

Using send and receive CallsUsing send and receive Calls Example:

rules of use:• mpi_send/recv are defined as blocking calls

– the program should not assume blocking behaviour (small messages are buffered)– when these calls return, the buffers can be (re-)used

• the arrival order of messages send from A and B to C is not determined. Two messages from A to B will arrive in the order sent.

• Message Passing programming models are non-deterministic.

If(mod(myid,2).eq.0) then idst = mod(id+1,np) itag = 0 call mpi_send(A,N,MPI_REAL,idst,itag,MPI_COMM_WORLD,ierr) if(ierr.ne.MPI_SUCCESS) print *,’error from’,id,np,ierrelse isrc = mod(id-1+np,np) itag = MPI_ANY_TAG call mpi_recv(B,NSIZE,MPI_REAL,isrc,itag,MPI_COMM_WORLD,stat,ierr) if(ierr.ne.MPI_SUCCESS) print *,’error from’,id,np,ierr call mpi_get_count(stat,MPI_REAL,N)endif

TM

Another Simple ExampleAnother Simple Example

TM

MPI send/receive: BufferingMPI send/receive: Buffering MPI program should not assume buffering of messages. The following program is erroneous:

running on Origin2000 on 2 cpu the program will block after reaching the size i=2100 (because the buffering constraint MPI_BUFFER_MAX=16384, I.e. 2048 items of real*8)

Program long_messagesinclude ‘mpif.h’real*8 h(4000)integer stat(MPI_STATUS_SIZE)

call mpi_init(info)call mpi_comm_rank(MPI_COMM_WORLD, mype, info)call mpi_comm_size(MPI_COMM_WORLD, npes, info)

do I = 1000, 4000, 100 ! Increasing size of the message call mpi_barrier(MPI_COMM_WORLD,info) print *,’mype=‘,mype,’ before send’,I

call mpi_send(h,I,mpi_real8,mod(mype+1,npes),I,MPI_COMM_WORLD,info)

call mpi_barrier(MPI_COMM_WORLD,info)

call mpi_recv(h,I,MPI_REAL8,MOD(mype-1+npes,npes),I,MPI_COMM_WORLD,stat,info)

enddocall mpi_finalize(info)END

TM

MPI Asynchronous send/receiveMPI Asynchronous send/receive Non-blocking send and receive calls are available: mpi_isend(buf,count,datatype,dest,tag,comm,req,ierr) mpi_irecv(buf,count,datatype,dest,tag,comm,req,ierr) buf,count,datatype message content dest,tag,comm message envelope req integer holding the request id

the asynchronous call returns the request-id after registering the buffer . The request id can be used in the probe and wait calls: mpi_wait(req,stat,ierr)• blocks until the MPI send or receive with req request-id completes mpi_waitall(count,array-of-req,array-of-stat,ierr)• waits for all given communications to complete (a blocking call)• the (array-of-)stat can be probed for items received. The data can be

retrieved with the recv call (or irecv call, or any other variety receive)

NOTE: although this interface announces asynchronous communication, the actual copy of buffers happens only at the time of the receive and wait calls

TM

MPI Asynchronous: ExampleMPI Asynchronous: Example Buffer management with asynchronous communcation:

• buffers declared in isend/irecv can be (re-)used only after the communication has actually completed.

• Requests should be freed (mpi_test, mpi_wait, mpi_request_free) for all the isend calls in the program, otherwise mpi_finalize might hang

include ‘mpif.h’integer stat(MPI_STATUS_SIZE,10)integer req(10)real B1(NB1,10)

if(mype.eq.0) then ! Master receiving from all slaves do ip=1,npes-1 call mpi_irecv(B1(ip),NB1,MPI_REAL,

ip,MPI_ANY_TAG,MPI_COMM_WORLD,req(ip),info) enddo nreq = npeselse ! Slave send to master call mpi_isend(B1(mype),NB1,MPI_REAL,0,itag,MPI_COMM_WORLD,req,info) nreq = 1endif

… ! Some unrelated calculations

call mpi_waitall(nreq,req,stat,ierr)

… ! Data is available in B1 in the master process … ! Buffer B1 can be reused in the slave processes

TMPerformance of Asynchronous Performance of Asynchronous CommunicationCommunication

TM

MPI FunctionalityMPI Functionality

TM

MPI Most Important FunctionsMPI Most Important Functions Synchronous communication: mpi_sendmpi_send mpi_recvmpi_recv mpi_sendrecvmpi_sendrecv

Asynchronous communication: mpi_isendmpi_isend mpi_irecvmpi_irecv mpi_iprobe mpi_wait/waitallmpi_wait/waitall

Collective communication: mpi_barriermpi_barrier mpi_bcastmpi_bcast mpi_gather/scattermpi_gather/scatter mpi_reduce/allreducempi_reduce/allreduce mpi_alltoallmpi_alltoall

Creating communicators: mpi_comm_dup mpi_comm_split mpi_comm_free mpi_intercomm_create

Derived data types: mpi_type_contiguous mpi_type_vector mpi_type_indexed mpi_type_pack mpi_type_commit mpi_type_free

TM

MPI Most Important FunctionsMPI Most Important Functions One-sided communication: mpi_win_creatempi_win_create mpi_putmpi_put mpi_getmpi_get mpi_fencempi_fence

Miscellaneous: MPI_Wtime()MPI_Wtime()• Based on SGI_CYCLE clock with 0.8 microsecond resolution

TM

• On SGI, all MPI programs are launched with the mpirun command– mpirun -np N executable-name arguments

syntax on a single host – multi-host execution of different executables is possible

• The mpirun establishes connection with the Array Daemon with the socket interface.

• The Array Daemon launches the mpi executable.

• N+1 threads are started. One additional thread is the “lazy” thread which is blocked in mpi_init() call and terminates when all other threads call mpi_finalize()

• The mpirun -cpr (or -miser) will work on the single host to avoid the socket interface to the Array Daemon (for Checkpoint/Restart facility)

Note: start MPI programs with N < #procs

MPI Run Time System on SGIMPI Run Time System on SGIArray daemon

N 0 N-1

fork() t.exe N times

Array daemon

N 0 N-1

fork() t.exe N times

mpirun -np N t.exempirun Host_A -np N a.out : Host_B -np M b.out

Program name, path,environement variables

HiPPI optimizedcommunication

TM

MPI Run Time on SGIMPI Run Time on SGI

TM

MPI Implementation on SGIMPI Implementation on SGI• In C, mpi_init ignores all arguments passed to it

• All MPI processes are required to call mpi_finalize at exit

• I/O streams:– stdin is enabled only for the master thread (process with rank 0)– stdout and stderr are enabled for all the threads and line buffered– output from different MPI threads can be prepended with -prefix argument;

output sent to mpirun process example: mpirun -prefix “<proc %g out of %G> “ prints:

<proc 0 out of 2> Hello World <proc 1 out of 2> Hello World

– see man mpi(5) and man mpirun(1) for a complete description

• Systems with the HIPPI software installed will trigger usage of the HIPPI optimized communication (HIPPI bypass). If the hardware is not installed it is necessary to switch the HIPPI bypass off (setenv MPI_BYPASS_OFF TRUE)

• With f77/f90, the -auto_use mpi_interface flag is available to check the consistency of mpi arguments at compile time

• With -64 compilation, mpi run time maps out the address space such that shared memory optimizations are available to circumvent the double copy problem. In particular, communication involving static data (I.e. common blocks) can be sped up.

TM

SGI Message-Passing SoftwareSGI Message-Passing Software

• SGI Message Passing Toolkit (MPT 1.5)

• MPI, SHMEM, PVM components

• Packaged with Array Services software

• MPT external web page: – http://www.sgi.com/software/mpt/

• MPT engineering internal web page– http://wwwmn.americas.sgi.com/mpi/

TM

SGI Message-Passing ToolkitSGI Message-Passing Toolkit

• Fully MPI 1.2 standard compliant (based on MPICH)

• SHMEM API for one-sided communication

• Support for selected MPI-2 features and will continue enhancing as customer needs dictate– MPI I/O (ROMIO version 1.0.2)– MPI one-sided communication– Thread safety– Fortran 90 bindings: USE MPI– C++ bindings

• PVM available on IRIX (Public Domain version)

TM

MPT: Supported PlatformsMPT: Supported Platforms

Now• IRIX SSI

• IRIX clusters (GSN, Hippi, Ethernet)

• IA32 and IA64 SSI with Linux

• IA32 cluster (Myrinet, Ethernet) with Linux

Soon• Partitioned IRIX (NUMAlink interconnect)

• IRIX clusters (Myrinet)

• Partitioned SN IA (NUMAlink interconnect)

• IA64 cluster (Myrinet, Ethernet)

TM

Convenience Features in MPTConvenience Features in MPT

• MPI job management with LSF, NQE, PBS, others

• Totalview debugger interoperability

• Fortran MPI subroutine interface checking at compile time with USE MPI

• Aborted cluster jobs are cleaned up automatically

• Array Services provides job control for cluster jobs

• Array Services and MPI work together to propagate user signals to all slaves

• Use shell modules to install multiple versions of MPT on the same system.

TM

MPI PerformanceMPI Performance

• Low latency and high bandwidth.

• Fetchop-assisted fast message queuing

• Fast fetchop tree barriers

• Very fast MPI and SHMEM one-sided communication

• Interoperability with SHMEM

• Support for SSI to 512 P

• Automatic NUMA placement

• Optimized MPI collectives

• Internal MPI statistics reporting

• Integration with PCP

• Direct send/recv transfers

• No-impact thread safety support

• Runtime MPI tuning

TM

NUMAlink ImplementationNUMAlink Implementation• Used by MPI_Barrier, MPI_Win_fence, and

shmem_barrier_all

• Fetch-Op-variables on Hub provide fast synchronization for flat and tree barrier methods

• The Fetch-Op AMO helped reduce MPI send/recv latency from 12 to 8 usec

CPUHUB

ROUTER

CPU

Fetch-opvariable

TM

NUMAlink-based MPI PerformanceNUMAlink-based MPI Performance

send/recv latency 8 (5) usec

Peak bandwidth 150 (280) Mbytes/sec

One-sided get latency 2 (1) usec

Barrier sync on 128 P 9 (6) usec

Barrier sync on 484 P 26 (17) usec

Origin 3000 performance numbers subject to further verification

MPI Performance on Origin 2000 (Origin 3000)

TM

SHMEM ModelSHMEM Model

TM

SHMEM APISHMEM API

TM

COMMUNICATE

COMMUNICATE

One-Sided Communication PatternOne-Sided Communication Pattern

COMPUTE

01234

N-1

COMPUTE

Barriers

Processes

Time

TM

MPI Message Exchange(on host)MPI Message Exchange(on host)

Process 0 Process 1

src dst0

0 1

Shared memoryMPI_Send(src,len,…) MPI_Recv(dst,len,…)

01

Messageheaders

1 Databuffers

Messagequeues

Messageheaders

0 1fetchop

TMMPI Message Exchange using MPI Message Exchange using Single Copy (on host)Single Copy (on host)

Process 0 Process 1

src

dst

0 1

Shared memoryMPI_Send(src,len,…) MPI_Recv(dst,len,…)

01

Messageheaders

Messagequeues

Messageheaders

0 1fetchop

TMPerformance of Synchronous Performance of Synchronous CommunicationCommunication

TMPerformance of SynchronousPerformance of Synchronous CommunicationCommunication

TM

Using Single Copy send/recvUsing Single Copy send/recv

• Set MPI_BUFFER_MAX to N

• any message with size > N bytes will be transferred by direct copy if – MPI semantics allow it– -64 ABI is used– the memory region it is allocated in is a globally

accessible location

• N=2000 seems to work well– shorter messages don’t benefit from direct copy

transfer method

• Look at stats to verify that direct copy was used.

TM

Making Memory Globally Accessible for Making Memory Globally Accessible for Single Copy send/recvSingle Copy send/recv

• User’s send buffer must reside in one of the following regions:– static memory (-static/common blocks/DATA/SAVE)– symmetric heap (allocated with SHPALLOC or

shmalloc)– global heap (allocated with f90 ALLOCATE statement

and SMA_GLOBAL_ALLOC , MIPSPro version 7.3.1.1m )

• When SMA_GLOBAL_ALLOC is set, usually need to increase global heap size by setting SMA_GLOBAL_HEAP_SIZE

TM

Global Communication TestGlobal Communication Test

The ALL-to-ALL communication test : (known as COMMS3 in the Parkbench suite)

Send (A) Receive (B)

p0

p1p2

pn

p0

p1

pn

iw

TM

Global CommunicationGlobal Communication

The ALL-to-ALL communication test :

C every processor sends message to every other processorC then every processor receives messages directed to it.

T0 = mpi_time()Do I = 1, NREPT CALL mpi_alltoall (A, iw, MPI_DOUBLE_PRECISION,

B, iw, MPI_DOUBLE_PRECISION, MPI_COMM_WORLD,ier)End doT1 = mpi_time()Tn = (T1-T0)/(NREPT*NP*(NP-1)) ! NP processes send NP-1 messages

T0 = mpi_time()Do I = 1, NREPT CALL shmem_barrier_all () Do j=0, NP-1 other = MOD (my_rank+j, NP) CALL shmem_put8(B(1+iw*my_rank), A(1+iw*other), iw, other)enddoT1 = mpi_time()Tn = (T1-T0)/(NREPT*NP*(NP-1)) ! NP processes send NP-1 messages

MPI Version

SHMEM Version

TM

Global CommunicationGlobal Communication

Performance of the global communication test

The test case shows cache effects since every operation is performed 50 times.

Global communication routines do already uses in MPT_1.4.0.0 a single copy algorithm for remotely accessible variables.

AlltoAll Bandwidth for R12K@300MHz:

Std MPI ~45 Mb/s

SC MPI ~95 MB/s

SHMEM ~95 MB/s

Actions:•convert to Shmem•used single copy versions on remotely accessible variables

TM

Global Communication Global Communication

Conclusions: Implement critical data exchange in MPI programs with SHMEM or single copy MPI on static or (shmalloc/shpalloc) allocated data.

Single copy

Double copy

TM

MPI get/putMPI get/put

• For codes that are latency sensitive, try using one-sided MPI (get/put).

• latency over NUMAlink on O3000:– send/recv: 5 microseconds– mpi_get: 0.7 microseconds

• if portability isn’t an issue use SHMEM instead– shmem_get latency: 0.5 microseconds (estimate

by MPT group)– much easier to write code

TMTransposition with SHMEM Transposition with SHMEM vs. send/recvvs. send/recv

call shmem_barrier_all

do 150 kk=1,lmtot

ktag=ksendto(kk)

call shmem_put8( y(1+(ktag-1)*len), x(1,ksnding(kk), len, ipsndto(kk) )

continue

call shmem_barrier_all

ltag=0

do 150 kk=1,lmtot

ltag=ltag+1

ktag=ksendto(kk)

call mpi_isend(x(1,ksnding(kk), len, mpireal, ipsndto(kk), ktag, mpicomm, iss(ltag), istat)

ltag=ltag+1

ktag=krcving(kk)

call mpi_irecv(y(1,krcving(kk), len, mpireal, iprcvfr(kk), ktag, mpicomm, iss(ltag), istat)

150 continue

call mpi_wait_all(ltag,iss,istatm, istat)

TM

Transposition with MPI_putTransposition with MPI_put

common/buffer/ yg(length)

integer(kind=MPI_ADDRESS_KIND) winsize, target_disp

! Setup: create a window for array yg since we will do puts into it

call MPI_type_extent(MPI_REAL8, isizereal8, ierr)

winsize=isizereal8*length

call MPI_win_create(yg, winsize, isizereal8, MPI_INFO_NULL, MPI_COMM_WORLD, iwin, ierr)

TM

Transposition with MPI_putTransposition with MPI_put call mpi_barrier(MPI_COMM_WORLD,ierr)

do 150 kk=1,lmtot

ktag=ksendto(kk)

target_disp=(1+(ktag-1)*len)-1

call mpi_put(x(1,ksnding(kk), len, MPI_REAL8, ipsndto(kk), target_disp, len,

MPI_REAL8, iwin, ierr)

150 continue

call mpi_win_fence(0, iwin, ierr)

do kk=1,len*lmtot

y(kk)=yg(kk)

end do

! Cleanup - destroy window

call mpi_barrier(MPI_COMM_WORLD,ierr)

call mpi_win_free(iwin, ierr)

TMPerformance of One-Sided Performance of One-Sided CommunicationCommunication

TMPerformance of the Message Performance of the Message Passing LibrariesPassing Libraries• Latency is the time it takes to pass a very short (zero length) message• Bandwidth is the sustained performance passing long messages

• the “single” test is using the send/recv pair; “multiple” test uses the equivalent of the sendrecv primitive

• Note that a single bcopy speed on Origin2000 is about 150 MB/s• MPI suffers a performance disadvantage with respect to SHMEM due to the

fact that MPI semantics requires separate address spaces between threads. Therefore MPI implementation requires “double copy” to pass messages.

• SHMEM is optimized for one-sided communication as is done for SMP programming and therefore shows a very good latency measurement.

MPI-1 SHMEMOrigin2000R10000@195MHz

Single Multiple Single multiple

Latency [s] 8.5 13.0 0.9 1.7

Bandwidth [MB/s] 99 80 140 180

Origin3000R12000@400MHz

SingleSend/Recv

Asysend/recv

One-sidedPut+fence

Latency [s] 4.3 5.4 0.7

Bandwidth [MB/s] 250 250 310

MPI-2

TM

MPI Tips for PerformanceMPI Tips for Performance

• Use ABI 64 for additional memory cross-mapping MPI optimizations

• Use cpusets for best reproducible results in batch environment

• Avoid over-subscription of tasks to physical CPUs in a throughput benchmark

• Use the -stats option and MPI tuning variables

TM

MPI Tips for PerformanceMPI Tips for Performance

• Try direct-copy send/receive for memory bandwidth improvement and collective calls

• Use one-sided communication for latency (& memory bandwidth) improvement

• Try setting MPI_DSM_MUSTRUN or SMA_DSM_MUSTRUN to maintain CPU / memory affinity

• Do NOT use bsend/ssend or wild cards (MPI_ANY_SOURCE, MPI_ANY_TAG) for message headers

TM

Important Environment VariablesImportant Environment Variables

MPI_DSM_MUSTRUN

MPI_REQUEST_MAX

MPI_GM_ON

MPI_BAR_DISSEM

MPI_BUFS_PER_PROC

MPI_BUFS_PER_HOST

MPI_BUFFER_MAX

“-stats” mpirun option / Totalview display

TM

MPI Performance ExperimentsMPI Performance Experiments Performance data on MPI programs can be collected with: mpirun -np N perfex -a -y -mp -o perfex.out prog-args• the -o perfex.out.#procid file will contain event counts for every MPI

thread and the perfex.out will contain aggregate for all the threads together

Profiling data on MPI programs can be collected with: mpirun -np N ssrun -experiment program prog-args

• the experiment is one of the usual experiments (pcsamp, usertime, etc.) or mpi:– mpirun -np N ssrun -workshop -mpi prog will

produce N prog.mpi.f#procid files; these files can be aggregated with the ssaggregate tool and interactively viewed with cvperf tool

– ssaggregate -e prog.mpi.f* -o prog.mpi_all– cvperf prog.mpi_all or prof prog.mpi_all– the following routines are traced (see man ssrun(1)):

MPI_Barrier(3) MPI_Send(3) MPI_Bsend(3) MPI_Ssend(3) MPI_Rsend(3) MPI_Isend(3) MPI_Ibsend(3) MPI_Issend (3) MPI_Irsend(3) MPI_Sendrecv(3)MPI_Sendrecv_replace(3) MPI_Bcast(3) MPI_Recv(3) MPI_Irecv(3) MPI_Wait(3) MPI_Waitall(3) MPI_Waitany(3) MPI_Waitsome(3)MPI_Test(3) MPI_Testall(3) MPI_Testany(3) MPI_Testsome(3)MPI_Request_free(3) MPI_Cancel(3) MPI_Pcontrol(3)

TM

MPI versus OpenMPMPI versus OpenMP

TM

SGI Message-Passing ReferencesSGI Message-Passing References

•“relnotes mpt” gives information about new features

•“man mpi” tells about all environment variables

•“man shmem” tells about the SHMEM API

•MPI Reference Manuals viewable with insight viewer

– “Message Passing Toolkit: MPI Programmer’s Manual” (document # 007-3687-005)

•MPT web page: – http://www.sgi.com/software/mpt

•MPI Web Sites:– http://www.mpi-forum.org

– http:/www.mcs.anl.gov/mpi/index.html

TM

SummarySummary• It is important to understand the semantics of MPI

• The send/receive calls provide for data synchronization, not necessarily process synchronization

• A correct MPI program cannot depend on buffering for messages

• For a highly optimized MPI program, it is important to use only few optimized subroutines from the MPI library, typically straight send/receive variants

• The SGI implementation of MPI uses N+1 processes in parallel region, therefore it is better for scalability to run MPI with smaller number of processors than physically available in the machine

• Proprietary Message Passing Libraries (I.e. SHMEM) perform better than MPI on the Origin, because MPI’s generic interface makes it much harder to optimize

tm message passing mpi on origin systems. tm mpi programming model

Documents