parallel computing using mpi

Parallel Computing Using MPI

AnCAD, Inc. 逸奇科技

Main sources of the Material

• MacDonald, N., Minty, E., Harding, T. and Brown, S. “Writing Message-Passing Parallel Programs with MPI,” Edinburgh Parallel Computing Centre, Germany. (Available at http://www.lrz-muenchen.de/services/software/parallel/mpi/epcc-course/)

• “MPICH-A Portable MPI Implementation,” Web page available at http://www-unix.mcs.anl.gov/mpi/mpich/, Argonne National Laboratory, USA.

• “Message Passing Interface (MPI) Forum,” Web page available at http://www.mpi-forum.org/, Message Passing Interface Forum.

• Group, W., Lusk, E. and Skjellum, A. (1999). Using MPI Portable Parallel Programming with the Message-Passing Interface, MIT Press, England.

• Grama, A., Gupta, A., Karypis, G. and Kumar, V. (2003). Introduction to Parallel Computing 2nd Edition, Addison Wesley, USA.

http://www.lrz-muenchen.de/services/software/parallel/mpi/epcc-course/

http://www-unix.mcs.anl.gov/mpi/mpich/

http://www.mpi-forum.org/

Computational Science

• Computer– A stimulation of growth

of science

• Computational Science– A branch of science

between

Theoretical Science and

Experimental Science

A supercomputer system (ASCI Red)

A nuclear surface test

Trends of Computing (1)

• Requirements never stop• Scientific computing

– Fluid dynamics– Structural analysis– Electromagnetic analysis– Optimization of design– Bio-informatics– Astrophysics

• Commercial applications– Network security– Web and database data mining– Optimizing business and market

decisions

F-117 fighter (2D analysis in 1970s)

B-2 bomber (3D analysis in 1980s)


• CPU performance– Moore’s Law

18 months Twice of transistors

– Single CPU’s performance grows exponentially

– The cost of advanced single-CPU computer increases rapidly than its power

Progress of Intel CPUs

Typical cost-performance curves (a sketch)


• Solutions– A bigger supercomputer, or

– A cluster of small computers

• Today’s supercomputers– Also use multiple processors– Needs parallel computing skills to

exploit their performance

• A cluster of small computers– Less budget– More flexibility– Needs more knowledge on parallel

computing

Avalon (http://www.lanl.gov)

Parallel Computer Architectures

• Shared-memory architecture – Typical SMP (Symmetric Multi-

Processing)– Programmers use threads or O

penMP

• Distributed-memory architecture– Needs message passing amon

g processors– Programmers use MPI (Messa

ge Passing Interface) or PVM (Parallel Virtual Machine)

Shared-memory architecture

Distributed-memory architecture

Distributed Shared-Memory Computers

• Assemblages of shared-memory parallel computers

• Constellations:– Np < Nn*Nn

• Clusters or MPP:– Np >= Nn*Nn

A Constellation

Np: # of processors (CPUs)Nn: # of nodes (shared- mem. computers)MPP: Massively Parallel Processors

A Cluster or MPP

ASCI Q (http://www.lanl.gov)

Trend of Parallel Architectures

• Clusters get popular rapidly.– Clusters get in TOP500 list

since 1997.

– More than 40% of TOP500 are clusters in Nov, 2003.

– 170 are Intel-based clusters.

• More than 70% of TOP500 are message-passing based.

• All TOP500 supercomputers requires message passing to exploit their performance.

Trend of Top500 supercomputers’ architectures(http://www.top500.org)

Top 3 of the TOP500 (Nov. 2003)

• Earth Simulator – 35.86-Tera Flops – Earth Simulator Center (Japan)

• ASCI Q– 13.88-Tera Flops– Los Alamos Nat’l Lab. (USA)

• Virginia Tech’s X– 10.28-Tera Flops– Virginia Tech (USA)

(http://www.top500.org)

Earth Simulator

• For climate simulation• Massively Parallel Processor

s• 640 computing nodes• Each node:

– 8 NEC vector CPUs– 16 GB shared memory

• Network– 16 GB/s inter-node bandwidth

• Parallel tools:– Automatic vectorizing in CPU– OpenMP in each node– MPI or HPF across nodes

Earth Simulator site

Earth Simulator interconnection

(http://www.es.jamstec.go.jp/)

ASCI Q

• For numerical nuclear testing• Cluster

– Each node has independent operating system

• 2048 computing nodes– An option: adding to 3072

• Each node:– 4 HP Alpha processors– 10 GB memory

• Network– 250 MB/s of bandwidth– ~5 µs of latency time

ASCI Q site

ASCI Q

(http://www.lanl.gov/asci)

Virginia Tech’s X

• Scientific computing• Cluster• 1100 computing nodes• Each node:

– 2 2GHz-PowerPC CPUs– 4 GB memory– Mac OS X

• Network– 1.25 GB/s of bandwidth– 4.5 µs of latency time

• Parallel Tools– MPI

(http://www.eng.vt.edu/tcf/)

MPI is a …

• Library– Not a language– Functions or subroutines for

Fortran and C/C++ programmers

• Specification– Not a particular implementation– Computer vendors may provide

specific implementations.– Free and open source code may

be available. (http://www.mpi-forum.org/)

MPI_Init()

MPI_Comm_rank()

MPI_Comm_size()

MPI_Send()

MPI_Recv()

MPI_Finalize()

Six-Function Version of MPI

• MPI_Init– Initialize MPI– A must-do function at the beginning

• MPI_Comm_size– Number of processors (Np)

• MPI_Comm_rank– “This” processor’s ID

• MPI_Send– Sending data to a (specific) processor

• MPI_Recv– Receiving data from a (specific) processor

• MPI_Finalize– Terminate MPI– A must-do function at the end

Message-Passing Programming Paradigm

MPI on message-passing architecture(Edinburgh Parallel Computing Centre)

Message-Passing Programming Paradigm

MPI on shared-memory architecture(Edinburgh Parallel Computing Centre)

Single Program Multiple Data

• Single program– The same program runs on each processor.– Single program eases compilation and maintenance.

• Local variables– All variables are local. No shared data.

• Different at MPI_Comm_rank – Each processor gets different response from MPI_Comm_rank.

• Using sequential language– Programmers writes in Fortran / C / C++.

• Message passing– Message passing can be the only way to communicate.

Single Program Multiple Data

Processor 0

Processor 1

Processor 2

Processor 3

Hello World!

void main(int argc, char ** argv){ MPI_Init(&argc,&argv); printf("Hello world.\n"); MPI_Finalize();}

An MPI program

Output of the program

Emulating General Message Passing with SPMD

void main(int argc, char ** argv){ int Size, Rank; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &Size); MPI_Comm_rank(MPI_COMM_WORLD, &Rank); if (Rank==0) { Controller (argc, argv); } else { Workers (argc, argv); }

MPI_Finalize();}

(Edinburgh Parallel Computing Centre)

Emulating General Message Passing with SPMD

PROGRAM CALL MPI_INIT(iErr)CALL MPI_COMM_SIZE(MPI_COMM_WORLD, iSize, iErr);CALL MPI_COMM_RANK(MPI_COMM_WORLD, iRank, iErr);

IF (iRank.EQ.0) THEN

CALL Controller ();ELSE

CALL Workers ();END IF

MPI_Finalize(iRc);END


MPI Messages

• Messages are packets of data moving between processors

• The message passing system has to be told the following information:– Sending processor– Location of data in sending processor– Data type– Data length– Receiving processor(s)– Destination location in receiving processor(s)– Destination size

Message Sending Modes

• How do we communicate?– Phone

Information is supposed to be synchronized.

When you are talking, you may need to be concentrated.

– FaxInformation is supposed to be synchronized.

When faxing, you can do some other works.

After making sure the fax completes, you can leave.

– MailAfter sending mail, you have no idea about the

completion.

Synchronous/Asynchronous Sends

• Synchronous Sends– Completion implies a

receipt

• Asynchronous Sends– Only knows when the

message has left.


An asynchronous send

Synchronous Blocking Sends

• Only return from the subroutine call when the operation has completed.

• Program waits.

• Easy and safe.• Probably not efficient.

A blocking send


Synchronous Non-Blocking Sends

• Return straight away.• Allow program to

continue to perform other work.

• Program can test or wait.

• More flexible.• Probably more efficient.• Programmers should be

careful.A non-blocking send


Non-Blocking Sends w/o wait

• All non-blocking sends should have matching wait operations.

• Do not modify or release the sending data until completion.

Non-blocking send without test and wait


Collective Communications

• Collective communication routines are higher level routines involving several processes at a time.

• Can be built out of point-to-point communications.

Broadcasting: a popular collective communication


Collective Communications

Broadcasting

BarrierReduction (e.g., summation)


Writing MPI Programs

Most of the following original MPI training materials were developed under the Joint Information Systems Committee (JISC) New Technologies Initiative by the Training and Education Centre at Edinburgh Parallel Computing Centre (EPCC-TEC), University of Edinburgh, United Kingdom.

MPI in C and FORTRAN

C/C++ FORTRAN

Header file #include <mpi.h> include ‘mpif.h’

Function calls

Ierr= MPI_Xxx(…); CALL MPI_XXX(…, IERROR)


Initialization of MPI

• In Cint MPI_Init (int argc, char** argv);

• In FORTRANMPI_INIT(IERROR)

• Must be the first routine called


MPI Communicator

• A required argument for all

MPI communication calls

• Processes can communicate

only if they share the same c

ommunicator(s).

• MPI_COMM_WORLD is avai

lable

• Programmer-defined commu

nicators are allowed.


Size and Rank

• SizeQ: How many processors are contained in a co

mmunicator?

A: MPI_Comm_size(MPI_Comm, int *);

• Rank Q: How do I identify different processes in my M

PI program?

A: MPI_Comm_rank(MPI_Comm, int *);


Exiting MPI

• In Cint MPI_Finalize ();

• In FORTRANMPI_FINALIZE(IERROR)

• Must be the last MPI routine called


Contents of Typical MPI Messages

• Data to send

• Data length

• Data type

• Destination process

• Tag

• Communicator


Data Types


Why is Data Type Needed ?

• For heterogeneous parallel architecture

3.1416

…011001001001100101… …011001001001100101…

@#$%?

Binary datamessage passing


Point-to-Point Communication

• Communication between two processes.

• Source process sends message to destination process.

• Communication takes place within a communicator.

• Destination process is identified by its rank in communicator.


P2P Communication Modes


MPI Communication Modes


Sending/Receiving A Message

• Sending a message

int MPI_Ssend(

void *buffer,

int count,

MPI_Datatype datatype,

int destination,

int tag,

MPI_Comm comm

);

• Receiving a message

int MPI_Recv(

void *buffer,

int count,


int source,

int tag,

MPI_Comm comm, MPI_Status *status

);


Wildcarding

• Receiver can wildcard– From any source (MPI_ANY_SOURCE)– With any tag (MPI_ANY_TAG)

• Actual source and tag are returned in receiver’s status parameter.

• Sender can not use any wildcard– Destination (receiver) must be specified.– Tag must be specified.


Message Order Preservation

• Messages do not overtake each other.

• This is true even for non-synchronous sends.


Blocking/Non-Blocking Sends

• Blocking send– Returns after

completion.

• Non-blocking send– Returns immediately.– Allows test and wait.

A non-blocking send


Synchronous? Blocking?

• Synchronous/Asynchronous sends– Synchronous sends:

Completion implies the receive is completed.

– Asynchronous sends:Completion means the buffer can be re-used,

irrespective of whether the receive is completed.

• Blocking/Non-blocking sends– Blocking sends

Return only after completion.

– Non-blocking sendsReturn immediately.


MPI Sending Commands

Synchronous

send

Asynchronous

sendReceive

Blocking MPI_SsendMPI_Bsend

MPI_RsendMPI_Recv

Non-blocking

MPI_IssendMPI_Ibsend

MPI_IrsendMPI_Irecv


Non-Blocking Sends

• Avoids deadlock

• May Improve efficiency

Deadlock


Non-Blocking Communications

• Step 1:– Initiate non-blocking communication.– E.g., MPI_Isend or MPI_Irecv

• Step 2:– Do other work.

• Step 3:– Wait for non-blocking communication to compl

ete.– E.g., MPI_Wait


MPI Non-Blocking Commands

• Sending a message

int MPI_Isend( void *buffer,

int count,


int destination,

int tag,

MPI_Comm comm,

MPI_Request * request );

int MPI_Wait(

MPI_Request request,

MPI_Status * status );

• Receiving a message

int MPI_Irecv( void *buffer,

int count,


int source,

int tag,

MPI_Comm comm,

MPI_Request * request );

int MPI_Wait(

MPI_Request request,

MPI_Status * status );


Blocking and Non-Blocking

• A blocking send can be used with a non-blocking receive, and vice versa.


Derived Data Type

• Construct new data typeMPI_Type_vector (…)

MPI_Type_struct (…)

…

• Commit new data typeMPI_Type_commit (…)

• Release a data type MPI_Type_free (…)


MPI Collective Communications

• Communications involving a group of processes.

• Called by all processes in a communicator.

• Blocking communications.

• Examples:– Barrier synchronization– Broadcast, scatter, gather.– Global summation, global maximum, etc.


MPI Barrier Synchronization

int MPI_Barrier (MPI_Comm comm);


MPI Broadcast

int MPI_Bcast ( buffer, count, datatype,

root, comm );


MPI Scatter

int MPI_Scatterv ( sendbuf, sendcount, datatype, recvbuf, recvcount, recvtype, root, comm ) ;


MPI Gather

int MPI_Gather ( sendbuf, sendcount, datatype, recvbuf, recvcount, recvtype, root, comm ) ;


MPI Reduction Operations

int MPI_Reduce ( sendbuf, recvbuf, count, datatype, OP, root, comm ) ;

The OP can be – MPI_MAX– MPI_MIN– MPI_SUM, etc.– User defined OP.


Installation of MPI

• MPICH– Probably the most popular MPI

implementation – Provides both Unix and

Windows versions– Freely available (Try

http://www-unix.mcs.anl.gov/mpi/)

• In Windows– Execute the installation file

and follow the instructions

• In most Unix– “configure” and “make”

Windows MPICH Installation

Unix MPICH Installation

http://www-unix.mcs.anl.gov/mpi/

Logging and GUI Timeline

• MPE – MPI Parallel Environment

– A library with a set of functions

– Generates a logging file

• Jumpshot – A GUI program– Shows timeline figures

• MPE and Jumpshot are included in MPICH

MPE_Init_log() MPE_Describe_stateMPE_Log_eventMPE_Finish_log

Four-function version of MPE

Example of Jumpshot-4(http://www-unix.mcs.anl.gov/perfvis/)

just do it

parallel computing using mpi

Documents