parallel programming and mpi

An Advanced Simulation & Computing (ASC) Academic Strategic Alliances Program (ASAP) Center

at The University of Chicago

The Center for Astrophysical Thermonuclear Flashes

FLASH TutorialMay 13, 2004

Parallel Computing and MPI

The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago

What is Parallel Computing ?And why is it useful

Parallel Computing is more than one cpu working together on one problem

It is useful when Large problem, could take very long Data size too big to fit in the memory of one processor

When to parallelize Problem could be subdivided into relatively independent tasks

How much to parallelize While the speedup in computation relative to single processor

is of the order of number of processors


Parallel paradigms

SIMD – Single instruction multiple data Processors work in lock-step

MIMD – Multiple instruction multiple data Processors do their own thing with occasional synchronization

Shared Memory One way communications

Distributed Memory Message passing

Loosely Coupled When the process on each cpu is fairly self contained and

relatively independent of processes on other cpu’s Tightly Coupled

When cpu’s need to communicate with each other frequently


How to Parallelize

Divide a problem into a set of mostly independent tasks

Partitioning a problem Tasks get their own data

Localize a task They operate on their own data for the most part

Try to make it self contained Occasionally

Data may be needed from other tasks Inter-process communication

Synchronization may be required between tasks Global operation

Map tasks to different processors One processor may get more than one task Task distribution should be well balanced


New Code Components

Initialization Query parallel state

Identify process Identify number of processes

Exchange data between processes Local, Global

Synchronization Barriers, Blocking Communication, Locks

Finalization


MPI

Message Passing Interface, standard for distributed memory model of parallelism

MPI-2 will support one-way communication, commonly associated with shared memory operations

Works with communicators; a collection of processors MPI_COMM_WORLD default

Has support for lowest level communication operations and composite operations

Has blocking and non-blocking operations


Communicators

COMM1

COMM2


Low level Operations in MPI

MPI_Init MPI_Comm_size

Find number of processors MPI_Comm_rank

Find my processor number MPI_Send/Recv

Communicate with other processors one at a time MPI_Bcast

Global data transmission MPI_Barrier

Synchronization MPI_Finalize


Advanced Constructs in MPI

Composite Operations Gather/Scatter Allreduce Alltoall

Cartesian grid operations Shift

Communicators Creating subgroups of processors to operate on

User-defined Datatypes I/O

Parallel file operations


Communication Patterns

10

32

Collective

0 1 2 3

Shift

10

2

All to All

10

32

Point to Point

10

32

One to All Broadcast


Communication Overheads

Latency vs. Bandwidth Blocking vs. Non-Blocking

Overlap Buffering and copy

Scale of communication Nearest neighbor Short range Long range

Volume of data Resource contention for links

Efficiency Hardware, software, communication method


Parallelism in FLASH

Short range communications Nearest neighbor

Long range communications Regridding

Other global operations All-reduce operations on physical quantities Specific to solvers

multi-pole method FFT based solvers


Domain Decomposition

P0 P1

P2 P3


Border Cells / Ghost Points

When splitting up solnData, need data from other processors.

Need a layer of cells from each processor

Need to update each time step


Border/Ghost Cells

Short Range communication


Two MPI Methods for doing it

MPI_Cart_create Create topology

MPE_Decomp1d Domain decomp on topology

MPI_Cart_shift Who’s on the left/right?

MPI_SendRecv Ghost cells left

MPI_SendRecv Ghost cells right

MPI_Comm_rank MPI_Comm_size Manually decompose grid

over processors Calculate left/right MPI_Send/MPI_Recv

Carefully to avoid deadlocks


Adaptive Grid Issues

Discretization not uniform Simple left-right guard cell fills inadequate Adjacent grid points may not be mapped to the

nearest neighbors in processors topology Redistribution of work necessary


Regridding

Change in number of cells/blocks Some processors get more work than others Load imbalance Redistribute data to even out work on all processors Long range communications Large quantities of data moved


Regridding


Other parallel operations in FLASH

Global max/sum etc (Allreduce) Physical quantities In solvers Performance monitoring

Alltoall FFT based solver on UG

User defined datatypes and file operations Parallel I/O

parallel programming and mpi

Documents

mpi mpi

number of processors

topology mpi

collection of processors

rank mpi

leftright mpi

time mpi

init mpi