cs 471 final project 2d advection/wave equation using fourier methods

CS 471 Final Project2d Advection/Wave Equation

Using Fourier MethodsDecember 10, 2003

Jose L. Rodriguez

[email protected]

Project Description

• Use a Spectral Method (Fourier Method) for the equation:

• Use the JST Runge-Kutta Time Integrator for each time step.

0x yC C Ca a

t x y

Algorithm• For each time step that we take, we do s sub

stages:

n

n

n+1

Set C=C

for : 1:1

C C

end

C =C

x y

k s

dt C Ca a

k x y

Algorithm with Spectral Representation

2j j

Set

for : 1:1

ˆ = fft

ˆ ˆ

ˆ ˆ

ˆ ifft

ˆ ifft

for j=1,..,N

end

=

x x

y y

x x

y y

x x y yj j j j

k s

dt

k

c c

c c

d D c

d D c

d d

d d

c c a d a d

c c

Code Development• Develop Serial C Code based off given Matlab

code using FFTw libraries for fft and ifft calls• Very straightforward• Verification of code working correctly was simply

comparing with Matlab result• Develop Parallel C Code based off Serial C Code

• The FFTw libraries provide fft and ifft calls that do all MPI Calls for you.

• The tricky part of this development was placing the data correctly on each processor for the fft and ifft calls.

• Verification of code working correctly was again comparison with Matlab result

Results: N=512, 1000 Iterations

Usage of FFTw Libraries in Parallel: Function Calls

Notice: Message Passing is transparent to the user

Usage of FFTw Libraries in Parallel: MPI Data Layout

• The transform data used by the MPI FFTW routines is distributed: a distinct portion of it resides with each process involved in the transform. This allows the transform to be parallelized, for example, over a cluster of workstations, each with its own separate memory, so that you can take advantage of the total memory of all the processors you are parallelizing over.

• In particular, the array is divided according to the rows (first dimension) of the data: each process gets a subset of the rows of the data. (This is sometimes called a "slab decomposition.") One consequence of this is that you can't take advantage of more processors than you have rows (e.g. 64x64x64 matrix can at most use 64 processors). This isn't usually much of a limitation, however, as each processor needs a fair amount of data in order for the parallel-computation benefits to outweight the communications costs. Taken from FFTw website/documentation


These calls needed to create fft and ifft plan, as well as find out what memory needs are to be met


ilocal_x_start tells us where we are in the global 2d array (row) and ilocal_nx tells us how many elements we have on this current processor.

Using Row-Major Format

Notice: Message Passing is transparent to the user

Parallel Results

• Two versions written• A Non-Efficient version that is not optimized for FFTw

MPI calls: • An extra work array is not used.• An extra un-transposing of data is done prior to coming out of

fft calls.

• An Efficient version that is optimized for FFTw MPI calls:

• An extra work array is used• Data is left transposed so that an extra communication step of

un-transposing data is not done

Notice: The slight differences

N=256, Iterations=100

1

10

100

1 10 100

Number of Processors

Tim

e T

aken

(se

con

ds)

Efficient Version

Non-Efficient Version


0.000

0.200

0.400

0.600

0.800

1.000

1.200

0 5 10 15 20 25 30 35

Num ber of Processors

Eff

icie

ncy

Per

cen

tag

e

Eff icient Version

Non-Ef f icient Version

Efficient Version is Faster and more efficient.


1

10

100

1000

1 10 100


Tim

e T

aken

(se

con

ds)

Efficient Version



0.000

0.200

0.400

0.600

0.800

1.000

1.200

0 5 10 15 20 25 30 35


Eff

icie

ncy

Per

cen

tag

e

Efficient Version


We begin to see some scaling, however, efficiency starts to taper off indicating that much of the time spent is in communication.


1

10

100

1000

10000

1 10 100


Tim

e T

aken

(se

con

ds)

Efficient Version



0.000

0.200

0.400

0.600

0.800

1.000

1.200

0 5 10 15 20 25 30 35


Eff

icie

ncy

Per

cen

tag

e

Efficient Version


Overall, we see the same trend as N increases, i.e. some scaling as Number of Procs increases, but starts to flatten, and efficiency steadily decreases.

The Sea of Black for the Non-Efficient Version

N=256, 10 Iterations

A lot of communication between processors.

Communication goes on between each processor with MPI_SendRecv since each processor needs data from each other. We can actually see here when a fft is being performed.

8 processors and 16 processors: same trend of communication.

The sea of white for the Efficient Version.

N=256, 10 Iterations

The Efficient Version uses MPI_AlltoAll for its communication between all processors.

We again can see when an fft call is being performed by each white bar for each process.

8 processors and 16 processors: same trend of communication.

Conclusions• A lot of time is spent in communication since each

process communicates with each other process. • Efficiency goes down as a result because as number of

process increases for a given size N, more communication is needed.

• We saw some scaling, but this starts to drop off as number of processors increases (efficiency issues).

• Time Spent on this project• Code Development: ~8 hours with debugging• Data Collection: ~2 days • Overall: Quite a bit of time

cs 471 final project 2d advection/wave equation using fourier methods

Documents

fftw mpi calls

mpi data layoutthese

mpi fftw routines

mpi data layoutilocal

userusage of fftw libraries

metusage of fftw libraries

extra untransposing

matlab code