introduction to parallel programming (message passing)

Introduction to Parallel Programming

(Message Passing)

Francisco Almeida

[email protected]

Parallel Computing Group

Beowulf Computers

•Distributed Memory

•COTS: Commercial-Off-The-Shelf computers

The Parallel Model

PRAM

BSP, LogP

PVM, MPI, HPF, Threads, OPenMP

Parallel Architectures

Computational Models

Programming Models

Architectural Models

The Message Passing Model

Interconnection Network processor

processor

processor

processor

processor

processor

processor

Send(parameters)

Recv(parameters)

Network of WorkstationsHardware

•Sun Sparc Ultra 1• 143 Mhz Etherswitch

•Distributed Memory•Non Shared Memory Space•Star Topology

SGI Origin 2000Hardware

•C4-CEPBA•64 R1000processos•8 Gb memory•32 Gflop/s

•Shared Dsitributed Memory•Hypercubic Topology

Digital AlphaServer 8400Hardware

•C4-CEPBA•10 Alpha processors21164•2 Gb Memory•8,8 Gflop/s

•Shared Memory•BusTopology

Drawbacks that arise when solving Problems using Parallelism

Parallel Programming is more complex than sequential.

Results may vary as a consequence of the intrinsic non determinism.

New problems. Deadlocks, starvation...

Is more difficult to debug parallel programs.

Parallel programs are less portable.

MPI

MPIEUI

p4

pvm ExpressZipcode

CMMD

PARMACS

Parallel Libraries

Parallel Applications

Parallel Languages

MPI

• What Is MPI?• Message Passing Interface standard • The first standard and portable message passing library with good performance • "Standard" by consensus of MPI Forum participants from over 40 organizations • Finished and published in May 1994, updated in June 1995

• What does MPI offer? • Standardization - on many levels • Portability - to existing and new systems • Performance - comparable to vendors' proprietary libraries • Richness - extensive functionality, many quality implementations

MPI hello.c#include <stdio.h>#include <string.h>#include "mpi.h"main(int argc, char*argv[]) {

int name, p, source, dest, tag = 0;char message[100];MPI_Status status;MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&name);MPI_Comm_size(MPI_COMM_WORLD,&p);

if (name != 0) { printf("Processor %d of %d\n",name, p); sprintf(message,"greetings from process %d!", name); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else { printf("processor 0, p = %d ",p); for(source=1; source < p; source++) { MPI_Recv(message,100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s\n",message); } } MPI_Finalize();}

Processor 2 of 4Processor 3 of 4Processor 1 of 4processor 0, p = 4 greetings from process 1!greetings from process 2!greetings from process 3!

A Simple MPI Program

mpicc –o hello hello.c

mpirun –np 4 hello

Basic Communication Operations

One-to-all broadcast Single-node Accumulation

. . .0 p1 . . .0 p1

M One-to-all broadcast

Single-node AccumulationM MM

. . .

1

p

2

0Step 1

Step 2

Step p

Broadcast on Hypercubes

76

54

32

10

76

54

32

10

First Step

76

54

32

10

76

54

32

10

Second Step

Broadcast on Hypercubes

76

54

32

10

76

54

32

10

Third Step

MPI Broadcast

int MPI_Bcast(

void *buffer;

int count;

MPI_Datatype datatype;

int root;

MPI_Comm comm;

);

Broadcasts a message from the

process with rank "root" to

all other processes of the group

Reduction on Hypercubes

@ conmutative and associative operator

Ai in processor i

Every processor has to obtain A0@A1@...@AP-1

A0@A1 @A2 @A3 000 A1@A0@ A3 @A2 001

A2 @A3@ A0@A1 101 A3 @A2@ A1@A0 101

A7 @A6@A5@A4 101A6 @A7@A4@A5 110

A7 @A6@ A5@A4 101 A0@A1 000 A1@A0 001

A2 @A3 101 A3 @A2 101

A5@A4 101

A7 @A6 101A6 @A7 110

A0 000 A1 001

A2 101A3 101

A5 101

A7 101A6 110

A0

A1

Reductions with MPI

int MPI_Reduce(

void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;int root;MPI_Comm comm;);

Reduces values on all processes to a single value processes

int MPI_Allreduce(

void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;MPI_Comm comm;);

Combines values form all processes and distributes the result back to all

All-To-All BroadcastMultinode Accumulation

. . .0 p1 . . .0 p1

M1All-to-all broadcast

Single-node AccumulationM0

Mp

M1

M2 Mp

M0 M0

M1 M1

MpMp

Reductions, Prefixsums

MPI Collective Operations

MPI Operator Operation

---------------------------------------------------------------

MPI_MAX maximum

MPI_MIN minimum

MPI_SUM sum

MPI_PROD product

MPI_LAND logical and

MPI_BAND bitwise and

MPI_LOR logical or

MPI_BOR bitwise or

MPI_LXOR logical exclusive or

MPI_BXOR bitwise exclusive or

MPI_MAXLOC max value and location

MPI_MINLOC min value and location

The Master Slave Paradigm

Master

Slaves

Computing

0.0 0.2 0.4 0.6 0.8 1.0

2

4

=0

1

4(1+x2)

dx MPI_Bcast(&n, 1, MPI_INT, 0,

MPI_COMM_WORLD);

h = 1.0 / (double) n; mypi = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); mypi += f(x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE,

MPI_SUM, 0, MPI_COMM_WORLD);

mpirun –np 3 cpi

The Portability of the Efficiency

The Sequential Algorithm

void mochila01_sec (void)

{

unsigned v1;

int c, k;

for (c = 0; c <= C; c++)

f[0][c] = 0;

for (k = 1; k <= N; k++) {

for (c = 0; c <= C; c++)

f[k][c] = f[k-1][c];

if (c >= w[k])

v1 = f[k-1][c - w[k]] + p[k];

if (f[k][c] > v1)

f[k][c] = v1;

}

}

f[k][c] = max {f[k-1][C], f[k-1][C - W[k] ] + p[k ] for C W[k]}

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

n

C

f[k]f[k -] 1]

O(n*C)

The Parallel Algorithm

1:void transition (int stage)

2:{

3: unsigned x;

4: int c, k;

5: k = stage;

6: for (c = 0; c <= C; c++)

7: f[c] = 0;

8: for (c = 0; c <= C; c++) {

9: IN(&x);

10: f[c] = max(f[c], x);

11: OUT(&f[c], 1, sizeof(unsigned));

12: if (C >= c + w[k])

13: f[c + w[k]] = x + p[k];

14: }

15:}

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Processor k

f[k-1][c] f[k][c]

Processor k - 1

c

f[k][c] = max {f[k-1][C], f[k-1][C - W[k] ] + p[k]}

The Evolution of the Pipeline

n

C

The Running Time

n -1 + C

n

C

Processor Virtualization

0 1 2

n/p

C

Block Mapping


0 1 2

n/p

C

The Running Time

0 1 2

n/p

C

(n/p -1)C

(n/p -1)C

....(n/p -1)C

(p-1)(n/p-1)C

+ nC/p

= nC


0 1 2

C

n/p

The Running Time

0 1 2

C

n/pn/p

n/p....

n/p

(p-1)(n/p)

+ nC/p

= nC/p

Block Mapping

void transition (void)

{

unsigned c, k i, inData;

for (c = 0; c <= C; c++){

IN(&inData);

k = calcInitStage();

for (i = 0; i < width; k++, i++) {

f[i] [c] = max(f[i][c], inData);

if (c + w[k] <= C)

f[i][c + w[k]] = inData + p[k];

inData = f[i][c];

}

OUT(&f[i-1][c], 1, sizeof(unsigned));

}

}

width = N / num_proc;

if (f_name < N % num_proc)

/* Load Balancing */

width++;

int calcInitStage( void )

{

return (f_name < N % num_proc) ?

f_name * width :

(f_name * width) + (N % num_proc) ;

}

Cola

Cyclic Mapping

0 1 2

The Running Time

0 1 2

(p-1)

Cola

+ n/p C

Cyclic Mapping

void transition (int stage)

{

unsigned x;

int c, k;

k = stage;

for (c = 0; c <= C; c++)

f[c] = 0;

for (c = 0; c <= C; c++) {

IN(&x);

f[c] = max(f[c], x);

OUT(&f[c], 1, sizeof(unsigned));

if (C >= c + w[k])

f[c + w[k]] = x + p[k];

}

}

int bands = num_bands(n);

for (i = 0; i < bands; i++) {

stage = f_name + i * num_proc;

if (stage <= n - 1)

transition (stage);

}

unsigned num_bands (unsigned n)

{

float aux_f;

unsigned aux;

aux_f = (float) n / (float) num_proc;

aux = (unsigned) aux_f;

if (aux_f > aux)

return (aux + 1);

return (aux);

}

Advantages and Disadvantages

Block Distribution:– Minimizes the Number of Communications

– Penalizes the Startup Time of the Pipeline

Cyclic Distribution:– Minimizes the Startup Time of the Pipeline

– May Produce Communications Overhead

Transputer Network - Local Area Network

Local Area Network– Coarse Grain

– Serial Communications

Transputer Network– Fine Grain

– Parallel Communications

Computational Results

0

50

100

150

200

250

1 2 4 8 16 32

4x8

4x32

4x128

8x8

8x32

8x128

16x8

16x32

16x128

0

5

10

15

20

25

1 2 4 8 16 32

4x8

4x32

4x128

8x8

8x32

8x128

16x8

16x32

16x128

Processos Processors

Tim

e

Tim

e

Transputers Local Area Network

The Resource Allocation Problem

M units of an indivisible Resource and a set of N Tasks. fj(x) Benefit obtained when x unidades of resource are

allocated to task j.

max

Subject to

integer,

fj

xjj

N

xj

Mj

N

xj

Bj

xj j N M Bj

( )

,

,..., ; ,

1

10

1 N

RAP- The Sequential Algorithm

G[k][m] = max{G[k-1][m-i] + fk(i) / 0 i m }

int rap_seq(void) {

int i, k, m;

for (m = 0; m <= M; n++)

G[0][m] = 0;

q = a; Q = b;

for(k = 0; k < N; k++) {

for(m = 0; m <= M; m++) {

for (i = 0; i <= m; i++)

G[k][m] = max{G[k][m],

G[k-1][i] + f[k](m- i)};

}

return G[N ][M];

}

. . .

.. .

. .

. . .

.

. . .

.

. . .

.

. . .

.

. . .

.

. . .

.

N

M

kk -1

O(nM2)

RAP - The Parallel Algorithm

1:void transition (int stage)2:{

3: int m, j, x, k;

4: for( m = 0; m <= M; m++)

5: G[m] = 0;

6: k = stage;

7: for (m = 0; m <= M; m++) {

8: IN(&x);

9: G[m] = max(G[m], x + f(k-1, 0));

10: OUT(&G[m], 1, sizeof(int));

11: for (j = m + 1; j <= M; j++)

12: G[j] = max(G[j], x + f(k - 1, j - m));

13: } /* for m ... */

14: } /* transition */

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Processor k

G[k-1][m] G[k][m]

Processor k - 1

m

G[k][m] = max{G[k-1][m-i] + fk(i) / 0 i m }

The Cray T3E

CRAY T3E– Shared Address Space

– Three-Dimensional Toroidal Network

Block - Cyclic Mapping

Cola

0 1 2

g(p-1) + gM2 n/gp

Computational Results

05

1015202530354045

1 2 5 10 20 40

Grain

Tim

e

2

4

8

16

0

20

40

60

80

100

120

2 4 8 16

Processsors

Tim

e

10x100

100x1000

400x1000

1000x1000

0

1

2

3

4

5

1 2 5 10 20 40

Grain

Sp

ee

du

p 2

4

8

16

0

1

2

3

4

5

2 4 8 16

Processors

Sp

ee

du

p

1

2

5

10

20

40

Linear Model to Predict Communication Performance

Time to send N bytes= n +

0,00001

0,0001

0,001

0,01

0,1

1

BEOULL

CRAYT3E

5E-08 n + 5E-05

7E-07 n + 0,0003

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

0 5 0000 1 00000 1 5 0000 2 00000 2 5 0000 3 00000 3 5 0000 4 00000 4 5 0000 5 00000 5 5 0000 6 00000 6 5 0000 7 00000 7 5 0000 8 00000 8 5 0000 9 00000 9 5 0000 1 000000 1 05 0000

BEOULL

CRAYT3E

PAPI

http://icl.cs.utk.edu/projects/papi/

PAPI aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors.












Buffering Data

Virtual Process name runs of real processor fname if (name / grain) mod p = fname

00 11 22 33 44 55 66 77 88 ......

Processor 0 Processor 1 Processor 0

Virtual Processes

P = 2Grain = 3

OUT

IN

SET_BUFIO(1, size)

Size = B

The knapsack ProblemN = 12800, M = 12800

Cray - T3E

50

100

150

10002000

30004000

50006000

140

160

180

200

220

Grain

Problem p128-128.knp - np 2

Buffer

Tim

e (

sec)

50

100

150

10002000

30004000

5000

80

100

120

140

Grain


Buffer

Tim

e (

sec)

2040

6080

100120

140

500

1000

1500

40

50

60

70

Grain


Buffer

Tim

e (

sec)

2040

6080

100120

140

500

1000

1500

20

40

60

Grain


Buffer

Tim

e (

sec)

The Resource Allocation Problem. Cray - T3E

510

1520

50100150200250300350400450

38

40

42

44

46

48

Grain

Problem 1000x1000 - np 4

Buffer

Tim

e (

sec)

510

1520

20406080100120140160

19

20

21

22

23

24

25

Grain


Buffer

Tim

e (

sec)

510

1520

20406080100120140160

10

11

12

13

14

15

16

17

Grain


Buffer

Tim

e (

sec)

510

1520

100200

300400

72

74

76

78

80

Grain


Buffer

Tim

e (

sec)

Portability of the Efficiency

One disappointing contrast in parallel systems is between the peak performance of the parallel systems and the actual performance of parallel applications.

Metrics, techniques and tools have been developed to understand the sources of performance degradation.

An effective parallel program development cycle, may iterate many times before achieving the desired performance.

Performance prediction is important in achieving efficient execution of parallel programs, since it allows to avoid the coding and debugging cost of inefficient strategies.

Most of the approaches to performance analysis fall into two categories: Analytical Modeling and Performance Profiling.

Performance Analysis

Profiling may be conducted on an existing parallel system to recognize current performance bottlenecks, correct them, and identify and prevent potential future performance problems.

Architectural Dependent. The majority of performance

metrics and tools devised reflect their orientation towards the measurement-modify paradigm.

PICL, Dimemas, Kpi.

ParaGraph, Vampir, Paraver.

Instrumentation

Computation

Profile analysis

New Tuning Parameters

Error Prediction

Run Time Prediction

Performance Analysis

Analytical Modeling– Provides a structured way for

understanding performance problems

– Architectural Independent

– Has predictive ability

– Modeling is not a trivial task. The model must be simple enough to be tractable, and sufficiently detailed to be accurate.

– PRAM, LogP, BSP, BSPWB, etc...

Computation

OptimalParameterPrediction

Analytical Modeling

ErrorPrediction

Run TimePrediction

Standard Loop on a Pipeline Algorithm

void f() { Compute(body0); While (running) { Receive(); Compute(body1); Send(); Compute(body2); }

}

body0 take constant time

body1 and body2 depends on the iteration of the loop

Analytical Model

Numerical Solutions for every case

The Analytical Model

B

B

G

B

G

B

G

• Ts denotes the startup time betweentwo processors.

Ts = t0*( G - 1) + + G*i = 1, (B-1) (t1i + t2i )+ 2*I * (G - 1)* B + E * B + + *B

Tc denotes the whole evaluation of G processes, including the time to send M/B packets of size B:

Tc = t0 * G + G*i = 1, M (t1i + t2i )+

2*I*(G - 1)*M + E*M + (*B)* M/B

G

B...

M/B

0 1 2

The Analytical Model

T1(G, B) = Ts * (p - 1) + Tc * N/(G*p)

1 G N/p and 1 B M

G

B

G

B

G

11 p-1p-100

B

B

00

B

G

.

.

.

G

B

G

B

G

11 p-1p-100

B

B

00

B

G

.

.

.

T2(G, B) = Ts * (N/G – 1) + Tc

RR1 1 = = Values (G, B) where Ts * p Tc

RR2 2 = = Values (G, B) where Ts * p Tc

Validation of The Model

Knapsack Problem: Model vs Best Real

0

20

40

60

80

100

120

140

160

2 4 8 16

Processors

Tim

e Model

Best Real

RAP Problem: Model vs Best Real

0

10

20

30

40

50

60

70

80

2 4 8 16

Processors

Tim

e Model

Best Real

The Tuning Problem

Given an Algorithm A, FA is the input/output fuction computed by the algorithm

FA : D = D1x...xDn * * FA(z) is the output value of the Algorithm A for the entry z belonging to D

TimeM(A(z)) is the execution time of the Algorithm A over the input z on a Machine M. CTimeM(A(z)) is the analytical Complexity Time formula that approximates TimeM(A(z))

T = D1x...xDk T Tunning Parameters I = Dk+1x...xDn I Input Parametersx T if and only if, occurs that x has only impact in the performance of the algorithm but not in its

output.

FA(x, z) = FA(y, z) for any x and y T

TimeM(A(x, z)) TimeM(A(y, z)

The Tuning Problem:

is to find x0 T such that CTimeM(A(x0, z)) = min { CTimeM(A(x, z)) / xT}

Tunning Parameters

The list of tuning parameters in parallel computing is extensive:

– The most obvious tuning parameter is the Number of Processors.

– The size of the buffers used during data exchange.

– Under the Master-Slave paradigm, the size and the number of data item generated by the master.

– In the parallel Divide and Conquer technique, the size of a subproblem to be considered trivial and the the processor assignment policy.

– On regular numerical HPF-like algorithms, the block size allocation.

The Methodology

Profiling the execution to compute the parameters needed for the Complexity Time function CTimeM(A(x, z)).

Compute x0T such that minimizes the Complexity Time function CTimeM(A(x, z)).

CTimeM(A(x0, z)) = min { CTimeM(A(x, z)) /xT}

At this point, the predictive ability of the Complexity Time function can be used to predict the execution time TimeM(A(z)) of an optimal execution or to execute the algorithm according to the tuning parameter T.

Instrumentation

Analytical Modeling

Optimal ParameterComputation

Run TimePrediction

Error Prediction Computation

llp Solver

Instrumentation onllp Comunication

Calls

llp AnalyticalModelin

g

Computation

Run TimePrediction

Error Prediction

Min( T(p, G, B))

Computation

t0, t1 and t2

IN

OUT

gettime();

gettime();

BA

MA

LL

The MALLBA Infrastructure

Performace PredictionBA - ULL

0,0001 n - 0,0151

9E-05 n + 0,005

0

1

2

3

4

5

0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 1 0 0 0 0 1 1 0 0 0 1 2 0 0 0 1 3 0 0 0 1 4 0 0 0 1 5 0 0 0 1 6 0 0 0 1 7 0 0 0 1 8 0 0 0 1 9 0 0 0 2 0 0 0 0 2 1 0 0 0 2 2 0 0 0 2 3 0 0 0 2 4 0 0 0 2 5 0 0 0 2 6 0 0 0 2 7 0 0 0 2 8 0 0 0 2 9 0 0 0 3 0 0 0 0 3 1 0 0 0 3 2 0 0 0 3 3 0 0 0

BAULL-1

BAULL-2

0,01

0,1

1

10

BAULL-1

BAULL-2

0,00001

0,0001

0,001

0,01

0,1

1

10

100

1 4

16

64

25

6

10

24

40

96

16

38

4

65

53

6

3E

+0

5

1E

+0

6

BEOULL

CRAYT3E

BAULL-1

BAULL-2

The MALLBA Project

Library for the resolution of combinatorial optimisation problems.– 3 types of resolution techniques:

• Exact• Heuristic• Hybrid

– 3 implementations: • Sequential• LAN• WAN

Goals:– Genericity– Ease of utilization– Locally- and geographically-distributed computation

References

Willkinson B., Allen M. Parallel Programming. Techniques and Applications Using Networkded Workstations and Parallel Computers. 1999. Prentice-Hall.

Gropp W., Lusk E., Skjellum A. Using MPI. Portable Parallel Programming with the Message-Passing Interface. 1999. The MIT Press.

Pacheco P. Parallel Programming with MPI. 1997. Morgan Kaufmann Publishers.

Wu X. Performance Evaluation, Prediction and Visualization of Parallel Systems.

nereida.deioc.ull.es

introduction to parallel programming (message passing)

Documents