programming distributed memory sytems using openmp

30
Programming Distributed Memory Sytems Using OpenMP Rudolf Eigenmann, Ayon Basumallik, Seung-Jai Min, School of Electrical and Computer Engineering, Purdue University, http://www.ece.purdue.edu/ParaMount

Upload: others

Post on 03-Feb-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Programming Distributed Memory Sytems Using OpenMP

ProgrammingDistributed Memory SytemsUsing OpenMP

Rudolf Eigenmann,Ayon Basumallik, Seung-Jai Min,

School of Electrical and Computer Engineering,Purdue University,

http://www.ece.purdue.edu/ParaMount

Page 2: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 2

Is OpenMP a useful programmingmodel for distributed systems?

OpenMP is a parallel programming model that assumes a sharedaddress space #pragma OMP parallel for for (i=1; 1<n; i++) {a[i] = b[i];}

Why is it difficult to implement OpenMP for distributed processors?The compiler or runtime system will need to partition and place data onto the distributed memories send/receive messages to orchestrate remote data accessesHPF (High Performance Fortran) was a large-scale effort to do so -

without success So, why should we try (again)?

OpenMP is an easier programming (higher-productivity?) programmingmodel. It

allows programs to be incrementally parallelized starting from the serialversions,

relieves the programmer of the task of managing the movement oflogically shared data.

Page 3: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 3

Two Translation Approaches

Use a Software Distributed SharedMemory System

Translate OpenMP directly to MPI

Page 4: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 4

Approach 1:Compiling OpenMP for Software

Distributed Shared Memory

Page 5: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 5

Inter-procedural Shared Data Analysis

SUBROUTINE DCDTZ(A, B, C) INTEGER A,B,CC$OMP PARALLELC$OMP+PRIVATE (B, C) A = … CALL CCRANK …C$OMP END PARALLEL END

SUBROUTINE DUDTZ(X, Y, Z) INTEGER X,Y,ZC$OMP PARALLELC$OMP+REDUCTION(+:X) X = X + …C$OMP END PARALLEL END

SUBROUTINE SUB0INTEGER DELTATCALL DCDTZ(DELTAT,…)CALL DUDTZ(DELTAT,…)END

SUBROUTINE CCRANK() … beta = 1 – alpha … END

Page 6: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 6

DO istep = 1, itmax, 1

!$OMP PARALLEL DO rsd (i, j, k) = …!$OMP END PARALLEL DO

!$OMP PARALLEL DO rsd (i, j, k) = …!$OMP END PARALLEL DO

!$OMP PARALLEL DO u (i, j, k) = rsd (i, j, k)!$OMP END PARALLEL DO

CALL RHS()

ENDDO

SUBROUTINE RHS()

!$OMP PARALLEL DO u (i, j, k) = …!$OMP END PARALLEL DO

!$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k)..!$OMP END PARALLEL DO

!$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k)..!$OMP END PARALLEL DO

!$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = ...!$OMP END PARALLEL DO

Access Pattern Analysis

Page 7: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 7

=> Data Distribution-AwareOptimization

DO istep = 1, itmax, 1

!$OMP PARALLEL DO rsd (i, j, k) = …!$OMP END PARALLEL DO

!$OMP PARALLEL DO rsd (i, j, k) = …!$OMP END PARALLEL DO

!$OMP PARALLEL DO u (i, j, k) = rsd (i, j, k)!$OMP END PARALLEL DO

CALL RHS()

ENDDO

SUBROUTINE RHS()

!$OMP PARALLEL DO u (i, j, k) = …!$OMP END PARALLEL DO

!$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k)..!$OMP END PARALLEL DO

!$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k)..!$OMP END PARALLEL DO

!$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = ...!$OMP END PARALLEL DO

Page 8: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 8

DO k = 1, z!$OMP PARALLEL DO DO j = 1, N, 1 flux(m, j) = u(3, i, j, k) + … ENDDO!$OMP PARALLEL DODO j = 1, N, 1 DO m = 1, 5, 1 rsd(m, i, j, k) = … + flux(m, j+1)-flux(m, j-1)) ENDDO ENDDOENDDO

init00 = (N/proc_num)*(pid-1)…limit00 = (N/proc_num)*pid …

DO k = 1, z DO j = init00, limit00, 1 flux(m, j) = u(3, i, j, k) + … ENDDO CALL TMK_BARRIER(0) DO j = init00, limit00, 1 DO m = 1, 5, 1 rsd(m, i, j, k) = … + flux(m, j+1)-flux(m, j-1)) ENDDO ENDDOENDDO

Adding Redundant Computationto Eliminate Communication

OpenMP Program S-DSM ProgramOptimized S-DSM Code

init00 = (N/proc_num)*(pid-1)…limit00 = (N/proc_num)*pid…new_init = init00 - 1new_limit = limit00 + 1DO k = 1, z DO j = new_init, new_limit, 1 flux(m, j) = u(3, i, j, k) + … ENDDO CALL TMK_BARRIER(0) DO j = init00, limit00, 1 DO m = 1, 5, 1 rsd(m, i, j, k) = … + flux(m, j+1)-flux(m, j-1)) ENDDO ENDDOENDDO

Page 9: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 9

Access Privatization

Example from equake (SPEC OMPM2001)

If (master) {shared->ARCHnodes = …..shared->ARCHduration = …...}

/* Parallel Region */

N = shared->ARCHnodes ;

iter = shared->ARCHduration;

…...

// Done by all nodes{ ARCHnodes = ….. ARCHduration = …...}

/* Parallel Region */N = ARCHnodes;iter = ARCHduration;…...

READ-ONLYSHARED VARS

PRIVATEVARIABLES

Page 10: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 10

0

1

2

3

4

5

6

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

SPEC OMP2001M Performance

Baseline Performance Optimized Performance

wupwise swim mgrid artequakeapplu

Optimized Performance ofOMPM2001 Benchmarks

Page 11: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 11

A Key Question: How Close Are weto MPI Performance ?

0

1

2

3

4

5

6

7

8

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

SPEC OMP2001 Performance

Baseline Performance

Optimized Performance

MPI Performance

wupwise swim mgrid applu

Page 12: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 12

Towards Adaptive OptimizationA combined Compiler-Runtime Scheme

Compiler identifies repetitive access patterns Runtime system learns the actual remote

addresses and sends data early.

Ideal program characteristics:

Outer, serial loop

Inner, parallel loops

Communicationpoints at barriers

Data addresses areinvariant or a linearsequence, w.r.t.outer loop

Page 13: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 13

0

1

2

3

4

5

6

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

Baseline(No Opt.) Locality Opt Locality Opt + Comp/Run Opt

wupwise swim applu SpMul CG

Current Best Performance ofOpenMP for S-DSM

Page 14: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 14

Approach 2:Translating OpenMP directly to MPI

Baseline translation Overlapping computation and communication

for irregular accesses

Page 15: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 15

Baseline Translation ofOpenMP to MPI

Execution Model SPMD model

Serial Regions are replicated on all processes Iterations of parallel for loops are distributed (using

static block scheduling) Shared Data is allocated on all nodes

There is no concept of “owner” – only producers andconsumers of shared data

At the end of a parallel loop, producers communicateshared data to “potential” future consumers

Array section analysis is used for summarizing arrayaccesses

Page 16: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 16

Baseline Translation

Translation Steps:1. Identify all shared data2. Create annotations for accesses to shared data

(use regular section descriptors to summarizearray accesses)

3. Use interprocedural data flow analysis to identifypotential consumers; incorporate OpenMPrelaxed consistency specifications

4. Create message sets to communicate databetween producers and consumers

Page 17: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 17

Message Set Generation

<write,A,1,l1(p),u1(p)>

<read,A,1,l2(p),u2(p)> <write,A,1,l3(p),u3(p)>

<read,A,1,l4(p),u4(p)>

Message Set at RSD vertex V1, for arrayA from process p to process qcomputed as

SApq = Elements of A with subscripts inthe set

{[l1(p),u1(p)]∩[l2(q),u2(q)]} U{[l1(p),u1(p)]∩[l4(q),u4(q)]}

V1:

<read,A,1,l5(p),u5(p)>

U ([l1(p),u1(p)]∩{[l5(q),u5(q)]-[l3(p),u3(p)]})

For every write,determine all future reads

Page 18: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 18

Baseline Translation ofIrregular Accesses

Irregular Access – A[B[i]], A[f(i)] Reads: assumed the whole array accessed Writes: inspect at runtime, communicate at

the end of parallel loop We often can do better than

“conservative”: Monotonic array values => sharpen access

regions

Page 19: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 19

Optimizations based onCollective Communication

Recognition of Reduction Idioms Translate to MPI_Reduce / MPI_Allreduce

functions. Casting sends/receives in terms of alltoall

calls Beneficial where the producer-consumer

relationship is many-to-many and there isinsufficient distance between producers andconsumers.

Page 20: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 20

Performance of the BaselineOpenMP to MPI TranslationPlatform II – Sixteen IBM SP-2 WinterHawk-II nodes connected by a high-performance switch.

Page 21: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 21

We can do more forIrregular Applications ?L1 : #pragma omp parallel for

for(i=0;i<10;i++)A[i] = ...

L2 : #pragma omp parallel for for(j=0;j<20;j++)

B[j] = A[C[j]] + ...

Subscripts of accesses to sharedarrays not always analyzable atcompile-time

Baseline OpenMP to MPI translation: Conservatively estimate that each

process accesses the entire array Try to deduce properties such as

monotonicity for the irregularsubscript to refine the estimate

Still, there may be redundantcommunication

Runtime tests (inspection) areneeded to resolve accesses

ArrayA

produced byprocess 2

produced byprocess 1

1, 3, 5, 0, 2 ….. 2, 4, 8, 1, 2 ... ArrayC

accesses onprocess 1

accesses onprocess 2

Page 22: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 22

Inspection

Inspection allows accesses to be differentiated (at runtime) aslocal and non-local accesses.

Inspection can also map iterations to accesses. This mappingcan then be used to re-order iterations so that iterations withthe same data source are clubbed together. Communication of remote data can be overlapped with the

computation of iterations that access local data (or data alreadyreceived)

ArrayA

1, 3, 5, 0, 2 ….. 2, 5, 8, 1, 2 ... C[i]

accesses onprocess 1

accesses onprocess 2

0, 1, 2, 3, 4, …….. 10, 11, 12, 13, 14 ... Index i

0, 1, 2, 3, 5 ….. . 5, 8, 1, 2, 2, ... 3, 0, 4, 1, 2, …….. 11, 12, 13, 10, 14 ...

accesses onprocess 1

accesses onprocess 2

reorderiterations

Page 23: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 23

Loop Restructuring

Simple iteration reordering maynot be sufficient to expose thefull set of possibilities forcomputation-communicationoverlap.

L1 : #pragma omp parallel for for(i=0;i<N;i++)

p[i] = x[i] + alpha*r[i] ;

L2 : #pragma omp parallel for for(j=0;j<N;j++) {

w[j] = 0 ;for(k=rowstr[j];k<rowstr[j+1];k++)

S2: w[j] = w[j] +a[k]*p[col[k]] ;

}

Reordering loop L2 may still not clubtogether accesses from different sources

L1 : #pragma omp parallel for for(i=0;i<N;i++) p[i] = x[i] + alpha*r[i] ;

L2-1 : #pragma omp parallel for for(j=0;j<N;j++) { w[j] = 0 ; }

L2-2: #pragma omp parallel for for(j=0;j<N;j++) { for(k=rowstr[j];k<rowstr[j+1];k++) S2: w[j] = w[j] + a[k]*p[col[k]] ; }

Distribute loopL2 to form loopsL2-1 and L2-2

Page 24: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 24

Loop Restructuring contd.L1 : #pragma omp parallel for for(i=0;i<N;i++) p[i] = x[i] + alpha*r[i] ;

L2-1 : #pragma omp parallel for for(j=0;j<N;j++) { w[j] = 0 ; }

L2-2: #pragma omp parallel for for(j=0;j<N;j++) { for(k=rowstr[j];k<rowstr[j+1];k++) S2: w[j] = w[j] + a[k]*p[col[k]] ; }

L1 : #pragma omp parallel for for(i=0;i<N;i++) p[i] = x[i] + alpha*r[i] ;

L2-1 : #pragma omp parallel for for(j=0;j<N;j++) { w[j] = 0 ; }

L3: for(i=0;i<num_iter;i++)w[T[i].j] = w[T[i].j] + a[T[i].k]*p[T[i].col] ;

Coalesce nestedloop L2-2 to formloop L3

Reorder iterations ofloop L3 to achievecomputation-communication overlap

Final restructured and reordered loopThe T[i] data structure is createdand filled in by the inspector

Page 25: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 25

Achieving actual overlap ofcomputation and communication

Non-blocking send/recv calls may not actuallyprogress concurrently with computation. Use a multi-threaded runtime system with separate

computation and communication threads – on dual CPUmachines these threads can progress concurrently.

The compiler extracts the send/recvs along with thepacking/unpacking of message buffers into acommunication thread.

Page 26: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 26

Initiate sends to process q,r

Execute iterations thataccess local data

Wait for receives from process qto complete

Execute iterations that access datareceived from process q

Pack data and send toprocesses q and r.

Receive data from process q

Wait for receives from process rto complete

Computation Thread onProcess p

Communication Threadon Process p

Receive data from process r

Execute iterations that access datareceived from process r

ProgramTimeline

tsend

trecv-q

trecv-r

tcomp-p

tcomp-q

tcomp-r

twait-r

twait-q

Wake up communication thread

Page 27: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 27

0

50

100

150

200

250

300

350

400

450

1 2 4 8 16

Number of Processors

Tim

e (

in S

ec

on

ds

)

Actual Time Spent in Send/Recv Computation available for Overlapping

Actual Wait Time

Performance ofEquake

Computation-communicationoverlap in Equake

0

200

400

600

800

1000

1200

1 2 4 8 16

Nodes

Ex

ec

uti

on

Tim

e (

se

co

nd

s)

Hand-Coded MPI Baseline (No Inspection)

Inspection (No Reordering) Inspection and Reordering

Page 28: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 28

0

2

4

6

8

10

12

1 2 4 8 16

Number of Nodes

Tim

e (

se

co

nd

s)

Time spent in Send/Recv Computation Available for Overlapping Actual Wait Time

0

20

40

60

80

100

120

140

1 2 4 8 16

Number of Nodes

Ex

ec

uti

on

Tim

e (

se

co

nd

s)

Hand-coded MPI Baseline Inspector without Reordering Inspection and Reordering

Performance ofMoldyn

Computation-communicationoverlap in Moldyn

Page 29: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 29

0

200

400

600

800

1000

1200

1400

1 2 4 8 16

Number of Nodes

Ex

ec

uti

on

Tim

e (

in s

ec

on

ds

)

NPB-2.3-MPI Baseline Translation

Inspector without Reordering Inspector with Iteration Reordering

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16

Number of Nodes

Tim

e (

se

co

nd

s)

Time spent in Send/Recv Computation available for Overlap Actual Wait Time

169339

Performance of CG

Computation-communicationoverlap in CG

Page 30: Programming Distributed Memory Sytems Using OpenMP

R. Eigenmann, Purdue HIPS 2007 30

Conclusions

There is hope for easier programming models on distributed systems

OpenMP can be translated effectively onto DPS; we have usedbenchmarks from SPEC OMP NAS additional irregular codes

Direct Translation of OpenMP to MPI outperforms translation via S-DSM “Fall back” of S-DSM for irregular accesses incurs significant overhead

Caveats: Data scalability is an issue Black-belt programmers will always be able to do better Advanced compiler technology is involved. There will be performance

surprises.