delivering high performance to parallel applications using advanced scheduling nikolaos drosinos,...

Post on 14-Dec-2015

230 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Delivering High Performance to Delivering High Performance to Parallel Applications Using Parallel Applications Using

Advanced SchedulingAdvanced Scheduling

Nikolaos Drosinos, Georgios GoumasMaria Athanasaki and Nectarios Koziris

National Technical University of Athens

Computing Systems Laboratory

{ndros,goumas,maria,nkoziris}@cslab.ece.ntua.grwww.cslab.ece.ntua.gr

5/9/2003 Parallel Computing 2003 2

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

5/9/2003 Parallel Computing 2003 3

IntroductionIntroduction

Motivation:

• A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results!• There is no complete method to generate code for non-rectangular tiles

5/9/2003 Parallel Computing 2003 4

IntroductionIntroduction

Contribution:

• Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces• Simulation of blocking and non-blocking communication primitives• Experimental evaluation of proposed scheduling scheme

5/9/2003 Parallel Computing 2003 5

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

5/9/2003 Parallel Computing 2003 6

BackgroundBackground

Algorithmic Model:

FOR j1 = min1 TO max1 DO

FOR jn = minn TO maxn DO

Computation(j1,…,jn);

ENDFOR

ENDFOR

Perfectly nested loops Constant flow data dependencies (D)

5/9/2003 Parallel Computing 2003 7

BackgroundBackground

Tiling:

Popular loop transformation Groups iterations into atomic units Enhances locality in uniprocessors Enables coarse-grain parallelism in distributed memory systems Valid tiling matrix H: 0DH

5/9/2003 Parallel Computing 2003 8

Tiling TransformationTiling Transformation

Example:

FOR j1=0 TO 11 DO FOR j2=0 TO 8 DO A[j1,j2]:=A[j1-1,j2] + A[j1-1,j2-1]; ENDFORENDFOR

10

11D

5/9/2003 Parallel Computing 2003 9

Rectangular Tiling Rectangular Tiling TransformationTransformation

j1

j2

3 0P = 0 3

1/3 0H = 0 1/3

5/9/2003 Parallel Computing 2003 10

Non-rectangular Tiling Non-rectangular Tiling TransformationTransformation

j1

j2

3 3P = 0 3

1/3 -1/3H = 0 1/3

5/9/2003 Parallel Computing 2003 11

Why Non-rectangular Tiling?Why Non-rectangular Tiling?

Reduces communication

Enables more efficient scheduling schemes

8 communication points

6 communication points

6 time steps 5 time steps

5/9/2003 Parallel Computing 2003 12

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

5/9/2003 Parallel Computing 2003 13

Computation DistributionComputation Distribution

We map tiles along the longest dimension to the same processor because:

• It reduces the number of processors required• It simplifies message-passing• It reduces total execution times when overlapping computation with communication

5/9/2003 Parallel Computing 2003 14

Computation DistributionComputation Distribution

P3

P2

P1

j1

j2

5/9/2003 Parallel Computing 2003 15

Data DistributionData Distribution

Computer-owns rule: Each processor owns the data it computes Arbitrary convex iteration space, arbitrary tiling Rectangular local iteration and data spaces

5/9/2003 Parallel Computing 2003 16

Data DistributionData Distribution

j1

j2

Tile Iteration SpaceTIS

j'2

Transformed Tile Iteration SpaceTTIS

j'1

H'

P'

5/9/2003 Parallel Computing 2003 17

Data DistributionData Distribution

ma

pp

ing

dim

en

sio

n

t=0

t=1

t=2

t=3

j'1

j''2

j''1

j'2

map-1

map

LDS TTIS...

off1

off2

Computation Storage

Communication Storage

Unused Space

5/9/2003 Parallel Computing 2003 18

Data DistributionData Distribution

)( ypidLDS

Jn

)( xpidLDS

loc()

loc–1()

fw()

DS

loc()loc–1()

j2

j1

w1

w2

j2''

j2''

j1''

j1''

5/9/2003 Parallel Computing 2003 19

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

5/9/2003 Parallel Computing 2003 20

P3

P2

P1

Communication SchemesCommunication Schemes

With whom do I communicate?

j1

j2

5/9/2003 Parallel Computing 2003 21

P3

P2

P1

With whom do I communicate?

j1

j2

Communication SchemesCommunication Schemes

5/9/2003 Parallel Computing 2003 22

What do I send?

Tile Iteration SpaceTIS

j1

j2

21

13D

j'2

Transformed Tile Iteration SpaceTTIS

j'1

50

05D

Communication SchemesCommunication Schemes

5/9/2003 Parallel Computing 2003 23

Blocking SchemeBlocking Scheme

P3

P2

P1

j1

j2

12 time steps

5/9/2003 Parallel Computing 2003 24

Non-blocking SchemeNon-blocking Scheme

P3

P2

P1

j1

j2

6 time steps

5/9/2003 Parallel Computing 2003 25

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

5/9/2003 Parallel Computing 2003 26

Code Generation SummaryCode Generation Summary

Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme

Sequential Code

Dependence Analysis

ParallelizationParallel SPMD

Code

Tiling Transformation

Sequential Tiled Code

Tiling

• Computation Distribution

• Data Distribution

• Communication Primitives

5/9/2003 Parallel Computing 2003 27

Code Summary – Blocking Code Summary – Blocking SchemeSchemeFORACROSS Spid 11 min TO S

1max DO … FORACROSS S

nnpid 11 min TO Sn 1max DO

/*Sequential execution of tiles*/ FOR S

nSt min TO S

nmax DO /*Receive data from neighbouring tiles*/

BLOCKING_RECV( CCDtpid SS ,,, ); /*Traverse the internal of the tile*/ FOR 11 lj TO 1u STEP= 1c DO … FOR nn lj TO nu STEP= nc DO /*Perform computations on LDS*/ t := St - S

nl ;

LA[map( tj , )]:=F(LA[map( tdj ,1 )],…,

LA[map( tdj q , )]); ENDFOR … ENDFOR /*Send data to neighbouring processors*/

BLOCKING_SEND( CCDtpid mS ,,, ); ENDFOR ENDFORACROSS … ENDFORACROSS

5/9/2003 Parallel Computing 2003 28

FORACROSS Spid 11 min TO S1max DO

… FORACROSS S

nnpid 11 min TO Sn 1max DO

/*Sequential execution of tiles*/ FOR S

nSt min TO S

nmax DO /*Receive data for next tile*/

NON_BLOCKING_RECV( CCDtpid SS ,,1, ); /*Send data computed at previous tile*/

NON_BLOCKING_SEND( CCDtpid mS ,,1, ); /*Compute current tile*/ FOR 11 lj TO 1u STEP= 1c DO … FOR nn lj TO nu STEP= nc DO /*Perform computations on LDS*/ t := St - S

nl ;

LA[map( tj , )]:=F(LA[map( tdj ,1 )],…,

LA[map( tdj q , )]);

ENDFOR … ENDFOR /*Wait for communication completion*/ WAIT_ALL ENDFOR ENDFORACROSS … ENDFORACROSS

Code Summary – Non-blocking Code Summary – Non-blocking SchemeScheme

5/9/2003 Parallel Computing 2003 29

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

5/9/2003 Parallel Computing 2003 30

Experimental ResultsExperimental Results

8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) MPICH v.1.2.5 (--with-device=p4, --with-comm=shared) g++ compiler v.2.95.4 (-O3) FastEthernet interconnection 2 micro-kernel benchmarks (3D):

• Gauss Successive Over-Relaxation (SOR)• Texture Smoothing Code (TSC)

Simulation of communication schemes

5/9/2003 Parallel Computing 2003 31

SORSOR

Iteration space M x N x N Dependence matrix:

01010

00101

11100

D

Rectangular Tiling:

Non-rectangular Tiling:

z

y

x

Pr00

00

00

zx

y

x

Pnr0

00

00

5/9/2003 Parallel Computing 2003 32

SORSOR

5/9/2003 Parallel Computing 2003 33

SORSOR

5/9/2003 Parallel Computing 2003 34

TSCTSC

Iteration space T x N x N Dependence matrix:

10111101

11100111

11110000

D

Rectangular Tiling:

Non-rectangular Tiling:

z

y

x

Pr00

00

00

zyx

yx

x

Pnr 0

00

5/9/2003 Parallel Computing 2003 35

TSCTSC

5/9/2003 Parallel Computing 2003 36

TSCTSC

5/9/2003 Parallel Computing 2003 37

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

5/9/2003 Parallel Computing 2003 38

ConclusionsConclusions

Automatic code generation for arbitrary tiled spaces can be efficient High performance can be achieved by means of

a suitable tiling transformation overlapping computation with communication

5/9/2003 Parallel Computing 2003 39

Future WorkFuture Work

Application of methodology to imperfectly nested loops and non-constant dependencies Investigation of hybrid programming models (MPI+OpenMP) Performance evaluation on advanced interconnection networks (SCI, Myrinet)

5/9/2003 Parallel Computing 2003 40

Questions?Questions?

http://www.cslab.ece.ntua.gr/~ndros

top related