recurrence chain partitioning of non-uniform dependences

36
Aug 15-18, Montreal, Cana da 1 Recurrence Chain Partitioning of Non- Uniform Dependences Yijun Yu Erik H. D’Hollander

Upload: keagan

Post on 14-Jan-2016

53 views

Category:

Documents


5 download

DESCRIPTION

2004. Recurrence Chain Partitioning of Non-Uniform Dependences. Yijun Yu Erik H. D ’ Hollander. Overview. Dependence and Parallelism Non-Uniform Loop Dependences Recurrence Chains Partitioning Related work Implementations Experiment Results Summary. 0. 0. 0. 0. 0. 0. 0. 1. 2. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 1

Recurrence Chain Partitioning of Non-Uniform Dependences

Yijun YuErik H. D’Hollander

Page 2: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 2

Overview

1. Dependence and Parallelism2. Non-Uniform Loop Dependences3. Recurrence Chains Partitioning 4. Related work5. Implementations6. Experiment Results7. Summary

Page 3: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 3

1. Background Dependence vs. Parallelism

DO I = 1,3

A(I) = A(I-1)

ENDDO

DOALL I = 1,3

A(I) = A(I-1)

ENDDO

A(2) = A(1)

A(1) = A(0)

A(3) = A(2)

1 2 3

1 1 3

0 1 3

0 1 1

0

0

0

0

program

A(1) = A(0)

A(2) = A(1)

A(3) = A(2)

execution trace

1 2 3

0 2 3

0 0 3

0 0 0

0

0

0

0

shared memory

Page 4: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 4

The CFD application @ WTCM

Computation Fluid Dynamics CFDNavier-Stokes equationsSuccessive Over-Relaxation SOR

temperature3D geometry + 1D time

Page 5: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 5

The visualized Uniform dependences and transformations for the 4D loop

Before transformation After transformation

A 3-D unimodular transformation is found after visualizing the 4D loop nest which has 177 array references at run-time for each iteration. Here we use a regularshape. The transformation makes it possible to speed-up the program around N2/6 times where N is the diameter of the geometry. (Yu, Parco99)

Page 6: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 6

2. Non-uniform dependences

Uniform loop dependences Dependent iterations are apart at a uniform

distance in the iteration space: a set of distance vector can predict the dependences and indicate the affine index loop transformation to reveal the maximal loop parallelism.

Non-uniform dependences Irregular, can be caused by complex subscripts,

compile-time unknowns, etc. But not rare: in SPECfp95 benchmarks 46%

nested loops and 12.8% of the coupled subscripts

Page 7: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 7

Non-uniform dependences

Tip of the iceberg

Page 8: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 8

Irregular dependence

Dependences have non-uniform distance

Parallelism Analysis:200 iterations over 15 data flow steps

Speedup:13.3

Problem: How to exploit it?

Page 9: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 9

3. Recurrence Chain Partitioning

Research objectives

If DO loops fail to reveal the optimal parallelism for irregular dependences, can one use WHILE loops?

WHEN can one apply WHILE loops? HOW to construct WHILE loops? WHAT to do when one can not apply

WHILE loops? HOW MUCH can be achieved by an

evaluation purposes?

Page 10: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 10

3.1 How to Generate code? DOALL I = INIT(I)

WHILE !TERMINATE(I) DO S(I) I = NEXT(I) END DOENDDOALL

INIT(I) =? TERMINATE(I)=? NEXT(I) =?

Page 11: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 11

3.2 Solving recurrence equations in the unified iteration space Dependence equations: iA + a = jB + b Recurrence equations: j = i T + t or i = (j – t) T-1 = jT-1

+ tT-1

T = AB-1

t = (a – b)B-1

A recurrence chain is a sequence of dependent iterations, such that

iK+1 = iKT + t, or iK+1= (iK-t)T-1

i0 =

{ i | not exist j such that iA+a = jB+b or iB+b = jA+a} We have variable dependence distance dk=ik+1-ik:

dk+1 = dkT or dk=dk+1T-1

d is not constant and exponential to a=max(1/|T|, |T|), thus the dependence chain length is O(loga L), where L is the diameter of the iteration space

When |T| is negative, one can cut recurrence chain to 2 iterations by lexicographical ordering

Page 12: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 12

3.3 Generate code ?

DOALL I = i0 WHILE ( I is in Iteration Space) DO S(I) I = IT+t or I = (I-t)T-1

ENDDO ENDDOALL Problem: How to tell which index

update respects the dependency order?

Page 13: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 13

iteration space

i0

i0

i0i0

i2

i4

i1

independent

cyclic

integer

non-integer

integer

non-integer

I1

I2

i1

i3

initial set final set

intermediateset

R1

R2

R3 R4

Page 14: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 14

3.3 Generate code !

DOALL I in P1 IF (IT+t < I) T = T-1; t = tT ENDIF WHILE ( I is in Iteration Space) DO S(I) I = IT+t ENDDO

ENDDOALL

Page 15: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 15

4. Related work Strength of REC

(1) Scalability

LEN = length of the chain In comparison, unique-set oriented

methods have to deal with LEN = 2, 3, … differently…

In REC, the WHILE loops adjust their steps automatically…

Page 16: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 16

4. Related work Strength of REC

(2) Outermost loop parallelism

Set-oriented:DOALL I in P1 S(I)DOALL I in P2 S(I)…DOALL I in Pn D(I)

Recurrence ChainDOALL I in P1 IF (I > IT+t) T = T-1; t = tT WHILE ( I in IS) DO S(I) I = IT+t ENDDO

ENDDOALL

Page 17: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 17

4. Related work

Shortcoming and alternatives

Restriction in number of dep. Equations

Fall back to the following algorithms: A recursive 3-sets partitioning (3P)

(similar to unique-sets partitioning, but more accurate): can reuse the calculations for P1, P2, P3.

PDM and other uniformization techniques PDM is light-weight and can apply first, then apply 3P.

Page 18: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 18

Non-uniform Dependence

Loop Parallelization

Uniformization

DOALL

DOACROSS

Set-Oriented

Unique Sets splitting

Recurrence Chain partitioning

Outermostparallelism

Maximal parallelism

Minimal synchronization

Maximal Coverage

Affine Loop bounds

Multiple references

Affine Array subscripts

Non-perfectly Nested Loop

Multiple dimension of

loop nests

Statement-level

parallelism

Finest Partitioning

Load Balancing

Schedule Regularity

Make

Hurt

Help

Make

Make

Make

Make

Uniform Dependence

Loop Parallelization

Hurt

Help

Help

Hurt

Hel

pHurt

Hurt

Hurt

Yu & D’Hollander 04

Ju & Chaudhary, 97Cho & , 97

Pean & Chen, 01Yu & D’Hollander, 04

Tzen & Ni, 91Chen & Yew, 96

Punyamurtula et al 99

Shang et al 96Lim & Lam 99

Yu & Dollander 00

Wolfe, 87Wolf, 91

Banerjee, 93D’Hollander, 92

Hurt

Hurt

Mak

e

Mak

e

Help

GoalTask Softgoal

Make

Make

Page 19: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 19

Non-uniform Dependence

Loop Parallelization

Uniformization

DOALL

DOACROSS

Set-Oriented

Unique Sets splitting

Recurrence Chain partitioning

Outermostparallelism

Maximal parallelism

Minimal synchronization

Maximal Coverage

Affine Loop bounds

Multiple references

Affine Array subscripts

Non-perfectly Nested Loop

Multiple dimension of

loop nests

Statement-level

parallelism

Finest Partitioning

Load Balancing

Schedule Regularity

Make

Hurt

Help

Make

Make

Make

Make

Uniform Dependence

Loop Parallelization

Hurt

Help

Help

Hurt

Hel

pHurt

Hurt

Hurt

Yu & D’Hollander 04

Ju & Chaudhary, 97Cho & , 97

Pean & Chen, 01Yu & D’Hollander, 04

Tzen & Ni, 91Chen & Yew, 96

Punyamurtula et al 99

Shang et al 96Lim & Lam 99

Yu & Dollander 00

Wolfe, 87Wolf, 91

Banerjee, 93D’Hollander, 92

Hurt

Hurt

Mak

e

Mak

e

Help

GoalTask Softgoal

Make

Make

sat

den

partlyfully

Page 20: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 20

Non-uniform Dependence

Loop Parallelization

Uniformization

DOALL

DOACROSS

Set-Oriented

Unique Sets splitting

Recurrence Chain partitioning

Outermostparallelism

Maximal parallelism

Minimal synchronization

Maximal Coverage

Affine Loop bounds

Multiple references

Affine Array subscripts

Non-perfectly Nested Loop

Multiple dimension of

loop nests

Statement-level

parallelism

Finest Partitioning

Load Balancing

Schedule Regularity

Make

Hurt

Help

Make

Make

Make

Make

Uniform Dependence

Loop Parallelization

Hurt

Help

Help

Hurt

Hel

pHurt

Hurt

Hurt

Yu & D’Hollander 04

Ju & Chaudhary, 97Cho & , 97

Pean & Chen, 01Yu & D’Hollander, 04

Tzen & Ni, 91Chen & Yew, 96

Punyamurtula et al 99

Shang et al 96Lim & Lam 99

Yu & Dollander 00

Wolfe, 87Wolf, 91

Banerjee, 93D’Hollander, 92

Hurt

Hurt

Mak

e

Mak

e

Help

GoalTask Softgoal

Make

Make

sat

den

partlyfully

Page 21: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 21

Non-uniform Dependence

Loop Parallelization

Uniformization

DOALL

DOACROSS

Set-Oriented

Unique Sets splitting

Recurrence Chain partitioning

Outermostparallelism

Maximal parallelism

Minimal synchronization

Maximal Coverage

Affine Loop bounds

Multiple references

Affine Array subscripts

Non-perfectly Nested Loop

Multiple dimension of

loop nests

Statement-level

parallelism

Finest Partitioning

Load Balancing

Schedule Regularity

Make

Hurt

Help

Make

Make

Make

Make

Uniform Dependence

Loop Parallelization

Hurt

Help

Help

Hurt

Hel

pHurt

Hurt

Hurt

Yu & D’Hollander 04

Ju & Chaudhary, 97Cho & , 97

Pean & Chen, 01Yu & D’Hollander, 04

Tzen & Ni, 91Chen & Yew, 96

Punyamurtula et al 99

Shang et al 96Lim & Lam 99

Yu & Dollander 00

Wolfe, 87Wolf, 91

Banerjee, 93D’Hollander, 92

Hurt

Hurt

Mak

e

Mak

e

Help

GoalTask Softgoal

Make

Make

sat

den

partlyfully

Page 22: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 22

4. Implementations

Front end: source to source transformations

PDM/PL in FPT Set-oriented algorithms in FPT <->

XML/XSLT <-> OCBack end Intel Fortran compiler + OPENMP

directivesExperiments on an EPICMP 4-CPU server

Page 23: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 23

5. Results

5.1 Yu, ICPP00

DO I1=1,N1 DO I2=1,N2 a(3*I1+1,2*I1+I2-1) =a(I1+3,I2+1) ENDDO ENDDO

Page 24: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 24

5.1 Nonfull-rank PDMj1

j2

i2

Page 25: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 25

Page 26: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 26

5.2 Ju, 1997’s example

DO I=1,N DO J=1,N a(2*I+3,J+1) =

… =a(I+2*J+1,I+J+3) ENDDO ENDDOdet(PDM) = 2

Page 27: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 27

UNIQUE vs REC partitioning

132

Page 28: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 28

Ju’s Example

Comparison

We corrected the loop bounds flaw in the Ju’s 97 paper and 5 unique sets were derived for this case when N = 12.

But theoretically O(2^(log2 N)) = O(N) UNIQUE sets are needed

In REC partitioning, just one set P1 needs to be calculated for the initial i0

Page 29: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 29

Page 30: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 30

5.3 Chen, 96’s Example DO I=1,N DO J=1,I DO K=J,I ... = a(I+2*K+5,4*K-J) ENDDO a(I-J,I+J)= ... ENDDO ENDDO

Page 31: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 31

Chen’s Example

A special case It is a non-perfectedly nested loop First convert it into the unified iteration

space Then symbolically calculate P1, P2, P3

and finds P2 = empty Therefore the recurrence chains are at

most 1 iteration long, regardless to the loop bounds

Both REC and Three-region partitioning lead to the same optimal solution

Page 32: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 32

Page 33: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 33

5.4 Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF1 CONTINUE

C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K)

Loop Fusion

Page 34: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 3429

Cholesky Kernel

29

(I,K,J ,L)

IK

J

Plane: L=0

I

KL

Loop Projections

Page 35: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 35

Recursive Three Region partitioning

After loop fusion

Page 36: Recurrence Chain Partitioning of Non-Uniform Dependences

Aug 15-18, Montreal, Canada 36

6. Summary Recurrence Chain partitioning is scalable to

any size of the iteration space REC partitioning reveals outermost parallelism,

no synchronization between partitioned regions

The limitation of REC partitioning and its compensation: we provide fall back alternatives, if REC can not apply (1) PDM + Minimal distance (always applicable) (2) Recursive three-region partitioning (applicable for constant loop bounds, in some cases (e.g. Chen’s example) any loop bounds)

PDM

3RREC