loop tiling for iterative stencil computations marta jiménez
Post on 19-Dec-2015
233 views
TRANSCRIPT
![Page 1: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/1.jpg)
Loop Tiling for Iterative Stencil Computations
Marta Jiménez
![Page 2: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/2.jpg)
What is an Iterative Stencil Computation?
• ISC often performed for PDE, GM, IP– swim, tomcatv, mgrid (from SPEC95 benchmark)
– Jacobi
DO K = 1, NITER /* time-step loop */ do J = ... do I = ... {A(I,J), A(I+1,J),…} enddo enddo {wrapped-around computations}ENDDO
Matrix A
![Page 3: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/3.jpg)
Loop Tiling• Loop Tiling
– divides IS into regular tiles to make the working set fit in the memory level being exploited
– can be applied hierarchically (Multilevel Tiling)
• Current algorithms for Loop Tiling are limited to loops that:– are “perfectly” nested
– are fully permutable
– define a rectangular IS
• However, in iterative stencil computations, loops are:– NOT perfectly nested
– NOT fully permutable
![Page 4: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/4.jpg)
• Show how Loop Tiling can be applied to iterative stencil computations– based on Song & Li’s paper [PLDI99]
• define a Program Model• 1 Level of 1D-Tiling (cache)
– program example: SWIM• 2 levels of Tiling
– 2D-Tiling at the cache level
– 1D-Tiling at the register level (based on Jiménez et al. [ICS98][HPCA98])
• Performance Results– Loop Tiling on EV5 & EV6
Today’s talk
![Page 5: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/5.jpg)
Steps
1- Apply a set of transformations to the original program to achieve the desired program model defined by Song & Li
2- Perform 2D-Tiling for the Cache Level
3- Perform 1D-Tiling for the Register Level
![Page 6: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/6.jpg)
1st Step: achieve desired program model
DO K = 1, NITER /* time-step loop */ do J1 = LJ1, UJ1
do I1 = LI1, UI1
{A(I,J), A(I+1,J),…} enddo enddo . . . do Jm = LJm, UJm
do Im = LIm, UIm
{A(I,J), A(I+1,J),…} enddo enddo
ENDDO
Program Model:
Usually, programs are NOT directly written in this form – We must apply a set of transformations to achieve this program model
![Page 7: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/7.jpg)
SWIM original code
initializations90 NCYCLE = NCYCLE +1
CALL CALC1
CALL CALC2
IF (NCYCLE >= ITMAX) STOP
IF (NCYCLE <= 1) THEN
CALL CALC3Z
ELSE
CALL CALC3
ENDIF
GO TO 90
Transformations–Inline subroutines
–Convert GO TO into DO-loop
–Peel iterations of the time-step loop to eliminate IF-statements guarded by NCYCLE
SUBROUTINE CALCX do J = 1,N do I = 1,M ... enddo enddoc wrapped-around computations do J = 1, N ... enddo do I = 1, M ... enddo
...
![Page 8: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/8.jpg)
Wrapped-around Computations
DO K = 2, ITMAX-1 do J = 1,N do I = 1,M ... enddo enddo
wrapped-around comp do J = 1, N ... enddo do I = 1, M ... enddo ... do J = 1,N do I = 1,M ... enddo enddo ...
...ENDDO
J
I I
J
CALC1
CALC2
CALC3
![Page 9: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/9.jpg)
Projection along direction I
DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around comp do J = 1, N ... enddo do J = 1,N ... enddo wrapped-around comp do J = 1, N ... enddo...ENDDO
c
J
Wrapped-around Computations
c
Another way of dealing with the wrapped-around computations is performing code sinking
![Page 10: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/10.jpg)
DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around
do J = 1,N ... enddo
wrapped-around
ENDDO
J
1st Step: achieved program model Flow dependencies & iterations space for SWIM (Projection along direction I )
CALC1
CALC2
CALC3
K-loop(time)
K=2
K=3
1 N
![Page 11: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/11.jpg)
Steps
1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li
2- Perform 2D-Tiling for the Cache Level
3- Perform 1D-Tiling for the Register Level
![Page 12: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/12.jpg)
1D-Tiling
K=2
K=3
K=4
J1 N
Dependencies are violated Tiling parameters: SLOPE, OFFSETS-i
SLOPE
OFFSET-i
J
1 N1 N
![Page 13: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/13.jpg)
2D-Tiling
K (time-step loop)
J
I
1
M
N1
1
M
1
M
N1 N1
Tiling parameters: SLOPE, OFFSETS-i for each tiled dimension (J and I) Computed using the JI-loop distance subgraph
N1 N1 N1
1
M
1
M
1
M
![Page 14: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/14.jpg)
flow dependenciesanti-dependenciesoutput dependencies
JI3-loopJI2-loopJI1-loop
[1,-1,0][1,0,-1]
[1,-1,-1]
[1,0,0]
[1, 0, 0][1, 0, 0][1, 0, 0]
[0,0,0]
[1,-1,0][1,0,-1]
[1,0,-1][1,-1,0]
[0,0,0]
JI-loop Distance Subgraph
Each node represents a JI-loop nest Each edge represents a dependence (distance vector)
![Page 15: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/15.jpg)
SWIM: Projection along direction I
Wrapped-around Computations
Backward dependencies with large distances make Tiling not profitable
– apply Circular Loop Skewing to shorten backward dependencies
DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around
do J = 1,N ... enddo
wrapped-around
ENDDO
K-loop(time)
K=2
K=3
1 N
J
![Page 16: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/16.jpg)
Shorts backward dependencies by changing the iteration order
Circular Loop Skewing
1 N
J
CLS parameters: BETA-i, DELTA (computed using the JI-loop distance subgraph)
K=2
K=3
1 N
J1 42 3
BETA-i
DELTA
22
![Page 17: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/17.jpg)
Circular Loop Skewing
DO K = 2, ITMAX-1 do JX = 1+BETA1+DELTA(K-2),
N+BETA1+DELTA(K-2)
J = MOD(JX-1, N) + 1
... enddo wrapped-around do JX = 1+BETA2+DELTA(K-2),
N+BETA2+DELTA(K-2)
J = MOD(JX-1, N) + 1 ... enddo wrapped-around
do JX = 1+BETA3+DELTA(K-2),
N+BETA3+DELTA(K-2)
J = MOD(JX-1, N) + 1 ... enddo wrapped-around
ENDDO
K=2
K=3
1 N
J1 42 3
BETA-i
DELTA
![Page 18: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/18.jpg)
DO JJ = ... DO II = ... DO K = ... if (first tile) then do JX = ... offsets iter. enddo endif do JX = ... Iter. inside tile enddo do JX = ... Iter. inside tile enddo do JX = ... Iter. inside tile enddo
ENDDO
SWIM: projection along direction I CLS parameters: DELTA=2, BETA1=0, BETA2=1, BETA3=2 Tiling parameters: SLOPE=2, OFFSET1=1, OFFSET2=OFFSET3=0
2nd Step: 2D-Tiling for cache level
J
31 2 N 31 2
K=2
K=3
K=4
31 2 N 31 2
![Page 19: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/19.jpg)
Steps
1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li
2- Perform 2D-Tiling for the Cache Level
3- Perform 1D-Tiling for the Register Level
![Page 20: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/20.jpg)
3rd Step: 1D-Tiling for register level
DO JJ = ... DO II = ... DO K = ... ...
do JX = LJ, UJ
J = MOD (JX-1, N)+1
do IX = LI, UI
I = MOD (IX-1, M)+1
[loop body: {I,J}]
enddo
enddo
...ENDDO
The MOD operation introduced by CLS prevents us to fully unroll the loop
Apply first Index Set Splitting to loop J
J
I
1
M
M-1
2
M-2
N 1N-1 2N-2
unrolled
![Page 21: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/21.jpg)
Index Set Splitting ISS splits a loop into two new loops that iterate over non-intersecting portions of
the iteration space
DO JJ = ... DO II = ... DO K = ... ...
do JX = LJ, min(N,UJ)
J = JX
do IX = ...
enddo
enddo
do JX = max(N+1,LJ), UJ
J = JX-N
do IX = ...
enddo
enddo
...ENDDO
J
I
1
M
M-1
2
M-2
N 1N-1 2N-2
ISS
![Page 22: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/22.jpg)
DO JJ = ... DO II = ... DO K = ... ...
do JX = LJ, min(N,UJ)-3+1,3
J = JX
do IX = ...
[loop body: {J}]
[loop body: {J+1}]
[loop body: {J+2}]
enddo
enddo
do JX = JX, min(N,UJ)
J = JX
do IX = ...
[loop body: {J}]
enddo
enddo
...ENDDO
J
I
1
M
M-1
2
M-2
N 1N-1 2N-2
ISS
3rd Step: 1D-Tiling for register level
![Page 23: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/23.jpg)
Code Transformations Summary
1- Apply a set of transformations to the original program to achieve
the program model defined by Song & Li– Inline subroutines
– Convert GOTO into DO-loop
– Peel iterations of the time-step loop to eliminate IF-statements
2- Perform 2D-Tiling for the Cache Level– Construct JI-loop distance subgraph
– Compute DELTA and BETAs and apply CLS to shorten backwards dep.
– Update JI-loop distance subgraph
– Compute OFSSETs and SLOPE and tile the IS
3- Perform 1D-Tiling for the Register Level– Index Set Splitting
– Tiling in a straightforward manner
![Page 24: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/24.jpg)
• Architecture: EV56 (500Mhz, L1:8KB, L2:96KB), EV6(500MHz, L1:64KB, L2:4MB) • Compiler Invocation:
– f77 -O5 -arch ev56 (EV5) – kf77 -O5 -arch ev6 -notransform_loop -unroll 1 (EV6)
• Programs:– 1D-Tiling for the Cache Level: loop J, TS = 4 (EV5), TS=8 (EV6)
– 2D -Tiling for the Cache Level: TSIxJ = 32x16 (EV5), TSIxJ=40x12(EV6)
– 1D-Tiling for the register level: loop J, TS=4 (EV5 & EV6)
Performance Results (SWIM)
0.5
1
1.5
2
2.5
EV6
EV5
Spe
edup
ORI ORI + RT
1D 1D + RT
2D 2D + RT
439s 658s 294s 371s 578s 296s(execution time)
1519s 1533s 1023s 999s 1009s 677sEV5
EV6
![Page 25: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/25.jpg)
• Architecture: EV56 (500Mhz, L1:8KB, L2:96KB)
• Compiler invocations:
– base: kf77 -O5 -arch ev56
– no_prefetch: kf77 -O5 -arch ev56 -switch nolu_prefetch_fetch …..
Performance Results EV5 (SWIM)
0.5
1
1.5
2
2.5
base
no_prefetch
Speedup over ORI (base)
ORI ORI + RT
1D 1D + RT
2D 2D + RT
Spe
edup
![Page 26: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/26.jpg)
• Architecture: EV6(500MHz, L1:64KB, L2:4MB)
• Compiler invocations:
– base: f77 -O5 -arch ev6
– no_prefetch: f77 -O5 -arch ev6 -switch nolu_prefetch_fetch …..
Performance Results EV6 (SWIM)
0
0.5
1
1.5
2
2.5
base
no_prefetch
Speedup over ORI (base)
Spe
edup
ORI ORI + RT
1D 1D + RT
2D 2D + RT
![Page 27: Loop Tiling for Iterative Stencil Computations Marta Jiménez](https://reader030.vdocuments.site/reader030/viewer/2022033022/56649d385503460f94a1136c/html5/thumbnails/27.jpg)
J
Code for Result Verification
DO K = 2, ITMAX-1 ... do J = 1,N ... enddo
result verification
IF (MOD(K,MPRINT).eq.0) THEN do I = do J = UCHECK = UCHECK + {UNEW(I,J)} enddo UNEW (I,I) = . . . enddo PRINTS
ENDIF do J = 1,N ... enddoENDDO
c
Apply strip-mining to loop K (only useful if MPRINT is large)
NEW in SPEC2000!!