single-dimension software pipelining for multi-dimensional loops ifip tele-seminar june 1, 2004...

54
Single-dimension Software Pipelining for Multi- dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan Guang R. Gao Presented by: Hongbo Rong

Upload: jordan-fleming

Post on 15-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

Single-dimension Software Pipelining for Multi-dimensional

Loops

IFIP Tele-seminar June 1, 2004

Hongbo Rong

Zhizhong Tang

Alban Douillet

Ramaswamy Govindarajan

Guang R. Gao

Presented by: Hongbo Rong

Page 2: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

2

Introduction Loops and software pipelining are

important Innermost loops are not enough

[Burger&Goodman04] Billion-transistor architectures tend to

have much more parallelism Previous methods for scheduling

multi-dimensional loops are meeting new challenges

Page 3: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

3

Motivating Example

int U[N1+1][N2+1], V[N1+1][N2+1];L1: for (i1=0; i1<N1; i1++) {L2: for (i2=0; i2<N2; i2++) { a: U[i1+1][i2]=V[i1][i2]+ U[i1][i2];

b: V[i1][i2+1]=U[i1+1][i2]; } }

a

b<0,0> <0,1>

<1,0>

A strong cycle in the inner loop: No

parallelism

Page 4: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

4

Loop Interchange Followed by Modulo Scheduling of the Inner Loop

• Why not select a better loop to software pipeline?

• Which and how?

a

b<0,0> <1,0>

<0,1>

Page 5: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

5

Starting from A Naïve Approach

a(0,0)

b(0,0)

---

a(0,1)

b(0,1)

---

a(0,2)

b(0,2)

---

0

1

2

3

4

5

6

7

8

9

10

11

12

0

1 2 3 4 5

Cycle

i1

a(1,0)

b(1,0)

---

a(1,1)

b(1,1)

---

a(1,2)

b(1,2)

---

a(3,0)

b(3,0)

---

a(3,1)

b(3,1)

---

a(3,2)

b(3,2)

---

a(4,0)

b(4,0)

---

a(4,1)

b(4,1)

---

a(4,2)

b(4,2)

---

a(5,0)

b(5,0)

---

a(5,1)

b(5,1)

---

a(5,2)

b(5,2)

---

a(2,1)

b(2,1)

---

a(2,2)

b(2,2)

---

a(2,0)

b(2,0)

---

a

b

<0,0> <0,1>

<1,0>

2 function unitsa: 1 cycleb: 2 cyclesN2=3

Resource conflicts

Page 6: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

6

Looking from Another Angle

a(0,0)

b(0,0)

---

a(1,0)

b(1,0)

--- a(3,0)

b(3,0)

---

a(4,0)

b(4,0)

---

a(5,0)

b(5,0)

---

a(0,1)

b(0,1)

---

a(1,1)

b(1,1)

---

a(2,1)

b(2,1)

---

a(3,1)

b(3,1)

---

a(4,1)

b(4,1)

---

a(5,1)

b(5,1)

---

a(0,2)

b(0,2)

---

a(1,2)

b(1,2)

---

a(2,2)

b(2,2)

---

a(3,2)

b(3,2)

---

a(4,2)

b(4,2)

---

a(5,2)

b(5,2)

---

0

1

2

3

4

5

6

7

8

9

10

11

12

0

1 2 3 4 5

Cycle

i1

Slice 1

Slice2

Slice 3

Initiation interval T=1

a(2,0)

b(2,0)

---

Kernel, with S=3 stages

Resource conflicts

Page 7: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

SSP (Single-dimension Software Pipelining)

a(0,0)

b(0,0)

---

a(1,0)

b(1,0)

--- a(3,0)

b(3,0)

---

a(4,0)

b(4,0)

---

a(5,0)

b(5,0)

---

a(0,1)

b(0,1)

---

a(1,1)

b(1,1)

---

a(2,1)

b(2,1)

---

a(3,1)

b(3,1)

---

a(4,1)

b(4,1)

---

a(5,1)

b(5,1)

---

a(0,2)

b(0,2)

---

a(1,2)

b(1,2)

---

a(2,2)

b(2,2)

---

a(3,2)

b(3,2)

---

a(4,2)

b(4,2)

---

a(5,2)

b(5,2)

---

0

1

2

3

4

5

6

7

8

9

10

11

12

0

1 2 3 4 5

Cycle

i1

Initiation interval T=1

a(2,0)

b(2,0)

---

Kernel, with S=3 stages

Delay = (N2-1)*S*T

7

Page 8: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

An iteration point per cyle

Filling & draining naturally overlapped

Dependences are still respected!

Resources fully used Data reuse

exploited!

SSP (Single-dimension Software Pipelining)

a(0,0)

b(0,0)

---

a(1,0)

b(1,0)

---

a(3,0)

b(3,0)

---

a(4,0)

b(4,0)

---

a(5,0)

b(5,0)

---

a(0,1)

b(0,1)

---

a(1,1)

b(1,1)

---

a(2,1)

b(2,1)

---

a(0,2)

b(0,2)

---

a(1,2)

b(1,2)

---

a(2,2)

b(2,2)

---

a(3,1)

b(3,1)

---

a(4,1)

b(4,1)

---

a(5,1)

b(5,1)

---

a(3,2)

b(3,2)

---

a(4,2)

b(4,2)

---

a(5,2)

b(5,2)

---

0

1

2

3

4

5

6

7

8

9

10

11

12

0

1 2 3 4 5

Cycle

i1

a(2,0)

b(2,0)

---

Delay = (N2-1)*S*T

8

Initiation interval T=1 Kernel, with S=3

stages

Page 9: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

9

Loop Rewriting

int U[N1+1][N2+1], V[N1+1][N2+1]; L1': for (i1=0; i1<N1; i1+=3) { b(i1-1, N2-1) a(i1, 0) b(i1, 0) a(i1+1, 0) b(i1+1, 0) a(i1+2, 0)L2': for (i2=1; i2<N2; i2++) {

a(i1, i2) b(i1+2, i2-1)b(i1, i2) a(i1+1, i2) b(i1+1, i2) a(i1+2, i2)

} }

b(i1-1, N2-1)

Page 10: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

10

Outline

Motivation Problem Formulation &

Perspective Properties Extensions Current and Future work Code Generation and

experiments

Page 11: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

11

Problem Formulation

Given a loop nest L composed of n loops L1, …, Ln, identify the most profitable loop level Lx with 1<= x<=n, and software pipeline it.

Which loop to software pipeline? How to software pipeline the selected

loop? How to handle the n-D dependences? How to enforce resource constraints? How can we guarantee that repeating

patterns will definitely appear?

Page 12: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

12

Single-dimension Software Pipelining

A resource-constrained scheduling method for loop nests

Can schedule at an arbitrary level Simplify n-D dependences to 1-D 3 steps

Loop Selection Dependence Simplification and 1-D

Schedule Construction Final schedule computation

Page 13: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

13

Perspective Which loop to software pipeline?

Most profitable one in terms of parallelism, data reuse, or others

How to software pipeline the selected loop? Allocate iteration points to slices Software pipeline each slice Partition slices into groups Delay groups until resources available

Enforce resource constraints in two steps

Page 14: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

14

Perspective (Cont.)

How to handle dependences? If a dependence is respected before pushing-down the groups, it will be respected afterwards

Simplify dependences from n-D to 1-D

Page 15: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

How to handle dependences?

a(0,0)

b(0,0)

---

a(1,0)

b(1,0)

--- a(3,0)

b(3,0)

---

a(4,0)

b(4,0)

---

a(5,0)

b(5,0)

---

a(0,1)

b(0,1)

---

a(1,1)

b(1,1)

---

a(2,1)

b(2,1)

---

a(3,1)

b(3,1)

---

a(4,1)

b(4,1)

---

a(5,1)

b(5,1)

---

a(0,2)

b(0,2)

---

a(1,2)

b(1,2)

---

a(2,2)

b(2,2)

---

a(3,2)

b(3,2)

---

a(4,2)

b(4,2)

---

a(5,2)

b(5,2)

---

0

1

2

3

4

5

6

7

8

9

10

11

12

0

1 2 3 4 5

Cycle

i1

a(2,0)

b(2,0)

---

15

a

b

<0,0> <0,1>

<1,0>

Dependences within a slice

Dependences between slices

Still respected after pushing down

Page 16: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

16

Simplify n-D Dependences

Cycle

……(i1, 0, …, 0,1)

(i1+1, 0, …, 0,1)

……(i1, 0, …, 0,0)

(i1+1, 0, …, 0,0)

Only the first distance useful

Ignorablea

b

<0,1><0 >, 0

<1 >,0

Page 17: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

17

Step 1: Loop Selection Scan each loop. Evaluate parallelism

Recurrence Minimum II (RecMII) from the cycles in 1-D DDG

Evaluate data reuse average memory accesses of an

S*S tile from the future final schedule (optimized iteration space).

Page 18: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

18

Example: Evaluate Parallelism

Inner loop: RecMII=3

a

b

<0> < 1>

Outer loop: RecMII=1

a

b

<0>

<1>

Page 19: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

Evaluate Data Reuse

Symbolic parametersS: total stages

l: cache line size Evaluate data

reuse[WolfLam91] Localize

space=span{(0,1),(1,0)} Calculate equivalent

classes for temporal and spatial reuse space

avarage accesses=2/l

i10 1 ……S-1 S S+1 ……2S-1 …. N1-1

Cycle

……

……

……

……

…………

19

……

……

……

……

…………

Page 20: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

20

Step 2: Dependence Simplification and 1-D Schedule Construction

Dependence Simplification

1-D schedule construction

a

b

<0,1>

<1,0>

<0,0>

-

b-

ab-

ab

aT

Modulo property

Resource constraints

Sequential constraints Dependence

constraints

a

b

<1>

<0>

Page 21: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

Final Schedule Computation

Example: a(5,2)

a(0,0)

b(0,0)

---

a(1,0)

b(1,0)

--- a(3,0)

b(3,0)

---

a(4,0)

b(4,0)

---

a(5,0)

b(5,0)

---

a(0,1)

b(0,1)

---

a(1,1)

b(1,1)

---

a(2,1)

b(2,1)

---

a(3,1)

b(3,1)

---

a(4,1)

b(4,1)

---

a(5,1)

b(5,1)

---

a(0,2)

b(0,2)

---

a(1,2)

b(1,2)

---

a(2,2)

b(2,2)

---

a(3,2)

b(3,2)

---

a(4,2)

b(4,2)

---

a(5,2)

b(5,2)

---

0

1

2

3

4

5

6

7

8

9

10

11

12

0

1 2 3 4 5

Cycle

i1

a(2,0)

b(2,0)

---

21

Module schedule time=5

Distance=

61*3*2**2 TSi

Final schedule time=5+6+6=17

Distance=

61*3*2**2 TSi

Delay = (N2-1)*S*T

=(3-1)*3*1=6

Page 22: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

22

Step 3: Final Schedule Computation

For any operation o, iteration point I=(i1, i2,…,in),

f(o,I) = σ(o, i1) +

+

TSNinx nyxy

yx **)*(2 x, 1,

TSNSinxx

x **)1(*/2,

1

Delay from pushing down

Distance between o(i1,0, …, 0) and o(i1, i2, …, in)

Modulo schedule time

Page 23: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

23

Outline

Motivation Problem Formulation &

Perspective Properties Extensions Current and Future work Code Generation and

experiments

Page 24: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

24

Correctness of the Final Schedule

Respects the original n-D dependences Although we use 1-D

dependences in scheduling No resource competition Repeating patterns

definitely appear

Page 25: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

25

Efficiency of the Final Schedule

Schedule length <= the innermost-centric approach One iteration point per T cycles Draining and filling of pipelines

naturally overlapped Execution time: even better

Data reuse exploited from outermost and innermost dimensions

Page 26: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

26

Relation with Modulo Scheduling

The classical MS for single loops is subsumed as a special case of SSP No sequential constraints f(o,I) = Modulo schedule time (σ(o, i1))

Page 27: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

27

Outline

Motivation Problem Formulation &

Perspective Properties Extensions Current and Future work Code Generation and

experiments

Page 28: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

28

SSP for Imperfect Loop Nest

Loop selection Dependence simplification and 1-D schedule construction Sequential constraints

Final schedule

Page 29: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

SSP for Imperfect Loop

Nest (Cont.)

a(0,0)

b(0,0)

c(0,0)

a(1,0)

b(1,0)

a(3,0)

0

1

2

3

4

5

6

7

8

9

10

11

12

0

1 2 3 4 5

Cycle

i1

Initiation interval T=1

a(2,0)

b(2,0)d(0,0) c(1,0)

a(4,0)b(3,0)c(2,0)

a(5,0)b(4,0)d(2,0) c(3,0)

c(0,1)

d(0,1)

c(0,2)

d(0,2)

c(1,1)

d(1,1)

c(1,2)

d(1,2)

c(2,1)

d(2,1)

c(2,2)

d(2,2)

d(3,0)

c(3,1)

d(3,1)

c(3,2)

d(3,2)

c(4,0)

d(4,0)

c(4,1)

d(4,1)

c(4,2)

d(4,2)

b(5,0)

c(5,0)

d(5,0)

c(5,1)

d(5,1)

c(5,2)

d(5,2)

Kernel, with S=3 stagesd(1,0)

Push from here

Push from herea(5,0)b(4,0)

c(4,0)

d(4,0)

c(4,1)

d(4,1)

c(4,2)

d(4,2)

b(5,0)

c(5,0)

d(5,0)

c(5,1)

d(5,1)

c(5,2)

d(5,2)

29

a

b

c

d

Page 30: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

30

Outline

Motivation Problem Formulation &

Perspective Properties Extensions Current and Future work Code Generation and

experiments

Page 31: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

31

Compiler Platform Under Construction

Front End

Middle End

Back End

C/C++/Fortran

High WHIRL

Middle WHIRL

Low WHIRL

Very Low WHIRL

gfec/gfecc/f90

Very High WHIRL

Loop Selection

Selected Loop

Dependence Simplification

1-D DDG

Bundling

Bundled kernel

Register Allocation

Register-allocated kernel

Code generation

Assembly code

Pre-Loop

Selection

Consistency

Maintenance

1-D Schedule Construction

Intermediate kernel

Page 32: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

32

Current and Future Work Register allocation Implementation and evaluation Interaction and comparison with

pre-transforming the loop nest Unroll-and-jam Tiling Loop interchange Loop skewing and Peeling …….

Page 33: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

33

An (Incomplete) Taxonomy of Software Pipelining

Software Pipelining Modulo scheduling and

others

Hierarchical reduction[Lam88]

Pipelining-dovetailing[WangGao96]

Outer Loop Pipelining[MuthukumarDoshi01]

For 1-dimensional loops

Innermost-loop centric

Resource-constrained

Parallelism -oriented

For n-dimensional loops

SSP

Affine-by-statement scheduling[DarteEtal00,94]

Statement-level rational affine scheduling[Ramanujam94]

Linear scheduling with constants[DarteEtal00,94]

r-periodic scheduling[GaoEtAl93]Juggling problem[DarteEtAl02]

Page 34: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

34

Outline

Motivation Problem Formulation &

Perspective Properties Extensions Current and Future work Code Generation and

experiments

Page 35: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

35

Code GenerationLoop nest in CGIR

SSP

Intermediate Kernel

Register allocation

Register- allocated kernel

Code Generation

Final code

Code generation issues•Register assignment•Predicated execution•Loop and drain control•Generating prolog and epilog•Generating outermost loop pattern•Generating innermost loop pattern•Code-size optimizations

Problem Statement

Given an register allocated kernel generated by SSP and a target architecture, generate the SSP final schedule, while reducing code size and loop control overheads.

Page 36: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

36

Code Generation: Challenges

Multiple repeating patterns Code emission algorithms

Register Assignment Lack of multiple rotating register files

Mix of rotating registers and static register renaming techniques

Loop and drain control Predicated execution Loop counters Branch instructions

Code size increase Code compression techniques

Page 37: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

37

Experiments: Setup

Stand-alone module at assembly level. Software-pipelining using Huff's modulo-

scheduling. SSP kernel generation & register allocation by

hand. Scheduling algorithms: MS, xMS, SSP, CS-SSP Other optimizations: unroll-and-jam, loop tiling Benchmarks: MM, HD, LU, SOR Itanium workstation 733MHz,

16KB/96KB/2MB/2GB

Page 38: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

38

Experiments: Relative Speedup

Speedup between 1.1 and 4.24, average 2.1. Better performance : better parallelism and/or better data reuse. Code-size optimized version performs as well as original version. Code duplication and code size do not degrade performance.

Page 39: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

39

Experiments: Bundle Density

Bundle density measures average number of non-NOP in a bundle. Average: MS-xMS: 1.90, SSP: 1.91, CS-SSP: 2.1 CS-SSP produces a denser code. CS-SSP makes better use of available resources.

Page 40: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

40

Experiments: Relative Code Size

SSP code is between 3.6 and 9.0 times bigger than MS/xMS . CS-SSP code is between 2 and 6.85 times bigger than MS/xMS. Because of multiple patterns and code duplication in innermost loop. However entire code (~4KB) easily fits in the L1 instruction cache.

Page 41: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

41

Acknowledgement Prof.Bogong Su, Dr.Hongbo Yang Anonymous reviewers Chan, Sun C. NSF, DOE agencies

Page 42: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

42

Appendix The following slides are for the

detailed performance analysis of SSP.

Page 43: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

43

Exploiting Parallelism from the Whole Iteration Space

Represents a class of important application Strong dependence cycle in the innermost loop The middle loop has negative dependence but can be removed.

(Matrix size is N*N)

Page 44: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

44

Exploiting Data Reuse from the Whole Iteration Space

Page 45: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

45

Advantage of Code Generation

N

Speedup

Page 46: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

46

Exploiting Parallelism from the Whole Iteration Space

(Cont.)

Both have dependence cycles in the innermost loop

Page 47: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

47

Exploiting Data Reuse from the Whole Iteration Space

Page 48: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

48

Exploiting Data Reuse from the Whole Iteration Space

(Cont.)

Page 49: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

49

Exploiting Data Reuse from the Whole Iteration Space

(Cont.)

(Matrix size is jn*jn)

Page 50: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

50

Advantage of Code Generation

SSP considers all operations in constructing 1-D scheule, thus effectively offsets the overhead of operations out of the innermost loop

N

Speedup

Page 51: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

51

Performance Analysis from L2 Cache misses

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

ijk jik ikj jki kij kji HD LU SOR jki+RT jik+T

MS

xMS

SSP

CS-SSP

Cache misses relative to MS

Page 52: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

52

Performance Analysis from L3 Cache misses

0

0.2

0.4

0.6

0.8

1

1.2

ijk jik ikj jki kij kji HD LU SOR jki+RT jik+T

MS

xMS

SSP

CS-SSP

Cache misses relative to MS

Page 53: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

53

Comparison with Linear Schedule

Linear schedule Traditionally apply to multi-

processing, systolic arrays, etc. , not for uniprocessor

Parallelism oriented. Do not consider Fine-grain resource constraints Register usage Data reuse

Code generation Communicate values through memory, or

message passing, etc.

Page 54: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan

54

Optimized Iteration Space of A Linear Schedule

i1

0 1 2 3 4 5 6 7 8 9

Cycle54