cmput 680 - compiler design and optimization1 cmput680 - winter 2006 topic e: software pipelining...

34
CMPUT 680 - Compiler Des ign and Optimization 1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680

Upload: cynthia-harcrow

Post on 01-Apr-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

1

CMPUT680 - Winter 2006

Topic E: Software PipeliningJosé Nelson Amaral

http://www.cs.ualberta.ca/~amaral/courses/680

Page 2: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

2

Reading List

Tiger book: chapter 20Other papers such as:

GovindAltmanGao97, RutenbergAtAl97

Page 3: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

3

Software Pipeline

Software Pipeline is a technique that reduces the executiontime of important loops by interweaving operations

from many iterations to optimize the use of resources.

0 1 2 3 4 5 6 7 8 9 10 11 12 16151413 time

ldffadds

stf

sub

cmpbg

Page 4: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

4

Software Pipeline

What limits the speed of a loop?• Data dependencies: recurrence initiation interval (rec_mii)• Processor resources: resource initiation interval (res_mii)• Memory accesses: memory initiation interval (mem_mii)

0 1 2 3 4 5 6 7 8 9 10 11 12 16151413 time

ldffadds

stf

sub

cmpbg

Initiation interval

Page 5: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

5

Problem Formulation (I)

Given a weighted dependence graph, derive a schedule which is “time-optimal” under a machine model M.Def: A schedule S of a loop L is time-optimal if among all “legal” schedules of L, no schedule is faster than S.

Note: There may be more than one time-optimal schedule.

Page 6: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

6

Example: The Inner Product

Q = 0.0DO k = 1, N Q = Q+Z(k)*X(k)ENDDO

z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N

uk load zk-1

vk load xk-1

wk uk * vk

qk qk-1 + wk

zk zk-1 + 4xk xk-1 + 4

END DO

(Dehnert, J. and Towle, R. A., “Compiling for Cidra 5”)

Dynamic Single Assignment (DSA): Uses an expanded virtual register (EVR) thatis an infinite, linearly ordered, set ofvirtual registers.

A program in DSA has no anti-dependenciesand no output dependencies.

Page 7: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

7

Machine Model and Resource Constraints

z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N

uk load zk-1 MEMvk load xk-1 MEMwk uk * vk FMULTqk qk-1 + wk FADDzk zk-1 + 4 ADDRxk xk-1 + 4 ADDR

END DO

What unit each operation in the loop uses?

Unit LatencyMEM1 6MEM2 6ADDR1 1ADDR2 1FMULT 2FADD 2

Machine Model

Without instruction level parallelism.How long does the loop take to execute? (6+6+2+2+1+1)*N=18*N

Page 8: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

8

The Resource Minimum Initiation Interval of a loopis given by:

Resource Minimum Initiation Interval (resMII)

Each processor resource definesa minimum initiation intervalfor the execution of the loop.

For instance in the machine model in the previousexample, a loop that requires the computationof 6 addresses has a ResMII(ADDR) = 6*1/2 = 3.

( )ii

RResMIImaxResMII =

Page 9: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

9

ResMII

z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N

uk load zk-1 MEMvk load xk-1 MEMwk uk * vk FMULTqk qk-1 + wk FADDzk zk-1 + 4 ADDRxk xk-1 + 4 ADDR

END DO

Unit LatencyMEM1 6MEM2 6ADDR1 1ADDR2 1FMULT 2FADD 2

Machine Model

There are enough units to schedule all the instructions of the loop in the same cycle. Therefore ResMII = 1. Canwe execute the loop in N+C cycles (C = a small constant)?

Page 10: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

10

Recurrence Minimum Initiation Interval (RecMII)

z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N(a) uk load zk-1

(b) vk load xk-1

(c) wk uk * vk

(d) qk qk-1 + wk

(e) zk zk-1 + 4(f) xk xk-1 + 4END DO

k=1

a b

c

d

e

f

k=2

a b

c

d

e

f

k=3

a b

c

d

e

f

a b

c

d

e

f

(1)

(1)

(1)

(1)

(1)

Page 11: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

11

Recurrence Minimum Initiation Interval (RecMII)

a b

c

d

e

f

(1,2)(1,1)

(1,1)

(1,1)

(1,1)

z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N Unit Lat.(a) uk load zk-1 MEM (6)(b) vk load xk-1 MEM (6)(c) wk uk * vk FMULT (2)(d) qk qk-1 + wk FADD (2)(e) zk zk-1 + 4 ADDR (1)(f) xk xk-1 + 4 ADDR (1)END DO

(dist,lat)

Page 12: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

12

Recurrence Minimum Initiation Interval (RecMII)

a b

c

d

e

f

(1,2)(1,1)

(1,1)

(1,1)

(1,1)

(dist,lat)

The recursive minimum initiation interval (rec_mii) is given by:

( )( )

( )⎥⎥⎤

⎢⎢

⎡= ∀ θ

θθ distanceiteration

latency max rec_mii cycle

Quiz: What is the rec_mii for the example?

Page 13: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

13

Minimum Initiation Interval

The Minimum Initiation Interval (MII) for a loopis constrained both by resources and recurrences,therefore, it is given by:

)RecMII,ResMIImax(MII =

In our example we have MII = max(1,2) = 2.Therefore the best that we can do without transformingthe loop is to execute it in 2*N+C.

Page 14: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

14

Module Schedule

In module scheduling, we:(1) start with the first instruction(2) schedule as many instructions as we can in every cycle, limited only by the resources available and by the dependences.

When a pattern emerges, we adopt the pattern as our module schedule.

Instructions before this pattern form the loop prologue.

Instructions after this pattern form the loop epilogue.

Page 15: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

Recurrence Minimum Initiation Interval (RecMII)

z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N Lat.(a) uk load zk-1 (6)(b) vk load xk-1 (6)(c) wk uk * vk (2)(d) qk qk-1 + wk (2)(e) zk zk-1 + 4 (1)(f) xk xk-1 + 4 (1)END DO

cycle MEM1 MEM2 ADD1 ADD2 FMLT FADD 1 a1 b1 e1 f1 2 a2 b2 e2 f2 3 a3 b3 e3 f3 4 a4 b4 e4 f4 5 a5 b5 e5 f5 6 a6 b6 e6 f6 7 a7 b7 e7 f6 c1 8 a8 b8 e8 f7 c2 9 a9 b9 e9 f8 c3 d1 10 a10 b10 e10 f9 c4 11 a11 b11 e11 f10 c5 d2 … … … … … … …

Page 16: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

16

Why an eager scheduler fails in our example

Cycle

s

b10b21

b32b43

b54b65

b76c1 b87

c28 b9d1 c39

c410d2 c511

1 2 3 4 5 6 7 8 9Iterations

b10b11

b12

11 12 13 14 15 16 17 1810

d3 c713c814

d415 c916

d51718

d61920

d72122

d823

c612b14

b15b16

b17c10c11 b18

c12c13

c14c15

c16c17

b13

Cycle

s

Page 17: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

17

Why an eager scheduler fails in our example

Cycle

s

b101

b223

b345

b46c17

b58d1 c29

b610d2 c311

1 2 3 4 5 6 7 8 9Iterations

11 12 13 14 15 16 17 1810

d3 c413b814

d4 c51516 b9

d5 c61718

d6 c71920

d7 c82122

d823 c9

b712

b10

b11

b12

Cycle

s

Therefore we can doit in 2*N+9 cycles.

Page 18: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

18

Collision vectors

Given the reservation tables for two operations A and B,the set of forbidden intervals, i.e., intervals at whichdistance the operations A and B cannot be issued iscalled the collision vector for the reservation tables.

Page 19: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

19

A Simplistic Module Scheduling Algorithm

1. Compute MII as discussed2. Use a modified list scheduling algorithm to generate a module schedule. The scheduling algorithm must obey the following restriction:

If an operation P is scheduled at time t, it cannot be scheduled at any time t k*II

for any k 0.

The Module Reservation Table has II rows, representing the cycles of the initiation interval, and as many columns as the resources that it needs to keep track of.

Page 20: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

20

Heuristic Method for Modulo Scheduling

Why a simple variant of list scheduling may not work?

Problem: Generate a module schedule of a loop by scheduling instructions until a pattern emerge.

Page 21: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

21

A C

B D

(0,4)(0,2)

(0,2)(1,2)

Counter Example I:List Scheduling May Fail

There is only one cycle in the dependence graph,therefore RecMII is given by:

410

22RecMII =

++

=

Therefore, in a machine with infinite resources,we must be able to schedule the loop in 4 cycles.

Page 22: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

22

Counter Example I:List Scheduling May Fail

A C

B D

(0,4)(0,2)

(0,2)(1,2)

CA

D

0

1

2

3

D

A C

List Scheduling: a greedyalgorithm that scheduleseach operation at its earliest possible time

B must be scheduled after the A of the current iterationand before the C of the nextiteration.

We are deadlocked!!!

BB

???

Page 23: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

23

Counter Example I:List Scheduling May Fail

A C

B D

(0,4)(0,2)

(0,2)

(1,2)

CA

DB

0

1

2

3D(0)

A(0) C(0)

4

5

6

7

A(1) B(0)

C(1)

… … ………D(N)B(N)

The solution is to createa kernel with operations from different iterations, and use a prologue and an epilogue.

pro

logu

eep

ilogu

ek

ern

el

Page 24: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

24

A1

C2

A3

A4

M5

M6(0,2)

(0,1)

(0,2)

(0,2)

(0,3)

(0,3)

A1, A3, and A4 are non-pipelined adds thattake two cycles at the adder

M5 and M6 are non-pipelined multiply operations that take three cycles each onthe multiplier

C2 is a copy operation that uses the busfor one cycle

What is the ResMII for these operations ina machine that has one adder, one multiplierand one bus?

ResMII(Adder) = 6; ResMII(Multiplier) = 6ResMII(Bus) = 1

ResMII = 6

Counter Example II:List Scheduling May Fail

Page 25: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

25

A1

C2

A3

A4

M5

M6(0,2)

(0,1)

(0,2)

(0,2)

(0,3)

(0,3)

Counter Example II:List Scheduling May Fail

012345

Adder Mult BusA1 A1

A3

A3

A4

M6M6

C2 C2

A4??? We cannot schedule A4 and

achieve an MII = ResMII = 6!!!

Page 26: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

26

A1

C2

A3

A4

M5

M6(0,2)

(0,1)

(0,2)

(0,2)

(0,3)

(0,3)

Counter Example II:List Scheduling May Fail

012345

Adder Mult BusA1 A1

A3A3

A4

M6

M5

M6

M5

C2 C2A4

Although it seems counter-intuitivewe obtain a module schedule withMII = 6 if we initially scheduleboth M6 and A3 one cycle later thanthe earliest possible time for theseoperations.

Page 27: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

27

Complex Reservation Tables

Consider three independent operations withthe reservation tables shown below

A1 M2 MA3

(0,2) (0,3) (0,4) Add Mult Bus Add Mult Bus Add Mult Bus

What is the MII for a loop formed by this three operations?

ResMII(Add) = 1 + 0 + 1 = 2Res MII(Mult) = 0 + 1 + 1 = 2ResMII(Bus) = 1 + 1 + 0 = 2

ResMII = 2

Page 28: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

28

Is the MII = 2 Feasible??

A1 M2 MA3

(0,2) (0,3) (0,4) Add Mult Bus Add Mult Bus Add Mult Bus

A1

01

Adder Mult Bus

A1A1 M2 M2

M2

Deadlocked. Cannot allocate MA3. Even though MII = max(ResMII, RecMII) = 2,MII = 2 is not feasible!!!!

Page 29: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

29

Increasing MII to 3 helps?

A1 M2 MA3

(0,2) (0,3) (0,4) Add Mult Bus Add Mult Bus Add Mult Bus

A1 M2

012

Adder Mult Bus

A1A1

M2

M2MA3MA3

MA3

We find a module schedulewith MII = 3!!

Page 30: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

30

Iteration Between Recurrence Constraints and Resource Constraints

A1

A2

A3

A4

(0,2)

(2,2) (0,2)

(0,2)

A

(0,2) Add Mult Bus

What is the RecMII forthis loop?

RecMII = (2+2+2+2)/2 = 4

What is the ResMII forthe loop?

ResMII(Add) = 1+1+1+1 = 4ResMII(Mult) = 0+0+0+0 = 0ResMII(Bus) = 1+1+1+1 = 4

ResMII = 4

Therefore MII = max(ResMII,RecMII) = 4

Page 31: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

31

Is the MII = 4 feasible?

A1

A2

A3

A4

(0,2)

(2,2) (0,2)

(0,2)

A

(0,2) Add Mult Bus

A1

A2 A2A2

0123

Adder Mult Bus

A1A1

In order to finish A4 in time to produce the result for two iterations later, A3 must bescheduled at time 4.

But 4 module 4 = 0, which conflicts with A1.

Therefore there is no feasible schedulewith MII = 4.

Page 32: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

32

Scheduling Strategy

An exhaustive search will eventually reveal that theMII calculated is not feasible, but it might take too long.

In practice, we compute the MII and spend a pre-allocated budget of time trying to find aschedule with the MII. If we don’t find one, weincrease the MII.

In some commercial compilers, the search for the smallest feasible II is a binary search, where the IIis doubled at each step until a feasible one is found,at which point a linear search between the lastunfeasible II and the feasible one is conducted.

Page 33: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

33

Previous Approaches

Approach I (Operational): “Emulate” the loop execution under the machine

model and a “pattern” will eventually occur[AikenNic88, EbciogluNic89, GaoEtAl91]

Approach II (Periodic scheduling): Specify the scheduling problem into a periodical

scheduling problem and find optimal solution[Lam88, RauEtAl81,GovindAltmanGao94]

Page 34: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral amaral/courses/680

SoftwarePipelining

OperationalApproach

PeriodicScheduling

(Modulo Scheduling)

Heuristic (Aiken88, AikenNic88, Ebcioglu89, etc)

Formal Model (GaoWonNin91)

Non-Exact Method (Heuristic)(RauGla81, Lam88, RauEtA192, Huff93, DehnertTow93, Rau94, WanEis93)

ExactMethod

Basic Formulation(DongenGao92)

ILP based

ExhausitiveSearch (Altman95, AltmanGao96)

Register Optimal(NingGao91, NingGao93, Ning93)

Resource Constrained(GovindAltGao94)

Resource & Register(GovindAltGao95, Altman95,EichenbergerDav95)“Showdown”

(RuttenbergGaoStouchininWoody96)