ca226 — advanced computer architectureray/teaching/ca226/06-scheduling.pdfca226 — advanced...

Post on 22-Apr-2018

229 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

CA226 — AdvancedComputer Architecture

Stephen Blott <stephen.blott@dcu.ie>

Table of Contents

CA226 — AdvancedComputer Architecture

2

Load RAW Stalls

ld r1,0(r2)dadd r4,r3,r1 ; unavoidable stall (on r1)

The value needed by the second instruction in Ex is available only after the firstinstruction has completed Mem.

CA226 — AdvancedComputer Architecture

3

Branch RAW Stalls

dsub r1,r2,r3beqz r1,target

We can’t forward the necessary value until after Ex:

• hence, a stall of one cycle(whether the branch is taken or not)

CA226 — AdvancedComputer Architecture

4

A "Double Whammy" Stallld r1,0(r2)beqz r1,target

Stall of two cycles.This is a combination of the two previous stalls.

CA226 — AdvancedComputer Architecture

5

The PipelineThe pipeline:

• is essentially a miniature graph of parallel-processing elements

• instructions flow from node to node

CA226 — AdvancedComputer Architecture

6

The Pipeline

CA226 — AdvancedComputer Architecture

7

Consider this …dadd r3,r1,r2dadd r4,r1,r2

They flow:

• smoothly through the pipeline, no stalls

CA226 — AdvancedComputer Architecture

8

Now consider this …dadd r3,r1,r2dmul r4,r1,r2

They flow:

• more slowly (more cycles), but still no stalls(multiplication is expensive)

CA226 — AdvancedComputer Architecture

9

And this …dmul r3,r1,r2dmul r4,r1,r2dmul r5,r1,r2

They flow:

• again, more slowly (more cycles), but still no stalls

• all three instructions flow through the multiplier

CA226 — AdvancedComputer Architecture

10

And this …dmul r4,r1,r2dadd r3,r1,r2

The dadd is not blocked by the dmul:

• the dadd overtakes the dmul in the pipeline

• still no stalls

CA226 — AdvancedComputer Architecture

11

Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2

The dadd is now blocked by the dmul:

• were the dadd to overtake the dmul:r3 would have the incorrect final value

This is known as a:

• write-after-write (WAW) stall

CA226 — AdvancedComputer Architecture

12

Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2daddi r5,r0,100

Note:

• subsequent, independent instructions are also blocked!

CA226 — AdvancedComputer Architecture

13

Another topic…

CA226 — AdvancedComputer Architecture

14

ExampleConsider:

for (i=0; i<1000; i+=1) a[i] += 1; // where a[i] is an integer

CA226 — AdvancedComputer Architecture

15

Example.data ; psched3.s N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit words

.text daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: ld r3,a(r1) ; r3 = a[i/8] daddi r3,r3,1 ; r3 = r3 + 1 sd r3,a(r1) ; a[i/8] = r3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt

CA226 — AdvancedComputer Architecture

16

So …Stalls per iteration:

• one load RAW stall on r3

• one branch RAW stall on r1

• plus 999 wasted nop cycles (in delay slot)

8007 cycles in total

CA226 — AdvancedComputer Architecture

17

After pipeline scheduling, ….data ; psched4.s N: .word 8000 b: .word 0 ; Address of a, minus 8 (hack) a: .space 8000

.text daddi r1,r0,0 ld r2,N(r0)

loop: ld r3,a(r1) daddi r1,r1,8 ; moved up daddi r3,r3,1 bne r1,r2,loop sd r3,b(r1) ; moved down, and adjusted halt

CA226 — AdvancedComputer Architecture

18

So …No stalls!

• just 5007 cycles, a substantial improvement

• CPI of 1.001

I suspect:

• we can’t do much better than that!(every instruction/cycle does something which has to be done)

CA226 — AdvancedComputer Architecture

19

Now, …Let’s try the same thing:

• but with floating point numbers

CA226 — AdvancedComputer Architecture

20

Example — Floating Point.data ; psched5.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit floating point values

.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: l.d f3,a(r1) ; f3 = a[i/8] add.d f3,f3,f10 ; f3 = f3 + one s.d f3,a(r1) ; a[i/8] = f3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt

CA226 — AdvancedComputer Architecture

21

Stalls?Stalls:

• 5000 RAW stalls

• 1000 structural stalls

Overall: 11008 cycles.

CA226 — AdvancedComputer Architecture

22

As Before: Reorder Operations.data ; psched6.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values

.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: l.d f3,a(r1) ; f3 = a[i/8] daddi r1,r1,8 ; i = i + 8 add.d f3,f3,f10 ; f3 = f3 + one bne r1,r2,loop ; repeat, unless i == N s.d f3,b(r1) ; a[i/8] = f3 halt

CA226 — AdvancedComputer Architecture

23

Stalls?Stalls:

• 1000 RAW stalls

• 1000 structural stalls

Overall: 7008 cycles; better than 11008, previously.

CA226 — AdvancedComputer Architecture

24

Stalls?Remaining stalls, both RAW and structural are from:

add.d f3,f3,f10 ; f3 = f3 + one...s.d f3,b(r1) ; a[i] = f3

It takes four cycles for the add.d to move through the floating point adder.

The s.d (a read after write) arrives too soon, and is blocked for two cycles (periteration).

CA226 — AdvancedComputer Architecture

25

So, …How can we eliminate the remaining stalls?

CA226 — AdvancedComputer Architecture

26

Loop UnrollingOriginally:

for (i=0; i<1000; i+=1) a[i] += 1

Unroll the loop:

for (i=0; i<1000; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}

CA226 — AdvancedComputer Architecture

27

Example.data ; psched7.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values

.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: l.d f3,a(r1) ; 1 daddi r1,r1,8 ; add.d f3,f3,f10 ; s.d f3,b(r1) ;

l.d f4,a(r1) ; 2 - now using f4, instead of f3

CA226 — AdvancedComputer Architecture

28

daddi r1,r1,8 ; add.d f4,f4,f10 ; s.d f4,b(r1) ;

l.d f5,a(r1) ; 3 - now using f5, instead of f3 daddi r1,r1,8 ; add.d f5,f5,f10 ; s.d f5,b(r1) ;

l.d f6,a(r1) ; 4 - now using f6, instead of f3 daddi r1,r1,8 ; add.d f6,f6,f10 ; s.d f6,b(r1) ;

bne r1,r2,loop ; repeat, unless i == N nop halt

CA226 — AdvancedComputer Architecture

29

Now, …That doesn’t help:

• but many of these operations are now independent

• they can be reordered

First:

• we don’t need all those dadd instructions

• and we can use the delay slot for something useful

CA226 — AdvancedComputer Architecture

30

Example.data ; psched8.s one: .double 1.0 N: .word 8000 a: .space 8000

.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2

loop: l.d f3,0(r11) add.d f3,f3,f10 s.d f3,0(r11)

l.d f4,8(r11) ; adjust displacement add.d f4,f4,f10

CA226 — AdvancedComputer Architecture

31

s.d f4,8(r11) ; adjust displacement

l.d f5,16(r11) ; adjust displacement add.d f5,f5,f10 s.d f5,16(r11) ; adjust displacement

l.d f6,24(r11) ; adjust displacement add.d f6,f6,f10

daddi r11,r11,32 ; collect all four daddi-s into one

bne r11,r12,loop s.d f6,-8(r11) ; 24 - 32 == -8, adjust displacement halt

CA226 — AdvancedComputer Architecture

32

Hmm, …Still:

• 7009 cycles

• fewer instructions, more stalls, same number of cycles

We need to:

• look more carefully at how the pipeline is operating

• explore more options for reordering operations

CA226 — AdvancedComputer Architecture

33

Best we can do?.data ; psched9.s one: .double 1.0 N: .word 8000 a: .space 8000

.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2

loop: l.d f3,0(r11) ; i%4==0 load l.d f4,8(r11) ; i%4==1 load

add.d f3,f3,f10 ; i%4==0 add l.d f5,16(r11) ; i%4==2 load l.d f6,24(r11) ; i%4==3 load

CA226 — AdvancedComputer Architecture

34

add.d f4,f4,f10 ; i%4==1 add s.d f3,0(r11) ; i%4==0 store daddi r11,r11,32 ; increment loop counter

add.d f5,f5,f10 ; i%4==2 add add.d f6,f6,f10 ; i%4==3 add

s.d f4,-24(r11) ; i%4==1 store (8-32 == -24) s.d f5,-16(r11) ; i%4==2 store (16-32 == -16)

bne r11,r12,loop s.d f6,-8(r11) ; i%4==3 store (24-32 == -8) halt

CA226 — AdvancedComputer Architecture

35

Differences …Differences to previous version:

• reorder independent instructionspreserve the order of dependent instructions

• increment the loop counter sooner and adjust subsequent offsets

• carefully interleave FP and non-FP instructionsallow non-FP instructions to flow around FP instructions in the FP adder

CA226 — AdvancedComputer Architecture

36

PerformanceNow:

• 4009 cycles, CPI of 1.144

• just 500 structural stalls

Previously:

• 11008 cycles, CPI of 1.833

• a speedup of 60%

CA226 — AdvancedComputer Architecture

37

More GenerallyUnroll the loop some number of times (four, here):

for (i=0; i<N-N%4; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}

// Perform any remaining iterations...for (i=N-N%4; i<N; i+=1) a[i] += 1

CA226 — AdvancedComputer Architecture

38

Costs?Costs:

• increase in code size(so this is a space-time trade off)

• requires more registers(a limited resource)

In fact:

• this type of pipeline scheduling is only possible because we have a large number ofgeneral-purpose registers

CA226 — AdvancedComputer Architecture

39

Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>

top related