ca226 — advanced computer architectureray/teaching/ca226/06-scheduling.pdfca226 — advanced...

CA226 — AdvancedComputer Architecture

Stephen Blott <stephen.blott@dcu.ie>

Table of Contents

Load RAW Stalls

ld r1,0(r2)dadd r4,r3,r1 ; unavoidable stall (on r1)

The value needed by the second instruction in Ex is available only after the firstinstruction has completed Mem.

Branch RAW Stalls

dsub r1,r2,r3beqz r1,target

We can’t forward the necessary value until after Ex:

• hence, a stall of one cycle(whether the branch is taken or not)

A "Double Whammy" Stallld r1,0(r2)beqz r1,target

Stall of two cycles.This is a combination of the two previous stalls.

The PipelineThe pipeline:

• is essentially a miniature graph of parallel-processing elements

• instructions flow from node to node

The Pipeline

Consider this …dadd r3,r1,r2dadd r4,r1,r2

They flow:

• smoothly through the pipeline, no stalls

Now consider this …dadd r3,r1,r2dmul r4,r1,r2

They flow:

• more slowly (more cycles), but still no stalls(multiplication is expensive)

And this …dmul r3,r1,r2dmul r4,r1,r2dmul r5,r1,r2

They flow:

• again, more slowly (more cycles), but still no stalls

• all three instructions flow through the multiplier

And this …dmul r4,r1,r2dadd r3,r1,r2

The dadd is not blocked by the dmul:

• the dadd overtakes the dmul in the pipeline

• still no stalls

Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2

The dadd is now blocked by the dmul:

• were the dadd to overtake the dmul:r3 would have the incorrect final value

This is known as a:

• write-after-write (WAW) stall

Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2daddi r5,r0,100

• subsequent, independent instructions are also blocked!

Another topic…

ExampleConsider:

for (i=0; i<1000; i+=1) a[i] += 1; // where a[i] is an integer

Example.data ; psched3.s N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit words

.text daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: ld r3,a(r1) ; r3 = a[i/8] daddi r3,r3,1 ; r3 = r3 + 1 sd r3,a(r1) ; a[i/8] = r3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt

So …Stalls per iteration:

• one load RAW stall on r3

• one branch RAW stall on r1

• plus 999 wasted nop cycles (in delay slot)

8007 cycles in total

After pipeline scheduling, ….data ; psched4.s N: .word 8000 b: .word 0 ; Address of a, minus 8 (hack) a: .space 8000

.text daddi r1,r0,0 ld r2,N(r0)

loop: ld r3,a(r1) daddi r1,r1,8 ; moved up daddi r3,r3,1 bne r1,r2,loop sd r3,b(r1) ; moved down, and adjusted halt

So …No stalls!

• just 5007 cycles, a substantial improvement

• CPI of 1.001

I suspect:

• we can’t do much better than that!(every instruction/cycle does something which has to be done)

Now, …Let’s try the same thing:

• but with floating point numbers

Example — Floating Point.data ; psched5.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit floating point values

.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: l.d f3,a(r1) ; f3 = a[i/8] add.d f3,f3,f10 ; f3 = f3 + one s.d f3,a(r1) ; a[i/8] = f3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt

Stalls?Stalls:

• 5000 RAW stalls

• 1000 structural stalls

Overall: 11008 cycles.

As Before: Reorder Operations.data ; psched6.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values

loop: l.d f3,a(r1) ; f3 = a[i/8] daddi r1,r1,8 ; i = i + 8 add.d f3,f3,f10 ; f3 = f3 + one bne r1,r2,loop ; repeat, unless i == N s.d f3,b(r1) ; a[i/8] = f3 halt

Stalls?Stalls:

• 1000 RAW stalls

• 1000 structural stalls

Overall: 7008 cycles; better than 11008, previously.

Stalls?Remaining stalls, both RAW and structural are from:

add.d f3,f3,f10 ; f3 = f3 + one...s.d f3,b(r1) ; a[i] = f3

It takes four cycles for the add.d to move through the floating point adder.

The s.d (a read after write) arrives too soon, and is blocked for two cycles (periteration).

So, …How can we eliminate the remaining stalls?

Loop UnrollingOriginally:

for (i=0; i<1000; i+=1) a[i] += 1

Unroll the loop:

for (i=0; i<1000; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}

Example.data ; psched7.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values

loop: l.d f3,a(r1) ; 1 daddi r1,r1,8 ; add.d f3,f3,f10 ; s.d f3,b(r1) ;

l.d f4,a(r1) ; 2 - now using f4, instead of f3

daddi r1,r1,8 ; add.d f4,f4,f10 ; s.d f4,b(r1) ;

l.d f5,a(r1) ; 3 - now using f5, instead of f3 daddi r1,r1,8 ; add.d f5,f5,f10 ; s.d f5,b(r1) ;

l.d f6,a(r1) ; 4 - now using f6, instead of f3 daddi r1,r1,8 ; add.d f6,f6,f10 ; s.d f6,b(r1) ;

bne r1,r2,loop ; repeat, unless i == N nop halt

Now, …That doesn’t help:

• but many of these operations are now independent

• they can be reordered

First:

• we don’t need all those dadd instructions

• and we can use the delay slot for something useful

Example.data ; psched8.s one: .double 1.0 N: .word 8000 a: .space 8000

.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2

loop: l.d f3,0(r11) add.d f3,f3,f10 s.d f3,0(r11)

l.d f4,8(r11) ; adjust displacement add.d f4,f4,f10

s.d f4,8(r11) ; adjust displacement

l.d f5,16(r11) ; adjust displacement add.d f5,f5,f10 s.d f5,16(r11) ; adjust displacement

l.d f6,24(r11) ; adjust displacement add.d f6,f6,f10

daddi r11,r11,32 ; collect all four daddi-s into one

bne r11,r12,loop s.d f6,-8(r11) ; 24 - 32 == -8, adjust displacement halt

Hmm, …Still:

• 7009 cycles

• fewer instructions, more stalls, same number of cycles

We need to:

• look more carefully at how the pipeline is operating

• explore more options for reordering operations

Best we can do?.data ; psched9.s one: .double 1.0 N: .word 8000 a: .space 8000

.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2

loop: l.d f3,0(r11) ; i%4==0 load l.d f4,8(r11) ; i%4==1 load

add.d f3,f3,f10 ; i%4==0 add l.d f5,16(r11) ; i%4==2 load l.d f6,24(r11) ; i%4==3 load

add.d f4,f4,f10 ; i%4==1 add s.d f3,0(r11) ; i%4==0 store daddi r11,r11,32 ; increment loop counter

add.d f5,f5,f10 ; i%4==2 add add.d f6,f6,f10 ; i%4==3 add

s.d f4,-24(r11) ; i%4==1 store (8-32 == -24) s.d f5,-16(r11) ; i%4==2 store (16-32 == -16)

bne r11,r12,loop s.d f6,-8(r11) ; i%4==3 store (24-32 == -8) halt

Differences …Differences to previous version:

• reorder independent instructionspreserve the order of dependent instructions

• increment the loop counter sooner and adjust subsequent offsets

• carefully interleave FP and non-FP instructionsallow non-FP instructions to flow around FP instructions in the FP adder

PerformanceNow:

• 4009 cycles, CPI of 1.144

• just 500 structural stalls

Previously:

• 11008 cycles, CPI of 1.833

• a speedup of 60%

More GenerallyUnroll the loop some number of times (four, here):

for (i=0; i<N-N%4; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}

// Perform any remaining iterations...for (i=N-N%4; i<N; i+=1) a[i] += 1

Costs?Costs:

• increase in code size(so this is a space-time trade off)

• requires more registers(a limited resource)

In fact:

• this type of pipeline scheduling is only possible because we have a large number ofgeneral-purpose registers

Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>

ca226 — advanced computer architectureray/teaching/ca226/06-scheduling.pdfca226 — advanced...

Documents

r1 series - venta de equipo de topografia en mexico como...

orange local environmental...2019/07/23 · b6 r1 re1 in1...

vpl.0100.0261.0001 r1 p vpl.0100.0261.0001 r1 this

clamping systems - rolleri iberica · 2020. 9. 11. ·...

yamaha yzf-r1. we r1 - yamaha motor · yzf-r1 yamaha...

r1+ roduct r r1+ shield · introducing r1+ shield r1+ the...

nuova yamaha yzf-r1. we r1 -...

2018 alaska marine science symposium posters and … eugene...

championship classification season/3... · estoril spa-...

termalismo 2017.qxp maquetación 1 28/11/16 14:45 página...

2009 r1 winter meeting newark, nj – 7 feb 2009 r1 gold...

zoning map -...

maquetación 1 - amavor.es · episódica psistente pve ......

ca226 — advanced computer...

r1-2002 q1(f) r1-2003 q1(h) r1-2004 q1(g) r1-2003 q6 r1

antioxidantien€¦ · benzoesäure coumarinsäure: r1,...

theoretical investigation of the chemoselectivity and...

section 5 residential r1 zone 5.1 general … · section 5...

r1 aihsa business plan10-18-2016 r1

chisnau/mol samokov/bul globare/srb troyan/bul … · name...