ca226 — advanced computer architectureray/teaching/ca226/05b-hazards.pdfca226 — advanced...

Post on 22-Jan-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

CA226 — AdvancedComputer Architecture

Stephen Blott <stephen.blott@dcu.ie>

Table of Contents

CA226 — AdvancedComputer Architecture

2

Types of Hazard

Structural hazardsresource conflicts;hardware cannot support all instruction combinations simultaneously

Data hazardswhen one instruction depends upon the result (which is not yet available) of aprevious instruction

Control hazardswhen the address of the next instruction cannot be determined immediately(branch, jump instructions — today)

CA226 — AdvancedComputer Architecture

3

Control HazardsControl hazards:

• arise from pipelining of branch (and jump) instructions

As described thus far, branching decisions:

• are made during the Mem stage of the pipeline

A naive approach:

• stall until branch decision is known

CA226 — AdvancedComputer Architecture

4

TerminologyWhenever we encounter a branch:

• it is:

• either taken, or not taken

• the cost may be different in each case

CA226 — AdvancedComputer Architecture

5

Control Hazards

CA226 — AdvancedComputer Architecture

6

Naive Branching

1 2 3 4 5 6 7

branch IF ID Ex Mem** WB

branch+4 stall stall stall **IF ID Ex

branch+8 stall stall stall IF ID

branch IF ID Ex Mem** WB

target stall stall stall **IF ID Ex

target+4 stall stall stall IF ID

CA226 — AdvancedComputer Architecture

7

Unfortunately …This will result in:

• the pipeline being stalled for three cycles every time a branch is encountered

• and branch instructions are common

CA226 — AdvancedComputer Architecture

8

…What might help is:

• a prediction

Predict that a branch will either be:

• taken, or not taken

CA226 — AdvancedComputer Architecture

9

…Easiest thing to do:

• predict branch not taken

• simply allow subsequent instructions to continue to flow into the pipeline

CA226 — AdvancedComputer Architecture

10

Predict Not Taken

Table 1. And branch is indeed not taken:

branch IF ID Ex Mem** WB

branch+4 IF ID Ex **Mem WB

branch+8 IF ID Ex Mem WB

branch+12 IF ID Ex Mem

Perfect!

• But what if the branch is in fact taken?

CA226 — AdvancedComputer Architecture

11

Predict Not Taken

Table 2. But branch is in fact taken:

branch IF ID Ex Mem** WB

branch+4 IF ID Ex **Mem WB

branch+8 IF ID **Ex Mem WB

branch+12 IF **ID Ex Mem

target **IF

CA226 — AdvancedComputer Architecture

12

Predict Not Taken

Table 3. But branch is in fact taken:

branch IF ID Ex Mem** WB

branch+4 IF ID Ex **nop nop

branch+8 IF ID **nop nop nop

branch+12 IF **nop nop nop

target **IF

Observe:

• none of the subsequent instructions has yet changed memory or any registersthat’s helpful!replace them with nop instructions

(Still a stall of three cycles when branch taken.)

CA226 — AdvancedComputer Architecture

13

Slightly BetterWhen a branch instruction is detected:

• route the Branch Taken condition:

• from Ex(instead of from Mem)

• to ID(instead of to IF)

CA226 — AdvancedComputer Architecture

14

MIPS Pipeline

CA226 — AdvancedComputer Architecture

15

Example

Table 4. Branch not taken:

branch IF ID Ex** Mem WB

branch+4 IF stall **ID Ex Mem WB

branch+8 IF ID Ex Mem

branch+12 IF ID Ex

Note

We save two stalls:one because we learn the decision one cycle sooner, andone because we allow the subsequent instruction into IF

CA226 — AdvancedComputer Architecture

16

Example — Branch Taken

Table 5. Branch taken:

branch IF ID Ex** Mem WB

branch+4 IF nop **ID nop nop nop

target **IF ID Ex Mem

target+4 IF ID Ex

Note

An effective stall of two cycles, but one better than before, because we learn if thebranch is taken one cycle sooner.

CA226 — AdvancedComputer Architecture

17

Where do we stand?If a branch is not taken:

• we have a stall of one cycle

If a branch is taken:

• we have a stall of two cycles

CA226 — AdvancedComputer Architecture

18

In PracticeUnfortunately:

• branches are commonand most branches are taken(which is indeed unfortunate)

CA226 — AdvancedComputer Architecture

19

In PracticeAdd additional hardware in ID:

• detect branches

• decode the target address:target = IF/ID.nPC + (sign-extend(Regs[IF/ID.IR(0..15)]) <<2)(so we need at leastat least an adder)

• calculate whether the branch is taken:we need to:

• test equality, and for zero(and perhaps a couple of other tests)

CA226 — AdvancedComputer Architecture

20

..

CA226 — AdvancedComputer Architecture

21

..So:

• branching is so common and the cost of stalls so great,

• that it is worth the cost and complexity of additional hardware in the ID pipelinestage

CA226 — AdvancedComputer Architecture

22

..So:

• we determine one stage earlier still whether a branch is taken or not(in ID, now, instead of in Ex)

So, we have:

• no stall if the branch is not taken, and

• a one-cycle stall if the branch is taken

CA226 — AdvancedComputer Architecture

23

Now…

Table 6. Branch not taken:

1 2 3 4 5 6 7

branch IF ID** Ex Mem WB

branch+4 IF **ID Ex Mem WB

CA226 — AdvancedComputer Architecture

24

Now…

Table 7. Branch taken:

branch IF ID** Ex Mem WB

branch+4 IF **nop nop nop nop

target **IF ID Ex Mem WB

target+4 IF ID Ex Mem

CA226 — AdvancedComputer Architecture

25

..Try these in the simulator ….

bnez r0,target ; no stalldaddi r1,r0,1

beqz r0,target ; branch taken, stall of 1 cycledaddi r1,r1,1

Note to self:

• see branch.s

CA226 — AdvancedComputer Architecture

26

Predict Not TakenIn effect:

• we’re guessing, here, that the branch will not be taken

• so this strategy is known as predict not taken

CA226 — AdvancedComputer Architecture

27

..So:

• no stall if the branch is not taken

• a stall of one cycle if the branch is taken

What might the average number of stall cycles for branch instructions be?

CA226 — AdvancedComputer Architecture

28

Unfortunately, …The common case in practice is …

• that the branch is taken!

• so the average number of stalls per branch, in practice, approaches 1

CA226 — AdvancedComputer Architecture

29

Because …for (i=0; i<N; i+=1){ // do stuff}

Whenever we have such a loop:

• the branch is taken more often than not taken

CA226 — AdvancedComputer Architecture

30

Because … daddi r1,r0,0 ; i=0; beq r1,r2,done ; if (i==N) goto done;loop: ; do stuff daddi r1,r1,1 ; i+=1; bne r1,r2,done ; if (i!=N) goto loop;done:

The bne instruction:

• is repeated about N times so the branch is usually taken,so the stalls-per-branch approaches 1

CA226 — AdvancedComputer Architecture

31

Might we do better?A predict branch taken strategy:

• would be helpful

• unfortunately, this is not possible on MIPS:

• we only learn the target address after the ID stage

• so a cycle has already been wasted

CA226 — AdvancedComputer Architecture

32

Might we do better?A predict branch taken strategy:

• would be helpful

• unfortunately, this is not possible on MIPS:

• we only learn the target address after the ID stage

• so a cycle has already been wasted

Hmm:

• Wasted.

• Or is it?

CA226 — AdvancedComputer Architecture

33

..How might we:

• make good use of that "wasted" cycle?

CA226 — AdvancedComputer Architecture

34

The "Branch Delay Slot"A branch delay slot is:

• the instruction following any branch (or jump) instruction

Approach:

• the instruction in the delay slot is always executed,whether the branch is taken or not

CA226 — AdvancedComputer Architecture

35

The "Delay Slot"

Table 8. Branch not taken:

branch IF ID** Ex Mem WB

branch+4 (BDS) IF **ID Ex Mem WB

branch+8 IF ID Ex Mem WB

The instruction after the branch:

• is always executed,good!

CA226 — AdvancedComputer Architecture

36

The "Delay Slot"

Table 9. Branch taken:

branch IF ID** Ex Mem WB

branch+4 (BDS) IF ID Ex Mem WB

target **IF ID Ex Mem WB

target+4 IF ID Ex Mem

The instruction after the branch:

• is always executed,"branch+4" is executed anyway,no stall!

CA226 — AdvancedComputer Architecture

37

The "Delay Slot"On such hardware, compilers:

• must insert a suitable instruction into the delay slot

• or, if that is not possible, then a nop (poor solution)

CA226 — AdvancedComputer Architecture

38

Some Cases — nop

This:

dadd r1,r2,r3 bnez r2,somewhere

Becomes:

dadd r1,r2,r3 bnez r2,somewhere nop ; poor solution, effectively a stall

Note

Correct, but not great.The nop is in effect a stall.

CA226 — AdvancedComputer Architecture

39

Some Cases — Independent InstructionThis:

dadd r1,r2,r3 bnez r2,somewhere

Becomes:

bnez r2,somewhere dadd r1,r2,r3 ; the branch does not depend on r1

CA226 — AdvancedComputer Architecture

40

Some Cases — Temporary RegistersThis:

dadd r1,r2,r3 or r20,r2,r3 ; r20 is temporary register within this loop bnez r1,target ...target: dsub r4,r5,r6

Becomes:

dadd r1,r2,r3 bnez r1,target or r20,r2,r3 ; doesn't matter if executed ... ; again, the delay cycle is effectively losttarget: ; but only if the branch is taken! (no nop) dsub r4,r5,r6

CA226 — AdvancedComputer Architecture

41

Loop — Far BetterThis:

target: dsub r4,r5,r6 ; assume r4 is a temporary register ... ; do stuff daddi r1,r1,-1 bnez r1,target ; branch depends on r1 nop ; BDS: we want to use this slot

CA226 — AdvancedComputer Architecture

42

Loop — Far BetterThis:

target: dsub r4,r5,r6 ; assume r4 is a temporary register ... ; do stuff daddi r1,r1,-1 bnez r1,target

Becomes:

dsub r4,r5,r6 ; moved uptarget: ... ; do stuff daddi r1,r1,-1 bnez r1,target dsub r4,r5,r6 ; repeated, from above

CA226 — AdvancedComputer Architecture

43

..Try these in the simulator, again, ….

bnez r0,target ; no stalldaddi r1,r0,1

beqz r0,target ; branch taken, no stall with branch delay slotdaddi r1,r1,1

Note

This time with the branch delay slot enabled.

CA226 — AdvancedComputer Architecture

44

More Insurmountable StallsExample:

dadd r1,r2,r3 bnez r1,target ; stall one cycle

ld r1,N(r0) bnez r1,target ; stall two cycles

The branch:

• depends upon an immediately preceding arithmetic instruction

• depends upon an immediately preceding load (stall two cycles)

CA226 — AdvancedComputer Architecture

45

Another Insurmountable Stall

Table 10. If branch taken is resolved in Ex:

dadd r1,r2,r3 IF ID Ex** Mem WB

bnez r1,target IF ID **Ex Mem WB

delay slot IF ID Ex Mem WB

No problem:

• r1 can be forwarded, as before

CA226 — AdvancedComputer Architecture

46

Another Insurmountable Stall

Table 11. If branch taken is resolved in ID:

dadd r1,r2,r3 IF ID Ex** Mem WB

bnez r1,target IF **ID Ex Mem WB

delay slot IF ID Ex Mem WB

Oops:

• forwarding can’t help here

CA226 — AdvancedComputer Architecture

47

Another Insurmountable Stall

Table 12. If branch taken is resolved in ID:

dadd r1,r2,r3 IF ID Ex** Mem WB

bnez r1,target IF stall **ID Ex Mem WB

delay slot IF ID Ex Mem

Such a RAW dependency:

• results in a stall of one cycle

(Try to find another instruction which can be inserted in between.)

CA226 — AdvancedComputer Architecture

48

JumpsJumps:

• are handled the same way:we learn the target address in ID,the instruction in the delay slot is always executed

CA226 — AdvancedComputer Architecture

49

Jumps

Table 13. Jumps are always taken:

jump IF ID** Ex Mem WB

delay slot IF ID Ex Mem WB

target **IF ID Ex Mem WB

target+8 IF ID Ex Mem

CA226 — AdvancedComputer Architecture

50

ExampleNote to self:

• take a look at ../winmips64/reverse-with-nops.s

CA226 — AdvancedComputer Architecture

51

Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>

top related