ca226 — advanced computer architectureray/teaching/ca226/05-hazards.pdf · ca226 — advanced...
TRANSCRIPT
CA226 — AdvancedComputer Architecture
2
…Today:
• data hazards
CA226 — AdvancedComputer Architecture
3
…Recall:
• the MIPS pipeline implements instruction level parallelism
• ideally, up to five instructions are executed (in part) on any clock cycle
• if one instruction were to exit the pipeline on each cycle:
• then the CPI would be 1and, ideally, the MIPS pipeline approaches a CPI of 1
CA226 — AdvancedComputer Architecture
4
MIPS Pipeline
CA226 — AdvancedComputer Architecture
5
Example daddi r1,r1,1 daddi r2,r2,1 daddi r3,r3,1 daddi r4,r4,1 daddi r5,r5,1
Note
Note to self: see pipeline.s.
CA226 — AdvancedComputer Architecture
6
SpeedupIdeally:
• each instruction takes 5 cycles to execute
• however, 5 instructions are in the pipeline
• so the number of cycles per instruction approaches 1
Note
Note to self:Observe the effect on CPI of repeating the block of instructions, previous.
CA226 — AdvancedComputer Architecture
7
HazardsThe major hurdle to effective pipeline implementation is:
• hazards
CA226 — AdvancedComputer Architecture
8
Types of Hazard
Structural hazardsresource conflicts;hardware cannot support all instruction combinations simultaneously
Data hazardswhen one instruction depends upon the result (which is not yet available) of aprevious instruction(today)
Control hazardswhen the address of the next instruction cannot be determined immediately
CA226 — AdvancedComputer Architecture
9
Data Hazards — ExampleConsider:
dadd r1,r2,r3 ; instruction 1 dsub r4,r1,r5 ; instruction 2 and r6,r1,r5 ; instruction 3 or r8,r1,r9 ; instruction 4 xor r10,r1,r11 ; instruction 5
Instructions 2, 3, 4 and 5:
• each depend upon the result of instruction 1
CA226 — AdvancedComputer Architecture
10
Ok …
Turn off forwarding, and let’s try running that …
Note to self:
• see hazards1.s.
CA226 — AdvancedComputer Architecture
11
Illustration
Table 1. Two Read-After-Write (RAW) pipeline stalls:
1 2 3 4 5 6 7
dadd r1,r2,r3 IF ID Ex Mem WB*
dsub r4,r1,r5 IF ID RAW RAW *Ex
and r6,r1,r5 IF stall stall ID
or r8,r1,r9 IF
Note
This assumes that we can both write and read the register file in a single clock cycle.Typically, the write happens in the first half of the cycle, and the read in the secondhalf.
CA226 — AdvancedComputer Architecture
12
ObservationsThis is known as a read after write (or RAW) stall:
• instruction 2 is blocked at ID because one of its arguments (registers) is not yetavailable
• in this case, all subsequent instructions are blocked toowhich is known as a pipeline stall
CA226 — AdvancedComputer Architecture
13
Next, …Consider:
• the effect of replacing instruction 2 with a nop instruction(or any other, non-dependent instruction)
CA226 — AdvancedComputer Architecture
14
Illustration
Table 2. Still one RAW stall:
1 2 3 4 5 6 7
dadd r1,r2,r3 IF ID Ex Mem WB*
nop IF ID Ex Mem WB
and r6,r1,r5 IF ID RAW *Ex Mem
or r8,r1,r9 IF stall Id Ex
CA226 — AdvancedComputer Architecture
15
Next, …Finally, consider:
• the effect of replacing instruction 3 with a nop instruction(or any other, non-dependent instruction)
CA226 — AdvancedComputer Architecture
16
Illustration
Table 3. No stalls:
1 2 3 4 5 6 7
dadd r1,r2,r3 IF ID Ex Mem WB*
nop IF ID Ex Mem WB
nop IF ID Ex Mem
or r8,r1,r9 IF ID *Ex Mem
CA226 — AdvancedComputer Architecture
17
…We could:
• find (two) other (independent) instructions to insert between such write-readdependencies
• but such dependencies are commonand we rarely have enough instructions to fill the gaps
CA226 — AdvancedComputer Architecture
18
…However, such hazards are not insurmountable:
• the ALU produces the necessary value in cycle 3(although it is not written back to the register file until cycle 5)
• that value is not needed by instruction 2 until cycle 4
CA226 — AdvancedComputer Architecture
19
…
Table 4. The value is available after cycle 3:
1 2 3 4 5 6 7
dadd r1,r2,r3 IF ID Ex** Mem WB*
dsub r4,r1,r5 IF ID RAW RAW *Ex
and r6,r1,r5 IF stall stall ID
or r8,r1,r9 IF
CA226 — AdvancedComputer Architecture
20
ForwardingSolution:
• data paths are added:
• EX/Mem.ALUOutput → ID/EX.A (output)EX/Mem.ALUOutput → ID/EX.B (output)Mem/WB.ALUOutput → ID/EX.A (output)Mem/WB.ALUOutput → ID/EX.B (output)
• when a read-after-write is detected, the ALU input:(either ID/EX.A or ID/EX.B)is switched to one of the two available ALUOutput pipeline registers (Ex/Mem orMem/WB)
CA226 — AdvancedComputer Architecture
21
MIPS Pipeline
CA226 — AdvancedComputer Architecture
22
Forwarding
1 2 3 4 5 6 7
dadd r1,r2,r3 IF ID Ex** Mem WB
dsub r4,r1,r5 IF ID **Ex Mem WB
and r6,r1,r5 IF ID Ex Mem WB
or r8,r1,r9 IF ID Ex Mem
One of:
• EX/Mem.ALUOutput → ID/EX.AEX/Mem.ALUOutput → ID/EX.B
CA226 — AdvancedComputer Architecture
23
Forwarding
1 2 3 4 5 6 7
dadd r1,r2,r3 IF ID Ex Mem** WB
nop IF ID Ex Mem WB
and r6,r1,r5 IF ID **Ex Mem WB
or r8,r1,r9 IF ID Ex Mem
One of:
• Mem/WB.ALUOutput → ID/EX.AMem/WB.ALUOutput → ID/EX.B
CA226 — AdvancedComputer Architecture
24
The WinMIPS64 SimulatorThe WinMIPS64 simulator:
• supports forwardingit can be either enabled or disabled
• see: Configure/Enable Forwarding
CA226 — AdvancedComputer Architecture
25
…Try turning on forwarding:
• and running the example again…(hazards1.s)
CA226 — AdvancedComputer Architecture
26
Now, consider the following … daddi r1,r2,123 ; instruction 1 ld r4,0(r1) ; instruction 2 sd r4,8(r1) ; instruction 3
Here:
• there is a RAW dependency between the daddi instruction and the addresscalculation in both of the following instructions
• the address calculation is handled by the ALU,so these are handled by forwarding, as before
CA226 — AdvancedComputer Architecture
27
Illustration
Table 5. No stalls due to address calculation:
1 2 3 4 5 6 7
daddi r1,r2,123 IF ID Ex** Mem++ WB
ld r4,0(r1) IF ID **Ex Mem WB
sd r4,8(r1) IF ID ++Ex Mem WB
• EX/Mem.ALUOutput → ID/EX.A for cycle 4Mem/WB.ALUOutput → ID/EX.A for cycle 5
CA226 — AdvancedComputer Architecture
28
And, again …daddi r1,r2,123 ; instruction 1ld r4,0(r1) ; instruction 2sd r4,8(r1) ; instruction 3
CA226 — AdvancedComputer Architecture
29
And, again …daddi r1,r2,123 ; instruction 1ld r4,0(r1) ; instruction 2sd r4,8(r1) ; instruction 3
Also:
• the sd instruction depends upon the result of the ld
CA226 — AdvancedComputer Architecture
30
…
Table 6. This can be solved by forwarding too:
1 2 3 4 5 6 7
daddi r1,r2,123 IF ID Ex Mem WB
ld r4,0(r1) IF ID Ex Mem** WB
sd r4,8(r1) IF ID Ex **Mem WB
Here:
• Mem/WB.LMD → EX/MEM.B for cycle 6
CA226 — AdvancedComputer Architecture
31
In full …
1 2 3 4 5 6 7
daddi r1,r2,123 IF ID Ex++ Mem== WB
ld r4,0(r1) IF ID ++Ex Mem** WB
sd r4,8(r1) IF ID ==Ex **Mem WB
• EX/Mem.ALUOutput → ID/EX.A for cycle 4Mem/WB.ALUOutput → ID/EX.A for cycle 5Mem/WB.LMD → EX/MEM.B for cycle 6
CA226 — AdvancedComputer Architecture
32
…In all:
• four pipeline stalls are eliminated(note to self: see stalls1.s)
CA226 — AdvancedComputer Architecture
33
MIPS Pipeline
CA226 — AdvancedComputer Architecture
34
Unfortunately …Forwarding cannot solve all RAW problems:
ld r1,n(r0)dadd r2,r1,r0
CA226 — AdvancedComputer Architecture
35
…
Table 7. You can’t forward backwards in time:
1 2 3 4 5 6 7
ld r1,n(r0) IF ID Ex Mem** WB
dadd r2,r1,r0 IF ID **Ex Mem WB
Clearly:
• this is not possible
CA226 — AdvancedComputer Architecture
36
An Insurmountable Stall
Table 8. An inevitable stall of one cycle:
ld r1,n(r0) IF ID Ex Mem** WB
dadd r2,r1,r0 IF ID RAW **Ex Mem
CA226 — AdvancedComputer Architecture
37
More generally, …Unlike arithmetic instructions:
• loads yield values only after the Mem stage of the pipelineso stalls at Ex cannot be avoided
CA226 — AdvancedComputer Architecture
38
SuggestionWhen possible, replace:
dadd r3,r2,r1 ; some other, unrelated instructionld r4,N(r0)dadd r6,r5,r4 ; stall - can't forward backwards!
CA226 — AdvancedComputer Architecture
39
SuggestionWith:
ld r4,N(r0)dadd r3,r2,r1 ; some other, unrelated instructiondadd r6,r5,r4 ; doesn't stall - can forward from dadd
Now:
• when the final dadd reaches Ex:Mem/WB.LMD is available for forwarding
CA226 — AdvancedComputer Architecture
40
…
Note
A good compiler (or you!) should be able to spot such stalls and reorder theoperations.
We spot such stalls by observing that an ALU instruction immediately follows a loadupon which it depends.
CA226 — AdvancedComputer Architecture
41
ExampleCompile:
int a = b + c;int d = e + f;
Note to self:
• see psched1.s and psched2.s.
CA226 — AdvancedComputer Architecture
42
ExampleFirst, spot the problem:
ld r1,b(r0) ; a = b + cld r2,c(r0)dadd r5,r1,r2sd r5,a(r0)
ld r1,e(r0) ; d = e + fld r2,f(r0)dadd r5,r1,r2sd r5,d(r0)
CA226 — AdvancedComputer Architecture
43
ExampleThen, rewrite instructions such that there are no stalls:
ld r1,b(r0) ; a = b + cld r2,c(r0)dadd r5,r1,r2 ; stall, r2 not readysd r5,a(r0)
ld r1,e(r0) ; d = e + fld r2,f(r0)dadd r5,r1,r2 ; stall, r2 not readysd r5,d(r0)
CA226 — AdvancedComputer Architecture
44
ExampleWell, it’s helpful to use different registers:
ld r1,b(r0) ; a = b + cld r2,c(r0)dadd r5,r1,r2 ; stall, r2 not readysd r5,a(r0)
ld r3,e(r0) ; d = e + fld r4,f(r0)dadd r5,r3,r4 ; stall, r4 not readysd r5,d(r0)
CA226 — AdvancedComputer Architecture
45
ExampleNo stalls:
ld r1,b(r0)ld r2,c(r0)ld r3,e(r0) ; prevent stall (pulled up)dadd r5,r1,r2 ; no stall
ld r4,f(r0)sd r5,a(r0) ; prevent stall (pushed down)dadd r5,r3,r4 ; no stallsd r5,d(r0)
CA226 — AdvancedComputer Architecture
46
…This is known as:
• pipeline scheduling
In this case:
• use two extra registers
• avoid two stalls
• 13 cycles, instead of 15
CA226 — AdvancedComputer Architecture
47
AsideThe "13 versus 15 cycles" statement is misleading:
• it includes cycles for the pipeline to fill and empty
Actually:
• disregarding the filling of the pipeline:
• it’s 8 cycles, instead of 10so a speedup of 1.25
CA226 — AdvancedComputer Architecture
48
Summary 1Forwarding is simple:
• if the necessary data is available somewhere in the pipeline and when needed:then it can be forwarded to where it’s needed
The implementation in hardware of these strategies is an engineering decision:
• it is correct, in all cases, to stall the pipeline when such hazards are detected
• forwarding, however, improves performance at the cost of some additionalcomplexity
CA226 — AdvancedComputer Architecture
49
Summary 2Some types of (RAW) stall are unavoidable:
• however, it is often possible to reorder instructions such that they do not occur
CA226 — AdvancedComputer Architecture
50
Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>