chapter 4 the processor cpre 381 computer organization and assembly level programming, fall 2013...
TRANSCRIPT
Chapter 4
The Processor
CprE 381 Computer Organization and Assembly Level Programming, Fall 2013
Zhao ZhangIowa State UniversityRevised from original slides provided by MKP
Week 10 Overview Expected project progress: Complete Mini-
Project B, part 1 ALU data hazard and forwarding MEM data hazard, forwarding, and pipeline
stall Control hazard and branch execution
Chapter 1 — Computer Abstractions and Technology — 2
Chapter 4 — The Processor — 3
Data Hazards from ALU Instructions
An instruction depends on completion of data access by a previous instruction add $s0, $t0, $t1
sub $t2, $s0, $t3 Consider this sequence:
sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)
Chapter 4 — The Processor — 4
Data Hazards from ALU Instructions
A naïve approach is to insert nops to wait out the dependence add $s0, $t0, $t1
sub $t2, $s0, $t3
Change to add $s0, $t0, $t1
noopnoopsub $t2, $s0, $t3
Chapter 4 — The Processor — 5
Data Hazards in ALU Instructions
Another naïve approach is to stall the 2nd instruction in the dependence add $s0, $t0, $t1
sub $t2, $s0, $t3
Data Hazards in ALU Instructions
Observations on this scenario The first, ALU instruction produces a register
value The following instruction(s) consumes the
register value
sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)
Chapter 1 — Computer Abstractions and Technology — 6
Data Hazards in ALU Instructions What is exactly the problem?
A register value is written to the register file in the WB stage, two cycles after the EX stage
The following instructions read the register value in the beginning of the ID stage
IF ID EX MEM WB
Chapter 1 — Computer Abstractions and Technology — 7
or and sub … …
or and sub …add
AND reads old $2
OR reads old $2
or and subaddswsub writes to $2add reads new $2
and sub … … … sub reads $1, $3
Data Hazards in ALU Instructions
Chapter 1 — Computer Abstractions and Technology — 8
Chapter 4 — The Processor — 9
Forwarding (aka Bypassing) Use result when it is computed
The result is already in the pipeline Don’t wait for it to be stored in a register Requires extra connections in the datapath
Chapter 4 — The Processor — 10
Dependencies & Forwarding
Data Forwarding
To what place: The two ALU inputs in the EX stage datapath Forwarded register value may replace the
values from ID From where: The destination register
value in pipeline registers Source 1: EX/MEM register Source 2: MEM/WB register
Chapter 1 — Computer Abstractions and Technology — 11
Data Forwarding
When to forward: Data dependence detected between Instructions at the EX and MEM stage Instructions at the EX and WB stage
How to detect: Compare source and destination register numbers
Chapter 1 — Computer Abstractions and Technology — 12
Data Forwarding Example
sub $2, $1,$3 # MEM=>EX forwardingand $12,$2,$5 # WB =>EX forwardingor $13,$6,$2add $14,$2,$2sw $15,100($2)
Chapter 1 — Computer Abstractions and Technology — 13
or and sub … …
or and sub …addAND gets forwarded new $2 value
or and subaddsw SUB gets forwardednew $2 value
IF ID EX MEM WB
Data Forwarding Logic Designsub $2, $1,$3 #and $12,$2,$5 # comp $2 with $2, $5or $13,$6,$2 # comp $2 with $6, $2
Detection: Compare rs and rt at EX, with rd at MEM and rd at WB
Those register numbers are in the IE/EX, EX/MEM, and MEM/WB registers rs was not in IE/EX register, we can add it
Chapter 1 — Computer Abstractions and Technology — 14
Chapter 4 — The Processor — 15
Data Forwarding Logic Design Register numbers in pipeline
Source registers of the instruction at the EX stageID/EX.RegisterRs, ID/EX.RegisterRt
Destination register of the instruction at the MEM stage
EX/MEM.RegisterRd Destination register of the instruction at WB stage
MEM/WB.RegisterRd Potential data hazards when
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1b. EX/MEM.RegisterRd = ID/EX.RegisterRt2a. MEM/WB.RegisterRd = ID/EX.RegisterRs2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Fwd fromEX/MEMpipeline reg
Fwd fromMEM/WBpipeline reg
Chapter 4 — The Processor — 16
Data Forwarding Logic Design
But only if forwarding instruction will write to a register! EX/MEM.RegWrite=1, MEM/WB.RegWrite=1 It’s possible an instruction has a matching rd
but doesn’t write to register And only if Rd for that instruction is not
$zero EX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0 It’s allowed for an instruction to write to $0
Chapter 4 — The Processor — 17
Forwarding Paths
The forwarding unit accesses three pipeline registersNote rs is added to IE/EX pipeline register
Chapter 4 — The Processor — 18
Forwarding Conditions EX hazard: Data forwarding from EX/MEM register
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
MEM hazard: Data forwarding from MEM/WB register if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
This is not the final version (see slides 23)
Chapter 4 — The Processor — 19
Datapath with Forwarding
Caveats Data forwarding happens in the beginning of
the cycle The forwarding unit is in the EX stage, with its
inputs from three pipeline stages A small overhead added to the critical path
latency of the EX stage For EX hazard, data forwarding is from
MEM to EX Precisely, the register value of the instruction
being executed at the MEM stage is forwarded to the instruction being executed at the EX stage
Chapter 1 — Computer Abstractions and Technology — 20
Caveats For MEM hazard, the forwarding is from the WB
to EX From the instruction at WB to the instruction at EX
Data forwarding is to EX not to ID An instruction may read obsolete register values at ID,
with the values latched at ID/EX register The correct values may be at EX (EX Hazard) or MEM
(MEM Hazard) Any obsolete values get replaced at EX
There is no WB hazard Register write at WB and register read at ID, for the
same register, may complete within one cycle
Chapter 1 — Computer Abstractions and Technology — 21
Chapter 4 — The Processor — 22
Double Data Hazard Consider the sequence:
add $1,$1,$2add $1,$1,$3add $1,$1,$4
Both hazards occur Want to use the most recent
Revise MEM hazard condition Only fwd if EX hazard condition isn’t true
Chapter 4 — The Processor — 23
Revised Forwarding Condition
MEM hazard (revision from slide 18) if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01
Load-Use Data Hazard Load-use data hazard: A load instruction is
followed immediately by an instruction using the value of load
How is a load instruction different from an ALU instruction? ALU inst: destination register value available at
the end of the EX stage Load inst: destination register value available at
the end of the MEM stage Note the next instruction may need the value in
the beginning of its EX stageChapter 1 — Computer Abstractions and Technology — 24
Chapter 4 — The Processor — 25
Load-Use Data Hazard Can’t always avoid stalls by forwarding
If value not computed when needed Can’t forward backward in time!
Chapter 4 — The Processor — 26
Load-Use Data Hazard
Need to stall for one cycle
Load-Use HazardHow to insert a pipeline bubble (lost cycle)? lw $2, 20($1) sub $4, $2, $5 or $8, $2, $6
When the load instruction is at the EX stage Hold the instruction at the IF stage
Do not update the PC Hold the instruction at the ID stage
Do not change the IF/ID register Insert a nop at the EX stage
Make all control signals in ID/EX register to zero Particularly, RegWrite = 0 and MemWrite = 0
Move forward MEM and WBChapter 1 — Computer Abstractions and Technology — 27
Chapter 4 — The Processor — 28
Load-Use Hazard DetectionTo detect, check if A load instruction is at the EX stage
ID/EX.MemRead = 1 The instruction at the ID stage reads the
register value of load ID/EX.RegisterRt = IF/ID.RegisterRs, or
ID/EX.RegisterRt = IF/ID.RegisterRt (for R-type)
If detected, stall IF and ID, insert bubble at EX, move forward MEM and MB
Chapter 4 — The Processor — 29
Pipeline Stall The nop has all control signals set to zero
It does nothing at EX, MEM and WB Prevent update of PC and IF/ID register
Using instruction is decoded again (OK) Following instruction is fetched again (OK) 1-cycle stall allows MEM to read data for lw
Can subsequently forward from WB to EX
Need to add new control lines PCWrite for holding or updating PC IF/IDWrite for holding or update IF/ID register
Chapter 4 — The Processor — 30
Stall/Bubble in the Pipeline
Stall inserted here
Chapter 4 — The Processor — 31
Stall/Bubble in the Pipeline
Or, more accurately…
Chapter 4 — The Processor — 32
Datapath with Hazard Detection
Chapter 4 — The Processor — 33
Stalls and Performance
Stalls reduce performance But are required to get correct results
Compiler can arrange code to avoid hazards and stalls Requires knowledge of the pipeline structure
The BIG Picture
Chapter 4 — The Processor — 34
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the next instruction
C code for A = B + E; C = B + F;
lw $t1, 0($t0)lw $t2, 4($t0)add $t3, $t1, $t2sw $t3, 12($t0)lw $t4, 8($t0)add $t5, $t1, $t4sw $t5, 16($t0)
stall
stall
lw $t1, 0($t0)lw $t2, 4($t0)lw $t4, 8($t0)add $t3, $t1, $t2sw $t3, 12($t0)add $t5, $t1, $t4sw $t5, 16($t0)
11 cycles13 cycles
Chapter 4 — The Processor — 35
Control Hazards Branch determines flow of control
Two branch outcomes: Taken or Not-Taken Fetching next instruction depends on branch
outcome Pipeline can’t always fetch correct instruction
Still working on ID stage of branch In MIPS pipeline
Need to compare registers and compute target early in the pipeline
Add hardware to do it in ID stage
Control Hazards
Several caveats The CPU doesn’t recognize a branch until it
reaches the end of the ID stage Every cycle, the CPU has to fetch one
instruction Cannot afford to wait and see Must predict the next PC every cycle
The CPU may predict “always not-taken” (MIPS 5-stage pipeline)
Alternatively, the CPU may predict branch outcome dynamically (advanced CPU design)
Chapter 1 — Computer Abstractions and Technology — 36
Control Hazards This MIPS pipeline always predicts Not-
Taken Easy prediction: The next PC is current PC plus 4 No need to design complex branch prediction unit More Taken than Not-Taken in most programs
What happens if the branch is wrong? Will have mis-fetched instructions Flush those instructions before they take effect
i.e. Before they write to memory or register A Taken branch incurs a performance
penaltyChapter 1 — Computer Abstractions and Technology — 37
Chapter 4 — The Processor — 38
Performance Impact If branch outcome determined in MEM
§4.8 Control H
azards
PC
Flush theseinstructions(Set controlvalues to 0)
Three cycles wasted on a taken branch
Performance Impact The performance loss is 3 cycles per taken
branch If branch outcome determined in MEM
Move execution of branch to the ID stage! Only beq and bne are supported in original MIPS
Testing equality and inequality is very fast, do it at the end of ID
Branch target can be calculate in ID Branch target = PC + extended offset PC and offset are known in the beginning of ID
At the of ID, the CPU knows the branch outcome and branch target
Chapter 1 — Computer Abstractions and Technology — 39
Chapter 4 — The Processor — 40
Reducing Branch Delay Move hardware to determine outcome to ID
stage Target address adder Register comparator
Example code with branch taken36: sub $10, $4, $840: beq $1, $3, 744: and $12, $2, $548: or $13, $2, $652: add $14, $4, $256: slt $15, $6, $7 ...72: lw $4, 50($7)
Chapter 4 — The Processor — 41
Example: Branch Taken
Early Branch Outcome Pipeline changes for early branch outcome
2nd PC adder and the shifter moved to ID Comparator added to ID No zero any more from ALU
CPU flushes one instruction for every taken branch CPU detects taken branch at ID The instruction at the IF will be flushed 1 lost cycles instead of 3 lost cycles per taken
branch
Chapter 1 — Computer Abstractions and Technology — 42
Pipeline Flushing
When CPU detects a taken branch at ID Update PC with branch target (already
have) Flush the instruction IF stage
Add flush signal to IF/ID pipeline When flushing, convert the instruction in IF/ID
register to 32-bit zeros 0x00000000 is “and $0, $0, $0”, effectively a
nop
Chapter 1 — Computer Abstractions and Technology — 43
Chapter 4 — The Processor — 44
Example: Branch Taken
Note: Branch does nothing in EX, MEM and WB
Chapter 4 — The Processor — 45
Pipeline Bubble on Branch Taken branch incurs a pipeline bubble
because of instruction flushing
Data Hazards for Branches
Moving branch execution to ID is not so easy May need another forwarding unit
The forwarding unit has to be in the ID stage The current forwarding unit, in the EX stage,
obviously doesn't work Need extensions to the hazard detection
unit, and more pipeline stalls Branch uses register values at ID, ALU and
load produce register values at EX and MEM
Chapter 1 — Computer Abstractions and Technology — 46
Chapter 4 — The Processor — 47
Data Hazards for Branches If a comparison register is a destination of
2nd or 3rd preceding ALU instruction
…
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
add $4, $5, $6
add $1, $2, $3
beq $1, $4, target
Can resolve using forwarding From MEM to ID, and from WB to ID
Chapter 4 — The Processor — 48
Data Hazards for Branches If a comparison register is a destination of
preceding ALU instruction or 2nd preceding load instruction May need 1 stall cycle However, beq needs the value at the end of ID
beq stalled
IF ID EX MEM WB
IF ID EX MEM WB
IF ID
ID EX MEM WB
add $4, $5, $6
lw $1, addr
beq $1, $4, target
Chapter 4 — The Processor — 49
Data Hazards for Branches If a comparison register is a destination of
immediately preceding load instruction May need 2 stall cycles Again, beq needs the value at the end of ID, so
it’s possible to reduce stall to one cycle
beq stalled
IF ID EX MEM WB
IF ID
ID
ID EX MEM WB
beq stalled
lw $1, addr
beq $1, $0, target
Mini-Project C In Mini-Project C, implement
The simple MIPS pipeline Data forwarding and hazard detection Not-taken branch prediction with pipeline
flushing
Chapter 1 — Computer Abstractions and Technology — 50
Delayed Branch
Delayed branch may remove the one-cycle stall The instruction right after the beq is executed no
matter the branch is taken or not (sub instruction in the example)
Alternatingly saying, the execution of beq is delayed by one cycle
sub $10, $4, $8 beq $1, $3, 7 beq $1, $3, 7 => sub $10, $4, $8 and $12, $2, $5 and $12, $2, $5 Must find an independent instruction, otherwise
May have to fill in a nop instruction, or Need two variants of beq, delayed and not delayed
Chapter 1 — Computer Abstractions and Technology — 51
Chapter 4 — The Processor — 52
Branch Prediction
We’ve actually studied one form of branch prediction: always not-taken Longer pipelines can’t readily determine
branch outcome early Stall penalty becomes unacceptable
Predict outcome of branch Only stall if prediction is wrong
In MIPS pipeline Can predict branches not taken Fetch instruction after branch, with no delay