computer architecture pipeline
DESCRIPTION
Computer Architecture Pipeline. By Yoav Etsion & Dan Tsafrir Presentation based on slides by David Patterson, Avi Mendelson, Randi Katz, and Lihu Rappoport. Pipeline idea: keep everyone busy. Pipeline: more accurately…. Expert in cutting bread. Expert in placing roast beef. - PowerPoint PPT PresentationTRANSCRIPT
Computer Architecture 2014– Pipeline1
Computer Architecture
Pipeline
By Yoav Etsion & Dan TsafrirPresentation based on slides by David Patterson, Avi Mendelson, Randi Katz, and Lihu Rappoport
Computer Architecture 2014– Pipeline2
Pipeline idea: keep everyone busy
Computer Architecture 2014– Pipeline3
Pipeline: more accurately…
Expert in placing tomatoand closing the sandwich
Expert in placing roast beef
Expert in cutting bread
Pipelining elsewhere• Unix shell
grep string File | wc -l• Assembling cars• Whenever want to keep
functional units busy
Computer Architecture 2014– Pipeline4
DataAccess
DataAccess
Pipeline: microarchitecture2 4 6 8 10 12 14 16 18
. ..
InstFetch Reg ALU Reg
InstFetch Reg ALU Reg
InstFetch
8 ns
8 ns
8 ns
TimeProgramexecutionorder
lw R1, 100(R0)
lw R2, 200(R0)
lw R3, 300(R0)
befo
re
Computer Architecture 2014– Pipeline5
DataAccess
DataAccess
Pipeline: microarchitecture2 4 6 8 10 12 14 16 18
. ..
InstFetch Reg ALU Reg
InstFetch Reg ALU Reg
InstFetch
8 ns
8 ns
8 ns
TimeProgramexecutionorder
lw R1, 100(R0)
lw R2, 200(R0)
lw R3, 300(R0)
befo
re
// R1 = mem[0+100]
fetch
decode & bringregs to ALU
100+R0
access mem write back result to R1
Computer Architecture 2014– Pipeline7
DataAccess
DataAccess
DataAccess
DataAccess
DataAccess
Pipeline: microarchitecture
• Speed set by slowest component (instruction takes longer in pipeline)• First commercial use in 1985• In Intel chips since 486 (until then, serial execution)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
. ..
InstFetch Reg ALU Reg
InstFetch Reg ALU Reg
InstFetch
InstFetch Reg ALU Reg
InstFetch Reg ALU Reg
InstFetch Reg ALU Reg
2 ns
2 ns
2 ns 2 ns 2 ns 2 ns 2 ns
8 ns
8 ns
8 ns
TimeProgramexecutionorder
lw R1, 100(R0)
lw R2, 200(R0)
lw R3, 300(R0)
TimeProgramexecutionorder
lw R1, 100(R0)
lw R2, 200(R0)
lw R3, 300(R0)
befo
reaf
ter
// R1 = mem[0+100]
fetch
decode & bringregs to ALU
100+R0
access mem write back result to R1
Computer Architecture 2014– Pipeline8
MIPS
Introduced in 1981 by Hennessy (of “Patterson & Hennessy”) Idea suggested earlier, e.g., by John Cocke and friends at IBM
in the 1970s, but not developed in full
MIPS = Microprocessor without Interlocked Pipeline Stages RISC Often used in computer architecture courses Was very successful (e.g., inspired the Alpha ISA)
Interlocks (“without interlocks”) Mechanisms to allow stages to indicate they are busy E.g., ‘divide’ & ‘multiply’ required interlocks
=> paused other stages upstream With MIPS, every sub-phase of all instructions fits into 1 cycle No die area wasted on pausing mechanisms => faster cycle
Computer Architecture 2014– Pipeline9
Pipeline: principles Ideal speedup = num of pipeline stages
An instruction finishes every clock cycle Namely, IPC of an ideal pipelined machine is 1
Increase throughput rather than reduce latency One instruction still takes the same (or longer)
Since max speedup = num of stages &Latency determined by slowest stage, should:Þ Partition pipe to many stagesÞ Balance work across stagesÞ Shorten longest stage as much as possible
Computer Architecture 2014– Pipeline10
Pipeline: overheads & limitations Can increase per-instruction latency
Due to stages imbalance
Requires more logic than serial execution
Time to “fill” pipe reduces speedupTime to “drain” pipe reduces speedup E.g., upon interrupt or context switch
Stalls when there are dependencies
Computer Architecture 2014– Pipeline11
Instruction Decode /
register fetch
Instruction fetch
Execute /address
calculation
Memoryaccess
Writeback
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipelined CPU
Computer Architecture 2014– Pipeline12
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipeline: fetchbring next instructionfrom memory; 4 bytes(32 bit) per instruction
Instruction saved inregister, in preparationof next pipe stage
when not branching,next instruction is innext word
Computer Architecture 2014– Pipeline13
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipeline: decode + regs fetch• decode source reg numbers• read their values from reg file• reg IDs are 5 bits (2^5 = 32)
Computer Architecture 2014– Pipeline14
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipeline: decode + regs fetchdecode & sign-extend immediate (from 16 bit to 32)
Computer Architecture 2014– Pipeline15
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipeline: decode + regs fetchdecode destination reg (can be one of two, depending on op) & save in register for next stage…
Computer Architecture 2014– Pipeline16
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipeline: decode + regs fetchdecode destination reg (can be one of two, depending on op) & save in latch for next stage…
…based on the op type, next phase will determine, which reg of the two is the destination
Computer Architecture 2014– Pipeline17
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipeline: execute
ALU computes – “R” operation(the “shift” field is missing from this illustration)
reg1
reg2
func(6bit)
to reg3
Computer Architecture 2014– Pipeline18
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipeline: execute
ALU computes – “I” operation(not branch & not load/store)
reg1
imm
opcode
to reg2
Computer Architecture 2014– Pipeline19
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipeline: execute
ALU computes – “I” operationconditional branch BEQ or BNE[ if (reg1==reg2) pc = pc+4 + (imm<<2) ]
reg1
imm
opcode
reg2
Bra
nch?
Computer Architecture 2014– Pipeline20
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipeline: execute
ALU computes – “I” operationload (store is similar)( reg2 = mem[reg1+imm] )
reg1
imm to reg2
Computer Architecture 2014– Pipeline21
ALUSrc
6
ALUresult
zero
AddShift left 2
ALUctrl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
signexten
sio
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
instructionmemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipeline: updating PCno branch:just add 4 to PC
unconditional branch:add immediate to PC+4(type J operation)
conditional branch:depends on resultof ALU
Computer Architecture 2014– Pipeline22
Pipelined CPU with Control
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EXEX/MEM
MEM/WBIn
stru
ctio
n
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
Con
trol
WBMEM
WBMEM
WB
EXE
Instruction Decode /
register fetch
Instruction fetch
Execute /address
calculation
Memoryaccess
Writeback
Computer Architecture 2014– Pipeline23
Pipeline Example: cycle 1
0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction lwPC4
4
Computer Architecture 2014– Pipeline24
Pipeline Example: cycle 2
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
lw
PC8
8 4
sub[R1]
910
0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
Computer Architecture 2014– Pipeline25
Pipeline Example: cycle 3
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
lw
PC12
12 8
and[R2]
[R1]+9
sub4
[R3]
1011 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
Computer Architecture 2014– Pipeline26
Pipeline Example: cycle 4
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
lw
PC16
12 8
or[R4]
Data frommemoryaddress[R1]+9
sub4
[R5]
1112
and16
10
[R2]-[R3]
0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
Computer Architecture 2014– Pipeline27
Pipeline Hazards:
1. Structural Hazards
Computer Architecture 2014– Pipeline28
Structural Hazard Two instructions attempt to use same resource simultaneously
Problem: register-file accessed in 2 stages Write during stage 5 (WB) Read during stage 2 (ID)=> Resource (RF) conflict
Solution Split stage into two sub-stages Do write in first half Do reads in second half 2 read ports, 1 write port (separate)
IM Reg
IM Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
Reg
DM
Computer Architecture 2014– Pipeline29
Structural Hazard Problem: memory accessed in 2 stages
Fetch (stage 1), when reading instructions from memory
Memory (stage 4), when datais read/written from/tomemory
Princeton architecture
Solution Split data/inst. Memories
Harvard architecture Today, separate instruction cache and data cache
IM Reg
IM Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
Reg
DM
Computer Architecture 2014– Pipeline30
Pipeline Hazards:
2. Data Hazards
Computer Architecture 2014– Pipeline31
When two instructions access the same register
RAW: Read-After-Write True dependency
WAR: Write-After-Read Anti-dependency
WAW: Wrtie-After-Write False-dependency
Key problem with regular in-order pipelines is RAW We will also learn about out-of-order pipelines
Data Dependencies
Computer Architecture 2014– Pipeline32
Problem with starting next instruction before first is finished dependencies that “go backward in time” are data
hazards
Data Dependencies
IM Reg
IM Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
Reg
DM
sub R2, R1, R3
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 910 0–20Value of R2 10 10 10 -20 -20 -20 -20
Computer Architecture 2014– Pipeline33
IM bubble bubble bubble bubble
IM bubble bubble bubble bubble
IM bubble bubble bubble bubble
RAW Hazard: HW Solution 1 - Add Stalls
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 910 0–20Value of R2
DM Reg
IM DM RegReg
IM Reg
IM Reg DM Reg
IM DM Reg
Reg
Reg
DM
sub R2, R1, R3
stall
stall
stall
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
10 10 10 -20 -20 -20 -20
• Let the hardware detect hazard and add stalls if needed
Problem: slow! Solution: forwarding whenever possible
Computer Architecture 2014– Pipeline34
Use temporary results, don’t wait for them to be written to the register file register file forwarding to handle read/write to same register ALU forwarding
RAW Hazard: HW Solution 2 - Forwarding
IM Reg
IM Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
Reg
X X X –20 X X X X XX X X X –20 X X X X
DM
sub R2, R1, R3
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 910 0–20Value of R2 10 10 10 -20 -20 -20 -20
Value EX/MEMValue MEM/WB
Computer Architecture 2014– Pipeline35
Forwarding Hardware
Con
trol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0muxB
1
2
0muxA
1
2
DataMemoryA
LU
Register File
Inst
ruct
ion
Mem
ory
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd
Rs
RtRtRd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
EX/M
EM.R
egW
rite
MEM
/WB
.Reg
Writ
e
Computer Architecture 2014– Pipeline36
Forwarding Hardware
Con
trol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0muxB
1
2
0muxA
1
2
DataMemoryA
LU
Register File
Inst
ruct
ion
Mem
ory
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd
Rs
RtRtRd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
EX/M
EM.R
egW
rite
MEM
/WB
.Reg
Writ
e
• Added 2 mux units before ALU• Each mux gets 3 inputs, from:
1. Prev stage (ID/EX)2. Next stage (EX/MEM)3. The one after (MEM/WB)
• Forward unit tells the 2 mux units which input to use
Computer Architecture 2014– Pipeline37
Forwarding ControlEX Hazard:
A. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 1
B. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 1
MEM Hazard: if (not A and MEM/WB.RegWrite
(MEM/WB.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 2 if (not B and MEM/WB.RegWrite and
(MEM/WB.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 2
Computer Architecture 2014– Pipeline38
Forwarding ControlEX Hazard:
A. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 1
B. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 1
MEM Hazard: if (not A and MEM/WB.RegWrite
(MEM/WB.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 2 if (not B and MEM/WB.RegWrite and
(MEM/WB.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 2
If, in memory stage, we’re writing the output to a register
And the reg we’re writing to also happens to be inp_reg1 for the execute stage
Then mux_A should select inp_1,namely, the ALU should feed itself
Computer Architecture 2014– Pipeline39
Forwarding Hardware Example: Bypassing From EX to Src1 and From WB to Src2
lw R11,9(R1) sub R10,R2, R3and R12,R10,R11
Con
trol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0mux
1
2
0mux
1
2
DataMemoryA
LU
Register File
Inst
ruct
ion
Mem
ory
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd
Rs
RtRtRd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
lw[R10]
Data frommemoryaddress[R1]+9
sub
[R11]
1012
and
11
[R2]-[R3]
1110
load op => read from “1”
Computer Architecture 2014– Pipeline40
Forwarding Hardware Example #2: Bypassing From WB to Src2
sub R10,R2, R3xxxand R12,R10,R11
Con
trol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0mux
1
2
0mux
1
2
DataMemoryA
LU
Register File
Inst
ruct
ion
Mem
ory
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd
Rs
RtRtRd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
sub[R11]
xxx
[R10]
12
and
10
[R2]-[R3]
1110
not load op => read from “0”
Computer Architecture 2014– Pipeline42
“load” op can cause “un-forwardable” hazards load value to R In the next instruction, use R as input
Þ A bigger problem in longer pipelines
Can't always forward (stall inevitable)
Reg
IM
Reg
Reg
IM
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 9Programexecutionorder
lw R2, 30(R1)
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Computer Architecture 2014– Pipeline44
Hazard Detection (Stall) Logicif ( (ID/EX.RegWrite) and (ID/EX.opcode == lw) and ( (ID/EX.WriteReg == IF/ID.ReadReg1) or
(ID/EX.WriteReg == IF/ID.ReadReg2) )then stall IF/ID
Computer Architecture 2014– Pipeline45
Forwarding + Hazard Detection Unit
Con
trol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0mux
1
2
0mux
1
2
DataMemoryA
LU
Register File
Inst
ruct
ion
Mem
ory
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd
Rs
RtRtRd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
HazardDetection
Unit:Scoreboard
0
ID/EX.MemRead
PC W
rite
0mux
1
IF/ID
Writ
e
ID/EX.Rt
Computer Architecture 2014– Pipeline46
Example: code for (assume all variables are in memory):a = b + c;d = e – f;
Slow codeLW Rb,bLW Rc,c
Stall ADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,f
Stall SUB Rd,Re,RfSW d,Rd
Instruction order can be changed as long as correctness is kept (no dependencies violated)
Compiler scheduling helps avoid load hazards (when possible)
Fast codeLW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd
Computer Architecture 2014– Pipeline47
Pipeline Hazards:
3. Control Hazards
Computer Architecture 2014– Pipeline48
Branch, but where?
The decision to branch happens deep within the pipeline Likewise, the target of the branch becomes known deep
within the pipeline How does this effect the pipeline logic? For example…
Computer Architecture 2014– Pipeline49
Executing a BEQ Instruction (i)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
0 or 4 beq R4, R5, 27 8 and12 sw16 sub
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
R4R5 -27
812
12
beqand
Assume this program state
Computer Architecture 2014– Pipeline50
Executing a BEQ Instruction (i)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
0 or 4 beq R4, R5, 27 8 and12 sw16 sub
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
R4R5 -27
812
12
beqand
We know:• Values of registersWe don’t know:• If branch will be taken• What is its target
Computer Architecture 2014– Pipeline51
Executing a BEQ Instruction (ii)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
0 or 4 beq R4, R5, 27 8 and12 sw16 sub
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction R4-R5=0-
8+SignExt(27)*4
1216
16
beqandsw
…Now we know, but only in next cycle will
this effect PC
Calculate branch condition = compute R4-R5 & compare to 0
Calculate branch target
Computer Architecture 2014– Pipeline52
Executing a BEQ Instruction (iii)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
0 or 4 beq R4, R5, 27 8 and12 sw16 sub
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
8+SignExt(27)*4
16
20 or 116
beqandsw
20
sub
Finally, if taken, branch sets the PC
Computer Architecture 2014– Pipeline53
Control Hazard on Branches
And
Beq
sub
sw
Inst from target
IM Reg DM Reg
PC
IM Reg DM Reg
PC
IM Reg DM Reg
PC
IM Reg DM Reg
PC
IM Reg DM Reg
PC
Outcome: The 3 instructions following the branch are in the pipeline even if branch is taken!
Computer Architecture 2014– Pipeline54
Traps, Exceptions and Interrupts
Indication of events that require a higher authority to intervene (i.e. the operating system)
Atomically changes the protection mode and branches to OS Protection mode determines what the running is allowed to do (access
devices, memory, etc).
Traps: Synchronous The program asks for OS services (e.g. access a device)
Exceptions: Synchronous The program did something bad (divide-by-zero; prot. violation)
Interrupts: Asynchronous An external device needs OS attention (finished an operation)
Can these be handled like regular branches?
Computer Architecture 2014– Pipeline55
Stall
Easiest solution: Stall pipe when branch encountered until resolved
But there’s a prices. Assume: CPI = 1 20% of instructions are branches (realistic) Stall 3 cycles on every branch (extra 3 cycles for each branch)
Then the price is: CPI new = 1 + 0.2 × 3 = 1.6 // 1 = all instr., including branch [ CPI new = CPI Ideal + avg. stall cycles / instr. ]
Namely: We lose 60% of the performance!
Computer Architecture 2014– Pipeline56
Branch Prediction and Speculative Execution
Computer Architecture 2014– Pipeline57
Static prediction: branch not taken
Execute instructions from the fall-through (not-taken) path As if there is no branch If the branch is not-taken (~50%), no penalty is paid
If branch actually taken Flush the fall-through path instructions before they change the
machine state (memory / registers) Fetch the instructions from the correct (taken) path
Assuming ~50% branches not taken on average CPI new = 1 + (0.2 × 0.5) × 3 = 1.3 30% slowdown instead of 60% What happens in longer pipelines?
Computer Architecture 2014– Pipeline58
Dynamic branch prediction
Branch prediction is a key impediment to performance Modern processors employ complex branch predictors Often achieve < 3% misprediction rate
Given an instruction, we need to predict Is it a branch? Branch taken? Target address?
To avoid stalling Prediction needed at end of ‘fetch’ Before we even now what’s the instruction…
A simple mechanism: Branch Target Buffer (BTB)
Computer Architecture 2014– Pipeline59
BTB – the idea
fast lookup table
PC of fetched instruction
?= Predicted branchtaken or not taken?(last few times)
No => we don’t know,so we don’t predict
Yes => instructionis a branch, so let’spredict it
Branch PC Target PC History
Predicted Target
(Works in a straightforward manner only for direct branches, otherwise target PC changes)
Computer Architecture 2014– Pipeline60
How it works in a nutshell
Until proven otherwise, assume branches are not taken Fall through instructions (assume branch has no effect)
Upon the first time a branch is taken Pay the price (in terms of stalls), but Save the details of the branch in the BTB
(= PC, target PC, and whether or not branch was taken)
While fetching, HW checks in parallel Whether PC is in BTB
If found, make a prediction Taken? Address?
Upon misprediction Flush (throw out) pipeline content & start over from right PC
Computer Architecture 2014– Pipeline61
Prediction steps
1. Allocate Insert instruction to BTB once identified as taken branch Do not insert not-taken branches
Implicitly predict they’d continue not to be taken Insert both conditional & unconditional
To identify, and to save arithmetic2. Predict
BTB lookup done in parallel to PC-lookup, providing: Indication whether PC is a branch (=> BTB “hit”) Branch target Branch direction (forward or backward in program) Branch type (conditional or not)
3. Update (when branch taken & its outcome becomes known) Branch target, history (taken or not)
Computer Architecture 2014– Pipeline62
Misprediction
Occurs when Predict = not taken, reality = taken Predict = taken, reality = not taken Branch taken as predicted, but wrong target (indirect, as in
the jmp register)
Must flush pipeline Reset pipeline registers (similar to turning all into NOPs)
Commonly, other flush methods are easier to implement Set the PC to the correct path Start fetching instruction from correct path
Computer Architecture 2014– Pipeline63
CPI
Assuming a fraction of p correct predictions CPI_new = 1 + (0.2 × (1-p)) × 3
Example, p=0.7 CPI_new = 1 + (0.2 × 0.3) × 3 = 1.18
Example, p=0.98 CPI_new = 1 + (0.2 × 0.02) × 3 = 1.012 (But this is a simplistic model; in reality the price can
sometimes be much higher.)
Computer Architecture 2014– Pipeline64
History & prediction algorithm
“Always backward” prediction Works for long loops
Some branches exhibit “locality” Typically behave as the last time they were invoked Typically depend on their previous outcome (& it alone)
Can save a history window What happened last time, and before that, and before… The bigger the window, the greater the complexity
Some branches regularly alternate between taken & untaken Taken, then untaken, then taken, … Need only one history bit to identify this
Some branches are correlated with previous branches Those that lead to them
Computer Architecture 2014– Pipeline65
Adding a BTB to the Pipeline
Shift left 2
ALUSrc
6
ALUresultZero
+
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EXEX/MEM MEM
/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4+
IF/ID
PC
0
1
mux
0
1
mux
0mux
1
0
mux
Inst.Memory
Address
Instruction
BTB
12
pred targetpred dir
PC+4 (Not-taken target)taken target
3
Mis-predict
DetectionUnit
Flush
predicted targetPC+4 (Not-taken target)
predicted direction
−4
address
targetdirection
alloc/updt
Computer Architecture 2014– Pipeline66
Using The BTBPC moves to next instruction
Inst Mem gets PCand fetches new inst
BTB gets PCand looks it up
IF/ID latch loadedwith new inst
BTB Hit ?
Br taken ?
PC PC + 4PC pred addr
IF
IDIF/ID latch loaded
with pred instIF/ID latch loaded
with seq. instBranch ?
yes no
noyes
noyesEXE
Computer Architecture 2014– Pipeline67
Using The BTB (cont.)ID
EXE
MEM
WB
Branch ?
Calculate brcond & trgt
Flush pipe &update PC
Corect pred ?
yes no
IF/ID latch loadedwith correct inst
continue
Update BTB
yes no
continue
Computer Architecture 2014– Pipeline68
Prediction algorithm
Can do an entire course on this issue Still actively researched
As noted, modern predictors can often achieve misprediction < 2%
Still, it has been shown that these 2% can sometimes significantly worsen performance A real problem in out-of-order pipelines
We did not talk about the issue of indirect branches As in virtual function calls (object oriented) Where the branch target is written in memory, elsewhere