motivazioni per la memoria virtuale - unibo.it
TRANSCRIPT
1
Computer Architectures
DLX ISA: Pipelined Implementation
2
The Pipelining Principle
Pipelining is nowadays the main basic technique deployed to “speed-up” a CPU.
The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, …)
A system S, has to execute N times a task A:
A1 , A2 , A3 …AN S R1 , R2 , R3 …RN
Latency : time occurring between the beginning and the end of task A (TA ).
Throughput : frequency at which each task is completed.
3
The Pipelining Principle 1) Sequential System
A2 A3 t AN A1
TA
Latency (execution time of a single instruction) = TA Throughput(1) = 1 TA
2) Pipelined System
S
A
P1 P2 P3 P4 t
S1 S2 S3 S4
Si: pipeline stage
4
The Pipelining Principle
P1
TP
P2
P1 A2 P2
P3
P1 A3
t
A1
S
S1 S2 S3 S4
P4
P3
P2
P1 A4
P4
P3 P4
P2 P3 P4
An
Latency(2) = 4 *TP = TA
Throughput(2) 1 TP
4 TA
=
= 4 * Throughput(1)
TP : pipeline cycle
The Pipelining Principle (2)
5
• Pipelining does not decrease the amount of time needed for carrying out each single task:
Latency(2) = Latency(1)
• Pipelining, instead, increases the Throughput, by multiplying it of a factor K equal to the number of stages of the pipeline:
Throughput(2) = K * Throughput(1)
• This yields a reduction, by the same factor K, of the total execution time of a sequence of N tasks (TN):
𝑻𝑵 =𝑵
𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕 𝑻𝑵(𝟏) =
𝑵
𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟏), 𝑻𝑵(𝟐) =
𝑵
𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟐)
𝑺𝒑𝒆𝒆𝒅𝒖𝒑 𝟐 𝒗𝒔 𝟏 =𝑻𝑵 𝟏
𝑻𝑵 𝟐=
𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟐)
𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟏)=K
The Pipelining Principle (2)
6
• Ideal case:
• Real case:
• Example:
• TA = 20 t (t: time unit) • TP1 = 5t, TP2 = 5t, TP3 = 6t, TP4 = 4t TP = 6t
𝑺𝒑𝒆𝒆𝒅𝒖𝒑 𝟐 𝒗𝒔 𝟏 =𝑻𝑨
𝑻𝑷=
𝟐𝟎𝒕
𝟔𝒕=(<4)
𝑻𝑷 = 𝑻𝑷𝒊 =𝑻𝑨
𝑲
𝑻𝑷 = 𝒎𝒂𝒙 𝑻𝑷𝟏, 𝑻𝑷𝟐, . . , 𝑻𝑷𝑲
perfectly balanced pipeline
Speedup = K
(slightly) unbalanced pipeline
Speedup < K
7
Pipelining in a CPU (DLX)
Tasks: A1 , A2 , A3 …AN Instructions: I1 , I2 , I3 …IN
I
EX ID t
MEM WB IF
CPI=1 (ideally !)
IF/ID ID/EX EX/MEM MEM/WB
CPU (datapath)
IF ID EX MEM WB
Pipeline Cycle Clock Cycle Delay of the slowest stage
Registers (Pipeline
Registers, FFDs)
Combinatorial circuits
N.B. this architecture is COMPLETELY
different from the sequential one
8
Pipeline in the DLX
Instr i
Instr i+1
Instr i+2
Instr i+3
Instr i+4
IF ID EX MEM WB
Tclk = Td + TP + Tsu
Clock Cycle
CPI (ideally) = 1
Overhead introduced by the Pipeline Registers:
Delay of the Input stage register
Set-up of the output stage
register
Delay of the slowest combinatorial stage
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
D D Combinatorial
Circuit
Tp
9
Delay of the Input stage register
Set-up of the output stage
register
Delay of the slowest combinatorial stage
10
Requirements for implementation of the pipeline
Each stage has to be active during each clock cycle. The PC has to be incremented in the IF stage (instead of ID). An ADDER has to be introduced (PC <-- PC+4 – PC <-PC+1) in the IF stage. Since
instructions are aligned, a 30 bit register (counter) is incremented each clock cycle (2 ls bits are always 0).
Two MDRs are required (that will be referred to as LMDR e SMDR) to handle the situation where a LOAD is immediately followed by a STORE (WB-MEM overlapping – two data waiting to be written (one in memory, the other one in RF) are overlapping.
At every clock cycle, it has to be possible to execute 2 memory accesses (IF, MEM): Instruction Memory (IM) and Data Memory (DM): “Harvard” Architecture
The CPU clock is determined by the slowest stage: IM, DM have to be cache memories (on-chip)
Pipeline Registers store both data and control information (the Control Unit is “distributed” among the pipeline stages)
IF ID EX MEM WB
DLX Pipelined Datapath
A D D
4 M U X
DATA MEM
A L U M
U X
M U X
=0?
INSTR MEM
RF
SE
PC
DEC
M U X
IF/ID ID/EX EX/MEM MEM/WB
Sign extension
Number of dest. registers in case of LOAD and ALU instr.
JL and JLR (PC stored in R31)
For computing the new PC
when branching
For operations with immediates
RD
D
RS1
RS2
Number of destination register Data
PC
Actually, it is a programmable counter since the two least-significative bits are always 0
if jumping
For SCn (also <0 and >0)
[acts on the output]
=0?
for Branch
12
ID stage
I R
RF
SE
RD
D
RS1
RS2
IF/ID ID/EX
IR25-21
IR20-16
Number of the dest. register (from WB stage)
Data (from WB stage)
(31-16) Immed./Branch
(31-26) Jump
IR15
IR25
LB
SW
IR15-0 (Offset/Immediate/JR/Branch/Load – Dest. reg. )
IR25-16 (J; JL))
PC31-0 (JAL,JALR)
P C
A
B
26 (J and JL)
6
16
32
32
32
32
32
Info travelling with the instruction
IR10-00 (R Instr.) DEC
Sign extension
IR31-26 (Opcode)
DLX Pipelined Datapath
A D D
4 M U X
DM
A L U M
U X
M U X
IM RF
SE
PC
DEC
M U X
IF/ID ID/EX EX/MEM MEM/WB
I R 1
A
B
I R 2
P C 2
C O N D
X
S M D R Y
L M D R
IF ID EX MEM WB
P C 1
P C 3
P C 4
Address
Data
I R 3
I R 4
nr. destination register
for SCn (also <0 e >0)
[acts on the output]
=0?
=0?
for Branch
JL JLR
(PC in R31)
SMDR: Store Memory Data Register
LMDR: Load Memory Data Register
IRi : Instruction Register i
X: ALU output, or DMAR, or
Branch Target Address
Y: data computed from prev. stages
14
Pipelined execution of an “ALU” instruction
X : “ALUOUTPUT” (in EX/MEM), Y : “ALUOUTPUT1”
NOTE: for these instructions, RS2/RD need to be carried along the pipeline and up to the WB stage
IF
ID
EX
MEM Y <- X (temp. storing, waiting for WB)
WB RD <- Y
IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4
A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1
ID/EX <- Instruction decode;
X <- A op B or
X <- A op (IR215)16 ## IR215..0
[PC4 <- PC3]
[PC3 <- PC2]
Decoded opcode is
carried along all stages
[IR3 <- IR2]
[IR4 <- IR3]
NOTE: IRi bits that are not needed for all instructions are dropped during successive stages. From a stage to the next one, those bits that are needed for all instructions are kept
15
Pipelined execution of a “MEM” instruction
IF
ID
EX
MEM
LMDR <- M[X] (LOAD) or
M[X] <- SMDR (STORE)
WB RD <- MDR (LOAD) [Sign ext.]
IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4
A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1
ID/EX <- Instruction decode;
X<- A op (IR215)16 ## IR215..0 SMDR <- B
[PC4 <- PC3]
[PC3 <- PC2] Decoded opcode is
carried along all stages
[IR3 <.- IR2
[IR4 <.- IR3]
X : “DMAR (Data Memory Address Registrer)”
16
Pipelined execution of a “BRANCH” instruction (normally after a SCn instruction)
X : “BTA (BRANCH TARGET ADDRESS)”
IF
ID
EX
MEM if (Cond) PC <- X
WB (NOP)
IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4
A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1
ID/EX <- Instruction decode;
X <- PC2 op (IR15)16 ## IR15..0 Cond <- A op 0
[PC4 <- PC3]
[PC3 <- PC2]
Decoded opcode is
carried along all stages
[IR3 <.- IR2]
[IR4 <.- IR3
Branch performed on the current value on register A
If the branch is taken, the PC is overwritten in this stage
17
Pipelined execution of a “JR” instruction
ID
MEM
WB
IF
ID
EX
MEM PC <- X
WB (NOP)
IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4
A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1
ID/EX <- Instruction decode;
X <- A
[PC4 <- PC3]
[PC3 <- PC2] Decoded opcode is
carried along all stages
[IR3 <.- IR2]
[IR4 <.- IR3]
What would the stage sequence be for a J instruction?
18
Pipelined execution of a “JL or JLR” instruction
ID IF
ID
EX
MEM PC <- X ; PC4<- PC3
WB R31 <- PC4
IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4
A <- RS1; B <- RS2; PC2 <- P1; IR2<-IR1 ID/EX <- Instruction decode;
PC3 <- PC2 X <- A (If JLR) X <- PC2 + (IR25)6 ## IR25..0 (If JL)
NOTE: Writing on R31 can NOT be done on-the-fly since it could overlap with another register write operation
Decoded opcode is
carried along all stages
[IR4 <- IR3]
[IR3 <.- IR2]
In this case PCi values are used
19
What would be the sequence in case of SCn (ex SLT R1,R2,R3) ?
ID IF
ID
EX
MEM
WB
IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4
A <- RS1; B <- RS2; PC2 <- P1; IR2<-IR1 ID/EX <- Instruction decode;
?
?
?
20
Pipeline hazards
• Structural Hazards – The same resource is used by two different pipeline stages: the instructions currently in those stages can not be executed simultaneously.
• Data Hazards – they are due to instruction dependencies. For example, an instruction that needs to read a register not yet written by a previous instruction (Rear After Write - RAW).
• Control Hazards – The instructions that follow a branch depend from the branch result (taken/not taken).
A “Hazard” occurs when, in a specific clock cycle, an instruction currently flowing through a pipeline stage can not be executed in the same clock cycle.
The instruction that can not be executed has to be stopped (“pipeline stall” or “pipeline bubbling”), together with all the following instructions, while the previous instructions can proceed normally (so as to eliminate the hazard).
Clk 6 Clk 7 Clk 8
Hazards and stalls
ID
IF ID EX MEM WB Ii-3
Ii-2
Ii-1
Ii
ID EX MEM
ID EX
ID
IF
IF
IF
IF Ii+1
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
WB
Clk 9 Clk 10 Clk 11 Clk 12
WB
WB
T5 = 8 * CLK = (5 + 3) * CLK
T5 = 5 * (1 + 3/5 ) * CLK
ideal CPI Stalls per Instruction
TN = N * 1 * CLK
TN = N * (1 + S ) * CLK
effective CPI
S S S
S S IF S
MEM WB
Stall: the clock signal for Ii ,Ii+1, .. is stopped for three cycles
The consequence of a data hazard: if instruction Ii needs thre result of instruction Ii-1 (data are read in the ID stage), it has to wait until after WB of Ii-1
22
Forwarding
ADD R3, R1, R4
Clk 6 Clk 7 Clk 8
MEM WB
IF ID EX MEM WB
ID EX MEM
ID EX
ID
IF
IF
IF
IF
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
WB
EX MEM
ID EX
Clk 9
MEM
WB
WB
ID
ID
ID
Forwarding allows eliminating almost all RAW hazards of the DLX pipeline without stalling the pipeline.
(NOTE: in the DLX, registers are modified only in WB)
SUB R7, R3, R5 hazard
OR R1, R3, R5 hazard
LW R6, 100 (R3) hazard
AND R9, R5, R3 no hazard
Here too the data is not yet in RF since it is written on the positive clock edge at the end of WB (the register value is read in ID)
23
Forwarding implementation
FU
EX/MEM
M U X
MEM/WB
A L U M
U X
ID/EX
M U X
M U X
RS1/RS2 OPCODE
RD2/OpCode
RD1 (destination register/OpCode) Comparison
between RS1, RS2 and
RD1, RD2 and the Opcodes
RF M U X
Often performed inside the RF
Alternatively, SPLIT-CYCLE (see next)
write before read
It allows “anticipating” the register on ID/EX MUX control: IF/ID opcode and comparison of RD with RS1 and RS2 (IF/ID)
Forwarding Unit
A
B
Offset
24
Forwarding Unit
T
In this half-period the register is
written
In this half-period the register is read
• Within the Forwarding Unit, the opcodes of the instructions in the EX, MEM and WB stages are decoded.
• If the instruction in the EX stage needs a register value (either A or B i.e. an ALU instruction, NOT a J or Branch instruction) the opcodes of the instructions in the MEM and WB stages are examined. If they require a register update, the number of the involved register is compared with the register numbers of the instruction in the EX stage. If there is a match then the corresponding data is forwarded to the EX stage, thus replacing the data read from the register file
• The bypass MUXes (inputs of the ID/EX barrier) are needed because a fetched instruction can require the contents of registers whose numbers can match that of the instruction in the WB stage (if it must store a register value). In this case data must be read from the MEM/WB barrier instead from the register file.
• Alternatively, split-cycle:
25
Data hazard due to LOAD instructions
NOTE: the datum required by the ADD is available only at the end of the MEM stage. The hazard can not be eliminated by means of forwarding (unless there is an additional input in the MUXs between memory and ALU and everything is done in the same clock cycle – delays, there is a memory access in between which is already slow by itself!)
ADD R4,R1,R7
SUB R5,R1,R8
AND R6,R1,R7
LW R1,32(R6) MEM WB
IF ID EX MEM
IF ID EX
IF ID
IF ID EX
LW R1,32(R6)
ADD R4,R1,R7
SUB R5,R1,R8
AND R6,R1,R7
IF ID EX MEM WB
IF ID S EX MEM
IF S ID EX
S IF ID
The pipeline needs to be stalled
As a matter of fact, the clock signal is not generated. The clock block is propagated along the pipeline one stage at a time.
From the end of this stage onwards: standard forwarding MEM->EX
26
Delayed load
In many RISC CPUs, the hazard associated with the LOAD instruction is not handled by the hardware through pipeline stalling, instead it is handled via software by the compiler (delayed load):
LOAD Instruction
delay slot
Next instruction
The compiler tries to fill-in the delay-slot with a “useful” instruction (worst case: NOP).
LW R1,32(R6)
LW R3,10 (R4)
ADD R5,R1,R3
LW R6, 20 (R7)
LW R8, 40(R9)
LW R1,32(R6)
LW R3,10 (R4)
ADD R5,R1,R3
LW R6, 20 (R7)
LW R8, 40(R9)
27
Control Hazards
BEQZ R4, 200
PC BEQZ R4, 200
PC+4 SUB R7, R3, R5
PC+8 OR R1, R3, R5
PC+12 LW R6, 100 (R8)
PC+4+200 AND R9, R5, R3 (BTA)
Next Instruction Address
R4 = 0 : Branch Target Address
(taken) R4 0 : PC+4 (not taken)
Clk 6 Clk 7 Clk 8
IF ID EX MEM WB
ID
ID
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
MEM WB
EX MEM
EX
IF
IF
WB ID
ID
ID IF EX WB ID MEM
Fetch with the new PC
New computed PC value (Aluout)
SUB R7, R3, R5
OR R1, R3, R5
LW R6, 100 (R8)
New value in PC (one clock after)
ID IF EX WB ID MEM
A D D
4
IM RF
SE
PC
DEC
Instruction Fetch Instruction Decode
Execute
Memory Write Back
IF/ID ID/EX
A L U M
U X
EX/MEM
M U X
M U X
DLX Pipelined Datapath (Branch or JMP)
BEQZ R4, 200
M U X
DM
MEM/WB
When the new PC acts on the IM three instructions have already travelled through the first three stages (EX included)
NOTE if the feedback signal of the new PC was taken directly from the ALU instead than from ALUOUT the required stalls would obviously be 2 – but: slower clock!
=0?
=0?
29
Handling the Control Hazards
BEQZ R4,200
Clk 6 Clk 7 Clk 8
IF ID EX MEM WB
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
S S IF S
Fetch at new PC • Always Stall (three-clock block being propagated)
Hyp.: Branch Freq.= 25 % CPI = (1 + S ) = ( 1 + 3 * 0.25 ) = 1.75
• Predict Not Taken
IF ID EX MEM WB
ID
ID
ID
BEQZ R4, 200
SUB R7, R3, R5
OR R1, R3, R5
LW R6, 100 (R8)
Clk 6 Clk 7 Clk 8 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
MEM WB
EX MEM
EX
IF
IF
IF
WB
EX WB ID
ID
ID
MEM
Branch Completion
Flush: they become
NOP
NOP
NOP
NOP
IF – the previous instruction has not been decoded yet
S IF IF ID S Real situation
repeated IF PC <- PC - 4
Here the new value is sampled by the PC
No problem since no instruction has gone through WB!
IF ID EX MEM WB
Stalls with jumps (1/3)
A D D
4 M U X
DATA MEM
A L U M
U X
M U X
=0?
INSTR MEM
RF
SE
PC
DEC
M U X
IF/ID ID/EX EX/MEM MEM/WB
RD
D
RS1
RS2
Data
PC
if jumping
=0?
N O P
N O P
N O P
forced NOP for jumping
On the first positive clock edge after sampling the assertion of the jumping condition, 3 NOPs must be inserted to replace the 3 unwanted instructions already present in the pipeline.
IF ID EX MEM WB
Stalls when jumping(2/3)
A D D
4 M U X
DATA MEM
A L U M
U X
M U X
=0?
INSTR MEM
RF
SE
PC
DEC
M U X
IF/ID ID/EX EX/MEM MEM/WB
RD
D
RS1
RS2
Data
PC
if jumping
=0?
N O P
N O P
forced NOP when jumping
On the first positive clock edge after sampling the assertion of the jumping condition, 2 NOPs must be inserted to replace the 2 unwanted instructions
NOTE in this case the jump condition and the new PC are sent to the MUX in the same clock cycle as the processing of the condition
IF ID EX MEM WB
Stalls when jumping (3/3)
A D D
4
DATA MEM
A L U M
U X
M U X
=0?
INSTR MEM
RF
SE
DEC
M U X
IF/ID ID/EX EX/MEM MEM/WB
RD
D
RS1
RS2
Data
PC
if jumping
=0?
N O P
NOP when jumping
On the first positive clock edge after the assertion of the jumping condition, a NOP is inserted to replace the instruction currently in the IF/ID stage
NOTE In this case the jumping condition and the new PC control the MUX in the same moment as the processing of the condition
PC
M U X
33
Independent ALU for BRANCH/JMP
To reduce the number of stalls
BTA <-PC1+ (IR15)16 ## IR15-0 /(IR25)6 ## IR25..0
if Branch: if (RS1 op 0) PC <- BTA if JMP always PC <- BTA
IF
ID
EX -------------------------
MEM
WB
-------------------------
-------------------------
(New fetch only one stall)
ALU (additional full adder)
N.B. The full adder is separated from the adder “+4” (this means it overlaps with the addition required to compute the next instruction!), otherwise the same adder has to be used together with some multiplexers (so to select whether to add 4 or the offset, and whether to use PC or PC1)
A <- RS1; B <- RS2; PC2 <- PC1 ID/EX <- Decode; ID/EX <- Opc ext.
IR <- M[PC] ; PC <- PC + 4; PC1 <- PC + 4
NOTE here there is only one “stall” since the new value is inserted in the PC on the positive clock edge that ends the ID stage while, in the previous case, it was inserted after the MEM stage, that is, two clock later!!
BRANCH/JMP – 1 stall
A D D E R
4
IM RF
PC
DEC
IF/ID ID/EX
I R 1
IF ID
P C 1
M U X
M U X
SE
##
A
B
P C 2
NOTE: for the “Unconditional Jump” instructions there is an analogous situation: we only need to provide further inputs to the MUXs of the PC by taking into consideration either the RS1 register (JR and JRL) or the 26
less-significant bits of the IR with SE (J and JL) to be added to the current PC)
The new PC is selected according to the opcode and the value of the branch test register
= 0 ?
For Branches
Standard addition
Branch
Offset and sign
extension
Displacement of the Branch instruction
PC of the Branch instruction
This actually coincide with the current value in PC (can be avoided)
35
Delayed branch
Similarly to the LOAD case, with several RISC CPUs the hazard associated with BRANCH instructions is handled via SW by the compiler (delayed branch):
BRANCH instruction
delay slot
Next instruction
The compiler tries to fill-in the delay-slots with “useful” instructions (worst case: NOP).
delay slot
delay slot
36
Delayed branch/jump
Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; branch condition
Br R1, +100
Sne R1, R8, R9 ; branch condition
Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21
Compiled Original
Executed in both cases Obviously in this group
of instructions there must be no jumps!!!
Instead of one or more “postponed” instructions, the compiler inserts NOPs in case no suitable instructions are available
37
Handling the Control Hazards
Dynamic Prediction: Branch Target Buffer -> no stall (almost..)
T/NT
TAGS
Predicted PC
PC
= HIT : Fetch with predicted PC
MISS : Fetch with PC + 4
Correct prediction : no stall
Wrong prediction : 1-3 stalls (correct fetch in ID or EX, see before)
N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID
38
Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when the
last branch occurred.
In case of predominance of one prediction, when the opposite situation occurs we have two
consecutive errors.
Loop1 Loop2
When exiting loop2, the prediction fails (branch predicted as taken but actually it
is untaken), then it fails again when it predicts as untaken whilst entering once
again loop2
39
Hence, usually two bits are used for branch prediction:
TAKEN
TAKEN
UNTAKEN
UNTAKEN
TAKEN
UNTAKEN
TAKEN
UNTAKEN
TAKEN
TAKEN
UNTAKEN UNTAKEN