motivazioni per la memoria virtuale - unibo.it

1

Computer Architectures

DLX ISA: Pipelined Implementation

2

The Pipelining Principle

Pipelining is nowadays the main basic technique deployed to “speed-up” a CPU.

The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, …)

A system S, has to execute N times a task A:

A1 , A2 , A3 …AN S R1 , R2 , R3 …RN

Latency : time occurring between the beginning and the end of task A (TA ).

Throughput : frequency at which each task is completed.

3

The Pipelining Principle 1) Sequential System

A2 A3 t AN A1

TA

Latency (execution time of a single instruction) = TA Throughput(1) = 1 TA

2) Pipelined System

S

A

P1 P2 P3 P4 t

S1 S2 S3 S4

Si: pipeline stage

4

The Pipelining Principle

P1

TP

P2

P1 A2 P2

P3

P1 A3

t

A1

S

S1 S2 S3 S4

P4

P3

P2

P1 A4

P4

P3 P4

P2 P3 P4

An

Latency(2) = 4 *TP = TA

Throughput(2) 1 TP

4 TA

=

= 4 * Throughput(1)

TP : pipeline cycle

The Pipelining Principle (2)

5

• Pipelining does not decrease the amount of time needed for carrying out each single task:

Latency(2) = Latency(1)

• Pipelining, instead, increases the Throughput, by multiplying it of a factor K equal to the number of stages of the pipeline:

Throughput(2) = K * Throughput(1)

• This yields a reduction, by the same factor K, of the total execution time of a sequence of N tasks (TN):

𝑻𝑵 =𝑵

𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕 𝑻𝑵(𝟏) =

𝑵

𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟏), 𝑻𝑵(𝟐) =

𝑵

𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟐)

𝑺𝒑𝒆𝒆𝒅𝒖𝒑 𝟐 𝒗𝒔 𝟏 =𝑻𝑵 𝟏

𝑻𝑵 𝟐=

𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟐)

𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟏)=K

The Pipelining Principle (2)

6

• Ideal case:

• Real case:

• Example:

• TA = 20 t (t: time unit) • TP1 = 5t, TP2 = 5t, TP3 = 6t, TP4 = 4t TP = 6t

𝑺𝒑𝒆𝒆𝒅𝒖𝒑 𝟐 𝒗𝒔 𝟏 =𝑻𝑨

𝑻𝑷=

𝟐𝟎𝒕

𝟔𝒕=(<4)

𝑻𝑷 = 𝑻𝑷𝒊 =𝑻𝑨

𝑲

𝑻𝑷 = 𝒎𝒂𝒙 𝑻𝑷𝟏, 𝑻𝑷𝟐, . . , 𝑻𝑷𝑲

perfectly balanced pipeline

Speedup = K

(slightly) unbalanced pipeline

Speedup < K

7

Pipelining in a CPU (DLX)

Tasks: A1 , A2 , A3 …AN Instructions: I1 , I2 , I3 …IN

I

EX ID t

MEM WB IF

CPI=1 (ideally !)

IF/ID ID/EX EX/MEM MEM/WB

CPU (datapath)

IF ID EX MEM WB

Pipeline Cycle Clock Cycle Delay of the slowest stage

Registers (Pipeline

Registers, FFDs)

Combinatorial circuits

N.B. this architecture is COMPLETELY

different from the sequential one

8

Pipeline in the DLX

Instr i

Instr i+1

Instr i+2

Instr i+3

Instr i+4

IF ID EX MEM WB

Tclk = Td + TP + Tsu

Clock Cycle

CPI (ideally) = 1

Overhead introduced by the Pipeline Registers:

Delay of the Input stage register

Set-up of the output stage

register

Delay of the slowest combinatorial stage

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

D D Combinatorial

Circuit

Tp

9

Delay of the Input stage register

Set-up of the output stage

register

Delay of the slowest combinatorial stage

10

Requirements for implementation of the pipeline

Each stage has to be active during each clock cycle. The PC has to be incremented in the IF stage (instead of ID). An ADDER has to be introduced (PC <-- PC+4 – PC <-PC+1) in the IF stage. Since

instructions are aligned, a 30 bit register (counter) is incremented each clock cycle (2 ls bits are always 0).

Two MDRs are required (that will be referred to as LMDR e SMDR) to handle the situation where a LOAD is immediately followed by a STORE (WB-MEM overlapping – two data waiting to be written (one in memory, the other one in RF) are overlapping.

At every clock cycle, it has to be possible to execute 2 memory accesses (IF, MEM): Instruction Memory (IM) and Data Memory (DM): “Harvard” Architecture

The CPU clock is determined by the slowest stage: IM, DM have to be cache memories (on-chip)

Pipeline Registers store both data and control information (the Control Unit is “distributed” among the pipeline stages)

IF ID EX MEM WB

DLX Pipelined Datapath

A D D

4 M U X

DATA MEM

A L U M

U X

M U X

=0?

INSTR MEM

RF

SE

PC

DEC

M U X


Sign extension

Number of dest. registers in case of LOAD and ALU instr.

JL and JLR (PC stored in R31)

For computing the new PC

when branching

For operations with immediates

RD

D

RS1

RS2

Number of destination register Data

PC

Actually, it is a programmable counter since the two least-significative bits are always 0

if jumping

For SCn (also <0 and >0)

[acts on the output]

=0?

for Branch

12

ID stage

I R

RF

SE

RD

D

RS1

RS2

IF/ID ID/EX

IR25-21

IR20-16

Number of the dest. register (from WB stage)

Data (from WB stage)

(31-16) Immed./Branch

(31-26) Jump

IR15

IR25

LB

SW

IR15-0 (Offset/Immediate/JR/Branch/Load – Dest. reg. )

IR25-16 (J; JL))

PC31-0 (JAL,JALR)

P C

A

B

26 (J and JL)

6

16

32

32

32

32

32

Info travelling with the instruction

IR10-00 (R Instr.) DEC

Sign extension

IR31-26 (Opcode)

DLX Pipelined Datapath

A D D

4 M U X

DM

A L U M

U X

M U X

IM RF

SE

PC

DEC

M U X


I R 1

A

B

I R 2

P C 2

C O N D

X

S M D R Y

L M D R

IF ID EX MEM WB

P C 1

P C 3

P C 4

Address

Data

I R 3

I R 4

nr. destination register

for SCn (also <0 e >0)

[acts on the output]

=0?

=0?

for Branch

JL JLR

(PC in R31)

SMDR: Store Memory Data Register

LMDR: Load Memory Data Register

IRi : Instruction Register i

X: ALU output, or DMAR, or

Branch Target Address

Y: data computed from prev. stages

14

Pipelined execution of an “ALU” instruction

X : “ALUOUTPUT” (in EX/MEM), Y : “ALUOUTPUT1”

NOTE: for these instructions, RS2/RD need to be carried along the pipeline and up to the WB stage

IF

ID

EX

MEM Y <- X (temp. storing, waiting for WB)

WB RD <- Y

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1

ID/EX <- Instruction decode;

X <- A op B or

X <- A op (IR215)16 ## IR215..0

[PC4 <- PC3]

[PC3 <- PC2]

Decoded opcode is

carried along all stages

[IR3 <- IR2]

[IR4 <- IR3]

NOTE: IRi bits that are not needed for all instructions are dropped during successive stages. From a stage to the next one, those bits that are needed for all instructions are kept

15

Pipelined execution of a “MEM” instruction

IF

ID

EX

MEM

LMDR <- M[X] (LOAD) or

M[X] <- SMDR (STORE)

WB RD <- MDR (LOAD) [Sign ext.]

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1


X<- A op (IR215)16 ## IR215..0 SMDR <- B

[PC4 <- PC3]

[PC3 <- PC2] Decoded opcode is


[IR3 <.- IR2

[IR4 <.- IR3]

X : “DMAR (Data Memory Address Registrer)”

16

Pipelined execution of a “BRANCH” instruction (normally after a SCn instruction)

X : “BTA (BRANCH TARGET ADDRESS)”

IF

ID

EX

MEM if (Cond) PC <- X

WB (NOP)

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1


X <- PC2 op (IR15)16 ## IR15..0 Cond <- A op 0

[PC4 <- PC3]

[PC3 <- PC2]

Decoded opcode is


[IR3 <.- IR2]

[IR4 <.- IR3

Branch performed on the current value on register A

If the branch is taken, the PC is overwritten in this stage

17

Pipelined execution of a “JR” instruction

ID

MEM

WB

IF

ID

EX

MEM PC <- X

WB (NOP)

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1


X <- A

[PC4 <- PC3]

[PC3 <- PC2] Decoded opcode is


[IR3 <.- IR2]

[IR4 <.- IR3]

What would the stage sequence be for a J instruction?

18

Pipelined execution of a “JL or JLR” instruction

ID IF

ID

EX

MEM PC <- X ; PC4<- PC3

WB R31 <- PC4

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- P1; IR2<-IR1 ID/EX <- Instruction decode;

PC3 <- PC2 X <- A (If JLR) X <- PC2 + (IR25)6 ## IR25..0 (If JL)

NOTE: Writing on R31 can NOT be done on-the-fly since it could overlap with another register write operation

Decoded opcode is


[IR4 <- IR3]

[IR3 <.- IR2]

In this case PCi values are used

19

What would be the sequence in case of SCn (ex SLT R1,R2,R3) ?

ID IF

ID

EX

MEM

WB

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- P1; IR2<-IR1 ID/EX <- Instruction decode;

?

?

?

20

Pipeline hazards

• Structural Hazards – The same resource is used by two different pipeline stages: the instructions currently in those stages can not be executed simultaneously.

• Data Hazards – they are due to instruction dependencies. For example, an instruction that needs to read a register not yet written by a previous instruction (Rear After Write - RAW).

• Control Hazards – The instructions that follow a branch depend from the branch result (taken/not taken).

A “Hazard” occurs when, in a specific clock cycle, an instruction currently flowing through a pipeline stage can not be executed in the same clock cycle.

The instruction that can not be executed has to be stopped (“pipeline stall” or “pipeline bubbling”), together with all the following instructions, while the previous instructions can proceed normally (so as to eliminate the hazard).

Clk 6 Clk 7 Clk 8

Hazards and stalls

ID

IF ID EX MEM WB Ii-3

Ii-2

Ii-1

Ii

ID EX MEM

ID EX

ID

IF

IF

IF

IF Ii+1

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

WB

Clk 9 Clk 10 Clk 11 Clk 12

WB

WB

T5 = 8 * CLK = (5 + 3) * CLK

T5 = 5 * (1 + 3/5 ) * CLK

ideal CPI Stalls per Instruction

TN = N * 1 * CLK

TN = N * (1 + S ) * CLK

effective CPI

S S S

S S IF S

MEM WB

Stall: the clock signal for Ii ,Ii+1, .. is stopped for three cycles

The consequence of a data hazard: if instruction Ii needs thre result of instruction Ii-1 (data are read in the ID stage), it has to wait until after WB of Ii-1

22

Forwarding

ADD R3, R1, R4

Clk 6 Clk 7 Clk 8

MEM WB

IF ID EX MEM WB

ID EX MEM

ID EX

ID

IF

IF

IF

IF


WB

EX MEM

ID EX

Clk 9

MEM

WB

WB

ID

ID

ID

Forwarding allows eliminating almost all RAW hazards of the DLX pipeline without stalling the pipeline.

(NOTE: in the DLX, registers are modified only in WB)

SUB R7, R3, R5 hazard

OR R1, R3, R5 hazard

LW R6, 100 (R3) hazard

AND R9, R5, R3 no hazard

Here too the data is not yet in RF since it is written on the positive clock edge at the end of WB (the register value is read in ID)

23

Forwarding implementation

FU

EX/MEM

M U X

MEM/WB

A L U M

U X

ID/EX

M U X

M U X

RS1/RS2 OPCODE

RD2/OpCode

RD1 (destination register/OpCode) Comparison

between RS1, RS2 and

RD1, RD2 and the Opcodes

RF M U X

Often performed inside the RF

Alternatively, SPLIT-CYCLE (see next)

write before read

It allows “anticipating” the register on ID/EX MUX control: IF/ID opcode and comparison of RD with RS1 and RS2 (IF/ID)

Forwarding Unit

A

B

Offset

24

Forwarding Unit

T

In this half-period the register is

written

In this half-period the register is read

• Within the Forwarding Unit, the opcodes of the instructions in the EX, MEM and WB stages are decoded.

• If the instruction in the EX stage needs a register value (either A or B i.e. an ALU instruction, NOT a J or Branch instruction) the opcodes of the instructions in the MEM and WB stages are examined. If they require a register update, the number of the involved register is compared with the register numbers of the instruction in the EX stage. If there is a match then the corresponding data is forwarded to the EX stage, thus replacing the data read from the register file

• The bypass MUXes (inputs of the ID/EX barrier) are needed because a fetched instruction can require the contents of registers whose numbers can match that of the instruction in the WB stage (if it must store a register value). In this case data must be read from the MEM/WB barrier instead from the register file.

• Alternatively, split-cycle:

25

Data hazard due to LOAD instructions

NOTE: the datum required by the ADD is available only at the end of the MEM stage. The hazard can not be eliminated by means of forwarding (unless there is an additional input in the MUXs between memory and ALU and everything is done in the same clock cycle – delays, there is a memory access in between which is already slow by itself!)

ADD R4,R1,R7

SUB R5,R1,R8

AND R6,R1,R7

LW R1,32(R6) MEM WB

IF ID EX MEM

IF ID EX

IF ID

IF ID EX

LW R1,32(R6)

ADD R4,R1,R7

SUB R5,R1,R8

AND R6,R1,R7

IF ID EX MEM WB

IF ID S EX MEM

IF S ID EX

S IF ID

The pipeline needs to be stalled

As a matter of fact, the clock signal is not generated. The clock block is propagated along the pipeline one stage at a time.

From the end of this stage onwards: standard forwarding MEM->EX

26

Delayed load

In many RISC CPUs, the hazard associated with the LOAD instruction is not handled by the hardware through pipeline stalling, instead it is handled via software by the compiler (delayed load):

LOAD Instruction

delay slot

Next instruction

The compiler tries to fill-in the delay-slot with a “useful” instruction (worst case: NOP).

LW R1,32(R6)

LW R3,10 (R4)

ADD R5,R1,R3

LW R6, 20 (R7)

LW R8, 40(R9)

LW R1,32(R6)

LW R3,10 (R4)

ADD R5,R1,R3

LW R6, 20 (R7)

LW R8, 40(R9)

27

Control Hazards

BEQZ R4, 200

PC BEQZ R4, 200

PC+4 SUB R7, R3, R5

PC+8 OR R1, R3, R5

PC+12 LW R6, 100 (R8)

PC+4+200 AND R9, R5, R3 (BTA)

Next Instruction Address

R4 = 0 : Branch Target Address

(taken) R4 0 : PC+4 (not taken)

Clk 6 Clk 7 Clk 8

IF ID EX MEM WB

ID

ID


MEM WB

EX MEM

EX

IF

IF

WB ID

ID

ID IF EX WB ID MEM

Fetch with the new PC

New computed PC value (Aluout)

SUB R7, R3, R5

OR R1, R3, R5

LW R6, 100 (R8)

New value in PC (one clock after)

ID IF EX WB ID MEM

A D D

4

IM RF

SE

PC

DEC

Instruction Fetch Instruction Decode

Execute

Memory Write Back

IF/ID ID/EX

A L U M

U X

EX/MEM

M U X

M U X

DLX Pipelined Datapath (Branch or JMP)

BEQZ R4, 200

M U X

DM

MEM/WB

When the new PC acts on the IM three instructions have already travelled through the first three stages (EX included)

NOTE if the feedback signal of the new PC was taken directly from the ALU instead than from ALUOUT the required stalls would obviously be 2 – but: slower clock!

=0?

=0?

29

Handling the Control Hazards

BEQZ R4,200

Clk 6 Clk 7 Clk 8

IF ID EX MEM WB


S S IF S

Fetch at new PC • Always Stall (three-clock block being propagated)

Hyp.: Branch Freq.= 25 % CPI = (1 + S ) = ( 1 + 3 * 0.25 ) = 1.75

• Predict Not Taken

IF ID EX MEM WB

ID

ID

ID

BEQZ R4, 200

SUB R7, R3, R5

OR R1, R3, R5

LW R6, 100 (R8)

Clk 6 Clk 7 Clk 8 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

MEM WB

EX MEM

EX

IF

IF

IF

WB

EX WB ID

ID

ID

MEM

Branch Completion

Flush: they become

NOP

NOP

NOP

NOP

IF – the previous instruction has not been decoded yet

S IF IF ID S Real situation

repeated IF PC <- PC - 4

Here the new value is sampled by the PC

No problem since no instruction has gone through WB!

IF ID EX MEM WB

Stalls with jumps (1/3)

A D D

4 M U X

DATA MEM

A L U M

U X

M U X

=0?

INSTR MEM

RF

SE

PC

DEC

M U X


RD

D

RS1

RS2

Data

PC

if jumping

=0?

N O P

N O P

N O P

forced NOP for jumping

On the first positive clock edge after sampling the assertion of the jumping condition, 3 NOPs must be inserted to replace the 3 unwanted instructions already present in the pipeline.

IF ID EX MEM WB

Stalls when jumping(2/3)

A D D

4 M U X

DATA MEM

A L U M

U X

M U X

=0?

INSTR MEM

RF

SE

PC

DEC

M U X


RD

D

RS1

RS2

Data

PC

if jumping

=0?

N O P

N O P

forced NOP when jumping

On the first positive clock edge after sampling the assertion of the jumping condition, 2 NOPs must be inserted to replace the 2 unwanted instructions

NOTE in this case the jump condition and the new PC are sent to the MUX in the same clock cycle as the processing of the condition

IF ID EX MEM WB

Stalls when jumping (3/3)

A D D

4

DATA MEM

A L U M

U X

M U X

=0?

INSTR MEM

RF

SE

DEC

M U X


RD

D

RS1

RS2

Data

PC

if jumping

=0?

N O P

NOP when jumping

On the first positive clock edge after the assertion of the jumping condition, a NOP is inserted to replace the instruction currently in the IF/ID stage

NOTE In this case the jumping condition and the new PC control the MUX in the same moment as the processing of the condition

PC

M U X

33

Independent ALU for BRANCH/JMP

To reduce the number of stalls

BTA <-PC1+ (IR15)16 ## IR15-0 /(IR25)6 ## IR25..0

if Branch: if (RS1 op 0) PC <- BTA if JMP always PC <- BTA

IF

ID

EX -------------------------

MEM

WB

-------------------------

-------------------------

(New fetch only one stall)

ALU (additional full adder)

N.B. The full adder is separated from the adder “+4” (this means it overlaps with the addition required to compute the next instruction!), otherwise the same adder has to be used together with some multiplexers (so to select whether to add 4 or the offset, and whether to use PC or PC1)

A <- RS1; B <- RS2; PC2 <- PC1 ID/EX <- Decode; ID/EX <- Opc ext.

IR <- M[PC] ; PC <- PC + 4; PC1 <- PC + 4

NOTE here there is only one “stall” since the new value is inserted in the PC on the positive clock edge that ends the ID stage while, in the previous case, it was inserted after the MEM stage, that is, two clock later!!

BRANCH/JMP – 1 stall

A D D E R

4

IM RF

PC

DEC

IF/ID ID/EX

I R 1

IF ID

P C 1

M U X

M U X

SE

##

A

B

P C 2

NOTE: for the “Unconditional Jump” instructions there is an analogous situation: we only need to provide further inputs to the MUXs of the PC by taking into consideration either the RS1 register (JR and JRL) or the 26

less-significant bits of the IR with SE (J and JL) to be added to the current PC)

The new PC is selected according to the opcode and the value of the branch test register

= 0 ?

For Branches

Standard addition

Branch

Offset and sign

extension

Displacement of the Branch instruction

PC of the Branch instruction

This actually coincide with the current value in PC (can be avoided)

35

Delayed branch

Similarly to the LOAD case, with several RISC CPUs the hazard associated with BRANCH instructions is handled via SW by the compiler (delayed branch):

BRANCH instruction

delay slot

Next instruction

The compiler tries to fill-in the delay-slots with “useful” instructions (worst case: NOP).

delay slot

delay slot

36

Delayed branch/jump

Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; branch condition

Br R1, +100

Sne R1, R8, R9 ; branch condition

Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21

Compiled Original

Executed in both cases Obviously in this group

of instructions there must be no jumps!!!

Instead of one or more “postponed” instructions, the compiler inserts NOPs in case no suitable instructions are available

37

Handling the Control Hazards

Dynamic Prediction: Branch Target Buffer -> no stall (almost..)

T/NT

TAGS

Predicted PC

PC

= HIT : Fetch with predicted PC

MISS : Fetch with PC + 4

Correct prediction : no stall

Wrong prediction : 1-3 stalls (correct fetch in ID or EX, see before)

N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID

38

Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when the

last branch occurred.

In case of predominance of one prediction, when the opposite situation occurs we have two

consecutive errors.

Loop1 Loop2

When exiting loop2, the prediction fails (branch predicted as taken but actually it

is untaken), then it fails again when it predicts as untaken whilst entering once

again loop2

39

Hence, usually two bits are used for branch prediction:

TAKEN

TAKEN

UNTAKEN

UNTAKEN

TAKEN

UNTAKEN

TAKEN

UNTAKEN

TAKEN

TAKEN

UNTAKEN UNTAKEN

motivazioni per la memoria virtuale - unibo.it

Documents