computer architecture pipeline

Computer Architecture 2014– Pipeline1

Computer Architecture

Pipeline

By Yoav Etsion & Dan TsafrirPresentation based on slides by David Patterson, Avi Mendelson, Randi Katz, and Lihu Rappoport


Pipeline idea: keep everyone busy


Pipeline: more accurately…

Expert in placing tomatoand closing the sandwich

Expert in placing roast beef

Expert in cutting bread

Pipelining elsewhere• Unix shell

grep string File | wc -l• Assembling cars• Whenever want to keep

functional units busy


DataAccess

DataAccess

Pipeline: microarchitecture2 4 6 8 10 12 14 16 18

. ..

InstFetch Reg ALU Reg


InstFetch

8 ns

8 ns

8 ns

TimeProgramexecutionorder

lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

befo

re


DataAccess

DataAccess

Pipeline: microarchitecture2 4 6 8 10 12 14 16 18

. ..



InstFetch

8 ns

8 ns

8 ns


lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

befo

re

// R1 = mem[0+100]

fetch

decode & bringregs to ALU

100+R0

access mem write back result to R1


DataAccess

DataAccess

DataAccess

DataAccess

DataAccess

Pipeline: microarchitecture

• Speed set by slowest component (instruction takes longer in pipeline)• First commercial use in 1985• In Intel chips since 486 (until then, serial execution)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

. ..



InstFetch




2 ns

2 ns

2 ns 2 ns 2 ns 2 ns 2 ns

8 ns

8 ns

8 ns


lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)


lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

befo

reaf

ter

// R1 = mem[0+100]

fetch

decode & bringregs to ALU

100+R0

access mem write back result to R1


MIPS

Introduced in 1981 by Hennessy (of “Patterson & Hennessy”) Idea suggested earlier, e.g., by John Cocke and friends at IBM

in the 1970s, but not developed in full

MIPS = Microprocessor without Interlocked Pipeline Stages RISC Often used in computer architecture courses Was very successful (e.g., inspired the Alpha ISA)

Interlocks (“without interlocks”) Mechanisms to allow stages to indicate they are busy E.g., ‘divide’ & ‘multiply’ required interlocks

=> paused other stages upstream With MIPS, every sub-phase of all instructions fits into 1 cycle No die area wasted on pausing mechanisms => faster cycle


Pipeline: principles Ideal speedup = num of pipeline stages

An instruction finishes every clock cycle Namely, IPC of an ideal pipelined machine is 1

Increase throughput rather than reduce latency One instruction still takes the same (or longer)

Since max speedup = num of stages &Latency determined by slowest stage, should:Þ Partition pipe to many stagesÞ Balance work across stagesÞ Shorten longest stage as much as possible


Pipeline: overheads & limitations Can increase per-instruction latency

Due to stages imbalance

Requires more logic than serial execution

Time to “fill” pipe reduces speedupTime to “drain” pipe reduces speedup E.g., upon interrupt or context switch

Stalls when there are dependencies


Instruction Decode /

register fetch

Instruction fetch

Execute /address

calculation

Memoryaccess

Writeback

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipelined CPU


ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: fetchbring next instructionfrom memory; 4 bytes(32 bit) per instruction

Instruction saved inregister, in preparationof next pipe stage

when not branching,next instruction is innext word


ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: decode + regs fetch• decode source reg numbers• read their values from reg file• reg IDs are 5 bits (2^5 = 32)


ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: decode + regs fetchdecode & sign-extend immediate (from 16 bit to 32)


ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: decode + regs fetchdecode destination reg (can be one of two, depending on op) & save in register for next stage…


ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: decode + regs fetchdecode destination reg (can be one of two, depending on op) & save in latch for next stage…

…based on the op type, next phase will determine, which reg of the two is the destination


ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: execute

ALU computes – “R” operation(the “shift” field is missing from this illustration)

reg1

reg2

func(6bit)

to reg3


ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: execute

ALU computes – “I” operation(not branch & not load/store)

reg1

imm

opcode

to reg2


ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: execute

ALU computes – “I” operationconditional branch BEQ or BNE[ if (reg1==reg2) pc = pc+4 + (imm<<2) ]

reg1

imm

opcode

reg2

Bra

nch?


ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: execute

ALU computes – “I” operationload (store is similar)( reg2 = mem[reg1+imm] )

reg1

imm to reg2


ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: updating PCno branch:just add 4 to PC

unconditional branch:add immediate to PC+4(type J operation)

conditional branch:depends on resultof ALU


Pipelined CPU with Control

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EXEX/MEM

MEM/WBIn

stru

ctio

n

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

Con

trol

WBMEM

WBMEM

WB

EXE

Instruction Decode /

register fetch

Instruction fetch

Execute /address

calculation

Memoryaccess

Writeback


Pipeline Example: cycle 1

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction lwPC4

4



ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

lw

PC8

8 4

sub[R1]

910




ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite


Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

lw

PC12

12 8

and[R2]

[R1]+9

sub4

[R3]

1011 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7



ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite


Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

lw

PC16

12 8

or[R4]

Data frommemoryaddress[R1]+9

sub4

[R5]

1112

and16

10

[R2]-[R3]



Pipeline Hazards:

1. Structural Hazards


Structural Hazard Two instructions attempt to use same resource simultaneously

Problem: register-file accessed in 2 stages Write during stage 5 (WB) Read during stage 2 (ID)=> Resource (RF) conflict

Solution Split stage into two sub-stages Do write in first half Do reads in second half 2 read ports, 1 write port (separate)

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

DM


Structural Hazard Problem: memory accessed in 2 stages

Fetch (stage 1), when reading instructions from memory

Memory (stage 4), when datais read/written from/tomemory

Princeton architecture

Solution Split data/inst. Memories

Harvard architecture Today, separate instruction cache and data cache

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

DM


Pipeline Hazards:

2. Data Hazards


When two instructions access the same register

RAW: Read-After-Write True dependency

WAR: Write-After-Read Anti-dependency

WAW: Wrtie-After-Write False-dependency

Key problem with regular in-order pipelines is RAW We will also learn about out-of-order pipelines

Data Dependencies


Problem with starting next instruction before first is finished dependencies that “go backward in time” are data

hazards

Data Dependencies

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

DM

sub R2, R1, R3

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 910 0–20Value of R2 10 10 10 -20 -20 -20 -20


IM bubble bubble bubble bubble



RAW Hazard: HW Solution 1 - Add Stalls

IM Reg


CC 7 CC 8 CC 910 0–20Value of R2

DM Reg

IM DM RegReg

IM Reg

IM Reg DM Reg

IM DM Reg

Reg

Reg

DM

sub R2, R1, R3

stall

stall

stall

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)


10 10 10 -20 -20 -20 -20

• Let the hardware detect hazard and add stalls if needed

Problem: slow! Solution: forwarding whenever possible


Use temporary results, don’t wait for them to be written to the register file register file forwarding to handle read/write to same register ALU forwarding

RAW Hazard: HW Solution 2 - Forwarding

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

X X X –20 X X X X XX X X X –20 X X X X

DM

sub R2, R1, R3

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)



CC 7 CC 8 CC 910 0–20Value of R2 10 10 10 -20 -20 -20 -20

Value EX/MEMValue MEM/WB


Forwarding Hardware

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0muxB

1

2

0muxA

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC

ID/EX EX/MEM MEM/WBIF/ID

Inst

ruct

ion

IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd

Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

EX/M

EM.R

egW

rite

MEM

/WB

.Reg

Writ

e


Forwarding Hardware

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0muxB

1

2

0muxA

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC


Inst

ruct

ion


Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

EX/M

EM.R

egW

rite

MEM

/WB

.Reg

Writ

e

• Added 2 mux units before ALU• Each mux gets 3 inputs, from:

1. Prev stage (ID/EX)2. Next stage (EX/MEM)3. The one after (MEM/WB)

• Forward unit tells the 2 mux units which input to use


Forwarding ControlEX Hazard:

A. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 1

B. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 1

MEM Hazard: if (not A and MEM/WB.RegWrite

(MEM/WB.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 2 if (not B and MEM/WB.RegWrite and

(MEM/WB.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 2


Forwarding ControlEX Hazard:

A. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 1

B. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 1

MEM Hazard: if (not A and MEM/WB.RegWrite

(MEM/WB.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 2 if (not B and MEM/WB.RegWrite and

(MEM/WB.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 2

If, in memory stage, we’re writing the output to a register

And the reg we’re writing to also happens to be inp_reg1 for the execute stage

Then mux_A should select inp_1,namely, the ALU should feed itself


Forwarding Hardware Example: Bypassing From EX to Src1 and From WB to Src2

lw R11,9(R1) sub R10,R2, R3and R12,R10,R11

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0mux

1

2

0mux

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC


Inst

ruct

ion


Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

lw[R10]

Data frommemoryaddress[R1]+9

sub

[R11]

1012

and

11

[R2]-[R3]

1110

load op => read from “1”


Forwarding Hardware Example #2: Bypassing From WB to Src2

sub R10,R2, R3xxxand R12,R10,R11

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0mux

1

2

0mux

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC


Inst

ruct

ion


Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

sub[R11]

xxx

[R10]

12

and

10

[R2]-[R3]

1110

not load op => read from “0”


“load” op can cause “un-forwardable” hazards load value to R In the next instruction, use R as input

Þ A bigger problem in longer pipelines

Can't always forward (stall inevitable)

Reg

IM

Reg

Reg

IM

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

DM


CC 7 CC 8 CC 9Programexecutionorder

lw R2, 30(R1)

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)


Hazard Detection (Stall) Logicif ( (ID/EX.RegWrite) and (ID/EX.opcode == lw) and ( (ID/EX.WriteReg == IF/ID.ReadReg1) or

(ID/EX.WriteReg == IF/ID.ReadReg2) )then stall IF/ID


Forwarding + Hazard Detection Unit

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0mux

1

2

0mux

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC


Inst

ruct

ion


Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

HazardDetection

Unit:Scoreboard

0

ID/EX.MemRead

PC W

rite

0mux

1

IF/ID

Writ

e

ID/EX.Rt


Example: code for (assume all variables are in memory):a = b + c;d = e – f;

Slow codeLW Rb,bLW Rc,c

Stall ADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,f

Stall SUB Rd,Re,RfSW d,Rd

Instruction order can be changed as long as correctness is kept (no dependencies violated)

Compiler scheduling helps avoid load hazards (when possible)

Fast codeLW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd


Pipeline Hazards:

3. Control Hazards


Branch, but where?

The decision to branch happens deep within the pipeline Likewise, the target of the branch becomes known deep

within the pipeline How does this effect the pipeline logic? For example…


Executing a BEQ Instruction (i)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4

0 or 4 beq R4, R5, 27 8 and12 sw16 sub

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

R4R5 -27

812

12

beqand

Assume this program state


Executing a BEQ Instruction (i)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4


ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

R4R5 -27

812

12

beqand

We know:• Values of registersWe don’t know:• If branch will be taken• What is its target


Executing a BEQ Instruction (ii)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4


ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction R4-R5=0-

8+SignExt(27)*4

1216

16

beqandsw

…Now we know, but only in next cycle will

this effect PC

Calculate branch condition = compute R4-R5 & compare to 0

Calculate branch target


Executing a BEQ Instruction (iii)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4


ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

8+SignExt(27)*4

16

20 or 116

beqandsw

20

sub

Finally, if taken, branch sets the PC


Control Hazard on Branches

And

Beq

sub

sw

Inst from target

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

Outcome: The 3 instructions following the branch are in the pipeline even if branch is taken!


Traps, Exceptions and Interrupts

Indication of events that require a higher authority to intervene (i.e. the operating system)

Atomically changes the protection mode and branches to OS Protection mode determines what the running is allowed to do (access

devices, memory, etc).

Traps: Synchronous The program asks for OS services (e.g. access a device)

Exceptions: Synchronous The program did something bad (divide-by-zero; prot. violation)

Interrupts: Asynchronous An external device needs OS attention (finished an operation)

Can these be handled like regular branches?


Stall

Easiest solution: Stall pipe when branch encountered until resolved

But there’s a prices. Assume: CPI = 1 20% of instructions are branches (realistic) Stall 3 cycles on every branch (extra 3 cycles for each branch)

Then the price is: CPI new = 1 + 0.2 × 3 = 1.6 // 1 = all instr., including branch [ CPI new = CPI Ideal + avg. stall cycles / instr. ]

Namely: We lose 60% of the performance!


Branch Prediction and Speculative Execution


Static prediction: branch not taken

Execute instructions from the fall-through (not-taken) path As if there is no branch If the branch is not-taken (~50%), no penalty is paid

If branch actually taken Flush the fall-through path instructions before they change the

machine state (memory / registers) Fetch the instructions from the correct (taken) path

Assuming ~50% branches not taken on average CPI new = 1 + (0.2 × 0.5) × 3 = 1.3 30% slowdown instead of 60% What happens in longer pipelines?


Dynamic branch prediction

Branch prediction is a key impediment to performance Modern processors employ complex branch predictors Often achieve < 3% misprediction rate

Given an instruction, we need to predict Is it a branch? Branch taken? Target address?

To avoid stalling Prediction needed at end of ‘fetch’ Before we even now what’s the instruction…

A simple mechanism: Branch Target Buffer (BTB)


BTB – the idea

fast lookup table

PC of fetched instruction

?= Predicted branchtaken or not taken?(last few times)

No => we don’t know,so we don’t predict

Yes => instructionis a branch, so let’spredict it

Branch PC Target PC History

Predicted Target

(Works in a straightforward manner only for direct branches, otherwise target PC changes)


How it works in a nutshell

Until proven otherwise, assume branches are not taken Fall through instructions (assume branch has no effect)

Upon the first time a branch is taken Pay the price (in terms of stalls), but Save the details of the branch in the BTB

(= PC, target PC, and whether or not branch was taken)

While fetching, HW checks in parallel Whether PC is in BTB

If found, make a prediction Taken? Address?

Upon misprediction Flush (throw out) pipeline content & start over from right PC


Prediction steps

1. Allocate Insert instruction to BTB once identified as taken branch Do not insert not-taken branches

Implicitly predict they’d continue not to be taken Insert both conditional & unconditional

To identify, and to save arithmetic2. Predict

BTB lookup done in parallel to PC-lookup, providing: Indication whether PC is a branch (=> BTB “hit”) Branch target Branch direction (forward or backward in program) Branch type (conditional or not)

3. Update (when branch taken & its outcome becomes known) Branch target, history (taken or not)


Misprediction

Occurs when Predict = not taken, reality = taken Predict = taken, reality = not taken Branch taken as predicted, but wrong target (indirect, as in

the jmp register)

Must flush pipeline Reset pipeline registers (similar to turning all into NOPs)

Commonly, other flush methods are easier to implement Set the PC to the correct path Start fetching instruction from correct path


CPI

Assuming a fraction of p correct predictions CPI_new = 1 + (0.2 × (1-p)) × 3

Example, p=0.7 CPI_new = 1 + (0.2 × 0.3) × 3 = 1.18

Example, p=0.98 CPI_new = 1 + (0.2 × 0.02) × 3 = 1.012 (But this is a simplistic model; in reality the price can

sometimes be much higher.)


History & prediction algorithm

“Always backward” prediction Works for long loops

Some branches exhibit “locality” Typically behave as the last time they were invoked Typically depend on their previous outcome (& it alone)

Can save a history window What happened last time, and before that, and before… The bigger the window, the greater the complexity

Some branches regularly alternate between taken & untaken Taken, then untaken, then taken, … Need only one history bit to identify this

Some branches are correlated with previous branches Those that lead to them


Adding a BTB to the Pipeline

Shift left 2

ALUSrc

6

ALUresultZero

+

ALUControl

ALUOp

RegDst

RegWrite


Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EXEX/MEM MEM

/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4+

IF/ID

PC

0

1

mux

0

1

mux

0mux

1

0

mux

Inst.Memory

Address

Instruction

BTB

12

pred targetpred dir

PC+4 (Not-taken target)taken target

3

Mis-predict

DetectionUnit

Flush

predicted targetPC+4 (Not-taken target)

predicted direction

−4

address

targetdirection

alloc/updt


Using The BTBPC moves to next instruction

Inst Mem gets PCand fetches new inst

BTB gets PCand looks it up

IF/ID latch loadedwith new inst

BTB Hit ?

Br taken ?

PC PC + 4PC pred addr

IF

IDIF/ID latch loaded

with pred instIF/ID latch loaded

with seq. instBranch ?

yes no

noyes

noyesEXE


Using The BTB (cont.)ID

EXE

MEM

WB

Branch ?

Calculate brcond & trgt

Flush pipe &update PC

Corect pred ?

yes no

IF/ID latch loadedwith correct inst

continue

Update BTB

yes no

continue


Prediction algorithm

Can do an entire course on this issue Still actively researched

As noted, modern predictors can often achieve misprediction < 2%

Still, it has been shown that these 2% can sometimes significantly worsen performance A real problem in out-of-order pipelines

We did not talk about the issue of indirect branches As in virtual function calls (object oriented) Where the branch target is written in memory, elsewhere

computer architecture pipeline

Documents