computer architecture pipeline

65
Computer Architecture 2014– Pipeline 1 Computer Architecture Pipeline By Yoav Etsion & Dan Tsafrir Presentation based on slides by David Patterson, Avi Mendelson, Randi Katz, and Lihu Rappoport

Upload: anisa

Post on 23-Feb-2016

78 views

Category:

Documents


0 download

DESCRIPTION

Computer Architecture Pipeline. By Yoav Etsion & Dan Tsafrir Presentation based on slides by David Patterson, Avi Mendelson, Randi Katz, and Lihu Rappoport. Pipeline idea: keep everyone busy. Pipeline: more accurately…. Expert in cutting bread. Expert in placing roast beef. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline1

Computer Architecture

Pipeline

By Yoav Etsion & Dan TsafrirPresentation based on slides by David Patterson, Avi Mendelson, Randi Katz, and Lihu Rappoport

Page 2: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline2

Pipeline idea: keep everyone busy

Page 3: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline3

Pipeline: more accurately…

Expert in placing tomatoand closing the sandwich

Expert in placing roast beef

Expert in cutting bread

Pipelining elsewhere• Unix shell

grep string File | wc -l• Assembling cars• Whenever want to keep

functional units busy

Page 4: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline4

DataAccess

DataAccess

Pipeline: microarchitecture2 4 6 8 10 12 14 16 18

. ..

InstFetch Reg ALU Reg

InstFetch Reg ALU Reg

InstFetch

8 ns

8 ns

8 ns

TimeProgramexecutionorder

lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

befo

re

Page 5: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline5

DataAccess

DataAccess

Pipeline: microarchitecture2 4 6 8 10 12 14 16 18

. ..

InstFetch Reg ALU Reg

InstFetch Reg ALU Reg

InstFetch

8 ns

8 ns

8 ns

TimeProgramexecutionorder

lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

befo

re

// R1 = mem[0+100]

fetch

decode & bringregs to ALU

100+R0

access mem write back result to R1

Page 6: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline7

DataAccess

DataAccess

DataAccess

DataAccess

DataAccess

Pipeline: microarchitecture

• Speed set by slowest component (instruction takes longer in pipeline)• First commercial use in 1985• In Intel chips since 486 (until then, serial execution)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

. ..

InstFetch Reg ALU Reg

InstFetch Reg ALU Reg

InstFetch

InstFetch Reg ALU Reg

InstFetch Reg ALU Reg

InstFetch Reg ALU Reg

2 ns

2 ns

2 ns 2 ns 2 ns 2 ns 2 ns

8 ns

8 ns

8 ns

TimeProgramexecutionorder

lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

TimeProgramexecutionorder

lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

befo

reaf

ter

// R1 = mem[0+100]

fetch

decode & bringregs to ALU

100+R0

access mem write back result to R1

Page 7: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline8

MIPS

Introduced in 1981 by Hennessy (of “Patterson & Hennessy”) Idea suggested earlier, e.g., by John Cocke and friends at IBM

in the 1970s, but not developed in full

MIPS = Microprocessor without Interlocked Pipeline Stages RISC Often used in computer architecture courses Was very successful (e.g., inspired the Alpha ISA)

Interlocks (“without interlocks”) Mechanisms to allow stages to indicate they are busy E.g., ‘divide’ & ‘multiply’ required interlocks

=> paused other stages upstream With MIPS, every sub-phase of all instructions fits into 1 cycle No die area wasted on pausing mechanisms => faster cycle

Page 8: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline9

Pipeline: principles Ideal speedup = num of pipeline stages

An instruction finishes every clock cycle Namely, IPC of an ideal pipelined machine is 1

Increase throughput rather than reduce latency One instruction still takes the same (or longer)

Since max speedup = num of stages &Latency determined by slowest stage, should:Þ Partition pipe to many stagesÞ Balance work across stagesÞ Shorten longest stage as much as possible

Page 9: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline10

Pipeline: overheads & limitations Can increase per-instruction latency

Due to stages imbalance

Requires more logic than serial execution

Time to “fill” pipe reduces speedupTime to “drain” pipe reduces speedup E.g., upon interrupt or context switch

Stalls when there are dependencies

Page 10: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline11

Instruction Decode /

register fetch

Instruction fetch

Execute /address

calculation

Memoryaccess

Writeback

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipelined CPU

Page 11: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline12

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: fetchbring next instructionfrom memory; 4 bytes(32 bit) per instruction

Instruction saved inregister, in preparationof next pipe stage

when not branching,next instruction is innext word

Page 12: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline13

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: decode + regs fetch• decode source reg numbers• read their values from reg file• reg IDs are 5 bits (2^5 = 32)

Page 13: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline14

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: decode + regs fetchdecode & sign-extend immediate (from 16 bit to 32)

Page 14: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline15

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: decode + regs fetchdecode destination reg (can be one of two, depending on op) & save in register for next stage…

Page 15: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline16

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: decode + regs fetchdecode destination reg (can be one of two, depending on op) & save in latch for next stage…

…based on the op type, next phase will determine, which reg of the two is the destination

Page 16: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline17

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: execute

ALU computes – “R” operation(the “shift” field is missing from this illustration)

reg1

reg2

func(6bit)

to reg3

Page 17: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline18

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: execute

ALU computes – “I” operation(not branch & not load/store)

reg1

imm

opcode

to reg2

Page 18: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline19

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: execute

ALU computes – “I” operationconditional branch BEQ or BNE[ if (reg1==reg2) pc = pc+4 + (imm<<2) ]

reg1

imm

opcode

reg2

Bra

nch?

Page 19: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline20

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: execute

ALU computes – “I” operationload (store is similar)( reg2 = mem[reg1+imm] )

reg1

imm to reg2

Page 20: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline21

ALUSrc

6

ALUresult

zero

AddShift left 2

ALUctrl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

signexten

sio

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

instructionmemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

Pipeline: updating PCno branch:just add 4 to PC

unconditional branch:add immediate to PC+4(type J operation)

conditional branch:depends on resultof ALU

Page 21: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline22

Pipelined CPU with Control

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EXEX/MEM

MEM/WBIn

stru

ctio

n

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

Con

trol

WBMEM

WBMEM

WB

EXE

Instruction Decode /

register fetch

Instruction fetch

Execute /address

calculation

Memoryaccess

Writeback

Page 22: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline23

Pipeline Example: cycle 1

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction lwPC4

4

Page 23: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline24

Pipeline Example: cycle 2

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

lw

PC8

8 4

sub[R1]

910

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

Page 24: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline25

Pipeline Example: cycle 3

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

lw

PC12

12 8

and[R2]

[R1]+9

sub4

[R3]

1011 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

Page 25: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline26

Pipeline Example: cycle 4

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

0

1

mux

0

1

mux

0

1

mux

1

0

mux

Instruction

lw

PC16

12 8

or[R4]

Data frommemoryaddress[R1]+9

sub4

[R5]

1112

and16

10

[R2]-[R3]

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

Page 26: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline27

Pipeline Hazards:

1. Structural Hazards

Page 27: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline28

Structural Hazard Two instructions attempt to use same resource simultaneously

Problem: register-file accessed in 2 stages Write during stage 5 (WB) Read during stage 2 (ID)=> Resource (RF) conflict

Solution Split stage into two sub-stages Do write in first half Do reads in second half 2 read ports, 1 write port (separate)

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

DM

Page 28: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline29

Structural Hazard Problem: memory accessed in 2 stages

Fetch (stage 1), when reading instructions from memory

Memory (stage 4), when datais read/written from/tomemory

Princeton architecture

Solution Split data/inst. Memories

Harvard architecture Today, separate instruction cache and data cache

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

DM

Page 29: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline30

Pipeline Hazards:

2. Data Hazards

Page 30: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline31

When two instructions access the same register

RAW: Read-After-Write True dependency

WAR: Write-After-Read Anti-dependency

WAW: Wrtie-After-Write False-dependency

Key problem with regular in-order pipelines is RAW We will also learn about out-of-order pipelines

Data Dependencies

Page 31: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline32

Problem with starting next instruction before first is finished dependencies that “go backward in time” are data

hazards

Data Dependencies

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

DM

sub R2, R1, R3

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 910 0–20Value of R2 10 10 10 -20 -20 -20 -20

Page 32: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline33

IM bubble bubble bubble bubble

IM bubble bubble bubble bubble

IM bubble bubble bubble bubble

RAW Hazard: HW Solution 1 - Add Stalls

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 910 0–20Value of R2

DM Reg

IM DM RegReg

IM Reg

IM Reg DM Reg

IM DM Reg

Reg

Reg

DM

sub R2, R1, R3

stall

stall

stall

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

10 10 10 -20 -20 -20 -20

• Let the hardware detect hazard and add stalls if needed

Problem: slow! Solution: forwarding whenever possible

Page 33: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline34

Use temporary results, don’t wait for them to be written to the register file register file forwarding to handle read/write to same register ALU forwarding

RAW Hazard: HW Solution 2 - Forwarding

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

X X X –20 X X X X XX X X X –20 X X X X

DM

sub R2, R1, R3

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 910 0–20Value of R2 10 10 10 -20 -20 -20 -20

Value EX/MEMValue MEM/WB

Page 34: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline35

Forwarding Hardware

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0muxB

1

2

0muxA

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC

ID/EX EX/MEM MEM/WBIF/ID

Inst

ruct

ion

IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd

Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

EX/M

EM.R

egW

rite

MEM

/WB

.Reg

Writ

e

Page 35: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline36

Forwarding Hardware

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0muxB

1

2

0muxA

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC

ID/EX EX/MEM MEM/WBIF/ID

Inst

ruct

ion

IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd

Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

EX/M

EM.R

egW

rite

MEM

/WB

.Reg

Writ

e

• Added 2 mux units before ALU• Each mux gets 3 inputs, from:

1. Prev stage (ID/EX)2. Next stage (EX/MEM)3. The one after (MEM/WB)

• Forward unit tells the 2 mux units which input to use

Page 36: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline37

Forwarding ControlEX Hazard:

A. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 1

B. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 1

MEM Hazard: if (not A and MEM/WB.RegWrite

(MEM/WB.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 2 if (not B and MEM/WB.RegWrite and

(MEM/WB.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 2

Page 37: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline38

Forwarding ControlEX Hazard:

A. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 1

B. if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 1

MEM Hazard: if (not A and MEM/WB.RegWrite

(MEM/WB.WriteReg = ID/EX.ReadReg1)) then ALUSelA = 2 if (not B and MEM/WB.RegWrite and

(MEM/WB.WriteReg = ID/EX.ReadReg2)) then ALUSelB = 2

If, in memory stage, we’re writing the output to a register

And the reg we’re writing to also happens to be inp_reg1 for the execute stage

Then mux_A should select inp_1,namely, the ALU should feed itself

Page 38: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline39

Forwarding Hardware Example: Bypassing From EX to Src1 and From WB to Src2

lw R11,9(R1) sub R10,R2, R3and R12,R10,R11

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0mux

1

2

0mux

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC

ID/EX EX/MEM MEM/WBIF/ID

Inst

ruct

ion

IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd

Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

lw[R10]

Data frommemoryaddress[R1]+9

sub

[R11]

1012

and

11

[R2]-[R3]

1110

load op => read from “1”

Page 39: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline40

Forwarding Hardware Example #2: Bypassing From WB to Src2

sub R10,R2, R3xxxand R12,R10,R11

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0mux

1

2

0mux

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC

ID/EX EX/MEM MEM/WBIF/ID

Inst

ruct

ion

IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd

Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

sub[R11]

xxx

[R10]

12

and

10

[R2]-[R3]

1110

not load op => read from “0”

Page 40: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline42

“load” op can cause “un-forwardable” hazards load value to R In the next instruction, use R as input

Þ A bigger problem in longer pipelines

Can't always forward (stall inevitable)

Reg

IM

Reg

Reg

IM

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 9Programexecutionorder

lw R2, 30(R1)

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Page 41: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline44

Hazard Detection (Stall) Logicif ( (ID/EX.RegWrite) and (ID/EX.opcode == lw) and ( (ID/EX.WriteReg == IF/ID.ReadReg1) or

(ID/EX.WriteReg == IF/ID.ReadReg2) )then stall IF/ID

Page 42: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline45

Forwarding + Hazard Detection Unit

Con

trol

EX

M

WB

ForwardingUnit

1mux

0

0mux

1

0mux

1

2

0mux

1

2

DataMemoryA

LU

Register File

Inst

ruct

ion

Mem

ory

PC

ID/EX EX/MEM MEM/WBIF/ID

Inst

ruct

ion

IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd

Rs

RtRtRd

WB

M WB

EX/MEM.Rd

MEM/WB.Rd

HazardDetection

Unit:Scoreboard

0

ID/EX.MemRead

PC W

rite

0mux

1

IF/ID

Writ

e

ID/EX.Rt

Page 43: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline46

Example: code for (assume all variables are in memory):a = b + c;d = e – f;

Slow codeLW Rb,bLW Rc,c

Stall ADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,f

Stall SUB Rd,Re,RfSW d,Rd

Instruction order can be changed as long as correctness is kept (no dependencies violated)

Compiler scheduling helps avoid load hazards (when possible)

Fast codeLW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd

Page 44: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline47

Pipeline Hazards:

3. Control Hazards

Page 45: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline48

Branch, but where?

The decision to branch happens deep within the pipeline Likewise, the target of the branch becomes known deep

within the pipeline How does this effect the pipeline logic? For example…

Page 46: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline49

Executing a BEQ Instruction (i)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4

0 or 4 beq R4, R5, 27 8 and12 sw16 sub

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

R4R5 -27

812

12

beqand

Assume this program state

Page 47: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline50

Executing a BEQ Instruction (i)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4

0 or 4 beq R4, R5, 27 8 and12 sw16 sub

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

R4R5 -27

812

12

beqand

We know:• Values of registersWe don’t know:• If branch will be taken• What is its target

Page 48: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline51

Executing a BEQ Instruction (ii)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4

0 or 4 beq R4, R5, 27 8 and12 sw16 sub

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction R4-R5=0-

8+SignExt(27)*4

1216

16

beqandsw

…Now we know, but only in next cycle will

this effect PC

Calculate branch condition = compute R4-R5 & compare to 0

Calculate branch target

Page 49: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline52

Executing a BEQ Instruction (iii)BEQ R4, R5, 27 → if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4

0 or 4 beq R4, R5, 27 8 and12 sw16 sub

ALUSrc

6

ALUresultZero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EX EX/MEM MEM/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

8+SignExt(27)*4

16

20 or 116

beqandsw

20

sub

Finally, if taken, branch sets the PC

Page 50: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline53

Control Hazard on Branches

And

Beq

sub

sw

Inst from target

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

Outcome: The 3 instructions following the branch are in the pipeline even if branch is taken!

Page 51: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline54

Traps, Exceptions and Interrupts

Indication of events that require a higher authority to intervene (i.e. the operating system)

Atomically changes the protection mode and branches to OS Protection mode determines what the running is allowed to do (access

devices, memory, etc).

Traps: Synchronous The program asks for OS services (e.g. access a device)

Exceptions: Synchronous The program did something bad (divide-by-zero; prot. violation)

Interrupts: Asynchronous An external device needs OS attention (finished an operation)

Can these be handled like regular branches?

Page 52: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline55

Stall

Easiest solution: Stall pipe when branch encountered until resolved

But there’s a prices. Assume: CPI = 1 20% of instructions are branches (realistic) Stall 3 cycles on every branch (extra 3 cycles for each branch)

Then the price is: CPI new = 1 + 0.2 × 3 = 1.6 // 1 = all instr., including branch [ CPI new = CPI Ideal + avg. stall cycles / instr. ]

Namely: We lose 60% of the performance!

Page 53: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline56

Branch Prediction and Speculative Execution

Page 54: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline57

Static prediction: branch not taken

Execute instructions from the fall-through (not-taken) path As if there is no branch If the branch is not-taken (~50%), no penalty is paid

If branch actually taken Flush the fall-through path instructions before they change the

machine state (memory / registers) Fetch the instructions from the correct (taken) path

Assuming ~50% branches not taken on average CPI new = 1 + (0.2 × 0.5) × 3 = 1.3 30% slowdown instead of 60% What happens in longer pipelines?

Page 55: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline58

Dynamic branch prediction

Branch prediction is a key impediment to performance Modern processors employ complex branch predictors Often achieve < 3% misprediction rate

Given an instruction, we need to predict Is it a branch? Branch taken? Target address?

To avoid stalling Prediction needed at end of ‘fetch’ Before we even now what’s the instruction…

A simple mechanism: Branch Target Buffer (BTB)

Page 56: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline59

BTB – the idea

fast lookup table

PC of fetched instruction

?= Predicted branchtaken or not taken?(last few times)

No => we don’t know,so we don’t predict

Yes => instructionis a branch, so let’spredict it

Branch PC Target PC History

Predicted Target

(Works in a straightforward manner only for direct branches, otherwise target PC changes)

Page 57: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline60

How it works in a nutshell

Until proven otherwise, assume branches are not taken Fall through instructions (assume branch has no effect)

Upon the first time a branch is taken Pay the price (in terms of stalls), but Save the details of the branch in the BTB

(= PC, target PC, and whether or not branch was taken)

While fetching, HW checks in parallel Whether PC is in BTB

If found, make a prediction Taken? Address?

Upon misprediction Flush (throw out) pipeline content & start over from right PC

Page 58: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline61

Prediction steps

1. Allocate Insert instruction to BTB once identified as taken branch Do not insert not-taken branches

Implicitly predict they’d continue not to be taken Insert both conditional & unconditional

To identify, and to save arithmetic2. Predict

BTB lookup done in parallel to PC-lookup, providing: Indication whether PC is a branch (=> BTB “hit”) Branch target Branch direction (forward or backward in program) Branch type (conditional or not)

3. Update (when branch taken & its outcome becomes known) Branch target, history (taken or not)

Page 59: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline62

Misprediction

Occurs when Predict = not taken, reality = taken Predict = taken, reality = not taken Branch taken as predicted, but wrong target (indirect, as in

the jmp register)

Must flush pipeline Reset pipeline registers (similar to turning all into NOPs)

Commonly, other flush methods are easier to implement Set the PC to the correct path Start fetching instruction from correct path

Page 60: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline63

CPI

Assuming a fraction of p correct predictions CPI_new = 1 + (0.2 × (1-p)) × 3

Example, p=0.7 CPI_new = 1 + (0.2 × 0.3) × 3 = 1.18

Example, p=0.98 CPI_new = 1 + (0.2 × 0.02) × 3 = 1.012 (But this is a simplistic model; in reality the price can

sometimes be much higher.)

Page 61: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline64

History & prediction algorithm

“Always backward” prediction Works for long loops

Some branches exhibit “locality” Typically behave as the last time they were invoked Typically depend on their previous outcome (& it alone)

Can save a history window What happened last time, and before that, and before… The bigger the window, the greater the complexity

Some branches regularly alternate between taken & untaken Taken, then untaken, then taken, … Need only one history bit to identify this

Some branches are correlated with previous branches Those that lead to them

Page 62: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline65

Adding a BTB to the Pipeline

Shift left 2

ALUSrc

6

ALUresultZero

+

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2WriteregWritedata

Readdata 1

Readdata 2

Reg

iste

r File

[15-0]

[20-16]

[15-11]

Signextend

16 32

ID/EXEX/MEM MEM

/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4+

IF/ID

PC

0

1

mux

0

1

mux

0mux

1

0

mux

Inst.Memory

Address

Instruction

BTB

12

pred targetpred dir

PC+4 (Not-taken target)taken target

3

Mis-predict

DetectionUnit

Flush

predicted targetPC+4 (Not-taken target)

predicted direction

−4

address

targetdirection

alloc/updt

Page 63: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline66

Using The BTBPC moves to next instruction

Inst Mem gets PCand fetches new inst

BTB gets PCand looks it up

IF/ID latch loadedwith new inst

BTB Hit ?

Br taken ?

PC PC + 4PC pred addr

IF

IDIF/ID latch loaded

with pred instIF/ID latch loaded

with seq. instBranch ?

yes no

noyes

noyesEXE

Page 64: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline67

Using The BTB (cont.)ID

EXE

MEM

WB

Branch ?

Calculate brcond & trgt

Flush pipe &update PC

Corect pred ?

yes no

IF/ID latch loadedwith correct inst

continue

Update BTB

yes no

continue

Page 65: Computer Architecture Pipeline

Computer Architecture 2014– Pipeline68

Prediction algorithm

Can do an entire course on this issue Still actively researched

As noted, modern predictors can often achieve misprediction < 2%

Still, it has been shown that these 2% can sometimes significantly worsen performance A real problem in out-of-order pipelines

We did not talk about the issue of indirect branches As in virtual function calls (object oriented) Where the branch target is written in memory, elsewhere