lecture 3 processor - university of arkansas

Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook)

Chapter 4.2

Introduction Chapter 4.1

Chapter 4.3

Review: MIPS (RISC) Design Principles Simplicity favors regularity

fixed size instructions small number of instruction formats opcode always the first 6 bits

Smaller is faster limited instruction set limited number of registers in register file limited number of addressing modes

Make the common case fast arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands

Good design demands good compromises three instruction formats

Chapter 4.4

Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j

Generic implementation use the program counter (PC) to supply

the instruction address and fetch the instruction from memory (and update the PC)

decode the instruction (and read registers) execute the instruction

All instructions (except j) use the ALU after reading the registers

How? memory-reference? arithmetic? control flow?

The Processor: Datapath & Control

FetchPC = PC+4

DecodeExec

Chapter 4.5

Logic Design Conventions Chapter 4.2

Chapter 4.6

Logic Design Basics Information encoded in binary

Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data encoded on multi-wire buses

Combinational components Operate on data Output is a function of input

State (sequential) components Store information

Chapter 4.7

Combinational Components

AND-gate Y = A & B

Adder Y = A + B

Multiplexer Y = S ? I1 : I0

Arithmetic/Logic Unit Y = F(A,B)

I0I1 Y

Chapter 4.8

Sequential Components Register: stores data in a circuit

Uses a clock signal to determine when to update the stored value

Edge-triggered: update when the clock signal changes the value- Rising edge (positive edge) : 0→1- Falling edge (negative edge) : 1→0

D flip-flop is one option to implement register

Chapter 4.9

Sequential Components Register with write control

Only updates on clock edge when write control input (e.g., Write Enable) is asserted

Chapter 4.10

Clocking Methodologies The clocking methodology defines when data in a state

element are valid and stable relative to the clock State elements – a memory element such as a register Edge-triggered – all state changes occur on a clock edge

Typical execution read contents of state elements → send values through

combinational logic → write results to one or more state elementsState

element1

Stateelement

Combinationallogic

one clock cycle

Assumes state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted and the

clock edge occurs

Chapter 4.11

Building a Datapath Chapter 4.3

Chapter 4.12

Fetching Instructions Fetching instructions involves

reading the instruction from the Instruction Memory updating the PC value to be the address of the next

(sequential) instruction

ReadAddress

Instruction

InstructionMemory

PC is updated every clock cycle, so it does not need an explicit write control signal

FetchPC = PC+4

DecodeExec

Chapter 4.13

Decoding Instructions Decoding instructions involves

sending the fetched instruction’s opcode and function field bits to the control unit

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

ControlUnit

reading two values from the Register File- Register File addresses are contained in the instruction

FetchPC = PC+4

DecodeExec

Chapter 4.14

Executing R Format Operations R format operations (add, sub, slt, and, or)

perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

overflowzero

ALU controlRegWrite

R-type:31 25 20 15 5 0

op rs rt rd functshamt

Note that Register File is not written every cycle (e.g., sw), so we need an explicit write control signal for the Register File

FetchPC = PC+4

DecodeExec

Chapter 4.15

Executing Store Operations Store operations involve

compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction

store value (read from the Register File during decode) written to the Data Memory

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

overflowzero

ALU control

DataMemory

Address

Write Data

Read Data

SignExtend

MemWrite

Chapter 4.16

Executing Load Operations Load operations involve

compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction

load value (read from the Data Memory) written to the Register File

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

overflowzero

ALU controlRegWrite

DataMemory

Address

Write Data

Read Data

SignExtend

MemRead

Chapter 4.17

Executing Branch Operations Branch operations involve

compare the operands read from the Register File during decode for equality (zero ALU output)

compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

ALU control

SignExtend16 32

Shiftleft 2

Branchtargetaddress

(to branch control logic)

Chapter 4.18

Executing Jump Operations Jump operation involves

replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits

ReadAddress

Instruction

InstructionMemory

Shiftleft 2

Jumpaddress

Chapter 4.19

Creating a Single Datapath from the Parts Assemble the datapath segments and add control lines

and multiplexors as needed Single cycle design – fetch, decode and execute each

instruction in one clock cycle no datapath resource can be used more than once per

instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders)

multiplexors needed at the input of shared elements with control lines to do the selection

write signals to control writing to the Register File and Data Memory

Cycle time is determined by the length of the longest path

Chapter 4.20

Fetch, Register, and Memory Access Portions

MemtoReg

ReadAddress Instruction

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

ovfzero

ALU controlRegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemReadSign

Extend16 32

ALUSrc

zero flag of ALU- ‘1’: result is zero- ‘0’: result is not zero

Chapter 4.21

Full Datapath (without Jump)

Chapter 4.22

Adding Control Units Chapter 4.4, page 259

Chapter 4.23

Adding the Control Selecting the operations to perform (ALU, Register File

and Memory read/write) Controlling the flow of data (multiplexor inputs)

I-Type: op rs rt address offset31 25 20 15 0

R-type:31 25 20 15 5 0

op rs rt rd functshamt

Observations op field always

in bits 31-26

addr. of registers to be read are always specified by the rs field (bits 25-21) and rt field (bits 20-16)

- in lw and sw rs is the base register

addr. of register to be written is in one of two places – in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions

offset for beq, lw, and sw always in bits 15-0

J-type:31 25 0

op target address

Chapter 4.24

The Main Control Unit Control signals derived from instruction

0 rs rt rd shamt funct31:26 5:025:21 20:16 15:11 10:6

35 or 43 rs rt address31:26 25:21 20:16 15:0

4 rs rt address31:26 25:21 20:16 15:0

R-type

Load/Store

Branch

opcode always read

always read

write for R-typeand load

sign-extend

Chapter 4.25

Single Cycle Datapath with Control Unit (w/o jump)

ReadAddress

Instr[31-0]

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

RegDst

ALUcontrol

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Chapter 4.26

Single Cycle Datapath with Control Unit

ReadAddress

Instr[31-0]

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

RegDst

ALUcontrol

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Shiftleft 2

32Instr[25-0]

26PC+4 [31-28]

Chapter 4.27

ALU Control ALU used for

Load/Store: Function = add Branch: Function = subtract R-type: Function depends on funct field

ALU control Function

0000 AND

0001 OR

0010 add

0110 subtract

0111 set-on-less-than

Chapter 4.28

ALU Control Assume 2-bit ALUOp derived from opcode

Combinational logic derives ALU control

opcode ALUOp Operation funct ALU function ALU control

lw 00 load word XXXXXX add 0010

sw 00 store word XXXXXX add 0010

beq 01 branch equal XXXXXX subtract 0110

R-type 10 add 100000 add 0010

subtract 100010 subtract 0110

AND 100100 AND 0000

OR 100101 OR 0001

set-on-less-than 101010 set-on-less-than 0111

Chapter 4.29

Single-cycle Implementation Chapter 4.4, page 264

Chapter 4.30

R-type Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

RegDst

ALUcontrol

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Chapter 4.31

R-type Instruction Data/Control Flow (alt.)

Chapter 4.32

Load Word Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

RegDst

ALUcontrol

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Chapter 4.33

Load Word Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

RegDst

ALUcontrol

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Chapter 4.34

Load Word Instruction Data/Control Flow (alt.)

Chapter 4.35

Store Word Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

RegDst

ALUcontrol

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Chapter 4.36

Store Word Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

RegDst

ALUcontrol

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Chapter 4.37

Branch Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

RegDst

ALUcontrol

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Chapter 4.38

Branch Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

RegDst

ALUcontrol

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Chapter 4.39

Branch Instruction Data/Control Flow (alt.)

Chapter 4.40

Adding the Jump Operation

ReadAddress

Instr[31-0]

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

RegDst

ALUcontrol

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Shiftleft 2

32Instr[25-0]

26PC+4[31-28]

Chapter 4.41

Jump Operation (alt.)

Chapter 4.42

Introduction to Pipelining Design Chapter 4.5

Chapter 4.43

Instruction Critical Paths

Instr. I Mem Reg Rd ALU Op D Mem Reg Wr TotalR-typeloadstorebeqjump

200 100 200 100 600

200 100 200 200 100 800

What is the clock cycle time assuming negligible delays for muxes, control unit, sign extension, PC access, shift left 2, wires, setup and hold times except:

Instruction Memory and Data Memory (200 ps) ALU and adders (200 ps) Register File access (reads or writes) (100 ps)

200 100 200 200 700200 100 200 500200 200

Chapter 4.44

Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently – the clock cycle must

be timed to accommodate the slowest instruction especially problematic for more complex instructions like

floating-point multiplication

Is slow

but Is simple and easy to understand

lw sw Waste

Cycle 1 Cycle 2

Chapter 4.45

How Can We Make It Faster? Start fetching and executing the next instruction before

the current one has completed Pipelining – modern processors are pipelined for performance Remember the performance equation:

CPU time = IC × CPI × CC

Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages A five stage pipeline is nearly five times faster because the CC is

nearly five times faster- CPI=1 for single-cycle implementation- CPI≈1 for pipelined implementation

Chapter 4.46

Analogy: Assembly Line v.s. Mechanic Shop Mechanic Shop

The mechanic needs to do everything

It takes hours to fix just one car

Car assembly line Many workers work

together- Each worker just puts one

or a few components into the car

One assembly line can produce hundreds or thousands of cars per day

Chapter 4.47

The Five Stages of Executing Instruction

IFetch: Instruction Fetch and Update PC

Dec: Registers Fetch and Instruction Decode

Exec: Execute R-type; calculate memory address; etc.

Mem: Read/write the data from/to the Data Memory

WB: Write the result data into the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw

Chapter 4.48

Why Pipeline? For Performance!

Instr.

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

ALUIM Reg DM Reg

Once the pipeline is full, one instruction

is completed every cycle, so

CPI = 1

Time to fill the pipeline

Chapter 4.49

A Pipelined MIPS Processor Start the next instruction before the current one has

completed improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time -

time from the start of an instruction to its completion) is notreduced

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw

Cycle 7Cycle 6 Cycle 8

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

- clock cycle (pipeline stage time) is limited by the slowest stage for some stages don’t need the whole clock cycle (e.g., WB)

Chapter 4.50

Single Cycle versus Pipeline

lw IFetch Dec Exec Mem WBPipeline Implementation (CC = 200 ps):

IFetch Dec Exec Mem WBsw

IFetch Dec Exec Mem WBR-type

Single Cycle Implementation (CC = 800 ps):

lw sw Waste

Cycle 1 Cycle 2

To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why ?

How long does each take to complete 1,000,000 instrs ?

400 ps

Chapter 4.51

Pipelining the MIPS ISA

What makes it easy all instructions are the same length (32 bits)

- can fetch in the 1st stage and decode in the 2nd stage few instruction formats (three) memory operations occur only in loads and stores

- can use the execute stage to calculate memory addresses each instruction writes at most one result (i.e., changes the

machine state) and does it in the last few pipeline stages (MEM or WB)

operands must be aligned in memory so a single data transfer takes only one data memory access

Only cover the following 8 instructions as an example lw, sw, add, sub, and, or, slt, beq

Chapter 4.52

Pipelined Datapath Chapter 4.6, 286

Chapter 4.53

MIPS Pipelined Datapath

Chapter 4.54

Pipeline Registers Need registers between stages

Hold information produced in previous cycle

Chapter 4.55

Single-clock-cycle diagram Cycle-by-cycle flow of instructions through the pipelined

datapath

“Single-clock-cycle” pipeline diagram Show pipeline usage in a single cycle Highlight resources used in each cycle

We will look at “single-clock-cycle” diagrams for load & store instructions

Chapter 4.56

IF for Load & Store

Chapter 4.57

ID for Load & Store

Chapter 4.58

EX for Load & Store

Chapter 4.59

MEM for Load

Chapter 4.60

MEM for Store

Chapter 4.61

WB for Load

Wrongregisternumber

Chapter 4.62

Corrected Pipelined Datapath

Chapter 4.63

MIPS Pipeline Datapath State registers between each pipeline stage to isolate them

IF:IFetch ID:Dec EX:Execute MEM:MemAccess

WB:WriteBack

ReadAddress

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

Shiftleft 2

DataMemory

Address

Write Data

ReadData

SignExtend

ID/EX EX/MEM

MEM/WB

System Clock

Chapter 4.64

Graphically Representing MIPS Pipeline

Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed?

ALUIM Reg DM Reg

Chapter 4.65

Multi-Cycle Pipeline Diagram Showing the resource usage

Chapter 4.66

Multi-Cycle Pipeline Diagram Traditional form

Chapter 4.67

Pipeline Control Chapter 4.6, page 300

Chapter 4.68

Pipelined Control

Chapter 4.69

Pipelined Control Signals Control signals derived from instructions

As in single-cycle implementation

Chapter 4.70

Pipeline Control IF Stage: read Instr Memory (always asserted) and write

PC (on System Clock)

ID Stage: no control signals to set

EX Stage MEM Stage WB StageRegDst

ALUOp1

ALUOp0

ALUSrc

Brch MemRead

MemWrite

RegWrite

Mem toReg

R 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Chapter 4.71

MIPS Pipeline Control Path Modifications

Chapter 4.72

Pipeline Hazards

Data Hazard Chapter 4.7, page 303

Chapter 4.73

Can Pipelining Get Us Into Trouble? Yes: Pipeline Hazards

structural hazards: attempt to use the same resource by two different instructions at the same time

- Structural hazards are solved by duplicating the necessary components

data hazards: attempt to use data before it is ready- An instruction’s source operand(s) are produced by a prior

instruction still in the pipeline

control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated

- branch and jump instructions, exceptions

Can usually resolve hazards by waiting pipeline control must detect the hazard and take action to resolve hazards

Chapter 4.74

Instr.

Time (clock cycles)

Inst 2

Inst 3

Inst 5

Inst 4

ALUMem Reg Mem Reg

A Single Memory Would Be a Structural Hazard

Reading data from memory

Reading instruction from memory

Fix with separate instr and data memories (I$ and D$)

Chapter 4.75

How About Register File Access?

Instr.

Time (clock cycles)

Inst 2

Inst 3

ALUIM Reg DM Reg

Fix simple register file hazard by doing

writes in the first halfof the cycle and

reads in the second half

add $1,

add $2,$1,

Chapter 4.76

Register Usage Can Cause Data Hazards

Chapter 4.77

Register Usage Can Cause Data Hazards

ALUIM Reg DM Reg

Dependencies backward in time cause hazards

add $1,$8,$9

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

Read After Write data hazard

Instr.

Chapter 4.78

Loads Can Cause Data Hazards

Instr.

lw $1,4($2)

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9A

LUIM Reg DM Reg

ALUIM Reg DM Reg

Load-use data hazard● Another Read After Write hazard

Chapter 4.79

Formal Definitions of Data Hazards Consider two instructions i and j, with i occurring before j

in program order

Three data hazards RAW (read after write)

- j tries to read a source before i writes it, so j incorrectly gets the oldvalue

WAW (write after write)- j tries to write an operand before it is written by i, leaving the value

written by i rather than the value written by j in the destination WAR (write after read)

- j tries to write a destination before it is read by i, so i incorrectly gets the new value

In the basic 5-stage pipeline, WAW and WAR dependences do not cause any hazards

Chapter 4.80

stall (insert nop)

One Way to “Fix” a Data Hazard

Instr.

add $1,

ALUIM Reg DM Reg

sub $4,$1,$5

and $6,$1,$7

ALUIM Reg DM Reg

Can fix data hazard by

waiting – stall –but impacts CPI

Chapter 4.81

Another Way to “Fix” a Data Hazard

ALUIM Reg DM Reg

Instr.

add $1,

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

Fix data hazards by forwarding

results as soon as they are availableto where they are

needed

Chapter 4.82

Data Forwarding (aka Bypassing) Take the result from the earliest point where it exists in any

of the pipeline registers and forward it to the functional units (e.g., the ALU) that need it in that cycle

For ALU functional unit: the inputs can come from anypipeline register rather than just from ID/EX by adding multiplexors to both inputs of the ALU connecting the register write data in EX/MEM or MEM/WB to both

ALU mux inputs in the EX’s stage adding the proper control hardware to control the new muxes

With forwarding the processor can achieve a CPI close to 1 even in the presence of data dependencies

Chapter 4.83

Forwarding Illustration

Instr.

add $1,

sub $4,$1,$5

and $6,$7,$1

ALUIM Reg DM Reg

EX forwarding MEM forwarding

Chapter 4.84

Detecting the Need to Forward Pass register numbers along pipeline

e.g., ID/EX.RegisterRs: register number for Rs stored in ID/EX pipeline register

ALU operand register numbers in EX stage are given by ID/EX.RegisterRs, ID/EX.RegisterRt

Data hazards only if1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1b. EX/MEM.RegisterRd = ID/EX.RegisterRt

2a. MEM/WB.RegisterRd = ID/EX.RegisterRs2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

Fwd from previous instr.,i.e., EX/MEM pipeline register

Fwd from second previous instruction, i.e., MEM/WB pipeline register

Chapter 4.85

Detecting the Need to Forward But only if forwarding instruction will write to a register!

EX/MEM.RegWrite MEM/WB.RegWrite

And only if Rd for that instruction is not $zero EX/MEM.RegisterRd ≠ 0 MEM/WB.RegisterRd ≠ 0

Chapter 4.86

Datapath with Forwarding HardwarePCSrc

ReadAddress

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

Shiftleft 2

DataMemory

Address

Write Data

ReadData

SignExtend

ID/EXEX/MEM

MEM/WB

Control

ALUcntrl

Branch

ForwardUnit

ForwardA

ForwardB

Chapter 4.87

Control Values for the Forwarding Multiplexors

Mux control Source Explanation

ForwardA=00 ID/EX The first ALU operand comes from the register file.

ForwardA=10 EX/MEM The first ALU operand is forwarded from the prior ALU result.

ForwardA=01 MEM/WB The first ALU operand is forwarded from the data memory or an earlier ALU result.

ForwardB=00 ID/EX The second ALU operand comes from the register file.

ForwardB=10 EX/MEM The second ALU operand is forwarded from the prior ALU result.

ForwardB=01 MEM/WB The second ALU operand is forwarded from the data memory or an earlier ALU result.

Chapter 4.88

Yet Another Complication!

Instr.

add $1,$1,$2

ALUIM Reg DM Reg

add $1,$1,$3

add $1,$1,$4

ALUIM Reg DM Reg

Another potential data hazard can occur when there is a conflict between the outputs of EX/MEM pipeline register and MEM/WB pipeline register – which should be forwarded?

Chapter 4.89

Yet Another Complication!

Instr.

add $1,$1,$2

ALUIM Reg DM Reg

add $1,$1,$3

add $1,$1,$4

ALUIM Reg DM Reg

Another potential data hazard can occur when there is a conflict between the outputs of EX/MEM pipeline register and MEM/WB pipeline register – which should be forwarded?

What we want

The forwarding we want to avoid

Chapter 4.90

Statement for Forwarding Control Signals (in C) 1. ForwardA:

if ( EX/MEM.RegWrite and(EX/MEM.RegisterRd != 0) and(EX/MEM.RegisterRd = ID/EX.RegisterRs))

ForwardA = 10;

else if ( MEM/WB.RegWrite and(MEM/WB.RegisterRd != 0) and(MEM/WB.RegisterRd = ID/EX.RegisterRs))

ForwardA = 01;

elseForwardA = 00;

2. ForwardB● The logic is similar

Forwards the result from the previous instr. to either input of the ALU

Forwards the result from the second previous instr. to either input of the ALU

No forwarding

Chapter 4.91

Datapath with Forwarding HardwarePCSrc

ReadAddress

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

Shiftleft 2

DataMemory

Address

Write Data

ReadData

SignExtend

ID/EXEX/MEM

MEM/WB

Control

ALUcntrl

Branch

ForwardUnit

ID/EX.RegisterRt

ID/EX.RegisterRs

EX/MEM.RegisterRd

MEM/WB.RegisterRd

Chapter 4.92

Data Hazards and Stalls Chapter 4.7, page 313

Chapter 4.93

Forwarding with Load-use Data Hazards (logical view)

Instr.

lw $1,4($2)

ALUIM Reg DM Reg

ALUIM Reg DM Regsub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

Chapter 4.94

Forwarding with Load-use Data Hazards (logical view)

Instr.

lw $1,4($2)

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9A

LUIM Reg DM RegA

LUIM Reg DM

ALUReg DM Reg

ALUIM Reg DM Reg

Will still need one stall cycle even with forwarding

sub $4,$1,$5

Chapter 4.95

Load-use Hazard Detection Unit Need a Hazard detection Unit in the ID stage that inserts

a stall between the load and its use1. ID Hazard detection Unit:if (ID/EX.MemRead and

((ID/EX.RegisterRt = IF/ID.RegisterRs) or(ID/EX.RegisterRt = IF/ID.RegisterRt)))

stall the pipeline (more accurate, stall the following instructions)

The first line tests to see if the instruction now in the EX stage is a lw; the next two lines check to see if the destination register of the lw matches either source register of the instruction in the ID stage (the load-use instruction)

After this one cycle stall, the forwarding logic can handle the remaining data hazards

Chapter 4.96

Hazard/Stall Hardware Along with the Hazard Unit, we have to implement the stall Prevent the instructions in the IF and ID stages from

progressing down the pipeline – done by preventing the PC register and the IF/ID pipeline register from changing Hazard detection Unit controls the writing of the PC (PC.write)

and IF/ID (IF/ID.write) registers

Insert a “bubble” between the lw instruction (in the EX stage) and the load-use instruction (in the ID stage) (i.e., insert a nop in the execution stream) Set the control bits in the EX, MEM, and WB control fields of the

ID/EX pipeline register to 0 (nop). The Hazard Unit controls the mux that chooses between the real control values and the 0’s.

Let the lw instruction and the following instructions in the pipeline proceed normally down the pipeline

Chapter 4.97

Adding the Hazard/Stall Hardware

ReadAddress

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

ReadData 1

ReadData 2

Shiftleft 2

DataMemory

Address

Write Data

ReadData

SignExtend

ID/EXEX/MEM

MEM/WB

Control

ALUcntrl

Branch

ForwardUnit

HazardUnit

ID/EX.RegisterRt

ID/EX.MemRead

Chapter 4.98

Stall/Bubble in the Pipeline

Chapter 4.99

A Challenge: Memory-to-Memory Copies

Instr.

lw $1,4($2)A

LUIM Reg DM Reg

sw $1,4($3)

ALUIM Reg DM Reg

For loads immediately followed by stores (memory-to-memory copies), a stall can be avoided by adding forwarding hardware from the MEM/WB register to the data memory input.

Would need to add a Forward Unit and a mux to the MEM stage

Chapter 4.100

Control Hazards Chapter 4.8

Chapter 4.101

MIPS Pipeline Control Path Modifications

Chapter 4.102

Control Hazards When the flow of instruction addresses is not sequential

(i.e., PC ≠ PC + 4); incurred by change of flow instructions Unconditional branches (j, jal, jr) Conditional branches (beq, bne) Exceptions

Possible approaches Stall (impacts CPI) Move decision point as early in the pipeline as possible, thereby

reducing the number of stall cycles Predict and hope for the best! Delay decision (requires compiler support)

Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards

Chapter 4.103

Branch Instr’s Cause Control Hazards

Instr.

Inst 4

Inst 3

ALUIM Reg DM Reg

Chapter 4.104

One Way to “Fix” a Branch Control Hazard

Instr.

ALUIM Reg DM Reg

beq target

ALUIM Reg DM Reg

ALUInst 3

IM Reg DM

ALUIM Reg DM Reg

Fix branch hazard by waiting –

flush – but affects CPI

Chapter 4.105

Another Way to “Fix” a Branch Control Hazard

Instr.

beq target

ALUIM Reg DM Reg

Inst 3

ALUIM Reg DM

ALUIM Reg DM Reg

Move branch decision hardware as early in the pipeline as possible, i.e., during the decode cycle

ALUIM Reg DM Reg

Chapter 4.106

Reducing the Delay of Branches Add hardware to compute the branch target address and

evaluate the branch decision to the ID stage Reduces the number of stall (flush) cycles to one

(like with jumps)- But now need to add forwarding hardware in ID stage

Comparing and updating the PC adds 2 muxes, a comparator, and an and gate to the ID stage

- Computing branch target address can be done in parallel with RegFileread (done for all instructions – only used when needed)

- Comparing the registers can be done in the same clock cycle as RegFile read

For deeper pipelines, branch decision points can be even later in the pipeline, incurring more stalls

Chapter 4.107

Supporting ID Stage Branches

ReadAddress

InstructionMemory

Write Data

Read Addr 1

Read Addr 2

Write Addr

RegFile

Read Data 1

ReadData 2

Shiftleft 2

DataMemory

Address

Write Data

Read Data

SignExtend

ID/EXEX/MEM

MEM/WB

Control

ALUcntrl

BranchPCSrc

ForwardUnit

HazardUnit

ForwardUnit

Chapter 4.108

Example: Branch Taken Example

36: sub $10, $4, $840: beq $1, $3, 744: and $12, $2, $548: or $13, $2, $652: add $14, $4, $256: slt $15, $6, $7

...68: add $1, $2, $372: lw $4, 50($7)76: and $3, $1, $2

Chapter 4.109

Example: Branch Taken

Chapter 4.110

Example: Branch Taken

Chapter 4.111

Data Hazards for Branches If a comparison register is a destination of 2nd preceding

ALU instruction Can resolve using forwarding

Inst 3

IF ID EX MEM WB

add $4, $5, $6

add $1, $2, $3

beq $1, $4, target

Chapter 4.112

Data Hazards for Branches If a comparison register is a destination of preceding

ALU instruction or 2nd preceding load instruction Need 1 stall cycle

beq stalled

IF ID EX MEM WB

ID EX MEM WB

add $4, $5, $6

lw $1, addr

beq $1, $4, target

Chapter 4.113

Data Hazards for Branches If a comparison register is a destination of immediately

preceding load instruction Need 2 stall cycles But no data forwarding

beq stalled

IF ID EX MEM WB

ID EX MEM WB

beq stalled

lw $1, addr

beq $1, $0, target

Chapter 4.114

Summary All modern processors use pipelining for performance

(a CPI close to 1 and fast clock frequency) Pipeline clock rate limited by slowest pipeline stage –

so designing a balanced pipeline is important Pipelining doesn’t decrease the latency of single task,

it increases the throughput of entire workload Must detect and resolve hazards

Structural hazards – resolved by designing the pipeline correctly

Data hazards- Stall (impacts CPI)- Forward (requires hardware support)

Control hazards – put the branch decision hardware in as early a stage in the pipeline as possible

- Stall (impacts CPI)- Static and dynamic prediction (requires hardware support)- Delay decision (requires compiler support)

lecture 3 processor - university of arkansas

Documents

biblical chronology - lecture 1 arkansas notes

ece669 l19: processor design april 8, 2004 ece 669 parallel...

csc 2400 computer systems i lecture 4 processor architecture

lecture 2 x86 processor...

lecture 2: introduction to cell processor

lecture 2: mips processor example - university of...

lecture 11: a basic snooping-based multi-processor

lecture 24: parallel processor architecture algorithms ·...

lecture 2 instructions.ppt - university of arkansas

lecture 5: inter processor...

new lecture 2: single-processor computing · 2020. 12....

processor design: how to implement mips simplicity...

lecture 2 single processor machines: memory hierarchies...

lecture 5. mips processor design single-cycle mips #2

lecture 10 processor microarchitecture (part 1)

eecs150 - digital design lecture 8 –mips processor...

lecture 9. mips processor design – instruction fetch

ece 669 parallel computer architecture lecture 19 processor...

lecture 3 processor: datapath and control - ida

it253: computer organization lecture 9: making a processor:...