lecture 3 processor - university of arkansas

114
Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook)

Upload: others

Post on 04-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3 processor - University of Arkansas

Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook)

Page 2: Lecture 3 processor - University of Arkansas

Chapter 4.2

Introduction Chapter 4.1

Page 3: Lecture 3 processor - University of Arkansas

Chapter 4.3

Review: MIPS (RISC) Design Principles Simplicity favors regularity

fixed size instructions small number of instruction formats opcode always the first 6 bits

Smaller is faster limited instruction set limited number of registers in register file limited number of addressing modes

Make the common case fast arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands

Good design demands good compromises three instruction formats

Page 4: Lecture 3 processor - University of Arkansas

Chapter 4.4

Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j

Generic implementation use the program counter (PC) to supply

the instruction address and fetch the instruction from memory (and update the PC)

decode the instruction (and read registers) execute the instruction

All instructions (except j) use the ALU after reading the registers

How? memory-reference? arithmetic? control flow?

The Processor: Datapath & Control

FetchPC = PC+4

DecodeExec

Page 5: Lecture 3 processor - University of Arkansas

Chapter 4.5

Logic Design Conventions Chapter 4.2

Page 6: Lecture 3 processor - University of Arkansas

Chapter 4.6

Logic Design Basics Information encoded in binary

Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data encoded on multi-wire buses

Combinational components Operate on data Output is a function of input

State (sequential) components Store information

Page 7: Lecture 3 processor - University of Arkansas

Chapter 4.7

Combinational Components

AND-gate Y = A & B

Adder Y = A + B

Multiplexer Y = S ? I1 : I0

Arithmetic/Logic Unit Y = F(A,B)

AB Y

I0I1 Y

Mux

S

A

BY+

A

B

YALU

F

Page 8: Lecture 3 processor - University of Arkansas

Chapter 4.8

Sequential Components Register: stores data in a circuit

Uses a clock signal to determine when to update the stored value

Edge-triggered: update when the clock signal changes the value- Rising edge (positive edge) : 0→1- Falling edge (negative edge) : 1→0

D flip-flop is one option to implement register

D

Clk

QClk

D

Q

Page 9: Lecture 3 processor - University of Arkansas

Chapter 4.9

Sequential Components Register with write control

Only updates on clock edge when write control input (e.g., Write Enable) is asserted

D

Clk

QWE

WE

D

Q

Clk

Page 10: Lecture 3 processor - University of Arkansas

Chapter 4.10

Clocking Methodologies The clocking methodology defines when data in a state

element are valid and stable relative to the clock State elements – a memory element such as a register Edge-triggered – all state changes occur on a clock edge

Typical execution read contents of state elements → send values through

combinational logic → write results to one or more state elementsState

element1

Stateelement

2

Combinationallogic

clock

one clock cycle

Assumes state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted and the

clock edge occurs

Page 11: Lecture 3 processor - University of Arkansas

Chapter 4.11

Building a Datapath Chapter 4.3

Page 12: Lecture 3 processor - University of Arkansas

Chapter 4.12

Fetching Instructions Fetching instructions involves

reading the instruction from the Instruction Memory updating the PC value to be the address of the next

(sequential) instruction

ReadAddress

Instruction

InstructionMemory

Add

PC

4

PC is updated every clock cycle, so it does not need an explicit write control signal

FetchPC = PC+4

DecodeExec

clock

Page 13: Lecture 3 processor - University of Arkansas

Chapter 4.13

Decoding Instructions Decoding instructions involves

sending the fetched instruction’s opcode and function field bits to the control unit

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ControlUnit

reading two values from the Register File- Register File addresses are contained in the instruction

FetchPC = PC+4

DecodeExec

Page 14: Lecture 3 processor - University of Arkansas

Chapter 4.14

Executing R Format Operations R format operations (add, sub, slt, and, or)

perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

overflowzero

ALU controlRegWrite

R-type:31 25 20 15 5 0

op rs rt rd functshamt

10

Note that Register File is not written every cycle (e.g., sw), so we need an explicit write control signal for the Register File

FetchPC = PC+4

DecodeExec

Page 15: Lecture 3 processor - University of Arkansas

Chapter 4.15

Executing Store Operations Store operations involve

compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction

store value (read from the Register File during decode) written to the Data Memory

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

overflowzero

ALU control

DataMemory

Address

Write Data

Read Data

SignExtend

MemWrite

16 32

Page 16: Lecture 3 processor - University of Arkansas

Chapter 4.16

Executing Load Operations Load operations involve

compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction

load value (read from the Data Memory) written to the Register File

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

overflowzero

ALU controlRegWrite

DataMemory

Address

Write Data

Read Data

SignExtend

MemRead

16 32

Page 17: Lecture 3 processor - University of Arkansas

Chapter 4.17

Executing Branch Operations Branch operations involve

compare the operands read from the Register File during decode for equality (zero ALU output)

compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

zero

ALU control

SignExtend16 32

Shiftleft 2

Add

4 Add

PC

Branchtargetaddress

(to branch control logic)

Page 18: Lecture 3 processor - University of Arkansas

Chapter 4.18

Executing Jump Operations Jump operation involves

replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits

ReadAddress

Instruction

InstructionMemory

Add

PC

4

Shiftleft 2

Jumpaddress

26

4

28

Page 19: Lecture 3 processor - University of Arkansas

Chapter 4.19

Creating a Single Datapath from the Parts Assemble the datapath segments and add control lines

and multiplexors as needed Single cycle design – fetch, decode and execute each

instruction in one clock cycle no datapath resource can be used more than once per

instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders)

multiplexors needed at the input of shared elements with control lines to do the selection

write signals to control writing to the Register File and Data Memory

Cycle time is determined by the length of the longest path

Page 20: Lecture 3 processor - University of Arkansas

Chapter 4.20

Fetch, Register, and Memory Access Portions

MemtoReg

ReadAddress Instruction

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovfzero

ALU controlRegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemReadSign

Extend16 32

ALUSrc

zero flag of ALU- ‘1’: result is zero- ‘0’: result is not zero

Page 21: Lecture 3 processor - University of Arkansas

Chapter 4.21

Full Datapath (without Jump)

Page 22: Lecture 3 processor - University of Arkansas

Chapter 4.22

Adding Control Units Chapter 4.4, page 259

Page 23: Lecture 3 processor - University of Arkansas

Chapter 4.23

Adding the Control Selecting the operations to perform (ALU, Register File

and Memory read/write) Controlling the flow of data (multiplexor inputs)

I-Type: op rs rt address offset31 25 20 15 0

R-type:31 25 20 15 5 0

op rs rt rd functshamt

10

Observations op field always

in bits 31-26

addr. of registers to be read are always specified by the rs field (bits 25-21) and rt field (bits 20-16)

- in lw and sw rs is the base register

addr. of register to be written is in one of two places – in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions

offset for beq, lw, and sw always in bits 15-0

J-type:31 25 0

op target address

Page 24: Lecture 3 processor - University of Arkansas

Chapter 4.24

The Main Control Unit Control signals derived from instruction

0 rs rt rd shamt funct31:26 5:025:21 20:16 15:11 10:6

35 or 43 rs rt address31:26 25:21 20:16 15:0

4 rs rt address31:26 25:21 20:16 15:0

R-type

Load/Store

Branch

opcode always read

always read

write for R-typeand load

sign-extend

Page 25: Lecture 3 processor - University of Arkansas

Chapter 4.25

Single Cycle Datapath with Control Unit (w/o jump)

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

1

1

00

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Page 26: Lecture 3 processor - University of Arkansas

Chapter 4.26

Single Cycle Datapath with Control Unit

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

110

00

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Shiftleft 2

0

1

Jump

32Instr[25-0]

26PC+4 [31-28]

28

Page 27: Lecture 3 processor - University of Arkansas

Chapter 4.27

ALU Control ALU used for

Load/Store: Function = add Branch: Function = subtract R-type: Function depends on funct field

ALU control Function

0000 AND

0001 OR

0010 add

0110 subtract

0111 set-on-less-than

Page 28: Lecture 3 processor - University of Arkansas

Chapter 4.28

ALU Control Assume 2-bit ALUOp derived from opcode

Combinational logic derives ALU control

opcode ALUOp Operation funct ALU function ALU control

lw 00 load word XXXXXX add 0010

sw 00 store word XXXXXX add 0010

beq 01 branch equal XXXXXX subtract 0110

R-type 10 add 100000 add 0010

subtract 100010 subtract 0110

AND 100100 AND 0000

OR 100101 OR 0001

set-on-less-than 101010 set-on-less-than 0111

Page 29: Lecture 3 processor - University of Arkansas

Chapter 4.29

Single-cycle Implementation Chapter 4.4, page 264

Page 30: Lecture 3 processor - University of Arkansas

Chapter 4.30

R-type Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

1

1

00

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Page 31: Lecture 3 processor - University of Arkansas

Chapter 4.31

R-type Instruction Data/Control Flow (alt.)

Page 32: Lecture 3 processor - University of Arkansas

Chapter 4.32

Load Word Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

1

1

00

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Page 33: Lecture 3 processor - University of Arkansas

Chapter 4.33

Load Word Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

1

1

00

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Page 34: Lecture 3 processor - University of Arkansas

Chapter 4.34

Load Word Instruction Data/Control Flow (alt.)

Page 35: Lecture 3 processor - University of Arkansas

Chapter 4.35

Store Word Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

1

1

00

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Page 36: Lecture 3 processor - University of Arkansas

Chapter 4.36

Store Word Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

1

1

00

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Page 37: Lecture 3 processor - University of Arkansas

Chapter 4.37

Branch Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

1

1

00

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Page 38: Lecture 3 processor - University of Arkansas

Chapter 4.38

Branch Instruction Data/Control Flow

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

1

1

00

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Page 39: Lecture 3 processor - University of Arkansas

Chapter 4.39

Branch Instruction Data/Control Flow (alt.)

Page 40: Lecture 3 processor - University of Arkansas

Chapter 4.40

Adding the Jump Operation

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

1

1

00

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Shiftleft 2

0

1

Jump

32Instr[25-0]

26PC+4[31-28]

28

Page 41: Lecture 3 processor - University of Arkansas

Chapter 4.41

Jump Operation (alt.)

Page 42: Lecture 3 processor - University of Arkansas

Chapter 4.42

Introduction to Pipelining Design Chapter 4.5

Page 43: Lecture 3 processor - University of Arkansas

Chapter 4.43

Instruction Critical Paths

Instr. I Mem Reg Rd ALU Op D Mem Reg Wr TotalR-typeloadstorebeqjump

200 100 200 100 600

200 100 200 200 100 800

What is the clock cycle time assuming negligible delays for muxes, control unit, sign extension, PC access, shift left 2, wires, setup and hold times except:

Instruction Memory and Data Memory (200 ps) ALU and adders (200 ps) Register File access (reads or writes) (100 ps)

200 100 200 200 700200 100 200 500200 200

Page 44: Lecture 3 processor - University of Arkansas

Chapter 4.44

Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently – the clock cycle must

be timed to accommodate the slowest instruction especially problematic for more complex instructions like

floating-point multiplication

Is slow

but Is simple and easy to understand

Clk

lw sw Waste

Cycle 1 Cycle 2

Page 45: Lecture 3 processor - University of Arkansas

Chapter 4.45

How Can We Make It Faster? Start fetching and executing the next instruction before

the current one has completed Pipelining – modern processors are pipelined for performance Remember the performance equation:

CPU time = IC × CPI × CC

Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages A five stage pipeline is nearly five times faster because the CC is

nearly five times faster- CPI=1 for single-cycle implementation- CPI≈1 for pipelined implementation

Page 46: Lecture 3 processor - University of Arkansas

Chapter 4.46

Analogy: Assembly Line v.s. Mechanic Shop Mechanic Shop

The mechanic needs to do everything

It takes hours to fix just one car

Car assembly line Many workers work

together- Each worker just puts one

or a few components into the car

One assembly line can produce hundreds or thousands of cars per day

Page 47: Lecture 3 processor - University of Arkansas

Chapter 4.47

The Five Stages of Executing Instruction

IFetch: Instruction Fetch and Update PC

Dec: Registers Fetch and Instruction Decode

Exec: Execute R-type; calculate memory address; etc.

Mem: Read/write the data from/to the Data Memory

WB: Write the result data into the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw

Page 48: Lecture 3 processor - University of Arkansas

Chapter 4.48

Why Pipeline? For Performance!

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Once the pipeline is full, one instruction

is completed every cycle, so

CPI = 1

Time to fill the pipeline

Page 49: Lecture 3 processor - University of Arkansas

Chapter 4.49

A Pipelined MIPS Processor Start the next instruction before the current one has

completed improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time -

time from the start of an instruction to its completion) is notreduced

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw

Cycle 7Cycle 6 Cycle 8

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

- clock cycle (pipeline stage time) is limited by the slowest stage for some stages don’t need the whole clock cycle (e.g., WB)

Page 50: Lecture 3 processor - University of Arkansas

Chapter 4.50

Single Cycle versus Pipeline

lw IFetch Dec Exec Mem WBPipeline Implementation (CC = 200 ps):

IFetch Dec Exec Mem WBsw

IFetch Dec Exec Mem WBR-type

Clk

Single Cycle Implementation (CC = 800 ps):

lw sw Waste

Cycle 1 Cycle 2

To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why ?

How long does each take to complete 1,000,000 instrs ?

400 ps

Page 51: Lecture 3 processor - University of Arkansas

Chapter 4.51

Pipelining the MIPS ISA

What makes it easy all instructions are the same length (32 bits)

- can fetch in the 1st stage and decode in the 2nd stage few instruction formats (three) memory operations occur only in loads and stores

- can use the execute stage to calculate memory addresses each instruction writes at most one result (i.e., changes the

machine state) and does it in the last few pipeline stages (MEM or WB)

operands must be aligned in memory so a single data transfer takes only one data memory access

Only cover the following 8 instructions as an example lw, sw, add, sub, and, or, slt, beq

Page 52: Lecture 3 processor - University of Arkansas

Chapter 4.52

Pipelined Datapath Chapter 4.6, 286

Page 53: Lecture 3 processor - University of Arkansas

Chapter 4.53

MIPS Pipelined Datapath

Page 54: Lecture 3 processor - University of Arkansas

Chapter 4.54

Pipeline Registers Need registers between stages

Hold information produced in previous cycle

1

0

Page 55: Lecture 3 processor - University of Arkansas

Chapter 4.55

Single-clock-cycle diagram Cycle-by-cycle flow of instructions through the pipelined

datapath

“Single-clock-cycle” pipeline diagram Show pipeline usage in a single cycle Highlight resources used in each cycle

We will look at “single-clock-cycle” diagrams for load & store instructions

Page 56: Lecture 3 processor - University of Arkansas

Chapter 4.56

IF for Load & Store

1

0

Page 57: Lecture 3 processor - University of Arkansas

Chapter 4.57

ID for Load & Store

1

0

Page 58: Lecture 3 processor - University of Arkansas

Chapter 4.58

EX for Load & Store

1

0

Page 59: Lecture 3 processor - University of Arkansas

Chapter 4.59

MEM for Load

1

0

Page 60: Lecture 3 processor - University of Arkansas

Chapter 4.60

MEM for Store

1

0

Page 61: Lecture 3 processor - University of Arkansas

Chapter 4.61

WB for Load

Wrongregisternumber

1

0

Page 62: Lecture 3 processor - University of Arkansas

Chapter 4.62

Corrected Pipelined Datapath

1

0

Page 63: Lecture 3 processor - University of Arkansas

Chapter 4.63

MIPS Pipeline Datapath State registers between each pipeline stage to isolate them

IF:IFetch ID:Dec EX:Execute MEM:MemAccess

WB:WriteBack

ReadAddress

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

16 32

ALU

Shiftleft 2

Add

DataMemory

Address

Write Data

ReadData

IF/ID

SignExtend

ID/EX EX/MEM

MEM/WB

System Clock

Page 64: Lecture 3 processor - University of Arkansas

Chapter 4.64

Graphically Representing MIPS Pipeline

Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed?

ALUIM Reg DM Reg

Page 65: Lecture 3 processor - University of Arkansas

Chapter 4.65

Multi-Cycle Pipeline Diagram Showing the resource usage

Page 66: Lecture 3 processor - University of Arkansas

Chapter 4.66

Multi-Cycle Pipeline Diagram Traditional form

Page 67: Lecture 3 processor - University of Arkansas

Chapter 4.67

Pipeline Control Chapter 4.6, page 300

Page 68: Lecture 3 processor - University of Arkansas

Chapter 4.68

Pipelined Control

Page 69: Lecture 3 processor - University of Arkansas

Chapter 4.69

Pipelined Control Signals Control signals derived from instructions

As in single-cycle implementation

Page 70: Lecture 3 processor - University of Arkansas

Chapter 4.70

Pipeline Control IF Stage: read Instr Memory (always asserted) and write

PC (on System Clock)

ID Stage: no control signals to set

EX Stage MEM Stage WB StageRegDst

ALUOp1

ALUOp0

ALUSrc

Brch MemRead

MemWrite

RegWrite

Mem toReg

R 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Page 71: Lecture 3 processor - University of Arkansas

Chapter 4.71

MIPS Pipeline Control Path Modifications

1

0

Page 72: Lecture 3 processor - University of Arkansas

Chapter 4.72

Pipeline Hazards

Data Hazard Chapter 4.7, page 303

Page 73: Lecture 3 processor - University of Arkansas

Chapter 4.73

Can Pipelining Get Us Into Trouble? Yes: Pipeline Hazards

structural hazards: attempt to use the same resource by two different instructions at the same time

- Structural hazards are solved by duplicating the necessary components

data hazards: attempt to use data before it is ready- An instruction’s source operand(s) are produced by a prior

instruction still in the pipeline

control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated

- branch and jump instructions, exceptions

Can usually resolve hazards by waiting pipeline control must detect the hazard and take action to resolve hazards

Page 74: Lecture 3 processor - University of Arkansas

Chapter 4.74

Instr.

Order

Time (clock cycles)

lw

Inst 2

Inst 3

Inst 5

Inst 4

ALUMem Reg Mem Reg

ALUMem Reg Mem Reg

ALUMem Reg Mem Reg

ALUMem Reg Mem Reg

ALUMem Reg Mem Reg

A Single Memory Would Be a Structural Hazard

Reading data from memory

Reading instruction from memory

Fix with separate instr and data memories (I$ and D$)

Page 75: Lecture 3 processor - University of Arkansas

Chapter 4.75

How About Register File Access?

Instr.

Order

Time (clock cycles)

Inst 2

Inst 3

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Fix simple register file hazard by doing

writes in the first halfof the cycle and

reads in the second half

add $1,

add $2,$1,

Page 76: Lecture 3 processor - University of Arkansas

Chapter 4.76

Register Usage Can Cause Data Hazards

Page 77: Lecture 3 processor - University of Arkansas

Chapter 4.77

Register Usage Can Cause Data Hazards

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Dependencies backward in time cause hazards

add $1,$8,$9

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

Read After Write data hazard

Instr.

Order

Page 78: Lecture 3 processor - University of Arkansas

Chapter 4.78

Loads Can Cause Data Hazards

Instr.

Order

lw $1,4($2)

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9A

LUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Dependencies backward in time cause hazards

Load-use data hazard● Another Read After Write hazard

Page 79: Lecture 3 processor - University of Arkansas

Chapter 4.79

Formal Definitions of Data Hazards Consider two instructions i and j, with i occurring before j

in program order

Three data hazards RAW (read after write)

- j tries to read a source before i writes it, so j incorrectly gets the oldvalue

WAW (write after write)- j tries to write an operand before it is written by i, leaving the value

written by i rather than the value written by j in the destination WAR (write after read)

- j tries to write a destination before it is read by i, so i incorrectly gets the new value

In the basic 5-stage pipeline, WAW and WAR dependences do not cause any hazards

Page 80: Lecture 3 processor - University of Arkansas

Chapter 4.80

stall (insert nop)

stall (insert nop)

One Way to “Fix” a Data Hazard

Instr.

Order

add $1,

ALUIM Reg DM Reg

sub $4,$1,$5

and $6,$1,$7

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Can fix data hazard by

waiting – stall –but impacts CPI

Page 81: Lecture 3 processor - University of Arkansas

Chapter 4.81

Another Way to “Fix” a Data Hazard

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Instr.

Order

add $1,

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

Fix data hazards by forwarding

results as soon as they are availableto where they are

needed

Page 82: Lecture 3 processor - University of Arkansas

Chapter 4.82

Data Forwarding (aka Bypassing) Take the result from the earliest point where it exists in any

of the pipeline registers and forward it to the functional units (e.g., the ALU) that need it in that cycle

For ALU functional unit: the inputs can come from anypipeline register rather than just from ID/EX by adding multiplexors to both inputs of the ALU connecting the register write data in EX/MEM or MEM/WB to both

ALU mux inputs in the EX’s stage adding the proper control hardware to control the new muxes

With forwarding the processor can achieve a CPI close to 1 even in the presence of data dependencies

Page 83: Lecture 3 processor - University of Arkansas

Chapter 4.83

Forwarding Illustration

Instr.

Order

add $1,

sub $4,$1,$5

and $6,$7,$1

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

EX forwarding MEM forwarding

Page 84: Lecture 3 processor - University of Arkansas

Chapter 4.84

Detecting the Need to Forward Pass register numbers along pipeline

e.g., ID/EX.RegisterRs: register number for Rs stored in ID/EX pipeline register

ALU operand register numbers in EX stage are given by ID/EX.RegisterRs, ID/EX.RegisterRt

Data hazards only if1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1b. EX/MEM.RegisterRd = ID/EX.RegisterRt

2a. MEM/WB.RegisterRd = ID/EX.RegisterRs2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

Fwd from previous instr.,i.e., EX/MEM pipeline register

Fwd from second previous instruction, i.e., MEM/WB pipeline register

Page 85: Lecture 3 processor - University of Arkansas

Chapter 4.85

Detecting the Need to Forward But only if forwarding instruction will write to a register!

EX/MEM.RegWrite MEM/WB.RegWrite

And only if Rd for that instruction is not $zero EX/MEM.RegisterRd ≠ 0 MEM/WB.RegisterRd ≠ 0

Page 86: Lecture 3 processor - University of Arkansas

Chapter 4.86

Datapath with Forwarding HardwarePCSrc

ReadAddress

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

16 32

ALU

Shiftleft 2

Add

DataMemory

Address

Write Data

ReadData

IF/ID

SignExtend

ID/EXEX/MEM

MEM/WB

WB

ME

X

WB

M

WB

Control

ALUcntrl

Branch

ForwardUnit

ForwardA

ForwardB

Page 87: Lecture 3 processor - University of Arkansas

Chapter 4.87

Control Values for the Forwarding Multiplexors

Mux control Source Explanation

ForwardA=00 ID/EX The first ALU operand comes from the register file.

ForwardA=10 EX/MEM The first ALU operand is forwarded from the prior ALU result.

ForwardA=01 MEM/WB The first ALU operand is forwarded from the data memory or an earlier ALU result.

ForwardB=00 ID/EX The second ALU operand comes from the register file.

ForwardB=10 EX/MEM The second ALU operand is forwarded from the prior ALU result.

ForwardB=01 MEM/WB The second ALU operand is forwarded from the data memory or an earlier ALU result.

Page 88: Lecture 3 processor - University of Arkansas

Chapter 4.88

Yet Another Complication!

Instr.

Order

add $1,$1,$2

ALUIM Reg DM Reg

add $1,$1,$3

add $1,$1,$4

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Another potential data hazard can occur when there is a conflict between the outputs of EX/MEM pipeline register and MEM/WB pipeline register – which should be forwarded?

Page 89: Lecture 3 processor - University of Arkansas

Chapter 4.89

Yet Another Complication!

Instr.

Order

add $1,$1,$2

ALUIM Reg DM Reg

add $1,$1,$3

add $1,$1,$4

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Another potential data hazard can occur when there is a conflict between the outputs of EX/MEM pipeline register and MEM/WB pipeline register – which should be forwarded?

What we want

The forwarding we want to avoid

Page 90: Lecture 3 processor - University of Arkansas

Chapter 4.90

Statement for Forwarding Control Signals (in C) 1. ForwardA:

if ( EX/MEM.RegWrite and(EX/MEM.RegisterRd != 0) and(EX/MEM.RegisterRd = ID/EX.RegisterRs))

ForwardA = 10;

else if ( MEM/WB.RegWrite and(MEM/WB.RegisterRd != 0) and(MEM/WB.RegisterRd = ID/EX.RegisterRs))

ForwardA = 01;

elseForwardA = 00;

2. ForwardB● The logic is similar

Forwards the result from the previous instr. to either input of the ALU

Forwards the result from the second previous instr. to either input of the ALU

No forwarding

Page 91: Lecture 3 processor - University of Arkansas

Chapter 4.91

Datapath with Forwarding HardwarePCSrc

ReadAddress

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

16 32

ALU

Shiftleft 2

Add

DataMemory

Address

Write Data

ReadData

IF/ID

SignExtend

ID/EXEX/MEM

MEM/WB

WB

ME

X

WB

M

WB

Control

ALUcntrl

Branch

ForwardUnit

ID/EX.RegisterRt

ID/EX.RegisterRs

EX/MEM.RegisterRd

MEM/WB.RegisterRd

00

01

10

00

01

10

Page 92: Lecture 3 processor - University of Arkansas

Chapter 4.92

Data Hazards and Stalls Chapter 4.7, page 313

Page 93: Lecture 3 processor - University of Arkansas

Chapter 4.93

Forwarding with Load-use Data Hazards (logical view)

Instr.

Order

lw $1,4($2)

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Regsub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

Page 94: Lecture 3 processor - University of Arkansas

Chapter 4.94

stall

Forwarding with Load-use Data Hazards (logical view)

Instr.

Order

lw $1,4($2)

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9A

LUIM Reg DM RegA

LUIM Reg DM

ALUReg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Will still need one stall cycle even with forwarding

sub $4,$1,$5

IM

Page 95: Lecture 3 processor - University of Arkansas

Chapter 4.95

Load-use Hazard Detection Unit Need a Hazard detection Unit in the ID stage that inserts

a stall between the load and its use1. ID Hazard detection Unit:if (ID/EX.MemRead and

((ID/EX.RegisterRt = IF/ID.RegisterRs) or(ID/EX.RegisterRt = IF/ID.RegisterRt)))

stall the pipeline (more accurate, stall the following instructions)

The first line tests to see if the instruction now in the EX stage is a lw; the next two lines check to see if the destination register of the lw matches either source register of the instruction in the ID stage (the load-use instruction)

After this one cycle stall, the forwarding logic can handle the remaining data hazards

Page 96: Lecture 3 processor - University of Arkansas

Chapter 4.96

Hazard/Stall Hardware Along with the Hazard Unit, we have to implement the stall Prevent the instructions in the IF and ID stages from

progressing down the pipeline – done by preventing the PC register and the IF/ID pipeline register from changing Hazard detection Unit controls the writing of the PC (PC.write)

and IF/ID (IF/ID.write) registers

Insert a “bubble” between the lw instruction (in the EX stage) and the load-use instruction (in the ID stage) (i.e., insert a nop in the execution stream) Set the control bits in the EX, MEM, and WB control fields of the

ID/EX pipeline register to 0 (nop). The Hazard Unit controls the mux that chooses between the real control values and the 0’s.

Let the lw instruction and the following instructions in the pipeline proceed normally down the pipeline

Page 97: Lecture 3 processor - University of Arkansas

Chapter 4.97

Adding the Hazard/Stall Hardware

ReadAddress

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

16 32

ALU

Shiftleft 2

Add

DataMemory

Address

Write Data

ReadData

IF/ID

SignExtend

ID/EXEX/MEM

MEM/WB

WB

ME

X

WB

M

WB

Control

ALUcntrl

Branch

PCSrc

ForwardUnit

HazardUnit

01

ID/EX.RegisterRt

0

ID/EX.MemRead

Page 98: Lecture 3 processor - University of Arkansas

Chapter 4.98

Stall/Bubble in the Pipeline

Page 99: Lecture 3 processor - University of Arkansas

Chapter 4.99

A Challenge: Memory-to-Memory Copies

Instr.

Order

lw $1,4($2)A

LUIM Reg DM Reg

sw $1,4($3)

ALUIM Reg DM Reg

For loads immediately followed by stores (memory-to-memory copies), a stall can be avoided by adding forwarding hardware from the MEM/WB register to the data memory input.

Would need to add a Forward Unit and a mux to the MEM stage

Page 100: Lecture 3 processor - University of Arkansas

Chapter 4.100

Control Hazards Chapter 4.8

Page 101: Lecture 3 processor - University of Arkansas

Chapter 4.101

MIPS Pipeline Control Path Modifications

1

0

Page 102: Lecture 3 processor - University of Arkansas

Chapter 4.102

Control Hazards When the flow of instruction addresses is not sequential

(i.e., PC ≠ PC + 4); incurred by change of flow instructions Unconditional branches (j, jal, jr) Conditional branches (beq, bne) Exceptions

Possible approaches Stall (impacts CPI) Move decision point as early in the pipeline as possible, thereby

reducing the number of stall cycles Predict and hope for the best! Delay decision (requires compiler support)

Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards

Page 103: Lecture 3 processor - University of Arkansas

Chapter 4.103

Branch Instr’s Cause Control Hazards

Instr.

Order

lw

Inst 4

Inst 3

beq

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Dependencies backward in time cause hazards

Page 104: Lecture 3 processor - University of Arkansas

Chapter 4.104

flush

flush

flush

One Way to “Fix” a Branch Control Hazard

Instr.

Order

beq

ALUIM Reg DM Reg

beq target

ALUIM Reg DM Reg

ALUInst 3

IM Reg DM

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Fix branch hazard by waiting –

flush – but affects CPI

Page 105: Lecture 3 processor - University of Arkansas

Chapter 4.105

flush

Another Way to “Fix” a Branch Control Hazard

Instr.

Order

beq

beq target

ALUIM Reg DM Reg

Inst 3

ALUIM Reg DM

ALUIM Reg DM Reg

Move branch decision hardware as early in the pipeline as possible, i.e., during the decode cycle

ALUIM Reg DM Reg

Page 106: Lecture 3 processor - University of Arkansas

Chapter 4.106

Reducing the Delay of Branches Add hardware to compute the branch target address and

evaluate the branch decision to the ID stage Reduces the number of stall (flush) cycles to one

(like with jumps)- But now need to add forwarding hardware in ID stage

Comparing and updating the PC adds 2 muxes, a comparator, and an and gate to the ID stage

- Computing branch target address can be done in parallel with RegFileread (done for all instructions – only used when needed)

- Comparing the registers can be done in the same clock cycle as RegFile read

For deeper pipelines, branch decision points can be even later in the pipeline, incurring more stalls

Page 107: Lecture 3 processor - University of Arkansas

Chapter 4.107

Supporting ID Stage Branches

ReadAddress

InstructionMemory

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

RegFile

Read Data 1

ReadData 2

16

32

ALU

Shiftleft 2

Add

DataMemory

Address

Write Data

Read Data

IF/ID

SignExtend

ID/EXEX/MEM

MEM/WB

Control

ALUcntrl

BranchPCSrc

ForwardUnit

HazardUnit

Com

pare

ForwardUnit

Add

IF.F

lush

0

010

Page 108: Lecture 3 processor - University of Arkansas

Chapter 4.108

Example: Branch Taken Example

36: sub $10, $4, $840: beq $1, $3, 744: and $12, $2, $548: or $13, $2, $652: add $14, $4, $256: slt $15, $6, $7

...68: add $1, $2, $372: lw $4, 50($7)76: and $3, $1, $2

Page 109: Lecture 3 processor - University of Arkansas

Chapter 4.109

Example: Branch Taken

Page 110: Lecture 3 processor - University of Arkansas

Chapter 4.110

Example: Branch Taken

Page 111: Lecture 3 processor - University of Arkansas

Chapter 4.111

Data Hazards for Branches If a comparison register is a destination of 2nd preceding

ALU instruction Can resolve using forwarding

Inst 3

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

add $4, $5, $6

add $1, $2, $3

beq $1, $4, target

Page 112: Lecture 3 processor - University of Arkansas

Chapter 4.112

Data Hazards for Branches If a comparison register is a destination of preceding

ALU instruction or 2nd preceding load instruction Need 1 stall cycle

beq stalled

IF ID EX MEM WB

IF ID EX MEM WB

IF ID

ID EX MEM WB

add $4, $5, $6

lw $1, addr

beq $1, $4, target

Page 113: Lecture 3 processor - University of Arkansas

Chapter 4.113

Data Hazards for Branches If a comparison register is a destination of immediately

preceding load instruction Need 2 stall cycles But no data forwarding

beq stalled

IF ID EX MEM WB

IF ID

ID

ID EX MEM WB

beq stalled

lw $1, addr

beq $1, $0, target

Page 114: Lecture 3 processor - University of Arkansas

Chapter 4.114

Summary All modern processors use pipelining for performance

(a CPI close to 1 and fast clock frequency) Pipeline clock rate limited by slowest pipeline stage –

so designing a balanced pipeline is important Pipelining doesn’t decrease the latency of single task,

it increases the throughput of entire workload Must detect and resolve hazards

Structural hazards – resolved by designing the pipeline correctly

Data hazards- Stall (impacts CPI)- Forward (requires hardware support)

Control hazards – put the branch decision hardware in as early a stage in the pipeline as possible

- Stall (impacts CPI)- Static and dynamic prediction (requires hardware support)- Delay decision (requires compiler support)