lecture 3 processor - university of arkansas
TRANSCRIPT
Chapter 4.1
Lecture 3: The Processor (Chapter 4 of textbook)
Chapter 4.2
Introduction Chapter 4.1
Chapter 4.3
Review: MIPS (RISC) Design Principles Simplicity favors regularity
fixed size instructions small number of instruction formats opcode always the first 6 bits
Smaller is faster limited instruction set limited number of registers in register file limited number of addressing modes
Make the common case fast arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands
Good design demands good compromises three instruction formats
Chapter 4.4
Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j
Generic implementation use the program counter (PC) to supply
the instruction address and fetch the instruction from memory (and update the PC)
decode the instruction (and read registers) execute the instruction
All instructions (except j) use the ALU after reading the registers
How? memory-reference? arithmetic? control flow?
The Processor: Datapath & Control
FetchPC = PC+4
DecodeExec
Chapter 4.5
Logic Design Conventions Chapter 4.2
Chapter 4.6
Logic Design Basics Information encoded in binary
Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data encoded on multi-wire buses
Combinational components Operate on data Output is a function of input
State (sequential) components Store information
Chapter 4.7
Combinational Components
AND-gate Y = A & B
Adder Y = A + B
Multiplexer Y = S ? I1 : I0
Arithmetic/Logic Unit Y = F(A,B)
AB Y
I0I1 Y
Mux
S
A
BY+
A
B
YALU
F
Chapter 4.8
Sequential Components Register: stores data in a circuit
Uses a clock signal to determine when to update the stored value
Edge-triggered: update when the clock signal changes the value- Rising edge (positive edge) : 0→1- Falling edge (negative edge) : 1→0
D flip-flop is one option to implement register
D
Clk
QClk
D
Q
Chapter 4.9
Sequential Components Register with write control
Only updates on clock edge when write control input (e.g., Write Enable) is asserted
D
Clk
QWE
WE
D
Q
Clk
Chapter 4.10
Clocking Methodologies The clocking methodology defines when data in a state
element are valid and stable relative to the clock State elements – a memory element such as a register Edge-triggered – all state changes occur on a clock edge
Typical execution read contents of state elements → send values through
combinational logic → write results to one or more state elementsState
element1
Stateelement
2
Combinationallogic
clock
one clock cycle
Assumes state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted and the
clock edge occurs
Chapter 4.11
Building a Datapath Chapter 4.3
Chapter 4.12
Fetching Instructions Fetching instructions involves
reading the instruction from the Instruction Memory updating the PC value to be the address of the next
(sequential) instruction
ReadAddress
Instruction
InstructionMemory
Add
PC
4
PC is updated every clock cycle, so it does not need an explicit write control signal
FetchPC = PC+4
DecodeExec
clock
Chapter 4.13
Decoding Instructions Decoding instructions involves
sending the fetched instruction’s opcode and function field bits to the control unit
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ControlUnit
reading two values from the Register File- Register File addresses are contained in the instruction
FetchPC = PC+4
DecodeExec
Chapter 4.14
Executing R Format Operations R format operations (add, sub, slt, and, or)
perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
overflowzero
ALU controlRegWrite
R-type:31 25 20 15 5 0
op rs rt rd functshamt
10
Note that Register File is not written every cycle (e.g., sw), so we need an explicit write control signal for the Register File
FetchPC = PC+4
DecodeExec
Chapter 4.15
Executing Store Operations Store operations involve
compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction
store value (read from the Register File during decode) written to the Data Memory
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
overflowzero
ALU control
DataMemory
Address
Write Data
Read Data
SignExtend
MemWrite
16 32
Chapter 4.16
Executing Load Operations Load operations involve
compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction
load value (read from the Data Memory) written to the Register File
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
overflowzero
ALU controlRegWrite
DataMemory
Address
Write Data
Read Data
SignExtend
MemRead
16 32
Chapter 4.17
Executing Branch Operations Branch operations involve
compare the operands read from the Register File during decode for equality (zero ALU output)
compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
zero
ALU control
SignExtend16 32
Shiftleft 2
Add
4 Add
PC
Branchtargetaddress
(to branch control logic)
Chapter 4.18
Executing Jump Operations Jump operation involves
replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits
ReadAddress
Instruction
InstructionMemory
Add
PC
4
Shiftleft 2
Jumpaddress
26
4
28
Chapter 4.19
Creating a Single Datapath from the Parts Assemble the datapath segments and add control lines
and multiplexors as needed Single cycle design – fetch, decode and execute each
instruction in one clock cycle no datapath resource can be used more than once per
instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders)
multiplexors needed at the input of shared elements with control lines to do the selection
write signals to control writing to the Register File and Data Memory
Cycle time is determined by the length of the longest path
Chapter 4.20
Fetch, Register, and Memory Access Portions
MemtoReg
ReadAddress Instruction
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovfzero
ALU controlRegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemReadSign
Extend16 32
ALUSrc
zero flag of ALU- ‘1’: result is zero- ‘0’: result is not zero
Chapter 4.21
Full Datapath (without Jump)
Chapter 4.22
Adding Control Units Chapter 4.4, page 259
Chapter 4.23
Adding the Control Selecting the operations to perform (ALU, Register File
and Memory read/write) Controlling the flow of data (multiplexor inputs)
I-Type: op rs rt address offset31 25 20 15 0
R-type:31 25 20 15 5 0
op rs rt rd functshamt
10
Observations op field always
in bits 31-26
addr. of registers to be read are always specified by the rs field (bits 25-21) and rt field (bits 20-16)
- in lw and sw rs is the base register
addr. of register to be written is in one of two places – in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions
offset for beq, lw, and sw always in bits 15-0
J-type:31 25 0
op target address
Chapter 4.24
The Main Control Unit Control signals derived from instruction
0 rs rt rd shamt funct31:26 5:025:21 20:16 15:11 10:6
35 or 43 rs rt address31:26 25:21 20:16 15:0
4 rs rt address31:26 25:21 20:16 15:0
R-type
Load/Store
Branch
opcode always read
always read
write for R-typeand load
sign-extend
Chapter 4.25
Single Cycle Datapath with Control Unit (w/o jump)
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
1
1
00
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Chapter 4.26
Single Cycle Datapath with Control Unit
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
110
00
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Shiftleft 2
0
1
Jump
32Instr[25-0]
26PC+4 [31-28]
28
Chapter 4.27
ALU Control ALU used for
Load/Store: Function = add Branch: Function = subtract R-type: Function depends on funct field
ALU control Function
0000 AND
0001 OR
0010 add
0110 subtract
0111 set-on-less-than
Chapter 4.28
ALU Control Assume 2-bit ALUOp derived from opcode
Combinational logic derives ALU control
opcode ALUOp Operation funct ALU function ALU control
lw 00 load word XXXXXX add 0010
sw 00 store word XXXXXX add 0010
beq 01 branch equal XXXXXX subtract 0110
R-type 10 add 100000 add 0010
subtract 100010 subtract 0110
AND 100100 AND 0000
OR 100101 OR 0001
set-on-less-than 101010 set-on-less-than 0111
Chapter 4.29
Single-cycle Implementation Chapter 4.4, page 264
Chapter 4.30
R-type Instruction Data/Control Flow
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
1
1
00
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Chapter 4.31
R-type Instruction Data/Control Flow (alt.)
Chapter 4.32
Load Word Instruction Data/Control Flow
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
1
1
00
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Chapter 4.33
Load Word Instruction Data/Control Flow
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
1
1
00
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Chapter 4.34
Load Word Instruction Data/Control Flow (alt.)
Chapter 4.35
Store Word Instruction Data/Control Flow
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
1
1
00
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Chapter 4.36
Store Word Instruction Data/Control Flow
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
1
1
00
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Chapter 4.37
Branch Instruction Data/Control Flow
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
1
1
00
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Chapter 4.38
Branch Instruction Data/Control Flow
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
1
1
00
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Chapter 4.39
Branch Instruction Data/Control Flow (alt.)
Chapter 4.40
Adding the Jump Operation
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
1
1
00
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Shiftleft 2
0
1
Jump
32Instr[25-0]
26PC+4[31-28]
28
Chapter 4.41
Jump Operation (alt.)
Chapter 4.42
Introduction to Pipelining Design Chapter 4.5
Chapter 4.43
Instruction Critical Paths
Instr. I Mem Reg Rd ALU Op D Mem Reg Wr TotalR-typeloadstorebeqjump
200 100 200 100 600
200 100 200 200 100 800
What is the clock cycle time assuming negligible delays for muxes, control unit, sign extension, PC access, shift left 2, wires, setup and hold times except:
Instruction Memory and Data Memory (200 ps) ALU and adders (200 ps) Register File access (reads or writes) (100 ps)
200 100 200 200 700200 100 200 500200 200
Chapter 4.44
Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently – the clock cycle must
be timed to accommodate the slowest instruction especially problematic for more complex instructions like
floating-point multiplication
Is slow
but Is simple and easy to understand
Clk
lw sw Waste
Cycle 1 Cycle 2
Chapter 4.45
How Can We Make It Faster? Start fetching and executing the next instruction before
the current one has completed Pipelining – modern processors are pipelined for performance Remember the performance equation:
CPU time = IC × CPI × CC
Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages A five stage pipeline is nearly five times faster because the CC is
nearly five times faster- CPI=1 for single-cycle implementation- CPI≈1 for pipelined implementation
Chapter 4.46
Analogy: Assembly Line v.s. Mechanic Shop Mechanic Shop
The mechanic needs to do everything
It takes hours to fix just one car
Car assembly line Many workers work
together- Each worker just puts one
or a few components into the car
One assembly line can produce hundreds or thousands of cars per day
Chapter 4.47
The Five Stages of Executing Instruction
IFetch: Instruction Fetch and Update PC
Dec: Registers Fetch and Instruction Decode
Exec: Execute R-type; calculate memory address; etc.
Mem: Read/write the data from/to the Data Memory
WB: Write the result data into the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
Chapter 4.48
Why Pipeline? For Performance!
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Once the pipeline is full, one instruction
is completed every cycle, so
CPI = 1
Time to fill the pipeline
Chapter 4.49
A Pipelined MIPS Processor Start the next instruction before the current one has
completed improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time -
time from the start of an instruction to its completion) is notreduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
Cycle 7Cycle 6 Cycle 8
sw IFetch Dec Exec Mem WB
R-type IFetch Dec Exec Mem WB
- clock cycle (pipeline stage time) is limited by the slowest stage for some stages don’t need the whole clock cycle (e.g., WB)
Chapter 4.50
Single Cycle versus Pipeline
lw IFetch Dec Exec Mem WBPipeline Implementation (CC = 200 ps):
IFetch Dec Exec Mem WBsw
IFetch Dec Exec Mem WBR-type
Clk
Single Cycle Implementation (CC = 800 ps):
lw sw Waste
Cycle 1 Cycle 2
To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why ?
How long does each take to complete 1,000,000 instrs ?
400 ps
Chapter 4.51
Pipelining the MIPS ISA
What makes it easy all instructions are the same length (32 bits)
- can fetch in the 1st stage and decode in the 2nd stage few instruction formats (three) memory operations occur only in loads and stores
- can use the execute stage to calculate memory addresses each instruction writes at most one result (i.e., changes the
machine state) and does it in the last few pipeline stages (MEM or WB)
operands must be aligned in memory so a single data transfer takes only one data memory access
Only cover the following 8 instructions as an example lw, sw, add, sub, and, or, slt, beq
Chapter 4.52
Pipelined Datapath Chapter 4.6, 286
Chapter 4.53
MIPS Pipelined Datapath
Chapter 4.54
Pipeline Registers Need registers between stages
Hold information produced in previous cycle
1
0
Chapter 4.55
Single-clock-cycle diagram Cycle-by-cycle flow of instructions through the pipelined
datapath
“Single-clock-cycle” pipeline diagram Show pipeline usage in a single cycle Highlight resources used in each cycle
We will look at “single-clock-cycle” diagrams for load & store instructions
Chapter 4.56
IF for Load & Store
1
0
Chapter 4.57
ID for Load & Store
1
0
Chapter 4.58
EX for Load & Store
1
0
Chapter 4.59
MEM for Load
1
0
Chapter 4.60
MEM for Store
1
0
Chapter 4.61
WB for Load
Wrongregisternumber
1
0
Chapter 4.62
Corrected Pipelined Datapath
1
0
Chapter 4.63
MIPS Pipeline Datapath State registers between each pipeline stage to isolate them
IF:IFetch ID:Dec EX:Execute MEM:MemAccess
WB:WriteBack
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadData
IF/ID
SignExtend
ID/EX EX/MEM
MEM/WB
System Clock
Chapter 4.64
Graphically Representing MIPS Pipeline
Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed?
ALUIM Reg DM Reg
Chapter 4.65
Multi-Cycle Pipeline Diagram Showing the resource usage
Chapter 4.66
Multi-Cycle Pipeline Diagram Traditional form
Chapter 4.67
Pipeline Control Chapter 4.6, page 300
Chapter 4.68
Pipelined Control
Chapter 4.69
Pipelined Control Signals Control signals derived from instructions
As in single-cycle implementation
Chapter 4.70
Pipeline Control IF Stage: read Instr Memory (always asserted) and write
PC (on System Clock)
ID Stage: no control signals to set
EX Stage MEM Stage WB StageRegDst
ALUOp1
ALUOp0
ALUSrc
Brch MemRead
MemWrite
RegWrite
Mem toReg
R 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
Chapter 4.71
MIPS Pipeline Control Path Modifications
1
0
Chapter 4.72
Pipeline Hazards
Data Hazard Chapter 4.7, page 303
Chapter 4.73
Can Pipelining Get Us Into Trouble? Yes: Pipeline Hazards
structural hazards: attempt to use the same resource by two different instructions at the same time
- Structural hazards are solved by duplicating the necessary components
data hazards: attempt to use data before it is ready- An instruction’s source operand(s) are produced by a prior
instruction still in the pipeline
control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated
- branch and jump instructions, exceptions
Can usually resolve hazards by waiting pipeline control must detect the hazard and take action to resolve hazards
Chapter 4.74
Instr.
Order
Time (clock cycles)
lw
Inst 2
Inst 3
Inst 5
Inst 4
ALUMem Reg Mem Reg
ALUMem Reg Mem Reg
ALUMem Reg Mem Reg
ALUMem Reg Mem Reg
ALUMem Reg Mem Reg
A Single Memory Would Be a Structural Hazard
Reading data from memory
Reading instruction from memory
Fix with separate instr and data memories (I$ and D$)
Chapter 4.75
How About Register File Access?
Instr.
Order
Time (clock cycles)
Inst 2
Inst 3
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Fix simple register file hazard by doing
writes in the first halfof the cycle and
reads in the second half
add $1,
add $2,$1,
Chapter 4.76
Register Usage Can Cause Data Hazards
Chapter 4.77
Register Usage Can Cause Data Hazards
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Dependencies backward in time cause hazards
add $1,$8,$9
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
Read After Write data hazard
Instr.
Order
Chapter 4.78
Loads Can Cause Data Hazards
Instr.
Order
lw $1,4($2)
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9A
LUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Dependencies backward in time cause hazards
Load-use data hazard● Another Read After Write hazard
Chapter 4.79
Formal Definitions of Data Hazards Consider two instructions i and j, with i occurring before j
in program order
Three data hazards RAW (read after write)
- j tries to read a source before i writes it, so j incorrectly gets the oldvalue
WAW (write after write)- j tries to write an operand before it is written by i, leaving the value
written by i rather than the value written by j in the destination WAR (write after read)
- j tries to write a destination before it is read by i, so i incorrectly gets the new value
In the basic 5-stage pipeline, WAW and WAR dependences do not cause any hazards
Chapter 4.80
stall (insert nop)
stall (insert nop)
One Way to “Fix” a Data Hazard
Instr.
Order
add $1,
ALUIM Reg DM Reg
sub $4,$1,$5
and $6,$1,$7
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Can fix data hazard by
waiting – stall –but impacts CPI
Chapter 4.81
Another Way to “Fix” a Data Hazard
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Instr.
Order
add $1,
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
Fix data hazards by forwarding
results as soon as they are availableto where they are
needed
Chapter 4.82
Data Forwarding (aka Bypassing) Take the result from the earliest point where it exists in any
of the pipeline registers and forward it to the functional units (e.g., the ALU) that need it in that cycle
For ALU functional unit: the inputs can come from anypipeline register rather than just from ID/EX by adding multiplexors to both inputs of the ALU connecting the register write data in EX/MEM or MEM/WB to both
ALU mux inputs in the EX’s stage adding the proper control hardware to control the new muxes
With forwarding the processor can achieve a CPI close to 1 even in the presence of data dependencies
Chapter 4.83
Forwarding Illustration
Instr.
Order
add $1,
sub $4,$1,$5
and $6,$7,$1
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
EX forwarding MEM forwarding
Chapter 4.84
Detecting the Need to Forward Pass register numbers along pipeline
e.g., ID/EX.RegisterRs: register number for Rs stored in ID/EX pipeline register
ALU operand register numbers in EX stage are given by ID/EX.RegisterRs, ID/EX.RegisterRt
Data hazards only if1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Fwd from previous instr.,i.e., EX/MEM pipeline register
Fwd from second previous instruction, i.e., MEM/WB pipeline register
Chapter 4.85
Detecting the Need to Forward But only if forwarding instruction will write to a register!
EX/MEM.RegWrite MEM/WB.RegWrite
And only if Rd for that instruction is not $zero EX/MEM.RegisterRd ≠ 0 MEM/WB.RegisterRd ≠ 0
Chapter 4.86
Datapath with Forwarding HardwarePCSrc
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadData
IF/ID
SignExtend
ID/EXEX/MEM
MEM/WB
WB
ME
X
WB
M
WB
Control
ALUcntrl
Branch
ForwardUnit
ForwardA
ForwardB
Chapter 4.87
Control Values for the Forwarding Multiplexors
Mux control Source Explanation
ForwardA=00 ID/EX The first ALU operand comes from the register file.
ForwardA=10 EX/MEM The first ALU operand is forwarded from the prior ALU result.
ForwardA=01 MEM/WB The first ALU operand is forwarded from the data memory or an earlier ALU result.
ForwardB=00 ID/EX The second ALU operand comes from the register file.
ForwardB=10 EX/MEM The second ALU operand is forwarded from the prior ALU result.
ForwardB=01 MEM/WB The second ALU operand is forwarded from the data memory or an earlier ALU result.
Chapter 4.88
Yet Another Complication!
Instr.
Order
add $1,$1,$2
ALUIM Reg DM Reg
add $1,$1,$3
add $1,$1,$4
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Another potential data hazard can occur when there is a conflict between the outputs of EX/MEM pipeline register and MEM/WB pipeline register – which should be forwarded?
Chapter 4.89
Yet Another Complication!
Instr.
Order
add $1,$1,$2
ALUIM Reg DM Reg
add $1,$1,$3
add $1,$1,$4
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Another potential data hazard can occur when there is a conflict between the outputs of EX/MEM pipeline register and MEM/WB pipeline register – which should be forwarded?
What we want
The forwarding we want to avoid
Chapter 4.90
Statement for Forwarding Control Signals (in C) 1. ForwardA:
if ( EX/MEM.RegWrite and(EX/MEM.RegisterRd != 0) and(EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10;
else if ( MEM/WB.RegWrite and(MEM/WB.RegisterRd != 0) and(MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01;
elseForwardA = 00;
2. ForwardB● The logic is similar
Forwards the result from the previous instr. to either input of the ALU
Forwards the result from the second previous instr. to either input of the ALU
No forwarding
Chapter 4.91
Datapath with Forwarding HardwarePCSrc
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadData
IF/ID
SignExtend
ID/EXEX/MEM
MEM/WB
WB
ME
X
WB
M
WB
Control
ALUcntrl
Branch
ForwardUnit
ID/EX.RegisterRt
ID/EX.RegisterRs
EX/MEM.RegisterRd
MEM/WB.RegisterRd
00
01
10
00
01
10
Chapter 4.92
Data Hazards and Stalls Chapter 4.7, page 313
Chapter 4.93
Forwarding with Load-use Data Hazards (logical view)
Instr.
Order
lw $1,4($2)
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Regsub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
Chapter 4.94
stall
Forwarding with Load-use Data Hazards (logical view)
Instr.
Order
lw $1,4($2)
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9A
LUIM Reg DM RegA
LUIM Reg DM
ALUReg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Will still need one stall cycle even with forwarding
sub $4,$1,$5
IM
Chapter 4.95
Load-use Hazard Detection Unit Need a Hazard detection Unit in the ID stage that inserts
a stall between the load and its use1. ID Hazard detection Unit:if (ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or(ID/EX.RegisterRt = IF/ID.RegisterRt)))
stall the pipeline (more accurate, stall the following instructions)
The first line tests to see if the instruction now in the EX stage is a lw; the next two lines check to see if the destination register of the lw matches either source register of the instruction in the ID stage (the load-use instruction)
After this one cycle stall, the forwarding logic can handle the remaining data hazards
Chapter 4.96
Hazard/Stall Hardware Along with the Hazard Unit, we have to implement the stall Prevent the instructions in the IF and ID stages from
progressing down the pipeline – done by preventing the PC register and the IF/ID pipeline register from changing Hazard detection Unit controls the writing of the PC (PC.write)
and IF/ID (IF/ID.write) registers
Insert a “bubble” between the lw instruction (in the EX stage) and the load-use instruction (in the ID stage) (i.e., insert a nop in the execution stream) Set the control bits in the EX, MEM, and WB control fields of the
ID/EX pipeline register to 0 (nop). The Hazard Unit controls the mux that chooses between the real control values and the 0’s.
Let the lw instruction and the following instructions in the pipeline proceed normally down the pipeline
Chapter 4.97
Adding the Hazard/Stall Hardware
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadData
IF/ID
SignExtend
ID/EXEX/MEM
MEM/WB
WB
ME
X
WB
M
WB
Control
ALUcntrl
Branch
PCSrc
ForwardUnit
HazardUnit
01
ID/EX.RegisterRt
0
ID/EX.MemRead
Chapter 4.98
Stall/Bubble in the Pipeline
Chapter 4.99
A Challenge: Memory-to-Memory Copies
Instr.
Order
lw $1,4($2)A
LUIM Reg DM Reg
sw $1,4($3)
ALUIM Reg DM Reg
For loads immediately followed by stores (memory-to-memory copies), a stall can be avoided by adding forwarding hardware from the MEM/WB register to the data memory input.
Would need to add a Forward Unit and a mux to the MEM stage
Chapter 4.100
Control Hazards Chapter 4.8
Chapter 4.101
MIPS Pipeline Control Path Modifications
1
0
Chapter 4.102
Control Hazards When the flow of instruction addresses is not sequential
(i.e., PC ≠ PC + 4); incurred by change of flow instructions Unconditional branches (j, jal, jr) Conditional branches (beq, bne) Exceptions
Possible approaches Stall (impacts CPI) Move decision point as early in the pipeline as possible, thereby
reducing the number of stall cycles Predict and hope for the best! Delay decision (requires compiler support)
Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards
Chapter 4.103
Branch Instr’s Cause Control Hazards
Instr.
Order
lw
Inst 4
Inst 3
beq
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Dependencies backward in time cause hazards
Chapter 4.104
flush
flush
flush
One Way to “Fix” a Branch Control Hazard
Instr.
Order
beq
ALUIM Reg DM Reg
beq target
ALUIM Reg DM Reg
ALUInst 3
IM Reg DM
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Fix branch hazard by waiting –
flush – but affects CPI
Chapter 4.105
flush
Another Way to “Fix” a Branch Control Hazard
Instr.
Order
beq
beq target
ALUIM Reg DM Reg
Inst 3
ALUIM Reg DM
ALUIM Reg DM Reg
Move branch decision hardware as early in the pipeline as possible, i.e., during the decode cycle
ALUIM Reg DM Reg
Chapter 4.106
Reducing the Delay of Branches Add hardware to compute the branch target address and
evaluate the branch decision to the ID stage Reduces the number of stall (flush) cycles to one
(like with jumps)- But now need to add forwarding hardware in ID stage
Comparing and updating the PC adds 2 muxes, a comparator, and an and gate to the ID stage
- Computing branch target address can be done in parallel with RegFileread (done for all instructions – only used when needed)
- Comparing the registers can be done in the same clock cycle as RegFile read
For deeper pipelines, branch decision points can be even later in the pipeline, incurring more stalls
Chapter 4.107
Supporting ID Stage Branches
ReadAddress
InstructionMemory
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
RegFile
Read Data 1
ReadData 2
16
32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
Read Data
IF/ID
SignExtend
ID/EXEX/MEM
MEM/WB
Control
ALUcntrl
BranchPCSrc
ForwardUnit
HazardUnit
Com
pare
ForwardUnit
Add
IF.F
lush
0
010
Chapter 4.108
Example: Branch Taken Example
36: sub $10, $4, $840: beq $1, $3, 744: and $12, $2, $548: or $13, $2, $652: add $14, $4, $256: slt $15, $6, $7
...68: add $1, $2, $372: lw $4, 50($7)76: and $3, $1, $2
Chapter 4.109
Example: Branch Taken
Chapter 4.110
Example: Branch Taken
Chapter 4.111
Data Hazards for Branches If a comparison register is a destination of 2nd preceding
ALU instruction Can resolve using forwarding
Inst 3
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
add $4, $5, $6
add $1, $2, $3
beq $1, $4, target
Chapter 4.112
Data Hazards for Branches If a comparison register is a destination of preceding
ALU instruction or 2nd preceding load instruction Need 1 stall cycle
beq stalled
IF ID EX MEM WB
IF ID EX MEM WB
IF ID
ID EX MEM WB
add $4, $5, $6
lw $1, addr
beq $1, $4, target
Chapter 4.113
Data Hazards for Branches If a comparison register is a destination of immediately
preceding load instruction Need 2 stall cycles But no data forwarding
beq stalled
IF ID EX MEM WB
IF ID
ID
ID EX MEM WB
beq stalled
lw $1, addr
beq $1, $0, target
Chapter 4.114
Summary All modern processors use pipelining for performance
(a CPI close to 1 and fast clock frequency) Pipeline clock rate limited by slowest pipeline stage –
so designing a balanced pipeline is important Pipelining doesn’t decrease the latency of single task,
it increases the throughput of entire workload Must detect and resolve hazards
Structural hazards – resolved by designing the pipeline correctly
Data hazards- Stall (impacts CPI)- Forward (requires hardware support)
Control hazards – put the branch decision hardware in as early a stage in the pipeline as possible
- Stall (impacts CPI)- Static and dynamic prediction (requires hardware support)- Delay decision (requires compiler support)