computer architecture
DESCRIPTION
Computer Architecture. EEL 4713/5764, Fall 2006 Dr. Linda DeBrunner Module #16 – Pipeline Performance Limits. Part IV Data Path and Control. 16 Pipeline Performance Limits. Pipeline performance limited by data & control dependencies - PowerPoint PPT PresentationTRANSCRIPT
1
FAMU-FSU College of Engineering
ComputerArchitectureEEL 4713/5764, Fall 2006
Dr. Linda DeBrunner
Module #16 – Pipeline Performance Limits
Mar. 2006 Computer Architecture, Data Path and Control Slide 2
Part IVData Path and Control
Mar. 2006 Computer Architecture, Data Path and Control Slide 3
16 Pipeline Performance Limits
Pipeline performance limited by data & control dependencies• Hardware provisions: data forwarding, branch prediction• Software remedies: delayed branch, instruction reordering
Topics in This Chapter
16.1 Data Dependencies and Hazards
16.2 Data Forwarding
16.3 Pipeline Branch Hazards
16.4 Delayed Branch and Branch Prediction
16.5 Dealing with Exceptions
16.6 Advanced Pipelining
Mar. 2006 Computer Architecture, Data Path and Control Slide 4
16.1 Data Dependencies and Hazards
Fig. 16.1 Data dependency in a pipeline.
Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Cycle 9
$2 = $1 - $3
Instructions that read register $2
Instr cache
Instr cache
Instr cache
Instr cache
Instr cache
Data cache
Data cache
Data cache
Data cache
Data cache
Mar. 2006 Computer Architecture, Data Path and Control Slide 5
Fig. 16.2 When a previous instruction writes back a value computed by the ALU into a register, the data dependency can always be resolved through forwarding.
Resolving Data Dependencies via Forwarding
Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Cycle 9
$2 = $1 - $3
Instructions that read register $2
Instr cache
Instr cache
Instr cache
Instr cache
Data cache
Data cache
Data cache
Data cache
Mar. 2006 Computer Architecture, Data Path and Control Slide 6
Fig. 16.3 When the immediately preceding instruction writes a value read out from the data memory into a register, the data dependency cannot be resolved through forwarding (i.e., we cannot go back in time) and a bubble must be inserted in the pipeline.
Certain Data Dependencies Lead to Bubbles Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Cycle 9
lw $2,4($12)
Instructions that read register $2
Instr cache
Instr cache
Instr cache
Instr cache
Data cache
Data cache
Data cache
Data cache
Mar. 2006 Computer Architecture, Data Path and Control Slide 7
16.2 Data Forwarding
Fig. 16.4 Forwarding unit for the pipelined MicroMIPS data path.
(rt)
0 1 2
SE
ALU
Data cache
RegInSrc4
Func
Ovfl
0 1
0 1
RegWrite4 RetAddr3
(rs)
ALUSrc1
Stage 3 Stage 4 Stage 5 Stage 2
x3
y3
x4
y4 x3 y3
x4 y4
RegWrite3
d3 d4
x3 y3
x4 y4
Reg file
rs rt
RegInSrc3
ALUSrc2
s2
t2
d4 d3
RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4
RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4
d4 d3
Forwarding unit, upper
Forwarding unit, lower
x2
y2
Mar. 2006 Computer Architecture, Data Path and Control Slide 8
Design of the Data Forwarding Units
Fig. 16.4 Forwarding unit for the pipelined MicroMIPS data path.
(rt)
0 1 2
SE
ALU
Data cache
RegInSrc4
Func
Ovfl
0 1
0 1
RegWrite4 RetAddr3
(rs)
ALUSrc1
Stage 3 Stage 4 Stage 5 Stage 2
x3
y3
x4
y4 x3 y3
x4 y4
RegWrite3
d3 d4
x3 y3
x4 y4
Reg file
rs rt
RegInSrc3
ALUSrc2
s2
t2
d4 d3
RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4
RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4
d4 d3
Forwarding unit, upper
Forwarding unit, lower
x2
y2
RegWrite3 RegWrite4 s2matchesd3 s2matchesd4 RetAddr3 RegInSrc3 RegInSrc4 Choose
0 0 x x x x x x2
0 1 x 0 x x x x2
0 1 x 1 x x 0 x4
0 1 x 1 x x 1 y4
1 0 1 x 0 1 x x3
1 0 1 x 1 1 x y3
1 1 1 1 0 1 x x3
Table 16.1 Partial truth table for the upper forwarding unit in the pipelined MicroMIPS data path.
Let’s focus on designing the upper data forwarding unit
Incorrect in textbook
Mar. 2006 Computer Architecture, Data Path and Control Slide 9
Hardware for Inserting Bubbles
Fig. 16.5 Data hazard detector for the pipelined MicroMIPS data path.
(rt)
0 1 2
(rs)
Stage 3 Stage 2
Reg file
rs rt
t2
Data hazard detector
x2
y2
Control signals from decoder
DataRead2
Instr cache
LoadPC
Stage 1
PC Inst reg
All-0s
0 1
Bubble
Controls or all-0s
Mar. 2006 Computer Architecture, Data Path and Control Slide 10
16.3 Pipeline Branch Hazards
Software-based solutions
Compiler inserts a “no-op” after every branch (simple, but wasteful)
Branch is redefined to take effect after the instruction that follows it
Branch delay slot(s) are filled with useful instructions via reordering
Hardware-based solutions
Mechanism similar to data hazard detector to flush the pipeline
Constitutes a rudimentary form of branch prediction:Always predict that the branch is not taken, flush if mistaken
More elaborate branch prediction strategies possible
Mar. 2006 Computer Architecture, Data Path and Control Slide 11
16.4 Branch Prediction
Predicting whether a branch will be taken
Always predict that the branch will not be taken
Use program context to decide (backward branch is likely taken, forward branch is likely not taken)
Allow programmer or compiler to supply clues
Decide based on past history (maintain a small history table); to be discussed later
Apply a combination of factors: modern processors use elaborate techniques due to deep pipelines
Mar. 2006 Computer Architecture, Data Path and Control Slide 12
A Simple Branch Prediction Algorithm
Fig. 16.6 Four-state branch prediction scheme.
Not taken
Predict taken
Predict taken again
Predict not taken
Predict not taken
again
Not taken Taken
Not taken Taken
Taken Not taken
Taken
Example 16.1
L1: ---- ----L2: ---- ---- br <c2> L2 ---- br <c1> L1
20 iter’s
10 iter’sImpact of different branch prediction schemes
Solution
Always taken: 11 mispredictions, 94.8% accurate1-bit history: 20 mispredictions, 90.5% accurate2-bit history: Same as always taken
Mar. 2006 Computer Architecture, Data Path and Control Slide 13
Hardware Implementation of Branch Prediction
Fig. 16.7 Hardware elements for a branch prediction scheme.
The mapping scheme used to go from PC contents to a table entry is the same as that used in direct-mapped caches (Chapter 18)
Compare
Addresses of recent branch instructions
Target addresses
History bit(s) Low-order
bits used as index
Logic From PC
Incremented PC
Next PC
0
1
=
Read-out table entry
Mar. 2006 Computer Architecture, Data Path and Control Slide 14
16.5 Advanced Pipelining
Fig. 16.8 Dynamic instruction pipeline with in-order issue, possible out-of-order completion, and in-order retirement.
Deep pipeline = superpipeline; also, superpipelined, superpipeliningParallel instruction issue = superscalar, j-way issue (2-4 is typical)
Stage 1
Instr cache
Instruction fetch
Function unit 1
Function unit 2
Function unit 3
Stage 2 Stage 3 Stage 4 Variable # of stages Stage q 2 Stage q 1 Stage q
Ope- rand prep
Instr decode
Retirement & commit stages
Instr issue
Stage 5
Mar. 2006 Computer Architecture, Data Path and Control Slide 15
Performance Improvement for Deep Pipelines
Hardware-based methods
Lookahead past an instruction that will/may stall in the pipeline(out-of-order execution; requires in-order retirement)
Issue multiple instructions (requires more ports on register file)Eliminate false data dependencies via register renamingPredict branch outcomes more accurately, or speculate
Software-based method
Pipeline-aware compilationLoop unrolling to reduce the number of branches
Loop: Compute with index i Loop: Compute with index i
Increment i by 1 Compute with index i + 1
Go to Loop if not done Increment i by 2
Go to Loop if not done
Mar. 2006 Computer Architecture, Data Path and Control Slide 16
CPI Variations with Architectural Features
Table 16.2 Effect of processor architecture, branch prediction methods, and speculative execution on CPI.
Architecture Methods used in practice CPI
Nonpipelined, multicycle Strict in-order instruction issue and exec 5-10
Nonpipelined, overlapped In-order issue, with multiple function units 3-5
Pipelined, static In-order exec, simple branch prediction 2-3
Superpipelined, dynamic Out-of-order exec, adv branch prediction 1-2
Superscalar 2- to 4-way issue, interlock & speculation 0.5-1
Advanced superscalar 4- to 8-way issue, aggressive speculation 0.2-0.5
Mar. 2006 Computer Architecture, Data Path and Control Slide 17
Development of Intel’s Desktop/Laptop Micros
In the beginning, there was the 8080; led to the 80x86 = IA32 ISA
Half a dozen or so pipeline stages
802868038680486Pentium (80586)
A dozen or so pipeline stages, with out-of-order instruction execution
Pentium ProPentium IIPentium IIICeleron
Two dozens or so pipeline stages
Pentium 4
More advanced technology
More advanced technology
Instructions are broken into micro-ops which are executed out-of-order but retired in-order
Mar. 2006 Computer Architecture, Data Path and Control Slide 18
16.6 Dealing with Exceptions
Exceptions present the same problems as branches
How to handle instructions that are ahead in the pipeline?(let them run to completion and retirement of their
results)
What to do with instructions after the exception point?(flush them out so that they do not affect the state)
Precise versus imprecise exceptions
Precise exceptions hide the effects of pipelining and parallelism by forcing the same state as that of strict sequential execution
(desirable, because exception handling is not complicated)
Imprecise exceptions are messy, but lead to faster hardware(interrupt handler can clean up to offer precise
exception)
Mar. 2006 Computer Architecture, Data Path and Control Slide 19
The Three Hardware Designs for MicroMIPS
/
ALU
Data cache
Instr cache
Next addr
Reg file
op
jta
fn
inst
imm
rs (rs)
(rt)
Data addr
Data in 0
1
ALUSrc ALUFunc DataWrite
DataRead
SE
RegInSrc
rt
rd
RegDst RegWrite
32 / 16
Register input
Data out
Func
ALUOvfl
Ovfl
31
0 1 2
Next PC
Incr PC
(PC)
Br&Jump
ALU out
PC
0 1 2
Single-cycle
/
16
rs
0 1
0 1 2
ALU
Cache Reg file
op
jta
fn
(rs)
(rt)
Address
Data
Inst Reg
Data Reg
x Reg
y Reg
z Reg PC
4
ALUSrcX
ALUFunc
MemWrite MemRead
RegInSrc
4
rd
RegDst
RegWrite
/
32
Func
ALUOvfl
Ovfl
31
PCSrc PCWrite
IRWrite
ALU out
0 1
0 1
0 1 2 3
0 1 2 3
InstData ALUSrcY
SysCallAddr
/
26
4
rt
ALUZero
Zero
x Mux
y Mux
0 1
JumpAddr
4 MSBs
/
30
30
SE
imm
Multicycle
ALU
Data cache
Instr cache
Next addr
Reg file
op fn
inst
imm
rs (rs)
(rt)
Data addr
ALUSrc ALUFunc
DataWrite DataRead
RegInSrc
rt
rd
RegDst RegWrite
Func
ALUOvfl
Ovfl
IncrPC
Br&Jump
PC
1 Incr
0
1
rt
31
0 1 2
NextPC
0
1
SeqInst
0 1 2
0 1
RetAddr
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
SE
5 3
2
Pipelined
125 MHzCPI = 1
500 MHzCPI 4
500 MHzCPI 1.1
Mar. 2006 Computer Architecture, Data Path and Control Slide 20
Where Do We Go from Here?
Memory Design: How to build a memory unitthat responds in 1 clock
Input and Output:Peripheral devices, I/O programming,interfacing, interrupts
Higher Performance:Vector/array processingParallel processing
LD
21
Before our next class meeting… Homework #9 due on Nov. 2 Midterm Exam #2 is coming up!