computer architecture

1

FAMU-FSU College of Engineering

ComputerArchitectureEEL 4713/5764, Fall 2006

Dr. Linda DeBrunner

Module #16 – Pipeline Performance Limits

Mar. 2006 Computer Architecture, Data Path and Control Slide 2

Part IVData Path and Control


16 Pipeline Performance Limits

Pipeline performance limited by data & control dependencies• Hardware provisions: data forwarding, branch prediction• Software remedies: delayed branch, instruction reordering

Topics in This Chapter

16.1 Data Dependencies and Hazards

16.2 Data Forwarding

16.3 Pipeline Branch Hazards

16.4 Delayed Branch and Branch Prediction

16.5 Dealing with Exceptions

16.6 Advanced Pipelining


16.1 Data Dependencies and Hazards

Fig. 16.1 Data dependency in a pipeline.

Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Cycle 9

$2 = $1 - $3

Instructions that read register $2

Instr cache

Instr cache

Instr cache

Instr cache

Instr cache

Data cache

Data cache

Data cache

Data cache

Data cache


Fig. 16.2 When a previous instruction writes back a value computed by the ALU into a register, the data dependency can always be resolved through forwarding.

Resolving Data Dependencies via Forwarding

Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Cycle 9

$2 = $1 - $3


Instr cache

Instr cache

Instr cache

Instr cache

Data cache

Data cache

Data cache

Data cache


Fig. 16.3 When the immediately preceding instruction writes a value read out from the data memory into a register, the data dependency cannot be resolved through forwarding (i.e., we cannot go back in time) and a bubble must be inserted in the pipeline.

Certain Data Dependencies Lead to Bubbles Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Cycle 9

lw $2,4($12)


Instr cache

Instr cache

Instr cache

Instr cache

Data cache

Data cache

Data cache

Data cache


16.2 Data Forwarding

Fig. 16.4 Forwarding unit for the pipelined MicroMIPS data path.

(rt)

0 1 2

SE

ALU

Data cache

RegInSrc4

Func

Ovfl

0 1

0 1

RegWrite4 RetAddr3

(rs)

ALUSrc1

Stage 3 Stage 4 Stage 5 Stage 2

x3

y3

x4

y4 x3 y3

x4 y4

RegWrite3

d3 d4

x3 y3

x4 y4

Reg file

rs rt

RegInSrc3

ALUSrc2

s2

t2

d4 d3

RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4


d4 d3

Forwarding unit, upper

Forwarding unit, lower

x2

y2


Design of the Data Forwarding Units

Fig. 16.4 Forwarding unit for the pipelined MicroMIPS data path.

(rt)

0 1 2

SE

ALU

Data cache

RegInSrc4

Func

Ovfl

0 1

0 1

RegWrite4 RetAddr3

(rs)

ALUSrc1

Stage 3 Stage 4 Stage 5 Stage 2

x3

y3

x4

y4 x3 y3

x4 y4

RegWrite3

d3 d4

x3 y3

x4 y4

Reg file

rs rt

RegInSrc3

ALUSrc2

s2

t2

d4 d3



d4 d3

Forwarding unit, upper

Forwarding unit, lower

x2

y2

RegWrite3 RegWrite4 s2matchesd3 s2matchesd4 RetAddr3 RegInSrc3 RegInSrc4 Choose

0 0 x x x x x x2

0 1 x 0 x x x x2

0 1 x 1 x x 0 x4

0 1 x 1 x x 1 y4

1 0 1 x 0 1 x x3

1 0 1 x 1 1 x y3

1 1 1 1 0 1 x x3

Table 16.1 Partial truth table for the upper forwarding unit in the pipelined MicroMIPS data path.

Let’s focus on designing the upper data forwarding unit

Incorrect in textbook


Hardware for Inserting Bubbles

Fig. 16.5 Data hazard detector for the pipelined MicroMIPS data path.

(rt)

0 1 2

(rs)

Stage 3 Stage 2

Reg file

rs rt

t2

Data hazard detector

x2

y2

Control signals from decoder

DataRead2

Instr cache

LoadPC

Stage 1

PC Inst reg

All-0s

0 1

Bubble

Controls or all-0s


16.3 Pipeline Branch Hazards

Software-based solutions

Compiler inserts a “no-op” after every branch (simple, but wasteful)

Branch is redefined to take effect after the instruction that follows it

Branch delay slot(s) are filled with useful instructions via reordering

Hardware-based solutions

Mechanism similar to data hazard detector to flush the pipeline

Constitutes a rudimentary form of branch prediction:Always predict that the branch is not taken, flush if mistaken

More elaborate branch prediction strategies possible


16.4 Branch Prediction

Predicting whether a branch will be taken

Always predict that the branch will not be taken

Use program context to decide (backward branch is likely taken, forward branch is likely not taken)

Allow programmer or compiler to supply clues

Decide based on past history (maintain a small history table); to be discussed later

Apply a combination of factors: modern processors use elaborate techniques due to deep pipelines


A Simple Branch Prediction Algorithm

Fig. 16.6 Four-state branch prediction scheme.

Not taken

Predict taken

Predict taken again

Predict not taken

Predict not taken

again

Not taken Taken

Not taken Taken

Taken Not taken

Taken

Example 16.1

L1: ---- ----L2: ---- ---- br <c2> L2 ---- br <c1> L1

20 iter’s

10 iter’sImpact of different branch prediction schemes

Solution

Always taken: 11 mispredictions, 94.8% accurate1-bit history: 20 mispredictions, 90.5% accurate2-bit history: Same as always taken


Hardware Implementation of Branch Prediction

Fig. 16.7 Hardware elements for a branch prediction scheme.

The mapping scheme used to go from PC contents to a table entry is the same as that used in direct-mapped caches (Chapter 18)

Compare

Addresses of recent branch instructions

Target addresses

History bit(s) Low-order

bits used as index

Logic From PC

Incremented PC

Next PC

0

1

=

Read-out table entry


16.5 Advanced Pipelining

Fig. 16.8 Dynamic instruction pipeline with in-order issue, possible out-of-order completion, and in-order retirement.

Deep pipeline = superpipeline; also, superpipelined, superpipeliningParallel instruction issue = superscalar, j-way issue (2-4 is typical)

Stage 1

Instr cache

Instruction fetch

Function unit 1

Function unit 2

Function unit 3

Stage 2 Stage 3 Stage 4 Variable # of stages Stage q 2 Stage q 1 Stage q

Ope- rand prep

Instr decode

Retirement & commit stages

Instr issue

Stage 5


Performance Improvement for Deep Pipelines

Hardware-based methods

Lookahead past an instruction that will/may stall in the pipeline(out-of-order execution; requires in-order retirement)

Issue multiple instructions (requires more ports on register file)Eliminate false data dependencies via register renamingPredict branch outcomes more accurately, or speculate

Software-based method

Pipeline-aware compilationLoop unrolling to reduce the number of branches

Loop: Compute with index i Loop: Compute with index i

Increment i by 1 Compute with index i + 1

Go to Loop if not done Increment i by 2

Go to Loop if not done


CPI Variations with Architectural Features

Table 16.2 Effect of processor architecture, branch prediction methods, and speculative execution on CPI.

Architecture Methods used in practice CPI

Nonpipelined, multicycle Strict in-order instruction issue and exec 5-10

Nonpipelined, overlapped In-order issue, with multiple function units 3-5

Pipelined, static In-order exec, simple branch prediction 2-3

Superpipelined, dynamic Out-of-order exec, adv branch prediction 1-2

Superscalar 2- to 4-way issue, interlock & speculation 0.5-1

Advanced superscalar 4- to 8-way issue, aggressive speculation 0.2-0.5


Development of Intel’s Desktop/Laptop Micros

In the beginning, there was the 8080; led to the 80x86 = IA32 ISA

Half a dozen or so pipeline stages

802868038680486Pentium (80586)

A dozen or so pipeline stages, with out-of-order instruction execution

Pentium ProPentium IIPentium IIICeleron

Two dozens or so pipeline stages

Pentium 4

More advanced technology

More advanced technology

Instructions are broken into micro-ops which are executed out-of-order but retired in-order


16.6 Dealing with Exceptions

Exceptions present the same problems as branches

How to handle instructions that are ahead in the pipeline?(let them run to completion and retirement of their

results)

What to do with instructions after the exception point?(flush them out so that they do not affect the state)

Precise versus imprecise exceptions

Precise exceptions hide the effects of pipelining and parallelism by forcing the same state as that of strict sequential execution

(desirable, because exception handling is not complicated)

Imprecise exceptions are messy, but lead to faster hardware(interrupt handler can clean up to offer precise

exception)


The Three Hardware Designs for MicroMIPS

/

ALU

Data cache

Instr cache

Next addr

Reg file

op

jta

fn

inst

imm

rs (rs)

(rt)

Data addr

Data in 0

1

ALUSrc ALUFunc DataWrite

DataRead

SE

RegInSrc

rt

rd

RegDst RegWrite

32 / 16

Register input

Data out

Func

ALUOvfl

Ovfl

31

0 1 2

Next PC

Incr PC

(PC)

Br&Jump

ALU out

PC

0 1 2

Single-cycle

/

16

rs

0 1

0 1 2

ALU

Cache Reg file

op

jta

fn

(rs)

(rt)

Address

Data

Inst Reg

Data Reg

x Reg

y Reg

z Reg PC

4

ALUSrcX

ALUFunc

MemWrite MemRead

RegInSrc

4

rd

RegDst

RegWrite

/

32

Func

ALUOvfl

Ovfl

31

PCSrc PCWrite

IRWrite

ALU out

0 1

0 1

0 1 2 3

0 1 2 3

InstData ALUSrcY

SysCallAddr

/

26

4

rt

ALUZero

Zero

x Mux

y Mux

0 1

JumpAddr

4 MSBs

/

30

30

SE

imm

Multicycle

ALU

Data cache

Instr cache

Next addr

Reg file

op fn

inst

imm

rs (rs)

(rt)

Data addr

ALUSrc ALUFunc

DataWrite DataRead

RegInSrc

rt

rd

RegDst RegWrite

Func

ALUOvfl

Ovfl

IncrPC

Br&Jump

PC

1 Incr

0

1

rt

31

0 1 2

NextPC

0

1

SeqInst

0 1 2

0 1

RetAddr

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

SE

5 3

2

Pipelined

125 MHzCPI = 1

500 MHzCPI 4

500 MHzCPI 1.1


Where Do We Go from Here?

Memory Design: How to build a memory unitthat responds in 1 clock

Input and Output:Peripheral devices, I/O programming,interfacing, interrupts

Higher Performance:Vector/array processingParallel processing

LD

21

Before our next class meeting… Homework #9 due on Nov. 2 Midterm Exam #2 is coming up!

computer architecture

Documents

data dependency

upper data

data forwardingfig

branch simple

data memory

delayed branch

backward branch

branch predictionpredicting