ee382a lecture 4: instruction fetch techniquesinstruction

60
EE382A Lecture 4: Instruction Fetch Techniques Instruction Fetch Techniques Department of Electrical Engineering Stanford University EE382A – Autumn 2009 John P. Shen Lecture 4 - 1 Stanford University http://eeclass.stanford.edu/ee382a

Upload: others

Post on 17-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

EE382A Lecture 4:

Instruction Fetch TechniquesInstruction Fetch Techniques

Department of Electrical EngineeringStanford University

EE382A – Autumn 2009 John P. ShenLecture 4 - 1

Stanford University

http://eeclass.stanford.edu/ee382a

Page 2: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Announcements

• Start thinking about the research projectT 2 3 t d t– Team: 2-3 students

– Topic: an open problem in computer architecture• See our suggestions or propose your own• Don’t limit yourselves to scope of lectures• Don t limit yourselves to scope of lectures

– Project proposal due by 10/14

N i d• No new required paper– But you have one from previous lecture due today– One optional paper on-line, don’t forget textbook readings

• Class photos– Upload your picture on axess

EE382A – Autumn 2009 John P. ShenLecture 4 - 2

– Upload your picture on axess

Page 3: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

A Dynamic Superscalar Processor

Instruction/decode buffer

Fetch

Decode

Dispatch buffer

Dispatch

In o

rder

Reservationstations

Issue

Finish

Execute

Out

of

orde

r

Reorder/completion buffer

Store buffer

Complete

In o

rder

EE382A – Autumn 2009 John P. ShenLecture 4 - 3

Sto e bu e

Retire

Page 4: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Lecture 4 Outline

1. Superscalar Pipeline Implementationa. Instruction Fetch & Decodeb. Instruction Dispatch & Issuec. Instruction Executec. Instruction Executed. Instruction Complete & Retire

2. Instruction Flow Techniques

EE382A – Autumn 2009 John P. ShenLecture 4 - 4

Page 5: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

1 Superscalar Pipeline Implementation1. Superscalar Pipeline Implementation

EE382A – Autumn 2009 John P. ShenLecture 4 - 5

Page 6: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

a. Instruction Fetch & Decode

• Goal of instruction fetch & decode– Supply pipeline with maximum number of useful instructions per cycle– The fetch logic sets the maximum possible performance (IPCmax)

• Impediments– Instruction cache misses

I t ti li tInstruction Memory

PC

– Instruction alignment– Complex instruction sets

• CISC (x86, 390, etc)

y

3 instructions fetched– Branches and jumps

• Determining the instruction address (branch direction & targets) • Will start in this lecture & finish discussion in next one…

3 instructions fetched

EE382A – Autumn 2009 John P. ShenLecture 4 - 6

Page 7: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Instruction Cache Organization

1 cache line = 1 physical row 1 cache line = 2 physical rows

EE382A – Autumn 2009 John P. ShenLecture 4 - 7

Page 8: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

The Fetch Alignment Problem

PC � XX00001PC XX00001

00000 01 10 11

001001

w d

ecod

er

111Row

Fetch group

EE382A – Autumn 2009 John P. ShenLecture 4 - 8

g p

Row width

Page 9: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Solving the Alignment Problem

• Software solution: align taken branch targets to I-cache row startsEffect on code size?– Effect on code size?

– Effect on the I-cache miss rate? – What happens when we go to the next chip?

• Hardware solutionDetect (mis)alignment case– Detect (mis)alignment case

– Allow access of multiple rows (current and sequential next)• True multi-ported cache, over-clocked cache, multi-banked cache, …• Or keep around the cache line from previous access• Or keep around the cache line from previous access

– Assuming a large basic block

– Collapse the two fetched rows into one instruction group

EE382A – Autumn 2009 John P. ShenLecture 4 - 9

Page 10: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

IBM RS/6000 2-way I-Cacheand Fetch Logicand Fetch Logic

Tags

Tags

EE382A – Autumn 2009 John P. ShenLecture 4 - 10

Page 11: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Instruction Decoding Issues

• Primary decoding tasksIdentify individual instructions for CISC ISAs– Identify individual instructions for CISC ISAs

– Determine instruction types (especially branches)• Drop instructions after (predicted) taken branches and jumps

P t ti ll t t th f t h i li f t t dd• Potentially restart the fetch pipeline from target address

– Determine dependences between instructions

• Two important factors– Instruction set architecture (CISC vs RISC)

• Determines decoding difficulty• Determines decoding difficulty

– Pipeline width• Determines the number of comparators needed for dependence detection

EE382A – Autumn 2009 John P. ShenLecture 4 - 11

Page 12: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Pentium Pro Fetch/Decode

EE382A – Autumn 2009 John P. ShenLecture 4 - 12

Page 13: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

I-Cache Pre-decoding in the AMD K5

From memory

Byte1 Byte2 Byte88 instruction bytes 64

Predecode logic5 bits 5 bits 5 bits

Byte1 Byte2 Byte88 instruction bytes � predecode bits 64 � 40

I-Cache

16 instruction bytes � predecode bits 128 � 80

Decode, translate,d di h • Overheads of pre-decoding?

Up to 4 ROPs ROP1 ROP2 ROP3 ROP4

and dispatch • Overheads of pre-decoding?

• Any uses for RISC processors?

EE382A – Autumn 2009 John P. ShenLecture 4 - 13

Page 14: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

b. Instruction Dispatch & Issue

Instruction fetching

Instruction decoding

Instruction dispatching

FU nFU 3FU 2FU 1

EE382A – Autumn 2009 John P. ShenLecture 4 - 14

Instruction execution

Page 15: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Centralized Reservation Station

Centralized reservationt ti (di t h b ff )

Dispatch(i ) station (dispatch buffer)(issue)

Execute

Completion buffer

EE382A – Autumn 2009 John P. ShenLecture 4 - 15

Completion buffer

Page 16: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Distributed Reservation Station

Dispatch bufferDispatch

Distributedreservationstations

Issue

Execute

Completion buffer

Finish

EE382A – Autumn 2009 John P. ShenLecture 4 - 16

Completion buffer

Complete

Page 17: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Instruction Fetch Buffer

FetchUnit

Out-of-orderCore

• Smoothes out the rate mismatch between fetch and execution– Neither the fetch bandwidth nor the execution bandwidth is consistent

F t h b d idth h ld b hi h th ti b d idth• Fetch bandwidth should be higher than execution bandwidth– We prefer to have a stockpile of instructions in the buffer to hide cache

miss latencies. This requires both raw cache bandwidth + control flow l ti

EE382A – Autumn 2009 John P. ShenLecture 4 - 17

speculation

Page 18: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

c. Instruction Execute

• Current trends– More parallelism bypassing very challenging– More parallelism  bypassing very challenging– Deeper pipelines– More diversity

• Functional unit types– Integer– Integer– Floating point– Load/store  most difficult to make parallel– Branch– Specialized units (media)

EE382A – Autumn 2009 John P. ShenLecture 4 - 18

Page 19: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Bypass Networks

PCI-Cache

BRFetch Q BR Scan

BR Predict

Fetch Q

Decode

R d B ffBR/CRFX/LD 1 FX/LD 2FP Reorder BufferBR/CRIssue Q

CRUnit

BRUnit

FX/LD 1Issue Q

FX1Unit LD1

Unit

FX/LD 2Issue Q

LD2Unit

FX2Unit

FPIssue Q

FP1Unit

FP2Unit

• O(n2) interconnect from/to FU inputs and outputs

StQ

D-Cache

• O(n ) interconnect from/to FU inputs and outputs• Associative tag‐match to find operands• Solutions (hurt IPC, help cycle time)

– Use RF only (IBM Power4) with no bypass network

EE382A – Autumn 2009 John P. ShenLecture 4 - 19

Use RF only (IBM Power4) with no bypass network– Decompose into clusters (Alpha 21264)

Page 20: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Specialized units

Intel Pentium 4 d ddstaggered adders

– FireballRun at 2x clock f

Carry

frequencyTwo 16‐bit bitslicesDependent ops execute on half‐cycle boundariesFull result not 

l bl l f llavailable until full cycle later

EE382A – Autumn 2009 John P. ShenLecture 4 - 20

Page 21: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Specialized units

FP multiply‐accumulate

ALU 0ALU 2Shift

accumulateR = (A x B) + CDoubles FLOP/instruction

A � B

(A � B) � C

FLOP/instructionLose RISC instruction format symmetry:

3 source operands

ALU C

(A � B) � C

Round/Normalize

– 3 source operandsWidely used

(a) (b)

EE382A – Autumn 2009 John P. ShenLecture 4 - 21

(a) (b)

Page 22: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

d. Instruction Complete & Retire

• Out‐of‐order execution– ALU instructionsALU instructions– Load/store instructions

• In‐order completion/retirement– Precise exceptions– Memory coherence and consistency

• Solutions• Solutions– Reorder buffer– Store buffer– Load queue snooping (later)

EE382A – Autumn 2009 John P. ShenLecture 4 - 22

Page 23: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Exceptional Limitations

• Precise exceptions: when exception occurs on instruction i– Instruction i has not modified the processor state (regs, memory)– All older instructions have fully completed– No younger instruction has modified the processor state (regs, memory)

• How do we maintain precise exception in a simple pipelined ?processor?

• What makes precise exceptions difficult on a – In-order diversified pipeline?– Out-of-order pipeline?

EE382A – Autumn 2009 John P. ShenLecture 4 - 23

Out of order pipeline?

Page 24: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Precise Exceptions & OOO Processors

• Solution: force instructions to update processor state in-orderRegister and memory updates done in order– Register and memory updates done in order

– Usually called instruction retirement or graduation or commitment

• Implementation: Re-order Buffer (ROB)– A FIFO for instruction tracking– 1 entry per instruction

• PC, register/memory address, new value, ready, exceptiong y y p

• ROB algorithm– Allocate ROB entries at dispatch time in-order (FIFO)

I t ti d t th i ROB t h th l t– Instructions update their ROB entry when they complete– Examine head of ROB for in-order retirement

• If done retire, otherwise wait (this forces order)If ti fl h t t f ROB t t f i t ti PC ft h dl

EE382A – Autumn 2009 John P. ShenLecture 4 - 24

• If exception, flush contents of ROB, restart from instruction PC after handler

Page 25: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Re-order Buffer

In-order Allocate

Out-of-order Complete

In-order Retire

EE382A – Autumn 2009 John P. ShenLecture 4 - 25

PC Dst Value Dst Address Ready Exception Info

Page 26: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Re-order Buffer Issues

• ProblemAl d t d lt t it i th d b ff h th b– Already computed results must wait in the reorder buffer when they may be needed by other instructions which could otherwise execute.

– Data dependent instructions must wait until the result has been committed to registerregister

• Solution– Forwarding from the re-order buffer

• Allows data in ROB to be used in place of data in registers– Forwarding implementation 1: search ROB for values when registers read

• Only latest entry in ROB can be used• Many comparators but logic is conceptually simple• Many comparators, but logic is conceptually simple

– Forwarding implementation 2: use score-board to track results in ROB• Register scoreboard notes if latest value in ROB and # of ROB entry• Don’t need to track FU any more; an instruction is fully identified by ROB entry

EE382A – Autumn 2009 John P. ShenLecture 4 - 26

Don t need to track FU any more; an instruction is fully identified by ROB entry

Page 27: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

ROB Alternatives: History File

• A FIFO buffer operated similarly to the ROB but logs old register valuesEntry format just like ROB– Entry format just like ROB

• Algorithm– Entries allocated in-order at dispatch – Entries updated out-of-order at completion time

• Destination register updated immediately• Old value of register noted in re-order buffer

– Examine head of history file in-order• If no exception just de-allocate• If exception, reverse history file and undo all register updates before flushing

• Advantage: no need for separate forwarding from ROB

• Disadvantage: slower recovery from exceptions

EE382A – Autumn 2009 John P. ShenLecture 4 - 27

Page 28: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

ROB Alternatives: Future File

• Use two separate register files:Architectural file ⇒ Represents sequential execution– Architectural file ⇒ Represents sequential execution.

– Future file ⇒ Updated immediately upon instruction execution and used as the working file.

• Algorithm– When instruction reaches the head of ROB, it is committed to the

architectural file– On an exception changes are brought over from the architectural file to the

future file based on which instructions are still represented in the ROB

• Advantage: no need for separate forwarding from ROBg p g

• Disadvantage: slower recovery from exceptions

EE382A – Autumn 2009 John P. ShenLecture 4 - 28

Page 29: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Impediments to High IPC

I-cache

FETCHBranchPredictor Instruction

Buffer

InstructionFlow

DECODE

Buffer

Integer Floating-point Media Memory

MemoryData

EXECUTE Flow

COMMITD-cacheStore

Queue

ReorderBufferRegister

Data (ROB)

Flow

EE382A – Autumn 2009 John P. ShenLecture 4 - 29

Queue

Page 30: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

2 Instruction Flow Techniques2. Instruction Flow Techniques

EE382A – Autumn 2009 John P. ShenLecture 4 - 30

Page 31: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Control Flow Graph

• Shows possible paths of control flow through basic blocksBB 1

BB 2

main: addi r2, r0, A addi r3, r0, B addi r4, r0, C BB 1 addi r5, r0, N add r10,r0, r0

b 10 5 d

BB 3 BB 4

bge r10,r5, end loop: lw r20, 0(r2) lw r21, 0(r3) BB 2 bge r20,r21,T1 sw r21, 0(r4) BB 3

b T2

BB 5

b T2 T1: sw r20, 0(r4) BB 4 T2: addi r10,r10,1 addi r2, r2, 4 addi r3, r3, 4 BB 5

• Control DependenceNode X is control dependant on Node Y if the computation in Y determines

add 3, 3, 5 addi r4, r4, 4 blt r10,r5, loop end:

EE382A – Autumn 2009 John P. ShenLecture 4 - 31

– Node X is control dependant on Node Y if the computation in Y determines whether X executes

Page 32: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Mapping CFG to Linear Instruction Sequence

AA

BBB

C

D

D

C

EE382A – Autumn 2009 John P. ShenLecture 4 - 32

Page 33: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Disruption of Sequential Control Flow

Instruction/Decode Buffer

Fetch

Instruction/Decode Buffer

Dispatch Buffer

Decode

Reservation

Dispatch

StationsI StationsIssue

Execute

Branch

Reorder/

Complete

FinishCompletion Buffer

EE382A – Autumn 2009 John P. ShenLecture 4 - 33

Store Buffer

Retire

Page 34: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Condition Resolution

Decode Buffer

Fetch

Dispatch Buffer

Decode

Dispatch

CCreg.

GPreg.value

Reservation

p

StationsIssue

valuecomp.

Execute

Branch

Complete

Finish Completion Buffer

EE382A – Autumn 2009 John P. ShenLecture 4 - 34

Store Buffer

Retire

Page 35: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Target Address Generation

Decode Buffer

Fetch

PC-rel.

Dispatch Buffer

Decode

Dispatch

rel.Reg.ind.

Reg.ind.with

Reservation

p

StationsIssue

withoffset

Execute

Branch

Complete

Finish Completion Buffer

EE382A – Autumn 2009 John P. ShenLecture 4 - 35

Store Buffer

Retire

Page 36: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

What’s So Bad About Branches?

• Performance Penalties– Use up execution resources– Fragmentation of I-Cache lines– Disruption of sequential control flows up o o seque a co o o

• Need to determine branch direction (conditional branches)• Need to determine branch target

Robs instruction fetch bandwidth and ILP

E l• Example:– We have an N-way superscalar processor (N is very large)– A branch every 5 instructions and each branch takes 3 cycles to resolve

EE382A – Autumn 2009 John P. ShenLecture 4 - 36

– What is the effective fetch bandwidth?

Page 37: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Riseman and Foster’s Study

• 7 benchmark programs on CDC-3600

• Assume infinite machine:– Infinite memory and instruction stack, register file, functional units

Consider only true dependency at data-flow limity p y

• If bounded to single basic block, i.e. no bypassing of branches ⇒maximum speedup is 1.72

• Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution)p )

Br. Bypassed: 0 1 2 8 32 128

Max Speedup:1 72 2 72 3 62 7 21 24 4 51 2

EE382A – Autumn 2009 John P. ShenLecture 4 - 37

Max Speedup:1.72 2.72 3.62 7.21 24.4 51.2

Page 38: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Determining Branch Direction

Problem: Cannot fetch subsequent instructions until branch direction is d t i ddetermined

• Minimize penalty– Move the instruction that computes the branch condition away fromMove the instruction that computes the branch condition away from

branch (ISA & compiler)– 3 branch components can be separated

S if d f BB if diti if t t• Specify end of BB, specify condition, specify target

• Make use of penalty– Bias for not-taken– Fill delay slots with useful/safe instructions (ISA &compiler)– Follow both paths of execution (hardware)

EE382A – Autumn 2009 John P. ShenLecture 4 - 38

– Predict branch direction (hardware)

Page 39: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Determining Branch Target

Problem: Cannot fetch subsequent instructions until branch target is determineddetermined

• Minimize delayy– Generate branch target early in the pipeline (ISA & compiler)

• Make use of delay– Bias for not taken– Predict branch target (hardware)Predict branch target (hardware)

PC-relative vs Register Indirect targets

EE382A – Autumn 2009 John P. ShenLecture 4 - 39

Page 40: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Branch Condition Speculation Options

• Software PredictionEncode an extra bit in the branch instruction– Encode an extra bit in the branch instruction

• Predict not taken: set bit to 0• Predict taken: set bit to 1

Bit set by compiler or user; can use profiling– Bit set by compiler or user; can use profiling– Static prediction, same behavior every time

• Biased For Not Taken– Does not affect the instruction set architecture– Not effective for loops

• Prediction Based on Branch Offsets• Prediction Based on Branch Offsets– Positive offset: predict not taken– Negative offset: predict taken

EE382A – Autumn 2009 John P. ShenLecture 4 - 40

• Prediction Based on History

Page 41: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Branch Instruction SpeculationnPC to Icache

Fetch

nPC to Icache

nPC(seq.) = PC+4PCBranch

specu. target

prediction FA-mux

Decode Buffer

Fetch

Decode

BranchPredictor(using a BTB)

BTB

specu. cond.

Dispatch Buffer

Dispatch

update(target addr.and history)

ReservationStations

Issue

Execute

Branch

EE382A – Autumn 2009 John P. ShenLecture 4 - 41

Finish Completion Buffer

Page 42: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Branch Target Buffer (BTB)

• A small “cache-like” memory in the instruction fetch stage

……. ……. ……

currentPC

Branch Inst.Address (tag)

Branch History

Branch Target(Most Recent)

• Remembers previously executed branches, their addresses, information to aid prediction, and most recent target addresses

• I-fetch stage compares current PC against those in BTB to “guess” nPC– If matched then prediction is made else nPC=PC+4– If predict taken then nPC=target address in BTB else nPC=PC+4

EE382A – Autumn 2009 John P. ShenLecture 4 - 42

• When branch is actually resolved, BTB is updated

Page 43: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

More on BTB (aka BTAC)

• Typically a large associative structure Pentium3: 512 entries 4 way SA– Pentium3: 512 entries, 4-way SA

– Opteron: 2K entries, 4-way SA• Entry format

V lid bit dd t (PC) t t dd f ll th h BB dd (l th– Valid bit, address tag (PC), target address, fall-through BB address (length of BB), branch type info, branch direction prediction

• BTB provides both target and direction prediction– More on direction prediction later on

• Multi-cycle BTB access? – The case in most modern processors (2 cycle BTB)– Start BTB access along with I-cache in cycle 0– In cycle 1, fetch from BTB+N (predict not-taken)– In cycle 2, use BTB output to verify

EE382A – Autumn 2009 John P. ShenLecture 4 - 43

• 1 cycle fetch bubble if branch was taken

Page 44: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

I-Cache & BTB Integration

• Why does this make sense?

Pl BTB t i h h li• Place a BTB entry in each cache line– Each cache line tells you which line to address next– Do not need the full PC, just an index the cache + a way select– This is called way & line prediction

Implemented in Alpha 21264 processor• Implemented in Alpha 21264 processor– On refills, prediction value points to the next sequential fetch line.– Prediction is trained later on as program executes…

• When correct targets are known…

– Line prediction is verified in next cycle. If line-way prediction is incorrect, slot stage is flushed and PC generated using instruction info and direction

EE382A – Autumn 2009 John P. ShenLecture 4 - 44

prediction

Page 45: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Alpha 21264 Line & Way Predictionp y

EE382A – Autumn 2009 John P. ShenLecture 4 - 45

Source: IEEE Micro, March-April 1999

Page 46: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Branch Direction UCB Study [Lee and Smith 1984]UCB Study [Lee and Smith, 1984]

• Benchmarks– 26 programs (traces on IBM 370, DEC PDP-11, CDC 6400)p g ( , , )– Use trace-driven simulation with parameterized machine models

• Branch types– Unconditional: always taken or always not taken– Subroutine call: always taken– Loop control: usually taken (loop back)– Decision: either way, e.g. IF-THEN-ELSE– Computed GOTO: always taken, with changing targetComputed GOTO: always taken, with changing target– Supervisor call: always taken– “Execute”: always taken (IBM 370)

• Branch behavior: Taken vs Not TakenIBM1 IBM2 IBM3 IBM4 DEC CDC Average

T 0.640 0.657 0.704 0.540 0.738 0.778 0.676NT 0.360 0.343 0.296 0.460 0.262 0.222 0.324

EE382A – Autumn 2009 John P. ShenLecture 4 - 46

Page 47: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Branch Prediction Function

• Based on opcode only (%)IBM1 IBM2 IBM3 IBM4 DEC CDCIBM1 IBM2 IBM3 IBM4 DEC CDC66 69 71 55 80 78

• Based on history of branchBased on history of branch– Branch prediction function F (X1, X2, .... )– Use up to 5 previous branches for history (%)

IBM1 IBM2 IBM3 IBM4 DEC CDCIBM1 IBM2 IBM3 IBM4 DEC CDC0 64.1 64.4 70.4 54.0 73.8 77.81 91.9 95.2 86.6 79.7 96.5 82.32 93 3 96 5 90 8 83 4 97 5 90 62 93.3 96.5 90.8 83.4 97.5 90.63 93.7 96.7 91.2 83.5 97.7 93.54 94.5 97.0 92.0 83.7 98.1 95.35 94.7 97.1 92.2 83.9 98.2 95.7

EE382A – Autumn 2009 John P. ShenLecture 4 - 47

Page 48: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Example Prediction Algorithm based on History

• Prediction accuracy approaches maximum with as few as 2 preceding branch occurrences used as historypreceding branch occurrences used as history

TT

last two branches

TTT

N

NTT

TT

TTT next prediction

N

TNTTNT

NNN

N

TN

Results (%)IBM1 IBM2 IBM3 IBM4 DEC CDC93 3 96 5 90 8 83 4 97 5 90 6

NN

EE382A – Autumn 2009 John P. ShenLecture 4 - 48

93.3 96.5 90.8 83.4 97.5 90.6

Page 49: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Other Prediction Algorithms Based on History

N

TN

TN

Tt

Tt?tt?

T NSaturation Hysteresis

N

T

TNTn? N

TNT

T N

n? nn

T NCounter Counter

TT

TNN

• Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 5-20% performance enhancement.

EE382A – Autumn 2009 John P. ShenLecture 4 - 49

Page 50: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Designing History-based FSMs for Prediction

– 2-bit FSM as good as n-bit FSM– Saturating counter as good as any FSM IBM Study [Nair, 1992]g g y

• 220 possible FSMs, 5248 actually interestingBenchmark Optimal “Counter”

spice2g6 97.2 97.0

N

Tspice2g6 97.2 97.0

doduc 94.3 94.3 *

T

gcc 89.1 89.1

espresso 89.1 89.1 *

*

li 87.1 86.8

eqntott 87.9 87.2

*

*

EE382A – Autumn 2009 John P. ShenLecture 4 - 50

* Initial state Predict NT Predict T

Page 51: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Simple Branch History Table (BHT)

• Can be integrated with the BTB

• See any shortcoming?

EE382A – Autumn 2009 John P. ShenLecture 4 - 51

Page 52: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

PowerPC 604 Branch Predictor

FA

FA-m

ux

I-cache

FA FA

FAR

Branch TargetAddress Cache

x

Branch HistoryTable (BHT)

BTAC

(BTAC)+4

Decode Buffer

Dispatch Buffer

DecodeBHT update

updateBHT prediction

Dispatch Buffer

ReservationDispatch

StationsSFX SFX CFX FPU LSBRN

BTAC predictionBTAC:- 64 entries- fully associative

hit => predict takenIssue

ExecuteBranch

- hit => predict taken

BHT:- 512 entries- direct mapped

2 bit saturating counter

EE382A – Autumn 2009 John P. ShenLecture 4 - 52

Finish Completion Buffer

- 2-bit saturating counter history based prediction- overrides BTAC prediction

Page 53: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Branch Correlation

• So far, the prediction of each static branch instruction is based solely on its own past behavior and not the behaviors of other neighboringon its own past behavior and not the behaviors of other neighboring static branch instructions

• How about this one? x=0;If (someCondition) x=3; /* Branch A*/If (someOtherCondition) y+=19; /* Branch B*/If (someOtherCondition) y 19; / Branch B /If (x<=0) dosomething(); /* Branch C*?

• Other correlation examples?

EE382A – Autumn 2009 John P. ShenLecture 4 - 53

Page 54: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Global History Branch Predictor

• BHR: a shift register for global history (Shift in latest result in each cycle)

• Advantages & shortcomings?

EE382A – Autumn 2009 John P. ShenLecture 4 - 54

Page 55: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Local History Branch Predictor

• What could be the patterns detected by such a predictor?

• Advantages & shortcomings?

EE382A – Autumn 2009 John P. ShenLecture 4 - 55

Page 56: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

2-Level Adaptive Prediction [Yeh & Patt]

Nomenclature: {G,P}A{g,p,s}{ , } {g,p, }Pattern History Table (PHT)PC

00...0000...0100...10

Branch History Shift

(shift left when update)Register (BHSR)

1 0 1 1 1 PHTBits

old1 1 1 0 0

11...1011...11 Prediction

index

FSM

new

1 1 1 1 0

To achieve 97% average prediction accuracy: G (1) BHR: 18 bits; g (1) PHT: 218 x 2 bits total = 524 kbits

Branch ResultLogic

EE382A – Autumn 2009 John P. ShenLecture 4 - 56

P (512x4) BHR: 12 bits; g (1) PHT: 212 x 2 bits total = 33 kbitsP (512x4) BHR: 6 bits; s (512) PHT: 26 x 2 bits total = 78 kbits

Page 57: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Example: Global BHSR Scheme (GAs)

Branch Addressj bitj bits

ctio

n

Branch HistoryShift Register (BHSR)

Pre

di

k bits

EE382A – Autumn 2009 John P. ShenLecture 4 - 57

BHT of 2 x 2 j+k

Page 58: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Example: Per-Branch BHSR Scheme (PAs)

Branch Address

j bits Standard BHT

Branch HistoryShift R i t (BHSR)

j bitsi bits

Standard BHT

ctio

n

yShift Register (BHSR)k x 2 i

Pre

di

k bit

EE382A – Autumn 2009 John P. ShenLecture 4 - 58

BHT of 2 x 2 j+kk bits

Page 59: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Gshare Branch Prediction [McFarling]

Branch Address

j bits

xor

tionBranch History

Shift Register (BHSR)

Pre

dict

k bits

EE382A – Autumn 2009 John P. ShenLecture 4 - 59

BHT of 2 x 2 max(j,k)

Page 60: EE382A Lecture 4: Instruction Fetch TechniquesInstruction

Next Lecture

• More Branch Prediction

EE382A – Autumn 2009 John P. ShenLecture 4 - 60