instruction fetch and branch prediction cpre 581...

1

Instruction Fetch and Branch Prediction

CprE 581 Computer SystemsArchitecture

Readings: Textbook (4th ed2.3, 2.9); (5th ed 3.3)

2

Frontend and BackendFe

tch

Rena

me

Wak

eup

sele

ct

Regf

ile FUFUby

pass

D-ca

che

executecommit

schedule

Frontend:- Keep fetching n insts per cycle (in-order)- Predict next PC based on current PC and past history of branch targets and directions- Special handling of function returns

Backend:- Execute instructions out-of-order- Provide feedback info to frontend- Mis-prediction affects performance but not correctness

Feedback:- Prediction correct or not, update info- Incorrect: Correct next PC

3

Instruction FlowInstruction flow must be continuousBranch Target prediction: What is the target

PC, must be done at the fetch stage

Branch prediction: What direction does a branch take, usually done at fetching

Return Address Prediction: Special target prediction for return instructions, may be done at fetching or decoding

4

Instruction FlowDesign questions:

What would happen if branch prediction is done after the fetching stage, e.g. decoding?

At the fetch stage, how to know if an inst is a branch or not?

How to know if an inst is a return inst?

InstMemory

PC

INST

Target, branch, and RA

predictors

Feedback

Decode/Rename

Single cycle loop

Feedback

Branch Prediction Buffer

IF ID EX M WBPC

I-Cache PC

A0 0

A1

A2

A(k-1)

1

1

0

Associative Lookup expensive!

BPB IndexPC

log k

Branch Target Buffer (BTB)

Look up Predicted PC

Number ofentriesin branch-targetbuffer

No: instruction isnot predicted to bebranch. Proceed normally

=

Yes: then instruction is branch and predictedPC should be used as the next PC

Branchpredictedtaken oruntaken

PC of instruction to fetch

Branch Prediction Steps

IF

ID

EX

Entry found inbranch-target

buffer?

No

No

Normalinstructionexecution

Yes

Send outpredicted

PCIs

instructiona takenbranch?

Takenbranch?

Branch predictedCorrectly; continueexecution withno stalls

Yes

No Yes

Send PC to memory and branch-target

buffer

Enter branch addrand next PC

Into BTB

Mispredicted branch, kill fetched inst; restart fetch

at other target; deleteentry from BTB

Copyright © 2012, Elsevier Inc. All rights reserved.

Optimization: Larger branch-target buffer Add target instruction into buffer to deal

with longer decoding time required by larger buffer

“Branch folding”

Adv. Techniques for Instruction Delivery and Speculation

Branch Folding

9

Mis-prediction Recovery

Pipeline flushing Mis-prediction is detected when a branch is resolved May wait until the branch is to be committed, and

then flush the pipeline Selective flushing: Immediately and selectively flush

misfetched instructions

Fetch stage flushing: Special cases, e.g. A branch target was wrongly predicted; the correct

branch target is known at decoding for most branches An unconditional branches (jumps) were predicted as

not-taken

10

Branch PredictionPredict branch direction: taken or not taken (T/NT)

Static prediction: compilers decide the directionDynamic prediction: hardware decides the direction using dynamic information

1. 1-bit Branch-Prediction Buffer2. 2-bit Branch-Prediction Buffer3. Correlating Branch Prediction Buffer4. Tournament Branch Predictor5. and more …

Not taken

taken BNE R1, R2, L1…

L1: …

11

Predictor for a Single Branch

state 2. PredictOutput T/NT

1. Access

3. Feedback T/NT

T

Predict Taken Predict NotTaken

1 0T

NT

General Form

1-bit prediction

NT

PC

Feedback

12

1-bit BHT AccuracyExample: in a loop, 1-bit BHT will cause

2 mispredictionsConsider a loop of 10 iterations before exit:for (…){

for (i=0; i<10; i++)a[i] = a[i] * 2.0;

} Two mispredictions – first loop iteration and last

loop iteration. Only 80% accuracy.

1-Bit Prediction Drawbacks

LOOP: Inst 1Inst 2Inst 3..

Inst kBranch

10 iterations

Outer loop

Taken: 9 timesNot taken: 1 time

1-bit prediction mispredictstwice: 20% misprediction rate

14

Branch History Table of 1-bit PredictorBHT also Called Branch

Prediction Buffer in textbookCan use only one 1-bit predictor, but accuracy is lowBHT: use a table of simple predictors, indexed by bits from PCSimilar to direct mapped cacheMore entries, more cost, but less conflicts, higher accuracyBHT can contain complex predictors

PredictionPrediction

k-bitBranchaddress

2k

15

Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249)

Blue: stop, not takenGray: go, takenAdds hysteresis to decision making process

2-bit Saturating Counter

T

T

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not Taken

11 10

01 00T

NT

T

NT

NT

Correlating Branches

Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch.Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table.In general, (m,n) predictor records last mbranches to select between 2m history tables each with n-bit counters. Old 2-bit BHT is then a (0,2) predictor


if (d==0)d=1;

if (d==1)

BNEZ R1, L1ADDI R1, R0, #1

L1: SUBUI R3, R1, #1BNEZ R3, L2

……L2:

Branch B1

Branch B2

B1 and B2 are correlated?B1 Not Taken B2 Not Taken

18

Correlating Branch PredictorIdea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) Then behavior of

recent branches selects between, say, 2 predictions of next branch, updating just that prediction

(1,1) predictor: 1-bit global, 1-bit local

Branch address (4 bits)

1-bits per branch local predictors


1-bit global branch history(0 = not taken)

19

Correlating Branch PredictorGeneral form: (m, n) predictor m bits for global

history, n bits for local history

Records correlation between m+1 branches

Simple implementation: global history can be stored in a shift register

Example: (2,2) predictor, 2-bit global, 2-bit local

Branch address (4 bits)

2-bits per branch local predictors


2-bit global branch history

(01 = not taken then taken)

Correlating Branch ExampleInitial value

of d d==0? b1Value of dbefore b2 d==1? b2

0 Yes Not taken 1 Yes Not taken1 No Taken 1 Yes Not taken2 No Taken 2 No Taken

Assume d alternates between 2 and 0.

d=?b1

predictionb1

actionNew b1

predictionb2

predictionb2

actionNew b2

prediction

2 NT T T NT T T0 T NT NT T NT NT2 NT T T NT T T0 T NT NT T NT NT

1-bit predictor mispredicts every branch!

Correlating Branch Example

Prediction bitsPrediction if last branch

not taken Prediction if last branch taken

NT/NT Not taken Not takenNT/T Not taken TakenT/NT Taken Not TakenT/T Taken Taken

Initial prediction: NT/NT

d=? b1 prediction

b1 action

New b1 prediction

b2 prediction

b2 action

New b2 pred

2 NT /NT T T/NT NT/ NT T NT/T0 T/ NT NT T/NT NT /T NT NT/T2 T/NT T T/NT NT/ T T NT/T0 T/ NT NT T/NT NT /T NT NT/T


(2,2) predictor Then behavior

of recent branches selects between, say, four predictions of next branch, updating just that prediction

2-bit per branch predictors

Branch address

XX prediction

2-bit global branch history

4

XX

00 01 10 11

Copyright 2001 UCB & Morgan Kaufmann

ECE668 .24 Adapted from Patterson, Katz and Culler © UCB

Gselect and Gshare predictors

Keep a global register (GR) with outcome of k branches

Use that in conjunction with PC to index into a table containing 2-bit predictor

Gselect – concatenate Gshare – XOR (better)

Copyright 2007 CAM

/ PHT

global branch historyregister (GBHR)

/

decode

2predict:taken/

not taken

shift

branch result:taken/

not taken

25

0%

1%

5%

6% 6%

11%

4%

6%

5%

1%

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li

Freq

uenc

y of

Mis

pred

ictio

ns

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Different Schemes(Figure 2.7, page 87)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

Freq

uenc

y of

Mis

pred

ictio

ns



Re-evaluating Correlation Several SPEC benchmarks have less than a dozen branches

responsible for 90% of taken branches:program branch % static # = 90%compress 14% 236 13eqntott 25% 494 5gcc 15% 9531 2020mpeg 10% 5598 532real gcc 13% 17361 3214

Real programs + OS more like gcc Small benefits of correlation beyond benchmarks? Mispredict because either:

Wrong guess for that branch Got branch history of wrong branch when indexing the table

For SPEC92, 4096 about as good as infinite table Misprediction mostly due to wrong prediction

Can we improve using global history?

27

Estimate Branch PenaltyEX: BHT correct rate

is 95%, BTB hit rate is 95%

Average miss penalty is 1 cycle on BTB and 6 cycles on BHT

How much is the branch penalty?

28

Return Address (RA) PredictionReturn: special register indirect branchesRegister indirect branch hard to predict Many callers, one callee Jump to multiple return addresses from a single

address (no PC-target correlation)SPEC89 85% such branches for procedure returnSince stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries leads to small miss rate

29

Accuracy of Return Address Predictor



Tournament Predictors

Motivation for correlating branch predictors:2-bit local predictor failed on importantbranches; by adding global information,performance improved

Tournament predictors: use two predictors, 1based on global information and 1 based onlocal information, and combine with a selector

Hopes to select right predictor for rightbranch (or right context of branch)

31

Tournament Branch PredictorUsed in Alpha 21264: Track both “local” and global historyIntended for mixed types of applicationsGlobal history: T/NT history of past k branches, e.g. 0 1 0 1 0 1 (NT T NT T NT T)

LocalPredictor

GlobalPredictor

ChoicePredictor

muxGlobalhistoryNT/T

PC



Tournament Predictor in Alpha 21264 4K 2-bit counters to choose from among a global

predictor and a local predictor Global predictor also has 4K entries and is indexed by

the history of the last 12 branches; each entry in theglobal predictor is a standard 2-bit predictor 12-bit pattern: ith bit is 0 => ith prior branch not taken;

ith bit is 1 => ith prior branch taken;

00,10,11

00,11

10

Use 1 Use 2

Use 2Use 1

00,01,11

00,1110

10 0101

014K 2 bits

1

32

12

...



Tournament Predictor in Alpha 21264 Local predictor consists of a 2-level predictor:

Top level a local history table consisting of 1024 10-bitentries; each 10-bit entry corresponds to the most recent 10branch outcomes for the entry. 10-bit history allows patterns10 branches to be discovered and predicted

Next level Selected entry from the local history table is usedto index a table of 1K entries consisting a 3-bit saturatingcounters, which provide the local prediction

Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!(~180K transistors)

1K 10 bits

1K 3

bits

34

Tournament Branch Predictor

Local predictor: use 10-bit local history, shared 3-bit counters

Global and choice predictors:

Local historytable (1Kx10)

Counters (1Kx3)10 1

NT/T

Global history12-bit

PC

Counters(4Kx2) 1

NT/T12

010101010101 NT/T Counters(4Kx2) 1 local/global



% of predictions from local predictor in Tournament Prediction Scheme

98%100%

94%90%

55%76%

72%63%

37%69%

0% 20% 40% 60% 80% 100%

nasa7matrix300

tomcatvdoducspicefpppp

gcc

espressoeqntott

li



94%

96%

98%

98%

97%

100%

70%

82%

77%

82%

84%

99%

88%

86%

88%

86%

95%

99%

0% 20% 40% 60% 80% 100%

gcc

espresso

li

fpppp

doduc

tomcatv

Profile-based2-bit counterTournament

Accuracy of Branch Prediction

Profile: branch profile from last execution(static in that is encoded in instruction, but profile)

fig 3.40

Patt-Yeh Predictor

The correlating branch predictors we have just studied work by combining local with global information.However, it is also possible to do quite well considering only information about the current branch (local information).

38

Patt-Yeh Predictor

39

Patt-Yeh Predictor

A: T NT N T N T NB: T T T T T T T T

40

PT entries 01 and 10 are “trained” for A, and 11 is “trained” for B.In general, the Yeh-Patt predictor provides 96%–98% accuracy for integer code

42

43

Branch PredictorsSmith (bimodal) predictorPattern-based predictors Two-level, gshare, bi-mode, gskewed, Agree, …

Predictors based on alternative contexts Alloyed history, path history, loop counting, …

Hybrid predictors Multiple component predictors + selection/fusion Tournament, multihybrid, prediction fusion, …

Reference book: Ch. 9, “Advanced Instruction Flow Techniques”

Branch DecouplingLoop: LD F0,0(R1) ;F0=vector element

ADDD F4,F0,F2 ;add scalar from F2SD 0(R1),F4 ;store resultSUBI R1,R1,8 ;decrement pointer 8B (DW)BNEZ R1,Loop ;branch R1!=zero

Say 100 iterations. Can branch be pre-computed for each loopiteration?

Branch Determining Instructions (BDIs)LD F0,0(R1)

BNEZ R1,Loop

SUBI R1,R1,8

SD 0(R1),F4

ADDD F4,F0,F2

BDI

Branch DecouplingLoop: LD F0,0(R1)

ADDD F4,F0,F2SD 0(R1),F4SUBI R1,R1,8BNEZ R1,Loop

Branch StreamBLoop: SUBI

R1,R1,8BNEZ

R1,Bloop,Ploop

Program StreamPLoop: LD

F0,0(R1)ADDD F4,F0,F2SD 0(R1),F4SUBI R1,R1,8

Branch Decoupled Microarchitecture

P-ProcessorB-ProcessorP-Reg File

B-Reg File

B-PC

Target + Block size PPC Queue

PPCBlock Counter

I-cache

I-cache

D-cache

PPC Control

If(Block counter not = 0)Decrement Block counter & Increment PPC.

elsewhen (PPCQ not empty)

Dequeue a (target, block size) entry from PPCQ.PPC target;Block counter block size;

49

Branch Prediction With n-way Issue

1. Branches will arrive up to n timesfaster in an n-issue processor

2. Amdahl’s Law => relative impact ofthe control stalls will be larger withthe lower potential CPI in an n-issueprocessor

50

Modern Design: Frontend and BackendFrontend: Instruction fetch and dispatch

To supply high-quality instructions to the backend Instruction flows in program order

Backend: Schedule/execute, Writeback and Commit Instructions are processed out-of-order

Frontend Enhancements Instruction prefetch: fetch ahead to deliver multiple

instructions per cycle To handle multiple branches: may access multiple cache

lines in one cycle, use prefetch to hide the costTarget and branch predictions may be integrated with instruction cache: e.g. Intel P4 trace cache

51

Pitfall: Sometimes bigger and dumber is better

21264 uses tournament predictor (29 Kbits)Earlier 21164 uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits)SPEC95 benchmarks, 21264 outperforms 21264 avg. 11.5

mispredictions per 1000 instructions

21164 avg. 16.5 mispredictions per 1000 instructions

Reversed for transaction processing (TP) ! 21264 avg. 17

mispredictions per 1000 instructions

21164 avg. 15 mispredictions per 1000 instructions

TP code much larger & 21164 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264)

instruction fetch and branch prediction cpre 581...

Documents