instruction fetch and branch prediction cpre 581...
TRANSCRIPT
1
Instruction Fetch and Branch Prediction
CprE 581 Computer SystemsArchitecture
Readings: Textbook (4th ed2.3, 2.9); (5th ed 3.3)
2
Frontend and BackendFe
tch
Rena
me
Wak
eup
sele
ct
Regf
ile FUFUby
pass
D-ca
che
executecommit
schedule
Frontend:- Keep fetching n insts per cycle (in-order)- Predict next PC based on current PC and past history of branch targets and directions- Special handling of function returns
Backend:- Execute instructions out-of-order- Provide feedback info to frontend- Mis-prediction affects performance but not correctness
Feedback:- Prediction correct or not, update info- Incorrect: Correct next PC
3
Instruction FlowInstruction flow must be continuousBranch Target prediction: What is the target
PC, must be done at the fetch stage
Branch prediction: What direction does a branch take, usually done at fetching
Return Address Prediction: Special target prediction for return instructions, may be done at fetching or decoding
4
Instruction FlowDesign questions:
What would happen if branch prediction is done after the fetching stage, e.g. decoding?
At the fetch stage, how to know if an inst is a branch or not?
How to know if an inst is a return inst?
InstMemory
PC
INST
Target, branch, and RA
predictors
Feedback
Decode/Rename
Single cycle loop
Feedback
Branch Prediction Buffer
IF ID EX M WBPC
I-Cache PC
A0 0
A1
A2
A(k-1)
1
1
0
Associative Lookup expensive!
BPB IndexPC
log k
Branch Target Buffer (BTB)
Look up Predicted PC
Number ofentriesin branch-targetbuffer
No: instruction isnot predicted to bebranch. Proceed normally
=
Yes: then instruction is branch and predictedPC should be used as the next PC
Branchpredictedtaken oruntaken
PC of instruction to fetch
Branch Prediction Steps
IF
ID
EX
Entry found inbranch-target
buffer?
No
No
Normalinstructionexecution
Yes
Send outpredicted
PCIs
instructiona takenbranch?
Takenbranch?
Branch predictedCorrectly; continueexecution withno stalls
Yes
No Yes
Send PC to memory and branch-target
buffer
Enter branch addrand next PC
Into BTB
Mispredicted branch, kill fetched inst; restart fetch
at other target; deleteentry from BTB
Copyright © 2012, Elsevier Inc. All rights reserved.
Optimization: Larger branch-target buffer Add target instruction into buffer to deal
with longer decoding time required by larger buffer
“Branch folding”
Adv. Techniques for Instruction Delivery and Speculation
Branch Folding
9
Mis-prediction Recovery
Pipeline flushing Mis-prediction is detected when a branch is resolved May wait until the branch is to be committed, and
then flush the pipeline Selective flushing: Immediately and selectively flush
misfetched instructions
Fetch stage flushing: Special cases, e.g. A branch target was wrongly predicted; the correct
branch target is known at decoding for most branches An unconditional branches (jumps) were predicted as
not-taken
10
Branch PredictionPredict branch direction: taken or not taken (T/NT)
Static prediction: compilers decide the directionDynamic prediction: hardware decides the direction using dynamic information
1. 1-bit Branch-Prediction Buffer2. 2-bit Branch-Prediction Buffer3. Correlating Branch Prediction Buffer4. Tournament Branch Predictor5. and more …
Not taken
taken BNE R1, R2, L1…
L1: …
11
Predictor for a Single Branch
state 2. PredictOutput T/NT
1. Access
3. Feedback T/NT
T
Predict Taken Predict NotTaken
1 0T
NT
General Form
1-bit prediction
NT
PC
Feedback
12
1-bit BHT AccuracyExample: in a loop, 1-bit BHT will cause
2 mispredictionsConsider a loop of 10 iterations before exit:for (…){
for (i=0; i<10; i++)a[i] = a[i] * 2.0;
} Two mispredictions – first loop iteration and last
loop iteration. Only 80% accuracy.
1-Bit Prediction Drawbacks
LOOP: Inst 1Inst 2Inst 3..
Inst kBranch
10 iterations
Outer loop
Taken: 9 timesNot taken: 1 time
1-bit prediction mispredictstwice: 20% misprediction rate
14
Branch History Table of 1-bit PredictorBHT also Called Branch
Prediction Buffer in textbookCan use only one 1-bit predictor, but accuracy is lowBHT: use a table of simple predictors, indexed by bits from PCSimilar to direct mapped cacheMore entries, more cost, but less conflicts, higher accuracyBHT can contain complex predictors
PredictionPrediction
k-bitBranchaddress
2k
15
Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249)
Blue: stop, not takenGray: go, takenAdds hysteresis to decision making process
2-bit Saturating Counter
T
T
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not Taken
11 10
01 00T
NT
T
NT
NT
Correlating Branches
Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch.Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table.In general, (m,n) predictor records last mbranches to select between 2m history tables each with n-bit counters. Old 2-bit BHT is then a (0,2) predictor
Correlating Branches
if (d==0)d=1;
if (d==1)
BNEZ R1, L1ADDI R1, R0, #1
L1: SUBUI R3, R1, #1BNEZ R3, L2
……L2:
Branch B1
Branch B2
B1 and B2 are correlated?B1 Not Taken B2 Not Taken
18
Correlating Branch PredictorIdea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) Then behavior of
recent branches selects between, say, 2 predictions of next branch, updating just that prediction
(1,1) predictor: 1-bit global, 1-bit local
Branch address (4 bits)
1-bits per branch local predictors
PredictionPrediction
1-bit global branch history(0 = not taken)
19
Correlating Branch PredictorGeneral form: (m, n) predictor m bits for global
history, n bits for local history
Records correlation between m+1 branches
Simple implementation: global history can be stored in a shift register
Example: (2,2) predictor, 2-bit global, 2-bit local
Branch address (4 bits)
2-bits per branch local predictors
PredictionPrediction
2-bit global branch history
(01 = not taken then taken)
20
Correlating Branch ExampleInitial value
of d d==0? b1Value of dbefore b2 d==1? b2
0 Yes Not taken 1 Yes Not taken1 No Taken 1 Yes Not taken2 No Taken 2 No Taken
Assume d alternates between 2 and 0.
d=?b1
predictionb1
actionNew b1
predictionb2
predictionb2
actionNew b2
prediction
2 NT T T NT T T0 T NT NT T NT NT2 NT T T NT T T0 T NT NT T NT NT
1-bit predictor mispredicts every branch!
Correlating Branch Example
Prediction bitsPrediction if last branch
not taken Prediction if last branch taken
NT/NT Not taken Not takenNT/T Not taken TakenT/NT Taken Not TakenT/T Taken Taken
Initial prediction: NT/NT
d=? b1 prediction
b1 action
New b1 prediction
b2 prediction
b2 action
New b2 pred
2 NT /NT T T/NT NT/ NT T NT/T0 T/ NT NT T/NT NT /T NT NT/T2 T/NT T T/NT NT/ T T NT/T0 T/ NT NT T/NT NT /T NT NT/T
Correlating Branches
(2,2) predictor Then behavior
of recent branches selects between, say, four predictions of next branch, updating just that prediction
2-bit per branch predictors
Branch address
XX prediction
2-bit global branch history
4
XX
00 01 10 11
Copyright 2001 UCB & Morgan Kaufmann
ECE668 .24 Adapted from Patterson, Katz and Culler © UCB
Gselect and Gshare predictors
Keep a global register (GR) with outcome of k branches
Use that in conjunction with PC to index into a table containing 2-bit predictor
Gselect – concatenate Gshare – XOR (better)
Copyright 2007 CAM
/ PHT
global branch historyregister (GBHR)
/
decode
2predict:taken/
not taken
shift
branch result:taken/
not taken
25
0%
1%
5%
6% 6%
11%
4%
6%
5%
1%
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li
Freq
uenc
y of
Mis
pred
ictio
ns
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
Accuracy of Different Schemes(Figure 2.7, page 87)
4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT
Freq
uenc
y of
Mis
pred
ictio
ns
Copyright 2001 UCB & Morgan Kaufmann
ECE668 .26 Adapted from Patterson, Katz and Culler © UCB
Re-evaluating Correlation Several SPEC benchmarks have less than a dozen branches
responsible for 90% of taken branches:program branch % static # = 90%compress 14% 236 13eqntott 25% 494 5gcc 15% 9531 2020mpeg 10% 5598 532real gcc 13% 17361 3214
Real programs + OS more like gcc Small benefits of correlation beyond benchmarks? Mispredict because either:
Wrong guess for that branch Got branch history of wrong branch when indexing the table
For SPEC92, 4096 about as good as infinite table Misprediction mostly due to wrong prediction
Can we improve using global history?
27
Estimate Branch PenaltyEX: BHT correct rate
is 95%, BTB hit rate is 95%
Average miss penalty is 1 cycle on BTB and 6 cycles on BHT
How much is the branch penalty?
28
Return Address (RA) PredictionReturn: special register indirect branchesRegister indirect branch hard to predict Many callers, one callee Jump to multiple return addresses from a single
address (no PC-target correlation)SPEC89 85% such branches for procedure returnSince stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries leads to small miss rate
29
Accuracy of Return Address Predictor
Copyright 2001 UCB & Morgan Kaufmann
ECE668 .30 Adapted from Patterson, Katz and Culler © UCB
Tournament Predictors
Motivation for correlating branch predictors:2-bit local predictor failed on importantbranches; by adding global information,performance improved
Tournament predictors: use two predictors, 1based on global information and 1 based onlocal information, and combine with a selector
Hopes to select right predictor for rightbranch (or right context of branch)
31
Tournament Branch PredictorUsed in Alpha 21264: Track both “local” and global historyIntended for mixed types of applicationsGlobal history: T/NT history of past k branches, e.g. 0 1 0 1 0 1 (NT T NT T NT T)
LocalPredictor
GlobalPredictor
ChoicePredictor
muxGlobalhistoryNT/T
PC
Copyright 2001 UCB & Morgan Kaufmann
ECE668 .32 Adapted from Patterson, Katz and Culler © UCB
Tournament Predictor in Alpha 21264 4K 2-bit counters to choose from among a global
predictor and a local predictor Global predictor also has 4K entries and is indexed by
the history of the last 12 branches; each entry in theglobal predictor is a standard 2-bit predictor 12-bit pattern: ith bit is 0 => ith prior branch not taken;
ith bit is 1 => ith prior branch taken;
00,10,11
00,11
10
Use 1 Use 2
Use 2Use 1
00,01,11
00,1110
10 0101
014K 2 bits
1
32
12
...
Copyright 2001 UCB & Morgan Kaufmann
ECE668 .33 Adapted from Patterson, Katz and Culler © UCB
Tournament Predictor in Alpha 21264 Local predictor consists of a 2-level predictor:
Top level a local history table consisting of 1024 10-bitentries; each 10-bit entry corresponds to the most recent 10branch outcomes for the entry. 10-bit history allows patterns10 branches to be discovered and predicted
Next level Selected entry from the local history table is usedto index a table of 1K entries consisting a 3-bit saturatingcounters, which provide the local prediction
Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!(~180K transistors)
1K 10 bits
1K 3
bits
34
Tournament Branch Predictor
Local predictor: use 10-bit local history, shared 3-bit counters
Global and choice predictors:
Local historytable (1Kx10)
Counters (1Kx3)10 1
NT/T
Global history12-bit
PC
Counters(4Kx2) 1
NT/T12
010101010101 NT/T Counters(4Kx2) 1 local/global
Copyright 2001 UCB & Morgan Kaufmann
ECE668 .35 Adapted from Patterson, Katz and Culler © UCB
% of predictions from local predictor in Tournament Prediction Scheme
98%100%
94%90%
55%76%
72%63%
37%69%
0% 20% 40% 60% 80% 100%
nasa7matrix300
tomcatvdoducspicefpppp
gcc
espressoeqntott
li
Copyright 2001 UCB & Morgan Kaufmann
ECE668 .36 Adapted from Patterson, Katz and Culler © UCB
94%
96%
98%
98%
97%
100%
70%
82%
77%
82%
84%
99%
88%
86%
88%
86%
95%
99%
0% 20% 40% 60% 80% 100%
gcc
espresso
li
fpppp
doduc
tomcatv
Profile-based2-bit counterTournament
Accuracy of Branch Prediction
Profile: branch profile from last execution(static in that is encoded in instruction, but profile)
fig 3.40
Copyright © 2012, Elsevier Inc. All rights reserved.
Branch Prediction PerformanceBranch Prediction
Branch predictor performance
Patt-Yeh Predictor
The correlating branch predictors we have just studied work by combining local with global information.However, it is also possible to do quite well considering only information about the current branch (local information).
38
Patt-Yeh Predictor
39
Patt-Yeh Predictor
A: T NT N T N T NB: T T T T T T T T
40
41
PT entries 01 and 10 are “trained” for A, and 11 is “trained” for B.In general, the Yeh-Patt predictor provides 96%–98% accuracy for integer code
42
43
Branch PredictorsSmith (bimodal) predictorPattern-based predictors Two-level, gshare, bi-mode, gskewed, Agree, …
Predictors based on alternative contexts Alloyed history, path history, loop counting, …
Hybrid predictors Multiple component predictors + selection/fusion Tournament, multihybrid, prediction fusion, …
Reference book: Ch. 9, “Advanced Instruction Flow Techniques”
Branch DecouplingLoop: LD F0,0(R1) ;F0=vector element
ADDD F4,F0,F2 ;add scalar from F2SD 0(R1),F4 ;store resultSUBI R1,R1,8 ;decrement pointer 8B (DW)BNEZ R1,Loop ;branch R1!=zero
Say 100 iterations. Can branch be pre-computed for each loopiteration?
Branch Determining Instructions (BDIs)LD F0,0(R1)
BNEZ R1,Loop
SUBI R1,R1,8
SD 0(R1),F4
ADDD F4,F0,F2
BDI
Branch DecouplingLoop: LD F0,0(R1)
ADDD F4,F0,F2SD 0(R1),F4SUBI R1,R1,8BNEZ R1,Loop
Branch StreamBLoop: SUBI
R1,R1,8BNEZ
R1,Bloop,Ploop
Program StreamPLoop: LD
F0,0(R1)ADDD F4,F0,F2SD 0(R1),F4SUBI R1,R1,8
Branch Decoupled Microarchitecture
P-ProcessorB-ProcessorP-Reg File
B-Reg File
B-PC
Target + Block size PPC Queue
PPCBlock Counter
I-cache
I-cache
D-cache
PPC Control
If(Block counter not = 0)Decrement Block counter & Increment PPC.
elsewhen (PPCQ not empty)
Dequeue a (target, block size) entry from PPCQ.PPC target;Block counter block size;
49
Branch Prediction With n-way Issue
1. Branches will arrive up to n timesfaster in an n-issue processor
2. Amdahl’s Law => relative impact ofthe control stalls will be larger withthe lower potential CPI in an n-issueprocessor
50
Modern Design: Frontend and BackendFrontend: Instruction fetch and dispatch
To supply high-quality instructions to the backend Instruction flows in program order
Backend: Schedule/execute, Writeback and Commit Instructions are processed out-of-order
Frontend Enhancements Instruction prefetch: fetch ahead to deliver multiple
instructions per cycle To handle multiple branches: may access multiple cache
lines in one cycle, use prefetch to hide the costTarget and branch predictions may be integrated with instruction cache: e.g. Intel P4 trace cache
51
Pitfall: Sometimes bigger and dumber is better
21264 uses tournament predictor (29 Kbits)Earlier 21164 uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits)SPEC95 benchmarks, 21264 outperforms 21264 avg. 11.5
mispredictions per 1000 instructions
21164 avg. 16.5 mispredictions per 1000 instructions
Reversed for transaction processing (TP) ! 21264 avg. 17
mispredictions per 1000 instructions
21164 avg. 15 mispredictions per 1000 instructions
TP code much larger & 21164 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264)