1 comp 740: computer architecture and implementation montek singh thu, feb 19, 2009 topic:...

21
1 COMP 740: COMP 740: Computer Architecture and Computer Architecture and Implementation Implementation Montek Singh Montek Singh Thu, Feb 19, 2009 Thu, Feb 19, 2009 Topic: Topic: Instruction-Level Parallelism III Instruction-Level Parallelism III (Dynamic Branch Prediction) (Dynamic Branch Prediction)

Post on 21-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

1

COMP 740:COMP 740:Computer Architecture and Computer Architecture and ImplementationImplementation

Montek SinghMontek Singh

Thu, Feb 19, 2009Thu, Feb 19, 2009

Topic: Topic: Instruction-Level Parallelism IIIInstruction-Level Parallelism III

(Dynamic Branch Prediction)(Dynamic Branch Prediction)

2

Why Do We Need Branch Why Do We Need Branch Prediction?Prediction? Basic blocks are short, and we have done Basic blocks are short, and we have done

about all we can do for them with dynamic about all we can do for them with dynamic schedulingscheduling control dependences now become the bottleneckcontrol dependences now become the bottleneck

Since branches disrupt sequential flow of Since branches disrupt sequential flow of

instrs…instrs… we need to be able to predict branch behavior to we need to be able to predict branch behavior to

avoid stalling the pipelineavoid stalling the pipeline

What we must predictWhat we must predict Branch outcome (Is the branch taken?)Branch outcome (Is the branch taken?) Branch Target Address (What is the next non-Branch Target Address (What is the next non-

sequential PC value?)sequential PC value?)

3

A General Model of Branch A General Model of Branch PredictionPrediction

APPTPPTPP

pPPPP

pPPPP

ntntttntntntttnttt

ntnttntntnttnt

nttttntttt

//////

////

////

1

1TakenNot Predict

TakenPredict

TakenNot Taken

nm

kj

TakenNot Predict

TakenPredict

TakenNot Taken• T: probability of branch being taken• p: fraction of branches that are predicted to be taken• A: accuracy of prediction• j, k, m, n: associated delays (penalties) for the four events (n is usually 0)

ntnttntntttt PnPmPkPj //// Branch penalty of a particular prediction method

Branch predictor accuracy

Branch penalties

4

Theoretical Limits of Branch Theoretical Limits of Branch PredictionPredictionBest case: Best case: branches are perfectly predicted branches are perfectly predicted (A (A

= 1)= 1) also assume that also assume that n = 0n = 0 minimum branch penalty = minimum branch penalty = j*Tj*T

Let Let ss be the pipeline stage where BTA be the pipeline stage where BTA becomes knownbecomes known Then Then j = s-1j = s-1 SeeSee static prediction methods in Lecture 7static prediction methods in Lecture 7

Thus, performance of any branch prediction Thus, performance of any branch prediction strategy is limited bystrategy is limited by s,s, the location of the pipeline stage that develops BTA the location of the pipeline stage that develops BTA A,A, the accuracy of the prediction the accuracy of the prediction

5

Review: Static Branch Prediction Review: Static Branch Prediction MethodsMethodsSeveral Several staticstatic prediction strategies: prediction strategies:

Predict all branches as NOT TAKENPredict all branches as NOT TAKEN Predict all branches as TAKENPredict all branches as TAKEN Predict all branches with certain opcodes as TAKEN, Predict all branches with certain opcodes as TAKEN,

and all others as NOT TAKENand all others as NOT TAKEN Predict all forward branches as NOT TAKEN, and all Predict all forward branches as NOT TAKEN, and all

backward branches as TAKENbackward branches as TAKEN Opcodes have default predictions, which the compiler Opcodes have default predictions, which the compiler

may reverse by setting a bit in the instructionmay reverse by setting a bit in the instruction

6

Dynamic Branch PredictionDynamic Branch PredictionPremise: Premise: HistoryHistory of a branch instr’s outcome of a branch instr’s outcome

matters!matters! whether a branch will be taken depends greatly on whether a branch will be taken depends greatly on

the way previous dynamic instances of the same the way previous dynamic instances of the same branch were decidedbranch were decided

Dynamic prediction methods:Dynamic prediction methods: take advantage of this fact by making their take advantage of this fact by making their

predictions dependent on the past behavior of the predictions dependent on the past behavior of the same branch instrsame branch instr

such methods are called Branch History Table (BHT) such methods are called Branch History Table (BHT) methodsmethods

7

BHT Methods for Branch BHT Methods for Branch PredictionPrediction(1) Predict that each branch will be decided the same

way as it was on its last execution (predict TAKEN the first time)

Cannot be implemented.

(2) Maintain a table of the k most recently used branches that were not taken. If a branch is in the table, predict that it will be NOT TAKEN, otherwise predict TAKEN. Purge table entries if taken, using LRU to add new entries

Needs a separate cache for full branch addresses.

(3) Maintain a bit for each instruction in the cache. If an instruction is a branch, the bit is used to record if it was taken on its last execution. The branch is predicted to behave as last time (bit is initialized to TAKEN)

Needs an extra access to I cache if prediction is wrong.

(4) Hash the branch instruction address to m bits, and use this index to address into a table containing single bits (outcome of most recent branch instructions). Predict that outcome will be the same. Update table entry if prediction is wrong.

Needs a separate addressable memory.

(5) Method (4) with more than one bit per table entry (up-down saturating counters): initialize with 0, increment when branch is taken, decrement when branch is not taken. Predict TAKEN when counter is positive, NOT TAKEN otherwise.

Needs a separate addressable memory of counters.

(6) Various 2-bit (or possibly 3-bit) schemes per table entry (one of them is a 2-bit counter). A popular scheme is described in HP3.

Needs a separate addressable memory of 2- or 3-bit entries.

(7) Two-level adaptive (or correlating) predictors: nine complex schemes, one described in HP3.

Needs shift registers and addressable memories of 2-bit counters.

8

NT

A One-Bit PredictorA One-Bit Predictor

Branch outcomePrediction State Taken Not TakenTaken 1 1 0Not Taken 0 1 0

Actual T T T NT T T T T NT T NT T NT TState 1 1 1 1 0 1 1 1 1 0 1 0 1 0 1

Predicts T T T T NT T T T T NT T NT T NTHit/Miss H H H M M H H H M M M M M M

Predictor misses twice on typical loop branchesPredictor misses twice on typical loop branches Once at the end of loopOnce at the end of loop Once at the end of the 1Once at the end of the 1stst iteration of next execution of loop iteration of next execution of loop

The outcome sequence NT-T-NT-T makes it miss all the timeThe outcome sequence NT-T-NT-T makes it miss all the time

State 0

Predict Not Taken

State 1

Predict Taken

T

T

NT

9

A Two-Bit PredictorA Two-Bit Predictor

Branch outcomePrediction State Taken Not TakenTaken 3 3 2Taken 2 3 0Not Taken 0 1 0Not Taken 1 3 0

A four-state Moore machineA four-state Moore machine Predictor misses once on typical loop branchesPredictor misses once on typical loop branches

hence popularhence popular

Outcome sequence NT-NT-T-T-NT-NT-T-T make Outcome sequence NT-NT-T-T-NT-NT-T-T make it miss all the timeit miss all the time

NTState 2

PredictTaken

State 3

Predict Taken

T

T

NTState 0

Predict Not Taken

State 1

Predict Not Taken

TNT

NTT

10

A Two-Bit PredictorA Two-Bit PredictorBranch outcome

Prediction State Taken Not TakenTaken 3 3 2Taken 2 3 0Not Taken 0 1 0Not Taken 1 3 0

Actual T T T NT T T T T NT NT T T NT NTState 3 3 3 3 2 3 3 3 3 2 0 1 3 2 0

Predicts T T T T T T T T T T NT NT T THit/Miss H H H M H H H H M M M M M M

A four-state Moore machineA four-state Moore machine Predictor misses once on typical loop branchesPredictor misses once on typical loop branches

hence popularhence popular

Input sequence NT-NT-T-T-NT-NT-T-T make it Input sequence NT-NT-T-T-NT-NT-T-T make it miss all the timemiss all the time

11

Correlating Branch Outcome Correlating Branch Outcome PredictorsPredictors The history-based branch predictors seen so far The history-based branch predictors seen so far

base their predictions on past history of base their predictions on past history of branch branch that is being predictedthat is being predicted

A completely different idea:A completely different idea: The outcome of a branch may well be predicted The outcome of a branch may well be predicted

successfully based on the outcome of the last successfully based on the outcome of the last kk branches branches executed executed i.e., the path leading to the branch being predictedi.e., the path leading to the branch being predicted

Much-quoted example from SPEC92 benchmark Much-quoted example from SPEC92 benchmark eqntotteqntott

if (aa == 2) /*b1*/ aa = 0; if (bb == 2) /*b2*/ bb = 0; if (aa != bb) /*b3*/ { … }

if (aa == 2) /*b1*/ aa = 0; if (bb == 2) /*b2*/ bb = 0; if (aa != bb) /*b3*/ { … }

TAKEN(b1) && TAKEN(b2)implies

NOT-TAKEN(b3)

12

Another Example of Branch Another Example of Branch CorrelationCorrelation if (d == 0) //b1

d = 1; if (d == 1) //b2 ...

if (d == 0) //b1 d = 1; if (d == 1) //b2 ...

• Assume multiple runs of code fragment• d alternates between 2 and 0• How would a 1-bit predictor initialized to state 0 behave?

BNEZ R1, L1ADDI R1, R0, 1L1:SUBI R3, R1, 1BNEZ R3, L2…L2:

d, or (R1) 2 2 0 1 2 2 0 1Actual T NT T NTState 0 1 0 1 0

Predicts NT T NT THit/Miss M M M M

Actual T NT T NTState 0 1 0 1 0

Predicts NT T NT THit/Miss M M M M

13

A Correlating Branch PredictorA Correlating Branch Predictor Think of having a pair of 1-bit predictors [pThink of having a pair of 1-bit predictors [p00, p, p11] for each branch, where we ] for each branch, where we

choose between predictors (and update them) based on outcome of most choose between predictors (and update them) based on outcome of most recent branch (i.e., B1 for B2, and B2 for B1)recent branch (i.e., B1 for B2, and B2 for B1) if most recent br was not taken, use and update (if needed) predictor pif most recent br was not taken, use and update (if needed) predictor p00

If most recent br was taken, use and update (if needed) predictor pIf most recent br was taken, use and update (if needed) predictor p11

How would such (1,1) correlating predictors behave if initialized to [0,0]?How would such (1,1) correlating predictors behave if initialized to [0,0]?

(R1) 2 2 0 1 2 2 0 1Actual T NT T NTState [0,0] [1,0] [1,0] [1,0] [1,0]

Predicts NT NT T NTHit/Miss M H H H

Actual T NT T NTState [0,0] [0,1] [0,1] [0,1] [0,1]

Predicts NT NT T NTHit/Miss M H H H

14

Organization of Organization of (m,n)(m,n) Correlating Correlating PredictorPredictor Using the results of last Using the results of last mm branches branches

22mm outcomes outcomes can be kept in can be kept in mm-bit shift register-bit shift register

nn-bit “self-history” predictor-bit “self-history” predictor

BHT addressed usingBHT addressed using mm bits of global history bits of global history

select column (particular predictor)select column (particular predictor) some lower bits of branch addresssome lower bits of branch address

select row (particular branch instr)select row (particular branch instr) entry holds entry holds nn previous outcomes previous outcomes

Aliasing can occur since BHT uses Aliasing can occur since BHT uses only only portionportion of branch instr of branch instr addressaddress state in various predictors in state in various predictors in

single row may correspond to single row may correspond to different branches at different different branches at different points of timepoints of time

mm=0=0 is ordinary BHT is ordinary BHT

4

Branch address

Global branch history

Prediction

2-bit branch predictors

2

15

Improved Dynamic Branch Improved Dynamic Branch PredictionPrediction Recall that, even with perfect accuracy of Recall that, even with perfect accuracy of

prediction, branch penalty of a prediction prediction, branch penalty of a prediction method is method is (s-1)*T(s-1)*T ss is the pipeline stage where BTA is developedis the pipeline stage where BTA is developed TT is the frequency of taken branches is the frequency of taken branches

Further improvements can be obtained only by Further improvements can be obtained only by using a cache storing BTAs, and accessing it using a cache storing BTAs, and accessing it simultaneously with the I-cachesimultaneously with the I-cache Such a cache is called a Branch Target Buffer (BTB)Such a cache is called a Branch Target Buffer (BTB)

BHT and BTB can be used togetherBHT and BTB can be used together Coupled: one table holds all the informationCoupled: one table holds all the information Uncoupled: two independent tablesUncoupled: two independent tables

16

Using BTB and BHT TogetherUsing BTB and BHT Together Uncoupled solutionUncoupled solution

BTB stores only the BTAs of BTB stores only the BTAs of takentaken branches recently branches recently executedexecuted

No separate branch outcome prediction (the presence No separate branch outcome prediction (the presence of an entry in BTB can be used as an implicit of an entry in BTB can be used as an implicit prediction of the branch being TAKEN next time)prediction of the branch being TAKEN next time)

Use the BHT in case of a BTB missUse the BHT in case of a BTB miss

Coupled solutionCoupled solution Stores BTAs of Stores BTAs of allall branches recently executed branches recently executed Has separate branch outcome prediction for each Has separate branch outcome prediction for each

table entrytable entry Use BHT in case of BTB hitUse BHT in case of BTB hit Predict NOT TAKEN otherwisePredict NOT TAKEN otherwise

17

Parameters of Real MachinesParameters of Real Machines

Machine BTB(entries/associativity)

Coupled? BHT(entries/bits)

MIPS R10000 None --- 512/2PowerPC 604 64 /FA No 512/2PowerPC 620 256/FA No 2048/2PA-8000 32/FA No 256/3MIPS R12000 32/2 No 2048/2MIPS R8000 1024/2 Yes 1Pentium 256/4 Yes 2M68060 256/4 Yes 2Pentium Pro 512/4 Yes 4-bit self

18

Coupled BTB and BHTCoupled BTB and BHT

Access BTB and I-cache

Miss in BTBPNT (fetch inline)

Hit in BTB

Not a branchOK: zero penalty

Case 1

Instruction is a branch(Next instr killed)

Branch not takenCase 2

Branch takenEnter into BTB

Case 3

Predict Not Taken(using BHT)

Predict Taken(using BHT)

Go to BTA stored in BTB

Branch not takenUpdate BTB?

OK: zero penaltyCase 4

Branch takenUpdate BTB

Case 5

Branch not takenUpdate BTB

Case 6Branch taken

Wrong BTAUpdate BTB

Case 7

Correct BTAOK: zero penalty

Case 8

19

Decoupled BTB and BHTDecoupled BTB and BHT

Access BTB and I-cache

Miss in BTB(Fetch inline, wait for opcode)

Hit in BTBPredict Taken

Go to BTA stored in BTB

Not a branchOK: zero penalty

Case 1

Instruction is a branch(Next instr killed)

Predict Not Taken(using BHT)

Predict Taken(using BHT)

Branch not takenOK: zero penalty

Case 2

Branch takenEnter into BTB

Case 3

Branch not taken

Case 4

Branch takenEnter into BTB

Case 5

Branch not takenRemove from BTB

Case 6Branch taken

Wrong BTAUpdate BTB

Case 7

Correct BTAOK: zero penalty

Case 8

20

Reducing Misprediction PenaltiesReducing Misprediction Penalties Need to recover whenever branch prediction is not Need to recover whenever branch prediction is not

correctcorrect Discard all speculatively executed instructionsDiscard all speculatively executed instructions Resume execution along alternative path (this is the costly Resume execution along alternative path (this is the costly

step)step) Scenarios where recovery is neededScenarios where recovery is needed

Predict taken, branch is taken, BTA wrong (case 7)Predict taken, branch is taken, BTA wrong (case 7) Predict taken, branch is not taken (cases 4 and 6)Predict taken, branch is not taken (cases 4 and 6) Predict not taken, branch is taken (case 3)Predict not taken, branch is taken (case 3)

Preparing for recovery involves working on alternative Preparing for recovery involves working on alternative parhparh On instruction levelOn instruction level

Two fetch address registers per speculated branch (PPC 603 & 640)Two fetch address registers per speculated branch (PPC 603 & 640) Two instruction buffers (IBM 360/91, SuperSPARC, Pentium)Two instruction buffers (IBM 360/91, SuperSPARC, Pentium)

On I-cache levelOn I-cache level For PT, also do next-line prefetchingFor PT, also do next-line prefetching For PNT, also do target-line prefetchingFor PNT, also do target-line prefetching

21

Predicting Dynamic BTAsPredicting Dynamic BTAs Vast majority of dynamic BTAs come from Vast majority of dynamic BTAs come from

procedure returns (85% for SPEC95)procedure returns (85% for SPEC95) Since procedure call-return for the most part Since procedure call-return for the most part

follows a stack discipline, a specialized return follows a stack discipline, a specialized return address buffer operated as a stack is address buffer operated as a stack is appropriate for high prediction accuracyappropriate for high prediction accuracy Pushes return address on callPushes return address on call Pops return address on returnPops return address on return Depth of RAS should be as large as maximum call Depth of RAS should be as large as maximum call

depth to avoid mispredictionsdepth to avoid mispredictions 8-16 elements generally sufficient8-16 elements generally sufficient