comp architecture : branch prediction

20 Nov 2005 roberto innocente 1

1

Prediction and speculation :

the role of stochastic models ofprogram behaviour in the performance of modern computers

r. innocente


2

Speculation

from the MerriamWebster dict :

an assumption of an unusual risk in hopes of obtaining commensurate gains


3

Speculative execution

A prediction of what work is likely to be needed soon is made. Then it is speculatively executed in such a way that you can commit it if the prediction was correct or abort it.


4

von Neumann's model : Stored Program Computer The Control Counter

today called Program Counter (PC) or Instruction Pointer (IP) keeps the address of the next instruction to be executed. The control part fetches this instruction, decodes and executes it. At the end the PC is updated.


5

Linear scaling of speed Quadratic scaling of transistors Let's look at the last scaling in

silicon litography from 0.13 u to 0.9 u : a 0.70 linear scaling, a 0.49 scaling of surface.

Gate delays scale linearly, transistors available scale quadratically

We will get much more in available complexity than in gate speed

0 ,0 0

5 0 ,0 0

1 0 0 ,0 0

1 5 0 ,0 0

2 0 0 ,0 0

2 5 0 ,0 0

0 .2 5 0 .1 8 0 .1 3 0 .0 9 0 .06 5

Gate SpeedTransistors


6

von Neumann's Projection/ Collapse postulate of QM A system can be described with any mix of states,

but if you observe it you can only find it in one of the eigenstates, and you can only measure an eigenvalue .

( When you look at it the Shroedinger's cat is aut dead aut alive )


7

Modern microprocessors

Today µ processors take advantage of the fact that they need to present an architectural state compliant with the standard von Neumann's model only from time to time, being for the remaining time free to proceed in whatever way they find it convenient


8

ILP – Instruction Level Parallelism (Fisher 1981) Obeying the standard semantic when

required, try to overlap the execution of multiple instructions as much as possible. (We will see that current microprocessors can have more than 100 instructions in flight)


9

Enabling technologies for ILP exploitation

Pipelining

Multiple issue = Superscalar


10

A microprocessor in 1989 (Intel 386) CPI = Cycles Per Instruction Performance = Frequency / CPI Intel 386 :

feature size : 1 micron frequency: 33 Mhz CPI = 5/6

Performance = 33 M/6 ~ 6 Kinstructions/s


11

Pipelining

The work to be done is divided in stages, with a clear signal interface between them. After each stage a latch memorizes the state for the next cycle. It adds some overhead, but the hope is to get 1 result per cycle, after the pipe is full.

F

eXecuteMemory

WritebackDecode

Fetch

D XM

W

Pipeline latch


12

Limits of pipelining A latch can add 2 or 3 gate

delays. Current work is around 400

gate delays you get a result every 400/n

+ 3 gate delays you add an overhead of 3n

gate delays


13

Pipeline at work

cycle F D X M W1 add r1,r3,r42 mul r5,r6,r73 bnez loop,r14 X5 X X6 X X X7 X X X X8 div r8,r3,r6 X X X X9 add r10,r8,r9 X X X

10 jmp loop X X

When there is a dependency we say that the pipeline is stalled or a bubble is inserted waiting for the dependency to solve. Here a control dependency causes a 4 cycles stall.


14

Instruction dependencies Data dependency :

add r1,r2,r3 ; r1<r2+r3mul r1,r4,r5 ; r5<r4*r5

Solution: register renaming,

result forwarding Structural dependency:

Solution: add functional units

Control dependency :bne label1,r1,r2add r1,r2,r3

label1:mul r4,r5,r6

Solution: branch prediction


15

Multiple issue (Superscalar) Architectures

F D XM

W

F D XM

W

Architectures that are able to process multiple instructions at a time. While it was common to have multiple execution units (like an integer and a FP unit), only in the '90 appeared the first superscalar architectures e.g. IBM Power and Pentium Pro. These architectures require a very good branch prediction. Here it's depicted a 2 way superscalar.


16

Superscalar/2

Current architectures are commonly 4 or 8 way superscalars

The design of the last Alpha, canceled in its late phase, was for an 8 way superscalar

Extremely good branch prediction is needed : there can be hundredths of instructions in flight ( 4 way*30 stages=120)


17

Superscalar at work

cycle F D X M W1 add r1,r3,r4 mul r5,r6,r72 bnez loop,r1 X3 X X X4 X X

X X X5 X X X X X X X

The wastedslots are nowmuch more thanin the pipelinedonly case


18

Real World Architectures

IBM power5


19

15 years of x86

year processor

1979 8088 12

1988 386dx 1 275 5 33 801991 486dx 1 1100 501993 pentium 60 0.8 3100 60 51995 pentiumPro 0.6 5500 150 101997 Pentium II 0.35 7500 233 101999 Pentium III 0.25 9500 450 102000 Pentium 4 0.18 42000 1300 202005 Pentium 4 571 0.09 130000 3800 30 13

feature size

transistor count

cycles /instr.

frequency

pipe length

FO4 gates per cycle


20

Feature size, frequency, complexity

386 486dx P 60 p pro P II P III P 4 P 4 571

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

feat.size


0

500

1000

1500

2000

2500

3000

3500

4000

freq


10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

120000

130000

trans.#


21

A microprocessor in 2005(Intel Pentium4) IPC = Instructions Per Cycle Performance = Frequency * IPC Intel Pentium4 :

feature size : 90 nm frequency: 3 Ghz IPC ~ 2/3 (2 for SPECint,3 for SPECfp)

Performance = 3 G * 2 = 6 Ginstructions/s


22

Control xfer instructions

Some of the instructions, instead of simply incrementing the PC to the next instruction, change it to a different value. We distinguish : Unconditional branches or simply jumps Conditional branches or simply branches subroutine calls subroutine returns traps, returns from interrupts or exceptions


23

Assembly – Machine instructions

Only jumps or branches : j <label> j @register beq <label> bne <label> bz <label> bnz <label>


24

High level Language – Assembly for(i=1;i<=4;i++)

{ .. }

if (i) { .. }

while (i) { .. }

ld r1,1 ld r2,4loop:cmp r1,r2 beq out ..

add r1,r1,1 jmp loopout: ld r1,i bz next ..next:

loop: sub r1,1 bz out .. jmp loopout:


25

SPECStd Perf. Evaluation Corporation benchmarks Wellknown set of benchmarks, continuously

updated, recognized as representative of possible workloads

Divided in 2 big sets : SPECint : integer programs( go, m88ksim, compress, li,

ijpeg, perl, vortex) SPECfp : floating point programs (mathematical

simulation prgs) http://www.spec.org


26

Condit ional

7 2 %

I m m ediate

1 6 %

Returns1 0 %

I ndirect2 %

Condit ionalI m m ediateReturnsI ndirect

Branches by type

Average from SPECint95


27

0

5 0

1 0 0

1 5 0

2 0 0

2 5 0

compress

gccgo ij peg

m8 8 ksim

perlvortex

xlisp

Dynam ic instruct ionsDynam ic branchesDynam ic Cond BR

Branches by frequency

SPEC95 Benchmarks(on yaxis millionsof instruction)


28

Alw ays taken1 4 %

9 5 1 0 0 %2 1 %

5 0 9 5 %2 0 %

5 5 0 %2 4 %

0 5 %7 %

Never Taken1 4 % Alw ays taken

9 5 1 0 0 %5 0 9 5 %5 5 0 %0 5 %Never Taken

Branches by taken rate

Average from SPECint95


29

Occurrences of branches

Occurrences of branches (conditional branches) : SPECint 95 1 out of 5 instruction executed (20%) SPECfp 95 1 out of 10 instruction executed (10%)

Basic block is the term used for a sequence of instructions without any control xfer

Note : this is different and much more than the rate of branches in the static program


30

Mispredictions effects

b=rate of instruction executed that are branches (0.10.2)

p=prediction accuracy (currently the best is in the 0.900.97 range)

f=instructions “inflight” (in execution, currently over 100)

Oversimplification: misprediction is recognized only at the very end and forces to squash all the following f inflight instr.

Then every 1/(b*(1p)) instr. we squash f instr.

E = 1/(1+f*(b*(1p))


31

Efficiency versus bp accuracy

5 inflight instructions

100 in flight

200 inflight300 inflight


32

Branch prediction methods

Are they using informations collected from the running programs ?

Static : no Semistatic : info collected from test

samples Dynamic : yes


33

Static branch prediction Always taken (AT), Always not taken (ANT) Backward taken, forward not taken(BTFNT)

frequently used by current processors, relies on compilers too (Intel Pentium4)

Complicate rules : for example the bp of PentiumM looks at the distance between addresses and opcodes

Programmer hints (special opcodes on Pentium, flags on Itanium)

program reorganization by compilers Achieves ~ 60/70 % accuracy


34

Semistatic branch prediction

It relies on data collected from previous runs of the program (profiling : Sun Sparc)

Insertion in the code of appropriate hints : predict taken predict not taken

Achieves accuracy of ~ 65/80 %


35

Dynamic branch prediction As the last time Bimodal predictors

Achieve accuracy of 70/85 %

2 level / correlation predictors Achieve accuracy of

80/90 % Combining/Meta predictors Markov/PPM predictors Neural predictors

static semistatic

bimodal

2 level combined

50

55

60

65

70

75

80

85

90

95

% branch pred. accuracy

from to


36

2bc – Two bit saturated counter The best 4 states FSA

(Finite State Automaton) SNT,NT,T,ST (Strongly

NotTaken, NotTaken, Taken, Strongly Taken)

Add 1 when branch is taken, subtract 1 when not taken. Saturate at 0 and 3

ST T

NT SNT

t

t

t

t

nt

nt

nt

0001

1011

nt

t


37

bimodal predictor (Smith ’85)

Array of counters

STTNTSNT

t

nt

•At every branch, hashing on the instruction address (usually simply using some of the bits of the address), a counter is chosen and a prediction is made. Whole array is initialized to the T or NT state•When the outcome of the branch is known the counter is updated •General consensus on using the 2 bit saturated counter

Address of branch instruction


38

Branch correlations Global correlation Local correlation

for(i=0;i<1000;i++) { if (i%4 == 0) a[i]=0;}

if (cond1) { .. }if (cond1 && cond2) { .. }

if (cond1) a=2;..if (a==0) { .. }

if (cond1) { .. }if (cond2) { .. }if (cond1 && cond2) { .. }

for(i=0;i<12;i++) { .. }

Outcome depends on the outcome of previous branches

Outcome depends on previous outcomes of same branch


39

Twolevel/Correlation predictor (YehPatt’92,PanSohRameh’92)

Branches are correlated one to the other

We keep a shift register with the most recent branch outcomes

We index a bimodal table(Pattern History Table) with this branch history register (BHT)

We can keep only one global BHT for all the branches (global 2level predictor) or a BHT per each branch (local 2level predictor). The same we can do for the PHT.

Branch history register

Pattern History Table

Prediction

Last outcome


40

Taxonomy of 2 level predictors

GSP

agsp

Branch History Pattern HistoryGlobalShared(aSsociative)Per address

globalshared(associative)per address

Gas = Global History Register, adaptive, with shared Pattern History Table (for instance 8 ways)

} {


41

gshare (McFarling ’93) Alleviates the

problem of PHT destructive interference between branches

The PHT is indexed with the XOR of the BHT and the BIA (branch instruction address)

Branch history register


Prediction

Last outcome

XOR

Branch address


42

Path correlated prediction

The same branch history may be the result of very different program behaviours. To disentangle such situations we can take some bits of the target address of each of the last n taken branches and use those to address the bimodal PHT.

TA bits TA bitsTA bitsTA bits TA bits TA bitsPath history register



43

Tournament/Meta predictor (McFarling ’93) Often happens that a predictor

is better for some branches and another for other branches

A bimodal predictor can then be used to drive a mux that will choose between the 2 predictors

When the outcome is known the metapredictor is updated if one of the predictors was right and the other wrong

In this case the states are the confidence on the 2 predictors

Predictor1 Predictor2 MetaPredictor

Address of branch instruction

Mux

Hybrid predictoroutcome


44

Data compression

It is a similar and well studied problem, for which there exists an algorithm reputed nearly optimal (PPM).

the goal is to represent the data with fewer bits : You use fewer bits for frequent sequences and more

bits for the infrequent ones. The net effect is to use less bits overall.

It relies on accurately predicting the probabilistic distribution of data and using a coder tuned to that


45

Markov predictor

A Markov predictor of order j, bases its prediction on the last j outcomes

It builds the matrix of transition frequencies and makes the prediction according to that

pattern next frequency00 0

1 101 0

1 210 0 1

1 111 0 2

1

1 0 1 1 0 0 1 1 0 Last outcomes


46

PPM – (Cleary, Witten 1984)Prediction by Partial Matching

A PPM predictor of order m is a set of m+1 Markov predictors

1 0 1 1 0 0 1 1 0Last m bits

if found

Predict withMarkov predictor of order m

Last m1 bits

if found

if found

Markov predictor of order m1

Markov predictor of order 1

Markov predictor of order 0if not found

if not found

if not found


47

Neural methodsD.Jimenez 2002

Machine learning has often used neural methods Most neural networks can't be candidates for

hardware prediction at the microarchitecture level Their implementation would require much more

than several cycles The standard method of training, the

backpropagation algorithm, is infeasible in a few machine cycles


48

Perceptron Introduced by Rosenblatt in

1962 as a model of brain functioning, popularized by M.Minsky

We will consider the simplest: the singlelayer perceptron

A vector of n inputs: x[1]..x[n] Each input has a weight

associated with it: w[0]..w[n]. This vector characterizes the perceptron


49

Bipolar perceptron

The inputs and the outcome t can be only 1 or 1

Then t*x[i] = 1 if they agree, or 1 if they disagree

if the w[i] are integers, y is an integer too and sign(y) is the prediction


50

Perceptron training Simply stated : increase the weights of those inputs

that agree with the outcome, and decrease the weight of those that do not

Let t be the outcome and θ be a threshold after which we stop to train the perceptron. Then the algorithm is :if ((sign(y) <> t)||(|y| < theta)) {

for (i=0 ; i<=n;i++) {

w[i] = w[i] + t * x[i];}

}


51

Perceptron limitations

A single perceptron can only learn linearly separable functions of the inputs. The linear equation w[0]+Σ w[i]*x[i]=0 represents an hyperplane in the

ndim space of inputs AND, OR, NAND, NOR are linearly separable, XOR is

not Of course any boolean function can be learned by a 2

layer network of perceptrons (as any boolean function can be represented by a 2layer net of ANDs and ORs), but it has been shown that for bp there is not much gain and the delay gets much worse


52

Branch prediction with perceptrons The inputs of the perceptron are the branch history We keep a table of perceptrons (the weights) that we address

hashing on the branch address Every time we meet a branch we load the perceptron in a vector

register and we compute in parallel the dot product between the weights and the branch history (summing the complements to 1 instead of those to 2)

According to the result we predict the branch taken or not taken The training alg. is performed and the updated perceptron is

written back


53

It's the serialization constraint imposed by data dependencies among instructions

Was always thought to be an insurmountable limit

An instruction that needs data from another instruction needs to be executed after that

The dataflow limit

ADD R1,R2,R3 ; R1<R2+R3

ADD R4,R1,R5 ; R4<R1+R5


54

Exceeding the dataflow limit

At the end of the '90 some authors proposed the use of data prediction to overcome the dataflow limit

M.Lipasti, Shen Exceeding the data flow limit This is much more difficult than branch prediction

where you need to predict only a binary value


55

Value locality

The simulations showed in fact that the applications are obeying also to a new locality principle : Value Locality

Value Locality

Temporal Spatial


56

Value prediction/1

It was shown for instance that 20/30 % of instructions that write value in registers write the same value as the last time

And 40/50 % write one of the last 4 preceeding values


57

Value prediction/2

What makes these values so predictable ? It seems this is due to severe penalties realworld

programs incur not only because they are designed to manage quite infrequent contingencies like exceptions and error conditions but because they are general by design. This is shown even by code aggressively optimized by modern state of the art compilers


58

Value prediction/3

Load Value prediction

Register Value prediction


59

Speculation taxonomy

Speculative execution

Control speculation Data speculation

Branch outcome

Branch target

Data Location

Data Value

Load

Register Value

Address


60

Research areas

Reverse engineering of prediction algorithms implementations

Simulation of new prediction algorithms : Using legacy Instructions Sets (IS) Using abstract RISC instructions sets

Hand code optimization and compiler optimization techniques


61

Reverse engineering

A python or perl script : produces assembly language kernels (with for

example fix distance between branch instructions)

compiles and runs the kernels using the hardware counters for mispredictions to detect table sizes, conflicts and so on


62

Legacy IS/OS simulations

Can be obtained instrumenting an x86 open source simulator like bochs that can run windows or linux

You can then run statically precompiled binaries over it

Problem : bochs is not even a complete Pentium II simulator !


63

Abstract IS simulators

SimpleScalar is an opensource framework for a generic software simulator over which modules for different prediction algorithms can be implemented

Offers the possibility to customize also the Instruction Set (IS)

Problem : you need the source and compile all special libraries to use this tool


64

Optimization techniques

Basic block extension Code duplication Scheduling techniques :


65

Scheduling

Code scheduling or reordering of instruction is used to improve performance or guarantee correctness

Important for dynamically scheduled architectures, essential for static scheduled architectures

Examples : branch delay slots, memory delays, multicycle operations

Block scheduling, List scheduling, Superblock scheduling, Trace Scheduling


66

BTA era is here(Billion Transistor Architecture) Intel Itanium2 with 6MB L3 cache has 0.41 billion transistors of which

around 0.3 billion transistors are for the cache memory It's not clear what will be the best use of the available silicon:

CMP (SingleChip MultiProcessors) Superwide superspeculative superscalar Simultaneous MultiThreading Raw Processors


67


10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

120000

130000

FO4 gatespipe lengthfeat.sizetrans.#freq

comp architecture : branch prediction

Technology