just a new case for geometric history length predictors

1

André Seznec Caps Team

IRISA/INRIA

Just a new case for geometric history length predictors

André Seznec and Pierre Michaud

IRISA/INRIA/HIPEAC

2

André SeznecCaps Team

Irisa

Why branch prediction ?

Processors are deeply pipelined: More than 30 cycles on a Pentium 4

Processors execute a few instructions per cycle: 3-4 on current superscalar processors More than 100 instructions are inflight in a processor

≈ 10 % of instructions are branches (mostly conditional) Effective direction/target known very late:

• To avoid delays, predict direction/target

• Backtrack as soon as a misprediction is detected

3


Irisa

Why (essentially) predicting the direction of conditional branches

Most branchs are conditional: PC relative address are known/computed at decode time:

• Correction early in the pipeline

Effective direction is computed at execution time• Correction very late in the pipeline

Other branchs: Returns: easy with a return stack Indirect jumps: same penalty as conditional branches on

mispredictions

3 cycles lost

30 cycles lost

4


Irisa

Dynamic branch prediction J. Smith 1981

Predict the same direction as the last time: Use of a table of 1-bit counter indexed by the instruction

address Table of 2bit counters: hysteresis to enhance predictor behaviors

0123

0000

11 1 1

5


Irisa

Exploiting branch historyYeh and Patt 91, Pan et al 91

Autocorrelation: Local history

• the past behavior of this particular branch

Intercorrelation: Global history

• the (recent) past behavior of all branches

6


Irisa

Global history predictor gshare, McFarling 93

PC

PHT

2n 2 bits counters

histo L bits

n bits

xor

prediction

Prediction is inserted in the history to predict the next branch

A representationof the recent path

7


Irisa

Local history predictor

2m L bits

local history

PHT

2n 2 bits counters

histo L bits

Prediction

PC

m bits

8


Irisa

And of course: hybrid predictors !!

Take a few distinct branch predictors

Select or compute the final prediction

9


Irisa

EV8 predictor: (derived from) (2Bc-gskew)Seznec et al, ISCA 2002

e-gskewMichaud et al 97

10


Irisa

Perceptron predictorJimenez and Lin 2001

∑Sign=prediction

X

11


Irisa

Qualities required on a branch predictor

State-of-the-art accuracy: 2bcgskew was state-of-art on most applications, but was

lagging behind neural inspired predictors on a few.

Reasonable implementation cost: Neural inspired are not really realistic, (huge number of

adders), many access to the tables

In-time response

12


Irisa

Speculative history must be managed !?

Global history: Append a bit on a single history register Use of a circular buffer and just a pointer to speculatively

manage the history

Local history: table of histories (unspeculatively updated) must maintain a speculative history per inflight branch:

• Associative search, etc ?!?

13


Irisa

From my experience with Alpha EV8 team

Designers hate maintaining

speculative local history

14


Irisa

L(0) ?

L(4)

L(3)

L(2)L(1)

TOT1

T2

T3

T4

The basis : A Multiple length global history predictor

15


Irisa

Underlying idea

H and H’ two history vectors equal on L bitse.g. L <L(2)

Branches (A,H) and (A,H’) biased in opposite directions

Table T2 allows to discriminatebetween (A,H) and (A,H’)

16


Irisa

Selecting between multiple predictions

Classical solution: Use of a meta predictor

“wasting” storage !?! chosing among 5 or 10 predictions ??

Neural inspired predictors: Use an adder tree instead of a meta-predictorJimenez and Lin 2001

Partial matching: Use tagged tables and the longest matching historyChen et al 96, Eden and Mudge 1998, Michaud 2005

17


Irisa

L(0) ∑

L(4)

L(3)

L(2)L(1)

TOT1

T2

T3

T4

Final computation through a sum

Prediction=Sign

18


Irisa

pc h[0:L1]

ctru

tag

hash hash

=?

ctru

tag

hash hash

=?

ctru

tag

hash hash

=?

prediction

pc pc h[0:L2] pc h[0:L3]

11 1 1 1 1 1

1

1

Final computation through partial match

Tagless base predictor

19


Irisa

From “old experience” on 2bcgskew and EV8

Some applications benefit from 100+ bits histories

Some don’t !!

20


Irisa

GEometric History Length predictor

L(1)1iαL(i)

0 L(0)

The set of history lengths forms a geometric series

{0, 2, 4, 8, 16, 32, 64, 128}

What is important: L(i)-L(i-1) is drastically increasing

Spends most of the storage for short history !!

21


Irisa

Tree adder: O-GEHL: Optimized Geometric History Length

predictor

Partial match: TAGE: Tagged Geometric History Length predictor

• Inspired from PPM-like, Michaud 2004

22


Irisa

O-GEHL update policy

Perceptron inspired threshold based

Perceptron-like threshold did not work

Reasonable fixed threshold= Number of tables

23


Irisa

Dynamic update threshold fitting

On an O-GEHL predictor, best threshold depends on the application the predictor size the counter width

By chance, on most applications, for the best fixed threshold,

updates on mispredictions ≈ updates on correct predictions

Monitor the difference

and adapt the update threshold

24


Irisa

Adaptative history length fitting (inspired by Juan et al 98)

(½ applications: L(7) < 50) ≠

(½ applications: L(7) > 150)

Let us adapt some history lengths to the behavior of each application

8 tables: T2: L(2) and L(8) T4: L(4) and L(9) T6: L(6) and L(10)

25


Irisa

Adaptative history length fitting (2)

Intuition: if high degree of conflicts on T7, stick with short

history

Implementation: monitoring of aliasing on updates on T7 through a

tag bit and a counter

Simple is sufficient: Flipping from short to long histories and vice-versa

26


Irisa

Evaluation framework

1st Championship Branch Prediction traces:

20 traces including system activity

Floating point apps : loop dominated

Integer apps: usual SPECINT

Multimedia apps

Server workload apps: very large footprint

27


Irisa

Reference configurationpresented for 2004 Championship Branch Prediction

8 tables: 2 Kentries except T1, 1Kentries 5 bit counters for T0 and T1, 4 bit counters

otherwise 1 Kbits of one bit tags associated with T7

10K + 5K + 6x8K + 1K = 64K

L(1) =3 and L(10)= 200 {0,3,5,8,12,19,31,49,75,125,200}

28


Irisa

The OGEHL predictor

2nd at CBP: 2.82 misp/KI

Best practice award: The predictor the closest to a possible hardware

implementation Does not use exotic features:

• Various prime numbers, etc• Strange initial state

Very short warming intervals: Chaining all simulations: 2.84 misp/KI

29


Irisa

A case for the OGEHL predictor (2)

High accuracy 32Kbits (3,150): 3.41 misp/KI

better than any implementable 128 Kbits predictor before CBP• 128 Kbits 2bcgkew (6,6,24,48): 3.55 misp/KI• 176 Kbits PBNP (43) : 3.67 misp/KI

1Mbits (5,300): 2.27 misp/KI• 1Mbit 2bcgskew (9,9,36,72): 3.19 misp/KI• 1888 Kbits PBNP (58): 3.23 misp/KI

30


Irisa

Robustness of the OGEHL predictor (3)

Robustness to variations of history lengths choices: L(1) in [2,6], L(10) in [125,300] misp. rate < 2.96 misp/KI

Geometric series: not a bad formula !! best geometric L(1)=3, L(10)=223, 2.80 misp/KI best overall {0, 2, 4, 9, 12, 18, 31, 54, 114, 145, 266} 2.78

misp/KI

31


Irisa

OGEHL scalability

4 components — 8 components• 64 Kbits: 3.02 -- 2.84 misp/KI• 256Kbits: 2.59 -- 2.44 misp/KI• 1Mbit: 2.40 -- 2.27 misp/KI

6 components — 12 components• 48 Kbits: 3.02 – 3.03 misp/KI• 768Kbits: 2.35 – 2.25 misp/KI

4 to 12 components bring high accuracy

32


Irisa

Impact of counter width

Robustness to counter width variations: 3-bit counter, 49 Kbits: 3.09 misp/KI

Dynamic update threshold fitting helps a lot

5-bit counter 79 Kbits: 2.79 misp/KI

4-bit is the best tradeoff

33


Irisa

PPM-like predictor: 3.24 misp/KI

but ..

The update policy was poor

34


Irisa

TAGE update policy

General principle:

Minimize the footprint of the prediction.

Just update the longest history

matching component and allocate at most one entry on mispredictions

35


Irisa

A tagged table entry

Ctr: 3-bit prediction counter U: 2-bit useful counter

Was the entry recently useful ? Tag: partial tag

Tag CtrU

36


Irisa

=? =? =?

11 1 1 1 1 1

1

1

Hit

Hit

Altpred

Pred

Miss

37


Irisa

Updating the U counter

If (Altpred ≠ Pred) then• Pred = taken : U= U + 1• Pred ≠ taken : U = U - 1

Graceful aging:Periodic reset of 1 bit in all counters

38


Irisa

Allocating a new entry on a misprediction

Find a single Unuseful entry with a longer history: Priviledge the smallest possible history

• To minimize footprint But not too much

• To avoid ping-pong phenomena

Initialize Ctr as weak and U as zero

39


Irisa

TAGE update policy

+ a few other tricks

The update is not on the critical path

40


Irisa

Tradeoffs on tag width

Too narrow: Lots of ( destructive) false match

Too wide: Losing storage space

Tradeoff: 11 bits for 8 tables 9 bits for 5 tables

can be tuned on a per table basis: (destructive) false match is better tolerated on shorter history

41


Irisa

TAGE vs OGEHL

8 comp. OGEHL

64K: 2.84 misp/KI128K: 2.54 misp/KI256K: 2.43 misp/KI512K: 2.33 misp/KI1M : 2.27 misp/KI

8 comp. TAGE

64K: 2.58 misp/KI128K: 2.38 misp/KI256K: 2.23 misp/KI512K: 2.12 misp/KI1M: 2.05 misp/KI

5 comp. TAGE

64K: 2.70 misp/KI128K: 2.45 misp/KI256K: 2.28 misp/KI512K: 2.19 misp/KI1M: 2.12 misp/KI

42


Irisa

Partial tag matching is more effective than adder tree

At equal table numbers and equal history lengths: TAGE better than OGEHL on each of the 20 benchmarks

Accuracy limit ( with very high storage budget) is higher on TAGE Additions leads to averaging and information loss With tags, no information is lost

43


Irisa

Prediction computation time ?

3 successive steps: Index computation: a few entry XOR gate Table read Adder tree

May not fit on a single cycle: But can be ahead pipelined !

44


Irisa

Ahead pipelining a global history branch predictor (principle)

Initiate branch prediction X+1 cycles in advance to provide the prediction in time Use information available:

• X-block ahead instruction address• X-block ahead history

To ensure accuracy: Use intermediate path information

45


Irisa

Practice

Ahead OGEHL or TAGE:8 // prediction computations

bcd

Ha

A

A B C D

46


Irisa

Ahead Pipelined 64 Kbits OGEHL or TAGE

3-block 64Kbits TAGE: 2.70 misp/KI vs 2.58 misp/KI

3-block 64Kbits OGEHL:2.94 misp/KI vs 2.84 misp/KI

Not such a huge accuracy loss

47


Irisa

And indirect jumps ?

TAGE principles can be adapted to predict indirect jumps:

“A case for (partially) tagged branch predictors”, JILP Feb. 2006

48


Irisa

A final case for the Geometric History Length predictors

delivers state-of-the-art accuracy uses only global information:

Very long history: 200+ bits !! can be ahead pipelined many effective design points

OGEHL or TAGE Nb of tables, history lengths

prediction computation logic complexity is low(compared with concurrent predictors)

49


Irisa

The End

50


Irisa

BACK UP

51


Irisa

Improved the global history

Address + conditional branch history: path confusion on short histories

Address + path: Direct hashing leads to path confusion

1. Represent all branches in branch history

2. Use also path history ( 1 bit per branch, limited to 16 bits)

52


Irisa

Hashing 200+ bits for indexing !!

Need to compute 11 bits indexes : Full hashing is unrealistic

1. Just regularly pick at most 33 bits in:

address+branch history +path history

2. A single 3-entry exclusive-OR stage

53


Irisa

Piecewise linear -- OGEHL (1) accuracy

My own best implementation of piecewise linear: Shares counters for the same history rank but uses only power of two numbers of entries in tables

64 Kbits: H=31, 256 counters per rank, x 32 sums 3.72 misp/KI ------------ 2.84 misp/KI

1 MBit : H=63, 2048 counters per rank, x 32 sums 2.64 misp/KI ------------ 2.27 misp/KI

48 Mbit: H=95, 64K counters per rank, x 32 sums 2.30 misp/KI

54


Irisa

Piecewise linear – OGEHL-TAGE (2) hardware complexity

Computation logic complexity: OGEHL: only a few adders TAGE: only a few comparators + priority encoder PL: hundreds of adders

Number of storage tables OGEHL, TAGE: 4 to 12 PL: H tables ( update time)

Index computation time: + a 3-entry XOR gate on OGEHL/TAGE

Information to checkpoint: Kilobits for PL

just a new case for geometric history length predictors

Documents