just a new case for geometric history length predictors
DESCRIPTION
Just a new case for geometric history length predictors. André Seznec and Pierre Michaud IRISA/INRIA/HIPEAC. Why branch prediction ?. Processors are deeply pipelined: More than 30 cycles on a Pentium 4 Processors execute a few instructions per cycle: 3-4 on current superscalar processors - PowerPoint PPT PresentationTRANSCRIPT
1
André Seznec Caps Team
IRISA/INRIA
Just a new case for geometric history length predictors
André Seznec and Pierre Michaud
IRISA/INRIA/HIPEAC
2
André SeznecCaps Team
Irisa
Why branch prediction ?
Processors are deeply pipelined: More than 30 cycles on a Pentium 4
Processors execute a few instructions per cycle: 3-4 on current superscalar processors More than 100 instructions are inflight in a processor
≈ 10 % of instructions are branches (mostly conditional) Effective direction/target known very late:
• To avoid delays, predict direction/target
• Backtrack as soon as a misprediction is detected
3
André SeznecCaps Team
Irisa
Why (essentially) predicting the direction of conditional branches
Most branchs are conditional: PC relative address are known/computed at decode time:
• Correction early in the pipeline
Effective direction is computed at execution time• Correction very late in the pipeline
Other branchs: Returns: easy with a return stack Indirect jumps: same penalty as conditional branches on
mispredictions
3 cycles lost
30 cycles lost
4
André SeznecCaps Team
Irisa
Dynamic branch prediction J. Smith 1981
Predict the same direction as the last time: Use of a table of 1-bit counter indexed by the instruction
address Table of 2bit counters: hysteresis to enhance predictor behaviors
0123
0000
11 1 1
5
André SeznecCaps Team
Irisa
Exploiting branch historyYeh and Patt 91, Pan et al 91
Autocorrelation: Local history
• the past behavior of this particular branch
Intercorrelation: Global history
• the (recent) past behavior of all branches
6
André SeznecCaps Team
Irisa
Global history predictor gshare, McFarling 93
PC
PHT
2n 2 bits counters
histo L bits
n bits
xor
prediction
Prediction is inserted in the history to predict the next branch
A representationof the recent path
7
André SeznecCaps Team
Irisa
Local history predictor
2m L bits
local history
PHT
2n 2 bits counters
histo L bits
Prediction
PC
m bits
8
André SeznecCaps Team
Irisa
And of course: hybrid predictors !!
Take a few distinct branch predictors
Select or compute the final prediction
9
André SeznecCaps Team
Irisa
EV8 predictor: (derived from) (2Bc-gskew)Seznec et al, ISCA 2002
e-gskewMichaud et al 97
10
André SeznecCaps Team
Irisa
Perceptron predictorJimenez and Lin 2001
∑Sign=prediction
X
11
André SeznecCaps Team
Irisa
Qualities required on a branch predictor
State-of-the-art accuracy: 2bcgskew was state-of-art on most applications, but was
lagging behind neural inspired predictors on a few.
Reasonable implementation cost: Neural inspired are not really realistic, (huge number of
adders), many access to the tables
In-time response
12
André SeznecCaps Team
Irisa
Speculative history must be managed !?
Global history: Append a bit on a single history register Use of a circular buffer and just a pointer to speculatively
manage the history
Local history: table of histories (unspeculatively updated) must maintain a speculative history per inflight branch:
• Associative search, etc ?!?
13
André SeznecCaps Team
Irisa
From my experience with Alpha EV8 team
Designers hate maintaining
speculative local history
14
André SeznecCaps Team
Irisa
L(0) ?
L(4)
L(3)
L(2)L(1)
TOT1
T2
T3
T4
The basis : A Multiple length global history predictor
15
André SeznecCaps Team
Irisa
Underlying idea
H and H’ two history vectors equal on L bitse.g. L <L(2)
Branches (A,H) and (A,H’) biased in opposite directions
Table T2 allows to discriminatebetween (A,H) and (A,H’)
16
André SeznecCaps Team
Irisa
Selecting between multiple predictions
Classical solution: Use of a meta predictor
“wasting” storage !?! chosing among 5 or 10 predictions ??
Neural inspired predictors: Use an adder tree instead of a meta-predictorJimenez and Lin 2001
Partial matching: Use tagged tables and the longest matching historyChen et al 96, Eden and Mudge 1998, Michaud 2005
17
André SeznecCaps Team
Irisa
L(0) ∑
L(4)
L(3)
L(2)L(1)
TOT1
T2
T3
T4
Final computation through a sum
Prediction=Sign
18
André SeznecCaps Team
Irisa
pc h[0:L1]
ctru
tag
hash hash
=?
ctru
tag
hash hash
=?
ctru
tag
hash hash
=?
prediction
pc pc h[0:L2] pc h[0:L3]
11 1 1 1 1 1
1
1
Final computation through partial match
Tagless base predictor
19
André SeznecCaps Team
Irisa
From “old experience” on 2bcgskew and EV8
Some applications benefit from 100+ bits histories
Some don’t !!
20
André SeznecCaps Team
Irisa
GEometric History Length predictor
L(1)1iαL(i)
0 L(0)
The set of history lengths forms a geometric series
{0, 2, 4, 8, 16, 32, 64, 128}
What is important: L(i)-L(i-1) is drastically increasing
Spends most of the storage for short history !!
21
André SeznecCaps Team
Irisa
Tree adder: O-GEHL: Optimized Geometric History Length
predictor
Partial match: TAGE: Tagged Geometric History Length predictor
• Inspired from PPM-like, Michaud 2004
22
André SeznecCaps Team
Irisa
O-GEHL update policy
Perceptron inspired threshold based
Perceptron-like threshold did not work
Reasonable fixed threshold= Number of tables
23
André SeznecCaps Team
Irisa
Dynamic update threshold fitting
On an O-GEHL predictor, best threshold depends on the application the predictor size the counter width
By chance, on most applications, for the best fixed threshold,
updates on mispredictions ≈ updates on correct predictions
Monitor the difference
and adapt the update threshold
24
André SeznecCaps Team
Irisa
Adaptative history length fitting (inspired by Juan et al 98)
(½ applications: L(7) < 50) ≠
(½ applications: L(7) > 150)
Let us adapt some history lengths to the behavior of each application
8 tables: T2: L(2) and L(8) T4: L(4) and L(9) T6: L(6) and L(10)
25
André SeznecCaps Team
Irisa
Adaptative history length fitting (2)
Intuition: if high degree of conflicts on T7, stick with short
history
Implementation: monitoring of aliasing on updates on T7 through a
tag bit and a counter
Simple is sufficient: Flipping from short to long histories and vice-versa
26
André SeznecCaps Team
Irisa
Evaluation framework
1st Championship Branch Prediction traces:
20 traces including system activity
Floating point apps : loop dominated
Integer apps: usual SPECINT
Multimedia apps
Server workload apps: very large footprint
27
André SeznecCaps Team
Irisa
Reference configurationpresented for 2004 Championship Branch Prediction
8 tables: 2 Kentries except T1, 1Kentries 5 bit counters for T0 and T1, 4 bit counters
otherwise 1 Kbits of one bit tags associated with T7
10K + 5K + 6x8K + 1K = 64K
L(1) =3 and L(10)= 200 {0,3,5,8,12,19,31,49,75,125,200}
28
André SeznecCaps Team
Irisa
The OGEHL predictor
2nd at CBP: 2.82 misp/KI
Best practice award: The predictor the closest to a possible hardware
implementation Does not use exotic features:
• Various prime numbers, etc• Strange initial state
Very short warming intervals: Chaining all simulations: 2.84 misp/KI
29
André SeznecCaps Team
Irisa
A case for the OGEHL predictor (2)
High accuracy 32Kbits (3,150): 3.41 misp/KI
better than any implementable 128 Kbits predictor before CBP• 128 Kbits 2bcgkew (6,6,24,48): 3.55 misp/KI• 176 Kbits PBNP (43) : 3.67 misp/KI
1Mbits (5,300): 2.27 misp/KI• 1Mbit 2bcgskew (9,9,36,72): 3.19 misp/KI• 1888 Kbits PBNP (58): 3.23 misp/KI
30
André SeznecCaps Team
Irisa
Robustness of the OGEHL predictor (3)
Robustness to variations of history lengths choices: L(1) in [2,6], L(10) in [125,300] misp. rate < 2.96 misp/KI
Geometric series: not a bad formula !! best geometric L(1)=3, L(10)=223, 2.80 misp/KI best overall {0, 2, 4, 9, 12, 18, 31, 54, 114, 145, 266} 2.78
misp/KI
31
André SeznecCaps Team
Irisa
OGEHL scalability
4 components — 8 components• 64 Kbits: 3.02 -- 2.84 misp/KI• 256Kbits: 2.59 -- 2.44 misp/KI• 1Mbit: 2.40 -- 2.27 misp/KI
6 components — 12 components• 48 Kbits: 3.02 – 3.03 misp/KI• 768Kbits: 2.35 – 2.25 misp/KI
4 to 12 components bring high accuracy
32
André SeznecCaps Team
Irisa
Impact of counter width
Robustness to counter width variations: 3-bit counter, 49 Kbits: 3.09 misp/KI
Dynamic update threshold fitting helps a lot
5-bit counter 79 Kbits: 2.79 misp/KI
4-bit is the best tradeoff
33
André SeznecCaps Team
Irisa
PPM-like predictor: 3.24 misp/KI
but ..
The update policy was poor
34
André SeznecCaps Team
Irisa
TAGE update policy
General principle:
Minimize the footprint of the prediction.
Just update the longest history
matching component and allocate at most one entry on mispredictions
35
André SeznecCaps Team
Irisa
A tagged table entry
Ctr: 3-bit prediction counter U: 2-bit useful counter
Was the entry recently useful ? Tag: partial tag
Tag CtrU
36
André SeznecCaps Team
Irisa
=? =? =?
11 1 1 1 1 1
1
1
Hit
Hit
Altpred
Pred
Miss
37
André SeznecCaps Team
Irisa
Updating the U counter
If (Altpred ≠ Pred) then• Pred = taken : U= U + 1• Pred ≠ taken : U = U - 1
Graceful aging:Periodic reset of 1 bit in all counters
38
André SeznecCaps Team
Irisa
Allocating a new entry on a misprediction
Find a single Unuseful entry with a longer history: Priviledge the smallest possible history
• To minimize footprint But not too much
• To avoid ping-pong phenomena
Initialize Ctr as weak and U as zero
39
André SeznecCaps Team
Irisa
TAGE update policy
+ a few other tricks
The update is not on the critical path
40
André SeznecCaps Team
Irisa
Tradeoffs on tag width
Too narrow: Lots of ( destructive) false match
Too wide: Losing storage space
Tradeoff: 11 bits for 8 tables 9 bits for 5 tables
can be tuned on a per table basis: (destructive) false match is better tolerated on shorter history
41
André SeznecCaps Team
Irisa
TAGE vs OGEHL
8 comp. OGEHL
64K: 2.84 misp/KI128K: 2.54 misp/KI256K: 2.43 misp/KI512K: 2.33 misp/KI1M : 2.27 misp/KI
8 comp. TAGE
64K: 2.58 misp/KI128K: 2.38 misp/KI256K: 2.23 misp/KI512K: 2.12 misp/KI1M: 2.05 misp/KI
5 comp. TAGE
64K: 2.70 misp/KI128K: 2.45 misp/KI256K: 2.28 misp/KI512K: 2.19 misp/KI1M: 2.12 misp/KI
42
André SeznecCaps Team
Irisa
Partial tag matching is more effective than adder tree
At equal table numbers and equal history lengths: TAGE better than OGEHL on each of the 20 benchmarks
Accuracy limit ( with very high storage budget) is higher on TAGE Additions leads to averaging and information loss With tags, no information is lost
43
André SeznecCaps Team
Irisa
Prediction computation time ?
3 successive steps: Index computation: a few entry XOR gate Table read Adder tree
May not fit on a single cycle: But can be ahead pipelined !
44
André SeznecCaps Team
Irisa
Ahead pipelining a global history branch predictor (principle)
Initiate branch prediction X+1 cycles in advance to provide the prediction in time Use information available:
• X-block ahead instruction address• X-block ahead history
To ensure accuracy: Use intermediate path information
45
André SeznecCaps Team
Irisa
Practice
Ahead OGEHL or TAGE:8 // prediction computations
bcd
Ha
A
A B C D
46
André SeznecCaps Team
Irisa
Ahead Pipelined 64 Kbits OGEHL or TAGE
3-block 64Kbits TAGE: 2.70 misp/KI vs 2.58 misp/KI
3-block 64Kbits OGEHL:2.94 misp/KI vs 2.84 misp/KI
Not such a huge accuracy loss
47
André SeznecCaps Team
Irisa
And indirect jumps ?
TAGE principles can be adapted to predict indirect jumps:
“A case for (partially) tagged branch predictors”, JILP Feb. 2006
48
André SeznecCaps Team
Irisa
A final case for the Geometric History Length predictors
delivers state-of-the-art accuracy uses only global information:
Very long history: 200+ bits !! can be ahead pipelined many effective design points
OGEHL or TAGE Nb of tables, history lengths
prediction computation logic complexity is low(compared with concurrent predictors)
49
André SeznecCaps Team
Irisa
The End
50
André SeznecCaps Team
Irisa
BACK UP
51
André SeznecCaps Team
Irisa
Improved the global history
Address + conditional branch history: path confusion on short histories
Address + path: Direct hashing leads to path confusion
1. Represent all branches in branch history
2. Use also path history ( 1 bit per branch, limited to 16 bits)
52
André SeznecCaps Team
Irisa
Hashing 200+ bits for indexing !!
Need to compute 11 bits indexes : Full hashing is unrealistic
1. Just regularly pick at most 33 bits in:
address+branch history +path history
2. A single 3-entry exclusive-OR stage
53
André SeznecCaps Team
Irisa
Piecewise linear -- OGEHL (1) accuracy
My own best implementation of piecewise linear: Shares counters for the same history rank but uses only power of two numbers of entries in tables
64 Kbits: H=31, 256 counters per rank, x 32 sums 3.72 misp/KI ------------ 2.84 misp/KI
1 MBit : H=63, 2048 counters per rank, x 32 sums 2.64 misp/KI ------------ 2.27 misp/KI
48 Mbit: H=95, 64K counters per rank, x 32 sums 2.30 misp/KI
54
André SeznecCaps Team
Irisa
Piecewise linear – OGEHL-TAGE (2) hardware complexity
Computation logic complexity: OGEHL: only a few adders TAGE: only a few comparators + priority encoder PL: hundreds of adders
Number of storage tables OGEHL, TAGE: 4 to 12 PL: H tables ( update time)
Index computation time: + a 3-entry XOR gate on OGEHL/TAGE
Information to checkpoint: Kilobits for PL