comp architecture : branch prediction
TRANSCRIPT
20 Nov 2005 roberto innocente 1
1
Prediction and speculation :
the role of stochastic models ofprogram behaviour in the performance of modern computers
r. innocente
20 Nov 2005 roberto innocente 2
2
Speculation
from the MerriamWebster dict :
an assumption of an unusual risk in hopes of obtaining commensurate gains
20 Nov 2005 roberto innocente 3
3
Speculative execution
A prediction of what work is likely to be needed soon is made. Then it is speculatively executed in such a way that you can commit it if the prediction was correct or abort it.
20 Nov 2005 roberto innocente 4
4
von Neumann's model : Stored Program Computer The Control Counter
today called Program Counter (PC) or Instruction Pointer (IP) keeps the address of the next instruction to be executed. The control part fetches this instruction, decodes and executes it. At the end the PC is updated.
20 Nov 2005 roberto innocente 5
5
Linear scaling of speed Quadratic scaling of transistors Let's look at the last scaling in
silicon litography from 0.13 u to 0.9 u : a 0.70 linear scaling, a 0.49 scaling of surface.
Gate delays scale linearly, transistors available scale quadratically
We will get much more in available complexity than in gate speed
0 ,0 0
5 0 ,0 0
1 0 0 ,0 0
1 5 0 ,0 0
2 0 0 ,0 0
2 5 0 ,0 0
0 .2 5 0 .1 8 0 .1 3 0 .0 9 0 .06 5
Gate SpeedTransistors
20 Nov 2005 roberto innocente 6
6
von Neumann's Projection/ Collapse postulate of QM A system can be described with any mix of states,
but if you observe it you can only find it in one of the eigenstates, and you can only measure an eigenvalue .
( When you look at it the Shroedinger's cat is aut dead aut alive )
20 Nov 2005 roberto innocente 7
7
Modern microprocessors
Today µ processors take advantage of the fact that they need to present an architectural state compliant with the standard von Neumann's model only from time to time, being for the remaining time free to proceed in whatever way they find it convenient
20 Nov 2005 roberto innocente 8
8
ILP – Instruction Level Parallelism (Fisher 1981) Obeying the standard semantic when
required, try to overlap the execution of multiple instructions as much as possible. (We will see that current microprocessors can have more than 100 instructions in flight)
20 Nov 2005 roberto innocente 9
9
Enabling technologies for ILP exploitation
Pipelining
Multiple issue = Superscalar
20 Nov 2005 roberto innocente 10
10
A microprocessor in 1989 (Intel 386) CPI = Cycles Per Instruction Performance = Frequency / CPI Intel 386 :
feature size : 1 micron frequency: 33 Mhz CPI = 5/6
Performance = 33 M/6 ~ 6 Kinstructions/s
20 Nov 2005 roberto innocente 11
11
Pipelining
The work to be done is divided in stages, with a clear signal interface between them. After each stage a latch memorizes the state for the next cycle. It adds some overhead, but the hope is to get 1 result per cycle, after the pipe is full.
F
eXecuteMemory
WritebackDecode
Fetch
D XM
W
Pipeline latch
20 Nov 2005 roberto innocente 12
12
Limits of pipelining A latch can add 2 or 3 gate
delays. Current work is around 400
gate delays you get a result every 400/n
+ 3 gate delays you add an overhead of 3n
gate delays
20 Nov 2005 roberto innocente 13
13
Pipeline at work
cycle F D X M W1 add r1,r3,r42 mul r5,r6,r73 bnez loop,r14 X5 X X6 X X X7 X X X X8 div r8,r3,r6 X X X X9 add r10,r8,r9 X X X
10 jmp loop X X
When there is a dependency we say that the pipeline is stalled or a bubble is inserted waiting for the dependency to solve. Here a control dependency causes a 4 cycles stall.
20 Nov 2005 roberto innocente 14
14
Instruction dependencies Data dependency :
add r1,r2,r3 ; r1<r2+r3mul r1,r4,r5 ; r5<r4*r5
Solution: register renaming,
result forwarding Structural dependency:
Solution: add functional units
Control dependency :bne label1,r1,r2add r1,r2,r3
label1:mul r4,r5,r6
Solution: branch prediction
20 Nov 2005 roberto innocente 15
15
Multiple issue (Superscalar) Architectures
F D XM
W
F D XM
W
Architectures that are able to process multiple instructions at a time. While it was common to have multiple execution units (like an integer and a FP unit), only in the '90 appeared the first superscalar architectures e.g. IBM Power and Pentium Pro. These architectures require a very good branch prediction. Here it's depicted a 2 way superscalar.
20 Nov 2005 roberto innocente 16
16
Superscalar/2
Current architectures are commonly 4 or 8 way superscalars
The design of the last Alpha, canceled in its late phase, was for an 8 way superscalar
Extremely good branch prediction is needed : there can be hundredths of instructions in flight ( 4 way*30 stages=120)
20 Nov 2005 roberto innocente 17
17
Superscalar at work
cycle F D X M W1 add r1,r3,r4 mul r5,r6,r72 bnez loop,r1 X3 X X X4 X X
X X X5 X X X X X X X
The wastedslots are nowmuch more thanin the pipelinedonly case
20 Nov 2005 roberto innocente 18
18
Real World Architectures
IBM power5
20 Nov 2005 roberto innocente 19
19
15 years of x86
year processor
1979 8088 12
1988 386dx 1 275 5 33 801991 486dx 1 1100 501993 pentium 60 0.8 3100 60 51995 pentiumPro 0.6 5500 150 101997 Pentium II 0.35 7500 233 101999 Pentium III 0.25 9500 450 102000 Pentium 4 0.18 42000 1300 202005 Pentium 4 571 0.09 130000 3800 30 13
feature size
transistor count
cycles /instr.
frequency
pipe length
FO4 gates per cycle
20 Nov 2005 roberto innocente 20
20
Feature size, frequency, complexity
386 486dx P 60 p pro P II P III P 4 P 4 571
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
feat.size
386 486dx P 60 p pro P II P III P 4 P 4 571
0
500
1000
1500
2000
2500
3000
3500
4000
freq
386 486dx P 60 p pro P II P III P 4 P 4 5710
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
trans.#
20 Nov 2005 roberto innocente 21
21
A microprocessor in 2005(Intel Pentium4) IPC = Instructions Per Cycle Performance = Frequency * IPC Intel Pentium4 :
feature size : 90 nm frequency: 3 Ghz IPC ~ 2/3 (2 for SPECint,3 for SPECfp)
Performance = 3 G * 2 = 6 Ginstructions/s
20 Nov 2005 roberto innocente 22
22
Control xfer instructions
Some of the instructions, instead of simply incrementing the PC to the next instruction, change it to a different value. We distinguish : Unconditional branches or simply jumps Conditional branches or simply branches subroutine calls subroutine returns traps, returns from interrupts or exceptions
20 Nov 2005 roberto innocente 23
23
Assembly – Machine instructions
Only jumps or branches : j <label> j @register beq <label> bne <label> bz <label> bnz <label>
20 Nov 2005 roberto innocente 24
24
High level Language – Assembly for(i=1;i<=4;i++)
{ .. }
if (i) { .. }
while (i) { .. }
ld r1,1 ld r2,4loop:cmp r1,r2 beq out ..
add r1,r1,1 jmp loopout: ld r1,i bz next ..next:
loop: sub r1,1 bz out .. jmp loopout:
20 Nov 2005 roberto innocente 25
25
SPECStd Perf. Evaluation Corporation benchmarks Wellknown set of benchmarks, continuously
updated, recognized as representative of possible workloads
Divided in 2 big sets : SPECint : integer programs( go, m88ksim, compress, li,
ijpeg, perl, vortex) SPECfp : floating point programs (mathematical
simulation prgs) http://www.spec.org
20 Nov 2005 roberto innocente 26
26
Condit ional
7 2 %
I m m ediate
1 6 %
Returns1 0 %
I ndirect2 %
Condit ionalI m m ediateReturnsI ndirect
Branches by type
Average from SPECint95
20 Nov 2005 roberto innocente 27
27
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
compress
gccgo ij peg
m8 8 ksim
perlvortex
xlisp
Dynam ic instruct ionsDynam ic branchesDynam ic Cond BR
Branches by frequency
SPEC95 Benchmarks(on yaxis millionsof instruction)
20 Nov 2005 roberto innocente 28
28
Alw ays taken1 4 %
9 5 1 0 0 %2 1 %
5 0 9 5 %2 0 %
5 5 0 %2 4 %
0 5 %7 %
Never Taken1 4 % Alw ays taken
9 5 1 0 0 %5 0 9 5 %5 5 0 %0 5 %Never Taken
Branches by taken rate
Average from SPECint95
20 Nov 2005 roberto innocente 29
29
Occurrences of branches
Occurrences of branches (conditional branches) : SPECint 95 1 out of 5 instruction executed (20%) SPECfp 95 1 out of 10 instruction executed (10%)
Basic block is the term used for a sequence of instructions without any control xfer
Note : this is different and much more than the rate of branches in the static program
20 Nov 2005 roberto innocente 30
30
Mispredictions effects
b=rate of instruction executed that are branches (0.10.2)
p=prediction accuracy (currently the best is in the 0.900.97 range)
f=instructions “inflight” (in execution, currently over 100)
Oversimplification: misprediction is recognized only at the very end and forces to squash all the following f inflight instr.
Then every 1/(b*(1p)) instr. we squash f instr.
E = 1/(1+f*(b*(1p))
20 Nov 2005 roberto innocente 31
31
Efficiency versus bp accuracy
5 inflight instructions
100 in flight
200 inflight300 inflight
20 Nov 2005 roberto innocente 32
32
Branch prediction methods
Are they using informations collected from the running programs ?
Static : no Semistatic : info collected from test
samples Dynamic : yes
20 Nov 2005 roberto innocente 33
33
Static branch prediction Always taken (AT), Always not taken (ANT) Backward taken, forward not taken(BTFNT)
frequently used by current processors, relies on compilers too (Intel Pentium4)
Complicate rules : for example the bp of PentiumM looks at the distance between addresses and opcodes
Programmer hints (special opcodes on Pentium, flags on Itanium)
program reorganization by compilers Achieves ~ 60/70 % accuracy
20 Nov 2005 roberto innocente 34
34
Semistatic branch prediction
It relies on data collected from previous runs of the program (profiling : Sun Sparc)
Insertion in the code of appropriate hints : predict taken predict not taken
Achieves accuracy of ~ 65/80 %
20 Nov 2005 roberto innocente 35
35
Dynamic branch prediction As the last time Bimodal predictors
Achieve accuracy of 70/85 %
2 level / correlation predictors Achieve accuracy of
80/90 % Combining/Meta predictors Markov/PPM predictors Neural predictors
static semistatic
bimodal
2 level combined
50
55
60
65
70
75
80
85
90
95
% branch pred. accuracy
from to
20 Nov 2005 roberto innocente 36
36
2bc – Two bit saturated counter The best 4 states FSA
(Finite State Automaton) SNT,NT,T,ST (Strongly
NotTaken, NotTaken, Taken, Strongly Taken)
Add 1 when branch is taken, subtract 1 when not taken. Saturate at 0 and 3
ST T
NT SNT
t
t
t
t
nt
nt
nt
0001
1011
nt
t
20 Nov 2005 roberto innocente 37
37
bimodal predictor (Smith ’85)
Array of counters
STTNTSNT
t
nt
•At every branch, hashing on the instruction address (usually simply using some of the bits of the address), a counter is chosen and a prediction is made. Whole array is initialized to the T or NT state•When the outcome of the branch is known the counter is updated •General consensus on using the 2 bit saturated counter
Address of branch instruction
20 Nov 2005 roberto innocente 38
38
Branch correlations Global correlation Local correlation
for(i=0;i<1000;i++) { if (i%4 == 0) a[i]=0;}
if (cond1) { .. }if (cond1 && cond2) { .. }
if (cond1) a=2;..if (a==0) { .. }
if (cond1) { .. }if (cond2) { .. }if (cond1 && cond2) { .. }
for(i=0;i<12;i++) { .. }
Outcome depends on the outcome of previous branches
Outcome depends on previous outcomes of same branch
20 Nov 2005 roberto innocente 39
39
Twolevel/Correlation predictor (YehPatt’92,PanSohRameh’92)
Branches are correlated one to the other
We keep a shift register with the most recent branch outcomes
We index a bimodal table(Pattern History Table) with this branch history register (BHT)
We can keep only one global BHT for all the branches (global 2level predictor) or a BHT per each branch (local 2level predictor). The same we can do for the PHT.
Branch history register
Pattern History Table
Prediction
Last outcome
20 Nov 2005 roberto innocente 40
40
Taxonomy of 2 level predictors
GSP
agsp
Branch History Pattern HistoryGlobalShared(aSsociative)Per address
globalshared(associative)per address
Gas = Global History Register, adaptive, with shared Pattern History Table (for instance 8 ways)
} {
20 Nov 2005 roberto innocente 41
41
gshare (McFarling ’93) Alleviates the
problem of PHT destructive interference between branches
The PHT is indexed with the XOR of the BHT and the BIA (branch instruction address)
Branch history register
Pattern History Table
Prediction
Last outcome
XOR
Branch address
20 Nov 2005 roberto innocente 42
42
Path correlated prediction
The same branch history may be the result of very different program behaviours. To disentangle such situations we can take some bits of the target address of each of the last n taken branches and use those to address the bimodal PHT.
TA bits TA bitsTA bitsTA bits TA bits TA bitsPath history register
Pattern History Table
20 Nov 2005 roberto innocente 43
43
Tournament/Meta predictor (McFarling ’93) Often happens that a predictor
is better for some branches and another for other branches
A bimodal predictor can then be used to drive a mux that will choose between the 2 predictors
When the outcome is known the metapredictor is updated if one of the predictors was right and the other wrong
In this case the states are the confidence on the 2 predictors
Predictor1 Predictor2 MetaPredictor
Address of branch instruction
Mux
Hybrid predictoroutcome
20 Nov 2005 roberto innocente 44
44
Data compression
It is a similar and well studied problem, for which there exists an algorithm reputed nearly optimal (PPM).
the goal is to represent the data with fewer bits : You use fewer bits for frequent sequences and more
bits for the infrequent ones. The net effect is to use less bits overall.
It relies on accurately predicting the probabilistic distribution of data and using a coder tuned to that
20 Nov 2005 roberto innocente 45
45
Markov predictor
A Markov predictor of order j, bases its prediction on the last j outcomes
It builds the matrix of transition frequencies and makes the prediction according to that
pattern next frequency00 0
1 101 0
1 210 0 1
1 111 0 2
1
1 0 1 1 0 0 1 1 0 Last outcomes
20 Nov 2005 roberto innocente 46
46
PPM – (Cleary, Witten 1984)Prediction by Partial Matching
A PPM predictor of order m is a set of m+1 Markov predictors
1 0 1 1 0 0 1 1 0Last m bits
if found
Predict withMarkov predictor of order m
Last m1 bits
if found
if found
Markov predictor of order m1
Markov predictor of order 1
Markov predictor of order 0if not found
if not found
if not found
20 Nov 2005 roberto innocente 47
47
Neural methodsD.Jimenez 2002
Machine learning has often used neural methods Most neural networks can't be candidates for
hardware prediction at the microarchitecture level Their implementation would require much more
than several cycles The standard method of training, the
backpropagation algorithm, is infeasible in a few machine cycles
20 Nov 2005 roberto innocente 48
48
Perceptron Introduced by Rosenblatt in
1962 as a model of brain functioning, popularized by M.Minsky
We will consider the simplest: the singlelayer perceptron
A vector of n inputs: x[1]..x[n] Each input has a weight
associated with it: w[0]..w[n]. This vector characterizes the perceptron
20 Nov 2005 roberto innocente 49
49
Bipolar perceptron
The inputs and the outcome t can be only 1 or 1
Then t*x[i] = 1 if they agree, or 1 if they disagree
if the w[i] are integers, y is an integer too and sign(y) is the prediction
20 Nov 2005 roberto innocente 50
50
Perceptron training Simply stated : increase the weights of those inputs
that agree with the outcome, and decrease the weight of those that do not
Let t be the outcome and θ be a threshold after which we stop to train the perceptron. Then the algorithm is :if ((sign(y) <> t)||(|y| < theta)) {
for (i=0 ; i<=n;i++) {
w[i] = w[i] + t * x[i];}
}
20 Nov 2005 roberto innocente 51
51
Perceptron limitations
A single perceptron can only learn linearly separable functions of the inputs. The linear equation w[0]+Σ w[i]*x[i]=0 represents an hyperplane in the
ndim space of inputs AND, OR, NAND, NOR are linearly separable, XOR is
not Of course any boolean function can be learned by a 2
layer network of perceptrons (as any boolean function can be represented by a 2layer net of ANDs and ORs), but it has been shown that for bp there is not much gain and the delay gets much worse
20 Nov 2005 roberto innocente 52
52
Branch prediction with perceptrons The inputs of the perceptron are the branch history We keep a table of perceptrons (the weights) that we address
hashing on the branch address Every time we meet a branch we load the perceptron in a vector
register and we compute in parallel the dot product between the weights and the branch history (summing the complements to 1 instead of those to 2)
According to the result we predict the branch taken or not taken The training alg. is performed and the updated perceptron is
written back
20 Nov 2005 roberto innocente 53
53
It's the serialization constraint imposed by data dependencies among instructions
Was always thought to be an insurmountable limit
An instruction that needs data from another instruction needs to be executed after that
The dataflow limit
ADD R1,R2,R3 ; R1<R2+R3
ADD R4,R1,R5 ; R4<R1+R5
20 Nov 2005 roberto innocente 54
54
Exceeding the dataflow limit
At the end of the '90 some authors proposed the use of data prediction to overcome the dataflow limit
M.Lipasti, Shen Exceeding the data flow limit This is much more difficult than branch prediction
where you need to predict only a binary value
20 Nov 2005 roberto innocente 55
55
Value locality
The simulations showed in fact that the applications are obeying also to a new locality principle : Value Locality
Value Locality
Temporal Spatial
20 Nov 2005 roberto innocente 56
56
Value prediction/1
It was shown for instance that 20/30 % of instructions that write value in registers write the same value as the last time
And 40/50 % write one of the last 4 preceeding values
20 Nov 2005 roberto innocente 57
57
Value prediction/2
What makes these values so predictable ? It seems this is due to severe penalties realworld
programs incur not only because they are designed to manage quite infrequent contingencies like exceptions and error conditions but because they are general by design. This is shown even by code aggressively optimized by modern state of the art compilers
20 Nov 2005 roberto innocente 58
58
Value prediction/3
Load Value prediction
Register Value prediction
20 Nov 2005 roberto innocente 59
59
Speculation taxonomy
Speculative execution
Control speculation Data speculation
Branch outcome
Branch target
Data Location
Data Value
Load
Register Value
Address
20 Nov 2005 roberto innocente 60
60
Research areas
Reverse engineering of prediction algorithms implementations
Simulation of new prediction algorithms : Using legacy Instructions Sets (IS) Using abstract RISC instructions sets
Hand code optimization and compiler optimization techniques
20 Nov 2005 roberto innocente 61
61
Reverse engineering
A python or perl script : produces assembly language kernels (with for
example fix distance between branch instructions)
compiles and runs the kernels using the hardware counters for mispredictions to detect table sizes, conflicts and so on
20 Nov 2005 roberto innocente 62
62
Legacy IS/OS simulations
Can be obtained instrumenting an x86 open source simulator like bochs that can run windows or linux
You can then run statically precompiled binaries over it
Problem : bochs is not even a complete Pentium II simulator !
20 Nov 2005 roberto innocente 63
63
Abstract IS simulators
SimpleScalar is an opensource framework for a generic software simulator over which modules for different prediction algorithms can be implemented
Offers the possibility to customize also the Instruction Set (IS)
Problem : you need the source and compile all special libraries to use this tool
20 Nov 2005 roberto innocente 64
64
Optimization techniques
Basic block extension Code duplication Scheduling techniques :
20 Nov 2005 roberto innocente 65
65
Scheduling
Code scheduling or reordering of instruction is used to improve performance or guarantee correctness
Important for dynamically scheduled architectures, essential for static scheduled architectures
Examples : branch delay slots, memory delays, multicycle operations
Block scheduling, List scheduling, Superblock scheduling, Trace Scheduling
20 Nov 2005 roberto innocente 66
66
BTA era is here(Billion Transistor Architecture) Intel Itanium2 with 6MB L3 cache has 0.41 billion transistors of which
around 0.3 billion transistors are for the cache memory It's not clear what will be the best use of the available silicon:
CMP (SingleChip MultiProcessors) Superwide superspeculative superscalar Simultaneous MultiThreading Raw Processors
20 Nov 2005 roberto innocente 67
67
386 486dx P 60 p pro P II P III P 4 P 4 5710
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
FO4 gatespipe lengthfeat.sizetrans.#freq