03 dynamic sched
Post on 04-Apr-2018
221 Views
Preview:
TRANSCRIPT
-
7/30/2019 03 Dynamic Sched
1/84
Computer Architecture
Instruction Level Parallelism
S. G. Shivaprasad Yadav
-
7/30/2019 03 Dynamic Sched
2/84
8/21/2011 2
Review from Last Time
4 papers: All about where to draw line between HW and SW
IBM set foundations for ISAs since 1960s 8-bit byte
Byte-addressable memory (as opposed to word-addressable memory) 32-bit words
Two's complement arithmetic (but not the first processor)
32-bit (SP) / 64-bit (DP) Floating Point format and registers
Commercial use of microcoded CPUs Binary compatibility / computer family
B5000 very different model: HLL only, stack, Segmented VM
IBM paper made case for ISAs good for microcoded
processors leading to CISC Berkeley paper made the case for ISAs for pipeline + cache
micrprocessors (VLSI) leading to RISC
Who won RISC vs. CISC? VAX is dead. Intel 80x86 on desktop,RISC in embedded, Servers x86 and RISC
-
7/30/2019 03 Dynamic Sched
3/84
8/21/2011 3
Outline
ILP
Compiler techniques to increase ILP
Loop Unrolling Static Branch Prediction
Dynamic Branch Prediction
Overcoming Data Hazards with DynamicScheduling
(Start) Tomasulo Algorithm
Conclusion
-
7/30/2019 03 Dynamic Sched
4/848/21/2011 4
Recall from Pipelining Review
Pipeline CPI = Ideal pipeline CPI + StructuralStalls + Data Hazard Stalls + Control Stalls
Ideal pipeline CPI: measure of the maximumperformance attainable by the implementation
Structural hazards: HW cannot support thiscombination of instructions
Data hazards: Instruction depends on result of priorinstruction still in the pipeline
Control hazards: Caused by delay between the fetchingof instructions and decisions about changes in control
flow (branches and jumps)
-
7/30/2019 03 Dynamic Sched
5/84
8/21/2011 5
Instruction Level Parallelism
Instruction-Level Parallelism (ILP): overlap theexecution of instructions to improveperformance
2 approaches to exploit ILP:1) Rely on hardware to help discover and exploit the parallelism
dynamically (e.g., Pentium 4, AMD Opteron, IBM Power) , and
2) Rely on software technology to find parallelism, statically atcompile-time (e.g., Itanium 2)
Next 4 lectures on this topic
-
7/30/2019 03 Dynamic Sched
6/84
8/21/2011 6
Instruction-Level Parallelism (ILP)
Basic Block (BB) ILP is quite smallBB: a straight-line code sequence with no branches in
except to the entry and no branches out except at the exitFor a typical MIPS programs, the average dynamic branch
frequency 15% to 25%=> 4 to 7 instructions execute between a pair of branchesSince these instructions in BB likely to depend on each
other, the amount of overlap we can exploit within a basicblock is likely to be less than the average BB size.
To obtain substantial performanceenhancements, we must exploit ILP acrossmultiple basic blocks
Simplest: loop-level parallelism to exploitparallelism among iterations of a loop. E.g.,for (i=1; i
-
7/30/2019 03 Dynamic Sched
7/84
8/21/2011 7
Loop-Level Parallelism
Exploit loop-level parallelism to parallelism byunrolling loop either by
1. dynamic via branch prediction or
2. static via loop unrolling by compiler(Another way is vectors, to be covered later) Determining instruction dependence is critical to
Loop Level Parallelism
To exploit ILP we must determine whichinstructions can be executed in parallel If 2 instructions are
parallel, they can execute simultaneously in apipeline of arbitrary depth without causing anystalls (assuming no structural hazards)
dependent, they are not parallel and must beexecuted in order, although they may often be
partially overlapped
-
7/30/2019 03 Dynamic Sched
8/84
8/21/2011 8
Three different types of dependencies datadependencies, name dependencies, and controldependencies
InstrJ is data dependent (aka true dependence) onInstrI: if either of the following holds1. Instr i produces a result that may be used by instruction j or
2. Instr j is data dependent on Instr k, and instruction k is datadependent on Instr I
Data Dependence and Hazards
I: addr1,r2,r3
J: sub r4,r1,r3
-
7/30/2019 03 Dynamic Sched
9/84
8/21/2011 9
Data Dependence and Hazards
Example: Consider the following MIPS codesequence that increments vector of values inmemory (starting at 0(R1), and with the last
element at 8(R2)), by a scalar in register F2.
-
7/30/2019 03 Dynamic Sched
10/84
8/21/2011 10
Data Dependence and Hazards
Both of the above dependent sequences, as shown by the arrows,have each instruction depending on the previous one.
The arrows here show the order that must be preserved for correctexecution.
The arrow points from an instruction that must precede the instructionthat the arrowhead points to. If two instructions are data dependent, they cannot execute
simultaneously or be completely overlapped The dependence implies that there would be a chain of one or more
data hazards between the two instructions. Executing the instructions simultaneously will cause a processor with
pipeline interlocks (and a pipeline depth longer than the distancebetween the instructions in cycles) to detect a hazard and stall,thereby reducing or eliminating the overlap.
Data dependence in instruction sequencedata dependence in source code effect of original data
dependence must be preserved If data dependence caused a hazard in pipeline,
called a Read After Write (RAW) hazard
-
7/30/2019 03 Dynamic Sched
11/84
8/21/2011 11
ILP and Data Dependencies,Hazards
HW/SW must preserve program order:order instructions would execute in if executedsequentially as determined by original source program
Dependences are a property of programs Presence of dependence indicates potential for a
hazard, but actual hazard and length of any stall isproperty of the pipeline
Importance of the data dependencies1) indicates the possibility of a hazard
2) determines order in which results must be calculated
3) sets an upper bound on how much parallelism can possibly beexploited
HW/SW goal: exploit parallelism by preserving program
order only where it affects the outcome of the program
-
7/30/2019 03 Dynamic Sched
12/84
8/21/2011 12
Data Dependencies,Hazards
A dependence can be overcome in two differentways: maintaining the dependence but avoiding a hazard, and
eliminating a dependence by transforming the code.
Scheduling the code is the primary method used toavoid a hazard without altering a dependence, and
such scheduling can be done both by the compilerand by the hardware.
-
7/30/2019 03 Dynamic Sched
13/84
8/21/2011 13
Name dependence: when 2 instructions use sameregister or memory location, called a name, but no
flow of data between the instructions associatedwith that name; 2 versions of name dependence
InstrJ writes operand beforeInstrI reads it
Called an anti-dependence by compiler writers.This results from reuse of the name r1
If anti-dependence caused a hazard in the pipeline,
called a Write After Read (WAR) hazard
I: sub r4,r1,r3J: addr1,r2,r3
K: mul r6,r1,r7
Name Dependence #1: Anti-dependence
-
7/30/2019 03 Dynamic Sched
14/84
8/21/2011 14
Name Dependence #2: Output dependence
InstrJ writes operand beforeInstrI writes it.
Called an output dependence by compiler writers
This also results from the reuse of name r1 If output-dependence caused a hazard in the pipeline,
called a Write After Write (WAW) hazard
Instructions involved in a name dependence canexecute simultaneously if name used in instructions ischanged so instructions do not conflict Register renaming resolves name dependence for regs
Either by compiler or by HW
I: sub r1,r4,r3
J: addr1,r2,r3K: mul r6,r1,r7
-
7/30/2019 03 Dynamic Sched
15/84
8/21/2011 15
Data Hazards
A hazard is created whenever there is a dependence betweeninstructions, and they are close enough that the overlap duringexecution would change the order of access to the operandinvolved in the dependence.
Because of the dependence, we must preserve program order,that is, the order that the instructions would execute in if executedsequentially one at a time as determined by the original sourceprogram.
The goal of both software and hardware techniques is to exploitparallelism by preserving program order only where it affects
the outcome of the program. Detecting and avoiding hazards ensures that necessary program
order is preserved. Data hazards may be classified as one of three types, depending
on the order of read and write accesses in the instructions.Consider two instructions i and j, with i preceding j in programorder.
The possible data hazards are RAW (read after write) WAW (write after write) WAR (write after read)
-
7/30/2019 03 Dynamic Sched
16/84
8/21/2011 16
Data Hazards
RAW (read after write) jtries to read a source before iwrites it, sojincorrectly gets the oldvalue.
This hazard is the most common type and corresponds to a true datadependence.
Program order must be preserved to ensure that j receives the value from i.
WAW (write after write) jtries to write an operand before it is written by i. The writes end up being
performed in the wrong order, leaving the value written by irather than thevalue written byjin the destination.
This hazard corresponds to an output dependence. WAW hazards are present only in pipelines that write in more than one pipe
stage or allow an instruction to proceed even when a previous instruction isstalled.
WAR (write after read) j tries to write a destination before it is read by i, so i incorrectly gets thenewvalue. This hazard arises from an anti-dependence. WAR hazards cannot occur in most static issue pipelineseven deeper
pipelines or floating-point pipelinesbecause all reads are early (in ID) andall writes are late (in WB).
A WAR hazard occurs either when there are some instructions that writeresults early in the instruction pipeline and other instructions that read asource late in the pipeline, or when instructions are reordered.
-
7/30/2019 03 Dynamic Sched
17/84
8/21/2011 17
Control Dependencies
A control dependence determines the ordering of an instruction,i, with respect to a branch instruction so that the instruction i isexecuted in correct program order and only when it should be.
Every instruction is control dependent on some set ofbranches, and, in general, these control dependencies mustbe preserved to preserve program order
One of the simplest examples of a control dependence is thedependence of the statements in the "then" part of an ifstatement on the branch.
For example, in the code segmentif p1 {S1;
};
if p2 {S2;
}
S1 is control dependent onp1, and S2 is control dependentonp2 but not onp1.
-
7/30/2019 03 Dynamic Sched
18/84
8/21/2011 18
Control Dependencies
In general, there are two constraints imposed bycontrol dependences:1. An instruction that is control dependent on a branch cannot be
moved before the branch so that its execution is no longercontrolled by the branch. For example, we cannot take aninstruction from the then portion of an if statement and move itbefore the if statement.
2. An instruction that is not control dependent on a branch cannot be
moved afterthe branch so that its execution is controlledby thebranch. For example, we cannot take a statement before the ifstatement and move it into the then portion.
-
7/30/2019 03 Dynamic Sched
19/84
8/21/2011 19
Control Dependence Ignored
Control dependence need not bepreserved
willing to execute instructions that should not have beenexecuted, thereby violating the control dependences, ifcan do so without affecting correctness of the program
Instead, 2 properties critical to programcorrectness are
1) exception behavior and
2) data flow
-
7/30/2019 03 Dynamic Sched
20/84
8/21/2011 20
Exception Behavior
Preserving exception behaviorany changes in instruction execution order
must not change how exceptions are raised in
program( no new exceptions)
Example:DADDU R2,R3,R4BEQZ R2,L1LW R1,0(R2)
L1:
(Assume branches not delayed)
Problem with moving LWbefore BEQZ?
-
7/30/2019 03 Dynamic Sched
21/84
8/21/2011 21
Data Flow
Data flow: actual flow of data values amonginstructions that produce results and those that
consume them branches make flow dynamic, determine which instruction issupplier of data
Example:
DADDU R1,R2,R3BEQZ R4,LDSUBU R1,R5,R6
L: OR R7,R1,R8
ORdepends on DADDU or DSUBU?Must preserve data flow on execution
-
7/30/2019 03 Dynamic Sched
22/84
8/21/2011 22
Outline
ILP
Compiler techniques to increase ILP
Loop Unrolling
Static Branch Prediction
Dynamic Branch Prediction
Overcoming Data Hazards with DynamicScheduling
(Start) Tomasulo Algorithm
Conclusion
B i Pi li S h d li d L
-
7/30/2019 03 Dynamic Sched
23/84
8/21/2011 23
Basic Pipeline Scheduling and LoopUnrolling
To keep a pipeline full, parallelism among instructions must beexploited by finding sequences of unrelated instructions that canbe overlapped in the pipeline.
To avoid a pipeline stall, a dependent instruction must be
separated from the source instruction by a distance in clockcycles equal to the pipeline latency of that source instruction.
A compiler's ability to perform this scheduling depends both onthe amount of ILP available in the program and on the latenciesof the functional units in the pipeline.
We assume the standard five-stage integer pipeline, so thatbranches have a delay of 1 clock cycle.
We assume that the functional units are fully pipelined orreplicated (as many times as the pipeline depth), so that an
operation of any type can be issued on every clock cycle andthere are no structural hazards.
Now we look at how the compiler can increase the amount ofavailable ILP by transforming loops
-
7/30/2019 03 Dynamic Sched
24/84
8/21/2011 24
Software Techniques - Example
Example: This code, add a scalar to a vector:for (i=1000; i>0; i=i1)
x[i] = x[i] + s;
Lets look at the performance of this loop, showing how wecan use the parallelism to improve its performance for aMIPS pipeline with latencies shown below Ignore delayed branch in these examples
Instruction Instruction Latency stalls between producing result using result in cycles in cycles
FP ALU op Another FP ALU op 4 3
FP ALU op Store double 3 2
Load double FP ALU op 1 1
Load double Store double 1 0
Integer op Integer op 1 0
-
7/30/2019 03 Dynamic Sched
25/84
8/21/2011 25
FP Loop: Where are the Hazards?
Loop: L.D F0,0(R1);F0=vector element
ADD.D F4,F0,F2;add scalar from F2
S.D 0(R1),F4;store result
DADDUI R1,R1,-8;decrement pointer 8B (DW)
BNEZ R1,Loop ;branch R1!=zero
First translate into MIPS code:-To simplify, assume 8 is lowest address
-
7/30/2019 03 Dynamic Sched
26/84
8/21/2011 26
FP Loop Showing Stalls
9 clock cycles: Rewrite code to minimize stalls?
Instruction Instruction Latency in
producing result using result clock cyclesFP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
1 Loop: L.D F0,0(R1) ;F0=vector element
2 stall
3 ADD.D F4,F0,F2 ;add scalar in F2
4 stall
5 stall
6 S.D 0(R1),F4 ;store result
7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW)
8 stall ;assumes cant forward to branch
9 BNEZ R1,Loop ;branch R1!=zero
-
7/30/2019 03 Dynamic Sched
27/84
8/21/2011 27
Revised FP Loop Minimizing Stalls
7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loopoverhead; How make faster?
Instruction Instruction Latency in producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Swap DADDUI and S.D by changing address of S.D
1 Loop: L.D F0,0(R1)
2 DADDUI R1,R1,-8
3 ADD.D F4,F0,F2
4 stall
5 stall
6 S.D 8(R1),F4 ;altered offset when move DSUBUI
7 BNEZ R1,Loop
-
7/30/2019 03 Dynamic Sched
28/84
8/21/2011 28
Loop Unrolling
A simple scheme for increasing the number of instructionsrelative to the branch and overhead instructions is loopunrolling.
Unrolling simply replicates the loop body multiple times,adjusting the loop termination code.
Loop unrolling can also be used to improve scheduling.
Because it eliminates the branch, it allows instructions from
different iterations to be scheduled together. In this case, we can eliminate the data use stalls by creating
additional independent instructions within the loop body.
If we simply replicated the instructions when we unrolled the
loop, the resulting use of the same registers could prevent usfrom effectively scheduling the loop.
Thus, we will want to use different registers for each iteration,increasing the required number of registers.
Unroll Loop Four Times
-
7/30/2019 03 Dynamic Sched
29/84
8/21/2011 29
Unroll Loop Four Times(straightforward way)
Rewrite loop tominimize stalls?
1 Loop:L.D F0,0(R1)
3 ADD.D F4,F0,F2
6 S.D 0(R1),F4 ;drop DSUBUI & BNEZ
7 L.D F6,-8(R1)9 ADD.D F8,F6,F2
12 S.D -8(R1),F8 ;drop DSUBUI & BNEZ
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12 ;drop DSUBUI & BNEZ
19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DADDUI R1,R1,#-32 ;alter to 4*826 BNEZ R1,LOOP
27 clock cycles, or 6.75 per iteration
(Assumes R1 is multiple of 4)
1 cycle stall
2 cycles stall
-
7/30/2019 03 Dynamic Sched
30/84
8/21/2011 30
Unrolled Loop Detail
Do not usually know upper bound of loop
Suppose it is n, and we would like to unroll theloop to make k copies of the body
Instead of a single unrolled loop, we generate apair of consecutive loops: 1st executes (n mod k) times and has a body that is the
original loop
2nd is the unrolled body surrounded by an outer loop thatiterates (n/k) times
For large values of n, most of the execution time
will be spent in the unrolled loop
-
7/30/2019 03 Dynamic Sched
31/84
8/21/2011 31
Unrolled Loop That Minimizes Stalls
1 Loop:L.D F0,0(R1)
2 L.D F6,-8(R1)
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D F4,F0,F2
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2
8 ADD.D F16,F14,F2
9 S.D 0(R1),F410 S.D -8(R1),F8
11 S.D -16(R1),F12
12 DSUBUI R1,R1,#32
13 S.D 8(R1),F16 ; 8-32 = -2414 BNEZ R1,LOOP
14 clock cycles, or 3.5 per iteration
-
7/30/2019 03 Dynamic Sched
32/84
8/21/2011 32
5 Loop Unrolling Decisions
Requires understanding how one instruction depends onanother and how the instructions can be changed orreordered given the dependences:
1. Determine loop unrolling useful by finding that loop
iterations were independent (except for maintenance code)
2. Use different registers to avoid unnecessary constraintsforced by using same registers for different computations
3. Eliminate the extra test and branch instructions and adjustthe loop termination and iteration code
4. Determine that loads and stores in unrolled loop can beinterchanged by observing that loads and stores from
different iterations are independent Transformation requires analyzing memory addresses and findingthat they do not refer to the same address
5. Schedule the code, preserving any dependences neededto yield the same result as the original code
3 Li i L U lli
-
7/30/2019 03 Dynamic Sched
33/84
8/21/2011 33
3 Limits to Loop Unrolling
1. Decrease in amount of overhead amortized witheach extra unrolling Amdahls Law
2. Growth in code size For larger loops, concern it increases the instruction cache
miss rate
3. Register pressure: potential shortfall inregisters created by aggressive unrolling andscheduling If not be possible to allocate all live values to registers, may
lose some or all of its advantage Loop unrolling reduces impact of branches on
pipeline; another way is branch prediction
St ti B h P di ti
-
7/30/2019 03 Dynamic Sched
34/84
8/21/2011 34
12%
22%
18%
11%12%
4%6%
9%10%
15%
0%
5%
10%
15%
20%
25%
compres
s
eqnto
tt
espress
ogcc li
dodu
c
ear
hydro2
d
mdljdp
su2cor
MispredictionRate
To reorder code around branches, need to predict branchstatically when compile
Simplest scheme is to predict a branch as taken This scheme has an average misprediction rate that is equal to the untaken
branch frequency, which for the SPEC programs is 34%. Unfortunately, the misprediction rate for the SPEC programs ranges from not
very accurate (59%) to highly accurate (9%)
More accuratescheme predictsbranches usingprofileinformation
collected fromearlier runs, andmodifyprediction
based on lastrun:
Integer Floating Point
Static Branch Prediction
St ti B h P di ti
-
7/30/2019 03 Dynamic Sched
35/84
8/21/2011 35
Static Branch Prediction
The key observation that makes this worthwhile is that thebehavior of branches is often bimodally distributed; that is, anindividual branch is often highly biased toward taken or untaken.
The same input data were used for runs and for collecting the
profile; other studies have shown that changing the input so thatthe profile is for a different run leads to only a small change inthe accuracy of profile-based prediction.
The effectiveness of any branch prediction scheme depends
both on the accuracy of the scheme and the frequency ofconditional branches, which vary in SPEC from 3% to 24%.
The fact that the misprediction rate for the integer programs ishigher and that such programs typically have a higher branch
frequency is a major limitation for static branch prediction.
D i B h P di ti
-
7/30/2019 03 Dynamic Sched
36/84
8/21/2011 36
Dynamic Branch Prediction
Why does prediction work? Underlying algorithm has regularities
Data that is being operated on has regularities
Instruction sequence has redundancies that are artifacts ofway that humans/compilers think about problems
Is dynamic branch prediction better than staticbranch prediction?
Seems to be
There are a small number of important branches in programswhich have dynamic behavior
Dynamic Branch Prediction
-
7/30/2019 03 Dynamic Sched
37/84
8/21/2011 37
Dynamic Branch Prediction
Performance = (accuracy, cost of misprediction) The simplest dynamic branch-prediction scheme is a
branch-prediction bufferor branch history table.
A branch-prediction buffer is a small memory indexed by the lowerportion of the address of the branch instruction.
The memory contains a bit that says whether the branch was recentlytaken or not.
This scheme is the simplest sort of buffer; it has no tags and is usefulonly to reduce the branch delay when it is longer than the time tocompute the possible target PCs.
Problem: in a loop, 1-bit BHT will cause two
mispredictions (avg is 9 iterations before exit): End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it
predicts exit instead of looping
Dynamic Branch Prediction
-
7/30/2019 03 Dynamic Sched
38/84
8/21/2011 38
Solution: 2-bit scheme where change predictiononly if get mis-prediction twice
Red: stop, not taken
Green: go, taken
Adds hysteresisto decision making process
Dynamic Branch Prediction
T
T NT
NT
Predict Taken
Predict NotTaken
Predict Taken
Predict NotTaken
11 10
01 00T
NT
T
NT
BHT Accuracy
-
7/30/2019 03 Dynamic Sched
39/84
8/21/2011 39
BHT Accuracy
Mispredict because either: Wrong guess for that branch
Got branch history of wrong branch when index the table
4096 entry table:
Correlated Branch Prediction
-
7/30/2019 03 Dynamic Sched
40/84
8/21/2011 40
Correlated Branch Prediction
Branch predictors that use the behavior of other branches tomake a prediction are called correlating predictors or two-level predictors.
Existing correlating predictors add information about thebehavior of the most recent branches to decide how to predict a
given branch. Idea: record mmost recently executed branches as taken
or not taken, and use that pattern to select the proper n-bitbranch history table
In general, (m,n) predictor means record last mbranches toselect between 2m history tables, each with n-bit counters Thus, old 2-bit BHT is a (0,2) predictor
Global Branch History: m-bit shift register keeping T/NTstatus of last mbranches.
Each entry in table has m n-bit predictors.
Correlating Branches
-
7/30/2019 03 Dynamic Sched
41/84
8/21/2011 41
Co e at g a c es
(2,2) predictor
Behavior of recent
branches selectsbetween four
predictions of next
branch, updating just
that prediction
Branch address
2-bits per branch predictor
Prediction
2-bit global branch history
4
Accuracy of Different Schemes
-
7/30/2019 03 Dynamic Sched
42/84
8/21/2011 42
0%Frequencyo
fMispredic
tions
0%
1%
5%6% 6%
11%
4%
6%5%
1%2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
matrix300
doducd
spice
fpppp
gcc
expresso
eqntott li
tomcatv
Accuracy of Different Schemes
nasa7
Tournament Predictors
-
7/30/2019 03 Dynamic Sched
43/84
8/21/2011 43
Tournament Predictors
Multilevel branch predictor
Use n-bit saturating counter to choose between
predictors
Usual choice between global and local predictors
Tournament Predictors
-
7/30/2019 03 Dynamic Sched
44/84
8/21/2011 44
Tournament Predictors
Tournament predictor using, say, 4K 2-bit countersindexed by local branch address. Choosesbetween:
Global predictor
4K entries index by history of last 12 branches (212 = 4K)
Each entry is a standard 2-bit predictor
Local predictor
Local history table: 1024 10-bit entries recording last 10branches, index by branch address
The pattern of the last 10 occurrences of that particular branchused to index table of 1K entries with 3-bit saturating counters
Comparing Predictors (Fig 2 8)
-
7/30/2019 03 Dynamic Sched
45/84
8/21/2011 45
Comparing Predictors (Fig. 2.8)
Advantage of tournament predictor is ability toselect the right predictor for a particular branch Particularly crucial for integer benchmarks.
A typical tournament predictor will select the global predictoralmost 40% of the time for the SPEC integer benchmarks and lessthan 15% of the time for the SPEC FP benchmarks
Pentium 4 Misprediction Rate( 1000 i t ti t b h)
-
7/30/2019 03 Dynamic Sched
46/84
8/21/2011 46
(per 1000 instructions, not per branch)
11
13
7
12
9
1
0 0 0
5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
64.gzip
175.
vpr
176.gc
c
181.mcf
186.crafty
168.w
upwise
17
1.swim
172.
mgrid
17
3.ap
plu
17
7.mesa
Bra
nch
m
ispredicti
onsper1000
In
structions
SPECint2000 SPECfp2000
6% misprediction rate per branch SPECint(19% of INT instructions are branch)
2% misprediction rate per branch SPECfp(5% of FP instructions are branch)
Branch Target Buffers (BTB)
-
7/30/2019 03 Dynamic Sched
47/84
Branch target calculation is costly and stalls theinstruction fetch.
BTB stores PCs the same way as caches
The PC of a branch is sent to the BTB
When a match is found the correspondingPredicted PC is returned
If the branch was predicted taken, instructionfetch continues at the returned predicted PC
g ( )
Branch Target Buffers
-
7/30/2019 03 Dynamic Sched
48/84
Branch Target Buffers
Dynamic Branch Prediction Summary
-
7/30/2019 03 Dynamic Sched
49/84
8/21/2011 49
Dynamic Branch Prediction Summary
Prediction becoming important part of execution
Branch History Table: 2 bits for loop accuracy
Correlation: Recently executed branches correlated
with next branch Either different branches (GA)
Or different executions of same branches (PA)
Tournament predictors take insight to next level, byusing multiple predictors usually one based on global information and one based on local
information, and combining them with a selector
In 2006, tournament predictors using 30K bits are in processorslike the Power5 and Pentium 4
Branch Target Buffer: include branch address &prediction
-
7/30/2019 03 Dynamic Sched
50/84
Advantages of Dynamic Scheduling
-
7/30/2019 03 Dynamic Sched
51/84
8/21/2011 51
d a tages o y a c Sc edu g
Dynamic scheduling - hardware rearranges theinstruction execution to reduce stalls whilemaintaining data flow and exception behavior
It handles cases when dependences unknown atcompile time it allows the processor to tolerate unpredictable delays such
as cache misses, by executing other code while waiting forthe miss to resolve
It allows code that compiled for one pipeline torun efficiently on a different pipeline
It simplifies the compiler
Hardware speculation, a technique withsignificant performance advantages, builds ondynamic scheduling (next lecture)
HW Schemes: Instruction Parallelism
-
7/30/2019 03 Dynamic Sched
52/84
8/21/2011 52
HW Schemes: Instruction Parallelism
Key idea: Allow instructions behind stall to proceedDIVD F0,F2,F4
ADDD F10,F0,F8SUBD F12,F8,F14
Enables out-of-order execution and allows out-of-order completion (e.g., SUBD) In a dynamically scheduled pipeline, all instructions still pass
through issue stage in order (in-order issue)
Will distinguish when an instruction beginsexecutionand when it completes execution; between2 times, the instruction is in execution
Note: Dynamic execution creates WAR and WAWhazards and makes exceptions harder
Dynamic Scheduling Step 1
-
7/30/2019 03 Dynamic Sched
53/84
8/21/2011 53
y g p
Simple pipeline had 1 stage to check bothstructural and data hazards: Instruction
Decode (ID), also called Instruction Issue Split the ID pipe stage of simple 5-stage
pipeline into 2 stages:
IssueDecode instructions, check forstructural hazards
Read operandsWait until no data hazards,
then read operands
A Dynamic Algorithm: Tomasulos
-
7/30/2019 03 Dynamic Sched
54/84
8/21/2011 54
y g
For IBM 360/91 (before caches!) Long memory latency
Goal: High Performance without special compilers
Small number of floating point registers (4 in 360)prevented interesting compiler scheduling of operations This led Tomasulo to try to figure out how to get more effective registers
renaming in hardware!
Why Study 1966 Computer?
The descendants of this have flourished! Alpha 21264, Pentium 4, AMD Opteron, Power 5,
Tomasulo Algorithm
-
7/30/2019 03 Dynamic Sched
55/84
8/21/2011 55
g
Control & buffers distributed with Function Units (FU)FU buffers called reservation stations; have pending operands
Registers in instructions replaced by values or pointersto reservation stations(RS); called register renaming ;Renaming avoids WAR, WAW hazards
More reservation stations than registers, so can do optimizations
compilers cant Results to FU from RS, not through registers, over
Common Data Bus that broadcasts results to all FUs Avoids RAW hazards by executing an instruction only when its
operands are available Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches (predicttaken), allowing FP ops beyond basic block in FP queue
Tomasulo Organization
-
7/30/2019 03 Dynamic Sched
56/84
8/21/2011 56
FP addersFP adders
Add1Add2Add3
FP multipliersFP multipliers
Mult1Mult2
From Mem FP Registers
Reservation
Stations
Common Data Bus (CDB)
To Mem
FP OpQueue
Load Buffers
StoreBuffers
Load1
Load2Load3Load4Load5Load6
Reservation Station Components
-
7/30/2019 03 Dynamic Sched
57/84
8/21/2011 57
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or )
Vj, Vk: Value of Source operands
Store buffers has V field, result to be stored
Qj, Qk: Reservation stations producing sourceregisters (value to be written)
Note: Qj,Qk=0 => ready Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Register result statusIndicates which functional unitwill write each register, if one exists. Blank when nopending instructions that will write that register.
Three Stages of Tomasulo Algorithm
-
7/30/2019 03 Dynamic Sched
58/84
8/21/2011 58
1. Issueget instruction from FP Op QueueIf reservation station free (no structural hazard),control issues instr & sends operands (renames registers).
2. Executeoperate on operands (EX)When both operands ready then execute;if not ready, watch Common Data Bus for result
3. Write resultfinish execution (WB)Write on Common Data Bus to all awaiting units;mark reservation station available
Normal data bus: data + destination (go to bus)
Common data bus: data + source (come from bus) 64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result)
Does the broadcast
Example speed:3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
Tomasulo ExampleInstruction stream
-
7/30/2019 03 Dynamic Sched
59/84
8/21/2011 59
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 No
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU
Clock cycle
counter
FU countdown
3 Load/Buffers
3 FP Adder R.S.2 FP Mult R.S.
Tomasulo Example Cycle 1
-
7/30/2019 03 Dynamic Sched
60/84
8/21/2011 60
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1
Tomasulo Example Cycle 2
-
7/30/2019 03 Dynamic Sched
61/84
8/21/2011 61
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NoAdd3 No
Mult1 No
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1
Note: Can have multiple loads outstanding
Tomasulo Example Cycle 3I i
-
7/30/2019 03 Dynamic Sched
62/84
8/21/2011 62
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NoAdd3 No
Mult1 Yes ULTD R(F4) Load2
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1
Note: registers names are removed (renamed) in ReservationStations; MULT issued
Load1 completing; what is waiting for Load1?
Tomasulo Example Cycle 4I t ti t t E W i
-
7/30/2019 03 Dynamic Sched
63/84
8/21/2011 63
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 Yes SUBD M(A1) Load2
Add2 NoAdd3 No
Mult1 Yes ULTD R(F4) Load2
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1
Load2 completing; what is waiting for Load2?
Tomasulo Example Cycle 5Instruction status: E W it
-
7/30/2019 03 Dynamic Sched
64/84
8/21/2011 64
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
2 Add1 Yes SUBD M(A1) M(A2)
Add2 NoAdd3 No
10 Mult1 Yes ULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2
Timer starts down for Add1, Mult1
Tomasulo Example Cycle 6Instruction status: E W it
-
7/30/2019 03 Dynamic Sched
65/84
8/21/2011 65
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1Add3 No
9 Mult1 Yes ULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2
Issue ADDD here despite name dependency on F6?
Tomasulo Example Cycle 7Instruction status: Exec Write
-
7/30/2019 03 Dynamic Sched
66/84
8/21/2011 66
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1Add3 No
8 Mult1 Yes ULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2
Add1 (SUBD) completing; what is waiting for it?
-
7/30/2019 03 Dynamic Sched
67/84
Tomasulo Example Cycle 9Instruction status: Exec Write
-
7/30/2019 03 Dynamic Sched
68/84
8/21/2011 68
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
1 Add2 Yes ADDD (M-M) M(A2)Add3 No
6 Mult1 Yes ULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 M(A2) Add2 (M-M) Mult2
Tomasulo Example Cycle 10Instruction status: Exec Write
-
7/30/2019 03 Dynamic Sched
69/84
8/21/2011 69
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
0 Add2 Yes ADDD (M-M) M(A2)Add3 No
5 Mult1 Yes ULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(A2) Add2 (M-M) Mult2
Add2 (ADDD) completing; what is waiting for it?
Tomasulo Example Cycle 11Instruction status: Exec Write
-
7/30/2019 03 Dynamic Sched
70/84
8/21/2011 70
Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NoAdd3 No
4 Mult1 Yes ULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+ (M-M) Mult2
Write result of ADDD here?
All quick instructions complete in this cycle!
Tomasulo Example Cycle 12Instruction status: Exec Write
-
7/30/2019 03 Dynamic Sched
71/84
8/21/2011 71
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NoAdd3 No
3 Mult1 Yes ULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(A2) (M-M+ (M-M) Mult2
Tomasulo Example Cycle 13Instruction status: Exec Write
-
7/30/2019 03 Dynamic Sched
72/84
8/21/2011 72
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NoAdd3 No
2 Mult1 Yes ULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(A2) (M-M+ (M-M) Mult2
Tomasulo Example Cycle 14Instruction status: Exec Write
-
7/30/2019 03 Dynamic Sched
73/84
8/21/2011 73
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NoAdd3 No
1 Mult1 Yes ULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(A2) (M-M+ (M-M) Mult2
-
7/30/2019 03 Dynamic Sched
74/84
Tomasulo Example Cycle 16Instruction status: Exec Write
-
7/30/2019 03 Dynamic Sched
75/84
8/21/2011 75
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NoAdd3 No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+ (M-M) Mult2
Just waiting for Mult2 (DIVD) to complete
-
7/30/2019 03 Dynamic Sched
76/84
8/21/2011 76
Faster than light computation(skip a couple of cycles)
Tomasulo Example Cycle 55Instruction status: Exec Write
-
7/30/2019 03 Dynamic Sched
77/84
8/21/2011 77
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NoAdd3 No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(A2) (M-M+ (M-M) Mult2
Tomasulo Example Cycle 56Instruction status: Exec Write
-
7/30/2019 03 Dynamic Sched
78/84
8/21/2011 78
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5 56
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NoAdd3 No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+ (M-M) Mult2
Mult2 (DIVD) is completing; what is waiting for it?
-
7/30/2019 03 Dynamic Sched
79/84
Why can Tomasulo overlapiterations of loops?
-
7/30/2019 03 Dynamic Sched
80/84
8/21/2011 80
Register renamingMultiple iterations use different physical destinations for
registers (dynamic loop unrolling).
Reservation stationsPermit instruction issue to advance past integer control flow
operations
Also buffer old values of registers - totally avoiding the WARstall
Other perspective: Tomasulo building dataflow dependency graph on the fly
Tomasulos scheme offers 2 majoradvantages
-
7/30/2019 03 Dynamic Sched
81/84
8/21/2011 81
1. Distribution of the hazard detection logic distributed reservation stations and the CDB
If multiple instructions waiting on single result, & eachinstruction has other operand, then instructions can bereleased simultaneously by broadcast on CDB
If a centralized register file were used, the units wouldhave to read their results from the registers whenregister buses are available
2. Elimination of stalls for WAW and WARhazards
Tomasulo Drawbacks
-
7/30/2019 03 Dynamic Sched
82/84
8/21/2011 82
Complexitydelays of 360/91, MIPS 10000, Alpha 21264,
IBM PPC 620 in CA:AQA 2/e, but not in silicon!
Many associative stores (CDB) at high speed Performance limited by Common Data Bus
Each CDB must go to multiple functional units
high capacitance, high wiring densityNumber of functional units that can complete per cycle
limited to one!
Multiple CDBs more FU logic for parallel assoc stores
Non-precise interrupts!We will address this later
And In Conclusion #1
Leverage Implicit Parallelism for Performance:
-
7/30/2019 03 Dynamic Sched
83/84
8/21/2011 83
g pInstruction Level Parallelism
Loop unrolling by compiler to increase ILP
Branch prediction to increase ILP
Dynamic HW exploiting ILP Works when cant know dependence at compile time
Can hide L1 cache misses
Code for one machine runs well on another
And In Conclusion #2
Reservations stations: renamingto larger set of
-
7/30/2019 03 Dynamic Sched
84/84
8/21/2011 84
g gregisters + buffering source operands
Prevents registers as bottleneck
Avoids WAR, WAW hazards
Allows loop unrolling in HW
Not limited to basic blocks(integer units gets ahead, beyond branches)
Helps cache misses as well Lasting Contributions
Dynamic scheduling
Register renaming Load/store disambiguation
360/91 descendants are Intel Pentium 4, IBM Power 5,
AMD Athlon/Opteron,
top related