hy425 lecture 08: limits of ilp - university of...
TRANSCRIPT
RecapLimits of ILP
Case study: Intel NetburstConclusion
HY425 Lecture 08: Limits of ILP
Dimitrios S. Nikolopoulos
University of Crete and FORTH-ICS
October 31, 2011
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 1 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
ILP techniquesHardware
I Dynamic scheduling with scoreboardI Dynamic scheduling with renaming
I Tomasulo, renaming registersI Branch predictionI Multiple issueI Speculation
Software
I Instruction schedulingI Code transformations (topic of next lecture)
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 3 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
What limits ILP
Software and hardware issuesI Limits of parallelism in programs
I Data flow – true data dependenciesI Control flow – control dependenciesI Code generation, scheduling by compiler
I Hardware complexityI Large storage structures – branch prediction, ROB, windowI Complex logic – dependence control, associative searchesI Higher bandwidth – multiple issue, multiple outstanding
instructionsI Long latencies – memory system (caches, DRAM)
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 4 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
What is the upper bound of ILP in programs?
Roofline model of performance analysis
I Study of maximum ILP in programsI Difficult question dependent on hardware and compiler
technologyI Different conclusions with different assumptions
I Optimistic (unrealistic) assumptions of hardwareI Unlimited storage resources for tables, etc.I Perfect branch prediction and speculationI Others . . .
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 6 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
David Wall (DEC, WRL Technical Report, 1993)
Assumptions
I Infinite virtual registers for renaming availableI Branch prediction is perfectI All branch targets are perfectly predicted
I No control dependenciesI All memory address known
I Can move loads before prior to unrelated storesI No dependencies other than true data dependencies
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 7 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
David Wall (DEC, WRL Technical Report, 1993)
Assumptions
I Unlimited instruction window sizeI Unlimited number of functional unitsI All functional units compute in one cycleI Perfect caches
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 8 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
David Wall (DEC, WRL Technical Report, 1993)
Comparison to a realistic processor – Alpha 21264
I Four-way instruction issueI 80 renaming registersI Branch predictor:
I 1024-branch history, 2 × 8K branch patterns
Methodology
I Collect trace of instructions and memory referencesI Schedule each instruction “by hand” as early as possible
I Wait until data dependence resolved
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 9 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
David Wall (DEC, WRL Technical Report, 1993)Theoretical maximum ILP
I gcc, espresso, li integer programsI fpppp, doduc, tomcatv floating point programs
55
63
18
75
119
150
0 20 40 60 80 100 120 140 160
gcc
espresso
li
fpppp
doduc
tomcatv
Instruction issues per cycle
Maximum ILP
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 10 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Limiting the instruction window size
Perfect processor
I Can look arbitrarily far ahead to fetch instructionsI Can rename output registers for all instructions that can
issueI Can determine data dependencies for all instructions
I O(n2) for n instructionsI Provide functional units for all instructions
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 11 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Limiting the instruction window size
Instruction window
I Group of instructions examined for simultaneous executionI WS × IW × RPI comparators needed
I WS: window sizeI IW : issue widthI RPI: registers per instruction to check
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 12 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Limiting the instruction window sizeWindow size 32 –∞
I 32–128 realistic values for modern processors
8
6
9
14
9
14
10
13
11
35
15
34
10
15
12
49
16
45
59
60
55
63
18
75
119
150
0 50 100 150 200
gcc
espresso
li
fpppp
doduc
tomcatv
Instruction issues per cycle
Impact of window size
infinite 2048 512 128 32
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 13 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Realistic branch predictorTournament predictor
I 2-bit correlating, 2-bit non-correlating, 2-bit selectorsI 8192 branches, 2 predictors, 1 selector per branchI Correlating predictor indexed with PC XORed with historyI Non-correlating predictor indexed with PCI Average accuracy 97% in SPEC
Alternatives
I 2-bit predictor with 512 entries, 16-entry return addresstable
I Static predictor using profile of applicationI No prediction
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 14 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Realistic branch predictorImpact of static vs. dynamic prediction
I 2048-instruction window, 64-way issue, 0-cycle mispredicted branchpenalty
2
2
2
29
4
19
6
6
7
45
14
45
6
7
6
46
13
45
9
12
10
48
15
46
35
41
16
61
58
60
0 10 20 30 40 50 60 70
gcc
espresso
li
fpppp
doduc
tomcatv
Instruction issues per cycle
Impact of branch predictor
Perfect Tournament predictor Standard 2-bit profile-based None
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 15 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Realistic branch predictor
Impact on INT versus FP programs
1
5
14
12
14
12
1
16
18
23
18
30
0
3
2
2
4
6
0 5 10 15 20 25 30 35
tomcatv
doduc
fpppp
li
espresso
gcc
percentage of branches mispredicted
Branch midprediction accuracy
tournament 2-bit counter profile-based
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 16 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Number of renaming registers32 –∞ renaming registers
I 2048-instruction window, 64-way issueI 8K-entry tournament predictor
5
5
4
5
4
4
7
5
5
6
5
5
28
11
20
11
10
9
44
15
35
12
13
10
45
16
49
12
15
10
54
29
59
12
15
11
0 10 20 30 40 50 60 70
tomcatv
doduc
fpppp
li
espresso
gcc
Instruction issues per cycle
Impact of renaming registers
Infinite 256 INT + 256 FP 128 INT + 128 FP 64 INT + 64 FP 32 INT + 32 FP None
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 17 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Alias analysis
Alternatives for static and dynamic alias analysis
I Impossible to disambiguate all references at compile timeI Compiler can inspect static data in global segment and
stacks (known locations)I Hard to inspect data in heap (dynamic allocation, pointers)
I Unbounded number of comparisons needed at runtimeI Three options
I Perfect disambiguation of global and stack data (perfectcompiler)
I Inspection (disambiguate based on registers pointing tomemory)
I None (no disambiguation)
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 18 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Impact of alias analysis
Alias alternativesI 2048-instruction window, 64-way issue, 8K-entry tournament predictor.
256 INT, 256 FP renaming registers
4
4
3
3
5
3
5
6
4
4
5
4
45
16
49
9
7
7
45
16
49
12
15
10
0 10 20 30 40 50 60
tomcatv
doduc
fpppp
li
espresso
gcc
Instruction issues per cycle
Impact of memory disambiguation scheme
perfect global/stack perfect inspection none
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 19 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Impact of alias analysis
Implementing memory disambiguation
I Need to know effective addresses of all earlier storesI Otherwise:
I In-order address calculationI Effective address speculation
Disambiguation with speculation
I Load assumes no dependence or uses dependencepredictor
I Stores check for dependence violations upon commitI Undo and restart mechanism used upon mis-speculation
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 20 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Going beyond the limits
Advanced hardware techniques for ILP
I Memory WAW and WAR hazardsI May happen across procedure calls
I Unnecessary dependencies imposed by softwareI E.g. incrementing the loop induction variable
I Predictable data flowI Value prediction
I Prediction of addresses for memory disambiguationI Prediction of values
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 21 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Other considerations for ILP
Clock rate vs. issue width
I 1994 HP PA 7100 @ 99 MHz 2-issue faster than TISuperSPARC 3-issue @ 60 MHz
I Focus on CPI may trade with long cycle time
Amdahl’s Law
I Single improvement may not improve performanceI Resources should scale proportionally
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 22 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Other considerations for ILP (cont.)
Control flow
I Branches more predictable in FP than INT codesI FP programs have simpler control paths than INT
programs
Parallelism beyond basic blocks
I Multi-program and multi-threaded parallelismI Limited ILP in a single program motivates simpler using
many processors to run many programsI Multiple simpler processors an attractive alternative for
servers
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 23 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Other considerations for ILP (cont.)
Clock speed
I Increased wire delays prevent increasing clock speedI Pipeline deepening may:
I Enable higher clock frequencyI Increase stall cycles and demand for ILPI Require multiple memory accesses, branch predictions,
register file accesses per cycleI Challenging at GHz clock rates
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 24 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Other considerations for ILP (cont.)
Power issuesI Increasing complexity increases power consumption
I More transistors switchingI Increasing complexity and frequency will increase power
exponentiallyI More transistors switching at a higher clock rate
I Increasing complexity linearly does not increaseperformance linearly
I 3-issue Intel Pentium barely gets CPI < 1.0I More switching transistors per unit of performance
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 25 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Instruction fetch and decode
CISC translated to RISC
I IA-32 CISC instruction set decoded to microops (uops)I uops scheduled in dynamically scheduled speculative
pipelineI Trace cache:
I Frequently executed instruction sequences, includingnon-adjacent sequences
I Sequences including multiple branchesI Up to 6 uops (three IA-32 instructions) decoded and
translated per cycleI Low miss rate (0.15% for SPEC CPUINT2000)
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 27 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Speculative pipeline
Dynamic scheduling
I Out-of-order execution pipelineI Register renaming for up to 3 uops per cycleI Commit up to 3 uops per cycleI Six dispatch ports to functional units
I Frequently executed instruction sequences, includingnon-adjacent sequences
I Sequences including multiple branches
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 28 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Microarchitecture of Pentium 4
Pentium 4 640
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 29 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Pentium 4 640 implementationMicroarchitecture quantitative characteristics
I Deep pipelines (21+ stages in various Pentium 4 generations)I Minimum 31 cycles from fetch to commit
Feature Size CommentsFront-end branch target 4K entries Predicts the next IA-32 instruction to fetch; used only when thebuffer execution trace cache misses.Execution trace cache 12K uops Trace cache used for uops.Trace cache branch- 2K entries Predicts the next uop.target bufferRegisters for renaming 128 total 128 uops can be in execution with up to 48 loads and 32 stores.Functional units 7 total; 2 simple ALU, The simple ALU units run at twice the clock rate, accepting up
complex ALU, load, store, to two simple ALU uops every clock cycle. This allowsFP move, FP arithmetic execution of two dependent ALU operation in a single clock
cycle.L1 data cache 16KB; 8-way associative; Integer load to use latency is 4 cycles; FP load to use latency is
64-byte blocks 12 cycles; up to 8 outstanding load misses.write-through
L2 cache 2 MB; 8-way associative; 256 bits to L1, providing 108 GB/sec; 18-cycle access time; 64128-byte blocks bits to memory, capable of 6.4 GB/sec. A miss in L2 does notwrite back cause an automatic update of L1.
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 30 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Performance of Pentium 4
Memory latency
I Performance critically dependent on memory systemI Memory (DRAM) latency upward of 100 cycles
I Fastest memories as of 2006, 3.2 GHz clockI 2 or 3 levels of caches common in modern high-end
processors
Branch prediction
I Trace cache and branch predictionI Misprediction rateI Percentage of instructions misspeculated
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 31 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Branch prediction on Pentium 4
Misprediction rate 8× higher in INT vs. FP programs
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 32 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Branch prediction on Pentium 4
Misspeculated instruction rate follows misprediction rate
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 33 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Data cache performance on Pentium 4
Multi-level cache hierarchy
I L2 cache miss penalty approx. 10× L1 cache miss penaltyI Hard to overlap stalls due to L2 cache misses
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 34 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
CPI on Pentium 4Is there any ILP to begin with?
I Translation of IA-32 instructions to uops increases CPI by1.29× (1.29 uops per instruction)
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 35 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
CPI on Pentium 4
Is there any ILP to begin with?
I When does the processor get > 1 uop per cycle?
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 36 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Pentium 4 @ 3.2 GHz vs. AMD Opteron @ 2.6 GHz
Putting it all together
I Can a processor with a lower clock frequency outperform aprocessor with a higher clock frequency?
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 37 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Pentium 4 @ 3.2 GHz vs. AMD Opteron @ 2.6 GHzPutting it all together
I Pentium CPI = 1.27 × AMD CPI
I Pentium clock freq. = 1.23 × AMD clock freq.
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 38 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Pentium 4 @ 3.2 GHz vs. AMD Opteron @ 2.6 GHz
Puttint it all together
I Can a processor with a lower clock frequency outperform aprocessor with a higher clock frequency?
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 39 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Conclusions
How do we compare processors?
I Lower CPI does not necessarily yield faster processorsI Processors with higher clock frequencies are not
necessarily fasterI Instruction-level parallelism faces various limitations
(walls):I Power wall – exponentially increasing complexityI Memory wall – non-overlapped memory latency
I What are we looking at next?I Software support for exploiting ILPI Designing more effective memory systems (caches, DRAM)I Looking in other sources of parallelism (threads, processes)
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 41 / 41