is there anything more to learn about high performance processors? j. e. smith

Download Is There Anything More to Learn about High Performance Processors? J. E. Smith

Post on 15-Dec-2015

214 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Slide 1

Is There Anything More to Learn about High Performance Processors? J. E. Smith Slide 2 June 2003copyright J. E. Smith, 2003 2 Underlying Issues Power Wire Delays Many available transistors Applications Web Databases Entertainment Office Embedded Slide 3 June 2003copyright J. E. Smith, 2003 3 The State of the Art Multiple instructions per cycle Out-of-order issue Register renaming Deep pipelining Branch prediction Speculative execution Cache memories Multi-threading Slide 4 June 2003copyright J. E. Smith, 2003 4 History Quiz Superscalar processing was invented by a) Intel in 1993 b) RISC designers in the late 80s, early 90s c) IBM ACS in late 60s; Tjaden and Flynn 1970 Slide 5 June 2003copyright J. E. Smith, 2003 5 History Quiz Out-of-order issue was invented by a) Intel in 1993 b) RISC designers in the late 80s, early 90s c) Thornton/Cray in the 6600, 1963 Slide 6 June 2003copyright J. E. Smith, 2003 6 History Quiz Register renaming was invented by a) Intel in 1995 b) RISC designers in the late 80s, early 90s c) Tomasulo in late 60s; also Tjaden and Flynn 1970 What Keller said in1975: Slide 7 June 2003copyright J. E. Smith, 2003 7 History Quiz Deep pipelining was invented by a) Intel in 2001 b) RISC designers in the late 80s, early 90s c) Seymour Cray in 1976 1969: 760012 gates/stage (?) 1976: Cray-1 8 gates/stage 1985: Cray-2 4 gates/stage 1991: Cray-3 6 gates/stage (?) Slide 8 June 2003copyright J. E. Smith, 2003 8 History Quiz Branch prediction was invented by a) Intel in 1995 b) RISC designers in the late 80s, early 90s c) Stretch 1959 (static); Livermore S1(?) 1979 or earlier at IBM(?) Slide 9 June 2003copyright J. E. Smith, 2003 9 History Quiz Speculative Execution was invented by a) Intel in 1995 b) RISC designers in the late 80s, early 90s c) CDC 180/990 (?) in 1983 Slide 10 June 2003copyright J. E. Smith, 2003 10 History Quiz Cache memories were invented by a) Intel in 1985 b) RISC designers in the late 80s, early 90s c) Maurice Wilkes in 1965 Slide 11 June 2003copyright J. E. Smith, 2003 11 History Quiz Multi-threading was invented by a) Intel in 2001 b) RISC designers in the 80s c) Seymour Cray in 1964 Slide 12 June 2003copyright J. E. Smith, 2003 12 Summary Multiple instructions per cycle -- 1969 Out-of-order issue -- 1964 Register renaming -- 1967 Deep pipelining -- 1975 Branch prediction -- 1979 Speculative Execution -- 1983 Cache memories -- 1965 Multi-threading -- 1964 All were done as part of a development project and immediately put into practice. After introduction, only a few remained in common use Slide 13 June 2003copyright J. E. Smith, 2003 13 The 1970s & 80s Less Complexity Level of integration wouldnt support it Not because of transistor counts, but because of small replaceable units. Cray went toward simple issue, deep pipelining Microprocessor development first used high complexity then drove pipelines deeper Limits to Wide Issue Limits to Deep Pipelining Slide 14 June 2003copyright J. E. Smith, 2003 14 Typical Superscalar Performance Your basic superscalar processor: 4-way issue, 32 window 16K I-cache and D-Cache 8K gshare branch predictor Wide performance range Performance typically much less than peak (4) Slide 15 June 2003copyright J. E. Smith, 2003 15 Superscalar Processor Performance Compare 4-way issue, 32 window Ideal I-cache, D-cache, Branch predictor Non-ideal I-cache, D-cache, Branch predictor Peak performance would be achievable IF it werent for bad events I Cache misses D Cache misses Branch mispredictions Slide 16 June 2003copyright J. E. Smith, 2003 16 Performance Model Consider profile of dynamic instructions issued per cycle: Background "issue-width" near-peak IPC With never-ending series of transient events determine performance with ideal caches & predictors then account for bad transient events time IPC branch mispredicts i-cache miss long d-cache miss Slide 17 June 2003copyright J. E. Smith, 2003 17 Backend: Ideal Conditions Key Result (Michaud, Seznec, Jourdan): Square Root relationship between I ssue R ate and W indow size Slide 18 June 2003copyright J. E. Smith, 2003 18 Branch Misprediction Penalty 1) lost opportunity performance lost by issuing soon-to-be flushed instructions 2) pipeline re-fill penalty obvious penalty; most people equate this with the penalty 3) window fill penalty performance lost due to window startup Slide 19 June 2003copyright J. E. Smith, 2003 19 Calculate Mispredict Penalty 8.5 insts/4 = 2.1 cp 9 insts/4 = 2.2 cp 19.75 insts/4 = 4.9 cp Total Penalty = 9.2 cp Slide 20 June 2003copyright J. E. Smith, 2003 20 Importance of Branch Prediction Slide 21 June 2003copyright J. E. Smith, 2003 21 Importance of Branch Prediction Doubling issue width means predictor has to be four times better for similar performance profile Assumes everything else is ideal I-caches & D-caches Research State of the Art: about 5 percent mispredicts on average (perceptron predictor) => one misprediction per 100 instructions Slide 22 June 2003copyright J. E. Smith, 2003 22 Next Generation Branch Prediction Classic Memory/Computation Tradeoff Conventional Branch Predictors Heavy on memory; light on computation Perceptron Predictor Add heavier computation Also adds latency to prediction Future predictors should balance memory, computation, prediction latency Slide 23 June 2003copyright J. E. Smith, 2003 23 Implication of Deeper Pipelines Assume 1 misprediction per 96 instructions Vary fetch/decode/rename section of pipe Advantage of wide issue diminishes as pipe deepens This ignores implementation complexity Graph also ignores longer execution latencies Slide 24 June 2003copyright J. E. Smith, 2003 24 Deep Pipelining: the Optimality of Eight Hrishikesh et al. : 8 F04s Kunkel et me : 8 gates Cray-1: 8 4/5 NANDS Were getting there! Slide 25 June 2003copyright J. E. Smith, 2003 25 Deep Pipelining Consider time per instruction (TPI) versus pipeline depth (Hartstein and Puzak) The curve is very flat near the optimum Good engineeringGood sales Slide 26 June 2003copyright J. E. Smith, 2003 26 Transistor Radios and High MHz A lesson from transistor radios Wonderful new technology in the late 50s Clearly, the more transistors, the better the radio! => Easy way to improve sales 6 transistors, 8 transistors, 14 transistors Use transistors as diodes Lesson: Eventually, people caught on Slide 27 June 2003copyright J. E. Smith, 2003 27 The Optimality of Eight 8 Transistors! Slide 28 June 2003copyright J. E. Smith, 2003 28 So, Processors are Dead for Research? Of course not BUT IPC oriented research may be on life support Slide 29 June 2003copyright J. E. Smith, 2003 29 Consider Car Engine Development Conclusion: We should be driving cars with 48 cylinders! Dont focus (obsess) on one aspect of performance And dont focus only on performance Power efficiency Reliability Security Design Complexity Slide 30 June 2003copyright J. E. Smith, 2003 30 Co-Designed VMs Move hardware/software boundary Give hardware designer some software in concealed memory Hardware does what it does best: speed Software does what it does best: manage complexity Operating System VMM Application Prog. Profiling HWConfiguration HW Visible Memory Concealed Memory Hardware Data Tables Slide 31 June 2003copyright J. E. Smith, 2003 31 Co-Designed VMs: Micro-OS Manage processor with micro-OS VMM software Manage processor resources in an integrated way Identify program phase changes Save/restore implementation contexts A microprocessor-controlled microprocessor Configurable I-cache size Simultaneous multithreading Variable branch predictor global history Configurable Instruction window Configurable D-Cache size Variable D-cache prefetch algorithm Configurable Reorder Buffer Pipeline Slide 32 June 2003copyright J. E. Smith, 2003 32 Co-Designed VMs Other Applications Binary Translation (e.g. Transmeta) Enables new ISAs Security (Dynamo/RIO) Slide 33 June 2003copyright J. E. Smith, 2003 33 Speculative Multi-threading Reasons for skepticism Complex Incompatible w/ deep pipelining The devil will be in the details researcher: 4 instruction types designer: 100(s) of instruction types High Power Consumption Performance advantages tend to be focused on specific programs (benchmarks) Better to push ahead with the real thread Slide 34 June 2003copyright J. E. Smith, 2003 34 The Memory Wall: D-Cache Misses Divide into: Short misses handle like long latency functional unit Long misses need special treatment Things that can reduce performance 1) Structural hazards ROB fills up behind load and dispatch stalls Window fills with instructions dependent on load and issue stops 2) Control dependences Mispredicted branch dependent on load data Instructions beyond branch wasted Slide 35 June 2003copyright J. E. Smith, 2003 35 Structural and Data Blockages Experiment: Window size 32, Issue width 4 Ideal branch prediction Cache miss delay 1000 cycles Separate Window and ROB 4K entries each Simulate single cache miss and see what happens Slide 36 June 2003copyright J. E. Smith, 2003 36 Results Issue continues at full speed Typical dependent instructions: about 30 Usually dependent instructions follow load closely BenchmarkAvg. # instsAvg. #insts issued afterin window missdep. on load Bzip2395017.8 Crafty374720.1 Eon392322.4 Gap329331.6 Gcc367817.2 Mcf350296.2 Gzip385311.5 Parser364832.6 Perl351930.3 Twolf367344.7 Vortex3606 7.8 Vpr237134.0 Slide 37 June 2003copyright J. E. Smith, 2003 37 Control Dependences Non-ideal Branch prediction How many cache misses lead to branch mispredict and when? Use 8K gshare Slide 38 June 2003copyright J. E. Smith, 2003 38 Results Bimodal behavior; for some programs, branch mispredictions are crucial In many cases 30-40% cache miss data leads to mispredicted branch Inhibits ability to overlap data cache misses One more reason to worry about branch predi