Transcript
Page 1: Is There Anything More to Learn about High Performance Processors? J. E. Smith

Is There Anything More to Learn Is There Anything More to Learn about High Performance about High Performance

Processors?Processors?

J. E. Smith

Page 2: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 3

The State of the ArtThe State of the Art

Multiple instructions per cycle Out-of-order issue Register renaming Deep pipelining Branch prediction Speculative execution Cache memories Multi-threading

Page 3: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 4

History QuizHistory Quiz

Superscalar processing was invented by

a) Intel in 1993

b) RISC designers in the late ’80s, early ’90s

c) IBM ACS in late ’60s; Tjaden and Flynn 1970

Page 4: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 5

History QuizHistory Quiz

Out-of-order issue was invented by

a) Intel in 1993

b) RISC designers in the late ’80s, early ’90s

c) Thornton/Cray in the 6600, 1963

Page 5: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 6

History QuizHistory Quiz

Register renaming was invented by

a) Intel in 1995

b) RISC designers in the late ’80s, early ’90s

c) Tomasulo in late ’60s; also Tjaden and Flynn 1970

What Keller said in1975:

Page 6: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 7

History QuizHistory Quiz

Deep pipelining was invented bya) Intel in 2001b) RISC designers in the late ’80s, early ’90s c) Seymour Cray in 1976

1969: 7600 12 gates/stage (?)1976: Cray-1 8 gates/stage1985: Cray-2 4 gates/stage1991: Cray-3 6 gates/stage (?)

Page 7: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 8

History QuizHistory Quiz

Branch prediction was invented bya) Intel in 1995

b) RISC designers in the late ’80s, early ’90s

c) Stretch 1959 (static); Livermore S1(?) 1979 or

earlier at IBM(?)

Page 8: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 9

History QuizHistory Quiz

Speculative Execution was invented bya) Intel in 1995

b) RISC designers in the late ’80s, early ’90s

c) CDC 180/990 (?) in 1983

Page 9: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 10

History QuizHistory Quiz

Cache memories were invented bya) Intel in 1985

b) RISC designers in the late ’80s, early ’90s

c) Maurice Wilkes in 1965

Page 10: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 11

History QuizHistory Quiz

Multi-threading was invented by

a) Intel in 2001

b) RISC designers in the ’80s

c) Seymour Cray in 1964

Page 11: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 12

SummarySummary

Multiple instructions per cycle -- 1969 Out-of-order issue -- 1964 Register renaming -- 1967 Deep pipelining -- 1975 Branch prediction -- 1979 Speculative Execution -- 1983 Cache memories -- 1965 Multi-threading -- 1964

All were done as part of a development project and immediately put into practice.

After introduction, only a few remained in common use

Page 12: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 13

The 1970s & 80s – Less ComplexityThe 1970s & 80s – Less Complexity

Level of integration wouldn’t support it• Not because of transistor counts, but because of

small replaceable units. Cray went toward simple issue, deep

pipelining Microprocessor development first used high

complexity then drove pipelines deeper Limits to Wide Issue Limits to Deep Pipelining

Page 13: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 14

Typical Superscalar PerformanceTypical Superscalar Performance

Your basic superscalar processor:4-way issue, 32 window16K I-cache and D-Cache8K gshare branch predictor

Wide performance range Performance typically

much less than peak (4)0

0.5

1

1.5

2

2.5

3

3.5

4

bzip

craf

tyeo

nga

pgc

cgz

ipm

cf

pars

er perl

twolf

vorte

xvp

r

IPC

Page 14: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 15

Superscalar Processor PerformanceSuperscalar Processor Performance

Compare4-way issue, 32 windowIdeal I-cache, D-cache, Branch predictor Non-ideal I-cache, D-cache, Branch predictor

Peak performance would be achievableIF it weren’t for “bad” events

I Cache missesD Cache missesBranch mispredictions 0

0.5

1

1.5

2

2.5

3

3.5

4

bzip

craf

tyeo

nga

pgc

cgz

ipm

cf

pars

er perl

twolf

vorte

xvp

rIP

C

0

0.5

1

1.5

2

2.5

3

3.5

4

bzip

craf

tyeo

nga

pgc

cgz

ipm

cf

pars

er perl

twolf

vorte

xvp

r

IPC

Page 15: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 16

Performance ModelPerformance Model

Consider profile of dynamic instructions issued per cycle:

Background "issue-width" near-peak IPC • With never-ending series of transient events

determine performance with ideal caches & predictors then account for “bad” transient events

time

IPC

branch mispredictsi-cache miss

long d-cache miss

Page 16: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 17

Backend: Ideal ConditionsBackend: Ideal Conditions

Key Result (Michaud, Seznec, Jourdan):• Square Root relationship between Issue Rate

and Window sizeWR

Page 17: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 18

Branch Misprediction PenaltyBranch Misprediction Penalty

1) lost opportunity• performance lost by issuing soon-to-be flushed instructions

2) pipeline re-fill penalty• obvious penalty; most people equate this with the penalty

3) window fill penalty• performance lost due to window startup

lostopportunity

pipelinere-fill window fill

Page 18: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 19

Calculate Mispredict PenaltyCalculate Mispredict Penalty

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

clock cycle

inst

ruct

ion

s is

sued

8.5 insts/4 = 2.1 cp 9 insts/4 = 2.2 cp

19.75 insts/4 = 4.9 cp

Total Penalty = 9.2 cp

Page 19: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 20

Importance of Branch PredictionImportance of Branch Prediction

0

200

400

600

800

1000

1200

1400

1600

1800

10 20 30 40 50

Percent time at 3.5, 7, 14 issues per cycle

Inst

ruct

ion

s b

etw

een

mis

pre

dic

tio

ns

Issue width 4 Issue width 8 issue width 16

Page 20: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 21

Importance of Branch PredictionImportance of Branch Prediction

Doubling issue width means predictor has to be four times better for similar performance profile Assumes everything else is ideal

• I-caches & D-caches Research State of the Art:

about 5 percent mispredicts on average (perceptron predictor)

=> one misprediction per 100 instructions

Page 21: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 22

Next Generation Branch PredictionNext Generation Branch Prediction

Memory computation

PC

GlobalHistory

PredictionMemoryComput-ation

PC

GlobalHistory

Prediction

Classic Memory/Computation Tradeoff Conventional Branch Predictors

• Heavy on memory; light on computation

HistoryMemory

Comput-ation

PC

GlobalHistory

PredictionPredictionMemory

Perceptron Predictor• Add heavier computation• Also adds latency to prediction

Future predictors should balance memory, computation, prediction latency

Page 22: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 23

Implication of Deeper PipelinesImplication of Deeper Pipelines

Assume 1 misprediction per 96 instructions Vary fetch/decode/rename section of pipe

Advantage of wide issue diminishes as pipe deepens

This ignores implementation complexity

Graph also ignores longer execution latencies

0

1

2

3

4

5

IP C

0 2 4 6 8 10 12 14 16 Fetch/Decode Pipe Length

issue=8

issue=4

issue=2

Page 23: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 24

Deep Pipelining: the Optimality of EightDeep Pipelining: the Optimality of Eight

Hrishikesh et al. : 8 F04s Kunkel et me : 8 gates Cray-1: 8 4/5 NANDS

We’re getting there!

Page 24: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 25

Deep PipeliningDeep Pipelining

Consider time per instruction (TPI) versus pipeline depth (Hartstein and Puzak)

The curve is very flat near the optimum

Good engineering Good sales

Page 25: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 26

Transistor Radios and High MHzTransistor Radios and High MHz

A lesson from transistor radios…

Wonderful new technology in the late ’50s

Clearly, the more transistors, the better the radio!

=> Easy way to improve sales6 transistors, 8 transistors, 14 transistors…

Use transistors as diodes…

Lesson: Eventually, people caught on

Page 26: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 27

The Optimality of EightThe Optimality of Eight

8 Transistors!

Page 27: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 28

So, Processors are Dead for Research?So, Processors are Dead for Research?

Of course not

BUT IPC oriented research may be on life support

Page 28: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 29

Consider Car Engine DevelopmentConsider Car Engine Development

Conclusion: We should be driving cars with 48 cylinders!

0

2

4

6

8

10

12

14

16

18

1890 1895 1900 1905 1910 1915 1920 1925 1930 1935

Year

Nu

mb

er C

ylin

der

s

Don’t focus (obsess) on one aspect of performance

And don’t focus only on performancePower efficiencyReliabilitySecurityDesign Complexity

Page 29: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 30

Co-Designed VMsCo-Designed VMs Move hardware/software boundary Give “hardware” designer some software in

concealed memory Hardware does what it does best: speed Software does what it does best: manage complexity

OperatingSystem

VMM

ApplicationProg.

Profiling HW Configuration HW

Vis

ible

Me

mo

ry

Co

nc

ea

led

Me

mo

ryH

ard

wa

re

DataTables

Page 30: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 31

Co-Designed VMs: Micro-OSCo-Designed VMs: Micro-OS Manage processor with micro-OS VMM software

• Manage processor resources in an integrated way• Identify program phase changes• Save/restore implementation contexts• A microprocessor-controlled microprocessor

Configurable I-cache size

Simultaneousmultithreading

Variable branchpredictor globalhistory

ConfigurableInstruction window

Configurable D-Cache size

Variable D-cacheprefetch algorithm

ConfigurableReorder Buffer

Pipeline

Page 31: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 32

Co-Designed VMsCo-Designed VMs

Other Applications• Binary Translation (e.g. Transmeta)

Enables new ISAs• Security (Dynamo/RIO)

Traditional ISA program

VMM

IFRename & steer

. . .

Integer unit 0

Integer unit 1

Integer unit N

. . .

D cache unit 0

D cache unit 1

D Cache unit N

Translate Dynamic profiling

Conventional ISA program

VMM

IFRename & steer

. . .

Integer unit 0

Integer unit 1

Integer unit N

. . .

D cache unit 0

D cache unit 1

D Cache unit N

Translate Dynamic profiling

Page 32: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 33

Speculative Multi-threadingSpeculative Multi-threading

Reasons for skepticism• Complex

Incompatible w/ deep pipelining

The devil will be in the details

researcher: 4 instruction types

designer: 100(s) of instruction types• High Power Consumption• Performance advantages tend to be focused on specific

programs (benchmarks)• Better to push ahead with the real thread

Page 33: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 34

The Memory Wall: D-Cache MissesThe Memory Wall: D-Cache Misses

Divide into:• Short misses

– handle like long latency functional unit• Long misses

– need special treatment

Things that can reduce performance1) Structural hazards

ROB fills up behind load and dispatch stallsWindow fills with instructions dependent on load and issue

stops2) Control dependences

Mispredicted branch dependent on load data Instructions beyond branch wasted

Page 34: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 35

Structural and Data BlockagesStructural and Data Blockages

Experiment:• Window size 32, Issue width 4• Ideal branch prediction• Cache miss delay 1000 cycles• Separate Window and ROB 4K entries each• Simulate single cache miss and see what happens

Page 35: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 36

ResultsResults

Issue continues at full speed

Typical dependent instructions: about 30

Usually dependent instructions follow load closely

Benchmark Avg. # insts Avg. #insts issued after in windowmiss dep. on load

Bzip2 3950 17.8Crafty 3747 20.1Eon 3923 22.4Gap 3293 31.6Gcc 3678 17.2Mcf 3502 96.2Gzip 3853 11.5Parser 3648 32.6Perl 3519 30.3Twolf 3673 44.7Vortex 3606 7.8Vpr 2371 34.0

Page 36: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 37

Control DependencesControl Dependences

Non-ideal Branch prediction• How many cache misses lead to branch mispredict

and when?• Use 8K gshare

Page 37: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 38

ResultsResults

• Bimodal behavior; for some programs, branch mispredictions are crucial

• In many cases 30-40% cache miss data leads to mispredicted branch

• Inhibits ability to overlap data cache misses• One more reason to worry about branch

prediction

fract. loads #insts before Benchmark driving mispredict

mispredictBzip2 .01 33.5Crafty .30 20.3Eon .18 30.6Gap .33 27.0Gcc .35 32.4Mcf .01 27.7Gzip .44 32.4Parser .08 35.9Perl .40 30.2Twolf .37 65.6Vortex .16 41.2Vpr .47 31.3

Page 38: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 39

Dealing with the Memory WallDealing with the Memory Wall

Don’t speculate about itRun through it

ROB grows as nD• issue width is n ; miss delay D cycles

miss delay of 200 cycles; four-issue machine ROB of about 800 entries

Window grows as dm• m outstanding misses; d dependent instructions each• Example:

6 outstanding misses and 30 dependent instructions

then the window should be enlarged by 180 slots

Page 39: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 40

Future High Performance ProcessorsFuture High Performance Processors

Fast clock cycle: 8 gates per stage Less speculation

• Deciding what to take out more important than new things to put in

Return to Simplicity• Leave the baroque era behind

ILP less important

Page 40: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 41

Research in the deep pipeline domainResearch in the deep pipeline domain

When there are 40 gate levels, we can be sloppy about adding gadgets

When there are 8 gate levels, a gadget requiring even one more level slows clock by 12.5%

logic

logic

logic

logic

logic

logic

latch latch

Neat Gadget

logic

To really evaluate performance impact of adding a gadget, we need a detailed logic design

Future research should be focused in jettisoning gadgets, not adding them

Page 41: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 42

Conclusion: Important Research AreasConclusion: Important Research Areas

Processor simplicity Power efficiency Security Reliability Reduced design times Systems (on a chip) balancing threads and on-chip

RAM Many very simple processors on a chip

• Look at architecture of Denelcor HEP…

Page 42: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 43

Attack of Killer Game ChipsAttack of Killer Game Chips

OR: The most important thing I learned at Cray Rsch. OR: What happened to SSTs?

•It isn’t enough that we can build them•It isn’t enough that there are interested customers•Volume rules!

Researchers have made a supercomputer - which is powerful enough to rival the top systems in the world - out of PlayStation 2 components A US research centre has clustered 70 Sony PlayStation 2 game consoles into a Linux supercomputer that ranks among the 500 most powerful in the world. According to the New York Times, the National Centre for Supercomputing Applications (NCSA) at the University of Illinois assembled the $50,000 (£30,000) machine out of components bought in retail shops. In all, 100 PlayStation 2 consoles were bought but only 70 have been used for this project.

Page 43: Is There Anything More to Learn about High Performance Processors? J. E. Smith

June 2003 copyright J. E. Smith, 2003 44

AcknowledgementsAcknowledgements

Performance ModelTejas Karkhanis

FundingNSF, SRC, IBM, Intel

Japanese Transistor RadioRadiophile.com


Top Related