lecture 2: benchmarks, performance metrics, cost, instruction set architecture

29
Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2004

Upload: inge

Post on 06-Jan-2016

25 views

Category:

Documents


1 download

DESCRIPTION

Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture. Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2004. Administrative. Read Chapter 3, Wulf, transmeta Homework #1 Due September 7 Simple scalar, read some of the documentation first - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

Professor Alvin R. Lebeck

Compsci 220 / ECE 252

Fall 2004

Page 2: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

CompSci 220 / ECE 252 2© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Administrative

• Read Chapter 3, Wulf, transmeta

• Homework #1 Due September 7– Simple scalar, read some of the documentation first

– See web page for details

– Questions, contact Shobana ([email protected])

Page 3: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

3© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Review: Trends

• Technology trends are one driving force in architectural innovation

• Moore’s Law

• Chip Area Reachable in one clock

• Power Density

Page 4: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

CompSci 220 / ECE 252 4© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

The Danger of Extrapolation

• Dot-com stock value

• Technology Trends

• Power dissipation?• Cost of new fabs?• Alternative

technologies?– Carbon Nanotubes– Optical

Page 5: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

CompSci 220 / ECE 252 5© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=

1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

Page 6: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

6© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Review: Performance

Countn Instructio

I F e wherFCPI CPI

I CPI Time Cycle timeCPU

Countn Instructio

Cycles

Countn Instructio

RateClock timeCPU CPI

ii

n

1 iii

i

n

1 ii

Invest Resources where time is Spent!

“Average Cycles Per Instruction”

“Instruction Frequency”

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Page 7: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

CompSci 220 / ECE 252 7© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Example

Base Machine (Reg / Reg)Op Freq CyclesALU 50% 1Load 20% 2Store 10% 2Branch 20% 2

Typical Mix

Add register / memory operations:– One source operand in memory– One source operand in register– Cycle count of 2

Branch cycle count to increase to 3.

What fraction of the loads must be eliminated for this to pay off?

Page 8: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

CompSci 220 / ECE 252 8© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Example Solution

Exec Time = Instr Cnt x CPI x Clock

Op Freq Cycles CPI Freq Cycles CPI

ALU .50 1 .5 .5 – X 1 .5 – X

Load .20 2 .4 .2 – X 2 .4 – 2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 – X (1.7 – X)/(1 – X)

CPINew must be normalized to new instruction frequency

CyclesNew

InstructionsNew

Page 9: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

CompSci 220 / ECE 252 9© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Example Solution

Exec Time = Instr Cnt x CPI x Clock

Op Freq Cycles Freq Cycles

ALU .50 1 .5 .5 – X 1 .5 – X

Load .20 2 .4 .2 – X 2 .4 – 2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 – X (1.7 – X)/(1 – X)

Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew

1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)

1.5 = 1.7 – X

0.2 = X

ALL loads must be eliminated for this to be a win!

Page 10: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

10© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Actually Measuring Performance

• how are execution-time & CPI actually measured?– execution time: time (Unix cmd): wall-clock, CPU, system

– CPI = CPU time / (clock frequency * # instructions)

– more useful? CPI breakdown (compute, memory stall, etc.)

– so we know what the performance problems are (what to fix)

• measuring CPI breakdown– hardware event counters (PentiumPro, Alpha DCPI)

» calculate CPI using instruction frequencies/event costs

– cycle-level microarchitecture simulator (e.g., SimpleScalar)

» measure exactly what you want

» model microarchitecture faithfully (at least parts of interest)

» method of choice for many architects (yours, too!)

Page 11: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

11© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Benchmarks and Benchmarking

• “program” as unit of work– millions of them, many different kinds, which to use?

• benchmarks– standard programs for measuring/comparing performance

– represent programs people care about

– repeatable!!

– benchmarking process

» define workload

» extract benchmarks from workload

» execute benchmarks on candidate machines

» project performance on new machine

» run workload on new machine and compare

» not close enough -> repeat

Page 12: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

12© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Benchmarks: Instruction Mixes

• instruction mix: instruction type frequencies

• ignores dependences

• ok for non-pipelined, scalar processor without caches– the way all processors used to be

– example: Gibson Mix - developed in 1950’s at IBM

– load/store: 31%, branches: 17%

– compare: 4%, shift: 4%, logical: 2%

– fixed add/sub: 6%, float add/sub: 7%

– float mult: 4%, float div: 2%, fixed mul: 1%, fixed div: <1%

– qualitatively, these numbers are still useful today!

Page 13: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

13© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Benchmarks: Toys, Kernels, Synthetics

• toy benchmarks: little programs that no one really runs

– e.g., fibonacci, 8 queens

– little value, what real programs do these represent?

– scary fact: used to prove the value of RISC in early 80’s

• kernels: important (frequently executed) pieces of real programs

– e.g., Livermore loops, Linpack (inner product)

– good for focusing on individual features not big picture

– over-emphasize target feature (for better or worse)

• synthetic benchmarks: programs made up for benchmarking

– e.g., Whetstone, Dhrystone

– toy kernels++, which programs do these represent?

Page 14: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

14© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Benchmarks: Real Programs

real programs

• only accurate way to characterize performance

• requires considerable work (porting)

Standard Performance Evaluation Corporation (SPEC)– http://www.spec.org

– collects, standardizes and distributes benchmark suites

– consortium made up of industry leaders

– SPEC CPU (CPU intensive benchmarks)

» SPEC89, SPEC92, SPEC95, SPEC2000

– other benchmark suites

» SPECjvm, SPECmail, SPECweb

Other benchmark suite examples: TPC-C, TPC-H for databases

Page 15: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

15© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

SPEC CPU2000

• 12 integer programs (C, C++)gcc (compiler), perl (interpreter), vortex (database)

bzip2, gzip (replace compress), crafty (chess, replaces go)

eon (rendering), gap (group theoretic enumerations)

twolf, vpr (FPGA place and route)

parser (grammar checker), mcf (network optimization)

• 14 floating point programs (C, FORTRAN)swim (shallow water model), mgrid (multigrid field solver)

applu (partial diffeq’s), apsi (air pollution simulation)

wupwise (quantum chromodynamics), mesa (OpenGL library)

art (neural network image recognition), equake (wave propagation)

fma3d (crash simulation), sixtrack (accelerator design)

lucas (primality testing), galgel (fluid dynamics), ammp (chemistry)

Page 16: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

16© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Benchmarking Pitfalls

• benchmark properties mismatch with features studied

– e.g., using SPEC for large cache studies

• careless scaling– using only first few million instructions (initialization phase)

– reducing program data size

• choosing performance from wrong application space– e.g., in a realtime environment, choosing troff

– others: SPECweb, TPC-W (amazon.com)

• using old benchmarks– “benchmark specials”: benchmark-specific optimizations

• benchmarks must be continuously maintained and updated!

Page 17: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

CompSci 220 / ECE 252 17© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Common Benchmarking Mistakes

• Not validating measurements• Collecting too much data but doing too little analysis• Only average behavior represented in test workload• Loading level (other users) controlled inappropriately• Caching effects ignored• Buffer sizes not appropriate• Inaccuracies due to sampling ignored• Ignoring monitoring overhead• Not ensuring same initial conditions• Not measuring transient (cold start) performance• Using device utilizations for performance

comparisons

Page 18: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

18© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Reporting Average Performance

• averages: one of the things architects frequently get wrong

– pay attention now and you won’t get them wrong on exams

• important things about averages (i.e., means)– ideally proportional to execution time (ultimate metric)

» Arithmetic Mean (AM) for times

» Harmonic Mean (HM) for rates (IPCs)

» Geometric Mean (GM) for ratios (speedups)

– there is no such thing as the average program

– use average when absolutely necessary

Page 19: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

CompSci 220 / ECE 252 19© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

What Does the Mean Mean?

• Arithmetic mean (AM): (weighted arithmetic mean) tracks execution time: (Timei)/N or (Wi*Timei)

• Harmonic mean (HM): (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: N/ (1/Ratei) or 1/ (Wi/Ratei)

– Arithmetic mean cannot be used for rates (e.g., IPC)

– 30 MPH for 1 mile + 90 MPH for 1 mile != avg 60 MPH

• Geometric mean (GM): average speedups of N programs

N√ ∏(speedup(i))

Page 20: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

20© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Geometric Mean Weirdness

• What about averaging ratios (speedups)?– HM / AM change depending on which machine is the base

Machine A Machine B B/A A/B

Program 1 1 100 10 0.1

Program 2 1000 100 0.1 10

AM(10+.1)/2=5.05

B is 5.05 faster!

(.1+10)/2 = 5.05

A is 5.05 faster!

HM2/(1/10+1/.1) = 5.05

B is 5.05 faster!

2/(1/.1+1/10) = 5.05

A is 5.05 faster!

GM Sqrt(10*.1) = 1 Sqrt(.1*10) = 1

• geometric mean of ratios is not proportional to total time!• if we take total execution time, B is 9.1 times faster

• GM says they are equal

Page 21: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

21© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Little’s Law

• Key Relationship between latency and bandwidth:

• Average number in system = arrival rate * mean holding time

• Example:– How big a wine cellar should we build?

– We drink (and buy) an average of 4 bottles per week

– On average, I want to age my wine 5 years

– bottles in cellar = 4 bottles/week * 52 weeks/year * 5 years

– = 1040 bottles

Page 22: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

22© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

System Balance

each system component produces & consumes data

• make sure data supply and demand is balanced

• X demand >= X supply computation is “X-bound”– e.g., memory bound, CPU-bound, I/O-bound

• goal: be bound everywhere at once (why?)

• X can be bandwidth or latency– X is bandwidth buy more bandwidth

– X is latency much tougher problem

Page 23: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

23© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Tradeoffs

“Bandwidth problems can be solved with money. Latency problems are harder, because the speed of light is fixed and you can’t bribe God” –David Clark (MIT)

well...

• can convert some latency problems to bandwidth problems

• solve those with money

• the famous “bandwidth/latency tradeoff”

• architecture is the art of making tradeoffs

Page 24: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

24© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Bursty Behavior

• Q: to sustain 2 IPC... how many instructions should processor be able to

– fetch per cycle?

– execute per cycle?

– complete per cycle?

• A: NOT 2 (more than 2)– dependences will cause stalls (under-utilization)

– if desired performance is X, peak performance must be > X

• programs don’t always obey “average” behavior– can’t design processor only to handle average behvaior

Page 25: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

25© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Cost

• very important to real designs– startup cost

» one large investment per chip (or family of chips)

» increases with time

– unit cost

» cost to produce individual copies

» decreases with time

– only loose correlation to price and profit

• Moore’s corollary: price of high-performance system is constant

– performance doubles every 18 months

– cost per function (unit cost) halves every 18 months

– assumes startup costs are constant (they aren’t)

Page 26: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

26© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Startup and Unit Cost

• startup cost: manufacturing– fabrication plant, clean rooms, lithography, etc. (~$3B)

– chip testers/debuggers (~$5M a piece, typically ~200)

– few companies can play this game (Intel, IBM, Sun)

– equipment more expensive as devices shrink

• startup cost: research and development– 300–500 person years, mostly spent in verification

– need more people as designs become more complex

• unit cost: manufacturing– raw materials, chemicals, process time (2–5K per wafer)

– decreased by improved technology & experience

Page 27: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

27© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Unit Cost and Die Size

• unit cost most strongly influenced by physical size of chip (die)

» semiconductors built on silicon wafers (8”)» chemical+photolithographic steps create transistor/wire layers» typical number of metal layers (M) today is 6 ( = ~4)

– cost per wafer is roughly constant C0 + C1 * ~$5000)

– basic cost per chip proportional to chip area (mm2)» typical: 150–200mm2, 50mm2 (embedded)–300mm2 (Itanium)» typical: 300–600 dies per wafer

– yield (% working chips) inversely proportional to area and » non-zero defect density (manufacturing defect per unit area)» P(working chip) = (1 + (defect density * die area)/)–

– typical defect density: 0.005 per mm2

– typical yield: (1 + (0.005 * 200) / 4)–4 = 40%

– typical cost per chip: $5000 / (500 * 40%) = $25

Page 28: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

28© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Unit Cost -> Price

• if chip cost $25 to manufacture, why do they cost $500 to buy?

– integrated circuit costs $25

– must still be tested, packaged, and tested again

– testing (time == $): $5 per working chip

– packaging (ceramic+pins): $30» more expensive for more pins or if chip dissipates a lot of heat» packaging yield < 100% (but high)» post-packaging test: another $ 5

– total for packaged chip: ~$65

– spread startup cost over volume ($100–200 per chip)» proliferations (i.e., shrinks) are startup free (help profits)

– Intel needs to make a profit...

– ... and so does Dell

Page 29: Lecture 2: Benchmarks, Performance Metrics, Cost, Instruction Set Architecture

29© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Reading Summary: Performance

• H&P Chapter 1

• Next Instruction Sets