performancecs510 computer architectureslecture 3 - 1 lecture 3 benchmarks and performance metrics...

38
Performance CS510 Computer Architectures Lecture 3 - 1 Lecture 3 Lecture 3 Benchmarks and Benchmarks and Performance Metrics Performance Metrics

Upload: leonard-barber

Post on 01-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Performance CS510 Computer Architectures Lecture 3 - 1

Lecture 3Lecture 3

Benchmarks and Benchmarks and Performance MetricsPerformance Metrics

Lecture 3Lecture 3

Benchmarks and Benchmarks and Performance MetricsPerformance Metrics

Performance CS510 Computer Architectures Lecture 3 - 2

Measurement ToolsMeasurement Tools

• Benchmarks, Traces, Mixes• Cost, Delay, Area, Power Estimation• Simulation (many levels)

– ISA, RT, Gate, Circuit• Queuing Theory• Rules of Thumb• Fundamental Laws

Performance CS510 Computer Architectures Lecture 3 - 3

The Bottom Line: The Bottom Line: Performance (and Cost)Performance (and Cost)

• Time to run the task (ExTime)– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns ....(Performance)– Throughput, bandwidth

610 mph

1350 mph

470

132

286,700

178,200

6.5 hours

3.0 hours

Plane

Boeing 747

BAD/Sud Concorde

SpeedTime (DC-Paris) Passengers Throughput

(pmph)

Performance CS510 Computer Architectures Lecture 3 - 4

The Bottom Line: The Bottom Line: Performance (and Cost)Performance (and Cost)

ExTime(Y) Performance(X)

n = =

ExTime(X) Performance(Y)

“X is n times faster than Y” means:

Performance CS510 Computer Architectures Lecture 3 - 5

Performance TerminologyPerformance Terminology

“X is n% faster than Y” means:

100 100 xx (Performance(X) - Performance(Y)) (Performance(X) - Performance(Y)) n =n = Performance(Y)Performance(Y)

ExTime(Y) Performance(X) n = = 1 +

ExTime(X) Performance(Y) 100

Performance CS510 Computer Architectures Lecture 3 - 6

ExampleExample

1510

= 1.51.0

= Performance (X)Performance (Y)

ExTime(Y)ExTime(X)

=

n = 100 (1.5 - 1.0) 1.0

n = 50%

Example: Y takes 15 seconds to complete a task, Example: Y takes 15 seconds to complete a task, X takes 10 seconds. X takes 10 seconds. What % faster is X?What % faster is X?

Performance CS510 Computer Architectures Lecture 3 - 7

Programs to Evaluate Programs to Evaluate Processor PerformanceProcessor Performance

• (Toy) Benchmarks– 10~100-line program– e.g.: sieve, puzzle, quicksort

• Synthetic Benchmarks– Attempt to match average frequencies of real workl

oads– e.g., Whetstone, dhrystone

• Kernels– Time critical excerpts of real programs– e.g., Livermore loops

• Real programs– e.g., gcc, spice

Performance CS510 Computer Architectures Lecture 3 - 8

Benchmarking GamesBenchmarking Games

• Differing configurations used to run the same workload on two systems

• Compiler wired to optimize the workload• Workload arbitrarily picked• Very small benchmarks used• Benchmarks manually translated to optimize

performance

Performance CS510 Computer Architectures Lecture 3 - 9

Common Benchmarking Common Benchmarking MistakesMistakes

• Only average behavior represented in test workload• Ignoring monitoring overhead• Not ensuring same initial conditions• “Benchmark Engineering”

– particular optimization– different compilers or preprocessors– runtime libraries

Performance CS510 Computer Architectures Lecture 3 - 10

SPEC: System Performance SPEC: System Performance Evaluation CooperativeEvaluation Cooperative

• First Round 1989– 10 programs yielding a single number

• Second Round 1992– SpecInt92 (6 integer programs) and SpecFP92 (14 floating

point programs)– VAX-11/780

• Third Round 1995– Single flag setting for all programs; new set of programs

“benchmarks useful for 3 years”– SPARCstation 10 Model 40

Performance CS510 Computer Architectures Lecture 3 - 11

SPEC First RoundSPEC First Round

• One program: 99% of time in single line of code

• New front-end compiler could improve dramatically

Benchmark

SPEC

Per

f

0

100

200

300

400

500

600

700

800

gcc

epre

sso

spice

dodu

c

nasa7 li

eqnt

ott

matrix300

fppp

p

tom

catv

Performance CS510 Computer Architectures Lecture 3 - 12

How to Summarize How to Summarize PerformancePerformance

• Arithmetic Mean (weighted arithmetic mean)

– tracks execution time: (Ti)/n or Wi*Ti

• Harmonic Mean (weighted harmonic mean) of execution rates (e.g., MFLOPS)

– tracks execution time: n/1/Ri or n/Wi/Ri

• Normalized execution time is handy for scaling performance

• But do not take the arithmetic mean of normalized execution time, use the geometric mean (Ri)1/n, where Ri=1/Ti

Performance CS510 Computer Architectures Lecture 3 - 13

Comparing and Summarizing Comparing and Summarizing PerformancePerformance

For program P1, A is 10 times faster than B, For program P2, B is 10 times faster than A, and so on...

The relative performance of computer is unclear with Total Execution Times

Computer A Computer B Computer C

P1(secs) 1 10 20

P2(secs) 1,000 100 20

Total time(secs) 1,001 110 40

Performance CS510 Computer Architectures Lecture 3 - 14

Summary MeasureSummary Measure

Arithmetic Mean n Execution Timei

i=1

1

n n n 1 / Ratei i=1

Ratei = ƒ(1 / Execcution Timei)

Good, if programs are run equally in the workload

Harmonic Mean(When performance is expressed as rates)

Performance CS510 Computer Architectures Lecture 3 - 15

Unequal Job MixUnequal Job Mix

­ Weighted Arithmetic Mean

­ Weighted Harmonic Mean

n Weighti x Execution Timei

i=1

n Weighti / Ratei i=1

Relative Performance

• Normalized Execution Time to a reference machine­ Arithmetic Mean­ Geometric Mean

n Execution Time Ratioi

i=1n

Normalized to the reference machine

• Weighted Execution Time

Performance CS510 Computer Architectures Lecture 3 - 16

Weighted Arithmetic MeanWeighted Arithmetic Mean

W(i)j x Timejj=1

nWAM(i) =

A B C W(1) W(2) W(3)

P1 (secs) 1.00 10.00 20.00 0.50 0.909 0.999

P2(secs) 1,000.00 100.00 20.00 0.50 0.091 0.001

1.0 x 0.5 + 1,000 x 0.5

WAM(1) 500.50 55.00 20.00

WAM(2) 91.91 18.19 20.00

WAM(3) 2.00 10.09 20.00

Performance CS510 Computer Architectures Lecture 3 - 17

Normalized Execution Normalized Execution TimeTime

P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0

P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

Normalized to A

Normalized to B Normalized to C

A B C A B C A B C

Geometric Mean = n Execution time ratioiI=1

n

Arithmetic mean 1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0

Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0

Total time 1.0 0.11 0.04 9.1 1.0 0.36 25.03 2.75 1.0

A B CP1 1.00 10.00 20.00P2 1,000.00 100.00 20.00

Performance CS510 Computer Architectures Lecture 3 - 18

Disadvantages Disadvantages of Arithmetic Meanof Arithmetic Mean

Performance varies depending on the reference machine

1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0

B is 5 times slower than A

A is 5 times slower than B

C is slowest C is fastest

Normalized to A

Normalized to B Normalized to C

A B C A B C A B C

P1

P2

Arithmetic mean

Performance CS510 Computer Architectures Lecture 3 - 19

The Pros and Cons Of The Pros and Cons Of Geometric MeansGeometric Means

• Independent of running times of the individual programs• Independent of the reference machines• Do not predict execution time

– the performance of A and B is the same : only true when P1 ran 100 times for every occurrence of P2

P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0

1(P1) x 100 + 1000(P2) x 1= 10(P1) x 100 + 100(P2) x 1

Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0

Normalized to A Normalized to B Normalized to C

A B C A B C A B C

Performance CS510 Computer Architectures Lecture 3 - 20

Performance CS510 Computer Architectures Lecture 3 - 21

Performance CS510 Computer Architectures Lecture 3 - 22

Amdahl's LawAmdahl's Law

Speedup due to enhancement E: ExTime w/o E Performance w/ESpeedup(E) = = ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:

ExTime(E) =Speedup(E) =

Performance CS510 Computer Architectures Lecture 3 - 23

Amdahl’s LawAmdahl’s Law

Speedup =ExTime

ExTimeE

=

1

(1 - FractionE) + FractionE

SpeedupE

ExTimeE = ExTime x (1 - FractionE) + SpeedupE

FractionE

1(1 - F) + F/S

=

Performance CS510 Computer Architectures Lecture 3 - 24

Amdahl’s LawAmdahl’s Law

Floating point instructions are improved to run 2 times(100% improvement); but only 10% of actual instructions are FP

Speedup = 1

(1-F) + F/S

= 1.0535.3% improvement

1

(1-0.1) + 0.1/2 0.95=

1=

Performance CS510 Computer Architectures Lecture 3 - 25

Corollary(Amdahl):Corollary(Amdahl): Make the Common Case FastMake the Common Case Fast

• All instructions require an instruction fetch, only a fraction require a data fetch/store

– Optimize instruction access over data access• Programs exhibit locality

Spatial Locality

Reg’sCache

Memory Disk / Tape

Temporal Locality

• Access to small memories is fasterProvide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.

Performance CS510 Computer Architectures Lecture 3 - 26

Locality of AccessLocality of Access

Spatial Locality:There is a high probability that a set of data, whose address differences are small, will be accessed in small time difference.

Temporal Locality:There is a high probability that the recently referenced data will be referenced in near future.

Performance CS510 Computer Architectures Lecture 3 - 27

• The simple case is usually the most frequent and the easiest to optimize!

• Do simple, fast things in hardware(faster) and be sure the rest can be handled correctly in software

Rule of ThumbRule of Thumb

Performance CS510 Computer Architectures Lecture 3 - 28

Metrics of PerformanceMetrics of Performance

Compiler

Programming Language

Application

ISA

DatapathControl

Transistors Wires PinsFunction Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Answers per monthOperations per second

Cycles per second (clock rate)

Megabytes per second

Performance CS510 Computer Architectures Lecture 3 - 29

Aspects of CPU PerformanceAspects of CPU Performance

Seconds Instructions Cycles Seconds

CPU time = = x x Program Program Instruction Cycle

Program X

Compiler X (X)

Inst. Set. X X

Organization X

Technology X X

Inst Count CPI Clock Rate

Performance CS510 Computer Architectures Lecture 3 - 30

Marketing MetricsMarketing MetricsMIPS = Instruction Count / Time x 106 = Clock Rate / CPI x 106

• Machines with different instruction sets ?

• Programs with different instruction mixes ?

– Dynamic frequency of instructions

• Not correlated with performance

MFLOP/s = FP Operations / Time x 106

• Machine dependent

• Often not where time is spent

Normalized:

add,sub,compare, mult 1

divide, sqrt 4

exp, sin, . . . 8

Normalized:

add,sub,compare, mult 1

divide, sqrt 4

exp, sin, . . . 8

Performance CS510 Computer Architectures Lecture 3 - 31

Cycles Per InstructionCycles Per Instruction

CPU time = Cycle Time x CPI x Ii = 1

n

i i

Instruction Frequency

CPI = CPI x F ,where F = I i = 1

n

i i i i

Instruction Count

CPI = (CPU Time x Clock Rate) / Instruction Count= Cycles / Instruction Count

Average cycles per instruction

Invest resources where time is spent !

Performance CS510 Computer Architectures Lecture 3 - 32

Organizational Trade-offsOrganizational Trade-offs

Instruction Mix

Cycle Time

CPIISA

DatapathControl

Transistors Wires PinsFunction Units

Compiler

Programming Language

Application

Performance CS510 Computer Architectures Lecture 3 - 33

Example: Calculating CPIExample: Calculating CPI

Typical Mix

Base Machine (Reg / Reg)

Op Freq CPI(i) CPI (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5

Performance CS510 Computer Architectures Lecture 3 - 34

ExampleExample

Add register / memory operations: R/M– One source operand in memory– One source operand in register– Cycle count of 2

Branch cycle count to increase to 3

What fraction of the loads must be eliminated for this to pay off?

Base Machine (Reg / Reg)

Some of LD instructions can be eliminated by having R/M type ADD instruction [ADD R1, X]

Typical Mix

Op Freqi CPIi

ALU 50% 1Load 20% 2Store 10% 2Branch 20% 2

Performance CS510 Computer Architectures Lecture 3 - 35

Example SolutionExample Solution Exec Time = Instr Cnt x CPI x Clock

Op Freqi CPIi CPIALU .50 1 .5Load .20 2 .4Store .10 2 .2Branch .20 2 .4Total 1.00 1.5

Performance CS510 Computer Architectures Lecture 3 - 36

Example SolutionExample Solution

Exec Time = Instr Cnt x CPI x Clock

CPINEW must be normalized to new instruction frequency

NewFreqi CPIi CPINEW

.5 - X 1 .5 - X

.2 - X 2 .4 - 2X

.1 2 .2

.2 3 .6

X 2 2X

1 - X (1.7 - X)/(1 - X)

OldOp Freqi CPIi CPI

ALU .50 1 .5

Load .20 2 .4

Store .10 2 .2

Branch .20 2 .4

Reg/Mem

1.00 1.5

Performance CS510 Computer Architectures Lecture 3 - 37

Example SolutionExample SolutionExec Time = Instr Cnt x CPI x Clock

All LOADs must be eliminated for this to be a win !

1.00 x 1.5 = (1 - X) x (1.7 - X)/(1 - X) 1.5 = 1.7 - X 0.2 = X

Instr CntOld x CPIOld x Clock = Instr CntNew x CPINew x Clock

Op Freq Cycles CPIOld Freq Cycles CPINEW

ALU .50 1 .5 .5 - X 1 .5 - XLoad .20 2 .4 .2 - X 2 .4 - 2XStore .10 2 .2 .1 2 .2Branch .20 2 .4 .2 3 .6Reg/Mem X 2 2X

1.00 1.5 1 - X (1.7 - X)/(1 - X)

Old New

Performance CS510 Computer Architectures Lecture 3 - 38

Fallacies and PitfallsFallacies and Pitfalls

• MIPS is an accurate measure for comparing performance among computers– dependent on the instruction set– varies between programs on the same computer– can vary inversely to performance

• MFLOPS is a consistent and useful measure of performance– dependent on the machine and on the program– not applicable outside the floating-point performance– the set of floating-point operations is not consistent

across the machines