18-600 foundations of computer systemsece600/lectures/lecture25.pdf · 10 1p 2p 4p 1p 2p 4p 1p 2p...

50
Lecture 25: “Performance, Power, & Energy of Computers” Recommended References: “Energy per Instruction Trends in Intel® Microprocessors,” by Ed Grochowski, Murali Annavaram, 2006. “Mitigating Amdahl's Law through EPI Throttling,” by M. Annavaram, E. Grochowski, J. Shen. In 32nd ISCA 2005. The “Science” of Computer Systems 18 - 600 Foundations of Computer Systems John P. Shen November 29, 2017 11/29/2017 (© J.P. Shen) 18-600 Lecture #25 1

Upload: others

Post on 21-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Lecture 25:“Performance, Power, & Energy of Computers”

➢ Recommended References:• “Energy per Instruction Trends in Intel® Microprocessors,” by Ed

Grochowski, Murali Annavaram, 2006. • “Mitigating Amdahl's Law through EPI Throttling,” by M. Annavaram,

E. Grochowski, J. Shen. In 32nd ISCA 2005.

The “Science” of Computer Systems

18-600 Foundations of Computer Systems

John P. ShenNovember 29, 2017

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 1

Page 2: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Lecture 25:“Performance, Power, & Energy of Computers”

A. Latency (Single-Thread) Performance (Law #1)

B. Throughput (Multi-Thread) Performance (Law #2)

C. Throughput Performance Scalability (Law #3)

D. Performance Scalability Impediments (Law #4)

E. Performance and Power Scaling (Law #5)

F. Power and Energy Optimizations

18-600 Foundations of Computer Systems

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 2

Page 3: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Flynn’s Taxonomy of Computer Systems [Mike Flynn, 1966]

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 3

SISD – Single Instruction Stream & Single Data Stream. (Sequential uni-processor.)

SIMD – Single Instruction Stream & Multiple Data Streams. (Vector processing, lockstep.)

MISD – Multiple Instruction Streams & Single Data Stream. (Stream data through multiple processing stages.)

MIMD – Multiple Instruction Streams & Multiple Data Streams. (Multi-threads & multi-processors, most general parallelism.)

SPMD – Single Program, Multiple Data Streams. (GPUs, not lockstep.)

MPMD – Multiple Programs, Multiple Data Streams.

Page 4: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Classification of Parallelism• SISD – Traditional sequential program on single-core processor

➢ Sequential code with sequential execution semantics (single PC)

➢ Can support concurrent “multi-processing” through time sharing

➢ Control-flow graph & data-flow graph embedded in sequential code

➢ Achieve Instruction Level Parallelism (ILP) via aggressive control flow speculation and dataflow limit processing

• MIMD – Multi-threads & multi-cores, most general type of parallelism

➢ Can support simultaneous parallel “multi-processing” (multiple PCs)

➢ Simultaneous traversing of multiple control-flow graphs

➢ Can support both “multi-processing” and “multi-threading”

➢ Achieve Thread Level Parallelism (TLP) via program & machine parallelisms

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 4

Page 5: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

A. Latency (Single-Thread) Performance (Law #1)

❖ Time to execute a program: T (latency)

❖ Processor performance: Perf = 1/T

cycle

time

ninstructio

cycles

program

nsinstructioT

CycleTimeCPIPathLengthT

CPIPathLength

Frequency

CycleTimeCPIPathLengthPerfCPU

1

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 5

Page 6: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Landscape of Microprocessor Families

0

0.5

1

0 500 1000 1500 2000 2500 3000 3500

Frequency (MHz)

SP

EC

int2

00

0/M

Hz

Intel-x86

AMD-x86

Power

Itanium

700500

300

100

PIII

P4

Athlon

** Data source www.spec.org

Power4

NWD

900

11001900 SpecINT 2000 1300

1500

Opteron

800 MHz

Extreme

Power 3

Power5

PSC

DTN

1700

Itanium

Source: www.SPEC.org

CPIPathLength

FrequencyePerformanc CPU

Deeper pipelining

Wid

er

pip

elin

e

SPECint Latency Performance of Processors[John DeVale & Bryan Black, 2005]

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 6

Page 7: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Latency vs. Throughput Performance

❖ Reduce Latency of Application▪ Uni-processor, Single Program

▪ Target Single-Thread (ST) Performance

▪ Examples: SPEC, PC and Workstations

❖ Increase Throughput of System▪ Multi-cores/Multi-processors, Many Threads

▪ Target Multi-threaded/Multi-tasking Throughput

▪ Example: Database Transaction Processing

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 7

Page 8: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Ideal Throughput (Multi-Thread) Performance

❖ Time to process a thread: T (latency)

❖ Multi-thread performance: Perf = 1/T

cycle

time

ninstructio

cycles

thread

nsinstructioT

CPIPathLength

Frequencyn

CycleTimeCPIPathLength

nPerfMP

1

CycleTimeCPIPathLengthT

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 8

Page 9: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

B. Throughput (Multi-Thread) Performance (Law #2)

❖ Multi-Core/Multi-Thread Performance:

❖ Can Improve PerfMC by:▪ Increasing: n (no. of CPUs or cores)

▪ Increasing: Frequency (CPU clock frequency)

▪ Decreasing: PL (dynamic instruction count)

▪ Decreasing: CPI (cycles/instruction)

)()( nCPInPL

FrequencynPerfMC

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 9

Page 10: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

A Throughput Performance Scalability Model

❖ Multi-Core/Multi-Thread Speedup:

❖ A Rigorous Scalability Model (my proposal):

)()(

)1()1(

)()()( 11

nCPInPL

CPIPLn

CPIPL

Frequency

nCPInPL

Frequencyn

nSpeedup MC

11)()( CPInPLnnCPInPL yx

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 10

Page 11: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

C. Throughput Performance Scalability (Law #3)

❖ Multi-Core/Multi-Thread Speedup:

❖ Scalability Impedance Functions:

yxyxMCnn

n

CPInPLn

CPIPLnnSpeedup

11

11)(

FunctionImpedancenCPIn

FunctionImpedancenPLny

x

)(

)(

ImpedanceNoyx

SpeedupNoyxnnn yxyx

0.0

0.1)(

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 11

Page 12: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Real World Example: Database Performance (OLTP)

❖ MIMD Database Performance: TPS

❖ Can Improve TPS by:▪ Increasing: n (no. of CPUs)

▪ Increasing: Frequency (CPU clock frequency)

▪ Decreasing: IPX (instructions/transaction) == PL

▪ Decreasing: CPI (cycles/instruction)

)()( nCPInIPX

Frequencyn

Second

nsTransactioTPS

[Law #2]

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 12

[Rich Hankins & Murali Annavaram, 2004]

Page 13: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

4-way Multiprocessor Experimental SetupComponent Intel® Xeon System Intel® Itanium® 2 System

Processors 4-way SMP, 1.6GHz 4-way SMP, 900MHz

Caches 256KB L2, 1MB L3 256KB L2, 3MB L3

Operating System Red Hat® AS 2.1 Red Hat® AS 2.1

Disks 24 data + 2 log 32 data + 1 log

Main Memory 4GB 16GB

Database Oracle® 9ir2 Oracle® 10g

OS Large Page Size 4MB 256MB

SGA 3GB 14GB

[Rich Hankins & Murali Annavaram, 2004]

❖ Based on EMON Events▪ Separate User and OS components for each event▪ Use multiple long runs (20-min warm up, 10-min measurement)▪ Strive for standard deviation <5% for TPS & CPU utilization > 90%▪ Overall user execution time ~70-90%

)()( nCPInIPX

FrequencynTPS

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 13

[Rich Hankins & Murali Annavaram, 2004]

Page 14: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

TPS: Throughput Scaling

0

200

400

600

800

1000

0 250 500 750 1000 1250 1500

Warehouses

TP

S

1P 2P 4P

0

200

400

600

800

1000

0 200 400 600 800 1000 1200

Warehouses

TP

S

1P 2P 4P

Balanced I/O Bound

CPU Bound

Xeon Itanium 2

❖ TPS scales more linearly on Itanium2 ▪ Larger SGA implies slower I/O rate increase

(Xeon I/O rate increases at 7KB/WH, & Itanium 2 at: 6KB/WH)

▪ Bus utilization on Xeon higher than Itanium 2 (45% vs. 39%)

❖ Increasing I/O rate g more processes & context switches▪ Clients increase from 8 to 56 on Xeon & 4 to 64 on Itanium 2

▪ OS time up to 20% on Xeon & only 10% on Itanium 2

)()( nCPInIPX

FrequencynTPS

▪ Performance degrades with increased data set size (due to I/O rate)

▪ Performance improves with increased n (on IPF almost linear increase)

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 14

[Rich Hankins & Murali Annavaram, 2004]

Page 15: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

IPX: Path-Length Scaling )()( nCPInIPX

FrequencynTPS

❖ Growth in IPX (quite linear) attributed to OS IPX increase▪ More I/O, more context switching

❖ User level IPX remains relatively constant in both systems▪ Code path through Oracle relatively constant

❖ Excluding NOPS, IPX at 25 WH similar for both systems!

❖ IPX growth less pronounced on Itanium 2

0.0

0.5

1.0

1.5

2.0

2.5

0 100 200 300 400 500 600 700 800

Warehouses

IPX

(x10

6)

1P 2P 4P

0.0

0.5

1.0

1.5

2.0

2.5

0 200 400 600 800 1000 1200

Warehouses

IPX

(x10

6)

1P 2P 4PXeon Itanium 2

1)( IPXnnIPX x 1)( IPXnnIPX x

10~ nnx 10~ nnx

▪ IPX increases with increased data set size (Xeon) ▪ But IPX(n) doesn’t

increase much with increased n

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 15

[Rich Hankins & Murali Annavaram, 2004]

Page 16: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Path-Length Contributions

ationSynchroniz

Replay

nSpeculatio

CodeOriginal

APP

SwitchContext

FaultPage

OI

OS

nPL

/

)(

)()( nCPInIPX

FrequencynTPS

0.0

0.1

)(

1

x

PL

nn

PL

x

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 16

Page 17: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

CPI: Overhead Scaling )()( nCPInIPX

FrequencynTPS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0 200 400 600 800 1000 1200

Warehouses

CP

I

1P 2P 4P

0.0

2.0

4.0

6.0

8.0

10.0

0 100 200 300 400 500 600 700 800

Warehouses

CP

I

1P 2P 4PXeon Itanium 2

1)( CPInnCPI y 1)( CPInnCPI y

❖ CPI increases with a “knee” but less sharp on Itanium 2 !

❖ Overall CPI trend strongly determined by user CPI▪ User mode execution time more than 90% on IPF, 80% on Xeon

❖ Xeon CPI grows with P, Itanium 2 CPI does not▪ Growth attributed to higher bus utilization on Xeon

▪ CPI does increase with increased data set size▪ CPI(n) also increases

with increased n (esp. for Xeon)

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 17

[Rich Hankins & Murali Annavaram, 2004]

Page 18: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

CPI Breakdown: Xeon

1

167.0

1

0 CPInIPXn

FrequencynTPS

833.0)( n

nn

nnSpeedup

yxMC

[Law #3]

0

1

2

3

4

5

6

7

8

9

10

1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P

Warehouses

CP

I

Other

L3 Miss

L2 Miss

TC

TLB

Branch

Inst

10 W 100 W 500 W 800 W

Xeon1)( CPInnCPI y )167.0(y

)()( nCPInIPX

FrequencynTPS

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 18

[Rich Hankins & Murali Annavaram, 2004]

Page 19: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

CPI Breakdown: Itanium 2

1

073.0

1

0 CPInIPXn

FrequencynTPS

927.0)( n

nn

nnSpeedup

yxMC

[Law #3]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P

25 W 100 W 500 W 800 W 1200 W

Warehouses

CP

IExe

FE

L1d

RSE

Flush

Work

Itanium 21)( CPInnCPI y )073.0(y

)()( nCPInIPX

FrequencynTPS

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 19

[Rich Hankins & Murali Annavaram, 2004]

Page 20: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

CPI Contributions

ationSynchroniz

ctInterconneMP

Bus

DRAMMem

Coherence

TLBCacheCache

sourceRe

StallsDep

Execute

CPU

nCPI

/

.

)(

)()( nCPInIPX

FrequencynTPS

0.0

0.1

)(

1

y

CP

In

nC

PI

y

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 20

Page 21: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Scalability Impediments

1p-16p scaling

1-(x+y) R^2

SEMPHY 0.993 0.999

PLSA 0.963 0.999

Rsearch 0.931 0.997

SVM-RFE 0.786 0.970

SNPs 0.685 0.967

GeneNet 0.642 0.983

[Carole Dulong et al., 2005]

Speedup Scalability

0

5

10

15

20

25

30

35

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Number of Cores (n)

Sp

ee

du

p

n^1.00

n^0.90

n^0.80

n^0.70

(x+y)=0.10

(x+y)=0.20

(x+y)=0.30

11 CPInPLn

FrequencynPerf

yxMC

yxn

nn

nnSpeedup

yxMC

1)(

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 21

Page 22: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Scalability HeadroomSpeedup Scalability Extrapolation

0

5

10

15

20

25

30

35

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Number of Cores (n)

Sp

eed

up

n̂ 1.00

n̂ 0.92

n̂ 0.65

11 CPInPLn

FrequencynPerf

yxMC

Reduce PL(n) Reduce CPI(n)

?

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 22

Page 23: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Conspiring Forces Against Performance Scaling: Three Forms of Scalability Impediments

❖ Algorithm▪ Limitation of Language and Algorithm

▪ Tyranny of Amdahl’s Law (sequential bottleneck)

❖ Architecture▪ Increase of Path-Length Undermines Scalability

▪ Increase of CPI also Undermines Scalability

❖ Power/Thermal▪ Increased Complexity and Inefficiency of Design

▪ Super-linear Power Scaling Relative to Performance

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 23

Page 24: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

D. Performance Scalability Impediments (Law #4)

111

1

1)(

fn

n

n

ffnSpeedup ALG

yxn

n

n

nn

nnSpeedup

yxyxARCH

1)(

Algorithm Scalability: (Amdahl’s Law)

Architecture Scalability:

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 24

Page 25: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Relative (Additive) Degrees of TyrannySpeedup for z and f == (0.10)

0

5

10

15

20

25

30

35

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Number of cores (n)

Sp

ee

du

pPerfect Speedup

JPS (architecture)

Amdahl (algorithm)

z

zyxyxMC nn

n

n

n

nn

nnSpeedup

1)(

111

1

1)(

fn

n

n

ffnSpeedup MP

Architecturetyranny

Algorithmtyranny

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 25

Page 26: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Speedup for f=0.01 and (x+y)=0.08

0

10

20

30

40

50

60

70

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Number of Cores (n)

Sp

ee

du

p

Perfect (Linear) Speedup

Architecture Scaling (w/ x+y=0.08)

Algorithm Scaling (w/ f=0.01)

Actual Speedup (w/ f=0.01 & x+y=0.08)

92.008.01)( nnnSpeedup ARCH

101.01)(

n

nnSpeedup ALG

92.0

101.01)(

n

nnSpeedup Actual

10X SU n > 14

20X SU n > 35

Combined Effect on Actual Speedup (f=0.01, x+y=0.08)

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 26

Page 27: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Combined Effect on Actual Speedup (f=0.02, x+y=0.08)

Speedup for f=0.02 and (x+y)=0.08

0

10

20

30

40

50

60

70

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Number of Cores (n)

Sp

ee

du

p

Perfect Speedup

Architecture Scaling (w/ x+y=0.08)

Algorithm Scaling (w/ f=0.02)

Actual Speedup (w/ f=0.02 & x+y=0.08)

92.008.01)( nnnSU ARCH

102.01)(

n

nnSU ALG

10X SU n > 16

20X SU n > 52

92.0)()( ALGActual nSUnSU

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 27

Page 28: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Combined Effect on Actual Speedup (f=0.02, x+y=0.10)Speedup for f=0.02 and (x+y)=0.10

(with scalar execution of sequential %)

0

10

20

30

40

50

60

70

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Number of Cores (n)

Sp

ee

du

p

Perfect (Linear) Speedup

Architecture Scaling (w/ x+y=0.10)

Algorithm Scaling (w/ f=0.02)

Actual Speedup (w/ f=0.02 & x+y=0.10)

90.010.01)( nnnSpeedup ARCH

n

nSpeedup ALG98.0

1

02.0

1)(

10X SU n > 18

20X SU n > 62

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 28

Page 29: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Impact of Latency Performance on MP Performance Speedup for f=0.02 and (x+y)=0.10

(with scalar execution of sequential %)

0

10

20

30

40

50

60

70

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Number of Cores (n)

Sp

ee

du

p

Perfect (Linear) Speedup

Architecture Scaling (w/ x+y=0.10)

Algorithm Scaling (w/ f=0.02)

Actual Speedup (w/ f=0.02 & x+y=0.10)

n

nSpeedup ALG98.0

1

02.0

1)(

10X SU n > 18

20X SU n > 62

Speedup for f=0.02 and (x+y)=0.10

(with superscalar execution of sequential %)

0

10

20

30

40

50

60

70

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Number of Cores (n)

Sp

ee

du

p

Perfect (Linear) Speedup

Architecture Scaling (w/ x+y=0.10)

Algorithm Scaling (w/ f=0.02)

Actual Speedup (w/ f=0.02 & x+y=0.10)

90.010.01)( nnnSpeedup ARCH

n

nSpeedup ALG98.0

2

02.0

1)(

10X SU n > 15

20X SU n > 38

90.010.01)( nnnSpeedup ARCH

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 29

Page 30: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

E. Performance and Power Scaling (Law #5)

PathLength

FrequencyIPC

CPIPathLength

FrequencyePerformanc

FrequencyIPCEPIPower

PathLengthePerformancEPIPower

second

cycle

cycle

ninstructio

ninstructio

Joule

second

JouleWatt

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 30

Page 31: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

FrequencyIPCEPIPower

[John DeVale & Bryan Black, 2005]

Power Scaling Relative to Performance Scaling

0 500 1000 1500 2000 2500 3000 3500 40000

0.2

0.4

0.6

0.8

1

0

20

40

60

80

100

120

Power (W)

10

0

30

0

50

0

70

0

90

0

1100

13

00

15

00

17

00

19

00

Pentium

Pentium EE

Pentium M

Frequency (Hz)

Spec2K/MHz

Power (W)

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 31

Page 32: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Power vs. Latency Performance

15

20

25

30

0 2

Relative Performance

Rela

tive P

ow

er

Pentium 4 (Psc)

power = perf (1.74)

Pentium 4 (Wmt)

Pentium Pro

Pentiumi486

0

5

10

4 6 8

[Ed Grochowski, 2004]

❖ For comparison▪ Factor out contributions due to

process technology

▪ Keep contributions due to microarchitecture design

▪ Normalize to i486™ processor

❖ Relative to i486™ Pentium® 4 (Wmt) processor is▪ 6x faster (2X IPC at 3X frequency)

▪ 23x higher power

▪ Spending 4 units of power for every 1 unit of scalar performance

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 32

Page 33: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Power vs. Latency & Throughput Performances

power = perf (1.74)

30

Rela

tive P

ow

er

10

15

20

25

0 2 4 6 8

Relative Performance

Pentium 4 (Psc)

Pentium 4 (Wmt)

Pentium Pro

Pentium

i486

0

5

Pentium M

power = perf (1.0) ?

power = perf (1.74)

Scalar/LatencyPerformance

ThroughputPerformance

CPU EPII486 7 nj

P5 10 nj

P6 17 nj

P4P-wmt 27 nj

P4P-psc 29 nj

Pentium M 9 nj

[Ed Grochowski, 2005]

Low EPI

❖ Assume a large-scale CMP with potentially many cores

❖ Replication of cores results in (nearly) proportional increases to both throughput performance and power (hopefully).

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 33

Page 34: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Power/Performance (EPI) Evolution

IntelMicroprocessors

EPI (nj) 65nm at 1.33v

i486 10

Pentium 14

Pentium Pro 24

Pentium 4 (WMT) 38

Pentium 4 (CDM) 48

Pentium M (Banias) 13

Pentium M (Dothan) 15

Core Duo (Yonah) 11

Core Duo (Merom) 10

❖ Power: single core power (relative to i486 baseline)❖ Performance: SPECint performance (relative to i486 baseline)❖ EPI: average energy spent per instruction (in nano-joules)

0

5

10

15

20

25

30

35

40

45

50

0 2 4 6 8 10Scalar Performance

Pow

er

Power = Performance1.75

i486 Pentium

Pentium Pro

Pentium 4 (Willamette)

Pentium 4 (Cedarmill)

BaniasDothan Yonah

Merom

Pentium M Core Duo

FrequencyIPCEPIPower [Ed Grochowski, 2004]

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 34

Page 35: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Power/Performance Scaling

0

20

40

60

80

100

120

140

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Relative Performance

Rela

tive P

ow

er

Perf^1.74

Perf^1.5

Perf^1.25

Perf^1.0

Pentium M

Pentium 4 (Psc)Pentium 4 (Wmt)

Pentium Proi486

?

?

Pentium Neo-Core ?

?

CPU EPI SUi486 7 nj 1

P5 10 nj 2

P6 17 nj 3.5

P4P-wmt 27 nj 6

P4P-psc 29 nj 6.5

Pentium M 9 nj 5.5

Neo-Core 5 nj 6.5

P5: 10 njP-M: 9 nj

Neo: 5 nj

P4P: 27 nj

• The issue is not small vs. big cores, nor in-ordervs. out-of-order cores. The key metric is EPI.

• The ideal core: ultra-low EPI with best possiblesingle-thread or single-core performance.

NP/DSP/GPU EPI

IXP2800 1 nj

TMS320C6713 0.7 nj

GeF7800GTX 0.6 nj

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 35

Page 36: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Power and Speedup Scaling

0

10

20

30

40

50

60

70

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Machine Scaling (n)

Po

wer/

Sp

eed

up

Scalin

g

Power at n 1̂.2

Power at n 1̂.1

Perfect Scaling

Speedup at n 0̂.9

Speedup at n 0̂.8

Power scalingchallenges:• Low EPI cores• Un-core scaling

Speedup scalingchallenges:• Algorithm

• Sequential %• Architecture

• PL scaling• CPI scaling

?

?

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 36

Page 37: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Power vs. M C Speedup Scaling

0

20

40

60

80

100

120

140

160

180

200

01.5 3

4.5 67.5 9

10.5 1213.5 15

16.5 1819.5

MC Speedup

Po

we

r

Speedup^1.74

Speedup^1.50

Speedup^1.25

Speedup^1.00

Pentium M

Pentium 4 (Psc)Pentium 4 (Wmt)

Pentium Proi486 Pentium

EPI=29nj

EPI=10nj

EPI=9nj

CPU EPI SU

i486 7 nj 1

P5 10 nj 2

P6 17 nj 3.5

P4P-wmt 27 nj 6

P4P-psc 29 nj 6.5

Pentium M 9 nj 5.5

Neo core? < 5 nj 6.5

Neo Core?

•The scaling goal is not just the number of cores but maximum throughput within fixed power envelope.

•The key issue is not the power scaling of replicated cores, but the un-core power scaling that may push total power scaling towards the square law again.

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 37

Page 38: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Future Directions

❖ Scalability Strategies:

▪ Algorithm – Languages & Specialized Parallelism

▪ Architecture – CPI and Path Length Reduction

▪ Power/Thermal – EPI Reduction & Scalable Un-core

❖ Power/Energy is the new scalability wall

❖ Research Challenges:

▪ Sequential % mitigation for compelling workloads

▪ Ultra-low EPI core with great latency performance

▪ Un-core fabric with near-linear power scaling

❖ Un-core scaling is the new power goblin

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 38

Page 39: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Useful Chart on Power, Voltage, Resistance, Current

http://www.sengpielaudio.com/calculator-ohm.htm

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 39

Formula 1 − Electrical power equation:

Power P = I × V = R × I2 = V2 ⁄ Rwhere power P is in watts, voltage V is in volts and current I is in amperes (DC).If there is AC, look also at the power factor PF = cos φ and

φ = power factor angle(phase angle) between voltage and amperage.

Formula 2 − Mechanical power equation:

Power P = E ⁄ t = W ⁄ twhere power P is in watts, Energy E is in joules, and time tis in seconds. 1 W = 1 J/s.

Electric Energy is E = P × t − measured in watt-hours,

or also in kWh. 1J = 1N×m = 1W×s

Page 40: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Power vs. Energy

▪ Energy: integral of power (area under the curve)

▪ Energy and power driven by different design constraints

▪ Power issues:

▪ Power delivery (supply current @ right voltage)

▪ Thermal (don’t fry the chip)

▪ Reliability effects (chip lifetime)

▪ Energy issues:

▪ Limited energy capacity (battery)

▪ Efficiency (work per unit energy)

▪ Different usage models drive tradeoffs

Time

Pow

er

Energy

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 40

Page 41: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

F. Power and Energy Optimizations

▪ With constant time base, two are “equivalent”

▪ 10% reduction in power => 10% reduction in energy

▪ Once time changes, must treat as separate metrics

▪ E.g. reduce frequency to save power => reduce performance => increase time to completion => consume more energy (perhaps)

▪ Metric: energy-delay product per unit of work

▪ Tries to capture both effects

▪ Others advocate energy-delay2

▪ Best to consider all

▪ Plot performance (time), energy, ed, ed2

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 41

Page 42: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Performance/Power Efficiency Metrics

▪ Power is good metric for deciding on the thermal envelope of the processor

▪ Energy is good metric in battery constrained environments

▪ Task executed at ½ speed but ¼ power means ½ the energy (2T * ¼ P = ½ E)

▪ 2X battery life!

▪ Energy*Delay metric gives higher weight to performance

▪ Same example above, ED ((2T)2 * ¼ P) stays same

▪ Energy*Delay2 gives even more weight to performance

▪ Same example above shows that ½ speed is 2X worse on ED2 metric

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 42

Page 43: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

MIPS/watt Common measure of power efficiency

Equivalent to energy per instruction

Independent of time

Ideal metric for throughput performance

MIPS^2/watt Equivalent to energy • delay

Common metric for comparing logic families

MIPS^3/watt Equivalent to energy • delay^2

Assign increasing weight to time

Appropriate metric for latency performance

Mips

Watt

Instructions

Second

Joules

Second

=Instructions

Joule=

mips/watt

0

0.2

0.4

0.6

0.8

1

1.2

486 p5 p6 pentium 4

Norm

aliz

ed M

etr

ic

mips 2̂/watt

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

486 p5 p6 pentium 4

Nor

mal

ized

Met

ric

mips 3̂/watt

0

2

4

6

8

10

12

486 p5 p6 pentium 4

Nor

mal

ized

Met

ric

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 43

[Ed Grochowski, Intel, 2005]

Page 44: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

iPad3

18-600 Lecture #2511/29/2017 (© J.P. Shen)

Energy Storage Challenge for Mobile Devices

44

iPad2•25Wh = 90kJ•10h use = avg 2.5W

Kindle3•6.5Wh = 23kJ•“30 day use” = 30h =

avg 220mW

Typical smartphone•32g, 13cc, 5.5Wh = 18kJ •+5h charging @ max 1W•20mW static power = 10 days standby•150mW notifications = 1.3 day standby•“typical usage” 5kJ active + 13kJ standby =

1 battery charge

How long between recharges?

Page 45: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Key Areas Description

➢ Context Based Power Management: Offloading of communication to available connectivity, and computation to companion devices or the cloud.

➢ Workload Based Power Management: Manage power consumption based on actual usage and workload scenarios by leveraging heterogeneous cores and smart parallelism to reduce overall power.

➢ Near-Zero-Power Standby Mode: Low-power always-on transflective bistable (TF/BS) displays; eager hibernation with instant resume.

➢ Ultra-Low-Power Always-On Device: Device with minimal standby functionality and seamless quick switch over to companion devices.

➢ Casual Charging: Wireless inductive charging; solar charging for large surface devices.

➢ Anticipatory Preexecution: Speculative cross-device or cloud-based preprocessing and content prefetching.

Active Power Saving

Standby Power Saving

Energy Harvesting

18-600 Lecture #2511/29/2017 (© J.P. Shen)

Key Challenges…how far can we go? [Per Ljung, 2012]

45

Page 46: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

18-600 Lecture #2511/29/2017 (© J.P. Shen)

Active Power Saving1.5h @ 900mW = 1.35Wh 5Wh battery: 30%

ideal 100%

avg app goal <100mW

➢ Context-Based Power Management• Offload Communication (Local ULP radio)

•1m radio instead of 1000m cellular•Dongle, access point, femtocell box

• Offload Computation (Cross device)• Cloud/companion pre-compute/pre-render

➢ Workload-Based Power Management• Power consumption based on usage scenario

•Activity, location, history• Approach zero energy waste

• De-powering unused peripherals & power islands• Heterogeneous cores & SP processor arrays (ULP)

80% time in home/office40mW vs 1W

trade cheap communication for expensive computation

40mW vs 1W

0W vs 160mW (email, notifications)

10’s mW vs 100’s mW cores

25x

25x

46

Page 47: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

18-600 Lecture #2511/29/2017 (© J.P. Shen)

Standby Power Saving

➢ Near-Zero-Power Standby Mode• Zero-power always-on displays

• Transflective, bistable • Zero-power OS idling

• Android, Meego, WP7 use 15mW • iPhone uses 5mW

➢ Ultra-Low-Power Always-On Device• Wearable accessory for notifications & voice• Seamless quick switch over to companion• Week-long battery life on small battery

4mW vs 200mW handset

better firmware 2mW vs 15mWwith hibernation 1mW vs 15mW

TF/BS, TF/LCD, TF/OLED5mW vs OLED 500mW

default: email, skype

100x

15x

50x

goal <10mW, 1kJ

22.5h @ 160mW = 3.6Wh 5Wh battery: 70%

ideal 0%

47

Page 48: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

18-600 Lecture #2511/29/2017 (© J.P. Shen)

Energy Harvesting

Effectively Approach Unlimited Standby

➢ Casual Charging• Wireless inductive chargers• Instant charging for quick fix• Solar for large surface areas• Cross-device energy sharing

➢ Anticipatory Preexecution• Speculative pre-processing and pre-fetching• Cloud based or cross-device based • Context and user behavior model driven

best case charge 3Wfor 1W tablet

25x

1W desk, nightstand, car

40mW vs 1W

1h charge 3h use

5h full charge

48

Page 49: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

18-600 Lecture #2511/29/2017 (© J.P. Shen)

Refactoring the Mobile Form Factors?Multiple Devices, One Seamless Experience•Laptop (avg 10W ➡ avg 2W)

• Exploit large battery & connectivity- Runs offloaded apps from Phone- Gathers data for Wearable+ Display normally off+ 50 WH battery 25 hours nonstop use

•Smartphone (avg 200mW ➡ avg 40mW)+ Normally hibernating in standby+ Offload apps to Laptop+ Longer battery life: 5x battery life+ 5 WH battery 125 hours nonstop use

•Wearable (avg 3mW)+ Always-on display+ Always fresh data feeds- Respond using Laptop or Smartphone+ Reduces avg power of Laptop & Smartphone+ 1.2 WH battery 400 hours nonstop use

49

Page 50: 18-600 Foundations of Computer Systemsece600/lectures/lecture25.pdf · 10 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P Warehouses I Other L3 Miss L2 Miss TC TLB Branch Inst 10 W 100 W 500

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Lecture 26:“Future of Computer Systems”

John P. Shen December 4, 2017

11/29/2017 (© J.P. Shen) 18-600 Lecture #25 50

18-600 Foundations of Computer Systems