power management from smartphones to data centers - · pdf filepower management from...

© 2013 Thomas Wenisch

Power Management from Smartphones to Data Centers

Thomas Wenisch Morris Wellman Faculty Dev. Asst. Prof. of CSE University of Michigan

Acknowledgements:

Luiz Barroso, Anuj Chandawalla, Laurel Emurian, Brian Gold, Yixin Luo, Milo Martin, David Meisner, Marios Papaefthymiou, Steven Pelley, Kevin Pipe, Arun Raghavan, Chris Sadler, Lei Shao, Wolf Weber


A Paradigm Shift In Computing

2

0.001

0.01

0.1

1

10

100

1000

10000

100000

1000000

1985 1990 1995 2000 2005 2010 2015 2020

Transistors (100,000's)

Power (W)

Performance (GOPS)

Efficiency (GOPS/W)

Limits on heat extraction

Limits on energy-efficiency of operations

IEEE Computer—April 2001 T. Mudge


A Paradigm Shift In Computing

3

0.001

0.01

0.1

1

10

100

1000

10000

100000

1000000

1985 1990 1995 2000 2005 2010 2015 2020

Transistors (100,000's)

Power (W)

Performance (GOPS)

Efficiency (GOPS/W)

Era of High Performance Computing Era of Energy-Efficient Computing c. 2000

Limits on heat extraction

Limits on energy-efficiency of operations

Stagnates performance growth

IEEE Computer—April 2001 T. Mudge


Four decades of Dennard Scaling

• P = C V2 f • Increase in device count • Lower supply voltages ➜ Constant power/chip

Dennard et. al., 1974 Robert H. Dennard, picture from

Wikipedia

4


Leakage Killed Dennard Scaling

Leakage: • Exponential in inverse of Vth

• Exponential in temperature • Linear in device count To switch well • must keep Vdd/Vth > 3

➜Vdd can’t go down

5


No more free lunch…

• Need system-level approaches to… – …turn increasing transistor counts into customer value

– …without exceeding thermal limits

• Energy efficiency is the new performance

• Today’s talk

– Computational Sprinting

• Improving responsiveness for mobile systems by briefly exceeding thermal limits

– Power mgmt. for Online Data Intensive Services

• Case study of server power management for Google’s Web Search 6


Computational Sprinting

Arun Raghavan*, Yixin Luo+, Anuj Chandawalla+, Marios Papaefthymiou+, Kevin P. Pipe+#,

Thomas F. Wenisch+, Milo M. K. Martin*

University of Pennsylvania, Computer and Information Science*

University of Michigan, Electrical Eng. and Computer Science+ University of Michigan, Mechanical Engineering#

7


Computational Sprinting and Dark Silicon

• A Problem: “Dark Silicon” a.k.a. “The Utilization Wall”

– Increasing power density; can’t use all transistors all the time

– Cooling constraints limit mobile systems

• One approach: Use few transistors for long durations

– Specialized functional units [Accelerators, GreenDroid]

– Targeted towards sustained compute, e.g. media playback

• Our approach: Use many transistors for short durations

– Computational Sprinting by activating many “dark cores”

– Unsustainable power for short, intense bursts of compute

– Responsiveness for bursty/interactive applications

8

Is this feasible?


Sprinting Challenges and Opportunities • Thermal challenges

– How to extend sprint duration and intensity?

Latent heat from phase change material close to the die

• Electrical challenges

– How to supply peak currents? Ultracapacitor/battery hybrid

– How to ensure power stability? Ramped activation (~100μs)

• Architectural challenges

– How to control sprints? Thermal resource management

– How do applications benefit from sprinting?

6.3x responsiveness for vision workloads on a real Core i7 testbed restricted to a 10W TDP

9


Power Density Trends for Sustained Compute

10

How to meet thermal limit despite power density increase?

0

0

pow

er

time

time

tem

pera

ture

Tmax Thermal limit

> 10x


Option 1: Enhance Cooling?

11

Mobile devices limited to passive cooling

0

tem

pe

ratu

re

time

Tmax


Option 2: Decrease Chip Area?

12

Reduces cost, but sacrifices benefits from Moore’s law


Option 3: Decrease Active Fraction?

13

How do we extract application performance

from this “dark silicon”?


Design for Responsiveness • Observation: today, design for sustained performance

• But, consider emerging interactive mobile apps… *Clemons+ DAC’11, Hartl+ ECV’11, Girod+ IEEE Signal Processing’11+

– Intense compute bursts in response to user input, then idle

– Humans demand sub-second response times *Doherty+ IBM TR ‘82, Yan+ DAC’05, Shye+ MICRO’09, Blake+ ISCA’10+

14

Peak performance during bursts limits what applications can do


COMPUTATIONAL SPRINTING

Designing for Responsiveness

15


Parallel Computational Sprinting

Tma

x p

ow

er

tem

pe

ratu

re

16


Tma

x p

ow

er

tem

pe

ratu

re

Effect of thermal capacitance


17


Tma

x p

ow

er

tem

pe

ratu

re



18


Tma

x p

ow

er

tem

pe

ratu

re



19


Tma

x p

ow

er

tem

pe

ratu

re

Effect of thermal capacitance State of the art:

Turbo Boost 2.0

exceeds

sustainable power

with DVFS

(~25% for 25s)


20


Hardware Testbed

21

• Quad-core Core i7 desktop – Heat sink removed – Fan tuned for 10W TDP

• Power profile – Idle: 4.5 W – Sustainable (1 core 1.6GHz): 9.5 W – Efficient sprint (4 core 1.6GHz): ~20 W – Max sprint (4 core 3.2GHz): ~50 W

• Temperature profile – Idle: ~45°C – Max safe: 78°C

• 20g copper heat spreader on package – Can absorb ~225J heat over 30°C temperature rise

Models a system capable of 5x max sprint intensity


Power & Temperature Response

22

0

10

20

30

40

50

60

-5 0 5 10 15 20 25 30 35 40

sustained

sprint-3.2GHz

sprint-1.6GHz

40

5

0

60

7

0

0

20

4

0

60

Po

wer

(W)

Te

mp

(°

C)

Max sprint: 3s @ 3.2GHz, 19s @ 1.6GHz

-5 0 5 10 15 20 25 30 35 40

sustained

sprint-3.2GHz

sprint-1.6GHz


Responsiveness & Energy Impact

23

0

2

4

6

8

3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6

no

rmal

ize

d s

pee

du

p

sobel disparity segment kmeans feature texture

sobel disparity segment kmeans feature texture

0

0.5

1

1.5

3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6

no

rmal

ize

d e

ne

rgy

Idle

Sprint

Idle

Sprint

Sprint for responsiveness (3.2GHz): 6.3x speedup Race-to-idle (1.6GHz): 7% energy savings (!!)


Extending Sprint Intensity & Duration: Role of Thermal Capacitance

• Current systems designed for thermal conductivity

– Limited capacitance close to die

• To explicitly design for sprinting, add thermal capacitance near die

– Exploit latent heat from phase change material (PCM)

Die

Die

PCM

24


PCM Heat Sink Prototype

25

• Aluminum foam mesh filled with Paraffin wax

– Relatively form-stable; melting point near 55°C

– Working on a fully-sealed prototype w/ thermocouples

solid liquid

solid liquid

Thermal couple 1

Thermal couple 2

1

2

Temperature sensor on die

camera

Still working on this…


Demo of PCM melting

26 Nickel-plated Copper fins; paraffin wax


Impact of PCM prototype

27

40

5

0

60

7

0

Te

mp

(°

C)

PCM extends max sprint duration by almost 3x

Data last week 8

0

90

air empty

water

wax

50 100 150 200 250 300 350 400

Elapsed Time


Power Management of

Online Data-Intensive Services

David Meisner†*, Christopher M. Sadler‡, Luiz A. Barroso‡, Wolf-Dietrich Weber‡, Thomas F. Wenisch†

†The University of Michigan *Facebook ‡Google, Inc.

International Symposium on Computer Architecture 2011

28


Power: A first-class data center constraint

29

Improving energy & capital efficiency is a critical challenge

Source: US EPA 2007

Source: Mankoff et al, IEEE Computer 2008

Annual data center CO2: 17 million households

2.5% of US energy $7.4 billion/yr. Installed base

grows 11%/yr.

FacilityElectricityServers

Source: Barroso ‘10

Lifetime Cost of a Data Center

Peak power determines data center capital costs


Online Data-Intensive Services [ISCA 2011]

• Challenging workload class for power management

– Process TBs of data with O(ms) request latency

– Tail latencies critical (e.g., 95th, 99th-percentile latency)

– Provisioned by data set size and latency not throughput

– Examples: web search, machine translation, online-ads

• Case Study: Google Web Search

– First study on power management for OLDI services

– Goal: Identify which power modes are useful

30


The need for energy-proportionality

31

~75%

~50%

~20%

How to achieve energy-proportionality at each QPS level?


Two-part study of Web Search • Part 1: Cluster-scale throughput study

– Web Search on O(1,000) node cluster

– Measured per-component activity at leaf level

– Use to derive upper-bounds on power savings

– Determine power modes of interest/non-interest

• Part 2: Single-node latency-constrained study

– Evaluate power-latency tradeoffs of power modes

– Can we achieve energy-proportionality with SLA “slack”

32

Need coordinated, full-system active low-power modes


Disk

CPU

Mem

Sys

tem

Clu

ste

r

Background: Low-power mode taxonomy

• Active Modes – Reduce speed of component

– Continue to do work (e.g., DVFS)

• Idle Modes – Put component into sleep mode

– No useful processing (e.g., ACPI C-states)

33

Active Idle Active Idle

Spatial Granularity

DVFS

ACPI C-states

MemScale *ASPLOS’11+

Self-refresh

Dual-speed disks

Spin-down

Balanced scaling (open problem)

PowerNap *ASPLOS’09, TOCS’11+

Consolidation & shutdown

© 2013 Thomas Wenisch …

Doc 36

Background: Web Search operation

34

Root

Leaf

Intermediary

Level Query: “Ice Cream”

Ice

Index Docs Index Docs Index Docs Index Docs

Cream

Doc 24

Doc

11, 200 Doc

36,50

Doc

11,50, 36,200

Doc

50, 76, 200, 323

Doc

76, 323

Doc

76, 323

…

… …

What if we turn off

a fraction of the cluster? Doc 50

…


Web Search operation

35

Query: “Vanilla Ice Cream”

Index Docs Index Docs Index Docs Index Docs

Doc 76

…

… …

What if we turn off

a fraction of the cluster?


Web Search operation

36

Index Index Docs

Doc 76 What if we turn off

a fraction of the cluster?

50% of Max QPS

20

A I

Cluster-level techniques cause data unavailability

Disk

CPU

Mem

A I

Sys

tem

Clu

ste

r


Study #1: Cluster-scale throughput

• Web Search experimental setup

– O(1,000) server system

– Operated at 20%, 50%, 75% of peak QPS

– Traces of CPU util., memory bandwidth, disk util.

• Characterization

– Goal: find periods to use low-power modes

– Understand intensity and time-scale of utilization

– Analyze using activity graphs

37


CPU utilization

38

50% of Max QPS

20

40

80

100

Time Scale

1ms 10ms 100 μs

Perc

en

t o

f T

ime

60

100ms 10s

10% Idle

30%

50%

1s

How often can

we use the mode? What is the intensity of activity?

We can use a mode with a 2x

slowdown 45% of the time

on time scales of 1ms or less


CPU utilization

39

50% of Max QPS

Very little Idleness > 10ms

1 ms granularity sufficient

20

40

80

100

Time Scale

1ms 10ms 100 μs

Perc

en

t o

f T

ime

60

100ms 10s

10% Idle

30%

50%

1s


CPU utilization

40

50% of Max QPS

Very little Idleness > 10ms

1 ms granularity sufficient

20

40

80

100

Time Scale

1ms 10ms 100 μs

Perc

en

t o

f T

ime

60

100ms 10s

10% Idle

30%

50%

1s

50% of Max QPS

20

A I

CPU active/idle mode opportunity from a bandwidth perspective

Disk

CPU

Mem

A I

Sys

tem

Clu

ste

r


Memory bandwidth utilization

41

50% of Max QPS

20

40

80

100

Time Scale

1s 10s 100 ms

Perc

en

t o

f T

ime

60

100s 1000s

10% Idle

30%

50%

No Significant

Idleness

Significant Periods

of Under-utilization


Memory bandwidth utilization

42

50% of Max QPS

20

40

80

100

Time Scale

1s 10s 100 ms

Perc

en

t o

f T

ime

60

100s 1000s

10% Idle

30%

50%

No Significant

Idleness

Significant Periods

of Under-utilization

50% of Max QPS

20

A I

Insufficient idleness for memory idle low-power mode

CPU

A I

Sys

tem

Clu

ste

r

Disk transition times too slow See paper for details

Disk

Mem


Study #2: Leaf node latency

• Goal: Understand latency effect of power modes

• Leaf node testbed

– Faithfully replicate production queries at leaf node

– Arrival time distribution critical for accurate modeling

– Up to 50% error from Naïve loadtester

• Validated power-performance model

– Characterize power-latency tradeoff on real HW

– Evaluate power modes using Stochastic Queuing Simulation (SQS) *EXERT ’10+

43


75% QPS

Full-system coordinated idle modes

• Scarce full-system idleness in 16-core systems

• PowerNap *ASPLOS ’09+ with batching *Elnozahy et al ‘03+

44

20% QPS

25

50

75

100

95th-Percentile Latency Increase 4x 6x 8x 10x 2x

Po

wer

(Perc

en

t o

f p

eak)

1x Avg. Query Time

0.1x Avg. Query Time

25

50

75

100

95th-Percentile Latency 4x 6x 8x 10x 2x

Po

wer

(Perc

en

t o

f p

eak)

1x Avg. Query Time


Transition Time

Transition Time


75% QPS

Full-system coordinated idle modes

• Scarce full-system idleness for multicore

• PowerNap *ASPLOS ’09+ with batching *Elnozahy et al ‘03+

45

20% QPS

25

50

75

100

95th-Percentile Latency Increase 4x 6x 8x 10x 2x

Po

wer

(Perc

en

t o

f p

eak)

1x Avg. Query Time


25

50

75

100

95th-Percentile Latency 4x 6x 8x 10x 2x

Po

wer

(Perc

en

t o

f p

eak)

1x Avg. Query Time


Transition Time

Transition Time

20

A I

Batching + full-system idle modes ineffective

CPU

A I

Clu

ste

r

Disk

Mem

Sys

tem


Full-system coordinated active scaling

• We assume CPU and memory scaling with P~f2.4

• Optimal Mix requires coordination of CPU/memory modes

46

20% QPS

1.6x 1.8x 2x

Memory Only

Full System

25

50

75

100

95th-Percentile Latency Increase

1.2x 1.4x 1x

Po

wer

(Perc

en

t o

f p

eak)

CPU Only 25

50

75

100


1.2x 1.4x 1.6x 1.8x 1x P

ow

er

(Perc

en

t o

f p

eak)

2x

75% QPS

Memory Only

Full System

CPU Only


Full-system coordinated active scaling

• We assume CPU and memory scaling with P~f2.4

• Optimal Mix requires coordination of CPU/memory modes

47

20% QPS

1.6x 1.8x 2x

Memory Only

Full System

25

50

75

100


1.2x 1.4x 1x

Po

wer

(Perc

en

t o

f p

eak)

CPU Only 25

50

75

100


1.2x 1.4x 1.6x 1.8x 1x P

ow

er

(Perc

en

t o

f p

eak)

2x

75% QPS

Memory Only

Full System

CPU Only

A I

Single-component active low-power modes insufficient

I

Clu

ste

r

Disk

A

Sys

tem

CPU

Mem


Comparing power modes

• Allow SLA “slack” – deviation from 95th-percentile latency

48

20

40

60

80

100 1x SLA 2x SLA

5x SLA

Ave

rag

e D

iurn

al

Po

we

r

(Perc

en

t o

f p

eak)


Comparing power modes

• Allow SLA “slack” – deviation from 95th-percentile latency

49

20

40

60

80

100 1x SLA 2x SLA

5x SLA

Ave

rag

e D

iurn

al

Po

we

r

(Perc

en

t o

f p

eak)

A I

Core-level power modes

provide negligible power savings

I

Clu

ste

r

Disk

A

Sys

tem

Mem

CPU

Only coordinated active modes achieve proportionality


OLDI power management summary

• OLDI workloads challenging for power management

• Cluster-scale study

– Current CPU power modes sufficient

– Massive opportunity for active modes for memory

– Need faster idle and active modes for disk

• Latency-constrained study

– Individual idle/active power modes do not achieve proportionality

– PowerNap + batching provides poor latency-power tradeoffs

50

Need coordinated, full-system active low-power modes

to achieve energy proportionality for OLDI workloads


For more information http://www.eecs.umich.edu/~twenisch

51

Sponsors


Backup Slides

52


Typical data center utilization

Low utilization (≤20%) is endemic • Provisioning for peak load

• Performance isolation

• Redundancy

53

Source: Barroso & Hölzle, Google 2007

0%20%40%60%80%

100%1

0%

20

%3

0%

40

%5

0%

60

%7

0%

80

%9

0%

10

0%

Frac

tio

n o

f ti

me

CPU utilization (%)

IT

Web 2.0

Customer traces supplied by HP Labs

But, historically, vendors optimize & report peak power


Most idle periods are < 1 Sec in length

Idle periods are short

54

20%

40%

60%

80%

100%

0%

100ms 1s 10s 100s 10ms Idle Period Length (L)

% I

dle

Tim

e in

Pe

rio

ds

≤L

[Meisner ’09]

DNS Shell

Mail Web

HPC

Backup


Background: PowerNap *ASPLOS’09, TOCS’11+

Full System Idle Low-Power Mode

• Full-system nap during <1s idle periods

– OS detects idleness, triggers transition

– Transparent to user software

– Exploits deepest component sleep modes

55

Work Work

Transition Power

Time

Transition

Energy proportional if transition << avg. work


0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

Avg

. Po

we

r

(% m

ax p

ow

er)

% utilization

DVFS = 100%

DVFS = 40%

DVFS = 20%

PowerNap = 100 ms

PowerNap = 10 ms

PowerNap = 1 ms0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

Avg

. Po

we

r

(% m

ax p

ow

er)

% utilization

DVFS = 100%

DVFS = 40%

DVFS = 20%

PowerNap = 100 ms

PowerNap = 10 ms

PowerNap = 1 ms

Average power

56

Pwrcpu

Pwrcpu

Pwrcpu

Tt

Tt

Tt

DVFS saves rapidly, but limited by Pwrcpu

PowerNap becomes energy-proportional as Tt → 0


Response time

57

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0% 20% 40% 60% 80% 100%

Re

lati

ve r

esp

on

se t

ime

% utilization

DVFS

PowerNap = 100 ms

PowerNap = 10 ms

PowerNap = 1 ms

Tt

Tt

Tt

DVFS response time penalty capped by fmin

PowerNap penalty negligible for Tt ≤ 1ms


PowerNap Hardware

58

Transition Time (us)

Nap

Po

we

r (W

)

Nap power is ~10W, but PSU uses additional 25W PSU also limiting factor for transition

1 10 100 1000

CPU

DRAM

PSU

0

10

20

30

40

CPU

DRAM

NIC

SSD

PSU


PowerNap & multicore scaling

• Request-level parallelism & core scaling thwart PowerNap

– Full-system idleness vanishes…

– … even at low utilization

– Per-core idleness unaligned

• “Parking” (Package C6) saves little

– Automatic per-core C1E on HLT already very good

59

0%

20%

40%

60%

80%

100%

1 2 4 8 16 32

Po

we

r Sa

vin

gs v

s. C

1E

30

% u

tiliz

atio

n

Cores per Socket

Core Parking

Socket Parking

PowerNap

Need to create PowerNap opportunity via scheduling


Background: MemScale *ASPLOS’11+

Active Low-Power Mode for Main Memory

• Goal: Dynamically scale memory frequency to conserve energy

• Hardware mechanism:

– Frequency scaling (DFS) of the channels, DIMMs, DRAM devices

– Voltage & frequency scaling (DVFS) of the memory controller

• Key challenge:

– Conserving significant energy while meeting performance constraints

• Approach:

– Online profiling to estimate performance and bandwidth demand

– Epoch-based modeling and control to meet performance constraints

System energy savings of 18%

with average performance loss of 4%

60

© 2013 Thomas Wenisch 61 61

MemScale frequency and slack management

Time

Epoch 1 Epoch 2 Epoch 3 Epoch 4

High Freq.

Low Freq.

MC, Bus + DRAM

CPU Pos. Slack Neg. Slack Pos. Slack Profiling Target

Actual

Calculate slack vs. target Estimate performance/energy via models


In Brief: Computational Sprinting [HPCA 2012]

• Many interactive mobile apps are bursty

• Power density trend leading to “dark silicon” (esp. mobile) – Today, we design for sustained performance

– Our goal: design for responsiveness

• Computational Sprinting

– Intensely, but briefly exceed Thermal Design Power (TDP)

– Buffer heat via thermal capacitance using phase change material

62

tem

p

Melting Point

Re-solidification

Tmax

Die

PCM

10.2x responsiveness via 16-core sprints within 1W TDP

power management from smartphones to data centers - · pdf filepower management from...

Documents