power management from smartphones to data centers - · pdf filepower management from...
TRANSCRIPT
© 2013 Thomas Wenisch
Power Management from Smartphones to Data Centers
Thomas Wenisch Morris Wellman Faculty Dev. Asst. Prof. of CSE University of Michigan
Acknowledgements:
Luiz Barroso, Anuj Chandawalla, Laurel Emurian, Brian Gold, Yixin Luo, Milo Martin, David Meisner, Marios Papaefthymiou, Steven Pelley, Kevin Pipe, Arun Raghavan, Chris Sadler, Lei Shao, Wolf Weber
© 2013 Thomas Wenisch
A Paradigm Shift In Computing
2
0.001
0.01
0.1
1
10
100
1000
10000
100000
1000000
1985 1990 1995 2000 2005 2010 2015 2020
Transistors (100,000's)
Power (W)
Performance (GOPS)
Efficiency (GOPS/W)
Limits on heat extraction
Limits on energy-efficiency of operations
IEEE Computer—April 2001 T. Mudge
© 2013 Thomas Wenisch
A Paradigm Shift In Computing
3
0.001
0.01
0.1
1
10
100
1000
10000
100000
1000000
1985 1990 1995 2000 2005 2010 2015 2020
Transistors (100,000's)
Power (W)
Performance (GOPS)
Efficiency (GOPS/W)
Era of High Performance Computing Era of Energy-Efficient Computing c. 2000
Limits on heat extraction
Limits on energy-efficiency of operations
Stagnates performance growth
IEEE Computer—April 2001 T. Mudge
© 2013 Thomas Wenisch
Four decades of Dennard Scaling
• P = C V2 f • Increase in device count • Lower supply voltages ➜ Constant power/chip
Dennard et. al., 1974 Robert H. Dennard, picture from
Wikipedia
4
© 2013 Thomas Wenisch
Leakage Killed Dennard Scaling
Leakage: • Exponential in inverse of Vth
• Exponential in temperature • Linear in device count To switch well • must keep Vdd/Vth > 3
➜Vdd can’t go down
5
© 2013 Thomas Wenisch
No more free lunch…
• Need system-level approaches to… – …turn increasing transistor counts into customer value
– …without exceeding thermal limits
• Energy efficiency is the new performance
• Today’s talk
– Computational Sprinting
• Improving responsiveness for mobile systems by briefly exceeding thermal limits
– Power mgmt. for Online Data Intensive Services
• Case study of server power management for Google’s Web Search 6
© 2013 Thomas Wenisch
Computational Sprinting
Arun Raghavan*, Yixin Luo+, Anuj Chandawalla+, Marios Papaefthymiou+, Kevin P. Pipe+#,
Thomas F. Wenisch+, Milo M. K. Martin*
University of Pennsylvania, Computer and Information Science*
University of Michigan, Electrical Eng. and Computer Science+ University of Michigan, Mechanical Engineering#
7
© 2013 Thomas Wenisch
Computational Sprinting and Dark Silicon
• A Problem: “Dark Silicon” a.k.a. “The Utilization Wall”
– Increasing power density; can’t use all transistors all the time
– Cooling constraints limit mobile systems
• One approach: Use few transistors for long durations
– Specialized functional units [Accelerators, GreenDroid]
– Targeted towards sustained compute, e.g. media playback
• Our approach: Use many transistors for short durations
– Computational Sprinting by activating many “dark cores”
– Unsustainable power for short, intense bursts of compute
– Responsiveness for bursty/interactive applications
8
Is this feasible?
© 2013 Thomas Wenisch
Sprinting Challenges and Opportunities • Thermal challenges
– How to extend sprint duration and intensity?
Latent heat from phase change material close to the die
• Electrical challenges
– How to supply peak currents? Ultracapacitor/battery hybrid
– How to ensure power stability? Ramped activation (~100μs)
• Architectural challenges
– How to control sprints? Thermal resource management
– How do applications benefit from sprinting?
6.3x responsiveness for vision workloads on a real Core i7 testbed restricted to a 10W TDP
9
© 2013 Thomas Wenisch
Power Density Trends for Sustained Compute
10
How to meet thermal limit despite power density increase?
0
0
pow
er
time
time
tem
pera
ture
Tmax Thermal limit
> 10x
© 2013 Thomas Wenisch
Option 1: Enhance Cooling?
11
Mobile devices limited to passive cooling
0
tem
pe
ratu
re
time
Tmax
© 2013 Thomas Wenisch
Option 2: Decrease Chip Area?
12
Reduces cost, but sacrifices benefits from Moore’s law
© 2013 Thomas Wenisch
Option 3: Decrease Active Fraction?
13
How do we extract application performance
from this “dark silicon”?
© 2013 Thomas Wenisch
Design for Responsiveness • Observation: today, design for sustained performance
• But, consider emerging interactive mobile apps… *Clemons+ DAC’11, Hartl+ ECV’11, Girod+ IEEE Signal Processing’11+
– Intense compute bursts in response to user input, then idle
– Humans demand sub-second response times *Doherty+ IBM TR ‘82, Yan+ DAC’05, Shye+ MICRO’09, Blake+ ISCA’10+
14
Peak performance during bursts limits what applications can do
© 2013 Thomas Wenisch
Tma
x p
ow
er
tem
pe
ratu
re
Effect of thermal capacitance
Parallel Computational Sprinting
17
© 2013 Thomas Wenisch
Tma
x p
ow
er
tem
pe
ratu
re
Effect of thermal capacitance
Parallel Computational Sprinting
18
© 2013 Thomas Wenisch
Tma
x p
ow
er
tem
pe
ratu
re
Effect of thermal capacitance
Parallel Computational Sprinting
19
© 2013 Thomas Wenisch
Tma
x p
ow
er
tem
pe
ratu
re
Effect of thermal capacitance State of the art:
Turbo Boost 2.0
exceeds
sustainable power
with DVFS
(~25% for 25s)
Parallel Computational Sprinting
20
© 2013 Thomas Wenisch
Hardware Testbed
21
• Quad-core Core i7 desktop – Heat sink removed – Fan tuned for 10W TDP
• Power profile – Idle: 4.5 W – Sustainable (1 core 1.6GHz): 9.5 W – Efficient sprint (4 core 1.6GHz): ~20 W – Max sprint (4 core 3.2GHz): ~50 W
• Temperature profile – Idle: ~45°C – Max safe: 78°C
• 20g copper heat spreader on package – Can absorb ~225J heat over 30°C temperature rise
Models a system capable of 5x max sprint intensity
© 2013 Thomas Wenisch
Power & Temperature Response
22
0
10
20
30
40
50
60
-5 0 5 10 15 20 25 30 35 40
sustained
sprint-3.2GHz
sprint-1.6GHz
40
5
0
60
7
0
0
20
4
0
60
Po
wer
(W)
Te
mp
(°
C)
Max sprint: 3s @ 3.2GHz, 19s @ 1.6GHz
-5 0 5 10 15 20 25 30 35 40
sustained
sprint-3.2GHz
sprint-1.6GHz
© 2013 Thomas Wenisch
Responsiveness & Energy Impact
23
0
2
4
6
8
3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6
no
rmal
ize
d s
pee
du
p
sobel disparity segment kmeans feature texture
sobel disparity segment kmeans feature texture
0
0.5
1
1.5
3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6 3.2 1.6
no
rmal
ize
d e
ne
rgy
Idle
Sprint
Idle
Sprint
Sprint for responsiveness (3.2GHz): 6.3x speedup Race-to-idle (1.6GHz): 7% energy savings (!!)
© 2013 Thomas Wenisch
Extending Sprint Intensity & Duration: Role of Thermal Capacitance
• Current systems designed for thermal conductivity
– Limited capacitance close to die
• To explicitly design for sprinting, add thermal capacitance near die
– Exploit latent heat from phase change material (PCM)
Die
Die
PCM
24
© 2013 Thomas Wenisch
PCM Heat Sink Prototype
25
• Aluminum foam mesh filled with Paraffin wax
– Relatively form-stable; melting point near 55°C
– Working on a fully-sealed prototype w/ thermocouples
solid liquid
solid liquid
Thermal couple 1
Thermal couple 2
1
2
Temperature sensor on die
camera
Still working on this…
© 2013 Thomas Wenisch
Impact of PCM prototype
27
40
5
0
60
7
0
Te
mp
(°
C)
PCM extends max sprint duration by almost 3x
Data last week 8
0
90
air empty
water
wax
50 100 150 200 250 300 350 400
Elapsed Time
© 2013 Thomas Wenisch
Power Management of
Online Data-Intensive Services
David Meisner†*, Christopher M. Sadler‡, Luiz A. Barroso‡, Wolf-Dietrich Weber‡, Thomas F. Wenisch†
†The University of Michigan *Facebook ‡Google, Inc.
International Symposium on Computer Architecture 2011
28
© 2013 Thomas Wenisch
Power: A first-class data center constraint
29
Improving energy & capital efficiency is a critical challenge
Source: US EPA 2007
Source: Mankoff et al, IEEE Computer 2008
Annual data center CO2: 17 million households
2.5% of US energy $7.4 billion/yr. Installed base
grows 11%/yr.
FacilityElectricityServers
Source: Barroso ‘10
Lifetime Cost of a Data Center
Peak power determines data center capital costs
© 2013 Thomas Wenisch
Online Data-Intensive Services [ISCA 2011]
• Challenging workload class for power management
– Process TBs of data with O(ms) request latency
– Tail latencies critical (e.g., 95th, 99th-percentile latency)
– Provisioned by data set size and latency not throughput
– Examples: web search, machine translation, online-ads
• Case Study: Google Web Search
– First study on power management for OLDI services
– Goal: Identify which power modes are useful
30
© 2013 Thomas Wenisch
The need for energy-proportionality
31
~75%
~50%
~20%
How to achieve energy-proportionality at each QPS level?
© 2013 Thomas Wenisch
Two-part study of Web Search • Part 1: Cluster-scale throughput study
– Web Search on O(1,000) node cluster
– Measured per-component activity at leaf level
– Use to derive upper-bounds on power savings
– Determine power modes of interest/non-interest
• Part 2: Single-node latency-constrained study
– Evaluate power-latency tradeoffs of power modes
– Can we achieve energy-proportionality with SLA “slack”
32
Need coordinated, full-system active low-power modes
© 2013 Thomas Wenisch
Disk
CPU
Mem
Sys
tem
Clu
ste
r
Background: Low-power mode taxonomy
• Active Modes – Reduce speed of component
– Continue to do work (e.g., DVFS)
• Idle Modes – Put component into sleep mode
– No useful processing (e.g., ACPI C-states)
33
Active Idle Active Idle
Spatial Granularity
DVFS
ACPI C-states
MemScale *ASPLOS’11+
Self-refresh
Dual-speed disks
Spin-down
Balanced scaling (open problem)
PowerNap *ASPLOS’09, TOCS’11+
Consolidation & shutdown
© 2013 Thomas Wenisch …
Doc 36
Background: Web Search operation
34
Root
Leaf
Intermediary
Level Query: “Ice Cream”
Ice
Index Docs Index Docs Index Docs Index Docs
Cream
Doc 24
Doc
11, 200 Doc
36,50
Doc
11,50, 36,200
Doc
50, 76, 200, 323
Doc
76, 323
Doc
76, 323
…
… …
What if we turn off
a fraction of the cluster? Doc 50
…
© 2013 Thomas Wenisch
Web Search operation
35
Query: “Vanilla Ice Cream”
Index Docs Index Docs Index Docs Index Docs
Doc 76
…
… …
What if we turn off
a fraction of the cluster?
© 2013 Thomas Wenisch
Web Search operation
36
Index Index Docs
Doc 76 What if we turn off
a fraction of the cluster?
50% of Max QPS
20
A I
Cluster-level techniques cause data unavailability
Disk
CPU
Mem
A I
Sys
tem
Clu
ste
r
© 2013 Thomas Wenisch
Study #1: Cluster-scale throughput
• Web Search experimental setup
– O(1,000) server system
– Operated at 20%, 50%, 75% of peak QPS
– Traces of CPU util., memory bandwidth, disk util.
• Characterization
– Goal: find periods to use low-power modes
– Understand intensity and time-scale of utilization
– Analyze using activity graphs
37
© 2013 Thomas Wenisch
CPU utilization
38
50% of Max QPS
20
40
80
100
Time Scale
1ms 10ms 100 μs
Perc
en
t o
f T
ime
60
100ms 10s
10% Idle
30%
50%
1s
How often can
we use the mode? What is the intensity of activity?
We can use a mode with a 2x
slowdown 45% of the time
on time scales of 1ms or less
© 2013 Thomas Wenisch
CPU utilization
39
50% of Max QPS
Very little Idleness > 10ms
1 ms granularity sufficient
20
40
80
100
Time Scale
1ms 10ms 100 μs
Perc
en
t o
f T
ime
60
100ms 10s
10% Idle
30%
50%
1s
© 2013 Thomas Wenisch
CPU utilization
40
50% of Max QPS
Very little Idleness > 10ms
1 ms granularity sufficient
20
40
80
100
Time Scale
1ms 10ms 100 μs
Perc
en
t o
f T
ime
60
100ms 10s
10% Idle
30%
50%
1s
50% of Max QPS
20
A I
CPU active/idle mode opportunity from a bandwidth perspective
Disk
CPU
Mem
A I
Sys
tem
Clu
ste
r
© 2013 Thomas Wenisch
Memory bandwidth utilization
41
50% of Max QPS
20
40
80
100
Time Scale
1s 10s 100 ms
Perc
en
t o
f T
ime
60
100s 1000s
10% Idle
30%
50%
No Significant
Idleness
Significant Periods
of Under-utilization
© 2013 Thomas Wenisch
Memory bandwidth utilization
42
50% of Max QPS
20
40
80
100
Time Scale
1s 10s 100 ms
Perc
en
t o
f T
ime
60
100s 1000s
10% Idle
30%
50%
No Significant
Idleness
Significant Periods
of Under-utilization
50% of Max QPS
20
A I
Insufficient idleness for memory idle low-power mode
CPU
A I
Sys
tem
Clu
ste
r
Disk transition times too slow See paper for details
Disk
Mem
© 2013 Thomas Wenisch
Study #2: Leaf node latency
• Goal: Understand latency effect of power modes
• Leaf node testbed
– Faithfully replicate production queries at leaf node
– Arrival time distribution critical for accurate modeling
– Up to 50% error from Naïve loadtester
• Validated power-performance model
– Characterize power-latency tradeoff on real HW
– Evaluate power modes using Stochastic Queuing Simulation (SQS) *EXERT ’10+
43
© 2013 Thomas Wenisch
75% QPS
Full-system coordinated idle modes
• Scarce full-system idleness in 16-core systems
• PowerNap *ASPLOS ’09+ with batching *Elnozahy et al ‘03+
44
20% QPS
25
50
75
100
95th-Percentile Latency Increase 4x 6x 8x 10x 2x
Po
wer
(Perc
en
t o
f p
eak)
1x Avg. Query Time
0.1x Avg. Query Time
25
50
75
100
95th-Percentile Latency 4x 6x 8x 10x 2x
Po
wer
(Perc
en
t o
f p
eak)
1x Avg. Query Time
0.1x Avg. Query Time
Transition Time
Transition Time
© 2013 Thomas Wenisch
75% QPS
Full-system coordinated idle modes
• Scarce full-system idleness for multicore
• PowerNap *ASPLOS ’09+ with batching *Elnozahy et al ‘03+
45
20% QPS
25
50
75
100
95th-Percentile Latency Increase 4x 6x 8x 10x 2x
Po
wer
(Perc
en
t o
f p
eak)
1x Avg. Query Time
0.1x Avg. Query Time
25
50
75
100
95th-Percentile Latency 4x 6x 8x 10x 2x
Po
wer
(Perc
en
t o
f p
eak)
1x Avg. Query Time
0.1x Avg. Query Time
Transition Time
Transition Time
20
A I
Batching + full-system idle modes ineffective
CPU
A I
Clu
ste
r
Disk
Mem
Sys
tem
© 2013 Thomas Wenisch
Full-system coordinated active scaling
• We assume CPU and memory scaling with P~f2.4
• Optimal Mix requires coordination of CPU/memory modes
46
20% QPS
1.6x 1.8x 2x
Memory Only
Full System
25
50
75
100
95th-Percentile Latency Increase
1.2x 1.4x 1x
Po
wer
(Perc
en
t o
f p
eak)
CPU Only 25
50
75
100
95th-Percentile Latency Increase
1.2x 1.4x 1.6x 1.8x 1x P
ow
er
(Perc
en
t o
f p
eak)
2x
75% QPS
Memory Only
Full System
CPU Only
© 2013 Thomas Wenisch
Full-system coordinated active scaling
• We assume CPU and memory scaling with P~f2.4
• Optimal Mix requires coordination of CPU/memory modes
47
20% QPS
1.6x 1.8x 2x
Memory Only
Full System
25
50
75
100
95th-Percentile Latency Increase
1.2x 1.4x 1x
Po
wer
(Perc
en
t o
f p
eak)
CPU Only 25
50
75
100
95th-Percentile Latency Increase
1.2x 1.4x 1.6x 1.8x 1x P
ow
er
(Perc
en
t o
f p
eak)
2x
75% QPS
Memory Only
Full System
CPU Only
A I
Single-component active low-power modes insufficient
I
Clu
ste
r
Disk
A
Sys
tem
CPU
Mem
© 2013 Thomas Wenisch
Comparing power modes
• Allow SLA “slack” – deviation from 95th-percentile latency
48
20
40
60
80
100 1x SLA 2x SLA
5x SLA
Ave
rag
e D
iurn
al
Po
we
r
(Perc
en
t o
f p
eak)
© 2013 Thomas Wenisch
Comparing power modes
• Allow SLA “slack” – deviation from 95th-percentile latency
49
20
40
60
80
100 1x SLA 2x SLA
5x SLA
Ave
rag
e D
iurn
al
Po
we
r
(Perc
en
t o
f p
eak)
A I
Core-level power modes
provide negligible power savings
I
Clu
ste
r
Disk
A
Sys
tem
Mem
CPU
Only coordinated active modes achieve proportionality
© 2013 Thomas Wenisch
OLDI power management summary
• OLDI workloads challenging for power management
• Cluster-scale study
– Current CPU power modes sufficient
– Massive opportunity for active modes for memory
– Need faster idle and active modes for disk
• Latency-constrained study
– Individual idle/active power modes do not achieve proportionality
– PowerNap + batching provides poor latency-power tradeoffs
50
Need coordinated, full-system active low-power modes
to achieve energy proportionality for OLDI workloads
© 2013 Thomas Wenisch
Typical data center utilization
Low utilization (≤20%) is endemic • Provisioning for peak load
• Performance isolation
• Redundancy
53
Source: Barroso & Hölzle, Google 2007
0%20%40%60%80%
100%1
0%
20
%3
0%
40
%5
0%
60
%7
0%
80
%9
0%
10
0%
Frac
tio
n o
f ti
me
CPU utilization (%)
IT
Web 2.0
Customer traces supplied by HP Labs
But, historically, vendors optimize & report peak power
© 2013 Thomas Wenisch
Most idle periods are < 1 Sec in length
Idle periods are short
54
20%
40%
60%
80%
100%
0%
100ms 1s 10s 100s 10ms Idle Period Length (L)
% I
dle
Tim
e in
Pe
rio
ds
≤L
[Meisner ’09]
DNS Shell
Mail Web
HPC
Backup
© 2013 Thomas Wenisch
Background: PowerNap *ASPLOS’09, TOCS’11+
Full System Idle Low-Power Mode
• Full-system nap during <1s idle periods
– OS detects idleness, triggers transition
– Transparent to user software
– Exploits deepest component sleep modes
55
Work Work
Transition Power
Time
Transition
Energy proportional if transition << avg. work
© 2013 Thomas Wenisch
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
Avg
. Po
we
r
(% m
ax p
ow
er)
% utilization
DVFS = 100%
DVFS = 40%
DVFS = 20%
PowerNap = 100 ms
PowerNap = 10 ms
PowerNap = 1 ms0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
Avg
. Po
we
r
(% m
ax p
ow
er)
% utilization
DVFS = 100%
DVFS = 40%
DVFS = 20%
PowerNap = 100 ms
PowerNap = 10 ms
PowerNap = 1 ms
Average power
56
Pwrcpu
Pwrcpu
Pwrcpu
Tt
Tt
Tt
DVFS saves rapidly, but limited by Pwrcpu
PowerNap becomes energy-proportional as Tt → 0
© 2013 Thomas Wenisch
Response time
57
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0% 20% 40% 60% 80% 100%
Re
lati
ve r
esp
on
se t
ime
% utilization
DVFS
PowerNap = 100 ms
PowerNap = 10 ms
PowerNap = 1 ms
Tt
Tt
Tt
DVFS response time penalty capped by fmin
PowerNap penalty negligible for Tt ≤ 1ms
© 2013 Thomas Wenisch
PowerNap Hardware
58
Transition Time (us)
Nap
Po
we
r (W
)
Nap power is ~10W, but PSU uses additional 25W PSU also limiting factor for transition
1 10 100 1000
CPU
DRAM
PSU
0
10
20
30
40
CPU
DRAM
NIC
SSD
PSU
© 2013 Thomas Wenisch
PowerNap & multicore scaling
• Request-level parallelism & core scaling thwart PowerNap
– Full-system idleness vanishes…
– … even at low utilization
– Per-core idleness unaligned
• “Parking” (Package C6) saves little
– Automatic per-core C1E on HLT already very good
59
0%
20%
40%
60%
80%
100%
1 2 4 8 16 32
Po
we
r Sa
vin
gs v
s. C
1E
30
% u
tiliz
atio
n
Cores per Socket
Core Parking
Socket Parking
PowerNap
Need to create PowerNap opportunity via scheduling
© 2013 Thomas Wenisch
Background: MemScale *ASPLOS’11+
Active Low-Power Mode for Main Memory
• Goal: Dynamically scale memory frequency to conserve energy
• Hardware mechanism:
– Frequency scaling (DFS) of the channels, DIMMs, DRAM devices
– Voltage & frequency scaling (DVFS) of the memory controller
• Key challenge:
– Conserving significant energy while meeting performance constraints
• Approach:
– Online profiling to estimate performance and bandwidth demand
– Epoch-based modeling and control to meet performance constraints
System energy savings of 18%
with average performance loss of 4%
60
© 2013 Thomas Wenisch 61 61
MemScale frequency and slack management
Time
Epoch 1 Epoch 2 Epoch 3 Epoch 4
High Freq.
Low Freq.
MC, Bus + DRAM
CPU Pos. Slack Neg. Slack Pos. Slack Profiling Target
Actual
Calculate slack vs. target Estimate performance/energy via models
© 2013 Thomas Wenisch
In Brief: Computational Sprinting [HPCA 2012]
• Many interactive mobile apps are bursty
• Power density trend leading to “dark silicon” (esp. mobile) – Today, we design for sustained performance
– Our goal: design for responsiveness
• Computational Sprinting
– Intensely, but briefly exceed Thermal Design Power (TDP)
– Buffer heat via thermal capacitance using phase change material
62
tem
p
Melting Point
Re-solidification
Tmax
Die
PCM
10.2x responsiveness via 16-core sprints within 1W TDP