there are no comprehensive, holistic studies of performance,

1
There are no comprehensive, holistic studies of performance, power and thermals on distributed scientific systems and workloads Without innovation future HEC systems will waste performance potential, waste energy, and require extravagant cooling. Improving Performance, Power, and Thermal Efficiency in High-End Systems Kirk W. Cameron Scalable Performance Laboratory Department of Computer Science and Engineering Virginia Tech cameron@ cs.vt.edu Introduction Performance Efficiency Power Efficiency Thermal Efficiency Problem Statement Left unchecked, the fundamental drive to increase peak performance using tens of thousands of components in close proximity to one another will result in: 1) an inability to sustain performance improvements, and 2) exorbitant infrastructure and operational cost for power and cooling. Performance, Power, and Thermal Facts The gap between peak and achieved performance is growing A 5 Megawatt Supercomputer can consume $4M in energy annually. In just 2 hours, Earth Simulator can produce enough heat to heat a home in the midwest all winter long. Projections Commodity components fail at annual rate of 2- 3%. Petaflop system of ~12,000 nodes (CPU, NIC, DRAM, disk) will sustain hardware failure once every 24 hours. Life expectancy of an electronic component decreases 50% for every 10°C(18°F) temperature increase. Our Approach Observations: Predictive models and techniques are needed to maximize performance of emergent systems. Additional below-peak performance may provide adequate “slack times” for improved power and thermal efficiencies. Constraint: Performance is the critical constraint. Reduce power and thermals ONLY if it does not reduce performance significantly. Relevant approaches to the problem Improving Performance Efficiencies Includes a myriad of tools and modeling techniques to analyze and optimize the performance of parallel scientific applications. In our work we focus on using fast analytical modeling techniques to optimize emergent architectures such as the IBM Cell Broadband Architecture. Improving Power Efficiencies Exploit application “slack times” to operate various components in lower power modes (e.g. dynamic voltage scaling or DVFS) to conserve power and energy. Prior to our work, no framework for profiling performance and power of parallel systems and applications. Improving Thermal Efficiencies Exploit application “slack times” to operate various components in lower power (and thermal) modes to reduce the heat emitted by the system. Prior to our work, no framework for profiling performance and thermals of parallel systems and applications. Our Contributions I. Portable framework to profile, analyze and optimize distributed applications for performance, power, and thermals with minimal performance impact. II. Performance-Power-Thermal tradeoff studies and optimizations of scientific workloads on various architectures. Performance analysis of NAS parallel benchmarks Distributed Thermal Profiles: A thermal profile of FT (above) reveals thermal patterns corresponding to code phases. Floating point intensive phases run hot while memory bound phases run cooler. Also, significant temperature drops occur in very short periods of time. Thermal behavior of BT (not pictured) shows temperatures synchronize with workload behavior across nodes. We also observe some nodes trend hotter than others. All of this data was obtained using Tempest. Temperature-Performance tradeoffs Thermal-Performance tradeoffs are studied using Tempest and DVFS strategies applied to reduce temperature in parallel scientific applications. Download Tempest Tempest is available for download from http://sourceforge.net . Related papers can be found at http://scape.cs.vt.edu . Tempest Software Architecture Detailed thermal profile of FT (Class C,NP=4) Thermal optimizations are achieved with minimal performance impact Thermal regulation: (top & top right) Tempest controller constrains temperature to within a threshold. Since the controller is heuristic, the temperature can exceed the threshold. However, temperature is typically controlled well using DVFS in a node. The weighted importance of thermals, performance and energy can determine the “best” operating point over a number of nodes. CPU Impact on Thermals: (left) For floating point intensive codes (e.g. SP, FT, EP from NAS) CPU is a large consumer of power under load and dissipates significant heat. Energy optimizations that significantly reduce CPU heat should impact total system temperature significantly. S equential orParallel applications w ritten in C, C ++ orFortran GNU Compiler Any Compiler libtempestfunc.so libtempestperblk.so per-node trace files per-node sensorstubs detailed per-process per-node therm al functional orblock profile Tem pestParser controller libtem pestctrl.so compiler instrumented user instrumented libtempestdvfs.so Per-node log ofoperating frequency and therm als profiler controller automated userdefined node 01 node 11 node 21 node m 1 ... node 02 node 12 node 22 node m 2 ... node 03 node 13 node 23 node m 3 ... node 0n node 1n node 2n node m n ... Tem pestaware cluster ... ... ... ... Thermal regulation of IS (Class C, NP=4) Thermal regulation of FT (Class C, NP=4) Thermal-aware Performance Impact: (right) The performance impact of our thermal-aware DVFS controller is less than 10% for all the NAS PB codes measured. Nonetheless, we commonly reduce operating temperature nearly 10°C(18°F) which translates to 50% reliability improvement in some cases. On average, we reduce operating temperature between 5-7 °C. 95 100 105 110 115 120 125 BT CG EP FT IS LU MG SP N AS parallelsuite Temperature(F) Avg CPU Temp for various NAS PB codes Efficiency G ap Jun-93 Jun-95 Jun-97 Jun-99 Jun-01 Jun-03 Jun-05 Jun-07 Jun-09 1 G FLO P 10 G FLO PS 10 2 G FLO PS 10 3 G FLO PS 10 4 G FLO PS 10 5 G FLO PS PERFORM ANCE Com mercial D ata Center 1,374 kW R esidentialAir C onditioner 15kW H igh Speed Electric Train 10,000 kW Sm allPow erPlant G enerating C apacity 300,000 kW IBM B lue G ene/L ?? kW IBM A SC IPurple 20,000 kW ? IBM SP ASCIW hite 6,000kW Earth Simulator 18,000 kW Intel ASCIRed 850 kW Fujitsu Numerical W ind Tunnel 100 kW TM C CM -5 5 kW 10 6 G FLO PS Tempest profiling techniques are automatic, accurate, and portable. 8-node Dori PowerPack II Software Power profiling API library - synchronized profiling of parallel applications. Power control API library - synchronized DVS control within parallel application. Multimeter middleware - coordinates data from multiple meter sources. Power analyzer middleware – sorts/sifts/analyzes/correlates profiling data. Performance profiler – use common utilities to poll system performance status. This work sponsored in part by the Department of Energy Office of Science Early Career Principal Investigator (ECPI) Program under grant number DOE DE-FG02-04ER25608. Ethernet Switch Data Collection System Multimeters Resistors Node under test + - Component R V S V R P Component =(V S -V R )V R /R RS232/GBIC Ethernet Ethernet Distributed Power Profiles: NAS codes exhibit regularity (e.g. FT on 4 nodes – above left) that reflects algorithm behavior. Intensive use of memory corresponds to decreases in CPU power and increases in memory power use (above right). Power consumption can vary with node for a single application, with number of nodes under fixed workload and with varied workload under fixed number of nodes. Results often correlate to comm/comp ratio. Normalized Energy and Delay with CPU MISER for FT.C.8 0.00 0.20 0.40 0.60 0.80 1.00 1.20 auto 600 800 1000 1200 1400 CPU MISER normalized delay normalized energy 0 1 2 3 4 5 6 7 8 22850 22860 22870 22880 22890 22900 22910 22920 22930 22940 22950 Tim e (seconds) G igabytes M em ory O nline M em ory D em and Reducing Energy Consumption: (left) CPU Miser uses dynamic voltage and frequency scaling (DVFS) to lower average processor power consumption. Using the default cpuspeed daemon (auto) or any fixed lower frequency, performance loss is common. CPU Miser is able to reduce energy consumption without reducing performance significantly. (above) Memory Miser uses power scalable DRAM to lower average memory power consumption by turning off memory DIMMs based on memory use and allocation. Note the top curve shows the amount of online memory and the bottom curve shows actual demand. CPU Miser and Memory Miser are both capable of 30% total system energy savings with less than 1% performance loss. Time for a single iteration: T i = T HPU + T APU + Offload Off-loaded time: Offload = O r + O s Total time: T = i(T HPU,i + T APU,i + O offload,i ) Single APU: T APU = T APUp + C APU T APUp : APU part that can be parallelized C APU : APU sequential part Multiple APUs: T APU(1,p) = T APU(1,1) /p + C APU p: number of APUs T APU(1,1) : offloaded time for 1 APU T APU(1,p) : offloaded time for p APUs T = T HPU + T APU(1,1) /p + C APU + O offload + p·g Optimizing Heterogeneous Multicore Systems We use a variation of the log n P performance model to predict the cost of various process and data placement configurations at runtime. Using the performance model we can schedule process and data placement optimally for a heterogeneous multicore architecture. Results on the IBM Cell Broadband Engine show dynamic multicore scheduling using analytical modeling is a viable, accurate technique to improve performance efficiencies. Portions of this work were accomplished in collaboration with the Pearl Laboratory led by Prof. D. Nikolopoulos. HPU time for one iteration: T HPU(m,1) = a m · T HPU(1,1) + T CSW + O col T (m,p) = T HPU(m,p) + T APU(m,p) + O offload + p·g Application: Parallel Bayesian Phylogenetic Inference Dataset: 107 sequences, each 10000 nucleotides, 20,000 gens MMGP mean error 3.2%, std. dev. 2.6, max. error 10% PBPI executes sampling phase at the beginning of execution MMGP params are determined during the sampling phase Execution restarted after the sampling phase with MMGP PBPI with sampling phase outperforms other configs by 1% - 4x. Sampling phase overhead is 2.5%.

Upload: lionel

Post on 25-Feb-2016

50 views

Category:

Documents


1 download

DESCRIPTION

Normalized Energy and Delay with CPU MISER for FT.C.8 . normalized delay. 1.20. normalized energy. 1.00. 0.80. 0.60. 0.40. 0.20. 0.00. auto. 600. 800. 1000. 1200. 1400. CPU MISER. Improving Performance, Power, and Thermal Efficiency in High-End Systems. Kirk W. Cameron. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: There are no comprehensive, holistic studies of performance,

There are no comprehensive, holistic studies of performance,power and thermals on distributed scientific systems and workloads

Without innovation future HEC systems will waste performancepotential, waste energy, and require extravagant cooling.

Improving Performance, Power, and Thermal Efficiency in High-End Systems

Kirk W. CameronScalable Performance Laboratory Department of Computer Science and Engineering Virginia Tech

cameron@ cs.vt.edu

Introduction

Performance Efficiency

Power Efficiency

Thermal Efficiency

Problem StatementLeft unchecked, the fundamental drive to increase peak performance using tens of thousands of components in close proximity to one another will result in: 1) an inability to sustain performance improvements, and 2) exorbitant infrastructure and operational cost for power and cooling.

Performance, Power, and Thermal Facts

The gap between peak and achieved performance is growing A 5 Megawatt Supercomputer can consume $4M in energy annually.In just 2 hours, Earth Simulator can produce enough heat

to heat a home in the midwest all winter long.

ProjectionsCommodity components fail at annual rate of 2-3%.Petaflop system of ~12,000 nodes (CPU, NIC, DRAM, disk)

will sustain hardware failure once every 24 hours.Life expectancy of an electronic component

decreases 50% for every 10°C(18°F) temperature increase.

Our ApproachObservations: Predictive models and techniques are needed to maximize performance of emergent systems. Additional below-peak performance may provide adequate “slack times” for improved power and thermal efficiencies.Constraint: Performance is the critical constraint. Reduce power and thermals ONLY if it does not reduce performance significantly.

Relevant approaches to the problemImproving Performance EfficienciesIncludes a myriad of tools and modeling techniques to analyze and optimize the performance of parallel scientific applications. In our work we focus on using fast analytical modeling techniques to optimize emergent architectures such as the IBM Cell Broadband Architecture.

Improving Power EfficienciesExploit application “slack times” to operate various components in lower power modes (e.g. dynamic voltage scaling or DVFS) to conserve power and energy. Prior to our work, no framework for profiling performance and power of parallel systems and applications.

Improving Thermal EfficienciesExploit application “slack times” to operate various components in lower power (and thermal) modes to reduce the heat emitted by the system. Prior to our work, no framework for profiling performance and thermals of parallel systems and applications.

Our ContributionsI. Portable framework to profile, analyze and optimize distributed

applications for performance, power, and thermals with minimal performance impact.

II. Performance-Power-Thermal tradeoff studies and optimizations of scientific workloads on various architectures.

Performance analysis of NAS parallel benchmarks

Distributed Thermal Profiles: A thermal profile of FT (above) reveals thermal patterns corresponding to code phases. Floating point intensive phases run hot while memory bound phases run cooler. Also, significant temperature drops occur in very short periods of time. Thermal behavior of BT (not pictured) shows temperatures synchronize with workload behavior across nodes. We also observe some nodes trend hotter than others. All of this data was obtained using Tempest.

Temperature-Performance tradeoffs

Thermal-Performance tradeoffs are studied using Tempest and DVFS strategies applied to reduce temperature in parallel scientific applications.

Download TempestTempest is available for download from http://sourceforge.net .Related papers can be found at http://scape.cs.vt.edu .

Tempest Software ArchitectureDetailed thermal profile of FT (Class

C,NP=4)

Thermal optimizations are achieved with minimal performance impact

Thermal regulation: (top & top right) Tempest controller constrains temperature to within a threshold. Since the controller is heuristic, the temperature can exceed the threshold. However, temperature is typically controlled well using DVFS in a node. The weighted importance of thermals, performance and energy can determine the “best” operating point over a number of nodes.

CPU Impact on Thermals: (left) For floating point intensive codes (e.g. SP, FT, EP from NAS) CPU is a large consumer of power under load and dissipates significant heat. Energy optimizations that significantly reduce CPU heat should impact total system temperature significantly.

Sequential or Parallel applications written in C, C++ or Fortran

GNU Compiler

Any Compiler

libtempestfunc.so

libtempestperblk.so

per-node trace files per-node sensor stubs

detailed per-process per-node thermal functional or block profile

Tempest Parser

controller

libtempestctrl.so

compiler instrumented

user instrumented

libtempestdvfs.so

Per-node log of operating frequency and thermals

profiler controller

automated

user defined

node 01

node 11

node 21

node m1

...

node 02

node 12

node 22

node m2

...

node 03

node 13

node 23

node m3

...

node 0n

node 1n

node 2n

node mn

...

Tempest aware cluster...

...

...

...

Thermal regulation of IS (Class C, NP=4)Thermal regulation of FT (Class C, NP=4)

Thermal-aware Performance Impact: (right) The performance impact of our thermal-aware DVFS controller is less than 10% for all the NAS PB codes measured. Nonetheless, we commonly reduce operating temperature nearly 10°C(18°F) which translates to 50% reliability improvement in some cases. On average, we reduce operating temperature between 5-7 °C.

95

100

105

110

115

120

125

BT CG EP FT IS LU MG SP

NAS parallel suite

Tem

pera

ture

(F)

Avg CPU Temp for various NAS PB codes

EfficiencyGap

Jun-93 Jun-95 Jun-97 Jun-99 Jun-01 Jun-03 Jun-05 Jun-07 Jun-091 GFLOP

10 GFLOPS

102 GFLOPS

103 GFLOPS

104 GFLOPS

105 GFLOPS

PER

FOR

MAN

CE

CommercialData Center

1,374 kW

Residential Air Conditioner

15kW

High SpeedElectric Train

10,000 kW

Small Power PlantGenerating Capacity

300,000 kW

IBMBlue Gene/L

?? kW

IBMASCI Purple20,000 kW?

IBM SPASCI White

6,000kWEarth

Simulator18,000 kW

IntelASCI Red

850 kW

FujitsuNumerical

Wind Tunnel100 kW

TMCCM-55 kW

106 GFLOPS

Tempest profiling techniques are automatic, accurate, and portable.

8-node Dori

PowerPack II SoftwarePower profiling API library - synchronized profiling of parallel applications.Power control API library - synchronized DVS control within parallel

application.Multimeter middleware - coordinates data from multiple meter sources.Power analyzer middleware – sorts/sifts/analyzes/correlates profiling

data.Performance profiler – use common utilities to poll system performance

status.

This work sponsored in part by the Department of Energy Office of Science Early Career Principal Investigator (ECPI) Program under grant number DOE DE-FG02-04ER25608.

Ethernet Switch Data Collection System

MultimetersResistorsNode under test

+-

Component

RVS VR

PComponent=(VS-VR)VR/R

RS23

2/GB

IC

Ethernet Ethernet

Distributed Power Profiles: NAS codes exhibit regularity (e.g. FT on 4 nodes – above left) that reflects algorithm behavior. Intensive use of memory corresponds to decreases in CPU power and increases in memory power use (above right). Power consumption can vary with node for a single application, with number of nodes under fixed workload and with varied workload under fixed number of nodes. Results often correlate to comm/comp ratio.

Normalized Energy and Delay with CPU MISER for FT.C.8

0.00

0.20

0.40

0.60

0.80

1.00

1.20

auto 600 800 1000 1200 1400 CPU MISER

normalized delaynormalized energy

0

1

2

3

4

5

6

7

8

22850 22860 22870 22880 22890 22900 22910 22920 22930 22940 22950

Time (seconds)

Gig

abyt

es

Memory Online Memory Demand

Reducing Energy Consumption: (left) CPU Miser uses dynamic voltage and frequency scaling (DVFS) to lower average processor power consumption. Using the default cpuspeed daemon (auto) or any fixed lower frequency, performance loss is common. CPU Miser is able to reduce energy consumption without reducing performance significantly. (above) Memory Miser uses power scalable DRAM to lower average memory power consumption by turning off memory DIMMs based on memory use and allocation. Note the top curve shows the amount of online memory and the bottom curve shows actual demand. CPU Miser and Memory Miser are both capable of 30% total system energy savings with less than 1% performance loss.

Time for a single iteration: Ti = THPU + TAPU + Offload

Off-loaded time:Offload = Or + Os

Total time: T = ∑i(THPU,i + TAPU,i + Ooffload,i)

Single APU: TAPU = TAPUp + CAPU

TAPUp: APU part that can be parallelized CAPU: APU sequential partMultiple APUs: TAPU(1,p) = TAPU(1,1)/p + CAPU p: number of APUs TAPU(1,1): offloaded time for 1 APU TAPU(1,p): offloaded time for p APUs T = THPU + TAPU(1,1)/p + CAPU + Ooffload + p·g

Optimizing Heterogeneous Multicore SystemsWe use a variation of the lognP performance model to predict the cost of various process and data placement configurations at runtime. Using the performance model we can schedule process and data placement optimally for a heterogeneous multicore architecture. Results on the IBM Cell Broadband Engine show dynamic multicore scheduling using analytical modeling is a viable, accurate technique to improve performance efficiencies. Portions of this work were accomplished in collaboration with the Pearl Laboratory led by Prof. D. Nikolopoulos.

HPU time for one iteration: THPU(m,1) = am · THPU(1,1) + TCSW + Ocol T(m,p) = THPU(m,p) + TAPU(m,p) + Ooffload + p·g

Application: Parallel Bayesian Phylogenetic Inference Dataset: 107 sequences, each 10000 nucleotides, 20,000 gens MMGP mean error 3.2%, std. dev. 2.6, max. error 10%

PBPI executes sampling phase at the beginning of execution MMGP params are determined during the sampling phase Execution restarted after the sampling phase with MMGP

PBPI with sampling phase outperforms other configsby 1% - 4x. Sampling phase overhead is 2.5%.