there are no comprehensive, holistic studies of performance,

There are no comprehensive, holistic studies of performance,power and thermals on distributed scientific systems and workloads

Without innovation future HEC systems will waste performancepotential, waste energy, and require extravagant cooling.

Improving Performance, Power, and Thermal Efficiency in High-End Systems

Kirk W. CameronScalable Performance Laboratory Department of Computer Science and Engineering Virginia Tech

cameron@ cs.vt.edu

Introduction

Performance Efficiency

Power Efficiency

Thermal Efficiency

Problem StatementLeft unchecked, the fundamental drive to increase peak performance using tens of thousands of components in close proximity to one another will result in: 1) an inability to sustain performance improvements, and 2) exorbitant infrastructure and operational cost for power and cooling.

Performance, Power, and Thermal Facts

The gap between peak and achieved performance is growing A 5 Megawatt Supercomputer can consume $4M in energy annually.In just 2 hours, Earth Simulator can produce enough heat

to heat a home in the midwest all winter long.

ProjectionsCommodity components fail at annual rate of 2-3%.Petaflop system of ~12,000 nodes (CPU, NIC, DRAM, disk)

will sustain hardware failure once every 24 hours.Life expectancy of an electronic component

decreases 50% for every 10°C(18°F) temperature increase.

Our ApproachObservations: Predictive models and techniques are needed to maximize performance of emergent systems. Additional below-peak performance may provide adequate “slack times” for improved power and thermal efficiencies.Constraint: Performance is the critical constraint. Reduce power and thermals ONLY if it does not reduce performance significantly.

Relevant approaches to the problemImproving Performance EfficienciesIncludes a myriad of tools and modeling techniques to analyze and optimize the performance of parallel scientific applications. In our work we focus on using fast analytical modeling techniques to optimize emergent architectures such as the IBM Cell Broadband Architecture.

Improving Power EfficienciesExploit application “slack times” to operate various components in lower power modes (e.g. dynamic voltage scaling or DVFS) to conserve power and energy. Prior to our work, no framework for profiling performance and power of parallel systems and applications.

Improving Thermal EfficienciesExploit application “slack times” to operate various components in lower power (and thermal) modes to reduce the heat emitted by the system. Prior to our work, no framework for profiling performance and thermals of parallel systems and applications.

Our ContributionsI. Portable framework to profile, analyze and optimize distributed

applications for performance, power, and thermals with minimal performance impact.

II. Performance-Power-Thermal tradeoff studies and optimizations of scientific workloads on various architectures.

Performance analysis of NAS parallel benchmarks

Distributed Thermal Profiles: A thermal profile of FT (above) reveals thermal patterns corresponding to code phases. Floating point intensive phases run hot while memory bound phases run cooler. Also, significant temperature drops occur in very short periods of time. Thermal behavior of BT (not pictured) shows temperatures synchronize with workload behavior across nodes. We also observe some nodes trend hotter than others. All of this data was obtained using Tempest.

Temperature-Performance tradeoffs

Thermal-Performance tradeoffs are studied using Tempest and DVFS strategies applied to reduce temperature in parallel scientific applications.

Download TempestTempest is available for download from http://sourceforge.net .Related papers can be found at http://scape.cs.vt.edu .

Tempest Software ArchitectureDetailed thermal profile of FT (Class

C,NP=4)

Thermal optimizations are achieved with minimal performance impact

Thermal regulation: (top & top right) Tempest controller constrains temperature to within a threshold. Since the controller is heuristic, the temperature can exceed the threshold. However, temperature is typically controlled well using DVFS in a node. The weighted importance of thermals, performance and energy can determine the “best” operating point over a number of nodes.

CPU Impact on Thermals: (left) For floating point intensive codes (e.g. SP, FT, EP from NAS) CPU is a large consumer of power under load and dissipates significant heat. Energy optimizations that significantly reduce CPU heat should impact total system temperature significantly.

Sequential or Parallel applications written in C, C++ or Fortran

GNU Compiler

Any Compiler

libtempestfunc.so

libtempestperblk.so

per-node trace files per-node sensor stubs

detailed per-process per-node thermal functional or block profile

Tempest Parser

controller

libtempestctrl.so

compiler instrumented

user instrumented

libtempestdvfs.so

Per-node log of operating frequency and thermals

profiler controller

automated

user defined

node 01

node 11

node 21

node m1

...

node 02

node 12

node 22

node m2

...

node 03

node 13

node 23

node m3

...

node 0n

node 1n

node 2n

node mn

...

Tempest aware cluster...

...

...

...

Thermal regulation of IS (Class C, NP=4)Thermal regulation of FT (Class C, NP=4)

Thermal-aware Performance Impact: (right) The performance impact of our thermal-aware DVFS controller is less than 10% for all the NAS PB codes measured. Nonetheless, we commonly reduce operating temperature nearly 10°C(18°F) which translates to 50% reliability improvement in some cases. On average, we reduce operating temperature between 5-7 °C.

95

100

105

110

115

120

125

BT CG EP FT IS LU MG SP

NAS parallel suite

Tem

pera

ture

(F)

Avg CPU Temp for various NAS PB codes

EfficiencyGap

Jun-93 Jun-95 Jun-97 Jun-99 Jun-01 Jun-03 Jun-05 Jun-07 Jun-091 GFLOP

10 GFLOPS

102 GFLOPS

103 GFLOPS

104 GFLOPS

105 GFLOPS

PER

FOR

MAN

CE

CommercialData Center

1,374 kW

Residential Air Conditioner

15kW

High SpeedElectric Train

10,000 kW

Small Power PlantGenerating Capacity

300,000 kW

IBMBlue Gene/L

?? kW

IBMASCI Purple20,000 kW?

IBM SPASCI White

6,000kWEarth

Simulator18,000 kW

IntelASCI Red

850 kW

FujitsuNumerical

Wind Tunnel100 kW

TMCCM-55 kW

106 GFLOPS

Tempest profiling techniques are automatic, accurate, and portable.

8-node Dori

PowerPack II SoftwarePower profiling API library - synchronized profiling of parallel applications.Power control API library - synchronized DVS control within parallel

application.Multimeter middleware - coordinates data from multiple meter sources.Power analyzer middleware – sorts/sifts/analyzes/correlates profiling

data.Performance profiler – use common utilities to poll system performance

status.

This work sponsored in part by the Department of Energy Office of Science Early Career Principal Investigator (ECPI) Program under grant number DOE DE-FG02-04ER25608.

Ethernet Switch Data Collection System

MultimetersResistorsNode under test

+-

Component

RVS VR

PComponent=(VS-VR)VR/R

RS23

2/GB

IC

Ethernet Ethernet

Distributed Power Profiles: NAS codes exhibit regularity (e.g. FT on 4 nodes – above left) that reflects algorithm behavior. Intensive use of memory corresponds to decreases in CPU power and increases in memory power use (above right). Power consumption can vary with node for a single application, with number of nodes under fixed workload and with varied workload under fixed number of nodes. Results often correlate to comm/comp ratio.

Normalized Energy and Delay with CPU MISER for FT.C.8

0.00

0.20

0.40

0.60

0.80

1.00

1.20

auto 600 800 1000 1200 1400 CPU MISER

normalized delaynormalized energy

0

1

2

3

4

5

6

7

8

22850 22860 22870 22880 22890 22900 22910 22920 22930 22940 22950

Time (seconds)

Gig

abyt

es

Memory Online Memory Demand

Reducing Energy Consumption: (left) CPU Miser uses dynamic voltage and frequency scaling (DVFS) to lower average processor power consumption. Using the default cpuspeed daemon (auto) or any fixed lower frequency, performance loss is common. CPU Miser is able to reduce energy consumption without reducing performance significantly. (above) Memory Miser uses power scalable DRAM to lower average memory power consumption by turning off memory DIMMs based on memory use and allocation. Note the top curve shows the amount of online memory and the bottom curve shows actual demand. CPU Miser and Memory Miser are both capable of 30% total system energy savings with less than 1% performance loss.

Time for a single iteration: Ti = THPU + TAPU + Offload

Off-loaded time:Offload = Or + Os

Total time: T = ∑i(THPU,i + TAPU,i + Ooffload,i)

Single APU: TAPU = TAPUp + CAPU

TAPUp: APU part that can be parallelized CAPU: APU sequential partMultiple APUs: TAPU(1,p) = TAPU(1,1)/p + CAPU p: number of APUs TAPU(1,1): offloaded time for 1 APU TAPU(1,p): offloaded time for p APUs T = THPU + TAPU(1,1)/p + CAPU + Ooffload + p·g

Optimizing Heterogeneous Multicore SystemsWe use a variation of the lognP performance model to predict the cost of various process and data placement configurations at runtime. Using the performance model we can schedule process and data placement optimally for a heterogeneous multicore architecture. Results on the IBM Cell Broadband Engine show dynamic multicore scheduling using analytical modeling is a viable, accurate technique to improve performance efficiencies. Portions of this work were accomplished in collaboration with the Pearl Laboratory led by Prof. D. Nikolopoulos.

HPU time for one iteration: THPU(m,1) = am · THPU(1,1) + TCSW + Ocol T(m,p) = THPU(m,p) + TAPU(m,p) + Ooffload + p·g

Application: Parallel Bayesian Phylogenetic Inference Dataset: 107 sequences, each 10000 nucleotides, 20,000 gens MMGP mean error 3.2%, std. dev. 2.6, max. error 10%

PBPI executes sampling phase at the beginning of execution MMGP params are determined during the sampling phase Execution restarted after the sampling phase with MMGP

PBPI with sampling phase outperforms other configsby 1% - 4x. Sampling phase overhead is 2.5%.

http://sourceforge.net/

http://scape.cs.vt.edu/

there are no comprehensive, holistic studies of performance,

Documents

peak performance

profiling performance

performance improvements

power of parallel systems

improved power

minimal performance

thermal modes

lower power modes