there are no comprehensive, holistic studies of performance,
DESCRIPTION
Normalized Energy and Delay with CPU MISER for FT.C.8 . normalized delay. 1.20. normalized energy. 1.00. 0.80. 0.60. 0.40. 0.20. 0.00. auto. 600. 800. 1000. 1200. 1400. CPU MISER. Improving Performance, Power, and Thermal Efficiency in High-End Systems. Kirk W. Cameron. - PowerPoint PPT PresentationTRANSCRIPT
There are no comprehensive, holistic studies of performance,power and thermals on distributed scientific systems and workloads
Without innovation future HEC systems will waste performancepotential, waste energy, and require extravagant cooling.
Improving Performance, Power, and Thermal Efficiency in High-End Systems
Kirk W. CameronScalable Performance Laboratory Department of Computer Science and Engineering Virginia Tech
cameron@ cs.vt.edu
Introduction
Performance Efficiency
Power Efficiency
Thermal Efficiency
Problem StatementLeft unchecked, the fundamental drive to increase peak performance using tens of thousands of components in close proximity to one another will result in: 1) an inability to sustain performance improvements, and 2) exorbitant infrastructure and operational cost for power and cooling.
Performance, Power, and Thermal Facts
The gap between peak and achieved performance is growing A 5 Megawatt Supercomputer can consume $4M in energy annually.In just 2 hours, Earth Simulator can produce enough heat
to heat a home in the midwest all winter long.
ProjectionsCommodity components fail at annual rate of 2-3%.Petaflop system of ~12,000 nodes (CPU, NIC, DRAM, disk)
will sustain hardware failure once every 24 hours.Life expectancy of an electronic component
decreases 50% for every 10°C(18°F) temperature increase.
Our ApproachObservations: Predictive models and techniques are needed to maximize performance of emergent systems. Additional below-peak performance may provide adequate “slack times” for improved power and thermal efficiencies.Constraint: Performance is the critical constraint. Reduce power and thermals ONLY if it does not reduce performance significantly.
Relevant approaches to the problemImproving Performance EfficienciesIncludes a myriad of tools and modeling techniques to analyze and optimize the performance of parallel scientific applications. In our work we focus on using fast analytical modeling techniques to optimize emergent architectures such as the IBM Cell Broadband Architecture.
Improving Power EfficienciesExploit application “slack times” to operate various components in lower power modes (e.g. dynamic voltage scaling or DVFS) to conserve power and energy. Prior to our work, no framework for profiling performance and power of parallel systems and applications.
Improving Thermal EfficienciesExploit application “slack times” to operate various components in lower power (and thermal) modes to reduce the heat emitted by the system. Prior to our work, no framework for profiling performance and thermals of parallel systems and applications.
Our ContributionsI. Portable framework to profile, analyze and optimize distributed
applications for performance, power, and thermals with minimal performance impact.
II. Performance-Power-Thermal tradeoff studies and optimizations of scientific workloads on various architectures.
Performance analysis of NAS parallel benchmarks
Distributed Thermal Profiles: A thermal profile of FT (above) reveals thermal patterns corresponding to code phases. Floating point intensive phases run hot while memory bound phases run cooler. Also, significant temperature drops occur in very short periods of time. Thermal behavior of BT (not pictured) shows temperatures synchronize with workload behavior across nodes. We also observe some nodes trend hotter than others. All of this data was obtained using Tempest.
Temperature-Performance tradeoffs
Thermal-Performance tradeoffs are studied using Tempest and DVFS strategies applied to reduce temperature in parallel scientific applications.
Download TempestTempest is available for download from http://sourceforge.net .Related papers can be found at http://scape.cs.vt.edu .
Tempest Software ArchitectureDetailed thermal profile of FT (Class
C,NP=4)
Thermal optimizations are achieved with minimal performance impact
Thermal regulation: (top & top right) Tempest controller constrains temperature to within a threshold. Since the controller is heuristic, the temperature can exceed the threshold. However, temperature is typically controlled well using DVFS in a node. The weighted importance of thermals, performance and energy can determine the “best” operating point over a number of nodes.
CPU Impact on Thermals: (left) For floating point intensive codes (e.g. SP, FT, EP from NAS) CPU is a large consumer of power under load and dissipates significant heat. Energy optimizations that significantly reduce CPU heat should impact total system temperature significantly.
Sequential or Parallel applications written in C, C++ or Fortran
GNU Compiler
Any Compiler
libtempestfunc.so
libtempestperblk.so
per-node trace files per-node sensor stubs
detailed per-process per-node thermal functional or block profile
Tempest Parser
controller
libtempestctrl.so
compiler instrumented
user instrumented
libtempestdvfs.so
Per-node log of operating frequency and thermals
profiler controller
automated
user defined
node 01
node 11
node 21
node m1
...
node 02
node 12
node 22
node m2
...
node 03
node 13
node 23
node m3
...
node 0n
node 1n
node 2n
node mn
...
Tempest aware cluster...
...
...
...
Thermal regulation of IS (Class C, NP=4)Thermal regulation of FT (Class C, NP=4)
Thermal-aware Performance Impact: (right) The performance impact of our thermal-aware DVFS controller is less than 10% for all the NAS PB codes measured. Nonetheless, we commonly reduce operating temperature nearly 10°C(18°F) which translates to 50% reliability improvement in some cases. On average, we reduce operating temperature between 5-7 °C.
95
100
105
110
115
120
125
BT CG EP FT IS LU MG SP
NAS parallel suite
Tem
pera
ture
(F)
Avg CPU Temp for various NAS PB codes
EfficiencyGap
Jun-93 Jun-95 Jun-97 Jun-99 Jun-01 Jun-03 Jun-05 Jun-07 Jun-091 GFLOP
10 GFLOPS
102 GFLOPS
103 GFLOPS
104 GFLOPS
105 GFLOPS
PER
FOR
MAN
CE
CommercialData Center
1,374 kW
Residential Air Conditioner
15kW
High SpeedElectric Train
10,000 kW
Small Power PlantGenerating Capacity
300,000 kW
IBMBlue Gene/L
?? kW
IBMASCI Purple20,000 kW?
IBM SPASCI White
6,000kWEarth
Simulator18,000 kW
IntelASCI Red
850 kW
FujitsuNumerical
Wind Tunnel100 kW
TMCCM-55 kW
106 GFLOPS
Tempest profiling techniques are automatic, accurate, and portable.
8-node Dori
PowerPack II SoftwarePower profiling API library - synchronized profiling of parallel applications.Power control API library - synchronized DVS control within parallel
application.Multimeter middleware - coordinates data from multiple meter sources.Power analyzer middleware – sorts/sifts/analyzes/correlates profiling
data.Performance profiler – use common utilities to poll system performance
status.
This work sponsored in part by the Department of Energy Office of Science Early Career Principal Investigator (ECPI) Program under grant number DOE DE-FG02-04ER25608.
Ethernet Switch Data Collection System
MultimetersResistorsNode under test
+-
Component
RVS VR
PComponent=(VS-VR)VR/R
RS23
2/GB
IC
Ethernet Ethernet
Distributed Power Profiles: NAS codes exhibit regularity (e.g. FT on 4 nodes – above left) that reflects algorithm behavior. Intensive use of memory corresponds to decreases in CPU power and increases in memory power use (above right). Power consumption can vary with node for a single application, with number of nodes under fixed workload and with varied workload under fixed number of nodes. Results often correlate to comm/comp ratio.
Normalized Energy and Delay with CPU MISER for FT.C.8
0.00
0.20
0.40
0.60
0.80
1.00
1.20
auto 600 800 1000 1200 1400 CPU MISER
normalized delaynormalized energy
0
1
2
3
4
5
6
7
8
22850 22860 22870 22880 22890 22900 22910 22920 22930 22940 22950
Time (seconds)
Gig
abyt
es
Memory Online Memory Demand
Reducing Energy Consumption: (left) CPU Miser uses dynamic voltage and frequency scaling (DVFS) to lower average processor power consumption. Using the default cpuspeed daemon (auto) or any fixed lower frequency, performance loss is common. CPU Miser is able to reduce energy consumption without reducing performance significantly. (above) Memory Miser uses power scalable DRAM to lower average memory power consumption by turning off memory DIMMs based on memory use and allocation. Note the top curve shows the amount of online memory and the bottom curve shows actual demand. CPU Miser and Memory Miser are both capable of 30% total system energy savings with less than 1% performance loss.
Time for a single iteration: Ti = THPU + TAPU + Offload
Off-loaded time:Offload = Or + Os
Total time: T = ∑i(THPU,i + TAPU,i + Ooffload,i)
Single APU: TAPU = TAPUp + CAPU
TAPUp: APU part that can be parallelized CAPU: APU sequential partMultiple APUs: TAPU(1,p) = TAPU(1,1)/p + CAPU p: number of APUs TAPU(1,1): offloaded time for 1 APU TAPU(1,p): offloaded time for p APUs T = THPU + TAPU(1,1)/p + CAPU + Ooffload + p·g
Optimizing Heterogeneous Multicore SystemsWe use a variation of the lognP performance model to predict the cost of various process and data placement configurations at runtime. Using the performance model we can schedule process and data placement optimally for a heterogeneous multicore architecture. Results on the IBM Cell Broadband Engine show dynamic multicore scheduling using analytical modeling is a viable, accurate technique to improve performance efficiencies. Portions of this work were accomplished in collaboration with the Pearl Laboratory led by Prof. D. Nikolopoulos.
HPU time for one iteration: THPU(m,1) = am · THPU(1,1) + TCSW + Ocol T(m,p) = THPU(m,p) + TAPU(m,p) + Ooffload + p·g
Application: Parallel Bayesian Phylogenetic Inference Dataset: 107 sequences, each 10000 nucleotides, 20,000 gens MMGP mean error 3.2%, std. dev. 2.6, max. error 10%
PBPI executes sampling phase at the beginning of execution MMGP params are determined during the sampling phase Execution restarted after the sampling phase with MMGP
PBPI with sampling phase outperforms other configsby 1% - 4x. Sampling phase overhead is 2.5%.