Download - Energy Accounting and Control on HPC clustersperso.ens-lyon.fr/laurent.lefevre/greendayslille/green...Energy Accounting and Control on HPC clusters Yiannis Georgiou R&D Software Engineer

Energy Accounting and Controlon HPC clusters

Yiannis GeorgiouR&D Software Engineer

Objectives

Issues that we wanted to deal with:I Measure power and energy consumption on HPC clustersI Attribute power and energy data to HPC componentsI Calculate the energy consumption of jobs in the systemI Extract power consumption time series of jobsI Control the Energy usage during the job execution

Why?I Measuring Energy will enable us to use it more efficiently:

I Motivate users to better exploit the power of their resourcesI Use the power/energy data as input information for central software that

can take actions

2 / 64

Objectives

Issues that we wanted to deal with:I Measure power and energy consumption on HPC clustersI Attribute power and energy data to HPC componentsI Calculate the energy consumption of jobs in the systemI Extract power consumption time series of jobsI Control the Energy usage during the job execution

Why?I Measuring Energy will enable us to use it more efficiently:

I Motivate users to better exploit the power of their resourcesI Use the power/energy data as input information for central software that

can take actions

3 / 64

Context

High Performance Computing

Infrastructures:I Supercomputers, Clusters,

Grids, Clouds

Applications:I Climate Prediction, Protein

Folding, Crash simulation,High-Energy Physics,Astrophysics, Rendering

System SoftwareI System Software: Operating

System, Runtime system,Resource Management, I/OSystems, Interfacing to ExternalEnvironments

4 / 64

Resource and Job Management

The goal of a Resource and Job Management System (RJMS) is to satisfyusers’ demands for computation and assign user jobs upon thecomputational resources in an efficient manner.

RJMS Importance

Strategic position butcomplex internals:

I Direct andconstantknowledge ofresources andjobs

I Multifacetprocedures withcomplex internalfunctions

5 / 64

RJMS abstraction layers

This assignement involves three principal abstraction layers:I Job Management layer

I the Scheduling layer

I and the Resource Management layer

6 / 64

RJMS and Power Management

I Constant knowledge of the resources and jobsI Take advantage of the strategic position of the RJMS softwareI To treat Energy as a new type of Resources to be used by jobs

7 / 64

RJMS and Power ManagementExisting mechanisms for energy reductions on most of todays RJMS(System side feature):

I Framework for energy saving through unutilized nodesI Administrator configurable actions (hibernate, DVFS, power off,etc)I Automatic “Wake up” when jobs arrive

Energy consumption of trace file execution with 50.32% of system utilization

01

00

02

00

03

00

04

00

05

00

06

00

0

Co

nsu

mp

tio

n [

Wa

tt]

0 2000 5000 8000 11000 14000 17000 20000 23000 26000

Time [s]

Total Energy Consumed:

42.7 KW.h Normal Management

30.6 KW.h Green Management

8 / 64


Issues:I Multiple Reboots: Risks for node crashes or other hardware

components problemsI Most of production HPC clusters have a nearly constant 90% or

higher utilization hence the gain can be trivialI TradeOffs: Jobs Waiting time increases significantly

Energy consumption of trace file execution with 89.62% of system utilization and NAS BT benchmark

010

0020

0030

0040

0050

0060

0070

0080

00

Con

sum

ptio

n [W

att]

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000Time [s]

Energy consumption of trace file execution with 89.62% of system utilization and NAS BT benchmark

Total Energy Consumed:

NORMAL Management 53.617888025 KWatts.hourGREEN Management 47.538609925 KWatts.hour

Total Execution Time:

NORMAL 26557GREEN 26638

0 200 400 600 800

0.0

0.2

0.4

0.6

0.8

1.0

CDF on Wait time with 89.62% of system utilization and NAS BT benchmark

Wait time [s]

Jobs

[%]

0 200 400 600 800

0.0

0.2

0.4

0.6

0.8

1.0

CDF on Wait time with 89.62% of system utilization and NAS BT benchmark

Wait time [s]

Jobs

[%]

GREENNORMAL

9 / 64


New issues to deal with:I To enable energy reductions

even with 100% system utilization

How?I Make energy consumption a user concern:

I Energy Accounting: Turn Energy Consumption to a new jobcharacteristic

I Energy Control: Allow users to control the energy consumptionof their jobs

Basic need:I Ways to monitor and measure energy

consumption

10 / 64








consumption

11 / 64








consumption

12 / 64

Plan

Introduction

Energy Accounting and Control Framework

Framework Overhead Evaluation

Energy and Power Measurements Evaluations

Performance-Energy TradeOffs

Conclusions and Ongoing Works

13 / 64

Plan

Introduction






14 / 64

SLURM scalable and flexible RJMS

I SLURM open-source Resource and Job Management System, sourcesfreely available under the GNU General Public License.https://github.com/SchedMD/slurm/

I Portable: written in C with a GNU autoconf configuration engine.I Modular: Based on a plugin mechanism used to support different kind

of scheduling policies, interconnects, libraries, etcI Robust: highly tolerant of system failures, including failure of the node

executing its control functions.I Scalable: designed to operate in a heterogeneous cluster with up to

tens of millions of processors. It can accept 1000 job submissions persecond and fully execute 500 simple jobs per second (depending uponhardware and system configuration).

15 / 64

SLURM History and Facts

I Developed in LLNL since 2003, passed to SchedMD since 2011I Multiple enterprises and research centers have been contributing to

the project (LANL, Cray, Intel, CEA, HP, BULL, BSC, etc)I Large international community Active mailing lists (support by main

developers)I Contributions (various external software and standards are integrated

upon SLURM)

16 / 64

SLURM History and Facts

I Used in more than 50% of worlds largest supercomputers of Top500list amongst which the following 5 from Top10:

I #1: Tianhe-2 at the National University of Defense TechnologyI #3: Sequoia, an IBM BlueGeneQ at Lawrence Livermore National

LaboratoryI #6: Piz Daint, a Cray XC30 at the Swiss National Supercomputing CenterI #7: Stampede, a Dell cluster at the Texas Advanced Computing CenterI #9: Vulcan, an IBM BlueGeneQ at Lawrence Livermore National

Laboratory

17 / 64

BULL and SLURM

I BULL initially started to work with SLURM in 2005I At least 5 BULL active developers since then:

I Development for new SLURM featuresI Bug Corrections and Support for clientsI All BULL developments are given to community(open-source), no

code is kept proprietary except some small parts related to particularBULL hardware or software

I Integrated into the bullx Extreme computing software stack since 2006I Used as the default RJMS of the bullx stackI Deployed upon the BULL-CEA petaflopic supercomputers Curie,

Tera100, Helios

I Close development collaboration between SchedMD,CEA and BULLI Annual User Group Meeting (User, Admin Tutorials + Technical

presentation for developpers)http://www.schedmd.com/slurmdocs/publications.html

18 / 64

Plan

Introduction






19 / 64

Summary of the energy accounting and control features

I Power and Energy consumption monitoring per node level.I Energy consumption accounting per step/job on SLURM

DataBaseI Power profiling per step/job on the end of jobI Frequency Selection Mechanisms for user control of job energy

consumption

How this takes place:I Dedicated Plugins for Support of in-band collection of energy/power

data (IPMI / RAPL)I Dedicated Plugins for Support of out-of-band collection of

energy/power data (RRD databases )I Power data job profiling with HDF5 file formatI SLURM Internal power-to-energy and energy-to-power calculations

Important Issues:I Overhead: In-band CollectionI Precision: of the measurements and

internal calculationsI Scalability: Out-of band Collection

20 / 64




consumption






21 / 64




consumption






22 / 64

Framework Architecture

23 / 64


24 / 64


25 / 64


26 / 64

In-band collection of power/energy data with IPMI

I IPMI is a message-based, hardware-level interface specification(may operate in-band or out-of-band)

I Communication with the Baseboard Management ControllerBMC which is a specialized microcontroller embedded on themotherboard of a computer

I SLURM support is based on the FreeIPMI APIhttp://www.gnu.org/software/freeipmi/

I FreeIPMI includes a userspace driver that works on mostmotherboards without any required driver.

I No thread interferes with application execution

I The data collected from IPMI are currently instantaneousmeasures in Watts

I SLURM individual polling frequency (>=1sec)I direct usage for power profilingI but internal SLURM calculations for energy reporting per job

Pros - Cons

Advantages:I Complete Node measurements

Disadvantages:I There could be overhead and the read may

be time consumingI Complex process in calculating energy

consumption per step/job basis usingaveraging measurements window

I No finer granularity (neither socket nor corelevel yet)

27 / 64

In-band collection of power/energy data with IPMI

I IPMI is a message-based, hardware-level interface specification(may operate in-band or out-of-band)

I Communication with the Baseboard Management ControllerBMC which is a specialized microcontroller embedded on themotherboard of a computer

I SLURM support is based on the FreeIPMI APIhttp://www.gnu.org/software/freeipmi/

I FreeIPMI includes a userspace driver that works on mostmotherboards without any required driver.

I No thread interferes with application execution

I The data collected from IPMI are currently instantaneousmeasures in Watts

I SLURM individual polling frequency (>=1sec)I direct usage for power profilingI but internal SLURM calculations for energy reporting per job

Pros - Cons

Advantages:I Complete Node measurements

Disadvantages:I There could be overhead and the read may

be time consumingI Complex process in calculating energy

consumption per step/job basis usingaveraging measurements window

I No finer granularity (neither socket nor corelevel yet)

28 / 64

In-band collection of power/energy data with RAPL

I RAPL ( Running Average Power Limit) are particular interfaceson Intel Sandy Bridge processors (and later models)implemented to provide a mechanism for keeping theprocessors in a particular user-specified power envelope.

I Interfaces can estimate current energy usage based on asoftware model driven by hardware performance counters,temperature and leakage models

I Linux supports an ’MSR’ driver and access to the register can bemade through /dev/cpu/*/msr with priviledged readpermissions

I The data collected from RAPL is energy consumption in Joules(since the last boot of the machine)

I SLURM individual polling frequency (>=1sec)I direct usage for energy reporting per jobI but internal SLURM calculations for power reporting

Pros - Cons

Advantages:I No overhead during capturing power/energy data (read

hardware registers)I Simple process in calculating energy consumption per

step/job basisI 2 values are sufficient (start/end of the job), no need of

collecting big number of instant watts (IPMI case)

Disadvantages:I No whole node data, only processor and part of

memory (DRAM) is supported (no motherboard, disk,external GPU, etc)

I Per socket granularity exists (not per core yet)

29 / 64

In-band collection of power/energy data with RAPL

I RAPL ( Running Average Power Limit) are particular interfaceson Intel Sandy Bridge processors (and later models)implemented to provide a mechanism for keeping theprocessors in a particular user-specified power envelope.

I Interfaces can estimate current energy usage based on asoftware model driven by hardware performance counters,temperature and leakage models

I Linux supports an ’MSR’ driver and access to the register can bemade through /dev/cpu/*/msr with priviledged readpermissions

I The data collected from RAPL is energy consumption in Joules(since the last boot of the machine)

I SLURM individual polling frequency (>=1sec)I direct usage for energy reporting per jobI but internal SLURM calculations for power reporting

Pros - Cons

Advantages:I No overhead during capturing power/energy data (read

hardware registers)I Simple process in calculating energy consumption per

step/job basisI 2 values are sufficient (start/end of the job), no need of

collecting big number of instant watts (IPMI case)

Disadvantages:I No whole node data, only processor and part of

memory (DRAM) is supported (no motherboard, disk,external GPU, etc)

I Per socket granularity exists (not per core yet)

30 / 64

Out-of-band collection of power/energy data

I External Sensors Plugins to allow out-of-band monitoring ofcluster sensors

I Possibility to Capture energy usage and temperature ofvarious components (nodes, switches, rack-doors, etc)

I Framework generic but initial Support for RRD databasesthrough rrdtool API (for the collection of energy/temperaturedata)

I Plugin to be used with real wattmeters or out-of-band IPMIcapturing

I Power data captured used for per node power monitoring(scontrol show node) and per job energy accounting (Slurm DB)

I direct usage for energy reporting per jobI but internal SLURM calculations for power reporting

Pros - Cons

Advantages:I No overhead during capturing power/energy

data (out-of-band)I Power/Energy data of various components

may be used (not possible through in-bandmechanisms)

Disadvantages:I Scalability issues

Online Tutorial for more details:

www.schedmd.com/slurmdocs/SUG13/energy sensors.pdf

31 / 64

Out-of-band collection of power/energy data

I External Sensors Plugins to allow out-of-band monitoring ofcluster sensors

I Possibility to Capture energy usage and temperature ofvarious components (nodes, switches, rack-doors, etc)

I Framework generic but initial Support for RRD databasesthrough rrdtool API (for the collection of energy/temperaturedata)

I Plugin to be used with real wattmeters or out-of-band IPMIcapturing

I Power data captured used for per node power monitoring(scontrol show node) and per job energy accounting (Slurm DB)

I direct usage for energy reporting per jobI but internal SLURM calculations for power reporting

Pros - Cons

Advantages:I No overhead during capturing power/energy

data (out-of-band)I Power/Energy data of various components

may be used (not possible through in-bandmechanisms)

Disadvantages:I Scalability issues


www.schedmd.com/slurmdocs/SUG13/energy sensors.pdf

32 / 64

Power profiling

I Job profiling to periodically capture the task’s usage of variousresources like CPU, Memory, Lustre, Infiniband and Power pernode

I Resource Independent polling frequency configurationI Based on hdf5 file format http://www.hdfgroup.org open

source software libraryI versatile data model that can represent very complex data objects

and a wide variety of metadataI portable file format with no limit on the number or size of data

objects stored

I Profiling per node (one hdf5 file per job on each node)I Aggregation on one hdf5 file per job (after job termination)I Slurm built-in tools for extraction of hdf5 profiling data


www.schedmd.com/slurmdocs/SUG13/profile hdf5.pdf

33 / 64

Power profiling

I Job profiling to periodically capture the task’s usage of variousresources like CPU, Memory, Lustre, Infiniband and Power pernode

I Resource Independent polling frequency configurationI Based on hdf5 file format http://www.hdfgroup.org open

source software libraryI versatile data model that can represent very complex data objects

and a wide variety of metadataI portable file format with no limit on the number or size of data

objects stored

I Profiling per node (one hdf5 file per job on each node)I Aggregation on one hdf5 file per job (after job termination)I Slurm built-in tools for extraction of hdf5 profiling data


www.schedmd.com/slurmdocs/SUG13/profile hdf5.pdf

34 / 64

Job Energy Control through CPU Frequency Setting

I Support of kind of DVFS technique through CPU Frequency settingI Static since it may not be changed during the execution (user does not

usually have root access on the computing nodes)I The user may ask either a particular value in kilohertz or use

low/medium/high and the request will match the closest possiblenumerical value

I Implementation based on tasks confinement to those cpus (cgroups orcpusets).

I Implemented through manipulation of the/sys/devices/system/cpu/cpu0/cpufreq/scaling cur freq andgovernors drivers

35 / 64

Issues regarding the FrameworkI Overhead: In-band monitoring on computing nodes. What is the

overhead of the framework?I Precision: Usage of built-in models/interfaces and internal calculations.

How precise are the reported energy and power data?I Scalability: Out-of-band monitoring of computing nodes. What is the

scalability of the mechanism? *Not treated here*

Overhead:I in terms of the following

resources which are shared withapplication a:

I CPUI MemoryI Energy

aNetwork excluded because in most cases atleast 2 network cards are used in HPC clustersand Slurm RPCs pass from the Admin network

Precision:I Compared to External

Wattmeters measurements

36 / 64

Issues regarding the FrameworkI Overhead: In-band monitoring on computing nodes. What is the

overhead of the framework?I Precision: Usage of built-in models/interfaces and internal calculations.

How precise are the reported energy and power data?I Scalability: Out-of-band monitoring of computing nodes. What is the

scalability of the mechanism? *Not treated here*

Overhead:I in terms of the following

resources which are shared withapplication a:

I CPUI MemoryI Energy

aNetwork excluded because in most cases atleast 2 network cards are used in HPC clustersand Slurm RPCs pass from the Admin network

Precision:I Compared to External

Wattmeters measurements

37 / 64

Plan

Introduction






38 / 64

Experimental Platform

HardwareI Grid5000 platform resourcesI Lyon clusters orion and taurusI 17 nodes cluster (DELL PowerEdge R720 with Intel

Xeon E5-2630 2.3GHz 2 sockets per node /6 cores persocket, 32 GB of Memory and 10 Gigabit EthernetNetwork.)

SoftwareI SLURM 2.6.0 with 1 slurmctld and 16 slurmdI Execution of HPL Linpack

I with 80% of memoryI Duration 44min

39 / 64

Experimental Methodology

Execution of the same job with each differentmonitoring mode (with polling frequency = 1sec)

I NO JOBACCTI JOBACCT 0 RAPLI JOBACCT RAPLI JOBACCT RAPL PROFILEI JOBACCT 0 IPMII JOBACCT IPMII JOBACCT IPMI PROFILE

Profile the usage of:I CPU: (cgroups/cpuacct

subsystem)I Memory: ( RSS of ps command)

Overhead in terms of:I Execution TimeI Energy Consumption

40 / 64

Framework CPU Overhead

0

500

1000

1500

2000

2500

3000

3500

4000

NO-JOBACCT

JOBACCT-0-RAPL

JOBACCT-RAPL

JOBACCT-RAPL-PROFILE

JOBACCT-0-IPMI

JOBACCT-IPMI

JOBACCT-IPMI-PROFILE

CP

U-T

ime [m

sec]

SLURM Monitoring Modes

Real CPU-Time of all SLURM processes on first computing node during Linpack executions on 16 nodes with different SLURM monitoring modes

All-SLURM-processes

41 / 64

Framework Memory Overhead

0

5

10

15

20

25

30

35

NO-JOBACCT

JOBACCT-0-RAPL

JOBACCT-RAPL

JOBACCT-RAPL-PROFILE

JOBACCT-0-IPMI

JOBACCT-IPMI

JOBACCT-IPMI-PROFILE

Mem

ory

RS

S U

sage [M

Byte

s]

SLURM Monitoring Modes

Instant RSS Memory of SLURM processes on first computing node during Linpack executions on 16 nodes with different SLURM monitoring modes

All-SLURM-processes

42 / 64

Monitoring Modes Comparison

Monitoring Modes Time (s) Energy (J) Time Overhead Energy OverheadNO JOBACCT 2657 12623276.73 - -

JOBACCT 0 RAPL 2658 12634455.87 0.04% 0.09%JOBACCT RAPL 2658 12645455.87 0.04% 0.18%

JOBACCT RAPL PROFILE 2658 12656320.47 0.04% 0.26%JOBACCT 0 IPMI 2658 12649197.41 0.04% 0.2%JOBACCT IPMI 2659 12674820.52 0.07% 0.41%

JOBACCT IPMI PROFILE 2661 12692382.01 0.15 % 0.54%

43 / 64

Plan

Introduction






44 / 64

Experimental Platform

HardwareI Grid5000 platform resourcesI 17 nodes cluster (DELL PowerEdge R720 with Intel

Xeon E5-2630 2.3GHz 2 sockets per node /6 cores persocket, 32 GB of Memory and 10 Gigabit EthernetNetwork.)

I Intergrated Wattmeters on all computing nodes

SoftwareI SLURM 2.6.0 with 1 slurmctld and 16 slurmdI Execution of HPL Linpack and Stream benchmarks

45 / 64

Experimental Methodology

Interchanging the following parameters:

I Execution of Long running (Linpack) andShort running (Stream) jobs

I Testing all possible CPU FrequenciesI SLURM Monitoring through RAPL or IPMI

with profiling (HDF5) activated on bothcases

Evaluate the:I Precision of job’s overall energy calculation

for IPMI case (with polling frequency =1sec)I Comparison of the SLURM Reported DB value

with the Wattmeters calculated value

I Precision of job’s instant power profiling forIPMI and RAPL case (with polling frequency=1sec)

I Comparison of the SLURM profiling (HDF5) datewith the Wattmeters data

46 / 64

Plan

Introduction



Energy and Power Measurements EvaluationsLong running jobsShort running jobs



47 / 64

Precision of job’s energy calculation with IPMI

Monitoring Modes / 2.301 2.2 2.0 1.8 1.4 1.2CPU-Frequencies

External Wattmeter 12754247.9 12106046.58 12034150.98 12086545.51 12989792.06 13932355.15Postreatment Value

SLURM IPMI 12708696 12116440 11998483 12093060 13107353 14015043Reported Accounting Value

Error Deviation 0.35% 0.08% 0.29% 0.05% 0.89% 0.58%

Table: SLURM IPMI precision in Accounting for Linpack executions

48 / 64

Precision of power profiling with IPMI

Graphs featuring:I first computing node (left)I all computing nodes (right)

0 500 1000 1500 2000 2500

250

300

350

Power consumption of one node measured through

External Wattmeter and SLURM inband IPMI during a Linpack on 16 nodes

Time (sec)

Insta

nt P

ow

er

Consum

ption [W

att]

0 500 1000 1500 2000 2500

250

300

350

Time (sec)

Type of collection − Calculated Energy Consumption

External Wattmeter − 936237.93 Joules

SLURM inband IPMI − 910812 Joules

Power consumption of Linpack execution upon 16 nodes measured through

External Wattmeter and SLURM inband IPMI

4000

4200

4400

4600

4800

5000

Consum

ption [W

att]

0 500 1000 1500 2000 2500

Time [s]




49 / 64

Precision of power profiling with RAPL


0 500 1000 1500 2000 2500

100

150

200

250

300


External Wattmeter and SLURM inband RAPL during a Linpack on 16 nodes

Time (sec)

Insta

nt P

ow

er

Consum

ption [W

att]

0 500 1000 1500 2000 2500

100

150

200

250

300

Time (sec)



SLURM inband RAPL − 518760 Joules

Power consumption of Linpack execution upon 16 nodes measured through

External Wattmeter and SLURM inband RAPL

1000

1400

1800

2200

2600

3000

3400

3800

4200

4600

5000

Consum

ption [W

att]

0 500 1000 1500 2000 2500

Time [s]



SLURM inband RAPL − 8458938 Joules

50 / 64

Plan

Introduction






51 / 64

Precision of job’s energy calculation with IPMI

Monitoring Modes / 2.301 2.2 2.0 1.8 1.4 1.2CPU-Frequencies

External Wattmeter 627555.9 567502.72 540666.46 518834.96 504505.76 529519.07Postreatment Value

SLURM IPMI 544740 500283 482566 467719 460005 497637Reported Accounting Value

Error Deviation 13.19% 11.84% 10.74% 9.85% 8.82% 6.02 %

Table: SLURM IPMI precision in Accounting for stream executions

52 / 64

Precision of power profiling with IPMI


0 50 100 150

100

150

200

250


External Wattmeter and SLURM inband IPMI during stream execution on 16 nodes

Time (sec)

Insta

nt P

ow

er

Consum

ption [W

att]

0 50 100 150

100

150

200

250

Time (sec)




Power consumption of stream execution upon 16 nodes measured through


1400

1600

1800

2000

2200

2400

2600

2800

3000

3200

3400

3600

3800

4000

4200

Insta

nt P

ow

er

Consum

ption [W

att]

0 50 100 150

Time [s]

Type of collection − Total Calculated Energy Consumption



53 / 64

Precision of power profiling with RAPL


0 50 100 150

150

200

250

300


External Wattmeter and SLURM inband RAPL during stream execution on 16 nodes

Time (sec)

Insta

nt P

ow

er

Consum

ption [W

att]

0 50 100 150

150

200

250

300

Time (sec)




Power consumption of stream execution upon 16 nodes measured through


2200

2400

2600

2800

3000

3200

3400

3600

3800

4000

4200

Insta

nt P

ow

er

Consum

ption [W

att]

0 50 100 150

Time [s]

Type of collection − Total Calculated Energy Consumption



54 / 64

Error Deviations Explanations

Setup on 2 different hardwaresI DELL PowerEdge R720 and BULL B710

nodes (12 cores, 32GB Memory each)I with different BMC models

SoftwareI SLURM 2.6.0 with 1 slurmctld and 1 slurmd eachI Execution of simple multi-threaded prime number calculation

Observe and Compare:I Precision of SLURM

RAPL and IPMImonitoring

55 / 64

Error Deviations Explanations

Graphs featuring different BMC models:I DELL node (left)I BULL node (right)

0 20 40 60 80 100

80

100

120

140

160

180

200

Power consumption of one DELL node measured through

SLURM inband IPMI or RAPL during prime number calculations program

Time (sec)

Insta

nt P

ow

er

Consum

ption [W

att]

0 20 40 60 80 100

80

100

120

140

160

180

200

Time (sec)

Type of collection

SLURM inband IPMI

SLURM inband RAPL

0 20 40 60 80

140

160

180

200

220

240

Power consumption of one BULL node measured through

SLURM inband IPMI or RAPL during prime number calculations program

Time (sec)

Insta

nt P

ow

er

Consum

ption [W

att]

0 20 40 60 80

140

160

180

200

220

240

Time (sec)

Type of collection

SLURM inband IPMI

SLURM inband RAPL

56 / 64

Plan

Introduction






57 / 64

Usage of Energy Accounting values

I Profiling Performance and Energy based on CPUFrequencies

I Graphs featuring Linpack( left) and Stream (right)executions

8e+06

9e+06

1e+07

1.1e+07

1.2e+07

1.3e+07

1.4e+07

3000 3500 4000 4500 5000 5500

Tota

l E

nerg

y C

onsum

ption (

Joule

s)

Execution Time (sec)

Energy-Performance Trade-off for Linpack executions upon a 16 nodes cluster with different CPU Frequencies

SLURM inband IPMI

2.301

2.32.2

2.1 2.01.9 1.8 1.7

1.6 1.5

1.4 1.3

1.2

External WattmetersSLURM inband RAPL

2.301

2.32.2

2.12.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2

300000

350000

400000

450000

500000

550000

600000

650000

150 160 170 180 190 200

Tota

l E

nerg

y C

onsum

ption (

Joule

s)

Execution Time (sec)

Energy-Performance Trade-off for stream executions upon a 16 nodes cluster with different CPU Frequencies

External Wattmeters2.301

2.3

2.22.1

2.01.9

1.81.7 1.6 1.5 1.4

1.3

1.2

SLURM inband IPMI

2.301

2.32.2

2.12.01.91.8

1.71.6 1.5 1.4

1.3

1.2

SLURM inband RAPL

2.301

2.3

2.22.1

2.01.9

1.81.7

1.61.5 1.4 1.3 1.2

58 / 64

Plan

Introduction






59 / 64

Conclusions

Framework Overhead:I Less than 0.6% of energy consumptionI Less than 0.2% of execution timeI Safe to use on production environment with >= 1sec sampling

frequency

Job’s energy consumption precision (SLURM Energy Accounting):I Good results for long running jobsI Not good results with short running jobs because of BMC issues,

newer BMC models show improved behaviourI Jobs with regular power variations that can cause strong aliasing

effects are not considered and will be treated in following studies

Power profiling precision:I IPMI: Correct average values but poor sensitivity (BMC responsible)I RAPL: Not complete node values but very good sensitivity

60 / 64

Current State

I Framework Design and evaluations appear in proc of [ICDCN-2014] a

I Developments appeared in Slurm-2.6.0 stable version, on July 2013I Today used in production on MeteoFrance supercomputer

( 1080nodes) and TU-Dresden cluster ( 600nodes) (IPMI version withprofiling activated)

I Energy Accounting features in SLURM enable institutions to chargeusers for energy consumption (besides CPU-Time).

I Control of Energy Consumption through the selection of the bestfrequency enables energy reductions with 100% system utilization.

I Workload trace files (SWF) with energy extracted from supercomputersin production to be used as heuristics for research in scheduling

aY. Georgiou et al. Energy Accounting and Control with SLURM Resource and Job Management System in M. Chatterjee et al.: ICDCN 2014, LNCS

8314, pp. 96–118, Springer-Verlag

SWF trace with Consumed Energy field

JobID Submit Wait Elapsed CPUs CPUTime Mem ...ConsumedEnergy3541439 0 271 67 1 59 374756 ...................... 209603541440 5 266 50 1 41 356628 ...................... 12150

61 / 64

Ongoing Works

Power/Energy MonitoringI IPMI and RAPL should report their results in the same time, to deduce

the evolution of the consumption of other parts of nodes (motherboard,network cards, etc)

I Finer-grained monitoring of various node components will help us tobetter characterize the usage effectiveness of nodes.

Power/Energy Measurements Precision

To deal with the worst case of jobs with regular power variations:I Development of new BULL specific BMC firmware to provide internal

energy consumption calculations with higher precision and lessoverhead (Proprietary)

I Technique based on start/get/stop technique for BMC energy calculationI 4Hz internal polling sampling

I Development of Slurm plugin to support this new BULL BMC firmwarethrough ipmi raw data collection with FreeIPMI

I New mechanisms will be deployed on TUDresden on December 2013

62 / 64

Future Works

Energy: a new type of resourceI Treating Energy as a new job characteristic opens new doors to treat

it as a new resource.I Energy fairsharing will keep the fairness of energy distribution amongst

users.

I Explore other techniques for Control of Energy Consumption during jobexecution (RAPL power-capping, GPU frequency setting, Network cardspower off, etc)

I Energy Aware Scheduling: Optimize system utilization with respect toenergy availability

I Power capping techniques reflecting the evolutions of the amount ofavailable power over time

63 / 64

Download - Energy Accounting and Control on HPC clustersperso.ens-lyon.fr/laurent.lefevre/greendayslille/green...Energy Accounting and Control on HPC clusters Yiannis Georgiou R&D Software Engineer

Top Related