energy measurement on intel architecturesprace.it4i.cz/sites/prace.it4i.cz/files/files/phi-02... ·...
TRANSCRIPT
Agenda
The HardwareIntel Turbo Boost and Enhanced SpeedStepIntel RAPL
The ToolsThe Linux way - powercapx86 adaptlikwidPAPIperftiptopIntel Xeon Phi
Interpreting results
The Hardware
Intel Turbo BoostThermal Design Power (TDP) - maximum amount of powerrequired to dissipate by the cooling system
Intel R© Xeon R© Processor E5-2680 v3 - TDP: 120Whttps://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2 50-GHz
Turbo Boost & AVX Base Frequencies
With Haswell and later (also includes KNL):
(Image: Intel)
Turbo Boost MAX 3.0 - available from Broadwell-EPAVX base frequency is set per core - better granularity.
Intel Enhanced SpeedStep
Approximate CPU power consumption:
P = CV 2f
where:C is capacitance of the processor circuitry (fixed)V input voltagef frequency
I Exposed through ACPI since Pentium M - intel pstatemodule
I Set of P-states: voltage/frequency pairs of operating pointsI Switching between states has latencyI OS controlled policies - Linux cpufreq governors
Processor P and C states
C-statesI Idle modes - how deep processor sleepsI C0 - Running ⇒ C6 - Maximal voltage reduction
Transistion between states introduces latency - max. C-state canbe forced
Kernel parameter: intel_idle.max_cstate=0
Verification: $ cat /sys/module/intel_idle/parameters/max_cstate9
P-statesI Operating modes - frequency/voltage pairI Controlled by OS (ACPI)I Skylake introduces autonomy
Intel Speed Shift
cpupower - Linux cpufreq interface
[root@cn11 ˜]# cpupower frequency-infoanalyzing CPU 0:
driver: intel_pstateCPUs which run at the same hardware frequency: 0CPUs which need to have their frequency coordinated by software: 0maximum transition latency: Cannot determine or is not supported.hardware limits: 1.20 GHz - 3.10 GHzavailable cpufreq governors: performance powersavecurrent policy: frequency should be within 1.20 GHz and 3.10 GHz.
The governor "powersave" may decide which speed to usewithin this range.
current CPU frequency: 1.30 GHz (asserted by call to hardware)boost state support:
Supported: noActive: no3000 MHz max turbo 4 active cores3000 MHz max turbo 3 active cores3100 MHz max turbo 2 active cores3100 MHz max turbo 1 active cores
Also available in: /sys/devices/system/cpu ...More information:https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html
Running Average Power Limit - RAPL
I Software-based power meterin CPU
I Metrics available throughMSRs
I Used by Intel Turbo BoostI Introduced in Sandy Bridge
microarchitecture
RAPL Domains Hierarchy
I Package - SocketI Power Plane 0 (PP0) - Individual CPU coresI Power Plane 1 (PP1) - Uncore devicesI DRAM - RAM memory
Grain of SaltAlways refer to CPU datasheet, available domains aremodel-specific. For example PP0/1 may not be available on someHaswell CPUs.
RAPL Interface - Machine Specific Registers
RDMSR/WRMSR - Privileged instructions for accessing MSRs
More info: System Programming Guide, Chapter 14.9.
Measurement units and increments
UnitsValues from MSR XXX ENERGY STATUS are not final.Use following formula to obtain correct values:
xvalue = c · 12m
where:c is value obtained from the STATUS registerm value of multiplier provided by MSR RAPL POWER UNIT
Value overflowThe STATUS register is updated in 1ms interval and wraps in 60s,earlier in case of heavy load.
The Tools
Linux Power Capping Framework
I Devices exposed through sysfs hierarchyI Using intel rapl kernel module
[root@cn11 ˜]# ls -R1 /sys/devices/virtual/powercap/intel-rapl/.../sys/devices/virtual/powercap/intel-rapl/intel-rapl:0:constraint_0_max_power_uwconstraint_0_nameconstraint_0_power_limit_uwconstraint_0_time_window_usconstraint_1_max_power_uw...
x86 adapt
I Linux kernel module andlibrary
I Secure access to MSR andPCI registers from userspace
I Useful for building customtools
More info: https://github.com/tud-zih-energy/x86 adapt
Using x86 adapt
Individual MSRs are available as r/w knobs defined in the library.
I API available throughlibx86 adapt.so
I Populates/dev/x86 adapt/*
Available knobs listing:
Item 0: RESET----------------Item 1: Intel_xd_bit_disable----------------Item 2: Intel_PERF_GLOBAL_STATUS----------------Item 3: Intel_RAPL_Pckg_Energy......
x86 adapt - Using C API
#i n c l u d e <s t d i o . h>#i n c l u d e <x86 adapt . h>. . .
// Lookup RAPL cpu i temconst char∗ i tem name = ” Inte l RAPL Pckg Energy ” ;i n t i t e m i d = x 8 6 a d a p t l o o k u p c i n a m e ( devtype , item name ) ;i f ( i t e m i d < 0){
f p r i n t f ( s t d e r r , ” Could not f i n d %s\n” , item name ) ;}
u i n t 6 4 t r e s u l t ;i n t r e t ;i f ( ( r e t = x 8 6 a d a p t g e t s e t t i n g ( fd cpu , i t e m i d ,& r e s u l t ) ) != 8){
f p r i n t f ( s t d e r r , ” Could not read i tem %d f o r cpu/ d i e %d\n” , i t e m i d ,CPU ) ;r e t u r n −1;
}
p r i n t f ( ”CPU: %d | MSR PKG ENERGY STATUS MSR %l l u \n” ,CPU, r e s u l t ) ;. . .
likwid
I Set of CLI toolsI C APII Performance and energy
measurementI OpenMP and MPI supportI Benchmarking
Power related tools:I likwid-topologyI likwid-powermeterI likwid-perfscope
Developed by: Regionales RechenZentrum Erlangen (RRZE)
https://github.com/RRZE-HPC/likwid
likwid-powermeter
Continuous measurement for selected CPU:
[root@cn11 ˜]# likwid-powermeter -c 0 -s 2s--------------------------------------------------------------------------------CPU name: Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHzCPU type: Intel Xeon SandyBridge EN/EP processorCPU clock: 2.40 GHz----------------------------------------------------------------------------------------------------------------------------------------------------------------Runtime: 2.0007 sMeasure for socket 0 on CPU 0Domain PKG:Energy consumed: 32.4096 JoulesPower consumed: 16.1991 WattDomain PP0:Energy consumed: 6.12715 JoulesPower consumed: 3.0625 WattDomain DRAM:Energy consumed: 1.58621 JoulesPower consumed: 0.792828 Watt--------------------------------------------------------------------------------
PAPI - Performance API
I Effort to provide portable API for accessing hardwareperformance counters
I Supports CPUs, network, accelerators, etc.I Many features: custom events, timers, multiplexing, statistics
Developed by: University of Tennessee - Innovative Computing Laboratory
http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:Overview
PAPI - Basic concepts
Components, Counters, EventsI High Level API
I Easy to use ( 10functions)
I Only preset events onthe CPU
I Low Level APII User-defined groups of
EventsI All PAPI componentsI Including native events
RAPL events available only as native!
PAPI - Components
Available components on a Sandy Bridge node(papi component avail):
...Name:net Linux network driver statistics
Native: 80, Preset: 0, Counters: 320
Name:rapl Linux SandyBridge RAPL energy measurementsNative: 14, Preset: 0, Counters: 14
Name:stealtime Stealtime filesystem statisticsNative: 33, Preset: 0, Counters: 33
...
PAPI - RAPL EventsAvailable native RAPL events on a Sandy Bridge node:(papi native avail):
Native Events in Component: raplrapl:::THERMAL_SPEC:PACKAGE0rapl:::THERMAL_SPEC:PACKAGE1rapl:::MINIMUM_POWER:PACKAGE0rapl:::MINIMUM_POWER:PACKAGE1rapl:::MAXIMUM_POWER:PACKAGE0rapl:::MAXIMUM_POWER:PACKAGE1rapl:::MAXIMUM_TIME_WINDOW:PACKAGE0rapl:::MAXIMUM_TIME_WINDOW:PACKAGE1rapl:::PACKAGE_ENERGY:PACKAGE0rapl:::PACKAGE_ENERGY:PACKAGE1rapl:::DRAM_ENERGY:PACKAGE0rapl:::DRAM_ENERGY:PACKAGE1rapl:::PP0_ENERGY:PACKAGE0rapl:::PP0_ENERGY:PACKAGE1
PAPI - Low level API demo
i n t e v e n t c o d e = −1;PAPI event name to code ( ” r a p l : : : PACKAGE ENERGY :PACKAGE0” , &e v e n t c o d e ) ;
P A P I e v e n t i n f o t e i n f o ;P A P I g e t e v e n t i n f o ( event code , &e i n f o ) ;
p r i n t f ( ” Event symbol : %s\n” , e i n f o . symbol ) ;p r i n t f ( ” D e s c r i p t i o n : %s\n\n” , e i n f o . l o n g d e s c r ) ;
i n t e v e n t s e t = PAPI NULL ;
// Crea te empty even t s e ti f ( ( r e t = P A P I c r e a t e e v e n t s e t (& e v e n t s e t ) ) != PAPI OK) {
h a n d l e e r r ( r e t ) ;}
// Add RAPL even t to even t s e ti f ( ( r e t = PAPI add event ( e v e n t s e t , e v e n t c o d e ) ) != PAPI OK) {
h a n d l e e r r ( r e t ) ;}
// S t a r t c o l l e c t i n g e v e n t si f ( ( r e t = P A P I s t a r t ( e v e n t s e t ) ) != PAPI OK) {
h a n d l e e r r ( r e t ) ;}
p r i n t f ( ” Doing some FLOPs . . . \ n” ) ;
PAPI - Low level API demo
. . .FLOPS. . . .// Stop c o l l e c t i n g e v e n t sl ong long v a l u e s [ 1 ] ; // For one even ti f ( ( r e t = PAPI stop ( e v e n t s e t , v a l u e s ) ) != PAPI OK) {
h a n d l e e r r ( r e t ) ;}
p r i n t f ( ” Energy consumed on PKG0 : %l l d %s \n” , v a l u e s [ 0 ] , e i n f o . u n i t s ) ;
perf - Linux profiling tool
I Common profiling tool inLinux
I Measuring, sampling,analysis
I Uses counters exposed bykernel
[root@cn11 ˜]# perf list
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]branch-misses [Hardware event]bus-cycles [Hardware event]cache-misses [Hardware event]cache-references [Hardware event]...power/energy-cores/ [Kernel PMU event]power/energy-pkg/ [Kernel PMU event]power/energy-ram/ [Kernel PMU event]
perf - Measuring energy using RAPL
perf stat -a -e \power/energy-pkg/,\power/energy-ram/,\power/energy-cores/,\cycles [binary-to-measure]
time counts unit events0.087152594 3,24 Joules power/energy-pkg/0.087152594 0,11 Joules power/energy-ram/0.087152594 0,92 Joules power/energy-cores/0.087152594 137 374 362 cycles
perf - Real time monitoring
tiptop - Top for Hardware Performance counters
I Real-time diplay of IPC,cache misses, etc.
I ncurses base top-like
Developed by: Inria http://tiptop.gforge.inria.fr/
Power Measurement on Xeon PhiI Host: PAPI 1 & RAPL (Package, PowerPlane02 & DRAM).I Coprocessor:
I On host: Use micsmc -f to get Total PowerI On coprocessor:
I /sys/class/micras/power (∼50 msec updates), e.g.:> cat /sys/ class / micras / power113000000112000000113000000 # Total instantaneous uWatt22100000016000000 # PCIe power uWatt28000000 # 2x3 connector uWatt69000000 # 2x4 connector uWatt28000000 0 96700032000000 0 100000031000000 0 1501000
I Library libmicmgmt provides an API (see man libmicmgmt)More information here
Attention: Reading power values from KNC is”desctructive”(no idle power can be measured, but should be∼17 Watt)!
1For KNC modules micpower and host micpower can be used2All cores - except for HSW (see here )
micsmc - quite useful utility
Static measurement with MERIC Tool
I Multi-node energymeasurement
I RAPL + x86 adaptI Readex project
$ source meric/intel2017a/set_env$ staticMERICtool/multiNodeStaticMeasureStart.sh --rapl$ ./a.out$ staticMERICtool/multiNodeStaticMeasureStop.sh --rapl
Runtime [s]: 6.59214Overall energy consumption [J]: 1167
The tool needs a special permissions on the cluster.For further info contact Ondrej Vysocky<[email protected]>.
Power vs. wall time tradeoff
I Reduce energy consumption via cpufreq or RAPL powercapping
I Possible to find optimal tradeoffI Highly depends on load type and cluster utilization
I Support infrastructure (DRUPS, storage, monitoring,. . . )consumes a lot of energy
Thank you for your attention.