performance tuning using hardware counter data
DESCRIPTION
Performance Tuning Using Hardware Counter Data. Philip Mucci m [email protected] Shirley Moore [email protected] Nils Smeds [email protected]. SC 2001 November 12, 2001 Denver, Colorado. Outline. Issues in application performance tuning – 30 minutes General design of PAPI – 15 minutes - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/1.jpg)
Performance Tuning Using Hardware Counter Data
Philip [email protected]
Shirley [email protected]
Nils [email protected]
SC 2001November 12, 2001Denver, Colorado
![Page 2: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/2.jpg)
2
Outline
• Issues in application performance tuning – 30 minutes
• General design of PAPI – 15 minutes• PAPI high-level interface – 15 minutes• PAPI low-level interface – 15 minutes• Counter overflow interrupts and
statistical profiling – 30 minutes (advanced)
• Tools that use PAPI – 30 minutes• Code examples – 30 minutes
![Page 3: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/3.jpg)
3
Issues in Application Performance Tuning
![Page 4: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/4.jpg)
4
HPC Architecture
• RISC or super-scalar architecture– Pipelined functional units– Multiple functional units in the CPU– Speculative execution– Several levels of cache memory– Cache lines shared between CPUs
![Page 5: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/5.jpg)
FloatingPointUnit
FPU1
FloatingPointUnit
FPU2
LD/STUnit
LS1
FixedPointUnit
FXU2
LD/STUnit
LS2
FixedPointUnit
FXU1
FixedPointUnit
FXU3
Branch/Dispatch
Memory Mgmt UnitInstruction Cache
IU
BIU Bus Interface Unit: L2 Control, Clock
Memory Mgmt UnitData Cache
DU
L2 Cache1-16 MB 5XX Bus
Branch history table: 2048 entriesBranch target cache: 256 entries
32 KB, 128-way
64 KB, 128-way
32 Bytes 32 Bytes
32 Bytes @ 200 MHz = 6.4 GB/s
16 Bytes @100 MHz = 1.6 GB/s
POWER3 Processing Units (Model 260)
![Page 6: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/6.jpg)
6
L2 C
ache
Bus Controller
L3 C
ache
Sco
rebo
ard,
Pre
dica
te,
NaT
s, E
xcep
tions
Branch & PredicateRegisters 128 Integer Registers 128 FP Registers
BranchUnits
Integerand
MM Units
Dual-PortL1
DataCache
AndDTLB
ALA
T FloatingPointUnits
SIMDFMAC
BranchPrediction
IA-32Decode
AndControl
L1 Instruction CacheAnd
Fetch/Pre-fetch EngineITLB
Register Stack Engine / Re-Mapping
B B B M M I I F F
DecouplingBuffer
8 Bundles
Itanium ™ Processor Block Diagram
![Page 7: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/7.jpg)
7
Hardware Counters
• Small set of registers that count events, which are occurrences of specific signals related to the processor’s function
• Monitoring these events facilitates correlation between the structure of the source/object code and the efficiency of the mapping of that code to the underlying architecture.
![Page 8: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/8.jpg)
8
Pipelined Functional Units
• The circuitry on a chip that performs a given operation is called a functional unit.
• Most integer and floating point units are pipelined – Each stage of a pipelined unit working
simultaneously on different sets of operands
– After initial startup latency, goal is to generate one result every clock cycle
![Page 9: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/9.jpg)
9
Super-scalar Processors
• Processors that have multiple functional units are called super-scalar.
• Examples:– IBM Power 3
• 2 floating point units (multiply-add)• 3 fixed point units• 2 load/store units• 1 branch/dispatch unit
![Page 10: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/10.jpg)
10
Super-scalar Processors (cont.)
– MIPS R12K• 2 floating point units (1 multiply-add, 1
add)• 2 integer units• 2 load/store units
– Alpha EV67• Instruction fetch/issue/retire unit• Integer execution unit (2 IU clusters)• Floating point execution unit (2 FPUs)
![Page 11: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/11.jpg)
11
Super-scalar Processors (cont.)
– Intel Itanium• EPIC (Explicitly Parallel Instruction
Computing) design• 4 integer units• 4 multimedia units• 2 load/store units• 3 branch units• 2 extended precision floating point units• 2 single precision floating point units
![Page 12: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/12.jpg)
12
Out of Order Execution
• CPU dynamically executes instructions as their operands become available, out of order if necessary– Any result generated out of order is
temporary until all previous instructions have successfully completed.
– Queues are used to select which instructions to issue dynamically to the execution units.
– Relevant hardware counter metrics: instructions issued, instructions completed
![Page 13: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/13.jpg)
13
Speculative Execution
• The CPU attempts to predict which way a branch will go and continues executing instructions speculatively along that path.– If the prediction is wrong, instructions
executed down the incorrect path must be canceled.
– On many processors, hardware counters keep counts of branch prediction hits and misses.
![Page 14: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/14.jpg)
14
Instruction Counts and Functional Unit Status
• Relevant hardware counter data– Total cycles– Total instructions– Floating point operations– Load/store instructions– Cycles functional units are idle– Cycles stalled
• waiting for memory access• waiting for resource
– Conditional branch instructions• executed• mispredicted
![Page 15: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/15.jpg)
15
Cache and Memory Hierarchy
• Registers: On-chip circuitry used to hold operands and results of calculations
• L1 (primary) data cache: Small on-chip cache used to hold data about to be operated on
• L2 (secondary) cache: Larger (on- or off-chip) cache used to hold data and instructions retrieved from local memory.
• Some systems have L3 and even L4 caches.
![Page 16: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/16.jpg)
16
Cache and Memory Hierarchy (cont.)
• Local memory: Memory on the same node as the processor
• Remote memory: Memory on another node but accessible over an interconnect network.
• Each level of the memory hierarchy introduces approximately an order of magnitude more latency than the previous level.
![Page 17: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/17.jpg)
17
Cache Structure
• Memory on a node is organized as an array of cache lines which are typically 4 or 8 words long. When a data item is fetched from a higher level cache or from local memory, an entire cache line is fetched.
• Caches can be either – direct mapped or – N-way set associative
• A cache miss occurs when the program refers to a data item that is not present in the cache.
![Page 18: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/18.jpg)
18
Cache Contention
• When two or more CPUs alternately and repeatedly update the same cache line– memory contention
• when two or more CPUs update the same variable
• correcting it involves an algorithm change– false sharing
• when CPUs update distinct variables that occupy the same cache line
• correcting it involves modification of data structure layout
![Page 19: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/19.jpg)
19
Cache Contention (cont.)
• Relevant hardware counter metrics– Cache misses and hit ratios– Cache line invalidations
![Page 20: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/20.jpg)
20
TLB and Virtual Memory
• Memory is divided into pages.• The operating system translates the
virtual page addresses used by a program into physical addresses used by the hardware.– The most recently used addresses are
cached in the translation lookaside buffer (TLB).
– When the program refers to a virtual address that is not in the TLB, a TLB miss occurs.
• Relevant hardware counter metric: TLB misses
![Page 21: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/21.jpg)
21
Memory Latencies
• CPU register: 0 cycles• L1 cache hit: 2-3 cycles• L1 cache miss satisfied by L2 cache hit:
8-12 cycles• L2 cache miss satisfied from main
memory, no TLB miss: 75-250 cycles• TLB miss requiring only reload of the
TLB: ~2000 cycles• TLB miss requiring reload of virtual
page – page fault: hundreds of millions of cycles
![Page 22: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/22.jpg)
22
Steps of Optimization
• Optimize compiler switches • Integrate libraries• Profile• Optimize blocks of code that
dominate execution time by using hardware counter data to determine why the bottlenecks exist
• Always examine correctness at every stage!
![Page 23: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/23.jpg)
23
General Design of PAPI
![Page 24: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/24.jpg)
24
Goals
• Solid foundation for cross platform performance analysis tools
• Free tool developers from re-implementing counter access
• Standardization between vendors, academics and users
• Encourage vendors to provide hardware and OS support for counter access
• Reference implementations for a number of HPC architectures
• Well documented and easy to use
![Page 25: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/25.jpg)
25
Overview of PAPI
• Performance Application Programming Interface
• The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.
• Parallel Tools Consortium project http://www.ptools.org/
![Page 26: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/26.jpg)
26
PAPI Counter Interfaces
• PAPI provides three interfaces to the underlying counter hardware: 1. The low level interface manages hardware
events in user defined groups called EventSets.
2. The high level interface simply provides the ability to start, stop and read the counters for a specified list of events.
3. Graphical tools to visualize information.
![Page 27: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/27.jpg)
27
PAPI Implementation
Java Monitor GUI
PAPI Low LevelPAPI High Level
Hardware Performance Counter
Operating System
Kernel Extension
PAPI Machine Dependent SubstrateMachine
SpecificLayer
PortableLayer
![Page 28: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/28.jpg)
28
PAPI Preset Events
• Proposed standard set of events deemed most relevant for application performance tuning
• Defined in papiStdEventDefs.h• Mapped to native events on a
given platform– Run tests/avail to see list of PAPI
preset events available on a platform
![Page 29: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/29.jpg)
29
PAPI Release
• Platforms– Linux/x86, Windows 2000
• Requires patch to Linux kernel, driver for Windows
– Linux/IA-64– Sun Solaris/Ultra 2.8– IBM AIX/Power
• Contact IBM for pmtoolkit– SGI IRIX/MIPS– Compaq Tru64/Alpha Ev6 & Ev67
• Requires OS device driver from Compaq– Cray T3E/Unicos
![Page 30: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/30.jpg)
30
PAPI Release (cont.)
• C and Fortran bindings and Matlab wrappers
• To download software:http://icl.cs.utk.edu/projects/papi/
![Page 31: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/31.jpg)
31
PAPI High-level Interface
![Page 32: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/32.jpg)
32
High-level Interface
• Meant for application programmers wanting coarse-grained measurements
• Not thread safe• Calls the lower level API• Allows only PAPI preset events• Easier to use and less setup
(additional code) than low-level
![Page 33: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/33.jpg)
33
High-level API
• C interfacePAPI_start_countersPAPI_read_countersPAPI_stop_countersPAPI_accum_countersPAPI_num_countersPAPI_flops
• Fortran interfacePAPIF_start_countersPAPIF_read_countersPAPIF_stop_countersPAPIF_accum_countersPAPIF_num_countersPAPIF_flops
![Page 34: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/34.jpg)
34
Setting up the High-level Interface
• Int PAPI_num_counters(void)
– Initializes PAPI (if needed)– Returns number of hardware counters
• int PAPI_start_counters(int *events, int len)
– Initializes PAPI (if needed)– Sets up an event set with the given counters– Starts counting in the event set
• int PAPI_library_init(int version)
– Low-level routine implicitly called by above
![Page 35: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/35.jpg)
35
Controlling the Counters
• PAPI_stop_counters(long_long *vals, int alen)– Stop counters and put counter values in array
• PAPI_accum_counters(long_long *vals, int alen)– Accumulate counters into array and reset
• PAPI_read_counters(long_long *vals, int alen)– Copy counter values into array and reset counters
• PAPI_flops(float *rtime, float *ptime, long_long *flpins, float *mflops)– Wallclock time, process time, FP ins since start,– Mflop/s since last call
![Page 36: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/36.jpg)
36
PAPI_flops
• int PAPI_flops(float *real_time, float *proc_time, long_long *flpins, float *mflops)– Only two calls needed, PAPI_flops before and after
the code you want to monitor– real_time is the wall-clocktime between the two calls– proc_time is the “virtual” time or time the process
was actually executing between the two calls (not as fine grained as real_time but better for longer measurements)
– flpins is the total floating point instructions executed between the two calls
– mflops is the Mflop/s rating between the two calls
![Page 37: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/37.jpg)
37
PAPI High-level Example
long long values[NUM_EVENTS]; unsigned int
Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC}; /* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS); /* What we are monitoring? */ do_work(); /* Stop the counters and store the results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS);
![Page 38: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/38.jpg)
38
Return codes
Name Description PAPI_OK No error PAPI_EINVAL Invalid argument PAPI_ENOMEM Insufficient memory PAPI_ESYS A system/C library call failed. Check errno variable PAPI_ESBSTR Substrate returned an error. E.g. unimplemented feature PAPI_ECLOST Access to the counters was lost or interrupted PAPI_EBUG Internal error PAPI_ENOEVNT Hardware event does not exist PAPI_ECNFLCT Hardware event exists, but resources are exhausted PAPI_ENOTRUN Event or envent set is currently counting PAPI_EISRUN Events or event set is currently running PAPI_ENOEVST No event set available PAPI_ENOTPRESET Argument is not a preset PAPI_ENOCNTR Hardware does not support counters PAPI_EMISC Any other error occured
![Page 39: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/39.jpg)
39
PAPI Low-level Interface
![Page 40: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/40.jpg)
40
Low-level Interface
• Increased efficiency and functionality over the high level PAPI interface
• About 40 functions• Obtain information about the
executable and the hardware• Thread-safe• Fully programmable• Callbacks on counter overflow
![Page 41: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/41.jpg)
41
Low-level Functionality
• Library initializationPAPI_library_init, PAPI_thread_init,
PAPI_shutdown• Timing functions
PAPI_get_real_usec, PAPI_get_virt_usecPAPI_get_real_cyc, PAPI_get_virt_cyc
• Inquiry functions• Management functions• Simple lock
PAPI_lock/PAPI_unlock
![Page 42: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/42.jpg)
42
Event sets
• The event set contains key information– What low-level hardware counters to use– Most recently read counter values– The state of the event set (running/not
running)– Option settings (e.g., domain, granularity,
overflow, profiling)• Event sets can overlap if they map to the
same hardware counter set-up.– Allows inclusive/exclusive measurements
![Page 43: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/43.jpg)
43
Event set Operations
• Event set managementPAPI_create_eventset, PAPI_add_event[s], PAPI_rem_event[s], PAPI_destroy_eventset
• Event set controlPAPI_start, PAPI_stop, PAPI_read, PAPI_accum
• Event set inquiryPAPI_query_event, PAPI_list_events,...
![Page 44: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/44.jpg)
44
Simple Example
#include "papi.h“#define NUM_EVENTS 2int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}, EventSet;
long_long values[NUM_EVENTS];/* Initialize the Library */retval = PAPI_library_init(PAPI_VER_CURRENT);/* Allocate space for the new eventset and do setup */retval = PAPI_create_eventset(&EventSet);/* Add Flops and total cycles to the eventset */retval = PAPI_add_events(&EventSet,Events,NUM_EVENTS);/* Start the counters */retval = PAPI_start(EventSet);
do_work(); /* What we want to monitor*/
/*Stop counters and store results in values */retval = PAPI_stop(EventSet,values);
![Page 45: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/45.jpg)
45
Overlapping Counters
retval = PAPI_start(InclEventSet);retval = PAPI_start(OthersEventSet);
...
retval = PAPI_reset(OthersEventSet);do_flops(NUM_FLOPS); /* Function call */retval = PAPI_accum(OthersEventSet,Othersvalues);
...
retval = PAPI_stop(InclEventSet,Inclvalues);printf("Counts: %12lld %12lld\n",
Inclvalues[0], Inclvalues[0]-Othersvalues[0]);
![Page 46: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/46.jpg)
46
Counter Domains
• int PAPI_set_domain(int domain);– PAPI_DOM_USER User context counted– PAPI_DOM_KERNEL Kernel/OS context
counted– PAPI_DOM_OTHER Exception/transient mode– PAPI_DOM_ALL All above contexts counted– PAPI_DOM_MIN The smallest available
context– PAPI_DOM_MAX The largest available context
• All domains not available on all platforms - OS dependent
![Page 47: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/47.jpg)
47
Counter Granularity
• int PAPI_set_granularity(int granul);– PAPI_GRN_THR count each individual thread– PAPI_GRN_PROC count each individual process– PAPI_GRN_PROCG count each process group– PAPI_GRN_SYS count on the current CPU– PAPI_GRN_SYS_CPU count on every CPU's– PAPI_GRN_MIN (=PAPI_GRN_THR)– PAPI_GRN_MAX (=PAPI_GRN_SYS_CPU)
• Requires OS support
![Page 48: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/48.jpg)
48
Using PAPI with Threads
• After PAPI_library_init need to register unique thread identifier function
• For Pthreads retval=PAPI_thread_init(pthread_self, 0);
• OpenMP retval=PAPI_thread_init(omp_get_thread_num, 0);
• Each thread responsible for creation, start, stop and read of its own counters
![Page 49: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/49.jpg)
49
Using PAPI with Multiplexing
• Multiplexing allows simultaneous use of more counters than are supported by the hardware.
• PAPI_multiplex_init() – should be called after PAPI_library_init()
to initialize multiplexing• PAPI_set_multiplex( int *EventSet );
– Used after the eventset is created to turn on multiplexing for that eventset
• Then use PAPI like normal
![Page 50: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/50.jpg)
50
Issues with Multiplexing
• Some platforms support hardware multiplexing, on those that don’t PAPI implements multiplexing in software.
• The more events you multiplex, the more likely the representation is not correct.
![Page 51: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/51.jpg)
51
Multiplex Code Examples
tests/multiplex1.ctests/multiplex1_pthreads.c
From the PAPI source distribution:
![Page 52: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/52.jpg)
52
Native Events
• An event countable by the CPU can be counted even if there is no matching preset PAPI event
• Same interface as when setting up a preset event, but a CPU-specific bit pattern is used instead of the PAPI event definition
![Page 53: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/53.jpg)
53
Native Event Examples
tests/native.cftests/native.F
From the PAPI source distribution:
![Page 54: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/54.jpg)
54
Counter Overflow Interrupts and
Statistical Profiling
![Page 55: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/55.jpg)
55
Callbacks on Counter Overflow
• PAPI provides the ability to call user-defined handlers when a specified event exceeds a specified threshold.
• For systems that do not support counter overflow at the OS level, PAPI sets up a high resolution interval timer and installs a timer interrupt handler.
![Page 56: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/56.jpg)
56
PAPI_overflow
• int PAPI_overflow(int EventSet, int EventCode, int threshold, int flags, PAPI_overflow_handler_t handler)
• Sets up an EventSet such that when it is PAPI_start()’d, it begins to register overflows
• The EventSet may contain multiple events, but only one may be an overflow trigger.
![Page 57: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/57.jpg)
57
Overflow Code Examples
tests/overflow.ctests/overflow_pthreads.c
From the PAPI source distribution:
![Page 58: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/58.jpg)
58
Statistical Profiling
• PAPI provides support for execution profiling based on any counter event.
• PAPI_profil() creates a histogram of overflow counts for a specified region of the application code.
![Page 59: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/59.jpg)
59
PAPI_profil
int PAPI_profil(unsigned short *buf, unsigned int bufsiz, unsigned long offset, unsigned scale, int EventSet, int EventCode, int threshold, int flags)
•buf – buffer of bufsiz bytes in which the histogram counts are stored•offset – start address of the region to be profiled•scale – contraction factor that indicates how much smaller the histogram buffer is than the region to be profiled
![Page 60: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/60.jpg)
60
Profiling Code Examples
tests/profile.ctests/sprofile.ctests/profile_pthreads.c
From the PAPI source distribution:
![Page 61: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/61.jpg)
61
Tools that use PAPI
![Page 62: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/62.jpg)
62
Perfometer
• Application is instrumented with PAPI– call perfometer()– call mark_perfometer(Color)
• Application is started. At the call to perfometer, signal handler and timer are set to collect and send the information to a Java applet containing the graphical view.
• Sections of code that are of interest can be designated with specific colors– Using a call to set_perfometer(‘color’)
• Real-time display or trace file
![Page 63: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/63.jpg)
63
Perfometer DisplayMachine info
Process &Real time
Flop/s Rate
Flop/s Min/Max
![Page 64: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/64.jpg)
64
Perfometer Parallel Interface
![Page 65: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/65.jpg)
65
Third-party Tools that use PAPI
• DEEP/PAPI (Pacific Sierra) http://www.psrv.com/deep_papi_top.html
• TAU (Allen Mallony, U of Oregon) http://www.cs.uoregon.edu/research/paracomp/tau/
• SvPablo (Dan Reed, U of Illinois) http://vibes.cs.uiuc.edu/Software/SvPablo/svPablo.htm
• Cactus (Ed Seidel, Max Plank/U of Illinois) http://www.aei-potsdam.mpg.de
• Vprof (Curtis Janssen, Sandia Livermore Lab) http://aros.ca.sandia.gov/~cljanss/perf/vprof/
• Cluster Tools (Al Geist, ORNL)• DynaProf (Phil Mucci, UTK)
http://www.cs.utk.edu/~mucci/dynaprof/
![Page 66: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/66.jpg)
66
DEEP/PAPI
![Page 67: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/67.jpg)
67
SvPablo
![Page 68: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/68.jpg)
68
TAU
![Page 69: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/69.jpg)
69
vprof
![Page 70: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/70.jpg)
70
Code Examples
![Page 71: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/71.jpg)
71
Code Examples
• Parallelising a particle particle simulator
• Parallelising a frequency domain MHD simulator
![Page 72: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/72.jpg)
72
Particle - particle simulator• Particles fall in a well• Particle interactions
computed for particles in the neighbourhood only
• Occasionally the neighbourhood list is recomputed
• 1000 particles• Neighbour list length
10000• ~6000-7000 interactions
Neighborhood
![Page 73: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/73.jpg)
73
Algorithm used
• Force vector is sum-updated in a “random” access pattern
• Little cache re-use
• Inhibits SMP parallelization
For each particle iFor each neighbor j
Compute distance ijCompute inter-particle forceUpdate force on particles i & j
Compute accelerations and updated positions
![Page 74: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/74.jpg)
74
Reversed neighborlist
• Introduce force interaction vector
• Introduce a reverse neighbour list
• Inter-particle force written linearly, but read randomly in j-loop
• Force vector updated linearly
For each particle iFor each neighbor j
Compute distance ijCompute inter-particle force
Update force on particles jCompute accelerations and update positions
Update force on particles i
![Page 75: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/75.jpg)
75
1 Naive load balancing2 Neighbour balancing
Final performanceWall clock time per time step
![Page 76: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/76.jpg)
76
Explanation
• User reports serial program 3 times faster– Several contributing factors:
• Compiler optimisations• Compiler inlining• Better cache utilization
• Without the linear traversing of writes no speed-up (not shown in previous graph)
• Scaling problem on the SGI is a cache issue? Whole problem fits nicely into one 8MB L2 cache
![Page 77: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/77.jpg)
77
Frequency domain MHD
• Code makes frequent 3D FFT transformations• Electric and magnetic field double complex
128 bit precision• Array dimensions are (3,N,N,N), N=64 • Array size: 12MB per field• In between calls matrices are set up in loops
M(:,j,k,l)=A(:,j,k,l) × B(:,j,k,l) + C(:,j,k,l)• Parallel FFTs are available• Parallel matrix set up is straight forward
![Page 78: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/78.jpg)
78
Expected behaviour
• Code is expected to be memory bound outside FFTs due to array sizes and number of floating point operations vs. memory accesses
• Going parallel on a bus gives no gain - or does it?
• Speed-up should be obtainable on CC-NUMA
• Code should run well on vector systems with good FFTs and enough memory ports
![Page 79: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/79.jpg)
79
Observed behaviour
Overloaded systemSerial DXML FFTs
![Page 80: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/80.jpg)
80
Obtained speed up vs. streams
• The IBM bus is “switched” - this can explain that speed-up was obtained
• Speed-up on the SGI platform could be better
http://www.cs.virginia.edu/stream/standard/Bandwidth.htmlMachine ncpus COPY SCALE ADD TRIAD
IBM_SP-PWR3smp-High-222 8 2954.1 2821 3889 3872.2 10 6̂ BYTE/SECIBM_SP-PWR3smp-High-222 4 1603.5 1535.9 2218.3 2187.5IBM_SP-PWR3smp-High-222 2 820.7 800.4 1182.1 1165.4IBM_SP-PWR3smp-High-222 1 413 421.2 587.4 614.2
SGI_Origin2000-195 8 1355 1450 1413 1675SGI_Origin2000-195 6 1066 1075 1262 1396SGI_Origin2000-195 4 666 727 792 874SGI_Origin2000-195 2 351 365 392 413SGI_Origin2000-195 1 296 300 315 317
![Page 81: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/81.jpg)
81
PAPI measurements — IBM
• Critical code section
main loop….Call nlin(….)….
end main loopSubroutine nlin
Compute arraysCall FFTsCompute arrays
Call FFTs Compute arraysend subroutine nlin
• Overlapping counters used
Inclusive results SecondsMain loop 169.90FFTs 109.66nlin 131.13
Exclusive results SecondsPAPI main loop: 38.76PAPI FFTs: 109.66PAPI nlin: 21.47
![Page 82: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/82.jpg)
82
Raw results
Exclusive results PAPI_L1_LDM PAPI_L1_STM PAPI_L2_LDM PAPI_L2_STMPAPI main loop: 234670773 9165559 0 675PAPI FFTs: 159252099 16727100 0 0PAPI nlin: 152516696 19557038 0 0
PAPI_LD_INS PAPI_ST_INSPAPI main loop: 894209461 513179931PAPI FFTs: 6169588863 5057277064PAPI nlin: 661376717 475098610
PAPI_FP_INS PAPI_PRF_DM SecondsPAPI main loop: 334859261 5042052 38.8PAPI FFTs: 7967695530 10715 109.7PAPI nlin: 1097492151 24339739 21.5
![Page 83: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/83.jpg)
83
Deduced results
Exclusive results LD_INS/FP_INS: ST_INS/FP_INS: LD_INS/L1_LDM:PAPI main loop: 2.67 1.53 3.81PAPI FFTs: 0.77 0.63 38.74PAPI nlin: 0.60 0.43 4.34
L1 LD B/W: L1 ST B/W: PREF B/W:PAPI main loop: 184.76 7.22 3.97 MiBi/secPAPI FFTs: 44.32 4.65 0.00PAPI nlin: 216.78 27.80 34.60
PAPI main loop: 8.64 M(FP_INS)/sPAPI FFTs: 72.66PAPI nlin: 51.12
• Still not at memory peak in nlin
• Good cache reuse in FFTs• No cache reuse in main
loop (cache line length is 32 byte)
![Page 84: Performance Tuning Using Hardware Counter Data](https://reader035.vdocuments.site/reader035/viewer/2022062310/56815c94550346895dcaa309/html5/thumbnails/84.jpg)
84
For More Information
• http://icl.cs.utk.edu/projects/papi/– Software and documentation– Reference materials– Papers and presentations– Third-party tools– Mailing lists