ibm system blue gene-p - performance analysis - 1-2-0 system... · universal performance counters...
TRANSCRIPT
© 2009 IBM Corporation1 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Content
� Profiling– GNU Profiler (Gprof)
– Vprof
� Communications Tracing– MPI Trace Library
� Hardware Performance Monitors– Universal Performance Counters (UPC)
– HPM Library
� IBM System Blue Gene/P Specifics– Personality
– Kernel Interface
� Multi-Purpose Toolkits– HPC Toolkit
� Major Open-Source Tools– SCALASCA– TAU
© 2009 IBM Corporation2 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Code Profiling
� Purpose
– Identify most-consuming routines of a binary
• In order to determine where the optimization effort has to take place
� Standard Features
– Construct a display of the functions within an application
– Help users identify functions that are the most CPU-intensive
– Charge execution time to source lines
� Methods & Tools
– GNU Profiler
– Vprof
� Notes
– Profiling can be used to profile both serial and parallel applications
– Based on sampling (support from both compiler and kernel)
© 2009 IBM Corporation3 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
GNU Profiler (Gprof) | How-To | Collection
� Compile the program with options: -g -pg -qfullpath– Will create symbols required for debugging / profiling
� Execute the program– Standard way
� Execution generates profiling files in execution di rectory– gmon.out.<MPI Rank>
• Binary files, not human readable– Nb files depends on environment variable
• 1 Profiling File / Process• 3 Profiling Files only
– One file for the slowest / fastest / median process
� Two options for output files interpretation– GNU Profiler (Command-line utility)– Xprofiler (Graphical utility / Part of HPC Toolkit)
© 2009 IBM Corporation4 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
GNU Profiler (Gprof) | How-To | Visualization
� Allows profiling report generation
– From profiling output files
– Standard Usage
• gprof <Binary> gmon.out.<MPI Rank> > gprof.out.<MPI Rank>
� Profiling report limited compared to standard Unix/Linux
– The subroutines and their relative importance
– Number of calls
© 2009 IBM Corporation5 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
VProf (Visual Profiler) | Definition
� The Visual Profiler, VProf, is a project developed for optimizing the performance of programs and evaluating algorithm ef ficiency
� Provides– Routines to collect statistical profiling information– Programs to view execution profiles
• Graphical and command-line
� Profile data is used to generate performance summar ies sorted by source code line, by file, and by function
� Advantages over Gprof– Recompilation not necessary (linking only)– Performance overhead significantly lower– Profiling visualisation through cprof is very clear
� Development– Apparently not too much maintained anymore (Sandia as last known owner)– Integrated into IBM MPI Trace Library
© 2009 IBM Corporation6 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
VProf (Visual Profiler) | How-To
� Link to IBM MPI Trace Library
� Set appropriate environment variable
– VPROF_PROFILE=yes• Enables profiling• One profiling file per MPI Trace file
� Execution produces profiling files
– vmon.out.<MPI Rank>
� Analyzes profiling files through cprof
– cprof -e <Binary> <Profiling File>
� Final profiling file contains four different sectio ns
– File Summary
– Function Summary
– Line Summary
– Source code annotations
© 2009 IBM Corporation7 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
VProf (Visual Profiler) | Cprof Command Line Options
� Usage: cprof [options] executable [vmon_file ...]
– -d, --directory dir Search dir for source files
– -D, --recursive-directory dir Search dir recursively for source files
– -e, --everything Show all information
– -a, --annotate file Annotate file
– -n, --number Show number of samples (not %)
– -s, --show thres Set threshold for showing aggregate data
– -H, --html dir Output HTML into directory dir
© 2009 IBM Corporation8 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
IBM MPI Trace Library | Principles
� MPI Trace Features
– Collects all MPI communications of an application
– Measures time spent in the MPI routines
– Provides call graph for communication subroutines
� Usage
– Link with library
• /bgp/usermisc/hhhibm11/libraries/libmpitrace/libmpitrace.a
– Execute program
• Various environment variables can be specified (cf. next slide)
– Analyze trace files
• Text files, humand-readable
© 2009 IBM Corporation9 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
IBM MPI Trace Library | Environment Variables
� Environment Variables– Data Collection Settings
• SWAP_BYTES={no* | yes} Switches output file endianness• TRACE_DIR=<Directory> Output directory• TRACE_BUFFER_SIZE=<Size> Buffer size• SAVE_ALL_TASKS={no* | yes} Save all MPI Tasks or maximum / median / minimum
only– Communications Profiling
• PROFILE_BY_CALL_SITE={no* | yes} Provides call stack for MPI primitives• TRACE_ALL_EVENTS={no* | yes}• TRACE_MAX_RANK=<Rank>• TRACE_SEND_PATTERN={no* | yes} Builds Point-To-Point communication
matrix
� Output Files– mpi.profile.<Process ID>.<MPI Task #>– events.trc– hpmdata.X_Y_Z.<Process ID>
© 2009 IBM Corporation10 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
IBM MPI Trace Library | Sample Output
----------------------------------------------------------------MPI Routine #calls avg. bytes time(sec)----------------------------------------------------------------MPI_Comm_size 1 0.0 0.000MPI_Comm_rank 1 0.0 0.000MPI_Isend 5738 2398.6 0.050MPI_Irecv 2163 2738.7 0.010MPI_Waitall 1919 0.0 0.028MPI_Reduce 3 8.0 0.000----------------------------------------------------------------total communication time = 0.087 seconds.total elapsed time = 3.922 seconds.user cpu time = 3.890 seconds.system time = 0.030 seconds.maximum memory size = 30012 KBytes.
----------------------------------------------------------------Message size distributions:
MPI_Isend #calls avg. bytes time(sec)2389 8.0 0.0123349 4104.0 0.038
MPI_Irecv #calls avg. bytes time(sec)721 8.0 0.0011442 4104.0 0.008
MPI_Reduce #calls avg. bytes time(sec)3 8.0 0.000
© 2009 IBM Corporation11 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Hardware Performance Monitors (HPM) | Definition
� Definition– Extra logic inserted in the processor to count specific events
– Updated at every cycle
– Strengths• Non-intrusive• Very accurate• Low overhead
– Weakness• Provides only hard counts• Specific for each processor• Access is not well documented• Lack of standard and documentation on what is counted
� Purpose– Provides comprehensive reports of events that are critical to performance on IBM
systems– Gathers critical hardware performance metrics
• Number of misses on all cache levels• Number of floating point instructions executed• Number of instruction loads that cause TLB misses
– Helps to identify and eliminate performance bottlenecks
© 2009 IBM Corporation12 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Universal Performance Counters (UPC) | Principles
� 256 counters, 64 bits each
– Hardware unit on the BG/P chip
� 72 counters are in the clock-x1 domain
– PowerPC 450 core: FPU, FP load/store…
– Counters specific to each core
� 184 counters are in the clock-x2 domain
– L2, L3, memory, networks
– Counters mostly shared across the node
� BGP counters are tied to hardware resources, either specif ic to a core or shared across the node
– There is no process or thread-level context,
• But processes and threads are pinned to specific cores
© 2009 IBM Corporation13 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Universal Performance Counters (UPC) | Principles
� The counter mode and trigger method are programmabl e:– Mode 0 : info on cores 0 and 1 for the clock-x1 counters
• plus a set of 184 counters in the clock-x2 domain– Mode 1 : info on cores 2 and 3 for the clock-x1 counters
• plus a different set of 184 counters in the clock-x2 domain– Modes 2 and 3 : primarily intended for hardware designers
� Trigger methods: rising edge, default edge, falling edge, level high, level low– the counters are basically looking at a voltage that can be "high" or "low“ …
• The edge modes can be set to count the number of events (for instance the counter increments at the rising edge of a low => high transition)
• The level (high or low) mode can be set to count p-clock cycles while the voltage is either high or low– This mode allows to know how many p-clocks went by while waitin a load request (instead of
the number of load requests)• The first 72 counters are not affected by the trigger, but the memory counters can be
different• It advices to use the “default edge” or “level high”
© 2009 IBM Corporation14 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Universal Performance Counters (UPC) | Counter Access
� The BGP_UPC interface definitions and list of events are in:
– /bgsys/drivers/ppcfloor/arch/include/spi/UPC.h
– /bgsys/drivers/ppcfloor/arch/include/spi/UPC_Events.h// every process on the node calls BGP_UPC_Initialize()
BGP_UPC_Initialize ();// just one rank per node sets the counter config and zeros the counters
if (local_rank == 0) {BGP_UPC_Initialize_Counter_Config (counter_mode , counter_trigger );BGP_UPC_Zero_Counter_Values ();BGP_UPC_Start (0);
}
MPI_Barrier(local_comm); // communicator local to the nodedo work …MPI_Barrier(local_comm);
if (local_rank == 0) {BGP_UPC_Stop ();BGP_UPC_Read_Counter_Values (&counter_dat a,
sizeof(struct CounterStruct),BGP_UPC_READ_EXCLUSIVE);
Save the counter values from the counter_data structure …BGP_UPC_Start (0);
}
counter_mode = 0, 1, 2, 3 (plus some others … see UPC.h)counter_trigger = BGP_UPC_CFG_LEVEL_HIGH, BGP_UPC_CFG_EDGE_DEFAULT
struct CounterStruct {int32_t rank; // Rankint32_t core; // Coreint32_t upc_number; // UPC Numberint32_t number_processes_per_upc;
// Number of processes per UPC unitBGP_UPC_Mode_t mode; // User modeint32_t number_of_counters;
// Number of counter values returnedchar location[24]; // Locationint64_t elapsed_time; // Elapsed timeuint32_t reserved_1; // Reserved for alignmentuint32_t reserved_2; // Reserved for alignmentint64_t values[256]; // Counter values
} counter_data ;
© 2009 IBM Corporation15 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Universal Performance Counters (UPC) | Usage
� Basic operation is BGP_UPC_Read_Counter_Values(&cou nter_data, …)
– Fills out a structure including 256 counter values (64 bits each)
� Caveats
– Reading all of the counters takes a long time … of order 10**4 cycles• Consequence: in practice, you can only use the counters for coarsegrained measurements
– The BGP headers (UPC.h) require the GNU compiler (mpicc, powerpc-bgp-linux-gcc) for compilation• Consequence: it is best to wrap the counter routines in separately compiled source
– some counters count events, and other counters count “cycles”, but one cycle in the clock-x2 domain = two processor cycles• Consequence: multiply by two the value obtained to get processor cycles• Example: counter 80 (mode 0) “BGP_PU0_L2_CYCLES_READ_REQUEST_PENDING” … with
trigger = level high, is the number of memory-bus cycles where the L2 unit attached to core 0 is waiting on a read request
– Any process or thread running on the node can (will) trigger the shared counters in the clock-x2 domain• Consequence: this needs to be remembered to properly interpret the data.
� Cf. Bob Walkup’s documentation for recommendations
© 2009 IBM Corporation16 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Universal Performance Counters (UPC) | Other Counter Int erfaces
� All hardware counter interfaces for BG/P are layere d on top of BGP_UPC
� The BGP_UPC layer is provided, so you can write you r own interfaces
� PAPI 4.0 has been ported to BG/P. Some information h as been posted by Argonne National Labs:– http://trac.mcs.anl.gov/projects/performance/wiki/Machines
� HPC Toolkit provides documentation in HPM_ug.pdf ; t here is no hpmcountor hpmstat for BG/P, just libhpm.a. The env variable HPM_EVENT_SET is used to set the counter mode 0, 1, 2, 3; default va lue is 0. The default trigger method was previously “edge rise” (can be se t by the user).
#include <libhpm.h>…hpmInit (rank, program);hpmTstart (number, label);do_work();hpmTstop (number);hpmTerminate (rank); // prints counter values etc.
© 2009 IBM Corporation17 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Universal Performance Counters (UPC) | Higher-Level In terfaces
� All hardware counter interfaces for BG/P are layered on top of BGP_UPC
� You can write your own interfaces
� PAPI 4.0 has been ported to BG/P
– Requires application of patch 003
– Provides common interface to many third-party tools
• Scalasca, TAU
� Useful information been posted by Argonne National L abs
– http://trac.mcs.anl.gov/projects/performance/wiki/Machines
© 2009 IBM Corporation18 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPM | IBM MPI Library | How-To
� Principle
– IBM MPI Library provides a very easy to use implementation of the HPM extraction
� Usage
– Link with IBM MPI Library
– Execute with following environment variables
• BGP_STATS={0* | 1}
• BGP_STATS_INTERVAL=<Interval (Seconds)>
– Execution produces one HPM file per MPI task
• hpmdata.<Coordinates>.<Job ID>
© 2009 IBM Corporation19 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPM | HPM Library | How-To
� Instrument code for HPM measurement
– call hpm_init()
• Initializes counters
– call hpm_start('label')
• Start counting a labeled block
– call hpm_stop ('label')
• Stop counting a labeled block
– call hpm_print()
• Print counter values and labels
� Link with HPM library
� Execution produces one HPM data file per MPI task
– hpm_data.<MPI Rank>
© 2009 IBM Corporation20 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPM Library | How-To
� libmpihpm.a Uses the MPI profiling interface, starts the BGP UPC counters in MPI_Init(), stops them in MPI_Finalize(), and produ ces two counter ouput files:
• One text summary with min, max, avg counter values• One binary file with all counter data from every node
� The command getcounts can be used to pull out the da ta for a given node from the aggregate binary file. Nodes are numbered in x, y, z order on the partition that the job ran on.
� Environement variables:• BGP_COUNTER_MODE=0,1,2,3 (default = 0)• BGP_COUNTER_TRIGGER={edge, high} (default method = high)
� This utility provides aggregate Flops for the whole job, from start to finish, along with MPI statistics; but can’t be used to measure s pecific code blocks.
� A simple start/stop interface that can be called fr om Fortran, C, C++ to get counts around specific code blocks, with one output file p er node.
Fortran interface :call hpm_init () ! one time to initialize counterscall hpm_start ('label') ! start counting a labeled blockcall hpm_stop ('label') ! stop counting a labeled blockcall hpm_print () ! print counter values and labels once at the end
C interface (add extern “C” for C++):void HPM_Init (void);void HPM_Start (char * label);void HPM_Stop (char * label);void HPM_Print (void);
© 2009 IBM Corporation21 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Automatically Available Performance Counters
� Principle
– By providing a hook into MPI_Init and MPI_Finalize functions, counters will be enabled before an application runs, and the results will be collected and summarized before the application exits
– Once this feature is enabled, no user intervention will be required to collect this performance counter data, but options will be provided at run time to change counter modes, counter triggers, and counter data output directories
– It is also possible to disable the collection of performance counter data at run time
� How-To
– Source file
• /bgsys/drivers/ppcfloor/tools/AutoPerfCounters/EnableAutoPerfCounters
© 2009 IBM Corporation22 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Personality | Definition
� Double Definition
– Static data given to every Compute Node and I/O Node at boot time by the control system
• Personality data contains information that is specific to the node
– Set of C language structures and functions that allows querying personality data from the node
• Useful to determine, at run time, where the tasks of the application are running– Might be used to tune certain aspects of the application at run time, such as
determining which set of tasks share the same I/O Node and then optimizing the network traffic from the Compute Nodes to that I/O Node
© 2009 IBM Corporation23 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Personality | Usage Elements
� Two Include Files
– #include <common/bgp_personality.h>
– #include <common/bgp_personality_inlines.h>
• In Directory: /bgsys/drivers/ppcfloor/arch/include
� Structure
– _BGP_Personality_t personality;
� Query Function
– Kernel_GetPersonality(&personality, sizeof(personality));
© 2009 IBM Corporation24 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Personality | Provided Information
� personality.Network_Config.[X|Y|Z]nodes
– Nb X / Y / Z Nodes in Torus
� personality.Network_Config.[X|Y|Z]coord
– X / Y / Z Node Coordinates in Torus
� Kernel_PhysicalProcessorID()
– Core ID on Compute Node (0, 1, 2, 3)
� BGP_Personality_getLocationString(&personality, locatio n)
– Location string
• Rxx-Mx-Nxx-Jxx
© 2009 IBM Corporation25 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Personality | Example
#include <spi/kernel_interface.h>
#include <common/bgp_personality.h>
#include <common/bgp_personality_inlines.h>
int main(int argc, char * argv[]) {
int taskid, ntasks;
int memory_size_MBytes;
_BGP_Personality_t personality;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);
Kernel_GetPersonality(&personality, sizeof(personality));
memory_size_MBytes = personality.DDR_Config.DDRSizeMB;
printf("Memory size = %d MBytes\n", memory_size_MBytes);
node_config = personality.Kernel_Config.ProcessConfig;
if (node_config == _BGP_PERS_PROCESSCONFIG_SMP) printf("SMP mode\n");
else if (node_config == _BGP_PERS_PROCESSCONFIG_VNM) printf("Virtual-nodemode\n");
else if (node_config == _BGP_PERS_PROCESSCONFIG_2x2) printf("Dual mode\n");
else printf("Unknown mode\n");
xcoord = personality.Network_Config.Xcoord;
ycoord = personality.Network_Config.Ycoord;
zcoord = personality.Network_Config.Zcoord;
xsize = personality.Network_Config.Xnodes;
ysize = personality.Network_Config.Ynodes;
zsize = personality.Network_Config.Znodes;
pset_num = personality.Network_Config.PSetNum;
pset_size = personality.Network_Config.PSetSize;
pset_rank = personality.Network_Config.RankInPSet;
BGP_Personality_getLocationString(&personality, location);
procid = Kernel_PhysicalProcessorID();
}
© 2009 IBM Corporation26 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Kernel Interface
� Main Include File
– #include <spi/kernel_interface.h>
• In Directory: /bgsys/drivers/ppcfloor/arch/include
� Query Functions
– Kernel_GetMemorySize
• Retrieves Memory Information from Kernel
– Kernel_ActiveProcesseCount
• Retrieves number of active processes in the Kernel
– …
© 2009 IBM Corporation27 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
IBM HPC Toolkit
� Toolkit Content– Hardware (CPU) Performance
• Xprofiler• HPM Toolkit
– Message-Passing Performance• MPI Profiler / Tracer
– I/O Performance• Modular I/O (MIO)
– Performance Visualization• PeekPerf
� Supported Platforms– AIX: AIX 5.3, AIX 6.1
– Linux on POWER: Red Hat 5.2, SLES 10
– IBM System Blue Gene/P
� Support via Advanced Computing Technology Center in Research (ACTC)
� Link
– http://domino.research.ibm.com/comm/research_projects.nsf/pages/hpct.index.html
© 2009 IBM Corporation28 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPC Toolkit | Xprofiler
� Visualize CPU time profiling data
� Compile and link with -g -pg flags + optimization
� Code execution generates gmon.out file
– MPI applications generate gmon.out.1, …, gmon.out.n
� Analyze gmon.out file with Xprofiler
– Xprofiler a.out gmon.out
� Important factors
– On AIX time-sampling interval is 0.01 sec
– Profiling introduces overhead due to function calls
© 2009 IBM Corporation29 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPC Toolkit | Xprofiler | GUI | Overview Window
� Width of a bar:time includingcalled routines
� Height of a bar:time excludingcalled routines
� Call arrowslabeled withnumber of calls
� Overview windowfor easy navigation(View ���� Overview)
© 2009 IBM Corporation30 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPC Toolkit | Xprofiler | GUI | Source Code Window
� Source codewindow displayssource codewith time profile(in ticks=.01 sec)
� Access
– Select functionin main display
• Context Menu
– Select functionin flat profile
• Code Display
• Show Source Code
© 2009 IBM Corporation31 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPC Toolkit | Xprofiler | GUI | Disassembler Code
© 2009 IBM Corporation32 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPC Toolkit | Message-Passing Performance
� MP_Profiler Library
– Captures “summary” data for MPI calls
– Source code traceback
– User MUST call MPI_Finalize() in order to get output files.
– No changes to source code
• MUST compile with -g to obtain source line number information
� MP_Tracer Library
– Captures “timestamped” data for MPI calls
– Source traceback
© 2009 IBM Corporation33 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
MP_Profiler Output with Peekperf
© 2009 IBM Corporation34 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
MP_Profiler Message Size Distribution
9E-0710485761M ... 4M1 (B)MPI_Isend
1.7E-06786432256K ... 1M2 (A)MPI_Isend
1.7E-0619660864K ... 256K2 (9)MPI_Isend
1.3E-064915216K ... 64K2 (8)MPI_Isend
1.3E-06122884K ... 16K2 (7)MPI_Isend
1.3E-0630721K ... 4K2 (6)MPI_Isend
1.3E-06768257 ... 1K2 (5)MPI_Isend
1.3E-0619265 ... 2562 (4)MPI_Isend
1.3E-064817 ... 642 (3)MPI_Isend
1.4E-06125 ... 162 (2)MPI_Isend
0.00000630 ... 42 (1)MPI_Isend
1E-0700 ... 41 (1)MPI_Comm_rank
1E-0700 ... 41 (1)MPI_Comm_size
Walltime#BytesMessage Size#CallsMPI Function
7.8E-0600 ... 45 (1)MPI_Barrier
1.98E-0500 ... 421 (1)MPI_Waitall
0.00051710485761M ... 4M1 (B)MPI_Irecv
0.00039786432256K ... 1M2 (A)MPI_Irecv
9.98E-0519660864K ... 256K2 (9)MPI_Irecv
2.23E-054915216K ... 64K2 (8)MPI_Irecv
7.1E-06122884K ... 16K2 (7)MPI_Irecv
3.4E-0630721K ... 4K2 (6)MPI_Irecv
2.6E-06768257 ... 1K2 (5)MPI_Irecv
2.4E-0619265 ... 2562 (4)MPI_Irecv
1.5E-064817 ... 642 (3)MPI_Irecv
1.4E-06125 ... 162 (2)MPI_Irecv
4.7E-0630 ... 42 (1)MPI_Irecv
Walltime#BytesMessage Size#CallsMPI Function
© 2009 IBM Corporation35 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
MP_Tracer Output with Peekperf
© 2009 IBM Corporation36 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPM Toolkit | Components
� libhpc
– Library for program (including multi-thread) section instrumentation
– Environment Variables
• HPM_EVENT_SET=[0-3]• HPM_UNIQUE_FILE_NAME={ 0 | 1 }
� Not available on Blue Gene/P
– hpccount
• Starts application and provides– Wall clock time– Hardware performance counter information– Resource utilization statistics
• Not available on Blue Gene/P
– hpcstat
• Provides system wide reports for root• Not available on Blue Gene/P
© 2009 IBM Corporation37 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPM Toolkit | libhpm
� Insert libhpc library calls in the source code and instrument different sections independently
� Supports Fortran, C, and C++
� Provides for each instrumented section
– Total count & duration (wall clock time)
– Hardware performance counters information
– Derived metrics
� Provides resource usage statistics for the total ex ecution of the instrumented program
� Supports
– MPI, OpenMP, & pthreads
– Multiple instrumentation points
– Nested instrumentation
– Multiple calls to an instrumented point
© 2009 IBM Corporation38 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPM Toolkit | libhpm | Output | Textual
� Summary report for each task – perfhpm<taskID>.<pid>– libhpm (V 2.6.0) summary– Total execution time of instrumented code (wall time): 0.143824 seconds– Instrumented section: 3 - Label: job 1 - process: 1– file: sanity.c, lines: 33 <--> 70– Count: 1– Wall Clock Time: 0.143545 seconds– BGL_FPU_ARITH_MULT_DIV (Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div)) : 0– BGL_FPU_LDST_DBL_ST (…) : 23– …– BGL_UPC_L3_WRBUF_LINE_ALLOC (Write buffer line was allocated) :1702– …
� Peekperf performance file– hpm<taskID>_<progName>_<pid>.viz
� Table performance file– tb_hpm<taskID>.<pid>
© 2009 IBM Corporation39 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPM Toolkit | libhpm | Output | Peekperf
© 2009 IBM Corporation40 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Environment Flags
� HPM_EVENT_SET– Select the event set to be recorded– Integer (0 – 15)
� HPM_NUM_INST_PTS– Overwrite the default of 100 instrumentation sections in the app. – Integer value > 0
� HPM_WITH_MEASUREMENT_ERROR– Deactivate the procedure that removes measurement errors. – True or False (0 or 1).
� HPM_OUTPUT_NAME– Define an output file name different from the default. – String
� HPM_VIZ_OUTPUT– Indicate if “.viz” file (for input to PeekPerf) should be generated or not. – True or False (0 or 1).
� HPM_TABLE_OUTPUT– Indicate table text file should be generated or not. – True or False (0 or 1).
© 2009 IBM Corporation41 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Peekperf
� Visualization and analysis tool
� Offline analysis and viewing capability
� Supported platforms
– AIX
– Linux (Power/Intel)
– Windows (Intel)
– Blue Gene
© 2009 IBM Corporation42 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
MP_Profiler Visualization Using PeekPerf
© 2009 IBM Corporation43 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
MP_Tracer Visualization Using PeekPerf
© 2009 IBM Corporation44 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
HPM Visualization Using PeekPerf
© 2009 IBM Corporation45 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Modular I/O Performance Tool (MIO)
� I/O Analysis
– Trace module
– Summary of File I/O Activity + Binary Events File
– Low CPU overhead
� I/O Performance Enhancement Library
– Prefetch module (optimizes asynchronous prefetch and write-behind)
– System Buffer Bypass capability
– User controlled pages (size and number)
� Recoverable Error Handling
– Recover module (monitors return values and errno + reissues failed requests)
� Remote Data Server
– Remote module (simple socket protocol for moving data)
� Shared object library for AIX
© 2009 IBM Corporation46 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Performance Visualization
readswrites
JFS performance
4500 15500
vmtune -p20 -P80 -f120 -F128 -r2 -R8
time (seconds)
file
posi
tion
( by
tes
)
© 2009 IBM Corporation47 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Scalasca | Definition
� Scalasca = SCalable performance Analysis of LArge SCale Applications
– Performance tool measurement and analysis developed by the Innovative Computing Laboratory (ICL) and the Jülich Supercomputing Centre (JSC)
– Scalable trace analysis tool• SCALASCA analyzes separate
local trace files in parallel by replaying the original communication on as many CPUs as have been used to execute the target application itself
� Link– http://www.fz-juelich.de/jsc/scalasca/
© 2009 IBM Corporation48 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Scalasca | Usage
� Easy Use
– No source code modification
– Wrapper functions for compilation and execution• Recompilation required
� Three-Stage Process
– Instrument• Prepare application objects and executable for measurement• scalasca -instrument [options] <compile-or-link-command>
– Analyze• Run application under control of measurement system• scalasca -analyze [options] <application-launch-command>
– Examine• Interactively explore measurement analysis report• scalasca -examine [options] <experiment-archive|report>
© 2009 IBM Corporation49 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Scalasca | Graphical User Interface
© 2009 IBM Corporation50 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
Scalasca | Personal Experience Feedback
� Hard to install but easy to use
– Exception: the mpirun command line is a pain in the neck
� More useful than a standard profiling / MPI Trace?
– Rich but complex GUI
• Requires X11 forwarding or VNC
� Probably mandatory for very large number of nodes
– Where standard profiling reaches its limits
– But performance analysis often performed on lower number of nodes
© 2009 IBM Corporation51 PSSC Montpellier Deep Computing Team 2010-06-23
IBM PSSC Montpellier Customer Center
TAU
� TAU = Tuning and Analysis Utility
– Program and performance analysis tool framework being developed for the DOE Office of Science, ASC initiatives at LLNL, the ZeptoOS project at ANL, and the Los Alamos National Laboratory
– Provides a suite of static and dynamic tools that provide graphical user interaction and interoperation to form an integrated analysis environment for parallel Fortran, C++, C, Java, and Python applications
– Link
• http://www.cs.uoregon.edu/research/tau/home.php