Download - A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco

A Survey about Performance Counters, Libraries and Tools

Joseph Bryant Manzano Franco

Agenda

Introduction W3H: The Why, The What, The When, and The How

Hardware Performance Libraries Performance Application Programming Interface

(PAPI) Performance Counters Libraries (PCL)

Visualization Tools TAU: An example of a data collector KOJAK: Semi automatic instrumentation tool VAMPIR: An example of a script language PE: The All levels approach

Introduction

Program Optimization

Algorithm Optimization

Other ubiquitous optimizations

Architecture Optimizations

Search for the most effective algorithms and data structures

Consider common architecture features such cache structures

Apply architecture specific characteristic (PIM instructions, atomic load and stores, massive memory allocations, etc)

Data Collection

Data Analysis

Identify and solve unexpected problems with the interaction between hardware and software (memory and network bottlenecks, false sharing, poor cache management, etc)

IntroductionThe Why

Data Collection Data Analysis

High Level Library Functions

Performance Counters

Simulation environments

Easy to use and available on almost all libraries. Restricted and intrusiveCompose of timing function and clever data manipulation

Complete control over the environment including hardware, memory hierarchies and application code.Development is long for new architecturesSteep learning curve

Easy to use (especially with high level wrappers)Provides a range of measurements and is less intrusive

Manual Analysis

Automatic Statistical Analysis

Visualization Tools

Simple, but limited in its useProne to human error

Organize the data in a suitable formatStill need to deal with numbers

Graphical representation of data or its properties. Easy to identify trends even in large sets of data

Introduction:The What

Performance Counters

Special Registers that are present in an specific architecture

Designed to count architectural events

• An event is defined as an action that the hardware takes• Predefined• Examples: cache misses / hits, TLB misses / hits, context switches, cache invalidations, total instructions, etc

Sun Ultra SPARC Two 32 bit registers called PIC (Performance Instrumentation Counters). User control restricted

Pentium Pro Two 40 bit registers called PerfCrt0/1. User control available

Introduction: The When

Date Machine/Author Method of reading/Document

1966 Don Widring Initial Metering Design

~1970 GE 645 Multics

1979 Honeywell 6180 Yellow Submarine

1983 Cray-XM User Accessible Registers

Late 80 / early 90 IBM 3090 Mainframes, First generation IBM RS/6000

Restricted and Confidential

1992 First Alpha Chip (DEC) Uprofile, kprofile or IPROBE

1993 Pentium Not documented and embedded in the MSR

4

Introduction:The How

Example: Ultra SPARC Architecture

Two counters - 32 bits each

Event that are being counted: Number of Instructions (pic0), and Cache invalidations (pic1)

CPU CPU

$ $

Bus

pic0pic1 pic0

pic1load 0,s1

load 1,s2

inc s2

load 0,s1

load 0,s1

load 1,s2

add s1, s2, s1

store 0,s1

3210 10 43210 10

Agenda





Hardware Performance Libraries

Performance Counters: Good idea, but only accessible to hardware experts.

Solution: High Level Wrappers. Usually written in C and Fortran. Easy to make them thread safe and to

integrate them in existent code. Examples:

Performance Application Programming Interface (PAPI)

Performance Counters Library (PCL)

Performance Application Programming Interface A high Level wrapper functions that includes a vast set of architectures

and events Available for Power3, Power4, Ultra SPARC II and III, all flavors of

Pentium, Itanium, AMD Athlon, etc. Well documented, stable and reliable programming interface. Goals of the PAPI project:

To provide a solid foundation for cross platform performance analysis tools

To present a set of standard definitions for performance metrics on all platforms

To provide a standardize API among users, vendors, and academics To be easy to use, well documented, and freely available (Excerpt obtained from the PAPI user guide)

PAPI is an effort of the Innovative Computer Laboratory (ICL) that is part of the Department of Computer Science at the University of Tennessee

PAPI

Platform PAPI_read() – PAPI 3.0 Altix (Itanium 2 -Madison Chip) 1357 Cycles/Call IBM Power 4 4034 Cycles/Call Itanium 2 (libpfm 2.0) 1606 Cycles/Call Pentium 3 (perfctr 2.4.5) 324 Cycles/Call Pentium 4 (perfctr 2.4.5) 401 Cycles/Call SGI R12k 3681 Cycles/Call Ultrasparc II 2150 Cycles/Call

High Level API

Kernel ExtensionsOperating System

Hardware Performance Counters

Low Level APIPortable Layer

Machine Dependent Layer

Substrate

Blo

ck D

iag

ram

Ove

rhea

d

PAPI:Terminology Native Events:

Defined as countable by an specific CPU. Machine dependent Hexadecimal value and a mask provided by PAPI libraries

Present Events: Predefined events. Events (or group of events) that are considered useful and

relative ubiquitous across architectures. A PAPI identifier is provided

Event List: A array of events (usually the consist of PAPI identifiers)

PAPI:Terminology High Level API:

A group of functions A single of list of events Access to Native Events is prohibited. Flexibility and performance is lost due to its easiness to

use Low Level API:

Another group of functions Multiple event list definitions and native events

interface. Only one event list can be running at any point in time

PAPI:Steps

Initialization of the PAPI library

Start the counters

Operate on the counters

Stop the counters

De-allocate any resource that has been allocated

#include <papi.h>#include <stdio.h>

#define NUM_EVENTS 2

int main(int argc, char **argv){ int Events[NUM_EVENTS] = { PAPI_TOT_INS, PAPI_TOT_CYC }; long_long values[NUM_EVENTS], val2[NUM_EVENTS]; int a= 0; int retval; retval = PAPI_library_init(PAPI_VER_CURRENT); PAPI_start_counters(Events, 2); PAPI_read_counters(values, 2); a++; PAPI_read_counters(values, 2); PAPI_read_counters(val2, 2); printf("The value of a is: %i \n", a); printf("The Coarse Instructions are: %10lld\n", values[0]); printf("The Coarse Cycles are: %10lld\n", (values[1])); printf("The Overhead Instructions are: %10lld\n", val2[0]); printf("The Overhead Cycles are: %10lld\n", (val2[1])); printf("The Total Instructions are: %10lld\n", (-val2[0] + values[0])); printf("The Total Cycles are: %10lld\n", (-val2[1] + values[1])); PAPI_stop_counters(values, 2); return 0;}

PAPI:Output

The value of a is: 1The Coarse Instructions are: 179The Coarse Cycles are: 641The Overhead Instructions are: 175The Overhead Cycles are: 395The Total Instructions are: 4The Total Cycles are: 246

ld [%fp-52],%l0 add %l0,1,%l0 st %l0,[%fp-52] add %fp,-32,%o0

Assembly Output of a++

The first access to produce a (L2) cache miss

PAPI:Extra Features Multithread safe and support Multiplexing where available Overflow control with thresholds Statistical Profiling and related functions Error detection and control features

Performance Counters Libraries

Another Example of High Level performance counters Events are classified (as in PAPI) as Memory Hierarchy events

(caches, TLB, memory, etc), Instructions (Instruction types, Instructions completed, etc), Status of Functional Units and rates and ratios.

It supports the Pentium architectures up to Pentium 4, the AMD Athlon / Duron, the IBM Power series up to Power 3-II, Alpha’s 21164 and 21264, SGI’s R10000 and R12000 and the UltraSPARC family of processors

PCL is available for C, C++ and Java PCL is an effort of Forschungszentrum Juelich GmbH and the

University of Applied Sciences Bonn-Rhein-Sieg from Germany and currently it is in its second version

PCL High Level API:

Similar to PAPI High Level API but the functions are different.

Events lists can be created in this API Access to predefine events only Recommended

Low Level API: Let to access the performance counters directly Not recommended

Handle: A single Data (usually an integer) that is used to uniquely

identify a set of resources. Used to provide a thread specific link to the resources (the

list of events)

PLC:Steps

#include <pcl.h>int main(int argc, char **argv){ int counter_list[2], a = 0; int ncounter; unsigned int mode; PCL_CNT_TYPE i_result_list[2]; PCL_FP_CNT_TYPE fp_result_list[2]; PCL_DESCR_TYPE descr; PCLinit(&descr); ncounter = 2; counter_list[0] = PCL_CYCLES; counter_list[1] = PCL_INSTR; mode = PCL_MODE_USER; PCLstart(descr, counter_list, ncounter, mode); a++; PCLstop(descr, i_result_list, fp_result_list, ncounter); printf("%f instructions in %f cycles\n", (double)i_result_list[1], (double)i_result_list[0]); PCLexit(descr); return 0;

}

Initialization of the PCL library

Start the counters

Operate on the counters

Stop the counters

De-allocate any resource that has been allocated

PLC:Differences with PAPI Nested function call enabled Rates and Ratios are function calls in PAPI

libraries Low Level API deals with native code as

PAPI’s Low level does but its used is not recommended in PCL

Agenda





Visualization Tools

After gathering the information for the tools, how to present it to the user in the most efficient matter?

The visualization tools provide a good way to present trends in data across extensive data sets

Examples of Visualization tools: Tuning and Analysis Utilities Kit for Objective Judgement and Knowledge-based

Detection of Performance Bottlenecks VAMPIR / VAMPIRTrace Performance Evaluator

Tuning and Analysis Utilities (TAU)

Program and Performance analysis tool framework for high-performance parallel and distributed computing.

A suite of tools for static and dynamic analysis of programs written in C, C++, FORTRAN 77/90, Python, High Performance FORTRAN, and Java.

Instrumentation by functions The concept of Inclusive and Exclusive

With Time Exclusive time Refers to the time spent in the function minus all the

time spent on functions that has instrumented and called by this function

Inclusive time Total time of the function With Performance Counter

The same as time with the properties of that performance counter Supported extensions in C and FORTRAN: MPI and OpenMP Hardware Counters supported: PAPI and PCL

TAU Infrastructure

KOJAK

Kit for Objective Judgement and Knowledge-based Detection of Performance Bottlenecks

A complete infrastructure dedicated to find performance bottlenecks and application properties

Consists of the following components OpenMP Pragma And Region Instrumentor (OPARI)

(Redirect the OpenMP function call and directives toward wrappers that contains instrumentation information (POMP)) and PMPI

TAU (function instrumentation) Event Processing, Investigating and Logging (EPILOG)

runtime library (event oriented trace creator utility)

KOJAK

Extensive Performance Tool (EXPERT) (trace files analyzer search for low performing sections on them and classify them according to severity) uses the Event Analysis and Recognition Library (EARL)

CUBE (KOJAK’s Trace visualization tool) Trace transformations to different formats (to

VAMPIR trace format)

KOJAK Infrastructure

KOJAK Snapshots

VAMPIR

A configurable visualization trace tool Converts trace information into a variety of graphical views:

Process State Display Statistics Display Timeline Display Communications Statistics Configured by using

Pull-down menus Configuration file

The displays can be related to the source code Zoom in and Zoom out Advance feature Defined trace format: VAMPIR-Trace (runtime library enhanced

with trace creation calls)

VAMPIR Infrastructure

Source Code

Guide Compiler Object Files

Linker

Guide Libraries

VAMPIRTrace Libraries

Executable

Config File

Trace File VAMPIR V

VAMPIR Snapshot

Performance Evaluator

Java Based Tool All level analysis of a program behavior:

Application Software level analysis Data / Algorithm Analysis

Operation System level analysis Thread context switching Thread scheduling

Hardware Level Analysis Memory Hierarchy

Used PMAPI performance counters (IBM proprietary)

Performance Evaluator Infrastructure

K42 Infrastructure

AIX OS

Others

Parser / Modifier

PE2 Visualization Tool

32

1

32

1

1 Trace Format File2 Map File3 Meta File

PE Trace Format PE2 Trace Format

Performance Evaluator:A Run Get Hardware Information from the

infrastructures (the source has been instrumented and the OS is collecting information also)

Create: Trace file (s) Trace records of a program

with short hand versions of events Map file Have static information about

functions, threads and other structures Meta file (s) Properties of a trace, records

type definitions and Map type definitions

Performance Evaluator:A Run Feed the files to the tool Visualize the information with graphs Contemplate the whole application behavior

since beginning to the end Complete GUI with the Eclipse Workbench Designed to work with several Multi Threaded

packages in C and Java OpenMP not supported

Thanks so much for your time

Questions? Comments?

Download - A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco

Top Related