allen d. malony, sameer shende, alan morris {malony,sameer,amorris}@cs.uoregon.edu department of...

27
Allen D. Malony , Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University of Oregon Phase-Based Parallel Performance Profiling

Post on 19-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu

Department of Computer and Information Science

Performance Research Laboratory

NeuroInformatics Center

University of Oregon

Phase-Based ParallelPerformance Profiling

Page 2: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 2

Outline of Talk

Motivation Models in parallel scientific applications Phases and performance mapping

Problem description Motivating example

Profiling techniques Flat, callpath, phase profiling

Approach and implementation Applications Future work and concluding remarks

Page 3: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 3

Motivation

Scientific applications designed based on models Computational: structural, logical, numerical models, … Correctness: execution order, data consistency, … Performance: expected, factors, parallelism/scalability, …

Computational models form developer’s “mental” model How the program is intended to behave and perform

Want to relate performance model to computation model View performance data with respect to “mental” model Better identify problems and guide tuning decisions

Must link computational abstractions to performance Bridge semantic gap – measurements “mental” model

Page 4: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 5

Performance Mapping

General problem of linking performance to computation Performance mapping (Irvin and Miller, ‘96; Shende, ‘01)

Associate (map) measured performance data To higher level, semantic representations Those with model significance to the user

What is the difficulty of making the association Depends on performance information

performance events/state visible from instrumentation what performance data can be measured

How the performance information is used in mapping Difficulty in how performance information is presented

Model-based views (LeBlanc et al., ‘90)

Page 5: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 6

Phases and Performance Mapping

Like to support the association between model and data Concept of “phases” is common in scientific applications

How developers think about structure, logic, numerics How performance can be interpreted (Worley, ‘92)

Worthwhile to consider support for phases In performance measurement Bridge semantic gap in parallel performance mapping?

tracing has long demonstrated the benefits! (Heath, ‘91) phase-based analysis and interpretation

Main contribution Support for phases in parallel performance profiling

Page 6: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 7

Problem Description

Performance measured as a consequence of events Events represent actions that occur during execution Events of interest determine performance information Events have semantics and context (pragmatics)

Semantics Defines what the event represents Example: subroutine entry

Context Properties of the state in which event occurred Example: subroutine’s calling parent

Interrogate context to map event performance data

Page 7: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 8

heat()

stress()

MPIrecv()

MPIsend()

otherroutines

Motivating Example – Multi-Physics Application

Assembly of physical objects Different shapes Different materials

Calculate physics Heat transfer Mechanical stress Within / between objects Iterate to error tolerance

How is performance attributed? Between events (e.g., routines) and execution components With respect to computational objects (e.g., data objects)

Page 8: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 9

Context and Standard Profiling

Flat profiles Context is whole program (i.e., program code) Performance distribution across (static) program structure Cannot differentiate dynamics (e.g., callpath or objects)

Callgraph / callpath profiles Identify parent-child calling relationships at exectution Context is calling (event) parent / calling (event) path Extend event semantics to encode context

create new event with callpath name requires dynamic event creation for complex callpaths burdens event mechanisms for context identification simple performance associations require many events

Page 9: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 10

Context and Phase Profiling

View the program execution as collection of phases Transition between phases (sequenced, nested)

easiest to think of as phase hierarchy (or phase graph) Phases are not events

phase boundaries can mark entry/exit events

Context is the current phase How do we know what phase we are in? Phases are identified separately from events

phases are not encoded in event names event mechanisms are not overloaded

A phase profile is event performance attributed to phases Phase-specific performance profiles (flat or callpath)

Page 10: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 11

Approach (Flat Profile)

Create a profile object for each entry/exit event Each profile object has a name Static profile object (static event)

event has a single instance (single name) Dynamic profile object (dynamic event)

event can have multiple instances (created dynamically)

Inclusive and exclusive performance statistics Must maintain an event stack (or callstack)

Context are generally thought of as code locations Dynamic events do allow for dynamic context awareness

User code can check “state” and create new events BUT only see one level of event!

Page 11: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 12

Approach (Callpath Profile)

Show event calling (nesting) relationships Create a profile object for each event calling context

Each profile object has a name that encodes the callpath Static profile object

callpath has a single instance (single name) Dynamic profile object

callpath can have multiple instances (created dynamically)

Reuse event mechanisms Interrogate the event stack to form event names

“main=> f1 => f2 => MPI_Send” Inclusive and exclusive performance statistics

Callpath length and callgraph depth options

Page 12: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 13

Approach (Phase Profile)

A phase is an execution abstraction Two questions

How to inform the measurement systems about phases? How to collect the performance data?

Create a phase object when new phase is created Each phase object has a name Static and dynamic phase objects

Phase relationships Phases may be nested (cannot overlap) “Active” phase object follows scoping rules Default (top-level) phase is outermost event (e.g., main)

Page 13: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 14

Approach (Phase Profile - API) Phase creation

TAU_PHASE_CREATE_STATIC(var, name, type, group)TAU_PHASE_CREATE_DYNAMIC(var, name, type, group)TAU_GLOBAL_PHASE(var, name, type, group)TAU_GLOBAL_PHASE_EXTERNAL(var)

Global phases have global scope (accessible anywhere) External declarations for defined phases outside file scope

Phase control

TAU_PHASE_START(var)TAU_PHASE_STOP(var)TAU_GLOBAL_PHASE_START(var)TAU_GLOBAL_PHASE_STOP(var)

Collects a callgraph profile (depth 2) PER PHASE! Phases default as standard events (when disable)

Page 14: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 15

Approach (Phase Profile - Data Collection)

Leverages performance mapping and callpath profiling Phase entry

Phase object pushed to measurement (event) callstack Phase / event entry

Need to determine (event, phase) tuple traverse callstack to find enclosing phase construct key for (event, phase) tuple

Maintain global map new keys for new (event, phase) tuples put into global map

create new profile object for every (event, phase) tuple search global map to determine is tuple occurred before

Use mapping support to store performance data on exit

Page 15: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 16

Multi-Physics Example

heat()

stress()

MPIrecv()

MPIsend()

otherroutines

events

only twoevents!

phasesiteratephase

Instrumentation

heatphase

stressphase

Page 16: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 17

Implementation

Parallel profiling in the TAU performance system Flat profiling Callpath and callgraph (2-level callpath) profiling Phase profiling

Multiple performance metrics Execution time Hardware performance counters (using PAPI)

Scalable to tens of thousands of processors Profile analysis and data management tools

ParaProf parallel profile analyzer / visualizer PerfDMF parallel profile database

Page 17: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 18

Application – NAS Parallel Benchmarks

Phase profiling can provide more refined profile results Specific to phase localities

Defining phases is an application-specific issue Apply understanding of computational models

Unfortunately, we were not the application developers How to decide on phases and phase instrumentation? Informed by application documentation and code

Look at NAS parallel benchmark application suite Identify benchmarks with phase behavior SP, BT, LU (simulated CFD codes) and CG Focus on BT

Page 18: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 19

NAS BT – Phase Analysis

Emulates a CFD application System of linear equations Implicit finite-difference discretization of Navier-Stokes Solve three sets of uncoupled systems of equations

in X, Y, Z directions Block tridiagonal with 5x5 blocks Square number of processors

Phase analysis Highlight performance for each solution direction Identified in code by three main functions

x_solve, y_solve, z_solve Static phases

Page 19: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 20

NAS BT – Instrumentation

call TAU_PHASE_CREATE_STATIC(xsolvephase,’x_solve phase’)

call TAU_PHASE_START(xsolvephase)

call x_solve

call TAU_PHASE_STOP(xsolvephase)

call TAU_PHASE_CREATE_STATIC(ysolvephase,’y_solve phase’)

call TAU_PHASE_START(ysolvephase)

call y_solve

call TAU_PHASE_STOP(ysolvephase)

call TAU_PHASE_CREATE_STATIC(zsolvephase,’z_solve phase’)

call TAU_PHASE_START(zsolvephase)

call z_solve

call TAU_PHASE_STOP(zsolvephase)

Page 20: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 21

NAS BT – Flat Profile

How is MPI_Wait()distributed relative tosolver direction?

Application routine namesreflect phase semantics

Page 21: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 22

NAS BT – Phase Profile (Main and X, Y, Z)

Main phase shows nested phases and immediate events

Page 22: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 23

Application – MFIX

Multiphase Flow with Interphase eXchanges (MFIX) National Energy Transfer Laboratory (NETL) Study physical/chemistry properties in fluid-solid systems

hydrodynamics, heat transfer, chemical reactions Characteristic of large-scale iterative simulations

major loop executed as simulation advances in time

Testcase Models Ozone decomposition in a bubbling fluidized bed Flat profile Iterate phase profile Demonstrate dynamic phases

Page 23: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 24

MFIX– Phase Instrumentation (ITERATE)

SUBROUTINE ITERATE(IER, NIT) character(11) tauchararyinteger tauiteration / 0 /integer profiler(2) / 0, 0 /save profiler, tauiteration

write (taucharary, ’(a8,i3)’) ’ITERATE ’, tauiterationtauiteration = tauiteration + 1call TAU_PHASE_CREATE_DYNAMIC(profiler,taucharary)call TAU_PHASE_START(profiler)

! WORK

call TAU_PHASE_STOP(profiler)

END SUBROUTINE ITERATE

Page 24: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 25

MFIX – Phase Profile (MPI_Waitall)

In 51st iteration, time spent in MPI_Waitall was 85.81 secs

Total time spent in MPI_Waitall was4137.9 secs across all92 iterations

dynamic phasesone per interation

Page 25: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 26

MFIX Iterate Phase Behavior

Page 26: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 27

Concluding Discussion and Future Work Phased-based profiling can help to bridge semantic gap

Computational models performance measurements Application-specific performance analysis

Implemented phase profiling in TAU Demonstrated phase profiling

NAS BT benchmark and MFIX application Also used in S3D, Uintah, Flash on large-scale platforms

Requires application-specific knowledge Might be possible to link to auto phase identification

Based on memory tracing or application state change Can this idea be extended to global parallel phases? Working on better ways to present phase performance

Page 27: Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Phase-Based Parallel Performance ProfilingParCo 2005 28

Support Acknowledgements

Department of Energy (DOE) Office of Science contracts University of Utah ASCI Level 1

sub-contract ASC/NNSA Level 3 contract

Department of Defense (DoD) HPC Modernization Office (HPCMO) Programming Environment and Training (PET)

NSF Research Centre Juelich Los Alamos National Laboratory www.cs.uoregon.edu/research/paracomp/tau

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.