understanding applications using the bsc performance tools

39
www.bsc.es Understanding applications using the BSC performance tools Judit Gimenez ([email protected]), Harald Servat ([email protected]) HPCKP’15 - February 3th, 2015

Upload: others

Post on 19-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding applications using the BSC performance tools

www.bsc.es

Understanding applications using the

BSC performance tools

Judit Gimenez ([email protected]), Harald Servat ([email protected])

HPCKP’15 - February 3th, 2015

Page 2: Understanding applications using the BSC performance tools

Humans are visual creatures

Films or books? PROCESS

– Two hours vs. days (months)

Memorizing a deck of playing cards STORE

– Each card translated to an image (person, action, location)

Our brain loves pattern recognition IDENTIFY

– What do you see on the pictures?

Page 3: Understanding applications using the BSC performance tools

Our Tools

Since 1991

Based on traces

Open Source – http://www.bsc.es/paraver

Core tools: – Paraver (paramedir) – offline trace analysis

– Dimemas – message passing simulator

– Extrae – instrumentation

Focus – Detail, variability, flexibility

– Behavioral structure vs. syntactic structure

– Intelligence: Performance Analytics

Page 4: Understanding applications using the BSC performance tools

www.bsc.es

Paraver

Page 5: Understanding applications using the BSC performance tools

Paraver – Performance data browser

Timelines

Raw data

2/3D tables

(Statistics)

Goal = Flexibility

No semantics

Programmable

Comparative analyses

Multiple traces

Synchronize scales

+ trace manipulation

Trace visualization/analysis

Page 6: Understanding applications using the BSC performance tools

Timelines

Each window displays one view

– Piecewise constant function of time

Types of functions

– Categorical

• State, user function, outlined routine

– Logical

• In specific user function, In MPI call, In long MPI call

– Numerical

• IPC, L2 miss ratio, Duration of MPI call, duration of computation

burst

nNS i ,n0,

1,,)( iii ttiSts

RS i

1 ,0 iS

Page 7: Understanding applications using the BSC performance tools

Tables: Profiles, histograms, correlations

From timelines to tables

MPI calls profile

Useful Duration

Histogram Useful Duration

MPI calls

Page 8: Understanding applications using the BSC performance tools

Analyzing variability through histograms and timelines

Useful Duration

Instructions

IPC

L2 miss ratio

Page 9: Understanding applications using the BSC performance tools

By the way: six months later ….

Useful Duration

Instructions

IPC

L2 miss ratio

Analyzing variability through histograms and timelines

Page 10: Understanding applications using the BSC performance tools

From tables to timelines

Where in the timeline do the values in certain table

columns appear?

ie. want to see the time distribution of a given routine?

Page 11: Understanding applications using the BSC performance tools

Variability … is everywhere

CESM: 16 processes, 2 simulated days

Histogram useful computation duration

shows high variability

How is it distributed?

Dynamic imbalance

– In space and time

– Day and night.

– Season ?

Page 12: Understanding applications using the BSC performance tools

Data handling/summarization capability

– Filtering • Subset of records in original trace

• By duration, type, value,…

• Filtered trace IS a paraver trace and can be analysed with the same cfgs (as long as needed data kept)

– Cutting • All records in a given time interval

• Only some processes

– Software counters • Summarized values computed from those

in the original trace emitted as new even types

• #MPI calls, total hardware count,…

570 s

2.2 GB

MPI, HWC

WRF-NMM

Peninsula 4km

128 procs

570 s

5 MB

4.6 s

36.5 MB

Trace manipulation

Page 13: Understanding applications using the BSC performance tools

www.bsc.es

Dimemas

Page 14: Understanding applications using the BSC performance tools

Dimemas: Coarse grain, Trace driven simulation

Simulation: Highly non linear model

– Linear components

• Point to point communication

• Sequential processor performance

– Global CPU speed

– Per block/subroutine

– Non linear components

• Synchronization semantics

– Blocking receives

– Rendezvous

– Eager limit

• Resource contention

– CPU

– Communication subsystem

» links (half/full duplex), busses

LBW

eMessageSizT

CPU

Local

Memory

B

CPU

CPU

L

CPU

CPU

CPU Local

Memory

L

CPU

CPU

CPU Local

Memory

L

Page 15: Understanding applications using the BSC performance tools

Dimemas: Type of analysis

Parametric sweeps

– On abstract architectures

– On application computational regions

What if analysis

– Ideal machine (instantaneous network)

– Estimating impact of ports to MPI+OpenMP/CUDA/…

– Should I use asynchronous communications?

– Are all parts of an app. equally sensitive to network?

MPI sanity check

– Modeling nominal

Detailed feedback on simulation (trace)

Impact of BW (L=8; B=0)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 4 16 64 256 1024

Eff

icie

ncy

NMM 512

ARW 512

NMM 256

ARW 256

NMM 128

ARW 128

4ms

Page 16: Understanding applications using the BSC performance tools

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

1

4

16

64

0

5

10

15

20

Bandwdith (MB/s)

CPU

ratio

Speedup

Paraver & Dimemas: Ecosystem and Integration

Paraver – Dimemas tandem

– Analysis and prediction

– What-if from selected time window

Analytics

– Clustering to model speeding or

balancing regions

Scalability model

– Separate transfer and serialization

efficiencies

Coupled simulation

– Multiscale simulation

– Combine with detailed network / CPU

simulators

0

0,2

0,4

0,6

0,8

1

1,2

0 100 200 300Processors

Performance Factors

Parallel_Eff

LB

uLB

Transfer

GROMACSv4.5

93.67%

González J, Giménez J, Casas M, Moretó M, Ramirez A, Labarta J, Valero M.

Simulating Whole Supercomputer Applications. IEEE Micro. 2011 ;31:32-45.

Page 17: Understanding applications using the BSC performance tools

Network sensitivity

MPIRE 32 tasks, no network contention

17 PATC Tools Training. Barcelona, May 12–13th 2014

L = 25µs – BW = 100MB/s L = 1000µs – BW = 100MB/s

L = 25µs – BW = 10MB/s

All windows same scale

Page 18: Understanding applications using the BSC performance tools

Ideal machine

The impossible machine: BW = , L = 0

Actually describes/characterizes Intrinsic application behavior

– Load balance problems?

– Dependence problems?

waitall

sendrec

alltoall

Real

run

Ideal

network

Allgather

+

sendrecv allreduce GADGET @ Nehalem cluster

256 processes

Impact on practical machines?

Page 19: Understanding applications using the BSC performance tools

Impact of architectural parameters

Ideal speeding up ALL the computation bursts by the

CPUratio factor

– The more processes the less speedup (higher impact of bandwidth

limitations) !!

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

1

4

1664

0

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPU

ratio

Speedup

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

1

8

64

0

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPU

ratio

Speedup

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

1

8

64

0

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPU

ratio

Speedup

64

procs

128

procs 256

procs

GADGET

Page 20: Understanding applications using the BSC performance tools

www.bsc.es

Models

Page 21: Understanding applications using the BSC performance tools

Scaling model

)max(*

1

i

P

i

i

effP

eff

LB

)max( ieffCommEff

T

Teff i

i iTT

IPC

instr#

Directly from real execution metrics

CommEffLB *

LB

η

CommEff

Page 22: Understanding applications using the BSC performance tools

Communication efficiency = µLB * Transfer

Serializations / dependences (µLB)

Measured using Dimemas

– An ideal network Transfer (efficiency) = 1

Computation Communication

𝑳𝑩=1

MPI_Send MPI_Recv

MPI_Send MPI_Recv

MPI_Send

MPI_Send

MPI_Recv

MPI_Recv

Page 23: Understanding applications using the BSC performance tools

Scaling model

Dimemas simulation with ideal target

– Latency =0; BW = ∞

µLB

TransferLBCommEff *Migrating/local load imbalance

Serialization

idealT

T

TTransfer ideal

ideal

i

T

TLB

)max(

Page 24: Understanding applications using the BSC performance tools

Why scaling?

0

5

10

15

20

0 100 200 300 400

speed up

Good scalability !! Should we be happy?

CG-POP mpi2s1D - 180x120

TrfSerLB **

0,5

0,6

0,7

0,8

0,9

1

1,1

0 100 200 300 400

Parallel eff

LB

uLB

transfer

0,4

0,6

0,8

1

1,2

1,4

1,6

0 100 200 300 400

Efficiency

Parallel eff

instr. eff

IPC eff

IPCinstr **

M. Casas et al, “Automatic analysis of speedup of MPI applications”. ICS 2008.

Page 25: Understanding applications using the BSC performance tools

www.bsc.es

Clustering

Page 26: Understanding applications using the BSC performance tools

Using Clustering to identify structure

IPC

Completed Instructions

Automatic Detection of Parallel Applications Computation Phases (IPDPS 2009)

Page 27: Understanding applications using the BSC performance tools

Performance @ serial computation bursts

WRF 128 cores GROMACS SPECFEM3D

Asynchronous SPMD

Balanced #instr

variability in IPC

MPMD structure

Different coupled

imbalance trends

SPMD

Repeated substructure

Coupled imbalance

Page 28: Understanding applications using the BSC performance tools

28

Full per region HWC characterization from a single run

Projecting hardware counters based on clustering

Miss ratios Instruction mix Stalls

González J. et al , “Performance Data Extrapolation in Parallel Codes”. ICPADS '10

Page 29: Understanding applications using the BSC performance tools

19% gain

Integrating models and analytics

13% gain

… we increase the IPC of Cluster1?

What if ….

… we balance Clusters 1 & 2?

PEPC

Page 30: Understanding applications using the BSC performance tools

OpenMX (strong scale from 64 to 512 tasks)

Tracking: scability through clustering

64 128 192

256 384 512

64 128 192

256 384 512

Page 31: Understanding applications using the BSC performance tools

www.bsc.es

Folding

Page 32: Understanding applications using the BSC performance tools

32

Folding: Detailed metrics evolution

• Performance of a sequential region = 2000 MIPS

Is it good enough?

Is it easy to improve?

Page 33: Understanding applications using the BSC performance tools

33

Folding: Instantaneous CPI stack

MRGENESIS

• Trivial fix.(loop interchange)

• Easy to locate?

• Next step?

• Availability of CPI stack models

for production processors?

• Provided by

manufacturers?

Page 34: Understanding applications using the BSC performance tools

“Blind” optimization

From folded samples of a few

levels to timeline structure of

“relevant” routines

Recommendation without

access to source code

Harald Servat et al. “Bio-inspired call-stack reconstruction for performance analysis” Submitted, 2015

Page 35: Understanding applications using the BSC performance tools

17.20 M instructions

~ 1000 MIPS

24.92 M instructions

~ 1100 MIPS

32.53 M instructions

~ 1200 MIPS

MPI call

MPI call

Unbalanced MPI application

– Same code

– Different duration

– Different performance

CG-POP multicore MN3 study

Page 36: Understanding applications using the BSC performance tools

www.bsc.es

Methodology

Page 37: Understanding applications using the BSC performance tools

Performance analysis tools objective

Help validate hypotheses

Help generate hypotheses

Qualitatively

Quantitatively

Page 38: Understanding applications using the BSC performance tools

First steps

Parallel efficiency – percentage of time invested on computation – Identify sources for “inefficiency”:

• load balance

• Communication /synchronization

Serial efficiency – how far from peak performance? – IPC, correlate with other counters

Scalability – code replication? – Total #instructions

Behavioral structure? Variability?

Paraver Tutorial:

Introduction to Paraver and Dimemas methodology

Page 39: Understanding applications using the BSC performance tools

BSC Tools web site

www.bsc.es/paraver

• downloads

– Sources / Binaries

– Linux / windows / MAC

• documentation

– Training guides

– Tutorial slides

Getting started

– Start wxparaver

– Help tutorials and follow instructions

– Follow training guides

• Paraver introduction (MPI): Navigation and basic understanding of Paraver

operation