event log “dresdner elbflorenz bei dämmerung” by maltef ...• inter-process communication...

1
For each job in the workflow we perform the following operations: Producing one event log for each job step using Score-P Generating one JSON profile for each event log in the same allocation Reduces processing time in the analysis phase Querying job scheduling information and creating the job log ( JSON) using SLURM tools The JSON profiles contain the time spent in: Computation Inter-process communication (MPI) Inter-thread communication (OpenMP, pthreads) I/O activities GPU Kernels This data allows users to better understand their workflow and how they can optimize its end-to-end execution time. We have developed a tool to translate OTF2 traces generated by Score-P into these JSON profiles. Center for Information Services and High Performance Computing Department of Interdisciplinary Application Development and Coordination Christian Herold and Bill Williams ([email protected], [email protected]) Tel. +49 351 - 463 - 38000 Falkenbrunnen Room 014 Chemnitzer Straße 50, 01187 Dresden, Germany Top-Down Performance Analysis of HPC Workflows An HPC workflow is a coordinated sequence of interdependent applications. Workflows can be modeled using jobs composed of steps, where each job represents a single submission to the scheduling system, and each step executes a single application (see Figure 1). Jobs may depend on each other. Therefore, inefficiencies in one step can delay work depending on its associated job and increase the runtime of the whole workflow. Determining the bottleneck of a complex workflow can be a challenging task without using tools. In order to optimize the step or application responsible for the bottleneck, details of the runtime behavior are required. Therefore, a top-down approach is needed to scale the performance data from a global (the whole workflow) to a detailed (application level) view. GROMACS is an open-source package for chemical simulation mostly used for dynamic simulations of biomolecules. We instrumented the example "Lysozyme in Water" from the GROMACS tutorial page with Score-P. We built 6 jobs in one pipeline, skipping the final analysis step, and profiled the entire workflow, including job information from the scheduler. A top-down approach provides performance summaries for each level of a single workflow: 1. Present an overview of each job inside a workflow Identify inefficient jobs 2. Present an overview of each step inside a job Identify causing job step 3. Analyse a job step in detail with Vampir Find the inefficiency in the program Our next step is the implementation of a visualizer that can read our profiles and job summaries and produce the charts we show here automatically. We also intend to refine our data collection infrastructure as needed. “Dresdner Elbflorenz bei Dämmerung” by MalteF, used under CC-BY-SA-3.0-DE / Cropped from original This research was undertaken as part of the NEXTGenIO project, which is funded through the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671951. In order to record the runtime behavior on the job step level, we used Score-P to instrument the application being executed. Figure 2 depicts the measurement of the workflow for one job and depicts the required components on the left side as well as the output on the right side Figure 3 depicts the topology of the GROMACS workflow: First, GROMACS runs three mostly serial jobs during its setup phase Then three parallel jobs perform the bulk of the simulation work In Figure 4, we show that the equilibration job in this workflow has the longest runtime and the largest MPI overhead of the six jobs. This makes it a promising candidate for optimization, so we investigate its component steps. Figure 5 shows that steps 1 and 3 in this job have the longest runtimes. Step 3 is compute bound and could possibly benefit from more cores; step 1 is MPI- bound and may not be well configured. We look at step 1 in more detail below to identify possible problems. In Figure 6, we see that step 1 has significant MPI startup overhead relative to its amount of computation. It is likely that this job would benefit from a lower degree of parallelism. Figure 1: An example of a workflow and the provided performance summaries. Figure 2: Required components for workflow measurement Motivation Top-Down Approach Methodology Evaluation Future Work Acknowledgement 0, 30, 60, 90, 120, 150, Time (sec) Job Name Figure 4: Workflow Overview by Job Computation MPI OpenMP ISO C I/O POSIX I/O 0,0 12,5 25,0 37,5 50,0 0 1 2 3 4 Time (sec) Job Step Number Figure 5: Details of Equilibration Job Steps Computation MPI OpenMP ISO C I/O POSIX I/O Figure 6: Analysis of Equilibration Step 1 with Vampir Figure 3: Job topology of the GROMACS example Lysozyme in Water. Job Workflow Event log Job step Profile Query job log Job log SLURM Score-P Output Job Summary Job A Job step A.1 Job step A.2 Workflow Job B Job step B.1 Job step B.2 1. Detailed Performance Analysis 2. 3. Workflow Summary Generate protein Select + Solvate Add Ions Minimize energy Equilibration MD production

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Event log “Dresdner Elbflorenz bei Dämmerung” by MalteF ...• Inter-process communication (MPI) • Inter-thread communication (OpenMP, pthreads) • I/O activities • GPU Kernels

For each job in the workflow we perform the following operations:

• Producing one event log for each job step using Score-P

• Generating one JSON profile for each event log in the same allocation

Reduces processing time in the analysis phase

• Querying job scheduling information and creating the job log (JSON) using

SLURM tools

The JSON profiles contain the time spent in:

• Computation

• Inter-process communication (MPI)

• Inter-thread communication (OpenMP, pthreads)

• I/O activities

• GPU Kernels

This data allows users to better understand their workflow and how they can

optimize its end-to-end execution time. We have developed a tool to translate

OTF2 traces generated by Score-P into these JSON profiles.

Center for Information Services and High Performance Computing Department of Interdisciplinary Application Development and Coordination

Christian Herold and Bill Williams ([email protected], [email protected])

Tel. +49 351 - 463 - 38000

Falkenbrunnen Room 014

Chemnitzer Straße 50, 01187 Dresden, Germany

Top-Down Performance Analysis of HPC Workflows

An HPC workflow is a coordinated sequence of interdependent applications.

Workflows can be modeled using jobs composed of steps, where each job

represents a single submission to the scheduling system, and each step executes

a single application (see Figure 1). Jobs may depend on each other. Therefore,

inefficiencies in one step can delay work depending on its associated job and

increase the runtime of the whole workflow. Determining the bottleneck of a

complex workflow can be a challenging task without using tools. In order to

optimize the step or application responsible for the bottleneck, details of the

runtime behavior are required. Therefore, a top-down approach is needed to

scale the performance data from a global (the whole workflow) to a detailed

(application level) view.

GROMACS is an open-source package for chemical simulation mostly used for

dynamic simulations of biomolecules. We instrumented the example "Lysozyme

in Water" from the GROMACS tutorial page with Score-P. We built 6 jobs in one

pipeline, skipping the final analysis step, and profiled the entire workflow,

including job information from the scheduler.

A top-down approach provides

performance summaries for each level

of a single workflow:

1. Present an overview of each job

inside a workflow

Identify inefficient jobs

2. Present an overview of each step

inside a job

Identify causing job step

3. Analyse a job step in detail with

Vampir

Find the inefficiency in the

program

Our next step is the implementation of a visualizer that can read our profiles and

job summaries and produce the charts we show here automatically. We also

intend to refine our data collection infrastructure as needed.

“Dresdner Elbflorenz bei Dämmerung” by MalteF,

used under CC-BY-SA-3.0-DE / Cropped from original

This research was undertaken as part of the NEXTGenIO project, which is funded

through the European Union’s Horizon 2020 Research and Innovation

programme under Grant Agreement no. 671951.

In order to record the runtime behavior on the job step level, we used Score-P to

instrument the application being executed. Figure 2 depicts the measurement of

the workflow for one job and depicts the required components on the left side as

well as the output on the right side

Figure 3 depicts the topology of the

GROMACS workflow:

• First, GROMACS runs three mostly

serial jobs during its setup phase

• Then three parallel jobs perform the

bulk of the simulation work

In Figure 4, we show that the

equilibration job in this workflow

has the longest runtime and the

largest MPI overhead of the six

jobs. This makes it a promising

candidate for optimization, so we

investigate its component steps.

Figure 5 shows that steps 1 and 3

in this job have the longest

runtimes. Step 3 is compute

bound and could possibly benefit

from more cores; step 1 is MPI-

bound and may not be well

configured. We look at step 1 in

more detail below to identify

possible problems.

In Figure 6, we see that step 1 has

significant MPI startup overhead

relative to its amount of

computation. It is likely that this

job would benefit from a lower

degree of parallelism.

Figure 1: An example of a workflow and the

provided performance summaries.

Figure 2: Required components for workflow measurement

Motivation

Top-Down Approach

Methodology

Evaluation

Future Work

Acknowledgement

0,

30,

60,

90,

120,

150,

Tim

e (

se

c)

Job Name

Figure 4: Workflow Overview by Job

Computation MPI OpenMP ISO C I/O POSIX I/O

0,0

12,5

25,0

37,5

50,0

0 1 2 3 4

Tim

e (

se

c)

Job Step Number

Figure 5: Details of Equilibration Job Steps

Computation MPI OpenMP ISO C I/O POSIX I/O

Figure 6: Analysis of Equilibration Step 1

with Vampir

Figure 3: Job topology of the GROMACS

example Lysozyme in Water.

Job

Workflow

Event log

Job step

Profile

Query job log Job log

SLURM

Score-P

Output

Job Summary

Job A

Job step A.1

Job step A.2

Workflow

Job B

Job step B.1

Job step B.2

1.

Detailed Performance Analysis

2.

3.

Workflow Summary

Generate protein

Select + Solvate

Add Ions

Minimize energy

EquilibrationMD

production