performance analysis of gs2 plasma turbulence ode · ommunication matrix for 120 ranks for a) right...

1
Introducon GS2 [1] is an open source gyrokinec simulaon code used to study turbulence in plasma, one applicaon is for fusion experiments. It is a gyrokinec flux tube inial value and eigenvalue solv- er and is wrien in Fortran and parallelised with MPI. Performance analysis was performed under the Performance Opmisaon and Producvity Centre of Excellence (POP) using a methodology to narrow down underlying causes of inefficien- cy. Aſter an inial analysis changes were made by the developers based upon the recommenda- ons. The refactored code was further analysed with two inputs variants, this comparison is pre- sented below. Performance Analysis was performed using the BSC tools Extrae and Paraver [2]. Methodology POP efficiency metrics give an overview of how well the parallelisaon of the applicaon works and how efficiently the hardware is used [3]. The metrics are organised in a hierarchy and give a detailed overview of the performance of an applicaon in a very condensed form. An ideal network is defined as instantaneous data transfer. Metrics are percentages where 0% is low, 80-85% is the cut off for good performance and 100% is ideal performance. Analysis A single GS2 mestep for this analysis included the following phases: The analysis was performed on 2304 MPI ranks on the ARCHER UK Supercomputer. The main input variables are : (ntheta, ngauss, negrid, nspec, nx, ny, nstep, field, layout)= (26, 5, 8, 2, 24, 24, 100, gf, yxles) Two versions of data distribuon were analysed. How five of the dimensions of the gyrokinec distribuon funcon are distributed across the MPI ranks was varied. a) b) Timelines of the two versions of GS2 for a) right and b) square split do- mains. Showing the MPI calls (top) and the duraon of computaon coloured by gradient (boom) for one mestep with the four applica- on phases of GS2 on 2304 MPI ranks. Performance Analysis of GS2 Plasma Turbulence Code Sally Bridgwater, Nick Dingle, Numerical Algorithms Group, UK April 2018 Correspondence address: [email protected] www.nag.co.uk The Square split domain is over twice as fast as the default data distribuon. Since the Transfer Efficiency is the main inefficiency we invesgated this further by looking into the communicaon paern between MPI ranks. a) b) Communicaon matrix for 120 ranks for a) right and b) square split domains. Coloured by number of messages sent between partners, light green are fewer and dark blue more messages. Considerably more point-to-point messages are sent with the right split domain than the square split domain. The paern is related to the input parameters for the domain decomposion. Conclusions The square split domain was the most efficient split tested and around twice as fast as the cur- rent default opon. Communicaon is the key boleneck, specifically the amount of data transferred and the com- plexity of the communicaon paerns. Further invesgaon of the impact of data distribuon with different inputs to determine an opmal configuraon for runs is required This work clearly demonstrates there is a large scope for improvement to the communicaon in GS2 Acknowledgements & References We would like to thank STFC and CCFE for giving us permission to showcase the work they have undertaken with POP. The POP project has received funding from the European Unions Horizon 2020 research and innovaon pro- gramme under grant agreement No. 676553. Plasma HEC Consorum EPSRC grant number EP/L000237/1. Collaborave Computaonal Project in Plasma Physics - grant number EP/M022463/1. [1] "Comparison of Inial Value and Eigenvalue Codes for Kinec Toroidal Plasma Instabilies," M. Kotschenreuther, G. Re- woldt, and W.M. Tang, Comp. Phys. Comm. 88, 128 (1995). [2] G. Llort, Servat, H., González, J., Giménez, J., and Labarta, J., On the Usefulness of Object Tracking Techniques in Perfor- mance Analysis, in Proceedings of SC13: Internaonal Conference for High Performance Compung, Networking, Storage and Analysis, New York, NY, USA, 2013. [3] “Efficiency Metrics in a POP performance audit”, hps://pop-coe.eu/sites/default/files/pop_files/metrics.pdf Right Square Number of cores 2304 2304 Global Efficiency 26.0% 50.7% Computaonal Scaling 75.2% 100.0% Useful Instrucons Scaling 85.3% 100.0% Useful IPC Scaling 96.4% 100.0% Parallel Efficiency 34.5% 50.7% Load Balance Efficiency 78.8% 80.2% Serialisaon Efficiency 97.3% 96.6% Transfer Efficiency 45.0% 65.4% Issues highlighted from the metrics, in order of importance: Transfer Efficiency is low for both but significantly beer for the square split. This is where most of the improvement is seen Load Balance is okay for both IPC and Instrucons are worse for the right split i.e. more work is done slower on the right split Good Serialisaon i.e. lile me is spent waing for communica- on partners to be available. Global Efficiency GE = CE * PE Parallel Efficiency PE = LB * SE * TE Instrucon Efficiency (INS) Total useful instrucons scaling with core count (p) relave to ref value (ref) INS = INS p /INS ref Load Balance (LB) How evenly me in computaon is distributed across ranks (r) LB = AVE r (comp)/ MAX r (comp) Instrucons per Cycle Efficiency (IPC) IPC scaling with core count (p) relave to ref value (ref) IPC = IPC ref /IPC p Serialisaon Efficiency (SE) Time lost waing for communicaon partners SE = MAX r (comp) ideal /runme ideal Transfer Efficiency (TE) Inefficiency due to me in data transfer Network TE= runme ideal / runme Computaonal Efficiency CE =Tcomp ref /Tcomp p Square Right Speedup 0 0.5 1 1.5 2 2.5 Outside MPI MPI_Bcast MPI_Isend MPI_Reduce MPI_Irecv MPI_Allreduce MPI_Waitall MPI_Waitany Dimension x y l e s Right (default) 3 2 24 8 2 Square 3 8 6 8 2 Right Square Runme (ns) 2.52E+07 1.11E+07 1) Nonlinear Advance (N) 3) Field Solver (F) 2) Linear Advance (L) 4) Second Linear Advance (L)

Upload: others

Post on 16-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Analysis of GS2 Plasma Turbulence ode · ommunication matrix for 120 ranks for a) right and b) square split domains. oloured by number of messages sent between partners,

Introduction GS2 [1] is an open source gyrokinetic simulation code used to study turbulence in plasma, one

application is for fusion experiments. It is a gyrokinetic flux tube initial value and eigenvalue solv-

er and is written in Fortran and parallelised with MPI.

Performance analysis was performed under the Performance Optimisation and Productivity

Centre of Excellence (POP) using a methodology to narrow down underlying causes of inefficien-

cy. After an initial analysis changes were made by the developers based upon the recommenda-

tions. The refactored code was further analysed with two inputs variants, this comparison is pre-

sented below. Performance Analysis was performed using the BSC tools Extrae and Paraver [2].

Methodology POP efficiency metrics give an overview of how well the parallelisation of the application works

and how efficiently the hardware is used [3].

The metrics are organised in a hierarchy and give a detailed overview of the performance of an

application in a very condensed form. An ideal network is defined as instantaneous data transfer.

Metrics are percentages where 0% is low, 80-85% is the cut off for good performance and 100% is

ideal performance.

Analysis A single GS2 timestep for this analysis included the following phases:

The analysis was performed on 2304 MPI ranks on the ARCHER UK Supercomputer.

The main input variables are :

(ntheta, ngauss, negrid, nspec, nx, ny, nstep, field, layout)= (26, 5, 8, 2, 24, 24, 100, gf, yxles)

Two versions of data distribution were analysed. How five of the dimensions of the gyrokinetic

distribution function are

distributed across the MPI

ranks was varied.

a) b)

Timelines of the two versions of GS2 for a) right and b) square split do-

mains. Showing the MPI calls (top) and the duration of computation

coloured by gradient (bottom) for one timestep with the four applica-

tion phases of GS2 on 2304 MPI ranks.

Performance Analysis of GS2 Plasma

Turbulence Code Sally Bridgwater, Nick Dingle, Numerical Algorithms Group, UK

April 2018 Correspondence address: [email protected] www.nag.co.uk

The Square split domain is over twice

as fast as the default data distribution.

Since the Transfer Efficiency is the main inefficiency we investigated this further by looking into

the communication pattern between MPI ranks.

a) b)

Communication matrix for 120 ranks for a) right and b) square split domains. Coloured by number of

messages sent between partners, light green are fewer and dark blue more messages.

Considerably more point-to-point messages are sent with the right split domain than the square

split domain. The pattern is related to the input parameters for the domain decomposition.

Conclusions The square split domain was the most efficient split tested and around twice as fast as the cur-

rent default option.

• Communication is the key bottleneck, specifically the amount of data transferred and the com-

plexity of the communication patterns.

• Further investigation of the impact of data distribution with different inputs to determine an

optimal configuration for runs is required

• This work clearly demonstrates there is a large scope for improvement to the communication

in GS2

Acknowledgements & References We would like to thank STFC and CCFE for giving us permission to showcase the work they have undertaken with

POP.

The POP project has received funding from the European Union’s Horizon 2020 research and innovation pro-

gramme under grant agreement No. 676553.

Plasma HEC Consortium EPSRC grant number EP/L000237/1.

Collaborative Computational Project in Plasma Physics - grant number EP/M022463/1.

[1] "Comparison of Initial Value and Eigenvalue Codes for Kinetic Toroidal Plasma Instabilities," M. Kotschenreuther, G. Re-

woldt, and W.M. Tang, Comp. Phys. Comm. 88, 128 (1995).

[2] G. Llort, Servat, H., González, J., Giménez, J., and Labarta, J., “On the Usefulness of Object Tracking Techniques in Perfor-

mance Analysis”, in Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage

and Analysis, New York, NY, USA, 2013.

[3] “Efficiency Metrics in a POP performance audit”, https://pop-coe.eu/sites/default/files/pop_files/metrics.pdf

Right Square

Number of cores 2304 2304

Global Efficiency 26.0% 50.7%

Computational Scaling 75.2% 100.0%

Useful Instructions Scaling 85.3% 100.0%

Useful IPC Scaling 96.4% 100.0%

Parallel Efficiency 34.5% 50.7%

Load Balance Efficiency 78.8% 80.2%

Serialisation Efficiency 97.3% 96.6%

Transfer Efficiency 45.0% 65.4%

Issues highlighted from the

metrics, in order of importance:

• Transfer Efficiency is low for both

but significantly better for the

square split. This is where most

of the improvement is seen

• Load Balance is okay for both

• IPC and Instructions are worse

for the right split i.e. more work

is done slower on the right split

• Good Serialisation i.e. little time

is spent waiting for communica-

tion partners to be available.

Global Efficiency GE = CE * PE

Parallel Efficiency PE = LB * SE * TE

Instruction

Efficiency (INS)

Total useful

instructions scaling

with core count (p)

relative to ref value

(ref)

INS = INSp/INSref

Load

Balance (LB)

How evenly time

in computation

is distributed

across ranks (r)

LB = AVEr(comp)/

MAXr(comp)

Instructions per

Cycle

Efficiency (IPC)

IPC scaling with

core count (p)

relative to ref

value (ref)

IPC = IPCref/IPCp

Serialisation

Efficiency (SE)

Time lost waiting

for communication

partners

SE = MAXr(comp)

ideal/runtimeideal

Transfer

Efficiency (TE)

Inefficiency due

to time in data

transfer

Network

TE= runtimeideal/

runtime

Computational Efficiency CE =Tcompref/Tcompp

Square

Right

Speedup

0 0.5 1 1.5 2 2.5

Outside MPI MPI_Bcast

MPI_Isend MPI_Reduce

MPI_Irecv MPI_Allreduce

MPI_Waitall MPI_Waitany

Dimension x y l e s

Right (default) 3 2 24 8 2

Square 3 8 6 8 2

Right Square

Runtime (ns) 2.52E+07 1.11E+07

1) Nonlinear Advance (N) 3) Field Solver (F)

2) Linear Advance (L) 4) Second Linear Advance (L)