sherpa performance study - cern

SHERPA PERFORMANCE STUDY — GLIBC

Rui Wang1, Jahred Adelman1, Doug Benjamin2, Dhiman Chakraborty1, Jeremy Love2 1Northern Illinois University 2Argonne National Laboratory

Iris-Hep Topical Meeting March 1St 2021

MC generator processing flow at ATLAS

2

• The event generation is invoked by Athena as part of the standard algorithm event loop

• The job configuration is passed from Athena (the ATLAS offline software) to the generator.

• The generated event is created and converted into the HepMC format

Liza Mijovic's talk

https://gitlab.cern.ch/atlas/athena

https://indico.desy.de/event/3246/contributions/65842/attachments/42751/52720/EvGen_slides_prop.pdf

Estimated CPU time for MC production

3

AtlasComputingandSoftwarePublicResults

• Currently the 2017 model does not scale much beyond Run 3

• A factor of two speed-up in MC event generation is needed after the HL-LHC upgrade

https://twiki.cern.ch/twiki/bin/view/AtlasPublic/ComputingandSoftwarePublicResults

CPU time for MC production

4

AtlasComputingandSoftwarePublicResults

• Currently the 2017 model does not scale much beyond Run 3

• A factor of two speed-up in MC event generation is needed after the HL-LHC upgrade

~20% of the total CPU resources need in 2028

https://twiki.cern.ch/twiki/bin/view/AtlasPublic/ComputingandSoftwarePublicResults

• Sherpa, used for SM V+jets events generation, is a major driver for resource usage as V+jets events are background to nearly all physics analyses

• Increasing the efficiency and parallelism of the ATLAS implementation of SHERPA is one of the biggest gains that can be made

• This study is to understand the impact of environment configuration on Sherpa performance and figure out hotspots

Introduction

5

Josh McFayden

Avg. ~80s

https://indico.cern.ch/event/777759/contributions/3235434/attachments/1793736/2923079/2019-02-11_AtlasWeek_GenRun34_mcfayden.pdf

Test setup

• Standalone Sherpa: 2.2.10p4 • Compiled with Openloops, LHApdf, HepMC and sqlit3 from CVMFS

• Docker image of CentOS7 (RAM: 24 GB) • gcc 6.2 from CVMFS • GNU glibc libraries (details in backup) • glibc-2.17, glibc-2.18, glibc-2.24 • glibc-2.24_vec • Compiled with vectorized math function options without openMP -ftree-

loop-vectorize -ffast-math• glibc-2.17 + libimf (Intel Math library )• Athena: 21.6.39,AthGeneration

• Profiling: Intel Vtune Profiler

• Process: Atlas official W+jets (up to 2 jets) production

6

Total CPU time

• <= 1k events generation, repeat for 10 times

• >1k events generation, ran once

7**No error bar on the ratios

Solid lines are the CPU Time Dashed lined are the Ratios

Total CPU time

• All glibcs are slower than glibc-2.17+libimf (21.6.39,AthGeneration)

• Reduced CPU time when moving to higher glibc release

• Dependence on the # of generated event

8

ratio increases as the # of event grows

**No error bar on the ratios

2.17 → 2.24 ~4% decrease

Improvement from vectorized compiling

CPU time on modules

math library is the third one

libimf only take half of the CPU time libm use

libLHAPDF & libOpenloops are the top two, ~40% of the total running time

9

CPU time on modulesExclude the external libraries (LHApdf & Openloops) and the math library

System operations & Phasic follows

10

• Separate the CPU time into Initialization stage and event generation stage

• libimf speed up the initialization stage (~150-250s) a lot

• While the event generation stage (~7500s pre 1k events) dose not significant difference

CPU Time per stage

11

Initi

aliz

atio

n

Eve

nt G

ener

atio

n

• System

• libm/libimf: math library

• External

• LHAPDF: PDF evaluation

• OpenLoops: evaluation of tree and one-loop matrix elements for any Standard Model process at NLO QCD and NLO EW

• Internal

• COMIX: multi-leg tree-level matrix element generator

• PHASIC++: Monte Carlo phase-space integration

• CSSHOWER++: parton shower

Sherpa

12

CPU time on modules — initialization stage (~150-250s)

• glibc-2.17+libimf vs glibcs • libimf is ~ 2 times faster than libms, less time on system operations

• ~50% more time on loading the Openloops while half time on loading the CS

13

(/2.

17)

CPU time on modules — initialization stage (~150-250s)

• glibcs • Small difference in general

• ~20% difference in libm

• Comix is faster with 2.24; CS is slower with 2.24 using vectorized compiling

14

(/2.

17)

CPU time on modules — event generation stage (~7500s)

• glibc-2.17+libimf vs glibcs • libimf is ~ 2 times faster than libms, less time on the system operations

• ~10% more time on LHAPDF & Openloops

• In general, ~5% less CPU time when moving to higher glibc release 15

(/2.

17)

Summary

16

• Math library is a hotspot besides the Openloops & LHApdf

• Athena has adopted intel math library

• glibc2.17+libimf is ~2 times faster than the GNU glibcs

• By comparing the Sherpa run time between glibc releases

• It is ~4% faster with glibc-2.24 than glibc-2.17 for 1k event generation with the normal gcc compiling

• Vectorization compiling add an extra improvement

• Because the default configuration Athena uses is pretty optimized already, we're beginning to work on Sherpa with HPCs

Thank you!

Backups

17

Consistency check

• Single thread running under similar environment

• Good consistency between runs

18

Comparison between glibc-2.17&glibc-2.18

19

Difference gcc options

• Sherpa compiled with different gcc optimization options

• For -Ofast, the impact on accuracy needs to be checked

• No visible difference between -O2 and -O3

20

option optimization level execution time code size memory usage compile time

-O0 optimization for compilation time (default) + + - -

-O1 or -O optimization for code size and execution time - - + +

-O2 optimization more for code size and execution time -- + ++

-O3 optimization more for code size and execution time --- + +++

-Os optimization for code size -- ++

-Ofast O3 with fast none accurate math calculations --- + +++https://www.rapidtables.com/code/linux/gcc/gcc-o.html

https://www.rapidtables.com/code/linux/gcc/gcc-o.html

GNU Glibc releases• glibc-2.17

• Centos7 default

• glibc-2.18

• libm multiprecision code cleanup and performance improvement

• glibc-2.24

• Bugs resolved in math (eg. math: Incorrect cos result for 1.5174239687223976)

• Starting from glibc-2.22, gcc can be compiled with vectorized math functions: cos, cosf, sin, sinf, sincos, sincosf, log, logf, exp, expf, pow, powf

• glibc-2.22 & glibc-2.23 were skipped due to compiling issues.

• glibc-2.24_vec

• Compiled with vectorized math function options without openMP -ftree-loop-vectorize -ffast-math

• Sherpa is compiled without openMP

• However -ffast-math is not safe and may cause losing of accuracy (eg. sqrt, rounding, inf, NaN and etc.)

• 21.6.39,AthGeneration

• Athena default

21

https://sourceware.org/glibc/wiki/Release

https://sourceware.org/glibc/wiki/Release

Sherpa config

• ./configure --enable-shared --enable-static --enable-binreloc --enable-analysis --enable-pythia --enable-hepevtsize=10000 --enable-lhapdf=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/MCGenerators/lhapdf/6.2.3/x86_64-slc6-gcc62-opt --enable-hepmc2=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/HepMC/2.06.09/x86_64-slc6-gcc62-opt --enable-openloops=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/MCGenerators/openloops/2.0.0/x86_64-slc6-gcc62-opt --enable-lhole --enable-fastjet=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/fastjet/3.2.0/x86_64-slc6-gcc62-opt --with-sqlite3=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/sqlite/3110100/x86_64-slc6-gcc62-opt CFLAGS="-O2 -g0" CXXFLAGS="-O2 -g0" FFLAGS="-O2 -g0"

22

Wjets Run card

23

(run){ %scales, tags for scale variations FSF:=1.; RSF:=1.; QSF:=1.; SCALES STRICT_METS{FSF*MU_F2}{RSF*MU_R2}{QSF*MU_Q2};

# me generator settings ME_SIGNAL_GENERATOR Comix Amegic LOOPGEN; LOOPGEN:=OpenLoops

# tags for process setup NJET:=4; LJET:=2,3,4; QCUT:=20.;

# EW corrections setup OL_PARAMETERS=ew_scheme 2 ew_renorm_scheme 1 ASSOCIATED_CONTRIBUTIONS_VARIATIONS=EW EW|LO1 EW|LO1|LO2 EW|LO1|LO2|LO3; EW_SCHEME=3 GF=1.166397e-5 METS_BBAR_MODE=5

# speed and neg weight fraction improvements PP_RS_SCALE VAR{sqr(sqrt(H_T2)-PPerp(p[2])-PPerp(p[3])+MPerp(p[2]+p[3]))/4}; NLO_CSS_PSMODE=1 BEAM_1=2212 BEAM_2=2212 MAX_PROPER_LIFETIME=10.0 HEPMC_TREE_LIKE=1 PRETTY_PRINT=Off OVERWEIGHT_THRESHOLD=10 PP_HPSMODE=0 HEPMC_USE_NAMED_WEIGHTS=1 CSS_REWEIGHT=1 REWEIGHT_SPLITTING_PDF_SCALES=1 REWEIGHT_SPLITTING_ALPHAS_SCALES=1 CSS_REWEIGHT_SCALE_CUTOFF=5.0 HEPMC_INCLUDE_ME_ONLY_VARIATIONS=1 SCALE_VARIATIONS=0.25,0.25 # 0.25,1. 1.,0.25 1.,1. 1.,4. 4.,1. 4.,4. MASS[6]=172.5 WIDTH[6]=1.32 MASS[15]=1.777 WIDTH[15]=2.26735e-12 MASS[23]=91.1876 WIDTH[23]=2.4952 MASS[24]=80.399

WIDTH[24]=2.085 EW_SCHEME=0 SIN2THETAW=0.23113 HDH_WIDTH[6,24,5]=1.32 HDH_WIDTH[-6,-24,-5]=1.32 HDH_WIDTH[25,5,-5]=2.35e-3 HDH_WIDTH[25,15,-15]=2.57e-4 HDH_WIDTH[25,13,-13]=8.91e-7 HDH_WIDTH[25,4,-4]=1.18e-4 HDH_WIDTH[25,3,-3]=1.00e-6 HDH_WIDTH[25,21,21]=3.49e-4 HDH_WIDTH[25,22,22]=9.28e-6 HDH_WIDTH[24,2,-1]=0.7041 HDH_WIDTH[24,4,-3]=0.7041 HDH_WIDTH[24,12,-11]=0.2256 HDH_WIDTH[24,14,-13]=0.2256 HDH_WIDTH[24,16,-15]=0.2256 HDH_WIDTH[-24,-2,1]=0.7041 HDH_WIDTH[-24,-4,3]=0.7041 HDH_WIDTH[-24,-12,11]=0.2256 HDH_WIDTH[-24,-14,13]=0.2256 HDH_WIDTH[-24,-16,15]=0.2256 HDH_WIDTH[23,1,-1]=0.3828 HDH_WIDTH[23,2,-2]=0.2980 HDH_WIDTH[23,3,-3]=0.3828 HDH_WIDTH[23,4,-4]=0.2980 HDH_WIDTH[23,5,-5]=0.3828 HDH_WIDTH[23,11,-11]=0.0840 HDH_WIDTH[23,12,-12]=0.1663 HDH_WIDTH[23,13,-13]=0.0840 HDH_WIDTH[23,14,-14]=0.1663 HDH_WIDTH[23,15,-15]=0.0840 HDH_WIDTH[23,16,-16]=0.1663 OL_PREFIX=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/MCGenerators/openloops/2.0.0/x86_64-slc6-gcc62-opt OL_PARAMETERS=preset=2 write_parameters=1 PDF_LIBRARY=LHAPDFSherpa USE_PDF_ALPHAS=1 PDF_SET=NNPDF30_nnlo_as_0118_hessian PDF_VARIATIONS=NNPDF30_nnlo_as_0118_hessian[all] NNPDF30_nnlo_as_0117 NNPDF30_nnlo_as_0119 MMHT2014nnlo68cl CT14nnlo PDF4LHC15_nnlo_30_pdfas[all] NNPDF31_nnlo_as_0118_hessian SHERPA_LDADD=SherpaFusing USERHOOK=Fusing_Fragmentation CSS_SCALE_SCHEME=20

CSS_EVOLUTION_SCHEME=30 FUSING_FRAGMENTATION_STORE_AS_WEIGHT=1 OL_PARAMETERS=ew_scheme=2 ew_renorm_scheme=1 write_parameters=1 EW_SCHEME=3 GF=1.166397e-5 BEAM_ENERGY_1=6500.0 BEAM_ENERGY_2=6500.0 }(run)

(processes){ Process 93 93 -> 13 -14 93{NJET}; Order (*,2); CKKW sqr(QCUT/E_CMS); Associated_Contributions EW|LO1|LO2|LO3 {LJET}; Enhance_Observable VAR{log10(max(sqrt(H_T2)-PPerp(p[2])-PPerp(p[3]),MPerp(p[2]+p[3])))}|1|3.3 {3,4,5,6,7} NLO_QCD_Mode MC@NLO {LJET}; ME_Generator Amegic {LJET}; RS_ME_Generator Comix {LJET}; Loop_Generator LOOPGEN {LJET}; Max_N_Quarks 4 {6,7,8}; Max_Epsilon 0.01 {6,7,8}; Integration_Error 0.99 {3,4,5,6,7,8}; End process;

Process 93 93 -> -13 14 93{NJET}; Order (*,2); CKKW sqr(QCUT/E_CMS); Associated_Contributions EW|LO1|LO2|LO3 {LJET}; Enhance_Observable VAR{log10(max(sqrt(H_T2)-PPerp(p[2])-PPerp(p[3]),MPerp(p[2]+p[3])))}|1|3.3 {3,4,5,6,7} NLO_QCD_Mode MC@NLO {LJET}; ME_Generator Amegic {LJET}; RS_ME_Generator Comix {LJET}; Loop_Generator LOOPGEN {LJET}; Max_N_Quarks 4 {6,7,8}; Max_Epsilon 0.01 {6,7,8}; Integration_Error 0.99 {3,4,5,6,7,8}; End process;

}(processes)

(selector){ Mass 13 -14 2.0 E_CMS Mass -13 14 2.0 E_CMS }(selector)

Total CPU time

• <= 1k events generation, repeat for 10 times

• >1k events generation, ran once

24**No error bar on the ratios

Solid lines are the CPU Time Dashed lined are the Ratios

• glibc-2.17+libimf is ~50% of glibcs;

CPU Time — system

25

Initi

aliz

atio

n

Eve

nt G

ener

atio

n

• LHApdf and Openloops run time are stable with glibcs, < glibc-2.17+libimf

CPU Time — LHApdf and Openloops

26

Initi

aliz

atio

n

Eve

nt G

ener

atio

n

sherpa performance study - cern

Documents