sherpa performance study - cern
TRANSCRIPT
SHERPA PERFORMANCE STUDY — GLIBC
Rui Wang1, Jahred Adelman1, Doug Benjamin2, Dhiman Chakraborty1, Jeremy Love2 1Northern Illinois University 2Argonne National Laboratory
Iris-Hep Topical Meeting March 1St 2021
MC generator processing flow at ATLAS
2
• The event generation is invoked by Athena as part of the standard algorithm event loop
• The job configuration is passed from Athena (the ATLAS offline software) to the generator.
• The generated event is created and converted into the HepMC format
Liza Mijovic's talk
Estimated CPU time for MC production
3
AtlasComputingandSoftwarePublicResults
• Currently the 2017 model does not scale much beyond Run 3
• A factor of two speed-up in MC event generation is needed after the HL-LHC upgrade
CPU time for MC production
4
AtlasComputingandSoftwarePublicResults
• Currently the 2017 model does not scale much beyond Run 3
• A factor of two speed-up in MC event generation is needed after the HL-LHC upgrade
~20% of the total CPU resources need in 2028
• Sherpa, used for SM V+jets events generation, is a major driver for resource usage as V+jets events are background to nearly all physics analyses
• Increasing the efficiency and parallelism of the ATLAS implementation of SHERPA is one of the biggest gains that can be made
• This study is to understand the impact of environment configuration on Sherpa performance and figure out hotspots
Introduction
5
Josh McFayden
Avg. ~80s
Test setup
• Standalone Sherpa: 2.2.10p4 • Compiled with Openloops, LHApdf, HepMC and sqlit3 from CVMFS
• Docker image of CentOS7 (RAM: 24 GB) • gcc 6.2 from CVMFS • GNU glibc libraries (details in backup) • glibc-2.17, glibc-2.18, glibc-2.24 • glibc-2.24_vec • Compiled with vectorized math function options without openMP -ftree-
loop-vectorize -ffast-math• glibc-2.17 + libimf (Intel Math library )• Athena: 21.6.39,AthGeneration
• Profiling: Intel Vtune Profiler
• Process: Atlas official W+jets (up to 2 jets) production
6
Total CPU time
• <= 1k events generation, repeat for 10 times
• >1k events generation, ran once
7**No error bar on the ratios
Solid lines are the CPU Time Dashed lined are the Ratios
Total CPU time
• All glibcs are slower than glibc-2.17+libimf (21.6.39,AthGeneration)
• Reduced CPU time when moving to higher glibc release
• Dependence on the # of generated event
8
ratio increases as the # of event grows
**No error bar on the ratios
2.17 → 2.24 ~4% decrease
Improvement from vectorized compiling
CPU time on modules
math library is the third one
libimf only take half of the CPU time libm use
libLHAPDF & libOpenloops are the top two, ~40% of the total running time
9
CPU time on modulesExclude the external libraries (LHApdf & Openloops) and the math library
System operations & Phasic follows
10
• Separate the CPU time into Initialization stage and event generation stage
• libimf speed up the initialization stage (~150-250s) a lot
• While the event generation stage (~7500s pre 1k events) dose not significant difference
CPU Time per stage
11
Initi
aliz
atio
n
Eve
nt G
ener
atio
n
• System
• libm/libimf: math library
• External
• LHAPDF: PDF evaluation
• OpenLoops: evaluation of tree and one-loop matrix elements for any Standard Model process at NLO QCD and NLO EW
• Internal
• COMIX: multi-leg tree-level matrix element generator
• PHASIC++: Monte Carlo phase-space integration
• CSSHOWER++: parton shower
Sherpa
12
CPU time on modules — initialization stage (~150-250s)
• glibc-2.17+libimf vs glibcs • libimf is ~ 2 times faster than libms, less time on system operations
• ~50% more time on loading the Openloops while half time on loading the CS
13
(/2.
17)
CPU time on modules — initialization stage (~150-250s)
• glibcs • Small difference in general
• ~20% difference in libm
• Comix is faster with 2.24; CS is slower with 2.24 using vectorized compiling
14
(/2.
17)
CPU time on modules — event generation stage (~7500s)
• glibc-2.17+libimf vs glibcs • libimf is ~ 2 times faster than libms, less time on the system operations
• ~10% more time on LHAPDF & Openloops
• In general, ~5% less CPU time when moving to higher glibc release 15
(/2.
17)
Summary
16
• Math library is a hotspot besides the Openloops & LHApdf
• Athena has adopted intel math library
• glibc2.17+libimf is ~2 times faster than the GNU glibcs
• By comparing the Sherpa run time between glibc releases
• It is ~4% faster with glibc-2.24 than glibc-2.17 for 1k event generation with the normal gcc compiling
• Vectorization compiling add an extra improvement
• Because the default configuration Athena uses is pretty optimized already, we're beginning to work on Sherpa with HPCs
Thank you!
Backups
17
Consistency check
• Single thread running under similar environment
• Good consistency between runs
18
Comparison between glibc-2.17&glibc-2.18
19
Difference gcc options
• Sherpa compiled with different gcc optimization options
• For -Ofast, the impact on accuracy needs to be checked
• No visible difference between -O2 and -O3
20
option optimization level execution time code size memory usage compile time
-O0 optimization for compilation time (default) + + - -
-O1 or -O optimization for code size and execution time - - + +
-O2 optimization more for code size and execution time -- + ++
-O3 optimization more for code size and execution time --- + +++
-Os optimization for code size -- ++
-Ofast O3 with fast none accurate math calculations --- + +++https://www.rapidtables.com/code/linux/gcc/gcc-o.html
GNU Glibc releases• glibc-2.17
• Centos7 default
• glibc-2.18
• libm multiprecision code cleanup and performance improvement
• glibc-2.24
• Bugs resolved in math (eg. math: Incorrect cos result for 1.5174239687223976)
• Starting from glibc-2.22, gcc can be compiled with vectorized math functions: cos, cosf, sin, sinf, sincos, sincosf, log, logf, exp, expf, pow, powf
• glibc-2.22 & glibc-2.23 were skipped due to compiling issues.
• glibc-2.24_vec
• Compiled with vectorized math function options without openMP -ftree-loop-vectorize -ffast-math
• Sherpa is compiled without openMP
• However -ffast-math is not safe and may cause losing of accuracy (eg. sqrt, rounding, inf, NaN and etc.)
• 21.6.39,AthGeneration
• Athena default
21
https://sourceware.org/glibc/wiki/Release
Sherpa config
• ./configure --enable-shared --enable-static --enable-binreloc --enable-analysis --enable-pythia --enable-hepevtsize=10000 --enable-lhapdf=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/MCGenerators/lhapdf/6.2.3/x86_64-slc6-gcc62-opt --enable-hepmc2=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/HepMC/2.06.09/x86_64-slc6-gcc62-opt --enable-openloops=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/MCGenerators/openloops/2.0.0/x86_64-slc6-gcc62-opt --enable-lhole --enable-fastjet=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/fastjet/3.2.0/x86_64-slc6-gcc62-opt --with-sqlite3=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/sqlite/3110100/x86_64-slc6-gcc62-opt CFLAGS="-O2 -g0" CXXFLAGS="-O2 -g0" FFLAGS="-O2 -g0"
22
Wjets Run card
23
(run){ %scales, tags for scale variations FSF:=1.; RSF:=1.; QSF:=1.; SCALES STRICT_METS{FSF*MU_F2}{RSF*MU_R2}{QSF*MU_Q2};
# me generator settings ME_SIGNAL_GENERATOR Comix Amegic LOOPGEN; LOOPGEN:=OpenLoops
# tags for process setup NJET:=4; LJET:=2,3,4; QCUT:=20.;
# EW corrections setup OL_PARAMETERS=ew_scheme 2 ew_renorm_scheme 1 ASSOCIATED_CONTRIBUTIONS_VARIATIONS=EW EW|LO1 EW|LO1|LO2 EW|LO1|LO2|LO3; EW_SCHEME=3 GF=1.166397e-5 METS_BBAR_MODE=5
# speed and neg weight fraction improvements PP_RS_SCALE VAR{sqr(sqrt(H_T2)-PPerp(p[2])-PPerp(p[3])+MPerp(p[2]+p[3]))/4}; NLO_CSS_PSMODE=1 BEAM_1=2212 BEAM_2=2212 MAX_PROPER_LIFETIME=10.0 HEPMC_TREE_LIKE=1 PRETTY_PRINT=Off OVERWEIGHT_THRESHOLD=10 PP_HPSMODE=0 HEPMC_USE_NAMED_WEIGHTS=1 CSS_REWEIGHT=1 REWEIGHT_SPLITTING_PDF_SCALES=1 REWEIGHT_SPLITTING_ALPHAS_SCALES=1 CSS_REWEIGHT_SCALE_CUTOFF=5.0 HEPMC_INCLUDE_ME_ONLY_VARIATIONS=1 SCALE_VARIATIONS=0.25,0.25 # 0.25,1. 1.,0.25 1.,1. 1.,4. 4.,1. 4.,4. MASS[6]=172.5 WIDTH[6]=1.32 MASS[15]=1.777 WIDTH[15]=2.26735e-12 MASS[23]=91.1876 WIDTH[23]=2.4952 MASS[24]=80.399
WIDTH[24]=2.085 EW_SCHEME=0 SIN2THETAW=0.23113 HDH_WIDTH[6,24,5]=1.32 HDH_WIDTH[-6,-24,-5]=1.32 HDH_WIDTH[25,5,-5]=2.35e-3 HDH_WIDTH[25,15,-15]=2.57e-4 HDH_WIDTH[25,13,-13]=8.91e-7 HDH_WIDTH[25,4,-4]=1.18e-4 HDH_WIDTH[25,3,-3]=1.00e-6 HDH_WIDTH[25,21,21]=3.49e-4 HDH_WIDTH[25,22,22]=9.28e-6 HDH_WIDTH[24,2,-1]=0.7041 HDH_WIDTH[24,4,-3]=0.7041 HDH_WIDTH[24,12,-11]=0.2256 HDH_WIDTH[24,14,-13]=0.2256 HDH_WIDTH[24,16,-15]=0.2256 HDH_WIDTH[-24,-2,1]=0.7041 HDH_WIDTH[-24,-4,3]=0.7041 HDH_WIDTH[-24,-12,11]=0.2256 HDH_WIDTH[-24,-14,13]=0.2256 HDH_WIDTH[-24,-16,15]=0.2256 HDH_WIDTH[23,1,-1]=0.3828 HDH_WIDTH[23,2,-2]=0.2980 HDH_WIDTH[23,3,-3]=0.3828 HDH_WIDTH[23,4,-4]=0.2980 HDH_WIDTH[23,5,-5]=0.3828 HDH_WIDTH[23,11,-11]=0.0840 HDH_WIDTH[23,12,-12]=0.1663 HDH_WIDTH[23,13,-13]=0.0840 HDH_WIDTH[23,14,-14]=0.1663 HDH_WIDTH[23,15,-15]=0.0840 HDH_WIDTH[23,16,-16]=0.1663 OL_PREFIX=/cvmfs/sft.cern.ch/lcg/releases/LCG_88/MCGenerators/openloops/2.0.0/x86_64-slc6-gcc62-opt OL_PARAMETERS=preset=2 write_parameters=1 PDF_LIBRARY=LHAPDFSherpa USE_PDF_ALPHAS=1 PDF_SET=NNPDF30_nnlo_as_0118_hessian PDF_VARIATIONS=NNPDF30_nnlo_as_0118_hessian[all] NNPDF30_nnlo_as_0117 NNPDF30_nnlo_as_0119 MMHT2014nnlo68cl CT14nnlo PDF4LHC15_nnlo_30_pdfas[all] NNPDF31_nnlo_as_0118_hessian SHERPA_LDADD=SherpaFusing USERHOOK=Fusing_Fragmentation CSS_SCALE_SCHEME=20
CSS_EVOLUTION_SCHEME=30 FUSING_FRAGMENTATION_STORE_AS_WEIGHT=1 OL_PARAMETERS=ew_scheme=2 ew_renorm_scheme=1 write_parameters=1 EW_SCHEME=3 GF=1.166397e-5 BEAM_ENERGY_1=6500.0 BEAM_ENERGY_2=6500.0 }(run)
(processes){ Process 93 93 -> 13 -14 93{NJET}; Order (*,2); CKKW sqr(QCUT/E_CMS); Associated_Contributions EW|LO1|LO2|LO3 {LJET}; Enhance_Observable VAR{log10(max(sqrt(H_T2)-PPerp(p[2])-PPerp(p[3]),MPerp(p[2]+p[3])))}|1|3.3 {3,4,5,6,7} NLO_QCD_Mode MC@NLO {LJET}; ME_Generator Amegic {LJET}; RS_ME_Generator Comix {LJET}; Loop_Generator LOOPGEN {LJET}; Max_N_Quarks 4 {6,7,8}; Max_Epsilon 0.01 {6,7,8}; Integration_Error 0.99 {3,4,5,6,7,8}; End process;
Process 93 93 -> -13 14 93{NJET}; Order (*,2); CKKW sqr(QCUT/E_CMS); Associated_Contributions EW|LO1|LO2|LO3 {LJET}; Enhance_Observable VAR{log10(max(sqrt(H_T2)-PPerp(p[2])-PPerp(p[3]),MPerp(p[2]+p[3])))}|1|3.3 {3,4,5,6,7} NLO_QCD_Mode MC@NLO {LJET}; ME_Generator Amegic {LJET}; RS_ME_Generator Comix {LJET}; Loop_Generator LOOPGEN {LJET}; Max_N_Quarks 4 {6,7,8}; Max_Epsilon 0.01 {6,7,8}; Integration_Error 0.99 {3,4,5,6,7,8}; End process;
}(processes)
(selector){ Mass 13 -14 2.0 E_CMS Mass -13 14 2.0 E_CMS }(selector)
Total CPU time
• <= 1k events generation, repeat for 10 times
• >1k events generation, ran once
24**No error bar on the ratios
Solid lines are the CPU Time Dashed lined are the Ratios
• glibc-2.17+libimf is ~50% of glibcs;
CPU Time — system
25
Initi
aliz
atio
n
Eve
nt G
ener
atio
n
• LHApdf and Openloops run time are stable with glibcs, < glibc-2.17+libimf
CPU Time — LHApdf and Openloops
26
Initi
aliz
atio
n
Eve
nt G
ener
atio
n