somp: simulating openmp task-based applications with numa
TRANSCRIPT
sOMP: Simulating OpenMP Task-Based Applicationswith NUMA Effects
DAOUDI Idriss1, VIROULEAU Philippe, THIBAULT Samuel,GAUTIER Thierry, AUMAGE Olivier
September 23, 2020
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 1 / 14
Context and objectives
Context:
anticipating applications behavior, studying, and designing algorithms
experiment with various scheduling algorithms in a reproduciblecontext
predictive tools to evaluate applications performances on existing andnon-existing systems
SimGrid → simulation of distributed infrastructures
MPI, tasks, threads...
OpenMP?In order to simulate OpenMP:
1 parallel tasks with data dependency2 parallel loops
Objectives:
predict the performances of task-based applications
take into account Non-Uniform Memory Access (NUMA) effects
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 2 / 14
State of the art
Distributed memory simulators:
SimGrid, Dimemas, BigSim, xSim...
Shared-memory simulators:
for specific architectures: Aversa et al. (hybrid MPI/OpenMP onSMP), Simany (multicore)...full system simulators: SimNUMA...for task-based applications: HLSMN tool (without considering taskdependencies)...
None of these tools takes into account task dependency andNUMA effects!
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 3 / 14
Methodology
1 OpenMP application runtime data collection (tracing)
2 NUMA architectures modeling
3 Building a task-based application simulator with SimGrid
4 Construction of a performance model to take into account memoryaccesses
5 Simulation of the behavior of OpenMP applications using theperformance model
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 4 / 14
Tracing with TiKKi
uses the OMPT API integrated in OpenMP 5.0
captures all the events required→ to build an OpenMP application task graph
generate several output forms of execution traces→ task graph, Gantt chart, sOMP trace format
sequence of parallel regions→ where the events of each task is recorded
one trace file per encountered parallel region
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 5 / 14
Modeling NUMA architectures with SimGrid
exploits the possibilities offered by SimGrid’s S4U API
NUMA node → cores + memory controller + links
exploit links parameters → model contention and concurrency
compromise between precision and simulation cost!→ around 70 times faster than real execution
(a) NUMA machine modeling (b) NUMA machine modeling usingSimGrid components
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 6 / 14
Task-based applications simulator: sOMP
Concept:
exploit traces collected from a sequential execution
execute the application on platforms modeled in the SimGrid senseusing the collected traces
simulate the execution time for a large number of cores
Components:
parser for trace and platform files
scheduler to submit jobs
single worker per simulated core to execute the tasks
performance model to refine the simulations
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 7 / 14
Communications-based performance model
Default task execution model in sOMP:
task execution time obtained via the trace filesadvance SimGrid’s simulated clock
Communications-based model using SimGrids’ communications:
trace file provide the list of memory operations performed by eachtask (R, RW);take into account those memory accesses;memory access → communication to the memory controllerthe size of each communication (in bytes) impact SimGrids’ internalclock→ contention and concurrency on the crossed links;
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 8 / 14
Evaluation
Architectures:
dual socket Intel Xeon Gold 6240 (CascadeLake)→ 2 sockets, 2 NUMA nodes, 36 cores
dual socket AMD EPYC 7452 (AMD Infinity)→ 2 sockets, 16 NUMA nodes, 64 cores
Kastors/Plasma[IWOMP2014]:
benchmark suite to evaluate the implementation of the OpenMP taskdependency paradigm;
several matrix factorization algorithms: Cholesky, QR, LU.
Metric:
precision error (%) = Tnative−TsimulatedTnative
x 100
positive → optimistic simulation, negative → pessimistic simulation
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 9 / 14
Architectures overview with HWLoc
Machine (1024GB total)
Package P#0
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#0
PU P#0
L2 (512KB)
L1 (32KB)
Core P#1
PU P#1
L2 (512KB)
L1 (32KB)
Core P#2
PU P#2
L2 (512KB)
L1 (32KB)
Core P#3
PU P#3
NUMANode P#0 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#4
PU P#4
L2 (512KB)
L1 (32KB)
Core P#5
PU P#5
L2 (512KB)
L1 (32KB)
Core P#6
PU P#6
L2 (512KB)
L1 (32KB)
Core P#7
PU P#7
NUMANode P#1 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#8
PU P#8
L2 (512KB)
L1 (32KB)
Core P#9
PU P#9
L2 (512KB)
L1 (32KB)
Core P#10
PU P#10
L2 (512KB)
L1 (32KB)
Core P#11
PU P#11
NUMANode P#2 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#12
PU P#12
L2 (512KB)
L1 (32KB)
Core P#13
PU P#13
L2 (512KB)
L1 (32KB)
Core P#14
PU P#14
L2 (512KB)
L1 (32KB)
Core P#15
PU P#15
NUMANode P#3 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#16
PU P#16
L2 (512KB)
L1 (32KB)
Core P#17
PU P#17
L2 (512KB)
L1 (32KB)
Core P#18
PU P#18
L2 (512KB)
L1 (32KB)
Core P#19
PU P#19
NUMANode P#4 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#20
PU P#20
L2 (512KB)
L1 (32KB)
Core P#21
PU P#21
L2 (512KB)
L1 (32KB)
Core P#22
PU P#22
L2 (512KB)
L1 (32KB)
Core P#23
PU P#23
NUMANode P#5 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#24
PU P#24
L2 (512KB)
L1 (32KB)
Core P#25
PU P#25
L2 (512KB)
L1 (32KB)
Core P#26
PU P#26
L2 (512KB)
L1 (32KB)
Core P#27
PU P#27
NUMANode P#6 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#28
PU P#28
L2 (512KB)
L1 (32KB)
Core P#29
PU P#29
L2 (512KB)
L1 (32KB)
Core P#30
PU P#30
L2 (512KB)
L1 (32KB)
Core P#31
PU P#31
NUMANode P#7 (64GB)
Package P#1
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#32
PU P#32
L2 (512KB)
L1 (32KB)
Core P#33
PU P#33
L2 (512KB)
L1 (32KB)
Core P#34
PU P#34
L2 (512KB)
L1 (32KB)
Core P#35
PU P#35
NUMANode P#8 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#36
PU P#36
L2 (512KB)
L1 (32KB)
Core P#37
PU P#37
L2 (512KB)
L1 (32KB)
Core P#38
PU P#38
L2 (512KB)
L1 (32KB)
Core P#39
PU P#39
NUMANode P#9 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#40
PU P#40
L2 (512KB)
L1 (32KB)
Core P#41
PU P#41
L2 (512KB)
L1 (32KB)
Core P#42
PU P#42
L2 (512KB)
L1 (32KB)
Core P#43
PU P#43
NUMANode P#10 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#44
PU P#44
L2 (512KB)
L1 (32KB)
Core P#45
PU P#45
L2 (512KB)
L1 (32KB)
Core P#46
PU P#46
L2 (512KB)
L1 (32KB)
Core P#47
PU P#47
NUMANode P#11 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#48
PU P#48
L2 (512KB)
L1 (32KB)
Core P#49
PU P#49
L2 (512KB)
L1 (32KB)
Core P#50
PU P#50
L2 (512KB)
L1 (32KB)
Core P#51
PU P#51
NUMANode P#12 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#52
PU P#52
L2 (512KB)
L1 (32KB)
Core P#53
PU P#53
L2 (512KB)
L1 (32KB)
Core P#54
PU P#54
L2 (512KB)
L1 (32KB)
Core P#55
PU P#55
NUMANode P#13 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#56
PU P#56
L2 (512KB)
L1 (32KB)
Core P#57
PU P#57
L2 (512KB)
L1 (32KB)
Core P#58
PU P#58
L2 (512KB)
L1 (32KB)
Core P#59
PU P#59
NUMANode P#14 (64GB)
L3 (16MB)
L2 (512KB)
L1 (32KB)
Core P#60
PU P#60
L2 (512KB)
L1 (32KB)
Core P#61
PU P#61
L2 (512KB)
L1 (32KB)
Core P#62
PU P#62
L2 (512KB)
L1 (32KB)
Core P#63
PU P#63
NUMANode P#15 (64GB)
AMD EPYC 7452
Machine (192GB total)
Package P#0
L3 (25MB)
L2 (1024KB)
L1 (32KB)
Core P#0
PU P#0
L2 (1024KB)
L1 (32KB)
Core P#1
PU P#1
L2 (1024KB)
L1 (32KB)
Core P#2
PU P#2
L2 (1024KB)
L1 (32KB)
Core P#3
PU P#3
L2 (1024KB)
L1 (32KB)
Core P#4
PU P#4
L2 (1024KB)
L1 (32KB)
Core P#5
PU P#5
L2 (1024KB)
L1 (32KB)
Core P#6
PU P#6
L2 (1024KB)
L1 (32KB)
Core P#7
PU P#7
L2 (1024KB)
L1 (32KB)
Core P#8
PU P#8
L2 (1024KB)
L1 (32KB)
Core P#9
PU P#9
L2 (1024KB)
L1 (32KB)
Core P#10
PU P#10
L2 (1024KB)
L1 (32KB)
Core P#11
PU P#11
L2 (1024KB)
L1 (32KB)
Core P#12
PU P#12
L2 (1024KB)
L1 (32KB)
Core P#13
PU P#13
L2 (1024KB)
L1 (32KB)
Core P#14
PU P#14
L2 (1024KB)
L1 (32KB)
Core P#15
PU P#15
L2 (1024KB)
L1 (32KB)
Core P#16
PU P#16
L2 (1024KB)
L1 (32KB)
Core P#17
PU P#17
NUMANode P#0 (96GB)
Package P#1
L3 (25MB)
L2 (1024KB)
L1 (32KB)
Core P#18
PU P#18
L2 (1024KB)
L1 (32KB)
Core P#19
PU P#19
L2 (1024KB)
L1 (32KB)
Core P#20
PU P#20
L2 (1024KB)
L1 (32KB)
Core P#21
PU P#21
L2 (1024KB)
L1 (32KB)
Core P#22
PU P#22
L2 (1024KB)
L1 (32KB)
Core P#23
PU P#23
L2 (1024KB)
L1 (32KB)
Core P#24
PU P#24
L2 (1024KB)
L1 (32KB)
Core P#25
PU P#25
L2 (1024KB)
L1 (32KB)
Core P#26
PU P#26
L2 (1024KB)
L1 (32KB)
Core P#27
PU P#27
L2 (1024KB)
L1 (32KB)
Core P#28
PU P#28
L2 (1024KB)
L1 (32KB)
Core P#29
PU P#29
L2 (1024KB)
L1 (32KB)
Core P#30
PU P#30
L2 (1024KB)
L1 (32KB)
Core P#31
PU P#31
L2 (1024KB)
L1 (32KB)
Core P#32
PU P#32
L2 (1024KB)
L1 (32KB)
Core P#33
PU P#33
L2 (1024KB)
L1 (32KB)
Core P#34
PU P#34
L2 (1024KB)
L1 (32KB)
Core P#35
PU P#35
NUMANode P#1 (96GB)
Intel Xeon Gold 6240DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 10 / 14
Results: Msize = 16384 x 16384, Bsize = 512 x 512
→ Cholesky algorithm on different architectures
-10
0
10
20
30
40
50
5 10 15 20 25 30 35
Pre
cis
ion e
rror
(%)
Number of cores
Intel Xeon Gold 6240 - Bloc size = 512 x 512
sOMPsOMP + Communications model
(c) on Intel
-10
0
10
20
30
40
50
10 20 30 40 50 60P
recis
ion e
rror
(%)
Number of cores
AMD EPYC 7452 - Bloc size = 512 x 512
sOMPsOMP + Communications model
(d) on AMD
→ less than 10% precision error with the communications-based model→ ideal for memory-bound applications→ similar results on LU algorithm
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 11 / 14
Results: Bsize = 768 x 768, on AMD machine
→ QR algorithm for different matrix sizes
-10
0
10
20
30
40
50
10 20 30 40 50 60
Pre
cis
ion e
rror
(%)
Number of cores
AMD EPYC 7452 - Matrix size = 8192 x 8192
sOMPsOMP + Communications model
-10
0
10
20
30
40
50
10 20 30 40 50 60
Pre
cis
ion e
rror
(%)
Number of cores
AMD EPYC 7452 - Matrix size = 16384 x 16384
sOMPsOMP + Communications model
→ impact of granularity and number of threads→ communications-based model is less efficient: QR is compute-bound
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 12 / 14
Summary
We developed sOMP for reproducible researches!
TiKKi→ obtaining sequential trace of application
sOMP→ simulate the application on modeled NUMA architectures
Communications-based model→ improve simulations by taking into account NUMA effects
Result→ very good precision for some applications!
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 13 / 14
Future work
Model data movements inside a cache→ introduce a level of cache (L3) in the simulator
Take into account various hardware affinity policies
Introduce other scheduling policies
Combine this work with MPI and GPUs simulations to model hybridMPI/OpenMP applications
Sources:
TiKKi: https://gitlab.inria.fr/openmp/tikki/-/wikis/home
sOMP: https://gitlab.inria.fr/idaoudi/omps/-/wikis/home
DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 14 / 14