required gpu scaling enabling gamess for extreme computing ... · ri-mp2 ri-mp2/fmo openmp...
TRANSCRIPT
Base Challenge Problem Stretch Challenge Problem
Enabling GAMESS for Extreme Computing in Chemistry and Materials Giuseppe Barca3, Colleen Bertoni2, Dmytro Bykov7, Laura Carrington4, Dipayan Datta1,6, Jorge Galvez1,6, Anastasia Guinina1, Taylor Harville1,6, Erik Jensen7, Sarom Leang4, Shirley Moore7, Buu Pham1,6, David Poole1,6,
Alistair Rendell3, Tosaporn Sattasathuchana1, David Sherrill5, Masha Sosonkina8, Vaibhav Sundriyal8, Ananta Tiwari4, Bryce Westheimer1,6, Theresa Windus1,6, Peng Xu1,6, Federico Zahariev1, Mark Gordon*1,6
1Ames Laboratory, 2Argonne National Laboratory, 3Australian National University, 4EP Analytics, 5Georgia Tech University, 6Iowa State University, 7Oak Ridge National Laboratory, 8Old Dominion University
(H2O)
2615
Communication Overhead
EFMO: GPU Acceleration Paths
How do we solve this problem? With quantum chemistry!
RI-MP2 OpenMP GPU OffloadingEFMO Workflow
FRAGMENTATION: Many Body Expansion LibAccInt GPU ScalingFragment Schemes
The EFMO algorithm can be split into two stages:1. STAGE 1 -> computation associated with fragments2. STAGE 2 -> computation associated with fragment dimers
*Currently on GPUIn progress on GPU
Planned on GPU
1-electron integrals• ~N2/2• N=104 : ~108 1-electron integrals
2-electron integrals*• ~N4/8• N=104 : ~1016 2-electron integrals• Cannot store in memory• Recalculate on-the-fly or disk storage• Different algorithms depending on l quantum numbers
Hartree-Fock*• Iterative solution• Populate F matrices using
integrals
MP2*• Requires partial 4-label transformation (4-LT)• Scales ~N5
RI-MP2*• Reduces number of integrals in 4-LT• Scaling can be reduced to ~N4
HL: CCSD(T), CR-CC(2,3)• Requires full 4-LT• Scales ~N7
• RI can reduce both scaling and memory footprint
Required GPU scaling
➢In both base and stretch challenge problems, we assign a single fragment dimer to a single Aurora/Frontier GPU
➢Each of these GPUs is 5x more powerful than one Summit GPU (NVIDIA V100)
➢In order to run the dimer calculations efficiently , we need to scale up to at least 5 GPUs on Summit, each giving a ~ 7x speedup over a CPU socket
➢1 MSN ring fully solvated-> 61,776 total fragment dimers, of which 21,000 treated at the quantum (RI-MP2) level
➢ All MSN fragment dimers and most MSN-water dimers are treated at the quantum level■ All the water inside - but not outside - the
pore is quantum■ Ensures accurate chemistry and physics
➢ 20 energy + gradient points to ensure sufficiently accurate potential energy surface model
➢ The calculation will use 21,000 GPUs, about 75% of Aurora/Frontier, for 7.5 - 8 h■ Allow for up to 5% communication overhead
STAGE 1➢ N GDDI groups are formed each performing HF, MakeEFP, and
RI-MP2 for a different fragment
HF, MakeEFP, & RIMP2 on fragment 1
HF, MakeEFP, & RIMP2 on fragment 2
HF, MakeEFP, & RIMP2 on fragment N
HF and RIMP2 dimer 1
HF and RIMP2 dimer 2
HF and RIMP2 dimer N(N-1)/2
GDDI groups redistribution
Group 1 Group 2 Group N
Group 1 Group 2 Group N(N-1)/2
➢4 MSN rings fully solvated-> 990,528 total fragment dimers, of which 28,000 treated at the quantum (RI-MP2) level
➢ All MSN fragment dimers and a sufficient number of MSN-water dimers are treated at the quantum level■ >80% of the water in the pore is quantum■ Ensures sufficiently accurate chemistry
➢ 40 energy + gradient points to ensure extremely accurate potential energy surface model
➢ The calculation will use 28,000 GPUs, that is, 100% of Aurora/Frontier, for 15 - 16 h■ Allow for up to 5% communication overhead
Gather energy
STA
GE
1S
TAG
E 2
...
...
STAGE 2➢ MPI ranks in GDDI groups are redistributed, so that each can
deal with a dimer
➢ Each GDDI group performs HF and RI-MP2 for a different fragment dimer
➢ Once the calculations on dimers is completed, the energy is gathered from the GDDI groups
Wall time, sec Speedup
Serial w/ 1 core of P9 342.697 0.04
OpenMP + ESSL dgemm w/ 42 threads on 2 P9 12.231 1.00
OpenMP offloading + nvblas dgemm on 1 V100 1.734 7.05
OpenMP offloading + cublas dgemm on 1 V100 1.983 6.17
OpenMP offloading + cublasXt dgemm on 1 V100 1.728 7.08
OpenACC offloading + cublas dgemm on 1 V100 1.905 6.42
OpenACC offloading + cublasXt dgemm on 1 V100 1.692 7.23
➢ (H2O)
2615 split to 217 fragments
➢ Basis set used 6-31G(d)//cc-pVDZ-RI
➢ 256-768 KNL nodes, split into 256-768 groups
(1 node/group)
➢ Each node creates 1 rank + a team of 64
threads.
➢ Wall time (s) vs. number of KNL compute nodesEach GDDI groups
has three possible paths to accelerate HF, MakeEFP and RI-MP2
RI-MP2/FMO OpenMP Accelerator Offloading
Average Communication % of EFMO/RHF
# Atoms # Frags # Nodes% of Total Run-Time
304 6 1 4.22
592 11 2 4.26
1141 22 4 4.54
1738 32 8 4.96
➢ Examine four increasing problem sizes using weak scaling
(i.e. number of atoms/node is approximately constant)
➢ Scale from
➢ 1-32 nodes
➢ 304-1738 atoms
➢ Percent communication remains relatively constant at
~5%
LibCChem Generalized Fock GPU Scaling LibCChem RI-MP2 GPU Scaling
➢ Run on Summit on a system of 150 H2O molecules.
➢ 1950 basis functions
➢ Used from 1 up to 9 GPUs➢ 1 GPU per MPI rank
➢ The scaling with respect to number of GPUs is excellent (96.9% parallel efficiency).
➢ Run on Summit on a system of 150 H2O molecules
➢ 1950 basis functions
➢ Used from 1 up to 9 GPUs➢ 1 GPU per MPI rank
➢ The scaling with respect to number of GPUs is nearly ideal (98.1% parallel efficiency)
➢ We ran our RI-MP2 on Summit on a system of 150 H2O molecules
➢ 3750 primary and 12600 auxiliary basis functions
➢ Calculations were performed on 11 nodes➢ 11 MPI processes, each with 84
threads (42 P9 cores at SMT 2) and a variable number of GPUs
20 Å
80 Å
➢ Ratio of wall time to the wall time of (H
2O)
2615 calculation
using 16,384 threads