lammps performance benchmark and profiling performance...– lammps performance overview over intel...

LAMMPS

Performance Benchmark and Profiling

November 2020

• The following research was performed under the HPC Advisory Council activities

– HPCAI-AC - Iris cluster

– Dell – Zenith cluster

• The following was done to provide best practices

– LAMMPS performance overview over Intel based platforms

– Understanding LAMMPS MPI communication patterns

• More info on LAMMPS

– https://lammps.sandia.gov/

LAMMPS

• Large-scale Atomic/Molecular Massively Parallel Simulator

– Classical molecular dynamics code which can model:

– Atomic, Polymeric, Biological, Metallic, Granular, and coarse-grained systems

• LAMMPS-KOKKOS package contains

– Versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library

• LAMMPS runs efficiently in parallel using message-passing techniques

– Developed at Sandia National Laboratories

– An open-source code, distributed under GNU Public License

• More information on LAMMPS can be found at the LAMMPS web site:

http://lammps.sandia.gov

Cluster Configuration

• HPC-AI AC Cluster Center – Iris cluster

– Dual Socket Intel Gold 6148 CPU @ 2.40GHz

– ConnectX-6 HDR100 InfiniBand

– Quantum Switch HDR InfiniBand

– Memory: 192GB DDR4 2677MHz RDIMMs per node

• Software

– OS: RHEL 7.8,

– MLNX_OFED 4.9

– MPI: HPC-X 2.7.0

– LAMMPS: v10-29-2020

– Compiler: Intel 2020.4.304

• Dell Cluster Center – Zenith cluster

– Dual Socket Intel Gold 6248 CPU @ 2.50GHz

– ConnectX-6 HDR100 InfiniBand

– Quantum Switch HDR InfiniBand

– Memory: 192GB DDR4 2677MHz RDIMMs per node

• Software

– OS: RHEL 7.8,

– MLNX_OFED 4.9

– MPI: HPC-X 2.7.0

– LAMMPS: v10-29-2020

– Compiler: Intel 2020.4.304

LAMMPS Inputs

• AF_lennard-jones_2.5

– Problem: https://lammps.sandia.gov/bench/in.lj.txt

– region: box block 0 200 0 200 0 200

– neigh_modify: delay 0 every 20 check no

– Iterations: 1000

• EAM

– Problem:

https://github.com/lammps/lammps/blob/master/bench/POTENTIALS/in.eam

– region: box block 0 200 0 200 0 200

– neigh_modify: delay 1 every 5 check yes

– thermo 100

• Tersoff

– Problem: https://lammps.sandia.gov/bench/in.tersoff.txt

– region: box block 0 200 0 200 0 200

• Gay-Berne

– Problem: https://lammps.sandia.gov/bench/in.gb.txt

– region: box block 0 320 0 320 0 320

– set type 1 mass 1.5

– set type 1 shape 1 1.5 2

– neigh_modify: delay 1 every 5 check yes

– thermo 100

• Rhodopsin

– Problem:

https://github.com/lammps/lammps/blob/master/bench/in.rhodo

– replicate: 1 1 1

– atom_modify map array

• SNAP

– Problem:

– region: box block 0 5 0 8 0 32

LAMMPS Performance – Scalability

Higher is better

100% 100%92%

92% 100%97%

* Bigger problem size

LAMMPS Performance – AVX2/AVX512

Higher is better

LAMMPS Performance – CPU

Higher is better

LAMMPS MPI Profiles on 32 nodes

Lennard_jones 2.5 - 30% MPI EAM - 15% MPI Gay-Berne - 12% MPI

Rhodopsin - 14% MPI SNAP - 4% MPI Tersoff - 6% MPI

LAMMPS MPI Profiles on 32 nodes

Lennard_jones 2.5 - 30% MPI EAM - 15% MPI Gay-Berne - 12% MPI

Rhodopsin - 14% MPI SNAP - 4% MPI Tersoff - 6% MPI

Summary

• LAMMPS can be scalable, per the problem size defined. The problem should suit the CPU

architecture and cluster size. With InfiniBand the scalability is above 92% for the

demonstrated cased

• AVX512 helps five out of the six input benchmarks, and up to 2x improvment

• Intel Gold 6248, 2.5GHz (40 cores per node) demonstrated up 38% of performance

improvements comparing to Intel Gold 6148 @2.4GHz (40 cores per node)

• MPI Profile shows up to 30% communication time mostly on point to point and MPI

AllReduce operations. Rhodopsin input also showing also MPI alltoallv as well

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC-AI Advisory Council makes no representation to the accuracy and completeness of the information

contained herein. HPC-AI Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

Thank You

lammps performance benchmark and profiling performance...– lammps performance overview over intel...

Documents

php profiling/performance

puppet performance profiling

lammps, flash and mesa - ubc physics &...

automated performance profiling with continuous integration

python performance profiling

memory performance profiling via sampled performance monitor...

lammps users manual - xiamen...

application profiling for memory and performance

eclipse performance benchmarks and profiling

cuda performance and profiling

nwchem performance benchmark and profiling

load test & performance profiling

lammps performance benchmark and profiling · 6 about...

lammps users manual -...

vasp performance benchmark and profiling · vasp...

java performance & profiling

profiling your python application with intel® vtune™...

mediawiki performance profiling - wikimediamediawiki...

performance and pro ling of the lammps code on hpcx

performance profiling orientation