aace:%applications% - home | national institute for ... paramount to extracting functionality from...

AACE: Applications R. Glenn Brook Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-‐[email protected] Ryan C. Hulguin Computational Science Associate National Institute for Computational Sciences ryan-‐[email protected]

Codes Investigated by AACE on the Intel® Xeon Phi™ Coprocessor •  Science codes ported or optimized through the Beacon Project

•  Chemistry – NWChem (ported) •  Astrophysics – Enzo (ported and optimized) •  Magnetospheric Physics – H3D (ported and optimized)

•  Other codes of interest •  Electronic Structures – Elk FP-‐LAPW (ported) •  Computational Fluid Dynamics (CFD) – Euler and BGK Boltzmann Solver (ported and optimized)

•  Linear Algebra routines – SGEMM and DGEMM (ported)

Enzo

•  Community code for computational astrophysics and cosmology •  More than 1 million lines of code •  Uses powerful adaptive mesh reVinement •  Highly vectorized with a hybrid MPI + OpenMP programming model •  Utilizes HDF5 and HYPRE libraries •  Multiple MPI tasks per coprocessor and many threads per MPI task

Enzo was ported and optimized for the the Intel® Xeon Phi™ Coprocessor by Dr. Robert Harkness [email protected]

Preliminary Scaling Study: Native •  ENZO-C

•  128^3 mesh (non-AMR)

•  pure MPI

•  native mode

1

2

4

8

16

32

1 2 4 8 16 32

Spee

dup

Number of Threads

Observed

Ideal

Results were generated on the Intel® Knights Ferry software development platform

Hybrid3d (H3D)

•  Provides breakthrough kinetic simulations of the Earth’s magnetosphere

•  Models the complex solar wind-‐magnetosphere interaction using both electron Vluid and kinetic ions •  Unlike magnetohydrodynamics (MHD), which completely ignores ion kinetic effects

•  Contains the following HPC innovations: 1.  multi-‐zone (asynchronous) algorithm 2.  dynamic load balancing 3.  code adaptation and optimization to large number of cores

Hybrid3d (H3D) was provided for porting to the the Intel® Xeon Phi™ Coprocessor by Dr. Homa Karimabadi

[email protected]

Hybrid3d (H3D) Performance

1

2

4

8

16

32

64

1 2 4 8 16 32 64

Rela%v

e Speedu

p

Number of MPI Processes

H3D Speedup on the Intel® Xeon Phi™ Coprocessor

(codename Knights Corner)

Observed

Ideal Speedup

Results were generated on a Pre-‐Production Intel® Xeon Phi™ coprocessor with B0 HW and Beta SW

61 cores @ 1.09 GHz and 8 GB of GDDR5 RAM @ 2.75 GHz

Optimizations were provided by Intel senior software engineer Rob Van der Wjingaart.

Elk FP-‐LAPW http://elk.sourceforge.net/ Paramount to extracting functionality from these advanced materials is having a detailed understanding of their electronic, magnetic, vibrational, and optical properties. Elk is a software platform which allows for the understanding of these properties from a first principles approach. It employs electronic structure techniques such as density functional theory, Hartree-Fock theory, and Green’s function theory for the calculation of relevant properties from first principles.

Antiferromagnetic structure of Sr2CuO3

Elk was ported to the the Intel® Xeon Phi™ Coprocessor by W. Scott Thornton

[email protected]

•  Fortran 90 •  Efficient hybrid MPI + OpenMP

parallelization

Elk FP-‐LAPW Performance

Elk uses master-‐slave parallelism where orbitals for different momenta are computed semi-‐independently. In this test 27 and 64 different crystal momenta were used. The test case was bulk silicon.

Results were generated on a Pre-‐Production Intel® Xeon Phi™ coprocessor with A0 HW and Beta SW


Unsteady solution of a Sod Shock using the Euler equations

Steady-‐state solution of a Couette Vlow using the Boltzmann equation with BGK collision approximation

•  2 CFD solvers were developed in house at NICS •  1st solver is based on the Euler equations •  2nd solver is based on Model Boltzmann equations

Computational Fluid Dynamics (CFD)

The above CFD solvers were developed for the Intel® Xeon Phi™ Coprocessor by Ryan C. Hulguin

ryan-‐[email protected]

Impact of Various Optimizations on the Model Boltzmann Equation Solver

•  The Model Boltzmann Equation solver was optimized by Intel software engineer Rob Van der Wjingaart

•  He took a baseline solver where all loops were vectorized except for one, and applied the following optimizations to get the most performance out of the Intel® Xeon Phi™ Coprocessor

(codename Knights Corner)

•  Set I — Loop Vectorization •  Stack variable pulled out of the loop •  Class member turned into a regular structure

•  Set II — Data Access •  Arrays linearized using macros •  Align data for more efVicient access

•  Set III — Parallel Overhead •  Reduce the number of parallel sections

•  Set IV — Dependency •  Remove reduction from computational loop by saving value into a private variable

•  Set V — Precision •  Use medium precision for math function calls (-‐Vimf-‐precision=medium)

•  Set VI — Precision •  Use single precision constants and intrinsics

•  Set VII — Compiler Hints •  Use #pragma SIMD instead of #pragma IVDEP

Optimization Results from the Model Boltzmann Equation Solver

0

1

2

3

4

5

6

7

8

Rela%v

e Speedu

p

balanced sca:er

Loop Vectoriza%on



Model Boltzmann Equation Solver Performance



0.5

1

2

4

8

16

32

64

128

1 2 4 8 16 32 64 128 256

Rela%v

e Speedu

p

Number of OpenMP Threads

Rela%ve Speedup of two 8-‐core 3.5 GHz Intel® Xeon E5-‐2670 Processors Versus an Intel® Xeon Phi™ Coprocessor

Dual Intel® Xeon E5-‐2680 -‐ Compiler Hints

Intel Xeon Phi -‐ Precision II -‐ Balanced

Intel Xeon Phi -‐ Compiler Hints -‐ Balanced

Intel Xeon Phi -‐ Precision II -‐ Sca:er

Intel Xeon Phi -‐ Compiler Hints -‐ Sca:er

Porting to the Intel® Xeon Phi™ Coprocessor

•  No major code rewrites were needed to start running on an Intel® Xeon Phi™ coprocessor

•  The previous applications were run in native mode and simply required a recompile using the –mmic Vlag

•  Parallelism is achieved using OpenMP, MPI, or both •  The transition from the Intel® Xeon Phi™ software development

platform (codename Knights Ferry) to the Intel® Xeon Phi™ coprocessor (codename Knights Corner) is seamless.

Custom SGEMM and DGEMM Routines for the Intel® Xeon Phi™ Coprocessor

•  Custom General Matrix-‐Matrix Multiply routines using single and double precision (SGEMM and DGEMM respectively) were developed for the Intel® Xeon Phi™ coprocessor.

•  Square matrix sizes were used (m = n = k). •  Intel® Xeon Phi™ coprocessor results are run with 240 threads and

compared against Intel® Xeon® E5-‐2670 processors.

The above SGEMM and DGEMM routines were developed for the Intel® Xeon Phi™ Coprocessor by Jonathan Peyton [email protected]

Custom SGEMM Performance Results

Custom DGEMM Performance Results

Contact Information

R. Glenn Brook, Ph.D. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-‐[email protected]

aace:%applications% - home | national institute for ... paramount to extracting functionality from...

Documents