massively parallel micromagnetic fem calculations with … · 2012. 11. 27. · the magnetostatic...

1
SAMPLE PROBLEM Elmar Westphal, and Riccardo Hertel* IFF-Scientific IT-Systems, *IFF-9, Institut für Festkörperforschung, Forschungszentrum Jülich GmbH, 52425 Jülich, Germany Attila Kákay* [email protected] [1] D.R. Fredkin and T.R. Koehler, J. Appl. Phys. 63, 3385 (1988) [5] https://computation.llnl.gov/casc/sundials/main.html 2 [6] CUDA Programming Guide, http://www.nvidia.com [2] S. Boerm and L. Grasedyck, HLib -- A library for H - and H - matrices, 1999, http://www.hlib.org/ [7] N. Bell and M. Garland, “Efficient Sparse Matrix-Vector Muliplication on CUDA”, NVIDIA Technical Report NVR-2008-004, December 2008 [3] MMM2008, Austin, GS-04 High resolution large-scale micromagnetic simulations with hierarchical matrices, [8] Parallel reduction problems (norms and dot-products), A. Kakay, S. Boerm and R. Hertel http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf [4] Numerical Recipes in C, http://www.nr.com/ CONCLUSIONS Mitglied der Helmholtz-Gemeinschaft BENCHMARK RESULTS AND COMPARISON OF CALCULATION SPEEDS: vs. 1 GPU 12 CPU cores Simulations have been performed on µMAG standard problem #4 with: various discretization - varying number of discretization nodes different complexity - regular and irregular mesh We benchmarked the simulation times on two different systems: CPU system: dual hex-core Intel X5670 2.93 GHz, 12 MB L3 cache GPU system: nVIDIA GTX480 in a The speed-up for the solver is calculated from the average duration per solver iteration for the calculation of U1 The speed-up for the ODE-integrator is calculated from the overall runtime of the integrator for 100ps of simulation time. All data (except the elements of the dense matrix) are stored and processed in double precision. Effective calculation speeds are calculated using the original matrix sizes, disregarding zero padding Since the main part of the calculations is performed by the GPU, multi-GPU systems can perform multiple jobs without significant speed loss. This was confirmed in different runs. Intel Core 2 quad 2.4 GHz based PC µ 0 H=36mT f f=190 ° µ 0 H=36mT f f=190 ° Permalloy thin film element NIST, Maryland (VA) USA, M. Donahue et al. The calculation of the magnetostatic potential is done in three steps[1]: 1. An iterative solution of a sparse linear system for U . 1 2. A dense matrix or hierarchical matrix-vector multiplication to obtain the values of U on the boundary of the 2 magnetic region. 3. An iterative solution of sparse linear systems for U within the magnetic region. 2 (linbcg) (linbcg) j ij i U D U 1 2 = 2 1 U U U + = U H s Ñ - = r r The magnetostatic field can be calculated as a gradient field of a scalar potential: ò Ñ × = V y d y x G y M x U 3 ) , ( ) ( ) ( r r r r r r The scalar potential can be computed from: The potential is split into two parts, where U is the solution of the inhomogeneous Neumann problem, while 1 boundary conditions. U satisfies Laplace’s equation with Dirichlet 2 FEM - BEM FOR MAGNETOSTATIC FIELDS WITH SCALAR POTENTIAL î í ì × Ñ - = D 0 M U r inside the magnetic volume. outside of the magnetic volume. TetraMag - FINITE ELEMENT MICROMAGNETIC SIMULATOR fully three-dimensional high geometric flexibility (hybrid FEM/BEM scheme) magnetostatic coupling (e.g., arrays of nanomagnets) easily portable (source code written in standard C) OpenMP parallelized ported to GPU architecture General-purpose finite-element algorithm to simulate the magnetization dynamics and the domains in ferromagnetic nanostructures. m e M H H m M A H K stray ext S eff r r r r r - + + D = 0 0 1 2 m m EQUATION OF MOTION Effective field: e: local energy density: Zeeman, exchange, magnetostatic and anisotropy energy density Landau-Lifshitz-Gilbert equation: INTRODUCTION ADAPTION OF THE MICROMAGNETIC SIMULATOR TO CUDA TetraMag REFERENCES 3.: Sparse Matrix divided into warp-size-wide blocks (here warp-size=4 for demonstrative purposes) 1 2 3 4 5 ( ) 1 3 ( ) 2 7 ( ) 3 2 ( ) 4 4 ( ) 3 1 ( ) 5 2 ( ) 2 9 ( ) 1 8 ( ) 6 0 (1) 4 ( ) 6 0 (1) 5 6 0 0 1 ( ) 5 6 ( ) 6 0 (1) 0 (1) 3 ( ) 3 2 ( ) 2 0 (1) 0 (1) Zero-Padding Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 value 5 3 7 4 4 1 2 9 8 0 4 0 1 6 0 0 3 2 0 0 column 1 2 3 4 3 5 2 1 6 1 6 1 5 6 1 1 3 2 1 1 block start 0 12 20 1 2 3 4 5 6 1 5 9 2 3 2 2 3 4 7 3 4 2 5 1 1 6 8 4 6 1 2 3 4 5 6 1 5 4 8 2 3 1 3 2 7 4 4 9 2 5 3 1 6 2 6 1.: Original Sparse Matrix 2.: Sparse Matrix transposed dt m d m H m dt m d eff r r r r r ´ + ´ - = a g m e M H S eff r r × - = 0 1 m With the matrix generated from an , the access is seemingly random, a disadvantage especially for the GPU due to its small caches. irregular mesh TEST SETUP The adaption of the existing code consists of three major parts: Matrix is preprocessed for coalesced memory access Blocks of almost constant lenght (no need for Hybrid schemes [7]) Reorder elements to optimize read-cycles/caching Overall flow control stays on CPU Zero-memcpy implementation for passing results Selection of fastest kernels based on problem structure Called for the calculation of U1 and U2 Performed on GPU for up to ~40k boundary nodes For more boundary nodes: hierarchical-matrix-vector multiplication on CPU (GPU-conversion: work in progress) Preprocessing similar to sparse matrices Lines are zero-packed and sorted by length to reduce zero-padding in blocks Called for the calculation of U2 Performed by CVODE from the Sundials Package [5] Includes field calculations as well as vector operations using the “NVector” structure supplied with Sundials ”NVector” and field calculations were ported to CUDA Comparisons are done against an OpenMP enhanced version of the originally serial “NVector” Part 1: Solver (linbcg) Part 2: Dense-Matrix-Vector Multiplication Part 3: Time Integration 4.: Blocks are then stored sequentially in device memory and referenced. Multiple warp-sized blocks are combined to the thread-blocks of the actual operations. (note: reordering step not shown here) Preprocessing the Sparse Matrix: For the , the access to the vector elements during the matrix-vector multiplication follows certain patterns with strong data locality where many reads can be combined. regular mesh The computation speed of the GTX480 equals or exceeds that of the 12-core setup already at relatively small problem sizes The initial advantage of the CPU’s large caches wears off as the problem size outgrows it With increasing problem size, the larger overhead for the GPU calculations becomes less significant The GPU’s speed advantage for regular problems increases for all problems that fit the available memory size For large problems, the solver operates near the theoretical memory bandwidth limit of the GPU Due to caching of the Fermi based architecure, the memory throughput of the dense-matrix-vector multiplication exceeds the posted bandwidth limit of the card The mesh type directly defines the structure of the sparse matrices and the performance of the calculations for both the CPU and GPU strongly depends on whether the mesh is regular or irregular. Complex magnetic structures - multiple length scales Micromagnetic structures can display a high complexity as a result of interacting length scale (vortices, domain walls, domains). This complexity represents a challange for micromagnetic simulations. 1 m m a b c d 1 m m a b c d 1 m m a b c d What we obtain: Complex static and dynamic magnetic structures (e.g. vortices, spin waves ...) Solutions without any simplifying assumption on the magnetic structure Possibility of direct comparision with experimental results 5µm M M y Experiment Simulation 500 nm Complex features involving disparate length scales cross-tie domain wall: series of vortex - antivortex M M y M x M y E p ri ent Simulat o 500 nm Complex features involving disparate length scales cross-tie domain wall: series of vortex - antivortex 500 nm Complex features involving disparate length scales cross-tie domain wall: series of vortex - antivortex 500 nm Complex features involving disparate length scales cross-tie domain wall: series of vortex - antivortex What we do: Fully three-dimensional finite element modelling - FEM - (tetrahedral discretization cells) Calculate effective fields, energy densities etc. at each discretization point (node) Calculate the current-density distributions and Oersted fields in general geometries Solving Poisson’s equation at each time step Integrate the Landau-Lifshitz-Gilbert differential equation Solve large sets of coupled integro-differential equations Micromagnetic simulations is a powerful numerical tool for the study of magnetism in mesoscopic samples. GPU systems are very efficient for the Finite Element Micromagnetic simulations. The simulation time can be considerably reduced. The gain in computation speed generally increases with increasing numbers of discretization nodes. It exceeds the perfor- mance of the fastest available PC-based systems already at small problem sizes The GPU implementation allows us to systematically study the magnetization dynamics in large magnetic samples with reasonable computation time. Using more than one GPU per computer without performance penalty significantly increases cost efficiency. Bigger and faster device memory as well as bigger caches would be desirable to further utilize the GPUs high calculation throughput Massively Parallel Micromagnetic FEM Calculations with Graphical Processing Units (GPUs) Effective Computation Speed Solver (U1) 0.00 5.00 10.00 15.00 20.00 25.00 0 100000 200000 300000 400000 500000 degrees of freedom GFLOPS 0 20 40 60 80 100 120 140 160 180 GBytes/s Speed-Up Solver (U1) 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 0 100000 200000 300000 400000 500000 degress of freedom Speed-Up factor Speed-Up ODE 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 0 100000 200000 300000 400000 500000 degrees of freedom Speed-Up factor Eff. Speed Dense Matrix 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 5000 10000 15000 20000 25000 boundary nodes GFLOPS 0.00 24.00 48.00 72.00 96.00 120.00 144.00 168.00 192.00 216.00 240.00 GBytes/s GPU reg. CPU reg. CPU irreg. GPU irreg. Speed-Up Dense Matrix 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 5000 10000 15000 20000 25000 boundary nodes Speed-Up factor

Upload: others

Post on 28-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Massively Parallel Micromagnetic FEM Calculations with … · 2012. 11. 27. · The magnetostatic field can be calculated as a gradient field of a scalar potential: =ò×Ñ V U(x)

SAMPLE PROBLEM

Elmar Westphal, and Riccardo Hertel*IFF-Scientific IT-Systems, *IFF-9, Institut für Festkörperforschung, Forschungszentrum Jülich GmbH, 52425 Jülich, Germany

Attila Kákay*

[email protected]

[1] D.R. Fredkin and T.R. Koehler, J. Appl. Phys. 63, 3385 (1988) [5] https://computation.llnl.gov/casc/sundials/main.html2 [6] CUDA Programming Guide, http://www.nvidia.com[2] S. Boerm and L. Grasedyck, HLib -- A library for H - and H - matrices, 1999, http://www.hlib.org/

[7] N. Bell and M. Garland, “Efficient Sparse Matrix-Vector Muliplication on CUDA”, NVIDIA Technical Report NVR-2008-004, December 2008[3] MMM2008, Austin, GS-04 High resolution large-scale micromagnetic simulations with hierarchical matrices, [8] Parallel reduction problems (norms and dot-products), A. Kakay, S. Boerm and R. Hertelhttp://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf[4] Numerical Recipes in C, http://www.nr.com/

CONCLUSIONS

Mitg

lied d

er

He

lmh

oltz-G

em

ein

sch

aft

BENCHMARK RESULTS AND COMPARISON OF CALCULATION SPEEDS: vs.1 GPU 12 CPU cores

Simulations have been performed on µMAG standard problem #4 with:

various discretization - varying number of discretization nodes

different complexity - regular and irregular mesh

We benchmarked the simulation times on two different systems:

� CPU system: dual hex-core Intel X5670 2.93 GHz, 12 MB L3 cache

� GPU system: nVIDIA GTX480 in a

� The speed-up for the solver is calculated from the average duration per solver iteration for the calculation of U1

� The speed-up for the ODE-integrator is calculated from the overall runtime of the integrator for 100ps of

simulation time.

� All data (except the elements of the dense matrix) are stored and processed in double precision.

� Effective calculation speeds are calculated using the original matrix sizes, disregarding zero padding

� Since the main part of the calculations is performed by the GPU, multi-GPU systems can perform multiple jobs

without significant speed loss. This was confirmed in different runs.

Intel Core 2 quad 2.4 GHz based PC

µ0H=36mT

ff=190°µ0H=36mT

ff=190°

Permalloy thin film element

NIST, Maryland (VA) USA, M. Donahue et al.

The calculation of the magnetostatic potential is done in three steps[1]:

1. An iterative solution of a sparse linear system for U . 1

2. A dense matrix or hierarchical matrix-vector multiplication to obtain the values of U on the boundary of the2 magnetic region.

3. An iterative solution of sparse linear systems for U within the magnetic region. 2

(linbcg)

(linbcg)

jij

i UDU 12 =

21 UUU +=

UH s Ñ-=rr

The magnetostatic field can be calculated as a gradient field of a scalar potential:

ò Ñ×=V

ydyxGyMxU 3),()()(rrrrrr

The scalar potential can be computed from:

The potential is split into two parts,

where U is the solution of the inhomogeneous Neumann problem, while 1

boundary conditions.U satisfies Laplace’s equation with Dirichlet 2

FEM - BEM FOR MAGNETOSTATIC FIELDS WITH SCALAR POTENTIAL

îíì×Ñ-

=D0

MU

rinside the magnetic volume.

outside of the magnetic volume.

TetraMag - FINITE ELEMENT MICROMAGNETIC SIMULATOR

fully three-dimensional

high geometric flexibility (hybrid FEM/BEM scheme)

magnetostatic coupling (e.g., arrays of nanomagnets)

easily portable (source code written in standard C)

OpenMP parallelized

ported to GPU architecture

General-purpose finite-element algorithm to simulate the magnetization dynamics

and the domains in ferromagnetic nanostructures.

m

e

MHHm

M

AH K

strayext

S

eff rrrrr

¶-++D=

00

12

mm

EQUATION OF MOTION

Effective field:

e: local energy density: Zeeman, exchange, magnetostatic and anisotropy energy density

Landau-Lifshitz-Gilbert equation:

INTRODUCTION ADAPTION OF THE MICROMAGNETIC SIMULATOR TO CUDATetraMag

REFERENCES

3.: Sparse Matrix divided into warp-size-wide blocks (here warp-size=4 for demonstrative purposes)

1 2 3 4

5 ( )1 3 ( )2 7 ( )3 2 ( )4

4 ( )3 1 ( )5 2 ( )2 9 ( )1

8 ( )6 0 (1) 4 ( )6 0 (1)

5 6 0 0

1 ( )5 6 ( )6 0 (1) 0 (1)

3 ( )3 2 ( )2 0 (1) 0 (1)

Zero-Padding

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

value 5 3 7 4 4 1 2 9 8 0 4 0 1 6 0 0 3 2 0 0

column 1 2 3 4 3 5 2 1 6 1 6 1 5 6 1 1 3 2 1 1

block start 0 12 20

1 2 3 4 5 6

1 5 9

2 3 2 2

3 4 7 3

4 2

5 1 1

6 8 4 6

1 2 3 4 5 6

1 5 4 8

2 3 1

3 2 7 4

4 9 2

5 3 1

6 2 6

1.: Original Sparse Matrix 2.: Sparse Matrix transposed

dt

mdmHm

dt

mdeff

rrrr

r

+́́-= ag

m

e

MH

S

eff rr

¶×-=

0

1

m

With the matrix generated from an , the access is seemingly random, a disadvantage especially for the GPU due to its small caches.

irregular mesh

TEST SETUP

The adaption of the existing code consists of three major parts:

� Matrix is preprocessed for coalesced memory access

� Blocks of almost constant lenght

(no need for Hybrid schemes [7])

� Reorder elements to optimize read-cycles/caching

� Overall flow control stays on CPU

� Zero-memcpy implementation for passing results

� Selection of fastest kernels based on problem structure

� Called for the calculation of U1 and U2

� Performed on GPU for up to ~40k boundary nodes

� For more boundary nodes:

hierarchical-matrix-vector multiplication on CPU

(GPU-conversion: work in progress)

� Preprocessing similar to sparse matrices

� Lines are zero-packed and sorted by length to reduce

zero-padding in blocks

� Called for the calculation of U2

� Performed by CVODE from the Sundials Package [5]

� Includes field calculations as well as vector operations

using the “NVector” structure supplied with Sundials

� ”NVector” and field calculations were ported to CUDA

� Comparisons are done against an OpenMP enhanced

version of the originally serial “NVector”

Part 1: Solver (linbcg)

Part 2: Dense-Matrix-Vector Multiplication

Part 3: Time Integration

4.: Blocks are then stored sequentially in device memory and referenced. Multiple warp-sized blocks are combined to the thread-blocks of the actual operations. (note: reordering step not shown here)

Preprocessing the Sparse Matrix:

For the , the access to the vector elements during the matrix-vector multiplication follows certain patterns with strong data locality where many reads can be combined.

regular mesh

� The computation speed of the GTX480 equals or exceeds that of the 12-core setup already at relatively small problem sizes

� The initial advantage of the CPU’s large caches wears off as the problem size outgrows it

� With increasing problem size, the larger overhead for the GPU calculations becomes less significant

� The GPU’s speed advantage for regular problems increases for all problems that fit the available memory size

� For large problems, the solver operates near the theoretical memory bandwidth limit of the GPU

� Due to caching of the Fermi based architecure, the memory throughput of the dense-matrix-vector multiplication exceeds

the posted bandwidth limit of the card

The mesh type directly defines the structure of the sparse matrices and the performance of the calculations for both the CPU and GPU strongly depends on whether the mesh is regular or irregular.

Complex magnetic structures - multiple length scales

Micromagnetic structures can display a high complexity as a result of interacting length scale (vortices, domain walls, domains).

This complexity represents a challange for micromagnetic simulations.

1 mm

a

b

c

d

1 mm

a

b

c

d

1 mm

a

b

c

d

What we obtain:

� Complex static and dynamic magnetic structures (e.g. vortices, spin waves ...)

� Solutions without any simplifying assumption on the magnetic structure

� Possibility of direct comparision with experimental results

5µm

MxMy

Ex

pe

rime

nt

Sim

ula

tion

500 nm

Complex features involvingdisparate length scales

cross-tie domain wall:series of vortex - antivortex

MxMy

MxMy

Ep

rie

nt

Sim

ula

to

500 nm

Complex features involvingdisparate length scales

cross-tie domain wall:series of vortex - antivortex500 nm

Complex features involvingdisparate length scales

cross-tie domain wall:series of vortex - antivortex500 nm

Complex features involvingdisparate length scales

cross-tie domain wall:series of vortex - antivortex

What we do:

� Fully three-dimensional finite element modelling - FEM - (tetrahedral discretization cells)

� Calculate effective fields, energy densities etc. at each discretization point (node)

� Calculate the current-density distributions and Oersted fields in general geometries

� Solving Poisson’s equation at each time step

� Integrate the Landau-Lifshitz-Gilbert differential equation

� Solve large sets of coupled integro-differential equations

Micromagnetic simulations is a powerful numerical tool for the study of magnetism in mesoscopic samples.

�GPU systems are very efficient for the Finite Element Micromagnetic simulations. The simulation time can be considerably

reduced.

�The gain in computation speed generally increases with increasing numbers of discretization nodes. It exceeds the perfor-

mance of the fastest available PC-based systems already at small problem sizes

�The GPU implementation allows us to systematically study the magnetization dynamics in large magnetic samples with

reasonable computation time.

�Using more than one GPU per computer without performance penalty significantly increases cost efficiency.

�Bigger and faster device memory as well as bigger caches would be desirable to further utilize the GPUs high

calculation throughput

Massively Parallel Micromagnetic FEM Calculations with Graphical ProcessingUnits (GPUs)

Effective Computation Speed Solver (U1)

0.00

5.00

10.00

15.00

20.00

25.00

0 100000 200000 300000 400000 500000

degrees of freedom

GF

LO

PS

0

20

40

60

80

100

120

140

160

180

GB

yte

s/s

Speed-Up Solver (U1)

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

0 100000 200000 300000 400000 500000

degress of freedom

Sp

ee

d-U

pfa

cto

r

Speed-Up ODE

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

0 100000 200000 300000 400000 500000

degrees of freedom

Sp

ee

d-U

pfa

cto

r

Eff. Speed Dense Matrix

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

5000 10000 15000 20000 25000

boundary nodes

GF

LO

PS

0.00

24.00

48.00

72.00

96.00

120.00

144.00

168.00

192.00

216.00

240.00

GB

yte

s/s

GPU reg. CPU reg.

CPU irreg. GPU irreg.

Speed-Up Dense Matrix

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

5000 10000 15000 20000 25000

boundary nodes

Sp

ee

d-U

pfa

cto

r