chris-kriton skylaris peter d. haynes arash a. mostofi...

21
Introducing ONETEP Part II - Efficient implementation of a parallel code Chris-Kriton Skylaris Peter D. Haynes Arash A. Mostofi Mike C. Payne Electronic Structure Discussion Group Theory of Condensed Matter, Cavendish Laboratory Cambridge, 19 May 2004 Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 1

Upload: others

Post on 10-Sep-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Introducing ONETEP

Part II - Efficient implementation of a parallel code

Chris-Kriton Skylaris

Peter D. Haynes

Arash A. Mostofi

Mike C. Payne

Electronic Structure Discussion Group

Theory of Condensed Matter, Cavendish Laboratory

Cambridge, 19 May 2004

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 1

ONETEP: A density-matrix linear-scaling DFT method

Non-orthogonal Generalised Wannier Functions (NGWFs)Molecular orbitals (MOs)

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 2

PSINC basis set (=plane waves) for the NGWFs

simulation cell

α

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 3

DFT always computationally demanding – ONETEP O(N) scheme should take full advantage of parallel computers

ONETEP two-nested-loop CG optimisation scheme

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 4

Parallel implementation using the Message Passing Interface (MPI) paradigm – each processor runs its own copy of the program with its own data

Parallelisation of data: Distribution of atoms (and NGWFs) to processors according to a space-filling curve

without SF curve with SF curve

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 5

Effect on sparsity pattern of Hamiltonian matrix

4650 x 4650without SF curve with SF curve

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 6

Atomic orbital “DZP” basis: 27500 x 27500

“Small” sparse matrices!

BRC4-RAD51 complex

(3000 atoms)

Can scale up to 100 processors (~5000 atoms) without data-parallel matrices!

ONETEP NGWFs: 7600 x 7600

1% -> 58MB!

1% -> 4.4 MB!

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 7

PPD & FFTbox representation of functionsφα in PPDs φα in FFTbox

Deposit PPDs

Extract PPDs

• QM operators

• Sums, products, interpolation

• Compact storage in 1D arrays

• Fast communication

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 8

Quantum mechanical operators delocalise to the whole volume of the FFTbox. Computationally expensive!

• T

• V local

• V non-local

• Interpolation to 2Gmax PSINC basis (fine grid)

φβ ÔφβÔ^

^

^

Calculation of Ôφβ for each φα is linear-scaling but not ideal (large prefactor)!Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 9

There is a way to reduce the prefactor: Enlarge FFTbox!

Smallest FFT box ensuring:

• Hermiticity

• Uniform representation for each operator

• But: φβ needs to be re-centred for every single φα

which overlaps with it!

Same advantages plus:

• φβ does not need to be re-centred. Ôφβ needs to be calculated only once!

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 10

Efficient calculation of integrals <φα|Ô|φβ>

φαÔφβ

φα is never placed in FFTbox!

Extract onlyPPDs in common with φα

dot product of (small) 1D arrays(ddot)

<φα|Ô|φβ> =

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 11

Keep Ôφβ FFTbox in memory and re-use for <φβ|Ô|φβ>,<φγ|Ô|φβ>, etc.

Efficient calculation of the charge density, n(r)

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 12

Only two interpolations per n(r;α), independent of num β!

interpolate

multiply

deposit in simulation cell

Σβα

Efficient calculation of the NGWF gradient

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 13

(restricted to region of φα - suitable for updating φα in conjugate gradients optimisation)

Only one application of Ĥ per φα!

extract only PPDsbelonging to φα

+= =

“shave” values outside region of φα

=

Linear-scaling with small prefactor

~N3

~N

0

5

10

15

20

25

0 200 400 600 800 1000

Number of atoms

Cal

cula

tion

time

(hou

rs) ONETEP

CASTEP

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 14

Communication model

Functions on different processors which overlap need to be communicated during computation• Use only point-to-point communication

• Keep functions to be sent in buffers and interleave communication with computation by using non-blocking sends

• Only use PPD representation during communication

• Keep in memory batches of Ôφβ to minimise communication

• Send only if there is an overlap with current batch of functions of receiving processor

• Work with processor-processor blocks of functions/matrices – do Nprocessor supersteps

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 15

Communication model simplest case – evaluation of integrals: outline from the viewpoint of processor X

Initialise my NGWF batch

Apply Ô to my batch

loop over procs

I will be sending to Y & receiving from Z

Send my φθ if needed by YReceive φη from Z if overlap with any member of my batch

loop over all my φθNGWFs

φη

store batch integrals in sparse form

loop over batches

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 16

0

20

40

60

80

100

0 20 40 60 80 100

Number of processors

Spee

d-up

BRC4 (570 atoms)BN DWNT (1200 atoms)Ideal

Speedup with increasing number of processors

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 17

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 18

Gly Gly50

True linear-scaling: Number of iterations small, independent of N

Nanotubes

Gly200

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

0 5 10 15 20 25Iteration

Ener

gy g

ain

/ ato

m (H

a)

Gly (10 atoms)Gly50 (353 atoms)Gly200 (1403 atoms)Nanotubes (516 atoms)

1.E-05

1.E-04

1.E-03

0 10 20Iteration

Res

idua

l nor

m (H

a)

Chemical accuracy, no Basis Set Superposition Error

CASTEPPlane-waves 1292 eV

ONETEPPlane-waves 1292 eV

H: 1 NGWFO: 4 NGWFs

NWCHEMcc-pVTZ+diffuse

H: 25 contracted GaussiansO: 55 contracted Gaussians

0

2

4

6

8

1 2 3 4 5 6 7O--H distance (Angstrom)

Ener

gy (k

cal/m

ol)

ONETEP

CASTEP

Gaussian

~O(N)

~O(N3)

~O(N3)NWCHEM

Basis sets, number of localised functions

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 19

Simulation cell size cost is O(0) in number of atoms. Can do plane-wave calculations in huge simulation cells!

ONETEP calculations of carbon nanotube tip in uniform external electric field (C.-K. Skylaris & G. Csányi et al.)

Local potential 50Å x 50Å x 50Å simulation cell

Charge density on local potential iso-surface near Fermi level

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 20

Conclusions• ONETEP – linear-scaling total energy code – takes

full advantage of parallel computers

• Accuracy equivalent to conventional cubic-scaling Plane-wave / Gaussian DFT codes

• Low prefactor – breakeven with cubic-scaling codes in the region of a few hundred atoms

• Work in progress: data-parallelisation of simulation cell – even larger simulation cells

• A whole new level of large scale first principles simulations from condensed matter physics to biology now possible - see next week’s instalment which will be brought to you by Peter D. Haynes

Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 21