theshirley reduced basis: a reduced ordermodel for plane ... · through proper orthogonal...

20
arXiv:1402.7366v2 [cond-mat.mtrl-sci] 6 Mar 2014 The Shirley reduced basis: a reduced order model for plane-wave DFT Maxwell Hutchinson The Physics Department, University of Chicago, Chicago IL 60637 * David Prendergast The Molecular Foundry, Lawrence Berkeley National Laboratory, Berkeley CA 94720 (Dated: September 25, 2018) The Shirley reduced basis (SRB) represents the periodic parts of Bloch functions as linear combi- nations of eigenvectors taken from a coarse sample of the Brillouin zone, orthogonalized and reduced through proper orthogonal decomposition. We describe a novel transformation of the self-consistent density functional theory eigenproblem from a plane-wave basis with ultra-soft pseudopotentials to the SRB that is independent of the k-point. In particular, the number of operations over the space of plane-waves is independent of the number of k-points. The parameter space of the transformation is explored and suitable defaults are proposed. The SRB is shown to converge to the plane-wave solution. For reduced dimensional systems, reductions in computational cost, compared to the plane-wave calculations, exceed 5x. Performance on bulk systems improves by 1.67x in molecular dynamics-like contexts. This robust technique is well-suited to efficient study of systems with strin- gent requirements on numerical accuracy related to subtle details in the electronic band structure, such as topological insulators, Dirac semi-metals, metal surfaces and nanostructures, and charge transfer at interfaces with any of these systems. The techniques used to achieve a k-independent transformation could be applied to other computationally expensive matrix elements, such as those found in density functional perturbation theory and many-body perturbation theory. PACS numbers: 71.15.-m, 71.15.Ap, 71.15.Dx, 02.70.-c, 02.60.-x I. INTRODUCTION Electronic structure is a cornerstone of modern materi- als science and condensed matter research efforts. Rapid advancements in experimental techniques have put pres- sure on the simulation community to provide efficient ac- cess to chemically accurate numerical predictions for a di- verse range of materials, while high-throughput efforts [1] have embraced standard electronic structure methods, such as density functional theory (DFT)[2], to explore entire databases of materials[3]. For materials science applications, typically focused on condensed phases, sur- faces, and extended nanostructures, the use of atomistic models employing periodic boundary conditions is the norm, with plane-wave DFT (PWDFT) providing a ro- bust numerical approach for systems where the number of atoms does not extend to thousands. While PWDFT has been shown to be adequately robust in the treatment of exotic materials, whose electronic properties are defined by subtle details in electronic band structure (e.g., Dirac semi-metals and topological insulators [4]), numerically converged calculations of complex systems often come at prohibitive computational cost. The Shirley reduced basis (SRB) technique is able to substantially reduce nu- merical effort while retaining the robustness of PWDFT. PWDFT is a natural numerical representation for electronic structure under periodic boundary conditions, which are commonly employed to model crystalline and amorphous condensed phases, surfaces and their adsor- * [email protected] bates, materials interfaces or heterojunctions, and ex- tended nanostructures, such as sheets, ribbons, wires and tubes. PWDFT can also be employed to study finite or molecular systems, but that will not be the focus of this work. Note that finite temperature first-principles molec- ular dynamics simulations for condensed phases, surfaces, interfaces and extended nano structures typically also of- ten rely on PWDFT, due to the robust description of electronic structure provided by this numerical represen- tation, even for configurations far from equilibrium. In particular, we are interested in systems which are ex- pensive to model due to the necessity to capture sub- tle details of electronic band structure which define the function of these materials. Typically, this complexity is related to adequately describing the Fermi surface of complex metals or semi-metals, which requires detailed knowledge of electronic band structure with respect to electron wave vectors. Bloch’s theorem factors the representation of wave functions in periodic systems into a slowly oscillating phase, defined by the electron wave vector k, or k-point, and a strictly periodic function, u(r)[5]: ψ(r)= e ik·r u(r) u(r + a)= u(r), (1) where a is a Bravais lattice vector of the periodic system. In DFT, the Hamiltonian is block-diagonal with respect to the k-point, but the expectation value of the electron density relies on integrating contributions over all values of k in the first Brillouin zone (BZ). Numerical solutions to electronic structure problems discretely sample the BZ with a finite set of k-points, k K BZ . Existing elec- tronic structure methods treat the k-dependent eigen- problems independently. In reality, the eigenproblems

Upload: lamtuong

Post on 26-Apr-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

arX

iv:1

402.

7366

v2 [

cond

-mat

.mtr

l-sc

i] 6

Mar

201

4

The Shirley reduced basis: a reduced order model for plane-wave DFT

Maxwell HutchinsonThe Physics Department, University of Chicago, Chicago IL 60637∗

David PrendergastThe Molecular Foundry, Lawrence Berkeley National Laboratory, Berkeley CA 94720

(Dated: September 25, 2018)

The Shirley reduced basis (SRB) represents the periodic parts of Bloch functions as linear combi-nations of eigenvectors taken from a coarse sample of the Brillouin zone, orthogonalized and reducedthrough proper orthogonal decomposition. We describe a novel transformation of the self-consistentdensity functional theory eigenproblem from a plane-wave basis with ultra-soft pseudopotentials tothe SRB that is independent of the k-point. In particular, the number of operations over the spaceof plane-waves is independent of the number of k-points. The parameter space of the transformationis explored and suitable defaults are proposed. The SRB is shown to converge to the plane-wavesolution. For reduced dimensional systems, reductions in computational cost, compared to theplane-wave calculations, exceed 5x. Performance on bulk systems improves by 1.67x in moleculardynamics-like contexts. This robust technique is well-suited to efficient study of systems with strin-gent requirements on numerical accuracy related to subtle details in the electronic band structure,such as topological insulators, Dirac semi-metals, metal surfaces and nanostructures, and chargetransfer at interfaces with any of these systems. The techniques used to achieve a k-independenttransformation could be applied to other computationally expensive matrix elements, such as thosefound in density functional perturbation theory and many-body perturbation theory.

PACS numbers: 71.15.-m, 71.15.Ap, 71.15.Dx, 02.70.-c, 02.60.-x

I. INTRODUCTION

Electronic structure is a cornerstone of modern materi-als science and condensed matter research efforts. Rapidadvancements in experimental techniques have put pres-sure on the simulation community to provide efficient ac-cess to chemically accurate numerical predictions for a di-verse range of materials, while high-throughput efforts [1]have embraced standard electronic structure methods,such as density functional theory (DFT)[2], to exploreentire databases of materials[3]. For materials scienceapplications, typically focused on condensed phases, sur-faces, and extended nanostructures, the use of atomisticmodels employing periodic boundary conditions is thenorm, with plane-wave DFT (PWDFT) providing a ro-bust numerical approach for systems where the number ofatoms does not extend to thousands. While PWDFT hasbeen shown to be adequately robust in the treatment ofexotic materials, whose electronic properties are definedby subtle details in electronic band structure (e.g., Diracsemi-metals and topological insulators [4]), numericallyconverged calculations of complex systems often comeat prohibitive computational cost. The Shirley reducedbasis (SRB) technique is able to substantially reduce nu-merical effort while retaining the robustness of PWDFT.PWDFT is a natural numerical representation for

electronic structure under periodic boundary conditions,which are commonly employed to model crystalline andamorphous condensed phases, surfaces and their adsor-

[email protected]

bates, materials interfaces or heterojunctions, and ex-tended nanostructures, such as sheets, ribbons, wires andtubes. PWDFT can also be employed to study finite ormolecular systems, but that will not be the focus of thiswork. Note that finite temperature first-principles molec-ular dynamics simulations for condensed phases, surfaces,interfaces and extended nano structures typically also of-ten rely on PWDFT, due to the robust description ofelectronic structure provided by this numerical represen-tation, even for configurations far from equilibrium. Inparticular, we are interested in systems which are ex-pensive to model due to the necessity to capture sub-tle details of electronic band structure which define thefunction of these materials. Typically, this complexityis related to adequately describing the Fermi surface ofcomplex metals or semi-metals, which requires detailedknowledge of electronic band structure with respect toelectron wave vectors.Bloch’s theorem factors the representation of wave

functions in periodic systems into a slowly oscillatingphase, defined by the electron wave vector k, or k-point,and a strictly periodic function, u(r)[5]:

ψ(r) = eik·ru(r) u(r + a) = u(r), (1)

where a is a Bravais lattice vector of the periodic system.In DFT, the Hamiltonian is block-diagonal with respectto the k-point, but the expectation value of the electrondensity relies on integrating contributions over all valuesof k in the first Brillouin zone (BZ). Numerical solutionsto electronic structure problems discretely sample the BZwith a finite set of k-points, k ∈ K ⊂ BZ. Existing elec-tronic structure methods treat the k-dependent eigen-problems independently. In reality, the eigenproblems

2

and associated eigensolutions at each k-point contain adegree of redundancy. The periodic Bloch states, |unk〉,are known to have relatively weak k-dependence in com-parison to the dispersion relation, εnk, which is mainlydriven by the quadratic k-dependence of the kinetic en-ergy.Shirley proposed that a reduced basis comprised of

Bloch eigenstates over a coarse sample of the BZ be usedto represent the eigenproblem throughout the BZ [6]. Wedenote the initial coarse sample as the q-points. Thus, thenth Bloch state at BZ sample k can be represented as

|unk〉 =∑

mq

|umq〉 cnkmq. (2)

The number of q-points needed to accurately reproducethe solutions at all k-points is small. We refer to thistechnique, the use of a coarse BZ sample as a basis, asShirley interpolation. Analogies can be drawn to k · ptheory [7, 8], which can effectively interpolate electronicstructure in the neighborhood of a specific k-point usinga large number of eigenstates from that k-point alone.By contrast, Shirley interpolation aims to provide a sin-gle reduced basis which spans the entire BZ by combin-ing details from multiple q-points and a relatively smallnumber of bands.The basis size can be further reduced by selecting lin-

ear combinations of q-point states though a principalcomponent analysis [6]. The principal components withsmall eigenvalues are removed from the basis, providedthe specification of a tolerated error. The use of prin-cipal component analysis in reducing the dimensionalityof PDEs is often called proper orthogonal decomposition(POD) [9].The Shirley reduced basis (SRB) is the union of

Shirley interpolation and proper orthogonal decompo-sition. Shirley applied the method to the non-self-consistent evaluation of band structure and associatedspectroscopy and achieved speedups in excess of 1000×for dense k-point samples [6].Prendergast and Louie extended Shirley’s work to

ultra-soft pseudopotentials (USPP) through interpola-tion of the projectors across the BZ [10]. They also recog-nized that preserving periodicity with respect to k acrossthe first BZ could be achieved using periodic images ofthe original coarse q-point basis. In particular, the peri-odic images of the Γ-point were used to successfully en-force the periodic symmetry of the band structure. Suchimages can be generated merely via mathematical trans-formation, without additional expensive plane-wave cal-culations.Here, we further extend the method to self-consistent

calculations, employing a new set of techniques to mit-igate the added cost of constructing an electron den-sity [2]. We also remove the interpolation of the atomicprojectors required for the non-local potential, in favorof an auxiliary basis approach similar to the SRB for theperiodic component of the Bloch eigenfunctions. We im-plement the SRB in the popular PWDFT code Quantum

Espresso [11]. Our implementation is freely available forgeneral use [12].The SRB approach is shown to be uniformly con-

vergent and numerically efficient, achieving significantspeedups for self-consistent calculations, including re-laxations. The accurate reproduction of forces andstresses also indicate the possibility for significant speedup of first-principles molecular dynamics simulations,which can prove prohibitively expensive for systemswhich require k-point sampling to capture metallic orsemi-metallic behavior or the possibility of a transi-tion to a metallic state via phase change [13] or chargetransfer [14]. One can view such molecular dynamicssimulations simplistically as requiring a converged self-consistent-field calculation at each time step for the de-termination of the next atomic configuration based onthe computed forces and stresses.In Section II, we discuss similar methodologies and

find the SRB to be the first combination of Shirley inter-polation and proper orthogonal decomposition. In Sec-tion III, we introduce the self-consistent extension of theSRB in general terms. In Section IV, we introduce struc-tures that will be used to demonstrate the accuracy andperformance of the SRB and describe our definition of theagreement between two calculations of the same physi-cal structure. In Section V, we describe the parametersof the method and their affect on the solution. In Sec-tion IV, we describe representative physical systems thatare used to demonstrate the effectiveness of the SRB anddiscuss agreement and convergence standards. In Sec-tion VI, we demonstrate the convergence properties andaccuracy of SRB. In Section VII, we present a perfor-mance model for the SRB. In Section VIII, we discussinteresting properties of the SRB. In Section IX, we pro-pose research efforts that would improve the efficiencyand accuracy of the SRB. Finally, in Section X, we con-clude. Appendix A provides a more thorough perfor-mance model for the SRB algorithm.

II. RELATED WORK

The SRB is a reduced order method (ROM) basedon Shirley interpolation and proper orthogonal decom-position (POD), which is related to principal compo-nent analysis and Galerkin projection in other fields [15].POD, principal component analysis, and Galerkin pro-jection have been well studied in fields as diverse asfluid dynamics [16], machine learning[17], and imageprocessing[9, 18]. Shirley interpolation appears to haveoriginated with the work of Shirley himself.ROMs were developed in the fluid dynamics commu-

nity [16, 19] as a means to reduce the dimensionalityof steady-state turbulent flows. Samples of the velocityfluctuations u−〈u〉 taken at intervals longer than the cor-relation time are fed into a singular-value decompositionto identify coherent structures in the flow. The govern-ing equations are then expressed in terms of the coherent

3

structures, creating a model for the time-dependent en-ergy transfer between the structures. The SRB differsfrom common ROMs in two ways: 1) it models a non-linear system and 2) it samples in Bloch space rather thantime. The non-linearity makes a-priori error estimatesdifficult. The Bloch-space sampling provides structureto the operator transformations.The optimal representation of the polarization propoga-

tor in [20] uses the SVD of the product basis of Wannier-function expansions of wave functions to reduce the basissize, akin to POD. The representation does not, however,take advantage of the smoothness of a solution manifoldas the SRB does in k-space and fluid dynamics ROMs doin time. In other words, it lacks the interpolation step.The optimal representation is most like the auxiliary ba-sis used in this work, Section III C, for the transformationof projectors and the reduced rank density matrix, nei-ther of which involve k-space interpolation.The k ·p-method has been applied to density functional

theory calculations by Persson and Ambrosch-Draxl [21].It relies on analytic k-dependence of the Hamiltonian toexpress linear relations between the Bloch states at dif-ferent k-points. That reliance, however, restricts k · p toapplication to methods with only local potentials, suchas FLPAW. The SRB has no such restrictions, and ispresented here with the generality of ultra-soft pseudo-potentials.Reduced Bloch Mode Expansion (RBME) [22, 23] is an

expansion in Bloch eigenfunctions nearly identical to theShirley interpolation step, but lacking the POD step. Ithas wide application beyond electronic structure to bandstructures of periodic systems and is agnostic to the dis-cretization of the Bloch states. The RBME techniquepost-dates Shirley’s original work.The “reduced-basis” method described by Pau [24]

makes similar assumptions about the form of the poten-tial in order to produce an eigenproblem that is affinein k. This provides for an a posteriori error estimation.Like RBME, it lacks the POD step.

III. SELF-CONSISTENT ALGORITHM

The methods of Shirley, Prendergast, and Louie fornon-self-consistent interpolation are well described in theliterature [6, 10]. The most basic idea is to use the spanof the periodic components of the Bloch eigenfunctions ata small number of points in the first BZ as a basis for theentire BZ. Furthermore, a POD of these periodic func-tions allows for the basis size to be reduced. Finally, ak-dependent Hamiltonian can be constructed without anexplicit basis transformation through polynomial expan-sion of the kinetic energy and interpolation of projectorsof the non-local potential.The algorithm presented below is a natural generaliza-

tion to the self-consistent regime. It differs primarily inthe addition of a step to reconstruct a real-space electrondensity from Bloch eigenfunctions in the reduced basis,

〈g |H [ρ]|ψmq〉1= Emq 〈g|ψmq〉

σ2 〈umq |b〉

3

��

2= 〈umq |um′q′〉 〈um′q′ |b〉

〈b |H [ρ](k)|unk〉4= Enk 〈b|unk〉

〈r |ρ| r〉

0

99

5=

∑nk fnk[E] |〈r|ψnk〉|

2

FIG. 1: Flow chart of the SRB method. The primaryself-consistent plane-wave method is embodied in steps0, 1, and 5. The SRB adds step 2 to build the reducedbasis, step 3 to transform the k-dependent Hamiltonianinto the reduced basis, and step 4 to solve the reduced

eigenproblem.

and secondarily in more general approaches to the basistransformations and interpolations.

A few notes: We use the term expansion to refer to ex-act polynomial forms and reserve interpolation for meth-ods that include truncation errors. The basis producedby this scheme is referred to as ‘reduced’[15], as opposedto ‘optimal’ in the previous literature, as it is only op-timal within the space spanned by the periodic compo-nents of the input Bloch eigenfunctions. For the sake ofclarity, we assume the conventional method employs aplane-wave basis with pseudopotentials [25, 26], thoughthe method generalizes to any discretization of the Blocheigen-functions, for example, in real-space using analyticor numerical atom-centered orbitals or using projector-augmented waves [27]. We use the unit cube in reciprocallattice coordinates as our BZ, as opposed to the conven-tional choice of the origin-centered Wigner-Seitz cell.

The algorithm refines an electron density. Generally,it would be repeated until a convergence criterion is met,signaling self-consistency. The outline of the algorithmcan be seen in Figure 1: Start with an electron den-sity represented in real-space. 0) Represent the Hamilto-nian as a functional of the electron density using a con-ventional basis, {|gi〉}; 1) Compute Bloch eigenfunctions〈g|ψnq〉 using a conventional method over a coarse recip-rocal mesh, Q; 2) Create a reduced basis {|bi〉} from anoptimal subspace of the span of the periodic componentsof the conventional Bloch eigenfunctions; 3) Representk-dependent Hamiltonians in the new basis over a finereciprocal mesh K; 4) Solve the k-dependent Hamiltoni-ans in the new basis producing the periodic componentsof the eigenfunctions 〈b|unk〉; 5) Construct a real-spaceelectron density from the eigenfunctions in the new basis.

4

A. Constructing the basis

The conventional algorithm produces eigenfunctionswhich are Bloch functions |ψmq〉 = eiq·r |umq〉 on a recip-rocal mesh Q. Symmetries in the problem can be usedto map the Bloch eigenfunctions computed at a singleq-point to other points in reciprocal space. In addition,the periodicity of eigensolutions in k-space permits statesfrom the surface of the first BZ, e.g. the reciprocal spaceorigin q = Γ = 0, to be translated by any other recip-rocal lattice vector, e.g. the corners of the unit cube,03 → {0, 1}

3in reciprocal lattice coordinates. This pro-

cedure is further described in [10], and generally requireslittle numerical effort, given that such translations can beachieved merely by a reordering of the Fourier coefficientsof each state.Additional symmetries can provide more input q-

points without additional plane-wave computation. Forexample, inversion symmetry is computed by conjugatingthe plane-wave coefficients:

cn,q = c∗n,−q.

which provides two interior k-points for the cost of one.The symmetries can be compounded. For example, asingle q-point on the edge can be used to construct 7other q-points without significant computation assuminginversion symmetry and periodicity:

(0, 0, 1/4) → {(0, 1, 1/4), (1, 0, 1/4), (1, 1, 1/4),

(0, 0, 3/4), (0, 1, 3/4), (1, 0, 3/4), (1, 1, 3/4)}

We use POD to pick a subspace of the span of theperiodic components of the Bloch eigenfunctions that isoptimally representative. First, construct a covariance(or overlap) matrix:

Ci,j ≡ 〈ui|uj〉 , (3)

where i, j are composite indices over m, q, the space ofcoarsely sampled eigen-functions. Next, diagonalize thecovariance matrix:

γ

〈uj |uγ〉 〈uγ |bi〉 = σ2i 〈uj|bi〉 , (4)

where the eigenvalues σ2i are the variances captured by

the basis elements |bi〉. Select those basis elements, |bi〉,with the largest variances as the reduced basis. The sizeof the basis can be informed by choosing a sufficient num-ber of basis elements to capture a significant fraction ofthe total variance. For example, the missing variancecould be constrained to be below a certain fraction ofthe total variance:

1−

∑Nb

i σ2i

∑Nj σ2

j

< σ2b , (5)

where Nb is the number of basis elements, N is the rankof the overlap matrix C, and σ2

b is the maximum toleratederror.

The basis elements are represented with respect to theoriginal periodic components of the Bloch eigenfunctions,so one must generally transform them back to the originalbasis. In the case of plane-waves, the transformationtakes the form:

〈gj |bi〉 =∑

γ

〈gj |uγ〉 〈uγ |bi〉 . (6)

B. k-dependent Hamiltonians

We use the reduced basis to expand a k-dependentHamiltonian H(k) = e−ik·rHeik·r. In the plane-wavepseudopotential framework, the Hamiltonian has threecomponents: the kinetic energy, the local potential, andthe non-local potential [5, 28]:

H ≡ |g〉T 〈g|+ |r〉V 〈r|+∣

∣β⟩

V nl⟨

β′∣

∣ . (7)

It was previously shown that the kinetic energy could be

written as a quadratic expansion in ~k:

2⟨

gi

∣T (~k)

∣gi

=∣

~k + ~gi

2

= |~k|2 + 2~gi · ~k + |~gi|2, (8)

which can be transformed to the reduced basis by meansof the matrices:

bi∣

∣T 0∣

∣ bj⟩

=∑

γ

〈bi|gγ〉 |~gγ |2 〈gγ |bj〉 (9)

bi

~T 1∣

∣bj

=∑

γ

〈bi|gγ〉~g 〈gγ |bj〉 (10)

bi∣

∣T 2∣

∣ bj⟩

= δij = 1, (11)

such that

2⟨

bi

∣T (~k)

∣bj

= |~k|2T 2 + 2~k · ~T 1 + T 0, (12)

which exactly isolates the k dependence in the transfor-mation of the kinetic energy.The local potential is k-independent. Its reduced ba-

sis representation is computed by explicit transformationfrom the plane-wave basis to real-space:

bi∣

∣V loc∣

∣ bj⟩

=∑

γ,γ′,r

〈bi|gγ〉 〈gγ |r〉 Vloc(r) 〈r|gγ′〉 〈gγ′ |bj〉 , (13)

where the sums over γ, γ′ can be implemented as Fouriertransformations.The non-local potential is generally expressed as a ma-

trix in the basis of atomic pseudo-wave functions, or just

pseudo-functions:⟨

φa,l′∣

∣V nl∣

∣ φa,l

, where a runs over

atomic sites and l over composite angular momentumquantum numbers [29]. The non-local potential is diag-onal for NCPP and block diagonal with respect to theatomic index for USPP. We will occasionally compress

5

the indices (a, l) → α. The space of atomic wave func-tions is accessed through a dual-space of projectors, β,which couples isolated atomic Hilbert spaces to the over-all periodic space comprising multiple atomic sites:

l,l′

∣βa,l

⟩⟨

φa,l′∣

∣= Ia, (14)

where Ia represents the identity in the space spanned bythe pseudo-functions in a finite volume around atomicsite a. These matrix elements do not depend on k, butthe projectors |βa,l〉 do. Transforming the projectors isall that must be done to transform the non-local potentialto the reduced basis:

〈bi|βα(k)〉 =∑

γ

〈bi|gγ〉 〈gγ |βα(k)〉 (15)

bi∣

∣V nl(k)∣

∣ bj⟩

=∑

α,α′

〈bi|βα(k)〉⟨

φα∣

∣V nl∣

∣ φα′

〈βα′(k)|bj〉 . (16)

For ultra-soft pseudo-potentials, the overlap matrix, S,is computed in the same way:

〈bi |S(k)| bj〉 − δi,j =∑

α,α′

〈bi|βα(k)〉⟨

φα |S| φα′

〈βα′(k)|bj〉 . (17)

Note that when expressed in the basis of pseudo-functions, the non-trivial part of the overlap matrix,⟨

φα |S| φα′

, and non-local potential,⟨

φα∣

∣V nl∣

∣ φα′

,

are often referred to as Q and D, respectively [25].

C. Auxiliary basis for projectors

The projectors can be written as the product of origin-centered, k-dependent atomic projectors, β0,l(k), and anatom-dependent structure factor, S(a) = exp[i(g+k)·τa],which we split into k-dependent and g-dependent terms:

〈gi|βa,l(k)〉 = e−k·τa 〈gi |S(a)| gi〉 〈gi|β0,l(k)〉 , (18)

where l runs over angular momenta, a runs over theatomic centers, and a = 0 corresponds to the origin. Thestructure factor can be written as:

〈g |S(a)| g〉 = e−i(g+k)·τa , (19)

If the number of k-points is large, we could considertransforming the structure factor in the SRB before ap-plying it to the origin-centered projectors:

〈bi |S(a)| gj〉 = 〈bi|gj〉 〈gj |S(a)| gj〉

〈bi|βa,l(k)〉 =∑

γ

〈bi |S(a)| gγ〉 〈gγ |β0,l(k)〉 . (20)

The inner product in the plane-wave basis is inefficient:the size of the plane-wave basis is much larger than therank of 〈bi |S(a)| gj〉, which is the lesser of the size ofthe SRB and the dimension of the space spanned by theorigin-centered projectors 〈gγ |βa,l(k)〉. We can introducean auxiliary basis, |x〉, to reduce the effort in the innerproduct:

〈bi |S(a)| xj〉 =∑

γ

〈bi|gγ〉 〈gγ |S(a)| gγ〉 〈gγ |xj〉

〈xi|β0,l(k)〉 =∑

γ

〈xi|gγ〉 〈gγ |β0,l(k)〉

〈bi|βa,l(k)〉 =∑

j

〈bi |S(a)| xj〉 〈xj |β0,l(k)〉 .

(21)

The basis x needs to represent the origin-centered projec-tors through the BZ. This is analogous to the SRB, whichmust represent the periodic Bloch functions throughthe BZ. Indeed, we use the procedure outlined in Sec-tion III A, but with 〈gi|βl(q, 0)〉 as input states. Thisauxiliary basis need only be produced once, as it is in-dependent of the electron density and atomic positions.Indeed, it could even be pre-computed and packaged withthe pseudo-potential. Therefore, we prefer to use the en-tire K grid as the ‘coarse’ sample Q, which makes thefractional variance a good metric for the accuracy of thebasis. In that sense, the auxiliary basis {|xi〉} is opti-mal. This procedure must be performed for each atomicspecies.

D. Diagonalizing the k-dependent Hamiltonian

The Hamiltonian and, in the case of USPP, the over-lap matrix are dense. Therefore, it is natural to leveragehighly optimized direct solvers, such as those found in theLAPACK library [30]. However, the ratio of the numberof basis elements to the number of bands is O(10), whichis large enough to justify the use of an iterative solver.Furthermore, as in the conventional case, diagonalizationin early self-consistent iterations need not be fully con-verged. We intend to explore the use of iterative solversin the future.

E. Constructing the density matrix

After diagonalizing the k-dependent Hamiltonian tocompute the periodic components of the eigenfunctions〈bi|umk〉 and the energies εmk, one must produce a real-space electron density so the process can be repeated.The simplest way of doing this would be to transformthe eigenfunctions back to plane-waves and then accu-

6

mulate the density in the usual way:

〈gi|unk〉 =∑

j

〈gi|bj〉 〈bj|unk〉

〈r|unk〉 =∑

j

〈r|gj〉 〈gj |unk〉

〈r |ρ| r〉 =∑

nk

〈r|unk〉 f(εnk) 〈unk|r〉 ,

(22)

where f(ε) is some occupation function and the trans-formation from plane-waves to real-space is generally ac-complished with an FFT.Another method takes advantage of the small size of

the SRB to form a density matrix:

〈bi |ρ| bj〉 =∑

nk

〈bi|unk〉 f(εnk) 〈unk|bj〉 . (23)

Using the Hermitian singular value decomposition ρ =V ΣV †, one can re-write the density matrix in terms ofits rank-1 components:

〈bi |ρ| bj〉 =∑

ν

〈bi|vν〉σν 〈vν |bj〉 , (24)

where σν are the diagonal elements of Σ. The rank-1components, 〈bi|vν〉, can be accumulated in real-spacejust as we did for the periodic parts of the Bloch eigen-functions in Equation 22:

〈gi|vν〉 =∑

j

〈gi|bj〉 〈bj|vν〉

〈r|vν〉 =∑

j

〈r|gj〉 〈gj |vν〉

〈r |ρ| r〉 =∑

ν

〈r|vν〉σν 〈vν |r〉 .

(25)

This method has the advantage that the number of rank-1 components is bounded above by the size of the SRB.If there are more bands across all k-points than basis ele-ments, NbNk > Nr, this approach performs fewer trans-formations. In practice, many of the σi are very smallso they can be truncated just as small occupations f(ε)generally are. This can reduce the number of rank-1 com-ponents by up to a factor of 4.In the case of USPP, there is an addition to the electron

density:

〈r |ρ| r〉 =∑

nk,α,α′

f(εnk) 〈unk|βα(k)〉⟨

φα |S(r)| φ′α

〈β′α(k)|unk〉 ,

(26)

which requires the inner product of the wave functionsand the projectors:

〈βi(k)|unk〉 =∑

j

〈βi(k)|bj〉 〈bj |unk〉 . (27)

Note that the length of the inner product is the reducedbasis size, representing a significant reduction in workover the plane-wave counterpart.

F. Basis saving

The self consistent procedure serves not only to relaxthe electron density to the ground state, but also to re-lax the subspace spanned by the SRB to include the trueground state. Therefore, we consider two convergences:the convergence of the electron density and the conver-gence of the SRB. To fully generalize the method, we con-sider these two convergences separately. Namely, we canrefrain from updating the SRB at every self-consistentiteration, instead ‘saving’ the transformation matrix, ki-netic energy, projectors, and factorized overlap matrixfrom the previous iteration. The only parts of the eigen-problem which must be recomputed are the explicitlydensity dependent operators: the local and, in the case ofUSPP, non-local potentials, Equations 13 and 16, respec-tively. Relaxing the basis more slowly than the electrondensity can slow the convergence of the electron density,but the avoided cost of computing a new basis and trans-forming density independent quantities can result in a netperformance improvement.The two convergences need not be terminated by the

same error condition. In general, the solution is less sen-sitive to the basis as it is to the electron density. Thus,the basis can be frozen on a weaker convergence thresholdthan the electron density while preserving the accuracyof the solution. In fact, if the basis is not frozen nearconvergence, it can lead to slowly decaying oscillationsin the charge density and basis elements, similar to so-called charge sloshing, delaying convergence.

G. Forces and stresses

The electron density produced by the SRB technique,Equation 25, is represented in the same basis as theplane-wave result. Therefore to compute forces andstresses, we need only address terms that depend on thewave-functions directly [31].The forces depend on derivative operators nominally

expressed in the plane-wave basis:

Dµ(k) ≡

gi

∂Rµ

gj

= (gµi + kµ)δi,j , (28)

which are used to compute the derivatives of the projec-tion:

∂Rµa

〈βa,l(k)|unk〉 =[

〈β0,l(k)| S(a)†Dµ(k)†

]

|unk〉 .

(29)The matrices S and D are Hermitian and diagonal inthe plane-wave basis, so we can drop the adjoints andcommute them. We need to compute the quantity:

〈bi |S(a)Dµ(k)|β0,l〉 , (30)

so we can take inner products with the periodic Blochstates 〈bi|unk〉 in the reduced basis.

7

This problem is analogous to that of the projectors,Equation 18, but with Dµβ0,l replacing β0,l. Therefore,we can form an auxiliary basis just as in Equation 21. Asbefore, the auxiliary basis is independent of the electrondensity and the atomic position, so it can be producedonce and stored for the duration of the calculation. Forrelaxations and molecular dynamics, this allows the costof producing the basis to be amortized by many forceevaluations. For self-consistent calculations with a singleforce evaluation, it may be advantageous to compute thederivatives of the projection in the plane-wave basis andexplicity transform into the SRB.The stress has an additional term that depends on the

wave-functions, the kinetic stress [32]:

σµ,νkin =

n,k,g

〈unk|g〉 (g + k)µ(g + k)ν 〈g|unk〉 , (31)

which leads to the k-dependent kinetic stress operator:

〈g |σµ,νkin (k)| g〉 = (g + k)µ(g + k)ν . (32)

Unsurprisingly, this term can be treated as the kinetic en-ergy operator was in Equation 8: the product is expandedand the k-dependence factored yielding a k-independenttransformation:

〈g |σµ,νkin (k)| g〉 = gµgν + gµkν + gνkµ + kµkν , (33)

the following nine transformations are required:

〈bi |gµgν | bj〉 =

g

〈bi|g〉 gµgν 〈g|bj〉 (34)

〈bi |gµ| bj〉 =

g

〈bi|g〉 gµ 〈g|bj〉 , (35)

which are then composed as:

〈bi |σµ,νkin (k)| bj〉 = kµkνδi,j

+ 〈bi |gµgν | bj〉+ 〈bi |g

µ| bj〉 kν + 〈bi |g

ν | bj〉 kµ. (36)

Conveniently, the single-g terms were already computedfor the kinetic energy in Equation 10. Further, one of thediagonal double-g terms can be computed by subtraction:

g3g3 = |g|2 − g1g1 − g2g2, (37)

with the |g|2 term previously defined as K0, Equation 9.Thus, the total number of additional diagonal transfor-mations is five.The second derivatives found in the non-local contri-

bution to the stress can be computed in a similar way asEquation 30. There are twelve such terms but they aredensity and atomic position independent, so they can bere-used.

IV. EXAMPLES

Through the rest of this paper, we discuss results fromthree physical systems: a (3, 3) carbon nanotube (CNT),

FIG. 2: Four cells of a (3,3) CNT.

a nickel slab decorated with water and hydroxide, anda bulk gold molecular dynamics snapshot at 2000 K.Each system was chosen to be representative of a broadclass of electronic structure problems, as described inthe following sections. To demonstrate the generalityof the method, we use Perdew-Burke-Ernzerhof GGA(PBE) [33] and Perdew-Zunger LDA (PZ) [34] function-als with Vanderbilt ultra-soft [25], Rappe-Rabe-Kaxiras-Joannopoulos ultrasoft (RRJK) [35], and Troulliers-Martins (TM) [36] norm conserving pseudopotentials.The raw input files used for these examples can be foundin the srb-supplemental repository[37]. A summarayof key parameters is given in Table I.

A. (3,3) carbon nanotube

The (3,3) carbon nanotube (CNT) is a metallic one-dimensional nano-structure. It has a 12-atom primitivecell arranged around the z-axis. To replicate isolationwithin periodic boundary conditions, a large vacuum re-gions is added in the x-y plane. We are only concernedwith relaxing stress along the tube axis.Metallic systems with a small number of Fermi level

crossings, such as the ‘Dirac points’ in CNTs, requirethorough sampling of the BZ. When the BZ is under-sampled, the crossings can be missed, leading to an ar-tificial band gap. In this case, 32 irreducible points arerequired.The CNT is representative of a large class of extended

one-dimensional metallic and semiconducting molecules(e.g. linear polymers) and nanostructures, including nan-otubes, nanowires, and nanoribbons. Accurate descrip-

8

FIG. 3: Nickel slab decorated with water and hydroxide.Cell is repeated 2× 2 in the plane of the slab.

tions of the electronic structure of these materials arevital to estimating their electronic properties and theirstructures. In addition, the proximity of metallic or semi-metallic nanostructures to molecules, nanoparticles or ex-tended substrates often results in significant charge trans-fer across the interface due to electronic coupling and hy-bridization [38]. Inaccurate descriptions of the Fermi sur-face can modify the details of electron transfer and leadto poor descriptions of chemisorption or electronic trans-mittance across such interfaces. Modeling such interac-tions at finite temperature would also require sufficientk-point sampling to cover the worst possibilities duringan entire trajectory – such as switching from metallic tosemi-metallic behavior [39]. In the absence of numericalconvergence, such simulations would incorrectly describethe system as semiconducting in both cases.

B. Nickel slab

We provide an example relevant to the study of sur-face chemistry and catalysis. A four atomic layer Ni slabis chosen to model a semi-infinite reactive metal surface,with sufficient vacuum padding to reduce the influence ofelectronic coupling between the top and bottom surfaces.We decorate one side of the slab with water moleculesand hydroxide moieties, representing a proposed surfacecoverage of Ni in the presence of water vapor [40]. Theparticular details of this system are not important withinthe context of the current study, and we will not tryto draw any physical or chemical conclusions. However,this quasi-2D system is representative of a large class of

FIG. 4: Gold snapshot at 2000K. Cell contains 32atoms.

simulations which aim to model surfaces and interfacesand their chemistry or reactivity. Specific details of sur-face relaxation and reconstruction, charge transfer andreactivity, and electronic coupling via hybridization areall strongly dependent on an accurate description of theelectronic structure and require numerically convergedsampling of the BZ. The combination of a large num-ber of k-points and the additional requirement to includea large number of plane-waves to describe the vacuumabove and below the slab render such common calcu-lations prohibitively expensive. Any gains in efficiencythrough the use of the SRB would surely be welcome inthe surface science and catalysis communities.

C. Gold snapshot

The gold system is a 32 atom disordered snapshot ex-tracted from an MD simulation at 2000 K [41]. A hardnorm-conserving pseudo-potential is used to describe theionic cores under these extreme conditions. The goldsnapshot is representative of the typical accuracy require-ments and computational cost per step of first-principlesmolecular dynamics of metals under extreme conditions.We wish to demonstrate the possibility of realizing somespeedup for such systems. In addition, we wish to pro-vide a practicable option for running robust simulationsof systems which undergo insulator to metal transitionsor pressure induced metallization. Furthermore, we hopeto enable additional parallelization paradigms for largescale molecular dynamics simulations through our use ofthe SRB.

9

System Calculation K Rv A PP Exc

CNT SCF/Bands 33 60a0 12 US PBENi SCF/Bands 74 68a0 17 US PBEAu AIMD 36 0 32 NC PZ

TABLE I: Key parameters of representative examples.K is the number of irreducible k-points; Rv is cell

dimension in the vacuum direction; A is the number ofatoms; PP is the pseudopotential type; and Exc is the

exchange-correlation functional.

D. Agreement

In the following sections, we will compare different cal-culations of the same physical system. Here, we definethe standards by which we compare calculations. Thatis, what conditions need to be met for two calculationsto be in ‘agreement’.We define agreement in terms of three measures: the

band structure (ε), the force (F ), and pressure (P ).Agreement in band structure means less than 5meV ab-solute root mean square error over bands that are belowor cross the Fermi level. We zero the band structure atthe Fermi level to avoid constant shift errors. Agree-ment in force means the root mean square error is lessthan 10−3Ry/au or 5% of the root mean square force,whichever is higher. Agreement in pressure means theabsolute error is less than 1 kbar or the relative error isless than 5%. The total energy is an interesting metricmathematically, as it converges monotonically with re-spect to basis completeness due to variational freedom.However, it has no absolute physical meaning in pseudo-potential calculations. Therefore, we report the total en-ergy per atom (E/A), but do not constrain our methodsto directly agree with respect to total energy.

E. PWDFT convergence

PWDFT has two primary discretizations: the fine k-point mesh and the number of plane waves, which is de-fined through a kinetic energy cut-off Ecut. Before com-paring to the SRB, we ensure that that each structureis converged with respect to the number of k-points andplane-waves. The resulting configurations can be foundin Table I.For each structure, we converge with respect to the

k-point mesh such that a doubling of the number of k-points on edge produces a result in agreement, as definedin Section IVD. For example, if the converged mesh had16 points on edge, then a calculation with 32 points onedge should agree with it.Convergence with respect to energy cutoff is a

more nuanced problem, especially for ultra-soft pseudo-potentials. For demonstration purposes, we simplify thematter by using a 32 (48) Rydberg cutoff for ultra-soft

NQ Nb ∆E/A ∆ε ∆F ∆PCorners 8 1595 31.74 7.27 0.003617 173.26Edges 20 3458 1.24 0.39 0.000231 1.29Faces 26 3507 1.03 0.21 0.000092 0.55Center 27 3509 1.00 0.20 0.000082 0.51

TABLE II: Comparison of basis size and accuracy forAu with various set of q-points. In all cases, σ2

b = 10−7.Energies are given in meV , forces in Ry/a0, and

pressure in kbar.

(norm conserving) pseudo-potentials in our presented ex-amples. The validity of the SRB over a broad range ofenergy cut-offs is demonstrated in Figure 7.The CNT and nickel slab structures are vacuum

padded normal to the structure to model a reduced di-mensional system in full periodic boundary conditions.At fixed atomic positions, the planar (axial) forces andstress of the Ni slab (CNT) are coupled to the amount ofvacuum padding. This prevents agreement with respectto the raw parallel pressure and forces at reasonable vac-uum separations. Instead, we allow each structure torelax normal to the vacuum direction before computingthe agreement metrics. However, for comparison with theSRB result, we use the un-relaxed with the vacuum fromthe relaxed structure. This way, the forces and pressureare non-zero and can be meaningfully compared.Band structure computations are performed non-self

consistently with the same energy cut-off and vacuumpadding as the self-consistent calculation. However, inorder to converge the band energies to high accuracy,sometimes the plane-wave eigensolver must be changedfrom Davidson to conjugate gradients. The switch is notuncommon for non-self consistent calculations.

V. PARAMETERIZATION

Thus far, we have tried to present the method as gen-erally as possible. Here, we describe the relevant param-eters of the method and their effects on the accuracy andperformance of the SRB. In particular, we attempt toprovide sensible defaults.The times reported in this section should be inter-

preted qualitatively. For quantitatively valid timings, seeSection VII.

A. Defining the SRB

There are three parameters that govern the subspaceof the full basis in which the reduced basis is generated:the coarse sample of reciprocal space, Q, the truncationthreshold, σ2

b , and the maximum number of basis ele-ments, Nmax. Our implementation provides for a simul-taneous definition of Nmax and σ2

b . As SCF calculationsconverge, the number of basis elements that satisfy σ2

b

10

(0,0,0) (1,0,0)

(a) Stick

(0,0,0) (1,0,0)

(0,1,0)

(b) Slab

(0,0,0) (1,0,0)

(0,1,0)

(0,0,1)

(c) Bulk.

FIG. 5: Coarse samples of the first Brillouin zone.Open points are related to closed points by inversion

and translation.

σ2

b Nb ∆E ∆ε ∆F ∆P

10.−4 2173 99.62 13.640 0.002883 92.1710.−7 3507 1.03 0.209 0.000092 0.5510.−14 5485 0.50 0.043 0.000038 0.470. 5486 0.50 0.043 0.000038 0.47

TABLE III: Comparison of basis size and accuracy Auwith various error tolerances, σ2

b . In all cases, theq-points are corners, edges, and faces of the unit cube.

Energies are given in meV , forces in Ry/a0, andpressure in kbar.

tends to decrease, so if both are specified tightly, Nmax

applies early in the calculation and σ2b later.

It was found that for non-self consistent calculationsof sufficiently large unit cells, the Γ point and its mirrorson the unit cube are generally sufficient [10]. For self-consistent calculations, this is not generally the case. We

σ2

x % Elem. ∆E ∆ε ∆F ∆P10−2 10.24 51.59 73.75 0.000224 3.7410−3 56.94 1.36 0.685 0.000093 0.5410−4 84.72 1.08 0.211 0.000092 0.5510−6 100. 1.03 0.209 0.000092 0.550 100. 1.03 0.209 0.000092 0.55

TABLE IV: Comparison of the number of auxillarybasis elements, as a fraction of the total projector inputstates, and accuracy for Au with various auxillary error

tolerances. In all cases, the auxillary q-point gridQ = K. Energies are given in meV , forces in Ry/a0,

and pressure in kbar.

find it best to use points that lie on the boundary of theBZ. For example, in three dimensions the corners, edgecenters, and face centers of the unit cube should be usedin that order. The relative basis sizes and accuracy forthose coarse samples can be seem in Table II. In twodimensions, only the face qz = 0 of the unit cube shouldbe included and the coarse spacing should be halved. Inone dimension, only the edge qz = qy = 0 should beincluded and the spacing halved again. These coarse BZsamples can be seen in Figure 5.On could consider increasing the number of input

coarse eigenstates by artificially increasing the numberof bands per coarse sample beyond the number desiredin the fine sample. However, the additional high-energyeigenstates would do little to improve the SRBs represen-tation of the occupied low-energy eigenstates. Simply,the high and low energy eigen-spaces overlap less thanthe low energy eigenspaces across the BZ.This is indicative of a more general trade-off: the more

representative the coarse sample Q is of the fine sampleK, the fewer basis elements are needed to achieve a fixederror tolerance. For a fixed error tolerance, selecting amore representative Q generally leads to a smaller basis.This is more readily understood by holding the numberof basis elements fixed and increasing the density of Q,which will reduce the error. In addition to the aforemen-tioned example of increasing the number of bands, thisprincipal applies to applying symmetry beyond the firstBZ.The relationship between the basis truncation thresh-

old, σ2b , the basis size, and the accuracy is demon-

strated in Table III. To achieve agreement, we recom-mend σ2

b = 10−7 as a default.

B. Auxiliary basis

The auxiliary basis |x〉 used in transforming the pro-jectors is controlled by three parameters (Section III C)in the same manner as the SRB. Because the auxiliarybasis need be constructed only once, but is used at ev-ery iteration, it is advantageous to be thorough in thesample Q. We let Q = K, which gives the variances

11

System Direct (s) Auxillary basis (s)CNT 47.98 20.59Ni 3009.32 2612.15Au 104.73 77.34

TABLE V: Comparison of the run-time of the directtransformation and auxiliary basis approaches to the

projectors. In all cases, σ2x = 0.001.

of the basis elements statistical meaning and producesan ‘optimal’ basis. In this light, the threshold methodis preferred over specifying the number of elements. Acomparison of the auxiliary basis size and resulting ac-curacy for various thresholds is provided in Table IV. Athreshold σ2

x = 10−3 is a robust choice.The dimension of the SVD used to form the auxiliary

basis scales linearly with the number of k-points. Forsmall systems with very large number of k-points, thiscost can dominate the overall cost of projection. In thosecases, a smaller set of q-points, Q ⊂ K, should be used.The construction of the auxiliary basis for each species

is independent of the number of atoms of that specieswhile the construction of the structure factors occursonce per atom. The overhead associated with construct-ing the auxiliary basis is amortized by the number ofatoms. If the number of atoms of a species is small, theoverhead costs are more of a factor. Therefore, it mayonly be beneficial to use the auxiliary basis when thenumber of atoms of a species is above some threshold,Nx,min.Similarly, for non-self-consistent calculations there is

only one projection. Thus, the overhead cost of theSVD is not amortized over many iterations. For non-self-consistent calculations, direct transformation of theprojectors is recommended.The three examples here are self-consistent with a

moderate number of k-points. Thus, we expect the aux-iliary basis to provide superior performance. It does, asseen in Table V.

C. Density matrix

There are two parameters that describe the buildingof the real-space density. The first is the threshold onthe Fermi weight with which to add a wave function tothe density matrix. This threshold is present in mostPW DFT codes. The second is the threshold on thesingular values of the density matrix, σρ. Because thedensity matrix is complete, this parameter provides anexact measure for the preservation of the electron density.The relationship between this threshold and accuracy isdemonstrated in Table VI. A sensible value is 10−4.The density matrix is the size of the Hamiltonian, so

the rank-1 decomposition costs no more than a single k-point diagonalization. The performance of this approachis compared to the direct transformation in Table VII.

σ2

ρ % Nb ∆E ∆ε ∆F ∆P10−2 23.4 6.42 11.93 0.000564 0.1310−3 29.6 8.07 3.79 0.000568 0.0310−4 37.0 8.11 2.636 0.000580 0.0110−6 48.2 8.11 2.635 0.000582 0.010 100. 8.11 2.620 0.000582 0.01n/a n/a 8.11 2.600 0.000537 0.01

TABLE VI: Comparison of the number of rank-1components, as a fraction of the basis size, and accuracyfor CNT with various density error tolerances. The lastentry uses the direct transformation. Energies are given

in meV , forces in Ry/a0, and pressure in kbar.

System Direct (s) Density matrix (s)CNT 511.75 47.68Ni 2958.32 588.99Au 3976.52 2223.18

TABLE VII: Comparison of the run time of the directtransformation and density approaches to computingthe real-space electron density. In all cases, the density

error tolerance σ2ρ = 10−4.

D. Basis saving

There are two parameters that define the reuse of theSRB during self-consistent calculations: the basis life-time, nL, and the freeze threshold, ǫb. The basis lifetimeis the minimum number of SCF iterations for which eachbasis is reused. The freeze threshold is the energy conver-gence criterion after which the basis is no longer updated.

As shown in Table VIII, the accuracy is nearly com-pletely independent of the basis lifetime. As the basis life-time increases, the number of iterations to convergence,NI , can increase. However, the mitigation of overheadcosts generally makes up for any additional iterations,lowering overall runtime. Across the full range of sys-tems, nL = 3 is a robust choice.

As the wave functions and electron density converge,the SRB changes less and less between updates. When

nL NI Time (s) ∆E ∆ε ∆F ∆P5 52 1039.51 −3.879 2.308 0.000746 −0.354 47 1044.89 −3.879 2.303 0.000745 −0.353 56 1306.05 −3.879 2.299 0.000746 −0.342 54 1425.08 −3.879 2.300 0.000746 −0.341 52 1883.34 −3.879 2.296 0.000747 −0.33

TABLE VIII: Comparison of the number of iterations,run-time, and accuracy for Ni with various basis

life-times. In all cases, the freeze threshold ǫb = 10−6.Energies are given in meV , forces in Ry/a0, and

pressure in kbar.

12

ǫb NI Time (s) ∆E ∆ε ∆F ∆P10−2 47 941.19 −1.7085 2.260 0.000851 −0.3810−3 47 1052.83 −1.6878 2.199 0.000932 −0.3110−4 49 1189.98 −1.6861 2.197 0.000944 −0.2010−6 50 1612.63 −1.6847 2.191 0.000947 −0.2910−9 50 2163.92 −1.6850 2.206 0.000949 −0.290 50 2347.25 −1.6878 2.199 0.000932 −0.31

TABLE IX: Comparison of the number of iterations,run-time, and accuracy for Ni with various freezethresholds. In all cases, the basis lifetime nL = 1.Energies are given in meV , forces in Ry/a0, and

pressure in kbar.

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

Norm

alizedsingularva

lue

Normalized index

CNTNiAu

FIG. 6: Singular value spectrum taken from the end ofa self-consistent calculation using the coarse samplesspecified in Figure 5. The vertical axis is the singularvalue divided by the number of coarse samples, that isthe number of q-points Q. The horizontal axis is the

singular value index divided by the number ofeigenfunctions taken per q-point, that is the number of

bands B.

the energy eigenvalues are converged to ǫb, the standardconvergence criteria for self-consistent calculations, theSRB is not updated. As demonstrated in Table IX, theaccuracy rapidly improves as the threshold is decreaseduntil basis saving is no longer a source of error. ǫb = 10−6

is a conservative choice.

VI. ACCURACY

There is an important inconsistency between the plane-wave basis used to construct the SRB and the plane-wavebasis used in a plane-wave calculation. Plane-wave codestypically define their plane-wave bases as the set of g-vectors within a sphere centered around each k-point.This ensures that each k-point has the same well-definedEcut, but each k-point therefore uses a different set of

g-vectors.

GPW (k) ={

g : |k + g|2 < Ecut

}

(38)

The SRB, on the other hand, is k-independent; we typi-cally define the SRB elements with respect to the Γ-pointg-vectors. Similar issues arise when the mirroring proce-dure is applied to the input q-points [10], taking q → q′.In the plane-wave code, the plane-wave basis at q and q′

would be different.We would like to create a larger basis that includes ev-

ery k-point basis by simply adding a g-vector in eachdirection. That is, a 28x34x60 reciprocal space gridwould become a 30x36x62 grid. However, in most mod-ern PWDFT codes, including Quantum ESPRESSO, acustom FFT interface is used which imposes an isotropicenergy cut-off. We could increase the Ecut such that ev-ery g-vector at every plane-wave k-point is included inthe Γ-point basis. Each SRB k-point would then be rep-resented on a superset of g-vectors from a mix of plane-wave k-points.

GSRB ={

g : |g|2 < Ecut + |kmax|2}

⊃⋃

k∈K

GPW (k)

(39)If the original PW calculation is not fully converged

with respect to the plane-wave cutoff, then the SRB withincreased cut-off will converge to a different, but some-times more accurate result, due to the basis discrepancy.However, the differences induced by basis discrepancyclosely track the g-vector truncation error. The error ofthe PW and SRB results, computed with respect to afully g-vector converged result, are nearly identical un-til the intrinsic accuracy defined by the basis truncationthreshold is reached. Beyond that point, the SRB errorlevels off. The position of this level off is the true error ofthe SRB. This effect is demonstrated in Figure 7. By de-creasing σ2

b , the error can be driven well below agreementstandards defined in Section IVD.

A. Convergence

The SRB spans a subspace of the span of the PW ba-sis. As the SRB grows, by adding samples to the coarsegrid or truncating at lower error threshold, it convergesuniformly to the PW result. Furthermore, convergence ofvariational quantities, such as the total energy, is mono-tonic.The basis elements are produced by singular value de-

composition, Eq. 4. Singular value spectra of the threemodel systems can be seen in Figure 6. The steady ex-ponential fall-off of the singular values demonstrates theredundancy of the set of input states {|umq〉}, for q sam-pled from Q, and suggests that the solution should con-verge exponentially with respect to the basis truncationthreshold σ2

b .Convergence of numerical examples displays three

regimes. First, when the SRB is too small, the non-linearity of the system can cause a high-error plateau

13

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

20 40 60 80 100 120 140

∆P

(kbar)

Ecut (Ry)

PWPW

10−14

10−14

10−11

10−11

10−9

10−9

10−7

10−7

10−5

10−5

10−3

10−3

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

20 40 60 80 100 120 140

∆E/N

A(eV)

Ecut (Ry)

PW10−14

10−11

10−9

10−7

10−5

10−3

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

20 40 60 80 100 120 140

RMSForce(R

y/au)

Ecut (Ry)

PW10−14

10−11

10−9

10−7

10−5

10−3

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

20 40 60 80 100 120 140

RMSBandstructure

(eV)

Ecut (Ry)

PW10−14

10−11

10−9

10−7

10−5

10−3

FIG. 7: Convergence of the total energy, band structure, forces, and pressure with respect to Ecut compared toEcut = 128 Ry for the bulk Au example. The black lines are for a PW calculation and the colored lines for SRB

calculations with a range of basis truncation thresholds, σ2b . In the error in pressure is signed, so positive errors are

dashed lines.

region. Subsequently, the error falls off exponentially.Finally, other differences between the cutoffs become ap-parent, causing the SRB difference to level off at the scaleof the PW cutoff convergence.There are other model differences related to the size of

the auxiliary basis and the thresholds used for truncatingsums in the density matrix. These errors are negligiblewhen using the parameters described in Section V.

VII. PERFORMANCE

A simple empirical performance analysis is to breakdown the computation time into a setup and a per k-point cost, as seen in Table X. PWDFT spends the over-whelming majority of time in k-dependent terms, withonly a few ρ-dependent operations. The SRB, on theother hand, spends a significant amount of time in k-independent ‘overhead’, in the form of the q-point cal-culations and basis transformations. The overhead isbalanced by significantly lower per k-point costs. For

the examples we have considered, the number of k-pointsneeded to amortize the overhead, resulting in an overallfaster calculation, ranges from 4 to 72, the speedup inthe limit of an infinite number of k-points ranges from1.4 to 22x, and the speedup of real calculations rangesfrom 0.8 to 5.3x.

It should be noted that the results presented in Ta-ble X were optimized for minimal total time. The op-timal parameters for these specific systems can be seein Table XIII. Greater per k-point acceleration could beachieved at the cost of greater overhead, and vice versa.

The SRB computation can be divided into the q-pointplane-wave solutions, the construction of the SRB {|b〉},the k-independent transformation of the Hamiltonian,the construction of the k-dependent Hamiltonian, diago-nalization, and the construction of the electron density.The time spent in these sections of the code can be seenin Table XII.

For CNT, the SRB calculation is dominated by theplane-wave parts, which includes the q-point solutionsand the k-independent transformation of the local po-

14

PW-0 PW-K PW Total SRB-0 SRB-K SRB Total KS X∞ Xreal

CNT 196.06 91.13 3204. 470.25 3.99 602. 4 22.84 5.32Ni 234.95 181.69 13680. 2551.6 52.28 6420. 18 3.47 2.13Au 10.93 341.36 12300. 6783.55 246.57 15660. 72 1.38 0.78

TABLE X: Breakdown into k-independent and k-dependent costs. PW-K (SRB-K) is the time spent in k-depdenentsteps of the PW (SRB) calculation divided by the number of k-points. PW-0 (SRB-0) is the time spent in

k-indepdenent steps of the PW (SRB) calculation. KS is the number of k-points needed for the SRB require lesstime than the PW calculation. X∞ is the ratio of the PW to SRB calculation times in the limit as K → ∞. Xreal is

the ratio of the PW to SRB calculation times for the actual calculation.

K Q G S αD αS

CNT 33 5 79214 176 2.54 6.07Ni 74 7 12431 972 2.02 11.85Au 36 7 20672 3016 2.39 14.29

TABLE XI: Indicators of the relative performance ofthe SRB. K is the number of irreducible k-points; Q isthe number of irreducible q-points; G is the number ofplane waves; S is the number of SRB elements; αD is

the size of the subspace used in the Davidson algorithmas a ratio of the number of bands; and αS is the size of

the SRB as a ratio of the number of bands.

PW {|b〉} H H(k) Diag ρCNT 62.6 4.7 19.9 6.4 0.7 5.8Ni 16.0 2.6 12.8 17.2 43.8 7.7Au 12.9 11.2 14.6 5.2 46.8 9.3

TABLE XII: Percent of run-time spent in plane-wave,basis construction, k-independent transformation,

k-dependent construction, diagonalization, and densitymatrix sections, respectively.

tential. Diagonalization, on the other hand, takes lessthan 1% of the run-time. In this case the SRB can bethought of as a way to reduce the number of FFTs, whichdominate the vacuum-padded calculation.For Ni and Au, the focus shifts from plane-wave cal-

culations to dense linear algebra. Au spends more timebuilding the basis because bulk BZs have more q-pointsthan slab BZs. Ni spends more time constructing the k-dependent Hamiltonian and overlap matrix, S, becauseit employs ultra-soft pseudo-potentials. In both cases,diagonalization is a significant cost, but no individual

σ2

b σ2

x σ2

ρ nL ǫbCNT 10.−6 0 10.−4 1 10.−2

Ni 10.−5 10.−3 10.−4 5 10.−5

Au 10.−6 10.−3 10.−4 4 10.−4

TABLE XIII: Optimal parameters for example systemssuch as to acheive agreement.

cost dominates the calculation. The cost of each SRBstep is at least linear in the number of basis elements,so reducing the basis size would reduce the cost of thesecalculations.We can also reason about the performance of the SRB

analytically. A full performance model, based on count-ing library calls, can be found in Appendix A. The modelhighlights three parameters as strongly indicative of theperformance of the SRB compared to plane-waves: 1) thek-point ratio, Q/K; 2) the band density, B/V ; and 3) thesubspace inefficiency, αS/αD. The first two parametersare known a priori, while the third is buried deeply in themethod of diagonalization. The value of these parame-ters for the three examples systems are found in Table XI.In all cases, lower values favor the SRB.The k-point ratio, Q/K, characterizes the balance of

over-head to per k-point costs. The lower the value, theless significant the overhead of the q-point plane-wavecalculations. The k-point ratio captures the importanceof the number of k-points on the relative efficiency of theSRB.The band density provides an a priori estimate of the

ratio of the SRB size to the number of plane-waves.Plane-waves cover space uniformly; the number of plane-waves is linearly dependent on the volume of the unitcell. The SRB is computed from the bands at each q-point. Thus, the size of the SRB depends linearly on thenumber of bands, not the volume. The ratio of the SRBsize to the number of planes thus goes as the ratio of thenumber of bands to the volume, the band density. Theband density captures the importance of the contents ofthe unit cell on the relative efficiency of the SRB.The subspace inefficiency is defined as the ratio of the

SRB size to the size of the iterative subspace that is di-rectly solved in the iterative eigen-solver. Because theSRB currently uses a direct solver for diagonalization,this ratio measures the relative difficulty of direct diago-nalization in the two methods. The cost of diagonaliza-tion is cubic in the dimension, so the relative costs arethe cube of the subspace inefficiency. The subspace inef-ficiency captures the difficulty of spanning the entire BZwith a single basis, compared to computing an iterativesubspace for each k-point individually. In that context,it is clear why the subspace inefficiency is greater thanone.Unlike the k-point ratio and band density, the sub-

15

Default Full Relaxed PWCNT 949. 602. 544. 3204.Au 29220. 15660. 7380. 12300.

TABLE XIV: Performance of the SRB based on serialrun time (in seconds) using: default settings, settingswhich produce full agreement, and relaxed setting withreduced accuracy in agreement criteria specified in

Section VIIA.

space inefficiency can not be computed a priori. We havefound it to depend strongly on the dimensionality of theBZ. The subspace inefficiency is highest for the 3D BZof bulk systems and lowest for the 1D BZ of nanowire,nanoribbons, and nanorods.

The data presented thus far reflects serial performanceas a proxy for model complexity. In reality, PWDFTproblems are solved in parallel, motivating a compari-son of parallel performance. Such a comparison requiresconsiderations of the parallel model employed both inthe SRB and PW calculations, which are not entirelyconsistent and beyond the scope of this writing. How-ever, a qualitative discussion of the benefits of the SRBalgorithm for parallel performance is discussed in Sec-tion VIII B.

A. Tuning

The performance results presented thus far come fromparameter configurations that satisfy all three agreementconditions from Section IVD. In some cases, these agree-ment conditions can be relaxed. For example, if the ge-ometry is prescribed, then force and stress agreement forthe purpose of relaxing the structure is unnecessary. Onthe other hand, if the end result is a molecular dynam-ics trajectory, then a highly accurate band structure maynot be required at every time-step. The SRB provides ameans for expressing lessened accuracy expectations toyield higher throughput calculations.

Here, we explore two situations with reduced accu-racy requirements. The first is AIMD, where the band-structure and stress at every time step are not important.We further relax the force constraint to be within 10%of the root mean square magnitude. The second is aband structure calculation of a prescribed CNT, wherethe forces and stress are not important. The constrainton the root mean square error of the band structure iskept at 5 meV.

The relaxed accuracy constraints lead to improved per-formance, seen in Table XIV. The resulting SRB perfor-mance for is 1.67x better than plane-waves for Au and5.89x better for CNT.

VIII. DISCUSSION

A. Where to expect speedup

The performance model is able to capture the majorindicators of SRB performance: the ratio of k-points toq-points, the ratio of plane-waves to reduced basis ele-ments, and the ratio of reduced basis elements to bands.Three dimensional bulk materials in modest cells often

require a large number of k-points to resolve the Fermisurface. When disordered, broken symmetries further in-crease the number of irreducible points. However, bulkmaterials also have relatively high band densities. Fur-thermore, the entire volume of the BZ must be covered bythe basis, requiring more basis elements per band thanin a system with a 1D or 2D BZ. Using direct diago-nalization, as the current implementation does, the SRBwill provide no more than a modest reduction in the run-time. Using iterative diagonalization, one could reducethe performance model’s dependence on the ratio of ba-sis elements to bands, providing a speedup through moreefficient application of the Hamiltonian.One dimensional wire, tube, or ribbon materials re-

quire vacuum padding to imitate isolation in periodicboundary conditions. The padding increases the numberof plane-waves without changing the number of occupiedbands, that is it decreases the band density. Also, theBZ is one-dimensional, so relatively few basis elementsper band are needed. However, a linear BZ generallyrequires fewer k-points. One dimensional systems fre-quently approach the limit given by Equation A6 withQ ≈ 4 and αS ≈ 2αD.Slab geometries are the best of both worlds. The need

for vacuum padding decreases the band density comparedto bulk systems. The two planar dimensions can requiremany k-points compared to the one-dimensional case.The k-points lie in a plane, which can be spanned mya smaller number of basis functions than the bulk. Themodest performance of the Ni slab considered here canbe attributed to the thickness of the slab and lack ofiterative diagonalization.

B. Implications for parallelism

The SRB adds global operations to the SCF process:the construction of the basis and the transformation of k-independent factors of the Hamiltonian. However, theseoperations contain inner products over the plane-wavespace, which is large and readily parallelizeable. Onlythe diagonalization of the covariance matrix, Eq. 4, ex-hibits poor scaling, and that operation is both G and Kindependent.A significant advantage of the SRB comes in the re-

duction of the problem size. The reduced, dense eigen-value problem is significantly smaller than the plane-waveequivalent. For example, a system with 1000 bands and20 basis elements per band would fit on a 16 GiB node.

16

Generally, the SRB requires less memory and thereforefewer nodes to handle each k-point. This allows for ag-gressive k-point parallelism, which further improves scal-ability through memory and network locality local.

The operations required to build the k-independentparts of the Hamiltonian have memory requirements thatscale with the number of plane-waves. The transforma-tion matrix, 〈b|g〉, can be particularly large. However,the application of these matrices is through highly par-allel matrix-matrix products. Therefore, a scheme whichparallelizes the construction of a set of dense Hamiltoni-ans over a set of nodes and then subdivides into maxi-mally local solves parallelizes very well compared to theplane-wave counterpart.

Additionally, the SRB uses significantly fewer FFT op-erations than the plane-wave counterpart. Parallel FFTshave high communication to computation ratios and areknown to limit plane-wave DFT scalability in many cases.Avoiding the majority of them eases this parallel bottle-neck.

C. Implications for accelerators

Accelerator systems, including those based on GPUsand coprocessors, can struggle to provide sufficient net-work bandwidth to keep up with their high floating-pointand local memory performance. The aforementioned de-crease in problem size, increase in possible locality, andavoidance of parallel FFTs reduces the communicationoverhead significantly. The dense linear algebra thatcomprises the majority the SRB computation is an idealcase of accelerators. Early testing indicates that the po-tential for the application of accelerators to the SRB ex-ceeds that of traditional PW calculations.

D. Advantages for norm-conservingpseduo-potentials

The SRB presents three advantages for norm-conserving pseudo-potentials. First, the norm-conservingnon-local potential is density independent, and thereforestatic throughout a calculation. When the SRB is saved,only the local potential must be recomputed betweenSCF iterations. Second, the eigensystem does not needto be generalized which greatly improves the efficiencyof direct solvers compared to subspace methods that stillrequire generalized solves. Third, the usual disadvantageof NCPPs is the requirement of more plane-waves thanUSPPs. The performance of the SRB is relatively in-sensitive to the number of plane-waves compared to fullplane-wave calculations. We can expect the SRB to re-duce the performance gap between NCPPs and USPPs.

E. Comparison to NSCF calculations

There are three primary differences between self-consistent and non-self-consistent calculations in theSRB. The first, and most obvious, is the addition of a stepto construct the charge density. The method outlined inSection III E is usually cheaper than the transformationof the local potential, so this should not significantly im-pact the performance of the method.The second difference is the reduction in the number

of k-points. The density does not depend as stronglyon the k-point sampling as the band-structure. It is notuncommon for NSCF calculations to exceed thousands ofk-points. It is rare for self-consistent calculations to reachone hundred. This shifts the target away from creatingthe smallest, most expressive basis towards methods thatfacilitate cheap transformations, reducing the overheadcost. This shift is typified by the basis saving technique.

The third difference is related to the diagonalization.Iterative diagonalization schemes take as input a conver-gence threshold for the accuracy of the eigenvalues. Thefirst self-consistent iteration specifies a large threshold.Subsequent iterations use successively smaller thresholdsuntil the user-defined accuracy criteria is met. Becauseeach SCF iteration uses the previous iterations result asan ‘initial guess’, the diagonalizer has very little work todo at each iteration. In a sense, self-consistent calcula-tions only diagonalize once, but they interleave that di-agonalization with self-consistent updates to the Hamil-tonian. In NSCF calculations, the same iterative diago-nalizer is used but the convergence threshold can not berelaxed. Therefore, NSCF diagonalization takes muchlonger than a single SCF iteration.

The SRB uses direct diagonalization of the denseHamiltonian, which doesn’t benefit much from a reducedconvergence threshold. The SRB diagonalization time inthe SCF and NSCF cases are identical. Thus, the reduc-tion in diagonalization time compared to plane-waves ismuch more pronounced for NSCF calculations, even ifthe number of k-points, plane-waves, and basis elementswere identical.

IX. FUTURE WORK

A. Preconditioned iterative diagonalization

We have seen the direct diagonalization step to bea significant bottleneck for higher dimensional systems.The poor performance is rooted in the larger size of theSRB compared to the subspace formed in the Davidsoniteration employed by the plane wave code. Direct diag-onalization is cubic in the matrix dimension, so even adoubling of the SRB size compared to the subspace willdegrade performance by 8x.

The obvious solution is to use an iterative method,such as Davidson, to diagonalize the SRB Hamiltonian.

17

This would allow the SRB to take advantage of grad-ual diagonalization inherit to self-consistent calculations.The iterative subspace needed to diagonalize the SRBHamiltonian should be no larger than the PWDFT coun-terpart, and could indeed be smaller.The challenge, as highlighted by Shirley [6], is to

pick a reasonable preconditioner. The SRB Hamilto-nian is dense and not always diagonally dominant. TheSRB provides a factorization of the Hamiltonian withrespect to k. One could consider directly inverting thek-independent part of the Hamiltonian and re-using it asthe foundation of dense preconditioner. Such a precon-ditioner could be much more accurate than the diagonalpreconditioners used in the plane-wave code, enabled bythe small dense matrix size.

B. Pruning

The success of the basis saving technique hints at atemporal-like redundancy in the self-consistent proce-dure. One could take this temporal redundancy a stepfarther and not only reuse a previous basis but reduce itfurther based on statistics of the previous iteration. Forexample, on the first iteration, the reduced basis could beproduced with little or no truncation. Then, at the endof the iteration, the weight of the states 〈bi|unk〉 on eachbasis element would indicate which elements to remove.This would amount to computing scores

ωi =∑

nk

f(ǫnk) 〈bi|unk〉 (40)

and removing some of the elements with the lowest score.This could be up to a sum of missing scores or a ratioof the number of elements. We call this pruning. Prun-ing could be repeated at the end of each iteration untilthe basis is frozen or a new, un-truncated basis could beformed.There are two advantages of pruning over simply pro-

ducing a new basis. First, pruning takes into consider-ation the Bloch states at all the k-points, not just theq-points. Therefore, we could expect pruning to betterdecide which basis elements are being used. Secondly,pruning does not require a new transformation of thedensity-independent components of the Hamiltonian. In-stead, the pruned Hamiltonian is formed by eliminatingrows and columns from SRB Hamiltonian and rows fromthe set of projectors.

C. Dividing the BZ

In the present scheme, the entire first BZ is covered bya single reduced basis. This requirement is the primarycontributor to subspace inefficiency. In Davidson, thesubspace need only be valid for a single k-point.The BZ could be divided into regions, each covered by

a smaller basis. For example, the first BZ could be split

into octants, the first octant covering [0, 1/2]3 in crystalcoordinates. We would expect that at a given error tol-erance, the octant reduced bases would be smaller thanthe basis for the entire BZ. Dividing the BZ in this waywould increase the overhead transformation costs, but re-duce the per k-point cost. If the number of k-points werevery large, the reduction in per-k-point cost due to thesmaller reduced bases could outweigh the added overheadcost.

D. Exact exchange

The computation of the exact-exchange operator in thereduced basis presents many challenges. The exchangeoperator can be written as

〈bi |K(k)| bj〉 =∑

nk

fnk(ε)〈bi|r〉 〈r|ψnk〉 〈ψnk|r

′〉 〈r′|bj〉

|r − r′|,

(41)up to constant prefactors. This expression is generallytreated by considering a two-particle density:

ρa,b(r) = 〈φa|r〉 〈r|φb〉 , (42)

where φa and φb are arbitrary complex fields (e.g. thewave function ψnk or basis element bj). Then an analo-gous two-particle potential is formed through a Poissonsolve. Both of these steps are defined in specific bases,real space for the density and reciprocal space for thePoisson solve. As linear operators, the two operators area 2,2 tensor, which is too large to be direct transformedor stored in the SRB.Another approach would be to recognize that the sum

over occupied states produces a non-diagonal density op-erator:

〈r |ρ| r′〉 =∑

nk

fnk(ε) 〈r|ψnk〉 〈ψnk|r′〉 . (43)

We would like to perform a rank-1 decomposition, as inSection III E, but the off-diagonal elements of the densitydo not benefit from cancellation of the k-point phase.That is, the non-diagonal density operator is full rank.However, it is not the opinion of the authors that the

exact exchange operator can not be expressed in the SRB.Simply, this is an area which requires further study.

X. CONCLUSIONS

We have shown the SRB to be capable of achieving anarbitrary degree of accuracy compared to the host cal-culation in the primary basis, in this case plane-waves.More generally, the SRB completes to the primary ba-sis and therefore inherits the accuracy and convergenceproperties of that basis. For plane-waves, this makes theSRB uniformly convergent.

18

The current implementation of the SRB is more ef-ficient than plane-wave calculations for reduced dimen-sional extended systems, such as surfaces, nanorods,nanowires, and nanoribbons. The reduction in runtimedepends chiefly on the number of k-points, the band den-sity, and the dimensionality of the BZ. We have demon-strated speed ups compared to plane-waves in excess of5x while maintaining 5 meV RMS accuracy across thebandstructure, 1 kbar accuracy in pressure, and 10−3

Ry/a0 RMS accuracy across forces. Reducing accuracyconstraints to be in line with practical expectations forMD trajectories leads to a 1.67x performance improve-ment for bulk Au.

To enable highly efficient self-consistent calculations,we have described a k-independent transformation of thePWDFT Hamiltonian into the SRB and analogous trans-formation of the electron density to real-space. The k-independent transformation relies on the novel auxiliarybasis approach for the projectors and explicit density ma-trix approach for the electron density. These techniquesare examples of a broader class of k-independent trans-formations that could be used to efficiently transformother operators into the SRB. For example, the treat-ment of electron-phonon interactions in density func-tional perturbation theory [42] and the treatment of elec-tronic excited states within many-body perturbation the-ory [43, 44], in some systems, lead to large numbers ofmatrix elements that need to be computed on a fine k-point grid. Further, the k-independent transformationgeneralizes to any transformation of the PWDFT Hamil-tonian and density matrix to another space, 〈b|g〉. Itcould be considered a baseline method for transforma-tions between plane-waves and bases other than the SRB,and will be efficient as long as the other basis is smallcompared to plane-waves.

The SRB shifts numerical focus away from FFTs andtowards dense linear algebra. For example, the numberof FFTs is independent of the number of k-points. Denselinear algebra has a very different computational profilethan FFTs. Particularly, dense linear algebra is gen-erally floating-point bound, while FFTs are bandwidthbound. With the growing heterogeneity of computer ar-chitectures, this difference will have growing implicationsfor the performance of the methods. If general purposegraphical processing units and other modern coproces-sors are any indication, floating-point performance willcontinue to grow more rapidly than bandwidth, favoringdense linear algebra.

We have outlined where there is room for improve-ment in our current implementation of the SRB. Themost pressing issue is iterative diagonalization, which willgreatly improve the performance of 2D and 3D systems,as demonstrated by a Ni slab and a finite-temperatureAu snapshot here. Additionally, the coarse meshes, q-points, recommended here are independent of the com-position of the fine mesh, k-points, which is likely sub-optimal. Pruning could provide an efficient means forupdating the basis without performing additional plane-

wave calculations or transformations. When the numberof k-points is large, likely in non-self-consistent calcula-tions, dividing the BZ could reduce the subspace ineffi-ciency. At this early stage, with room for developmentin theory and in implementation, we think the SRB ap-proach holds much promise for improving the efficiencyof self-consistent electronic structure calculations.

ACKNOWLEDGMENTS

M Hutchinson acknowledges support from the Depart-ment of Energy Computational Science Graduate Fellow-ship Program (Grant No. DE-FG02-97ER25308). Thiswork was performed at the Molecular Foundry, supportedby the Office of Science, Office of Basic Energy Sciences,of the U.S. Department of Energy under Contract No.DE-AC02-05CH11231.

Appendix A: Performance model

Both PWDFT and SRB calculations spend the major-ity of their runtime in the FFT, BLAS, and LAPACKlibraries. The performance of the two methods can bewell characterized by the number and size of those calls.Consider a self-consistent calculation of an electron

density using a traditional plane-wave method. At theheart of the procedure is a matrix free iterative eigen-solver, most commonly conjugate gradient (CG) [45] orDavidson [11, 46]. Both access the Hamiltonian by com-puting the action on a vector H |φ〉. We restrict therest of this discussion to Davidson, which is consideredhigher performance and a good representative of relatedsub-space techniques. Davidson uses the action to repre-sent the Hamiltonian and overlap matrix in a dense sub-space [47]. In PWDFT, each Hamiltonian evaluation re-quires 2 fast Fourier transforms (FFT) to access the real-space local potential, and 2 matrix-vector (MV) productsto project into and out of the atomic wave function space.The dense Hamiltonian is formed by inner-products overplane-waves. The subspace eigenproblem is then solveddirectly. Candidate solutions are transformed back intothe full space and their residuals, (H − εS) |ψ〉, are usedto expand the subspace using a diagonal approximationof the inverse. The construction of the electron density,ρ, requires an additional FFT and MV product per eigen-function. In sum, the Davidson workload per k-point perself-consistent iteration is:

WPW = (2αD + 1)B (FFT[G] + MV[P,G])

+MM [αDB,αDB,G] + EVD[αDB] (A1)

where B is the number of bands, G is the number ofplane-waves, P is the number of projectors, αD is sizeof the final subspace as multiple of the number of bands,and EVD[N ] denotes an eigenvalue decomposition (eigen-solve) of size N which computes all eigenvectors. The as-sumption is made that the direct diagonalization of the

19

largest subspace dominates the other diagonalizations,which underestimates the subspace diagonalization costs.Note that in Davidson, the matrix-vector (MV) prod-ucts can be coalesced into a more efficient matrix-matrix(MM) product because the subspace can be formed inparallel. In some codes, the transformation to pseudo-wave function space is performed in real-space, in whichcase the MV product is sparse with a number of non-zero terms that grows linearly with the system size, notquadratically.When using the SRB, as described in Section III, the

workload becomes:

WSRB = SVD[QB,G] + MM[QB, QB,G] + MM[QB, S,G]

+ 2S FFT[G] + 5MM[S, S,G]

+K MM[S, P,X ] +A MM[S,X,G]

+K MM[S, S, P ]

+K EVD[S] (A2)

+K MM[S, S,B] + SVD[S, S]

+ αρS (FFT[G] + MV[S,G])

+ (2αD)QB (FFT[G] + MV[P,G])

Where SVD denotes the singular value decomposition, Sis the size of the SRB, X is the size of the auxiliary basis,A is the number of atoms, Q is the number of q-points, Qis the number of irreducible q-points, EVD denotes theeigenvalue decomposition, and αρ is the fraction of theSRB degrees of freedom needed to represent the densitymatrix, with a typical value of 1/4.When a saved basis is used, the workload reduces to:

WSRB′ = SFFT[G] + MM[S, S,G]

+KEVD[S]

+KMM[S, S,B] + SVD[S, S] (A3)

+ αρS (FFT[G] + MV[S,G])

(A4)

which omits the plane-wave calculation and SVD used toform the basis and the transformation of projectors andkinetic energy, which are density independent, and usesthe previously computed Fourier transform of the basiselements.

As the number of k-points increases, the cost of theEVD dominates all other SRB costs. The Davidsonsolver in the plane-wave algorithm includes a similarlysized EVD per k-point. We can write S = αSB so thatthe ratio αD/αS characterizes the relative sizes of thedirect eigensolves. In this large K regime, the speedupshould be:

Speedup =

(

αD

αS

)3 [

1 +WPW − EVD[αDB]

EVD[αDB]

]

(A5)

When αD/αS < 1/2, the performance of the reduced di-agonalization degrades rapidly. In these cases, αS >> 1,so most of the direct solver’s effort is wasted on unin-teresting eigenpairs. An iterative eigensolver should beused instead.As the number of plane-waves per SRB matrix ele-

ment, G/S, increases, the cost of the G-dependent termsdominates all other costs. The Fourier transforms arethe highest order in G, so we only count them. Theplane-wave calculation uses 2KBαD while the SRB uses2QBαD + 2S = 2B(QαD + αS). The speed-up shouldapproach

Speedup =KαD

QαD + αS

(A6)

Because G grows linearly with the volume but S growslinearly with the number of bands, B, the ratio S/G ≈B/V , the band density. The lower the band density, thebetter one can expect the SRB to perform.

[1] A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards,S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder,and K. A. Persson, APL Materials 1, 1 (2013).

[2] W. Kohn and L. Sham, Physical Review 385 (1965).[3] A. Belsky, M. Hellenbrandt, V. L. Karen, and P. Luksch,

Acta Crystallogr. Sect. B 58, 364 (2002).[4] Z. Wang, Y. Sun, X.-Q. Chen, C. Franchini,

G. Xu, H. Weng, X. Dai, and Z. Fang,Physical Review B 85, 195320 (2012).

[5] R. Martin, Electronic structure: basic theory and practi-

cal methods (Cambridge Univ Pr, 2004).[6] E. Shirley, Physical Review B 54, 464 (1996).[7] J. Luttinger and W. Kohn, Physical Review 173 (1955).[8] G. Dresselhaus, A. Kip, and C. Kittel,

Physical Review 98 (1955).[9] M. Rathinam and L. R. Petzold,

SIAM Journal on Numerical Analysis 41, 1893 (2003).

[10] D. Prendergast and S. G. Louie,Physical Review B 80, 1 (2009).

[11] P. Giannozzi, S. Baroni, N. Bonini, M. Calandra,R. Car, C. Cavazzoni, D. Ceresoli, G. L. Chiarotti,M. Cococcioni, I. Dabo, A. Dal Corso, S. de Giron-coli, S. Fabris, G. Fratesi, R. Gebauer, U. Ger-stmann, C. Gougoussis, A. Kokalj, M. Lazzeri,L. Martin-Samos, N. Marzari, F. Mauri, R. Maz-zarello, S. Paolini, A. Pasquarello, L. Paulatto,C. Sbraccia, S. Scandolo, G. Sclauzero, A. P. Seitso-nen, A. Smogunov, P. Umari, and R. M. Wentzcovitch,Journal of Physics: Condensed Matter 21, 395502 (2009).

[12] http://www.github.com/maxhutch/qe-srb.[13] C. Pickard and R. Needs,

Physical Review Letters 107, 1 (2011).[14] K. T. Chan, J. B. Neaton, and M. L. Cohen,

Physical Review B 77, 235430 (2008).

20

[15] D. J. Lucia, P. S. Beran, and W. a. Silva,Progress in Aerospace Sciences 40, 51 (2004).

[16] L. Sirovich, Quarterly of Applied Mathematics XLV, 561(1987).

[17] C. Ding and X. He, in Proceedings of the twenty-first in-

ternational conference on Machine learning (2004).[18] C. Homescu, L. Petzold, and R. Serban,

SIAM Journal on Numerical Analysis 43, 1693 (2005).[19] B. Epureanu, Journal of Fluids and Structures 17, 971 (2003).[20] P. Umari, G. Stenuit, and S. Baroni,

Physical Review B 79, 7 (2009).[21] C. Persson and C. Ambrosch-Draxl,

Computer Physics Communications 177, 280 (2007).[22] M. I. Hussein and G. M. Hulbert,

Finite Elements in Analysis and Design 42, 602 (2006).[23] M. I. Hussein, Proc. R. Soc. A 465, 2825 (2009).[24] G. Pau, Physical Review E 76, 046704 (2007).[25] D. Vanderbilt, Physical Review B 25, 4228 (1990).[26] P. Blochl, Physical Review B 41, 5414 (1990).[27] P. Blochl, Physical Review B 50, 17953 (1994).[28] W. Pickett, Computer Physics Reports (1989).[29] L. Kleinman and D. Bylander,

Physical Review Letters 48, 1425 (1982).[30] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel,

J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammerling,and A. McKenney, LAPACK Users’ guide (Siam, 1999).

[31] K. Laasonen, A. Pasquarello, R. Car, C. Lee, andD. Vanderbilt, Physical Review B 47, 142 (1993).

[32] O. Nielsen and R. Martin,

Physical Review Letters 50, 697 (1983).[33] J. Perdew, K. Burke, and M. Ernzerhof,

Physical Review Letters 77, 3865 (1996).[34] J. Perdew and A. Zunger, Physical Review B 23 (1981).[35] A. Rappe, K. Rabe, E. Kaxiras, and J. Joannopoulos,

Physical Review B 41, 1227 (1990).[36] N. Troullier and J. Martins,

Physical Review B 43 (1991).[37] https://github.com/maxhutch/srb-supplemental.[38] D. Krepel and O. Hod,

Surface Science 605, 1633 (2011).[39] P. Ganesh and M. Widom,

Journal of Non-Crystalline Solids 357, 442 (2011).[40] X.-Z. Li, M. I. J. Probert, A. Alavi, and A. Michaelides,

Physical Review Letters 104, 066102 (2010).[41] Y. Ping, D. Hanson, I. Koslow, T. Ogitsu, D. Pren-

dergast, E. Schwegler, G. Collins, and a. Ng,Physics of Plasmas 15, 056303 (2008).

[42] F. Giustino, M. Cohen, and S. Louie,Physical Review B 76, 165108 (2007).

[43] M. Hybertsen and S. Louie,Physical Review B 34 (1986).

[44] M. Rohlfing and S. Louie,Physical Review B 62, 4927 (2000).

[45] M. Payne, M. Teter, and D. Allan,Reviews of Modern Physics 64, 1045 (1992).

[46] G. Kresse, Computational Materials Science 6, 15 (1996).[47] E. Davidson, Journal of Computational Physics , 87 (1975).