e cient computation of sparse matrix functions for large

18
Efficient computation of sparse matrix functions for large scale electronic structure calculations: The CheSS library Stephan Mohr, 1 William Dawson, 2 Michael Wagner, 1 Damien Caliste, 3, 4 Takahito Nakajima, 2 and Luigi Genovese 3, 4 1 Barcelona Supercomputing Center (BSC) * 2 RIKEN Advanced Institute for Computational Science, Kobe, Japan, 650-0002 3 Univ. Grenoble Alpes, INAC-MEM, L Sim, F-38000 Grenoble, France 4 CEA, INAC-MEM, L Sim, F-38000 Grenoble, France (Dated: October 10, 2017) We present CheSS, the “Chebyshev Sparse Solvers” library, which has been designed to solve typical problems arising in large scale electronic structure calculations using localized basis sets. The library is based on a flexible and efficient expansion in terms of Chebyshev polynomials, and presently features the calculation of the density matrix, the calculation of matrix powers for arbitrary powers, and the extraction of eigenvalues in a selected interval. CheSS is able to exploit the sparsity of the matrices and scales linearly with respect to the number of non-zero entries, making it well suited for large scale calculations. The approach is particularly adapted for setups leading to small spectral widths of the involved matrices, and outperforms alternative methods in this regime. By coupling CheSS to the DFT code BigDFT we show that such a favorable setup is indeed possible in practice. In addition, the approach based on Chebyshev polynomials can be massively parallelized, and CheSS exhibits excellent scaling up to thousands of cores even for relatively small matrix sizes. I. INTRODUCTION Sparse matrices are abundant in many branches of sci- ence, be it due to the characteristics of the employed basis set (e.g. finite elements, wavelets, Gaussians, etc.) or due to intrinsic localization properties of the system under investigation. Ideally, an operator acting on such matrices should exploit this sparsity as much as possi- ble in order to reach a high efficiency. Due to the great variety of specific needs there is no simple and unique ap- proach to perform this general task, and various solutions have been conceived to satisfy the respective demands 1 . The CheSS library, which we present in this paper, has its origins in electronic structure calculations, in partic- ular Density Functional Theory (DFT) 2,3 , and is conse- quently capable of performing the specific matrix opera- tions required in this context. In principle, these can all be solved straightforwardly using highly optimized Linear Algebra libraries such as LAPACK 4 /ScaLAPACK 5 , which is indeed the fastest solution for small to medium size matrices. However, this approach gets increasingly expensive for large systems, since it uses the matrices in their dense form, which exhibits an inherent cubic scaling. With the recent widespread availability of DFT codes that are able to tackle the regimes of many thou- sands atoms systems, such challenging large scale calcu- lations are becoming more and more abundant 6 , and the above mentioned cubic scaling is clearly a serious limita- tion. An overview of popular electronic structure codes, in particular focusing on large scale calculations, can be found in Ref. 6. In general these large scale DFT codes work with localized basis sets, e.g. BigDFT 7–9 , SIESTA 10,11 , Quickstep 12 , ONETEP 13–16 or Con- quest 17–19 , and the matrices expressed in these bases con- sequently exhibit a natural sparsity, i.e. only those matrix elements where the basis functions overlap are different from zero. Hence it is advantageous to use — beyond a certain crossover point — sparse algorithms instead of the basic dense approaches. Since exploiting the sparsity is also the key towards linear scaling algorithms — i.e. methods where the computational demand only increases linearly with respect to the system size — efficient sparse solvers are crucial in this domain. The central task within DFT is the calculation of the density matrix ˆ F , and many different approximate ways of calculating it with a scaling being more favorable than the default cubic one have been derived. The so-called Fermi Operator Expansion (FOE) 20,21 , which gave the inspiration to the creation of CheSS, calculates the den- sity matrix as a direct expansion of the Hamiltonian matrix in terms of Chebyshev polynomials. Another method, which is similar in spirit, writes the density ma- trix as a rational expansion 22,23 , exploiting Cauchy’s in- tegral theorem in the complex plane. This method has the advantage that — unlike FOE — it only has to cope with the occupied states, which makes it advantageous for large basis sets containing many high energetic vir- tual states. One particular implementation is the PEXSI package 24 , which we will use later on for a comparison with CheSS. Other popular approaches to calculate the density matrix are the density-matrix minimization ap- proach 25 , which calculates the density matrix by min- imizing a target function with respect to ˆ F , and the divide-and-conquer method 26 , which is based on a par- titioning of the density matrix into small subblocks. An overview over further popular methods — in particular focusing on linear scaling approaches — can be found in Refs. 23 and 27. Several studies have compared the various meth- ods 28–33 , but it is very hard — if not impossible — to determine the ultimate method that performs best un- der all circumstances. Rather all of these methods have

Upload: others

Post on 29-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: E cient computation of sparse matrix functions for large

Efficient computation of sparse matrix functions for large scale electronic structurecalculations: The CheSS library

Stephan Mohr,1 William Dawson,2 Michael Wagner,1 Damien Caliste,3, 4 Takahito Nakajima,2 and Luigi Genovese3, 4

1Barcelona Supercomputing Center (BSC)∗2RIKEN Advanced Institute for Computational Science, Kobe, Japan, 650-0002

3Univ. Grenoble Alpes, INAC-MEM, L Sim, F-38000 Grenoble, France4CEA, INAC-MEM, L Sim, F-38000 Grenoble, France

(Dated: October 10, 2017)

We present CheSS, the “Chebyshev Sparse Solvers” library, which has been designed to solvetypical problems arising in large scale electronic structure calculations using localized basis sets.The library is based on a flexible and efficient expansion in terms of Chebyshev polynomials, andpresently features the calculation of the density matrix, the calculation of matrix powers for arbitrarypowers, and the extraction of eigenvalues in a selected interval. CheSS is able to exploit the sparsityof the matrices and scales linearly with respect to the number of non-zero entries, making it wellsuited for large scale calculations. The approach is particularly adapted for setups leading to smallspectral widths of the involved matrices, and outperforms alternative methods in this regime. Bycoupling CheSS to the DFT code BigDFT we show that such a favorable setup is indeed possible inpractice. In addition, the approach based on Chebyshev polynomials can be massively parallelized,and CheSS exhibits excellent scaling up to thousands of cores even for relatively small matrix sizes.

I. INTRODUCTION

Sparse matrices are abundant in many branches of sci-ence, be it due to the characteristics of the employedbasis set (e.g. finite elements, wavelets, Gaussians, etc.)or due to intrinsic localization properties of the systemunder investigation. Ideally, an operator acting on suchmatrices should exploit this sparsity as much as possi-ble in order to reach a high efficiency. Due to the greatvariety of specific needs there is no simple and unique ap-proach to perform this general task, and various solutionshave been conceived to satisfy the respective demands1.

The CheSS library, which we present in this paper, hasits origins in electronic structure calculations, in partic-ular Density Functional Theory (DFT)2,3, and is conse-quently capable of performing the specific matrix opera-tions required in this context. In principle, these can allbe solved straightforwardly using highly optimized LinearAlgebra libraries such as LAPACK4/ScaLAPACK5,which is indeed the fastest solution for small to mediumsize matrices. However, this approach gets increasinglyexpensive for large systems, since it uses the matricesin their dense form, which exhibits an inherent cubicscaling. With the recent widespread availability of DFTcodes that are able to tackle the regimes of many thou-sands atoms systems, such challenging large scale calcu-lations are becoming more and more abundant6, and theabove mentioned cubic scaling is clearly a serious limita-tion.

An overview of popular electronic structure codes,in particular focusing on large scale calculations, canbe found in Ref. 6. In general these large scale DFTcodes work with localized basis sets, e.g. BigDFT7–9,SIESTA10,11, Quickstep12, ONETEP13–16 or Con-quest17–19, and the matrices expressed in these bases con-sequently exhibit a natural sparsity, i.e. only those matrix

elements where the basis functions overlap are differentfrom zero. Hence it is advantageous to use — beyonda certain crossover point — sparse algorithms instead ofthe basic dense approaches. Since exploiting the sparsityis also the key towards linear scaling algorithms — i.e.methods where the computational demand only increaseslinearly with respect to the system size — efficient sparsesolvers are crucial in this domain.

The central task within DFT is the calculation of thedensity matrix F , and many different approximate waysof calculating it with a scaling being more favorable thanthe default cubic one have been derived. The so-calledFermi Operator Expansion (FOE)20,21, which gave theinspiration to the creation of CheSS, calculates the den-sity matrix as a direct expansion of the Hamiltonianmatrix in terms of Chebyshev polynomials. Anothermethod, which is similar in spirit, writes the density ma-trix as a rational expansion22,23, exploiting Cauchy’s in-tegral theorem in the complex plane. This method hasthe advantage that — unlike FOE — it only has to copewith the occupied states, which makes it advantageousfor large basis sets containing many high energetic vir-tual states. One particular implementation is the PEXSIpackage24, which we will use later on for a comparisonwith CheSS. Other popular approaches to calculate thedensity matrix are the density-matrix minimization ap-proach25, which calculates the density matrix by min-imizing a target function with respect to F , and thedivide-and-conquer method26, which is based on a par-titioning of the density matrix into small subblocks. Anoverview over further popular methods — in particularfocusing on linear scaling approaches — can be found inRefs. 23 and 27.

Several studies have compared the various meth-ods28–33, but it is very hard — if not impossible — todetermine the ultimate method that performs best un-der all circumstances. Rather all of these methods have

Page 2: E cient computation of sparse matrix functions for large

2

their “niche” in which they are particularly efficient, andthe task of choosing the appropriate approach is there-fore strongly influenced by the specific application. Inparticular, the choice depends on the physical propertiesof the system (e.g. insulator versus metal), the chosenformalism (e.g. all-electron versus pseudopotential), andon the basis set that is used. These factors influencethe properties of the matrices that shall be processed invarious ways; we will focus in the following on the spec-tral width. For some of the aforementioned methods thisproperty has only little influence, whereas for others it iscrucial.

The other main operation available within CheSS,namely the calculation of a matrix power, where thepower can have any — in particular also non-integer —value, is a more general problem appearing also outsideof electronic structure calculations. Among the variouspowers, calculating the inverse is presumably the mostimportant one, and there exist various approaches todo this efficiently for sparse matrices. Some approachesallow to calculate the inverse in an efficient way forsome special cases. For banded matrices, for instance,Ran and Huang developed an algorithm34 that is abouttwice as fast as the standard method based of the LU -decomposition. To determine the diagonal entries of theinverse, Tang and Saad developed a probing method35

for situations where the inverse exhibits decay properties.Another algorithm for the calculation of the diagonal en-tries is the one by Lin et al.36, which calculates the diago-nal elements of the inverse by hierarchical decompositionof the computational domain. To get an approximationof the diagonal entries of a matrix, Bekas et al. proposeda stochastic estimator37. The FIND algorithm by Li etal.38, which follows the idea of nested dissection39, wasas well designed to calculate the exact diagonal entries ofGreens functions, but can be extended to calculate anysubset of entries of the inverse of a sparse matrix. Inthe same way, the Selected Inversion developed by Lin etal.40,41, which is based on an LDLT -factorization, allowsto exactly calculate selected elements of the inverse. Asfor the calculation of the density matrix, the choice of themethod for the calculation of matrix powers depends aswell on the specific application, and therefore there is nouniversal best option. However, we would like to pointout again the importance of the spectral width, whichmight favor or not a particular approach.

In this paper we present with CheSS the implementa-tion of a general approach that can efficiently evaluate amatrix function f(M) using an expansion in Chebyshevpolynomials. Since CheSS has first been used for elec-tronic structure calculations, the functions that are im-plemented so far are those needed in this context; moredetails will be given later. However, there are no re-strictions to go beyond this, as the function f can inprinciple be chosen arbitrarily, with the only restrictionthat it must be well representable by Chebyshev polyno-mials over the entire eigenvalue spectrum of M. CheSShas been particularly designed for matrices with a small

eigenvalue spectrum, and in this regime it is able to out-perform other comparable approaches. We will showlater that such a favorable regime can indeed be reachedwithin the context of DFT calculations.

In addition CheSS exploits the sparsity of the involvedmatrices and hence only calculates those elements thatare non-zero. Obviously, this requires that the solutionf(M) can reasonably well be represented within the pre-defined sparsity pattern; however, since the sparsity pat-tern is defined by the underlying physical or mathemat-ical problem, we leave the responsibility of well definingthis pattern to the code interfacing CheSS. If the costof calculating one matrix element can be considered asconstant (which is the case for a high degree of sparsity),we consequently reach an approach that scales linearlywith respect to the number of non-zero elements. Hence,CheSS is an ideal library for linear scaling calculations,which are crucial for the treatment of large systems.

In summary, we present with CheSS a flexible andpowerful framework to compute sparse matrix functionsrequired in the context of large scale electronic struc-ture calculations. If the above mentioned requirement— namely a small eigenvalue spectrum of the matrices— is fulfilled, CheSS can yield considerable performanceboosts compared to other similar approaches. Hence, thislibrary represents an interesting tool for any code work-ing with localized basis functions that aims to performsuch large scale calculations.

The remainder of the paper is structured as follows:In Sec. II we first show the basic theory behind CheSS,starting with the applicability of CheSS for electronicstructure calculations (Sec. II A), detailing the basic al-gorithm (Sec. II B) and available operations (Sec. II C),giving a brief discussion about sparsity (Sec. II D), andfinishing with a presentation of our format to storesparse matrices (Sec. II E). In Sec. III we then give var-ious performance numbers of CheSS, showing the ac-curacy (Sec. III A), the scaling with matrix properties(Sec. III B), the parallel scaling (Sec. III C), and a com-parison with other methods (Sec. III D). Finally we con-clude and summarize our work in Sec. IV and give anoutlook on future research.

II. MOTIVATION AND THEORY

A. Applicability of CheSS for electronic structurecalculations

As discussed in the introduction, there is a variety ofmethods to solve the typical problems arising in elec-tronic structure theory and DFT in particular. Conse-quently most approaches exhibit their best performancein a particular regime, determined by the properties ofthe matrices which are at the input of the problem. Inthis section we briefly want to discuss the motivation forthe creation of CheSS, i.e. present the conditions underwhich it works best.

Page 3: E cient computation of sparse matrix functions for large

3

(a) solvated DNA(15613 atoms)

(b) bulk pentacene(6876 atoms)

(c) perovskite(PbI3CNH6)64

(768 atoms)

(d) Si-wire(706 atoms)

(e) water(1800 atoms)

FIG. 1. The systems used for the analysis of the matrices produced by BigDFT; their data are shown in Tab. I.

S H

system #atoms sparsity εmin εmax κ sparsity εmin εmax λ ∆HL

DNA 15613 99.57% 0.72 1.65 2.29 98.46% -29.58 19.67 49.25 2.76bulk pentacene 6876 98.96% 0.78 1.77 2.26 97.11% -21.83 20.47 42.30 1.03perovskite 768 90.34% 0.70 1.50 2.15 76.47% -20.41 26.85 47.25 2.19Si nanowire 706 93.24% 0.72 1.54 2.16 81.61% -16.03 25.50 41.54 2.29water 1800 96.71% 0.83 1.30 1.57 90.06% -26.55 11.71 38.26 9.95

TABLE I. Sparsity, smallest and largest eigenvalue εmin and εmax, and condition number κ or spectral width λ, respectively,for the overlap and Hamiltonian matrix for typical runs with BigDFT. The eigenvalues shown for the overlap matrix are thoseof the standard eigenvalue problem Sci = εici, whereas in the case of the Hamiltonian matrix we report those of the generalizedeigenvalue problem Hci = εiSci. For the Hamiltonian matrix we additionally show the HOMO-LUMO gap ∆HL. For thelatter matrix all values are given in eV.

As will be explained in more detail in Sec. II B, CheSSis based on a polynomial approximation, and hence theperformance is a function on the polynomial degree. Thelatter depends, first, on the specific function that has tobe approximated and, second, on the interval over whichthe function shall be represented by the polynomial. Thefirst criterion is, of course, rather general and depends onthe specific application; however, for electronic structurecalculations a characteristic function is the Fermi func-tion, which has the property of becoming less smooth(and hence harder to approximate) for systems with smallband gaps at low electronic temperature. Given this wecan already now conclude that CheSS works best forsystems exhibiting a decent gap between the Highest Oc-cupied Molecular Orbital (HOMO) and the Lowest Un-occupied Molecular Orbital (LUMO), or for calculationswith finite electronic temperature — more details will begiven later in Sec. III B. For the second point, namely theinterval of the polynomial approximation, the situationis simpler: In general, the larger the interval is the morepolynomials will be required, since — as explained inmore detail later — the necessary rescaling of the over-all domain to the unit interval [−1, 1] results in higherpolynomial degrees for the Chebyshev decomposition. In

our case, the interval of the approximation correspondsto the eigenvalue spectrum of the matrices — the smallerthis spectrum the better CheSS will perform. Again wewill investigate this point in more detail in Sec. III B.

For DFT calculations the typical matrices that will beprocessed are the overlap matrix Sαβ = 〈φα|φβ〉 and theHamiltonian matrix Hαβ = 〈φα|H|φβ〉, where {φα} isthe used basis set and H the Hamiltonian operator. Thespectral width of S — which we will from now on measurewith the condition number κ = εmax

εmin, defined as the ra-

tio between the largest and smallest eigenvalue, since thematrix is always positive definite — depends solely onthe basis set. Obviously, small and (quasi-)orthogonalbasis sets are better suited than large non-orthogonalones. With respect to H, the spectrum depends as wellon the basis; for instance, large basis sets include morehigh energetic states than small ones. However, there isalso a dependence on the physical model that is used,for instance whether the low energetic cores states areabsorbed into a pseudopotential or not. Since H hasin general both negative and positive eigenvalues, thecondition number is not a good measurement any more,and we therefore rather consider the total spectral widthλ = εmax − εmin.

Page 4: E cient computation of sparse matrix functions for large

4

To see whether optimal conditions for CheSS — i.e.a small spectral width for both H and S — can be sat-isfied in practice, we investigated the properties of thesetwo matrices for some typical calculations with BigDFT.This code8,9 uses a minimal set of quasi-orthogonal andin-situ optimized basis functions. The first property al-lows to keep the condition number of the overlap matrixsmall, whereas the second property — together with thefact that BigDFT uses pseudopotentials to handle thecore electrons42,43 — allows to operate with a Hamilto-nian exhibiting a small spectrum.

As an illustration, we show in Tab. I the detailedvalues of the eigenvalue spectrum for typical runs withBigDFT, using systems that were already used in otherpublications9,44,45 and that are visualized in Fig. 1. Asmentioned, CheSS is mainly designed for systems ex-hibiting a finite HOMO-LUMO gap, and therefore weonly chose examples belonging to this class of systems.As can be seen, the condition number is of the order of2, independently of the system. The same also holds forthe spectral width of the Hamiltonian matrix, which isof the order of 40-50 eV. These low values are a directconsequence of the particular features of the BigDFTsetup mentioned above. For other popular basis sets, thecondition numbers are usually considerably higher, evenin case the basis sets were specifically designed to exhibitlow values for κ, as for instance in the case of atomic or-bitals (about two orders of magnitude larger)46 or Gaus-sians (at least one order of magnitude larger)47. Thus,the fact that such low values can be reached within aDFT code illustrates the need for an algorithm that canexploit this feature and thus lead to very efficient cal-culations, and indeed CheSS is used with great successtogether with BigDFT. Moreover a low condition num-ber of the overlap matrix is also crucial in the context oflinear scaling algorithms, since it can be shown48 that alocalized and well-conditioned overlap matrix leads to aninverse with similar decay properties and finally also toan equally localized density matrix.

B. Algorithm

1. Expansion in Chebyshev polynomials

The basic ansatz of the algorithm behind CheSS isto approximate the matrix function f(M) as a polyno-mial in M. However, such a polynomial expansion canbecome unstable for large degrees, which is known asRunge’s phenomenon. A way to circumvent this is touse Chebyshev polynomials, which are known to mini-mize this issue. This way the polynomial approximationbecomes

f(M) ≈ p(M) =c02I +

npl∑i=1

ciTi(M) , (1)

where I is the identity matrix and the Ti(M) are theChebyshev matrix polynomials of order i. Since thesepolynomials are only defined in the interval [−1, 1], thematrix M has to be scaled and shifted such that its eigen-value spectrum lies within this range. If εmin and εmaxare the smallest and largest eigenvalue of M, the modi-fied matrix M that enters Eq. (1) is given by

M = σ(M− τI) , (2)

with

σ =2

εmax − εmin, τ =

εmin + εmax2

. (3)

Obviously, the eigenvalue spectrum of M is not knownbeforehand. However it is relatively easy to determine anapproximate lower and upper bound — denoted as εminand εmax — for the eigenvalue spectrum on the fly, aswill be shown later in Sec. II B 2.

The determination of the Chebyshev matrix polyno-mials and the expansion coefficients is straightforward49.The polynomials can be calculated from the recursionrelations

T0(M) = I ,

T1(M) = M ,

Tj+1(M) = 2MTj(M)−Tj−1(M) ,

(4)

and the expansion coefficients are given by

cj =2

npl×

npl−1∑k=0

f

[1

σcos

(π(k + 1

2 )

npl

)+ τ

]cos

(πj(k + 1

2

npl

),

(5)

where f(x) is the function that shall be applied to thematrix M.

From Eq. (4) it follows that the individual columnsof the Chebyshev matrix polynomials fulfill as well arecursion relation and can be calculated independentlyof each other, i.e. we only need to apply local matrix-vector multiplications. Eventually this also implies thateach column of the matrix p(M) can be calculated inde-pendently, which makes the algorithm highly efficient forlarge parallel computing architectures. This specific needfor sparse matrix-vector multiplications is in contrast tothe more abundant case of parallel sparse matrix-matrixmultiplications, for which various methods and librariesexist50–53. In fact, there also exists the possibility tocalculate the Chebyshev matrix polynomials using suchmatrix-matrix multiplications. In this case, it is possi-ble to reduce the required multiplications from npl−1 toabout 2

√npl

30. On the other hand, apart from loosingthe strict independence of the columns and thus compli-cating the parallelization, this method has the additional

Page 5: E cient computation of sparse matrix functions for large

5

drawback that some of the multiplications have to be re-peated if the expansion coefficients change. This is incontrast to our approach, where the individual columnscan easily be summed up with different coefficients with-out the need of redoing any multiplications; this featureis in particular important for the calculation of the den-sity kernel, as will be shown in more detail in Sec. II C 1.

2. Determination of the eigenvalue bounds and polynomialdegree

In order to get the estimates εmin and εmax for theeigenvalue spectrum on the fly, we can use the same ap-proach as outlined in Sec. II B 1. Analogous to Eq. (1)we construct a penalty matrix polynomial Wp, where theexpansion coefficients cpi are again given by Eq. (5), butwith the function f(x) being this time an exponential:

fp(x) = eα(x−εmin) − e−α(x−εmax). (6)

If Tr(Wp) is below a given numerical threshold, thenall eigenvalues of M lie within the interval [εmin, εmax].Otherwise, the trace will strongly deviate from zero,with the sign indicating which bound has to be ad-justed. The larger the value of α, the more accuratethe eigenvalue bounds can be determined, but on theother hand a higher degree of the Chebyshev expansionwill be required to well represent this step-like function.Since the calculation of the penalty matrix Wp uses thesame Chebyshev polynomials as the original expansion ofEq. (1) and only requires to calculate a new set of expan-sion coefficients — which is computationally very cheap— this check of the eigenvalue bounds comes at virtu-ally no extra cost and can therefore easily be done onthe fly. Nevertheless a good initial approximation of theeigenvalue spectrum is of course beneficial, as it avoidsthe recalculation of the polynomials in case the eigen-value bounds have to be adjusted. This is usually thecase, since in a typical DFT setup CheSS is used withinan iterative loop — the so-called SCF cycle — wherethe bounds only change little. For the very first step,where no guess from a previous iteration is available, itis usually enough to start with typical default values fora given setup; in case that for some reason no such guessis available, it would still be possible to resort to otherapproaches, such as for instance a few steps of a Lanczosmethod54, to get a reasonable starting value.

In order to determine automatically an optimal valuefor the polynomial degree npl, we calculate a Chebyshevexpansion pmpl(x) for the one-dimensional function f(x)— which is computationally very cheap — for variousdegrees mpl, and then define the polynomial degree npl asthe minimal degree that guarantees that the polynomialapproximation does not deviate from the function f morethan a given threshold λ:

npl = min{mpl

∣∣|pmpl(x)− f(x)|max < λ}. (7)

Apart from the obvious dependence on the function fthat shall be represented, the minimal polynomial degreeis also strongly related to the spectral width that mustbe covered. In general npl is smaller the narrower thespectral width is. However, there can also be situationswhere it is advantageous to artificially spread the spec-trum, as we briefly illustrate for the important case of theinverse. Assuming that we have a matrix with a suitablecondition number, but eigenvalues close to zero — forinstance a spectrum ranging from 10−2 to 1 —, a quitehigh degree will be required to accurately reproduce thedivergence of the function x−1 close to 0. This problemcan be alleviated by rescaling the matrix, which resultsin a larger spectral width, but yields a function that iseasier to represent. CheSS automatically detects suchsituations and rescales the matrix such that its spectrumlies further away from the problematic regions.

C. Available operations

1. Fermi Operator Expansion

A quantum mechanical system can be completely char-acterized by the density matrix operator F , as the mea-sure of any observable O is given by Tr(F O). WithinDFT, it can be defined in terms of the eigenfunc-tions ψi(r) — which are solution of the single particleSchrodinger equation Hψi(r) = εiψi(r) — as

F (r, r′) =∑i

fiψi(r)ψ(r′) , (8)

with fi ≡ f(εi) being the occupation of state i, deter-mined by the Fermi function

f(ε) =1

1 + eβ(ε−µ), (9)

where µ is the Fermi energy and β the inverse electronictemperature. After choosing a specific basis {φα(r)} —i.e. ψi(r) =

∑α ciαφα(r) —, the Schrodinger equation

becomes Hci = εiSci, and Eq. (8) corresponds to thecalculation of the density kernel matrix K,

Kαβ =∑i

ficiαciβ . (10)

The drawback of this straightforward approach based ona diagonalization is that it scales cubically with respectto the size of the matrices and must therefore be avoidedfor the calculation of large systems.

The Fermi Operator Expansion (FOE)20,21 that is im-plemented in CheSS aims at calculating the density ker-nel directly from the Hamiltonian matrix (i.e. without adiagonalization), making the ansatz K = f(H). In prac-tice we can replace the Fermi function of Eq. (9) by anyother function as long as it fulfills the essential feature ofassigning an occupation of 1 to the occupied states and

Page 6: E cient computation of sparse matrix functions for large

6

0 to the empty states. In our implementation we chosethe complementary error function, since it decays rapidlyfrom 1 to 0 around the Fermi energy:

f(ε) =1

2

[1− erf

(β(ε− µ)

)]. (11)

With these definitions the density kernel can then becalculated as outlined in Sec. II B, with the subtletythat the input matrix M has to be replaced by M′ =S−1/2MS−1/2, and the output matrix has to be postpro-cessed as S−1/2p(M′)S−1/2. The calculation of S−1/2

is as well done with CheSS, as will be explained inSec. II C 2.

Since the resulting density kernel must fulfill the con-dition Tr(KS) = N , where N is the total number of elec-trons of the system, the parameter µ has to be adjusteduntil this condition is satisfied. To do so we simply haveto reevaluate Eq. (1) with a different set of coefficients,without the need of recalculating the Chebyshev polyno-mials, and this operation is consequently rather cheap.The choice of β is more delicate: It has to be chosensuch that the error function decays from 1 to 0 withinthe range between the highest occupied state (i.e. thefirst state with an energy smaller than µ) and the low-est unoccupied state (i.e. the first state with an energylarger than µ). Since this value is not known beforehand,we have to determine β on the fly: After calculating Kwith a first guess for β we calculate a second kernel K′

with a slightly larger decay length β′ > β. Then we com-pare the energies calculated with these two kernels: If thedifference between E = Tr(KH) and E′ = Tr(K′H) isbelow a given threshold, the decay length was sufficient;otherwise the density kernel has to be recalculated witha smaller value for β. Usually this means that npl mustbe increased, and thus a new set of polynomials must becalculated. The second kernel K′, on the other hand,can be evaluated cheaply as only the set of expansioncoefficients in Eq. (1) changes.

2. Matrix powers

The calculation of matrix powers, i.e. Ma, can be doneexactly along the same lines as the calculation of the den-sity kernel described in Sec. II C 1, with the only differ-ence that the function that enters the calculation of theChebyshev expansion coefficients of Eq. (5) is now givenby f(x) = xa. Our approach allows to calculate any —also non-integer — power a, as long as the function f canbe well represented by Chebyshev polynomials through-out the entire eigenvalue spectrum. Depending on thepower a this might lead to some restrictions; for the in-verse for instance this means that the matrix should bepositive definite, since otherwise the divergence at x = 0will lead to problems. For typical applications withinelectronic structure codes, the matrix M for which pow-ers have to be calculated is the overlap matrix, and hencethe above requirement is always fulfilled.

3. Extraction of selected eigenvalues

The FOE formalism described in Sec. II C 1 can also beused to extract selected eigenvalues from a matrix. Tothis end one has to recall that for a system with totaloccupation Tr(KS) = N the Fermi energy will be chosensuch that the error function of Eq. (11) decays from 1to 0 between the Nth and (N + 1)th eigenvalue. There-fore, in order to extract the Nth eigenvalue, we simplyperform an FOE calculation using a total occupation ofTr(KS) = N− 1

2 . This will lead to an error function thatdecays in such a way that it only half populates the Ntheigenstate, which is the case if the Fermi energy coincideswith the Nth eigenvalue. The accuracy of the calculatedeigenvalue is related to the value of β; we will discuss thisin more detail in Sec. III A.

Even though this approach to extract eigenvalues isoriginally related to the calculation of the density kernel,it is generally applicable to any matrix as long as the er-ror function of Eq. 11 is well representable by Chebyshevpolynomials over the entire eigenvalue spectrum. More-over, the Chebyshev polynomials — whose calculation isthe most expensive part of this approach — can be reusedfor each eigenvalue, since only the value of µ has to bevaried.

Nevertheless we should see this eigenvalue calculationmore as a “side product” of the usage of CheSS ratherthan an individual feature. In other terms, once theChebyshev polynomials are determined — for instanceto calculate the density kernel — a rough estimate ofsome eigenvalues (e.g. the HOMO-LUMO gap) can becalculated using the very same polynomials with hardlyany additional cost. If, on the other hand, the accurate(and not just approximate) calculation of the eigenvaluesand associated eigenvectors is a central point of an algo-rithm, then one should most likely resort to more appro-priate interior eigenvalue solvers, such as for instance theshift-and-invert Lanczos method55, the Sakurai-Sugiuramethod56 or the FEAST algorithm57. An example forsuch an approach is shown in Ref. 58, where the Sakurai-Sugiura method56 is used to compute hundreds of in-terior eigenstate for a Hamiltonian matrix stemmingfrom a large-scale DFT calculation with the Conquestcode17–19.

D. Sparsity and Truncation

a. Truncation to a fixed sparsity pattern CheSS isdesigned for large sparse matrices, meaning that most ofthe matrix entries are zero and consequently not stored.However, applying an operation to a sparse matrix — inour case the matrix vector multiplications to build up theChebyshev polynomials — does in general not preservethis sparsity, and repeated application of this operationcan eventually lead to a dense matrix. To avoid this, theresult after applying the operation is again mapped ontoa sparsity pattern, i.e. certain entries are forced to be

Page 7: E cient computation of sparse matrix functions for large

7

(a) original matrix M (b) exact calculation of M−1

without sparsity constraints

(c) sparse calculation of M−1

using CheSS within thesparsity pattern

(d) difference between Fig. 2band Fig. 2c

FIG. 2. Heat map of the matrix elements for a 6000 × 6000sparse matrix M and its inverse M−1. The sparsity patternof the matrices originate from a calculation of a small waterdroplet with BigDFT, and contain 738, 316 non-zero elements(97.95% sparsity) for M and — due to the buffer regions— 4, 157, 558 non-zero elements (88.45% sparsity) for M−1.We filled the matrix with random numbers, in such a waythat the matrix is symmetric and positive definite, and witheigenvalues in the range from 0.1 to 1.1, corresponding toa condition number of 11. Values that are zero due to thesparsity pattern are marked in gray. For the sake of a bettervisualization the coloring scheme ends at 10−9 and 10−20,respectively, even though there are still much smaller values.

zero. There are various ways to enforce such a sparsity,for instance simple methods such as setting all elementsbelow a given threshold value to zero or using informationabout the distance of the basis functions correspondingto a given matrix entry, as well as more sophisticatedapproaches that allow to control the error introduced bythe truncation59. In our case, we work with a fixed spar-sity pattern for the sake of simplicity. In order to keepthe error introduced by the truncation small, the spar-sity pattern of the matrix after the applied operation hasto be slightly larger than the original one, meaning thatthere must be some “buffers” into which the matrix canextend during the applied operation. The size of thesebuffers is related to the properties of the matrix and thespecific operation, and hence the correct setup dependson the particular application.

As a small illustration we show in Fig. 2 the behaviorfor the calculation of the inverse. In Fig. 2a we showthe values of the original matrix, with the entries that

are strictly zero due to the sparsity pattern marked ingray. In Fig. 2b we show the inverse, calculated exactlyand without any sparsity constraint; as can be seen, thismatrix is less sparse than the original one, but neverthe-less far from being dense. Consequently, it is reasonableto again map its values onto a sparsity pattern. How-ever, due to the larger extent, this sparsity pattern hasto contain the aforementioned buffer regions. This is il-lustrated in Fig. 2c, where we show the inverse calculatedby CheSS within the predefined enlarged sparsity pat-tern. Indeed we see that this pattern has been chosenreasonably and is able to absorb the spread of the inversecompared to the original matrix. This becomes even bet-ter visible in Fig. 2d, where we show — using a differentscale for the colors — the absolute difference betweenthe exact solution and the one calculated by CheSS. Weshow this difference for the entire matrix, i.e. both in-side and outside of the sparsity pattern; as can be seenthe difference is very small throughout the entire matrix— the maximal difference is only 1.2× 10−7 —, showingfirstly that the buffer region has been chosen reasonablyand secondly that the inverse within the sparsity patternhas been accurately calculated by CheSS.

b. Effects on the error definitions The fact that theresulting matrix is mapped back onto a sparsity patternhas also an impact on the definition of the “exact so-lution”. There are two ways to define the exact sparsesolution of an arbitrary operation f(M):

1. By defining the exact solution as the one that weobtain by calculating f(M) without any constraintsand then cropping the result to the desired sparsitypattern. The drawback of this definition is that it ingeneral violates the essential identity f−1(f(M)) =M.

2. By calculating the solution directly within the spar-

sity pattern, symbolized as f(M), and then defin-ing the exact solution as the one which fulfills

f−1(f(M)) = M.

CheSS calculates the solution by construction withinthe predefined sparsity pattern, and the exact solution istherefore defined in the second way. More quantitatively,the error according to this second definition is given by

wf−1(f) =1

|f(M)|×√√√√ ∑

(αβ)∈f(M)

(f−1(f(M))αβ −Mαβ

)2, (12)

where∑

(αβ)∈f(M) indicates a summation over all ele-

ments within the predefined sparsity pattern of f(M)

and |f(M)| denotes the number of elements within thispattern. Nevertheless we will also report errors accordingto the first definition, in order to asses the effect of thesparsity pattern and the resulting truncation, and define

Page 8: E cient computation of sparse matrix functions for large

8

this error as

wfsparse =1

|f(M)|×√√√√ ∑

(αβ)∈f(M)

(f(M)αβ − f(M)αβ

)2. (13)

c. Fixed versus variable sparsity Related to thetopic of sparsity and truncation, we finally also want topoint out an important advantage of our approach tocalculate the density kernel compared to another pop-ular class of methods, namely density matrix purifica-tion59–67. Both methods, i.e. FOE and purification, arebased on a series of repeated matrix multiplications; how-ever, whereas the FOE method directly expands the den-sity matrix as a polynomial of the Hamiltonian, purifi-cation recursively applies low order polynomials. As aresult of this recursive procedure, these methods requirefewer matrix multiplications than the FOE method.

However, the cost of the multiplications performed inrecursive procedures and the FOE method are not equal.This is because of matrix fill-in. In Fig. 3, we show thefill-in over the course of a representative purification cal-culation. This calculation used the fourth-order traceresetting method (TRS4) of Niklasson64, and was per-formed on a system representing a DNA fragment in so-lution. The sparsity is here defined in a variable way,namely by the magnitude of the matrix entries, i.e. allelements below a given threshold are set to zero. In-deed we see that the sparsity decreases considerably frommore than 98% to about 93%. This fill-in is even greaterfor poor starting guesses, systems with smaller HOMO-LUMO gaps, or calculations requiring greater accuracy.

During a density matrix purification calculation, theperformance is determined by the cost of multiplying twomatrices that have as many nonzero elements as the finaldensity matrix. By contrast, CheSS always multipliesthe intermediate density matrix by the sparser Hamil-tonian matrix. This allows to determine beforehand the(constant) cost of the matrix multiplications, whereas forthe purification schemes it may explode unexpectedly.Finally, the fact that in FOE we always apply the samematrix also permits a much easier parallelization — eachcolumn of the final matrix can be calculated indepen-dently — compared to the purification approach. This,combined with the aforementioned strict sparsity, allowsCheSS to perform accurate calculations with great effi-ciency, as will be demonstrated in section III.

E. Storage format of the sparse matrices

CheSS is designed to work with large sparse matricesand consequently only stores the non-zero entries withina single one-dimensional array. To describe the corre-sponding sparsity pattern, it uses a special format that

92

93

94

95

96

97

98

99

2 4 6 8 10 12 14 16

sp

ars

ity in

%

iteration

sparsity

FIG. 3. Fill-in of the intermediate matrices, represented byits sparsity, in the course of a purification calculation usingthe fourth-order trace resetting method (TRS4)64. As a testsystem we used a DNA fragment in solution (17947 atoms),giving rise to a matrix of size 36460× 36460.

we denote as Segment Storage Format (SSF). The ba-sic idea of the SSF format is to group together consecu-tive nonzero entries as segments. Assuming that we havenseg such segments, the SSF format then requires twodescriptor arrays, denoted by keyg and keyv, of dimen-sion (2, 2, nseg) and (nseg), respectively. keyg indicatesthe start and end of each segment in “dense coordinates”,i.e. each entry has the form (cs, ce; rs, re), with cs and cedenoting the starting and ending column, respectively,and rs and re denoting the starting and ending row, re-spectively. keyv indicates at which entry within the ar-ray of non-zero entries a given segment starts; this arrayactually contains redundant information that can be re-constructed at any time from keyg and is only used toaccelerate the handling of the sparse matrices. An illus-tration of this storage format for a simple 5× 5 matrix isshown in Fig. 4.

The advantage of this format is that it allows to de-scribe the sparsity pattern in a very compact form. Fora matrix containing nseg segments, it requires 5 × nsegdescriptor elements, or actually only 4 × nseg if the re-dundant keyv descriptors are omitted. The CCS/CRSformat, which is a standard format to store sparse ma-trices, requires for the same description ncol + nnz ele-ments, with ncol being the number of columns/rows andnnz being the number of non-zero entries. Assumingthat nnz � ncol and thus neglecting the contributioncoming from the ncol elements, we consequently see thatout format is more compact as soon as the average seg-ment size is larger than 4, which is likely to be the casefor many applications. This does not only reduce thememory footprint during a calculation, but also speedsup I/O operations and cuts down the required disk spacefor storage.

Even though CheSS is designed for large scale appli-cations, the focus onto electronic structure methods —

Page 9: E cient computation of sparse matrix functions for large

9

FIG. 4. Schematic illustration of the descriptors keyg andkeyv used by the SSF format to store a sparse matrix. Con-secutive non-zero entries are grouped together in segments,which are in this toy example however restricted to a sin-gle row each. The values of each segment are simply storedin a consecutive one-dimensional array. The example showsthe format for a row storage implementation, but the sameconcept is also applicable to a column storage setup.

which are computationally very expensive — neverthelesslimits the matrix sizes that are typically handled by thelibrary. Consequently it is in most cases not necessaryto resort to complicated parallel distribution schemes forthe sparse matrices. Nevertheless we have implementedsuch a distribution scheme to make CheSS also usablein extreme situations. For technical details we refer tothe appendix of Ref. 9.

III. PERFORMANCE

In the following we will present various benchmarksin order to evaluate the accuracy and performance ofCheSS. The sparsity patterns of all matrices used forthese tests are coming from calculations of small waterdroplets with the BigDFT8,9 code. The buffer regionsmentioned in Sec. II D are based on simple geometricalcriteria; nevertheless this does not decrease the valid-ity of the following tests, as the sparsity pattern is al-ways something that depends on the specific applicationand is therefore determined by the code interfacing withCheSS. Moreover, we took — unless otherwise stated —from the BigDFT runs only the sparsity pattern, butnot the content of the matrices; rather they were filledwith random numbers in order get the desired properties,like for instance the spectral width.

For the case of the matrix powers, we focus on theimportant case of the inverse; however this is not a re-striction, as other powers can be calculated exactly alongthe same lines. For the extraction of selected eigenvalueswe do not show many performance data, as most wouldbe redundant with the data shown for the calculation ofthe density kernel.

10-12

10-11

10-10

10-9

10-8

10-7

10-6

10 100 1000

me

an

re

lative

err

or

condition number

wf-1

(f)wfsparse

FIG. 5. Mean error for the calculation of the inverse, accord-ing to Eq. (12) (“wf−1(f)”) and Eq. 13 (“wfsparse

”), however

considering the relative error instead of the absolute error. Inorder to avoid divisions by zero only values larger than 10−12

were considered.

A. Accuracy

In this section we want to assess the accuracy ofCheSS, for each of the available operations. In all caseswe took as example a matrix of dimension 6000 × 6000,with a degree of sparsity of 97.95% for S, 92.97% for H,and 88.45% for S−1 and K.

a. Inverse In Fig. 5 we show the errors for the cal-culation of the inverse, according to Eqs. (12) and (13),as a function of the condition number. However, in orderto capture also differences between small numbers, we ac-tually report the relative error, i.e. we replace in Eq. (12)(f−1(f(M))αβ −Mαβ

)2by(f−1(f(M))αβ−Mαβ

Mαβ

)2and in

Eq. (13)(f(M)αβ − f(M)αβ

)2by(f(M)αβ−f(M)αβ

f(M)αβ

)2.

The polynomial degree was determined automatically, asdescribed in Sec. II B 2, with a value of α = −200 for thepenalty function of Eq. (6); this leads to values npl rang-ing from 60 to 4670, depending on the condition number.As can be seen, the error according to Eq. (12) is es-sentially zero, confirming the accuracy of the Chebyshevfit. The error according to Eq. (13) is slightly larger, butnevertheless remains small for all values of the conditionnumber, indicating that the buffer regions have been cho-sen sufficiently large.

b. Density kernel To asses the accuracy for the den-sity kernel computation we compare the energy cal-culated by CheSS using the FOE method (EFOE =Tr(KFOEH)) and the one determined by a ref-erence calculation using LAPACK ((ELAPACK =Tr(KLAPACKH))). The polynomial degree was againdetermined automatically, leading to values between 270and 1080. In Fig. 6 we show the difference between thesetwo values as a function of the spectral width and theHOMO-LUMO gap. As can be seen, the error shows

Page 10: E cient computation of sparse matrix functions for large

10

0.000

0.005

0.010

0.015

0.020

0.025

0.001 0.01 0.1 1

rela

tive e

rror

in %

gap (eV)

spectral width 50.0 eVspectral width 100.0 eVspectral width 150.0 eV

FIG. 6. Difference between the energies calculated by CheSSusing FOE and a reference LAPACK calculation, respectively,as a function of the HOMO-LUMO gap and for various spec-tral widths. The larger error for the smaller spectral widthscan be explained by the eigenvalue spectrum being denser inthat case, thus increasing the error introduced by the finitetemperature smearing used by FOE.

only little variation with respect to both quantities andis always of the order of 0.01%.

c. Selected eigenvalues For the assessment of the ac-curacy of the selected eigenvalues, we work directly — i.e.without modifying its values — with a matrix comingfrom a calculation with BigDFT in order to have a real-istic setup. The matrix has a spectral width of 41.36 eV,which means that the 6000 eigenvalues only exhibit nar-row separations among each other, being of the order ofsome meV. The accuracy with which the eigenvalues canbe calculated depends strongly on the decay length ofthe error function that is used; for this test we took avalue of β = 27.2 meV, leading to a polynomial degreeof 4120. Nevertheless we see from the results in Fig. 7that this is enough to determine the eigenvalues quiteaccurately; more precisely, the mean difference betweenthe exact result and the one calculated by CheSS is only2.6± 2.3 meV.

If a higher accuracy than the one obtained is required,a smaller value for β has to be chosen. This would, how-ever, dramatically increase the cost due to the higherpolynomial degree that is required. Our method is thusmore suited to get a rough estimate of an (arbitrary)eigenvalue rather than to calculate its exact value. Inthe context of electronic structure calculations, an exam-ple for such a situation is the calculation of the HOMO-LUMO gap, where the intrinsic error of the used theory(e.g. DFT) is usually much larger than that of the numer-ical method to calculate the eigenvalues. Additionally, inthis context the polynomials of the density matrix expan-sion can be reused, and an estimate of the HOMO-LUMOgap can thus be obtained on the fly with hardly any extracost, comparable to the method in Ref. 68.

0

2

4

6

8

10

12

14

16

18

-25 -20 -15 -10 -5 0 5 10

absolu

te e

rror

(meV

)

eigenvalues (eV)

FIG. 7. Errors of the calculated eigenvalues for a sparsematrix of dimension 6000 × 6000, with characteristics as ex-plained in the text.

B. Scaling with matrix properties

In this section we want to asses the performance ofCheSS with respect to certain specific properties of thematrices, in order to determine under which circum-stances it offers the biggest benefits. For the tests weagain used the same set of matrices as in Sec. III A.

1. Scaling with matrix size and sparsity

First we want to investigate how efficiently CheSS canexploit the sparsity of the matrices. Since the sparsityenters via the matrix vector multiplications for the con-struction of the Chebyshev polynomials — which are thesame for all operations — we only present data for theinverse. In Fig. 8 we show the runtime as a functionof the non-zero entries of the matrix, for various matrixsizes. Moreover we distinguish between the runtime withrespect to the number of non-zero entries in the origi-nal matrix (Fig. 8a) and the inverse (Fig. 8b). For eachmatrix size we generated several matrices with different“degrees of sparsity”, i.e. different numbers of non-zeroentries. All matrices were prepared to have a conditionnumber of about 11, leading to an automatically deter-mined polynomial degree of 60, and the runs were per-formed in parallel using 1280 cores.

As can be seen from Fig. 8a, the runtime scales linearlywith respect to the number of non-zero elements in theoriginal matrix and hardly depends on the matrix size.This is not surprising, as the non-zero elements directlydetermine the cost of the matrix vector multiplications,and thus demonstrate the good exploitation of the spar-sity by CheSS.

In Fig. 8b we see, however, a non-linear behavior. Thiscan be explained by the fact that the cost of calculat-ing each of the non-zero elements of the inverse depends

Page 11: E cient computation of sparse matrix functions for large

11

0

20

40

60

80

100

120

5x106

1x107

1.5x107

2x107

run

tim

e (

se

co

nd

s)

number of non-zero elements in the original matrix

matrix size 6000matrix size 12000matrix size 18000matrix size 24000matrix size 30000matrix size 36000

(a) Runtime as a function of the number of non-zero elements inthe original matrix M. The dashed lines are linear fits.

0

20

40

60

80

100

120

1x107

2x107

3x107

4x107

5x107

6x107

run

tim

e (

se

co

nd

s)

number of non-zero elements in the inverse

matrix size 6000matrix size 12000matrix size 18000matrix size 24000matrix size 30000matrix size 36000

(b) Runtime as a function of the number of non-zero elements inthe inverse M−1. The dashed lines are quadratic fits.

FIG. 8. Runtime for the calculation of the inverse as a function of the number of non-zero entries of the original matrix (Fig. 8a)and the inverse (Fig. 8b), for various matrix sizes and “degrees of sparsity”. In Fig. 8a we see a linear scaling with respect tothe number of non-zero entries and hardly any dependence on the total matrix size, whereas in Fig. 8b there is a quadraticscaling and a clear dependence on the total matrix size. The reasons for this different behavior are discussed in the text. Allruns were performed in parallel using 1280 cores (80 MPI tasks spanning 16 OpenMP threads each) on MareNostrum 3.

again on the number of non-zero elements, yielding in to-tal this quadratic behavior. In addition we see here also adependence on the total matrix size, which is due to thebuffer regions. The number of non-zero entries in the in-verse, |M−1|, is related to the number of non-zero entriesin the original matrix, |M|, via |M−1| = |M|+c, where cdepends linearly on the matrix size. Therefore, in orderto reach a given value of |M−1|, |M| must be larger thesmaller the matrix is, which explains the higher cost forthe smaller matrices.

2. Scaling with spectral properties

As mentioned earlier, the characteristics of the eigen-value spectrum of the matrices are an important aspectfor the performance of CheSS, and we therefore want toinvestigate this in more detail.

a. Condition number for the inverse For the calcu-lation of the inverse we restrict ourselves to the case ofpositive definite matrices, which means that the spectralwidth can well be characterized by the condition numberκ. We prepared a set of matrices with condition numbersranging from 6 to 1423, and additionally conducted cal-culations for two setups. In the first one we used a defaultguess for the eigenvalue bounds of [0.5, 1.0], whereas inthe second setup we used already well adjusted values.

In Fig. 9 we show the results of this benchmark, withthe runs being performed in parallel using 480 cores. Asexpected there is a very strong increase of the run time forlarger condition numbers, due to the higher polynomialdegree that is required. However we also see that a lotcan be gained by choosing a good input guess for the

1

10

100

1000

10 100 1000 10

100

1000

10000

runtim

e (

seconds)

poly

nom

ial degre

e

condition number

runtime "bounds default"runtime "bounds adjusted"

npl "bounds default"npl "bounds adjusted"

FIG. 9. Runtime and polynomial degree npl for the cal-culation of the inverse as a function of the condition num-ber. “bounds default” means that the runs where startedwith default values for the upper and lower eigenvalue bounds,whereas “bounds adjusted” means that well adjusted valueswere used. All runs were performed in parallel using 480cores (60 MPI tasks spanning 8 OpenMP threads each) onMareNostrum 3.

eigenvalue bounds — something which is often possiblein practicable applications. In this case the polynomialdegree remains more or less the same, but the expensivesearch for the correct eigenvalue bounds can be saved.Moreover, looking again at the values of κ in Tab. I,we see that the basis set employed by BigDFT indeedenables CheSS to operate in the optimal range of smallcondition numbers.

Page 12: E cient computation of sparse matrix functions for large

12

0

50

100

150

200

0.001 0.01 0.1 1 0

500

1000

1500

2000

2500

3000ru

ntim

e (

se

co

nd

s)

po

lyn

om

ial d

eg

ree

HOMO-LUMO gap (eV)runtime, εmax-εmin=50.0 eV

runtime, εmax-εmin=100.0 eVruntime, εmax-εmin=150.0 eV

npl, εmax-εmin=50.0 eVnpl, εmax-εmin=100.0 eVnpl, εmax-εmin=150.0 eV

FIG. 10. Runtime and polynomial degree npl for the densitykernel calculation as a function of the HOMO-LUMO gap, forvarious spectral widths. The runs were started with alreadywell adjusted values for both the upper and lower eigenvaluebounds and β, and performed in parallel using 480 cores (60MPI tasks spanning 8 OpenMP threads each) on MareNos-trum 3.

b. Spectral width and HOMO-LUMO gap for the den-sity kernel For the determination of the density kernelthe essential characteristic is not the condition numberany more, but rather the total spectral width. Addition-ally we have a dependence on the parameter β, whichdetermines how fast the error function used to assign theoccupation numbers decays between the highest occupiedand the lowest unoccupied state and is therefore directlyrelated to the HOMO-LUMO gap. The smaller β is, themore the error function resembles a step function, whichis difficult to represent using polynomials. As a conse-quence, the polynomial degree becomes very large forboth large spectra and small HOMO-LUMO gaps.

In Fig. 10 we show the runtime for a density kernel cal-culation as a function of the HOMO-LUMO gap and forvarious spectral widths, with the runs being performedin parallel using 480 cores. Following the considerationsof the previous test for the inverse we used already welladjusted values for the eigenvalue bounds and β. Wesee our assumption confirmed, as calculations with smallgaps and large spectral widths are considerably heavier.Whereas the value of the gap is imposed by the systemunder investigation, the spectral width depends on thespecific computational setup — and in particular alsothe basis set — that is used. In order to keep it small, itis advisable to use a minimal basis set of optimized func-tions, which — among other advantages44,45 — has thebenefit that it only contains few virtual (and thereforehigh energetic) states. These conditions are fulfilled bythe basis set used by BigDFT, and indeed — as shownin Tab. I — this leads to small values for the spectralwidth, allowing CheSS to operate in an optimal range.

C. Parallel scaling

The most expensive part of the CheSS algorithm arethe matrix vector multiplications of Eq. (4) to constructthe Chebyshev matrix polynomials. However, since thisoperation is independent for each vector, it can be paral-lelized in a straightforward way. To account for possiblenon-homogeneities of the sparsity pattern we have im-plemented a mechanism that automatically assigns thevectors to the parallel resources such as to optimize theload balancing. CheSS exhibits a two level hybrid par-allelization: On a coarser level the workload is paral-lelized using MPI (i.e. distributed memory paralleliza-tion), whereas on a finer level an additional paralleliza-tion using OpenMP (i.e. shared memory parallelization)is used. In this way it is possible to obtain a very efficientexploitation of parallel resources.

a. Scaling for small to medium size matrices InFig. 11 we show the parallel scaling for the calculationof the inverse; for the sake of simplicity we again focusedon this case, as the results for the other operations wouldbe very similar. We took matrices of three different sizes(12000× 12000, 24000× 24000 and 36000× 36000), withthe number of nonzero entries chosen to be approximatelyproportional to the matrix size, and varied the number ofcores from — depending on the matrix size due to mem-ory limitations — 80, 160 and 320, respectively, up to2560. The condition number for all matrices was set to11, which led to an (automatically determined) polyno-mial degree of 60.

In Fig. 11a we show the speedup with respect to theminimal number of cores. The curves are very similar forall matrices, meaning that a good exploitation of parallelresources — and consequently speedup — can alreadybe obtained for small systems. Indeed it is possible tobring down the calculation time to only a few seconds ifenough computational resources are available, as can beseen from Fig. 11b. By fitting the data for the 6000×6000matrix — which shows the worst scaling — to Amdahl’slaw69, we get an overall (i.e. also including communica-tions) parallel fraction of 98.9%, which gives a maximaltheoretical speedup of about 90. However, it must bestressed that this gives the speedup with respect to 80cores, and hence the overall maximal speedup is consid-erably higher.

b. Extreme scaling behavior In the previous para-graph we have demonstrated the efficient exploitation ofthe typically used parallel resources for small to mediumsize systems. Now we also want to show the extremescaling behavior of CheSS, i.e. how the library behaveswhen going to ten thousands of cores. For this pur-pose we chose a slightly larger matrix, namely of size96000 × 96000, again stemming from a calculation of awater droplet with BigDFT. In Fig.12 we show the scal-ing that we obtain for the calculation of the inverse of thismatrix, going from 1536 cores up to 16384 cores. Con-sidering that the chosen matrix is still not extremely big,CheSS scales reasonably well also for very large number

Page 13: E cient computation of sparse matrix functions for large

13

0

5

10

15

20

25

30

35

500 1000 1500 2000 2500

sp

eed

up

number of cores

matrix size 12000matrix size 24000matrix size 36000

ideal

(a) Speedup with respect to the minimal number of cores that waspossible due to memory restrictions (80, 160 and 320,

respectively). To ease the comparison, the curves for the 24000matrix and 36000 matrix start at 2.0 and 4.0, respectively.

4

8

16

32

64

128

128 256 512 1024 2048

run

tim

e(s

eco

nds)

number of cores

matrix size 12000matrix size 24000matrix size 36000

ideal

(b) Absolute runtime for all runs.

FIG. 11. Parallel scaling for the calculation of the inverse using CheSS. Fig. 11a shows the speedup, whereas Fig. 11b showsthe absolute runtimes. All the runs where performed with 16 OpenMP threads, and only the number of MPI tasks was varied.The benchmarks were done on MareNostrum 3.

1

2

3

4

5

6

7

8

9

10

11

2000 4000 6000 8000 10000 12000 14000 16000

sp

ee

du

p

number of cores

matrix size 96000ideal

FIG. 12. Extreme scaling behavior of CheSS for the cal-culation of the inverse, going from 1536 cores up to 16384cores. The runs were performed using 8 OpenMP threads,only varying the number of MPI tasks, and the speedup isshown with respect to the first data point. The benchmarkswere done on the K computer.

of cores, demonstrating its capability to perform efficientcalculations under extremely parallel conditions.

D. Comparison with other methods

Finally we want to compare the performance of CheSSwith two other methods that allow to perform the sameoperations. On the one hand we benchmark it against(Sca)LAPACK, which is presumably the most efficient

way to perform general purpose linear algebra operationsfor dense matrices. On the other hand we compare itwith PEXSI, which can exploit the sparsity of the ma-trices and is an established package for large scale DFTcalculations, as demonstrated for instance by its couplingwith the SIESTA code70.

We tested five sets of matrices, ranging from 6000 ×6000 to 30000 × 30000, with the number of non-zero el-ements proportional to the matrix size. Moreover weperformed the comparison for various values of the spec-tral width, in order to assess this important dependence.Following the conclusions of Sec. III B 2 we started theCheSS runs with well adjusted guesses for the eigenvaluebounds, thus simulating the conditions in a real applica-tion. We performed all runs in parallel, using 160 MPItasks with each one spanning 12 OpenMP threads, i.e.using in total 1920 cores. We note that such a high num-ber of threads does not seem to be optimal for PEXSIdue to only moderate OpenMP speedup. This is in con-trast to LAPACK and CheSS, which can exploit thiswide shared memory parallelism in a very efficient way.However such a setting might likely be imposed by the ap-plication using the library — i.e. the electronic structurecode, in our case BigDFT —, for instance due to mem-ory restrictions, or the usage of many-core systems —becoming more and more abundant — that are designedfor shared memory parallelization, and overall our setupthus corresponds to realistic situations. In other terms,we are not just comparing the various solvers, but ratherthe solvers within a given specific (but nevertheless re-alistic) setup. Nevertheless, we will show for complete-ness also some results using exclusively an MPI paral-lelization, in order to see the effects on both CheSS and

Page 14: E cient computation of sparse matrix functions for large

14

PEXSI

a. Inverse Due to the general character of(Sca)LAPACK, we have to implement the requiredfunctionality on our own. To calculate matrix powersMa, we first diagonalize the matrix M as D = UTMU,with a diagonal matrix D and a unitary matrix U.Then we can easily apply the desired power to thediagonal elements Dii in order to get Da, and finallywe obtain the desired result as Ma = UDaUT . Thediagonalization was done using the PDSYEVD routine,which is based on a parallel divide-and-conquer algo-rithm. For the calculation of the inverse, there existmore specific routines within (Sca)LAPACK; howeverwe nevertheless use the aforementioned approach as it isthe most general one and allows — as does CheSS —to calculate any desired power. With respect to PEXSI,we can invert a matrix using the Selected Inversionalgorithm, which this package contains as well as ituses it for the pole expansion within the density kernelcalculation.

In Fig. 13 we show the timings that we obtain for thecalculation of the inverse as a function of the conditionnumber κ. As can be seen, CheSS is indeed the mostefficient method for matrices with small values of κ. Theonly method that is competitive is the Selected Inversion,which does not show any dependence on the conditionnumber and will therefore be faster for large values. Thecrossover between CheSS and the Selected Inversion de-pends on the matrix size — thanks to the linear scalingproperty of CheSS it is higher the larger the matricesare. Nevertheless we see that for all matrices used inthis test CheSS is the most efficient method, with thecrossover with respect to the condition number being lo-cated at about 150. Following the discussion in Sec. II Athis is a value that is easily reachable in practical ap-plications. Last but not least, we note that there is nocase where LAPACK or ScaLAPACK are the fastestmethods, demonstrating the need to exploit the sparsityof the matrices.

b. Density kernel calculation Since CheSS ismainly designed for systems with a decent HOMO-LUMO gap, we are focusing on such systems; more specif-ically we always set the gap to a value of 1 eV. In Fig. 14we show the runtimes that we obtain for CheSS andPEXSI as a function of the spectral width. Moreoverwe show results for both a hybrid setup (160 MPI tasksspanning 12 OpenMP threads each) and an MPI-onlysetup (1920 MPI tasks). We do not show a comparisonwith (Sca)LAPACK, since the tests for the inverse havedemonstrated the need of using methods that take intoaccount the sparsity of the matrices.

When using the hybrid setup, CheSS is in all cases themost efficient approach even though the runtime slightlyincreases as a function of the spectral width. PEXSIdoes not exhibit such a a dependence, but the differenceto CheSS is so large that — according to the spectralwidths presented in Sec. II A — it is easily possible toalways operate in the regime where CheSS is the most

efficient approach. When using the MPI-only setup, theruntimes of CheSS systematically worsen, whereas thoseof PEXSI improve. This leads to an inversion of theranking for the smallest matrix, but for all the other onesCheSS remains the fastest method. Finally it is impor-tant to mention that the MPI-only setup does not allowcalculations beyond matrix sizes of 18000 due to memorylimitations. This clearly demonstrates the need of per-forming calculations using a hybrid distributed memory/ shared memory approach — a regime in which CheSSis clearly superior.

IV. CONCLUSIONS AND OUTLOOK

We presented CheSS, the “Chebyshev Sparse Solvers”library, which implements the flexible and efficient com-putation of matrix functions using an expansion inChebyshev polynomials. The library was developed inthe context of electronic structure calculations — in par-ticular DFT — with a localized basis set, but can alsobe extended and applied to other problems. More specif-ically, CheSS can calculate the density matrix, any —in particular also non-integer — power of a matrix, andselected eigenvalues. CheSS is capable to efficiently ex-ploit the sparsity of the matrices, scaling linearly withthe number of non-zero elements, and is consequentlywell suited for large scale applications requiring a linearscaling approach.

The performance of CheSS for a specific problem de-pends on how well the matrix function can be approx-imated by Chebyshev polynomials. This depends, ob-viously, on the function itself, but also on the spectralwidth of the matrices. Whereas the first dependenceis imposed by the specific application, the second oneis as well related to the physical model and basis setthat is employed. CheSS has been designed for matri-ces exhibiting a small eigenvalue spectrum, since this re-duces the number of polynomials required to representthe function that shall be calculated. We used the li-brary together with the DFT code BigDFT, which usesa minimal set of quasi-orthogonal in-situ optimized basisfunctions, leading to the required small eigenvalue spec-tra of the matrices. We showed that in such a favorablesetup CheSS is able to clearly outperform other compa-rable approaches, and hence can considerably boost largescale DFT calculations.

Finally, the algorithm on which CheSS is built can beparallelized in a very efficient way, allowing the libraryto scale up to thousands of cores. In addition, the par-allelism can already be well exploited for relatively smallmatrices, and consequently good speedups and low run-times can be obtained for such systems. The initial per-formance of CheSS was evaluated within an performanceaudit by the Performance Optimization and Productivitycenter of excellence (POP), which helped to understandperformance issues and gave recommendations and per-formance improvements71. In addition, we continue to

Page 15: E cient computation of sparse matrix functions for large

15

1

10

100

2 4 8

16

32

64

128

runtim

e (

seconds)

κ

matrix size 6000 sparsity S: 97.95 % sparsity S

-1: 88.45 %

CheSS

1

10

100

2 4 8

16

32

64

128

κ

matrix size 12000 sparsity S: 98.93 % sparsity S

-1: 93.73 %

SelInv

1

10

100

2 4 8

16

32

64

128

κ

matrix size 18000 sparsity S: 99.27 % sparsity S

-1: 95.65 %

ScaLAPACK

1

10

100

2 4 8

16

32

64

128

κ

matrix size 24000 sparsity S: 99.45 % sparsity S

-1: 96.66 %

LAPACK

1

10

100

2 4 8

16

32

64

128

κ

matrix size 30000 sparsity S: 99.56 % sparsity S

-1: 97.28 %

FIG. 13. Comparison of the runtimes for the matrix inversion using CheSS, the Selected Inversion from PEXSI, ScaLAPACKand LAPACK, for various matrices and as a function of the condition number. All runs were performed in parallel, using1920 cores (160 MPI tasks spanning 12 OpenMP threads each). The CheSS runs were started with well adjusted bounds forthe eigenvalue spectrum, and the polynomial degree ranged from 60 to 260. For LAPACK, no results for matrices larger than18000 are shown due to their long runtime. The benchmarks were done on MareNostrum 4.

10

100

1000

50 100 150 200

runtim

e (

seconds)

spectral width (eV)

matrix size 6000 sparsity S: 97.95 % sparsity H: 92.97 % sparsity K: 88.45 %

CheSS hybrid(160 MPI x 12 OMP)

10

100

1000

50 100 150 200

spectral width (eV)

matrix size 12000 sparsity S: 98.93 % sparsity H: 96.25 % sparsity K: 93.73 %

PEXSI hybrid(160 MPI x 12 OMP)

10

100

1000

50 100 150 200

spectral width (eV)

matrix size 18000 sparsity S: 99.27 % sparsity H: 97.42 % sparsity K: 95.65 %

CheSS MPI-only(1920 MPI x 1 OMP)

10

100

1000

50 100 150 200

spectral width (eV)

matrix size 24000 sparsity S: 99.27 % sparsity H: 97.42 % sparsity K: 95.65 %

PEXSI MPI-only(1920 MPI x 1 OMP)

10

100

1000

50 100 150 200

spectral width (eV)

matrix size 30000 sparsity S: 99.27 % sparsity H: 97.42 % sparsity K: 95.65 %

FIG. 14. Comparison of the runtimes for the density kernel calculation using CheSS and PEXSI, for various matrices and as afunction of the spectral width. All matrices had a HOMO-LUMO gap of 1 eV. The runs were performed in parallel using 1920cores, once with a hybrid setup (160 MPI tasks spanning 12 OpenMP threads each) and once with an MPI-only setup. TheCheSS runs were started with a good input guess for the eigenvalue bounds, with the polynomial degree ranging from 270 to790. For PEXSI, the number of poles used for the expansion was set to 40, following Ref. 24. The benchmarks were done onMareNostrum 4.

Page 16: E cient computation of sparse matrix functions for large

16

cooperate with POP to further analyze and optimize theparallel efficiency and scalability of the library.

CheSS is been used intensively within the BigDFTcode8,9 and is about to being coupled with the SIESTAcode10,11. Moreover it should be possible for any codeworking with localized basis functions to use this libraryand hence to accelerate large scale calculations.

V. ACKNOWLEDGMENTS

We gratefully acknowledge the support of the MaX(SM) and POP (MW) projects, which have receivedfunding from the European Union’s Horizon 2020 re-

search and innovation programme under grant agree-ment No. 676598 and 676553, respectively. This workwas also supported by the Energy oriented Centre ofExcellence (EoCoE), grant agreement number 676629,funded within the Horizon2020 framework of the Eu-ropean Union, as well as by the Next-Generation Su-percomputer project (the K computer project) and theFLAGSHIP2020 within the priority study5 (Develop-ment of new fundamental technologies for high-efficiencyenergy creation, conversion/storage and use) from theMinistry of Education, Culture, Sports, Science andTechnology (MEXT) of Japan. We (LG, DC, WD, TN)gratefully acknowledge the joint CEA-RIKEN collabora-tion action.

[email protected] N. Higham, Functions of Matrices (Society

for Industrial and Applied Mathematics, 2008)http://epubs.siam.org/doi/pdf/10.1137/1.9780898717778.

2 P. Hohenberg and W. Kohn, “Inhomogeneous electrongas,” Phys. Rev. 136, B864–B871 (1964).

3 W. Kohn and L. J. Sham, “Self-consistent equations in-cluding exchange and correlation effects,” Phys. Rev. 140,A1133–A1138 (1965).

4 E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel,J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling,A. McKenney, and D. Sorensen, LAPACK Users’ Guide,3rd ed. (Society for Industrial and Applied Mathematics,Philadelphia, PA, 1999).

5 L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Dem-mel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry,A. Petitet, K. Stanley, D. Walker, and R. C. Whaley,ScaLAPACK Users’ Guide (Society for Industrial and Ap-plied Mathematics, Philadelphia, PA, 1997).

6 Laura E. Ratcliff, Stephan Mohr, Georg Huhs, ThierryDeutsch, Michel Masella, and Luigi Genovese, “Challengesin large scale quantum mechanical calculations,” WileyInterdiscip. Rev.-Comput. Mol. Sci. 7, e1290–n/a (2017),e1290.

7 Luigi Genovese, Alexey Neelov, Stefan Goedecker, ThierryDeutsch, Seyed Alireza Ghasemi, Alexander Willand,Damien Caliste, Oded Zilberberg, Mark Rayson, AndersBergman, and Reinhold Schneider, “Daubechies waveletsas a basis set for density functional pseudopotential calcu-lations.” J. Chem. Phys. 129, 014109 (2008).

8 Stephan Mohr, Laura E. Ratcliff, Paul Boulanger, LuigiGenovese, Damien Caliste, Thierry Deutsch, and StefanGoedecker, “Daubechies wavelets for linear scaling densityfunctional theory,” J. Chem. Phys. 140, 204110 (2014).

9 Stephan Mohr, Laura E. Ratcliff, Luigi Genovese, DamienCaliste, Paul Boulanger, Stefan Goedecker, and ThierryDeutsch, “Accurate and efficient linear scaling dft calcu-lations with universal applicability,” Phys. Chem. Chem.Phys. , – (2015).

10 Jose M Soler, Emilio Artacho, Julian D Gale, AlbertoGarcıa, Javier Junquera, Pablo Ordejon, and DanielSanchez-Portal, “The siesta method for ab initio order- nmaterials simulation,” J. Phys.: Condens. Matter 14, 2745(2002).

11 Emilio Artacho, E Anglada, O Dieguez, J D Gale,A Garcıa, J Junquera, R M Martin, P Ordejon, J MPruneda, D Sanchez-Portal, and J M Soler, “The siestamethod; developments and applicability,” J. Phys.: Con-dens. Matter 20, 064208 (2008).

12 Joost VandeVondele, Matthias Krack, Fawzi Mohamed,Michele Parrinello, Thomas Chassaing, and Jurg Hutter,“Quickstep: Fast and accurate density functional calcula-tions using a mixed Gaussian and plane waves approach,”Comput. Phys. Commun. 167, 103–128 (2005).

13 Chris-Kriton Skylaris, Peter D. Haynes, Arash A.Mostofi, and Mike C. Payne, “Introducing onetep:Linear-scaling density functional simulations on paral-lel computers,” J. Chem. Phys. 122, 084119 (2005),http://dx.doi.org/10.1063/1.1839852.

14 Peter D. Haynes, Chris-Kriton Skylaris, Arash A. Mostofi,and Mike C. Payne, “Onetep: linear-scaling density-functional theory with local orbitals and plane waves,”Phys. Status Solidi B 243, 2489–2499 (2006).

15 A. A. Mostofi, P. D. Haynes, C. K. Skylaris, andM. C. Payne, “Onetep: linear-scaling density-functionaltheory with plane-waves,” Mol. Simul. 33, 551–555 (2007),http://dx.doi.org/10.1080/08927020600932801.

16 Chris-Kriton Skylaris, Peter D Haynes, Arash A Mostofi,and Mike C Payne, “Recent progress in linear-scaling den-sity functional calculations with plane waves and pseu-dopotentials: the onetep code,” J. Phys.: Condens. Matter20, 064209 (2008).

17 D. R. Bowler, I. J. Bush, and M. J. Gillan, “Practicalmethods for ab initio calculations on thousands of atoms,”Int. J. Quantum Chem. 77, 831–842 (2000).

18 D. R. Bowler, R. Choudhury, M. J. Gillan, andT. Miyazaki, “Recent progress with large-scale ab initiocalculations: the conquest code,” Phys. Status Solidi B243, 989–1000 (2006).

19 D R Bowler and T Miyazaki, “Calculations for millionsof atoms with density functional theory: linear scalingshows its potential.” J. Phys.: Condens. Matter 22, 074207(2010).

20 S. Goedecker and L. Colombo, “Efficient linear scaling al-gorithm for tight-binding molecular dynamics,” Phys. Rev.Lett. 73, 122–125 (1994).

21 S. Goedecker and M. Teter, “Tight-binding electronic-structure calculations and tight-binding molecular dynam-

Page 17: E cient computation of sparse matrix functions for large

17

ics with localized orbitals,” Phys. Rev. B 51, 9455–9464(1995).

22 S. Goedecker, “Low complexity algorithms for electronicstructure calculations,” J. Comput. Phys. 118, 261 – 268(1995).

23 Stefan Goedecker, “Linear scaling electronic structuremethods,” Rev. Mod. Phys. 71, 1085–1123 (1999).

24 Lin Lin, Mohan Chen, Chao Yang, and Lixin He, “Accel-erating atomic orbital-based electronic structure calcula-tion via pole expansion and selected inversion,” J. Phys.:Condens. Matter 25, 295501 (2013).

25 X.-P. Li, R. W. Nunes, and David Vanderbilt, “Density-matrix electronic-structure method with linear system-sizescaling,” Phys. Rev. B 47, 10891–10894 (1993).

26 Weitao Yang and TaiSung Lee, “A density-matrix divide-andconquer approach for electronic structure calculationsof large molecules,” J. Chem. Phys. 103, 5674–5678 (1995),http://dx.doi.org/10.1063/1.470549.

27 D R Bowler and T Miyazaki, “O(N) methods in elec-tronic structure calculations.” Rep. Prog. Phys. 75, 036503(2012).

28 Kevin R. Bates, Andrew D. Daniels, and Gustavo E.Scuseria, “Comparison of conjugate gradient density ma-trix search and chebyshev expansion methods for avoid-ing diagonalization in large-scale electronic structurecalculations,” J. Chem. Phys. 109, 3308–3312 (1998),http://dx.doi.org/10.1063/1.476927.

29 Andrew D. Daniels and Gustavo E. Scuseria, “Whatis the best alternative to diagonalization of thehamiltonian in large scale semiempirical calcula-tions?” J. Chem. Phys. 110, 1321–1328 (1999),http://dx.doi.org/10.1063/1.478008.

30 WanZhen Liang, Chandra Saravanan, Yihan Shao, RoiBaer, Alexis T. Bell, and Martin Head-Gordon, “Improvedfermi operator expansion methods for fast electronic struc-ture calculations,” J. Chem. Phys. 119, 4117–4125 (2003),http://dx.doi.org/10.1063/1.1590632.

31 Daniel K. Jordan and David A. Mazziotti, “Compar-ison of two genres for linear scaling in density func-tional theory: Purification and density matrix mini-mization methods,” J. Chem. Phys. 122, 084114 (2005),http://dx.doi.org/10.1063/1.1853378.

32 P D Haynes, C-K Skylaris, A A Mostofi, and M C Payne,“Density kernel optimization in the onetep code,” J. Phys.:Condens. Matter 20, 294207 (2008).

33 Elias Rudberg and Emanuel H Rubensson, “Assessment ofdensity matrix methods for linear scaling electronic struc-ture calculations,” J. Phys.: Condens. Matter 23, 075502(2011).

34 Rui-Sheng Ran and Ting-Zhu Huang, “An inversion al-gorithm for a banded matrix,” Comput. Math. Appl. 58,1699 – 1710 (2009).

35 Jok M. Tang and Yousef Saad, “A probing method forcomputing the diagonal of a matrix inverse,” Numer. Lin-ear Algebra Appl. 19, 485–501 (2012).

36 Lin Lin, Jianfeng Lu, Lexing Ying, Roberto Car, andWeinan E, “Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic struc-ture analysis of metallic systems,” Commun. Math. Sci. 7,755–777 (2009).

37 C. Bekas, E. Kokiopoulou, and Y. Saad, “An estimatorfor the diagonal of a matrix,” Appl. Numer. Math. 57,1214 – 1229 (2007), numerical Algorithms, Parallelism andApplications (2).

38 S. Li, S. Ahmed, G. Klimeck, and E. Darve, “Computingentries of the inverse of a sparse matrix using the {FIND}algorithm,” J. Comput. Phys. 227, 9408 – 9427 (2008).

39 Alan George, “Nested dissection of a regular finite ele-ment mesh,” SIAM J. Numer. Anal. 10, 345–363 (1973),http://dx.doi.org/10.1137/0710032.

40 Lin Lin, Chao Yang, Jianfeng Lu, Lexing Ying, andWeinan E, “A fast parallel algorithm for selected inversionof structured sparse matrices with application to 2d elec-tronic structure calculations,” SIAM J. Sci. Comput. 33,1329–1351 (2011), http://dx.doi.org/10.1137/09077432X.

41 Lin Lin, Chao Yang, Juan C. Meza, Jianfeng Lu, LexingYing, and Weinan E, “Selinv—an algorithm for selectedinversion of a sparse symmetric matrix,” ACM Trans.Math. Softw. 37, 40:1–40:19 (2011).

42 C. Hartwigsen, S. Goedecker, and J. Hutter, “Relativisticseparable dual-space gaussian pseudopotentials from h torn,” Phys. Rev. B 58, 3641–3662 (1998).

43 Alex Willand, Yaroslav O. Kvashnin, Luigi Genovese,Alvaro Vazquez-Mayagoitia, Arpan Krishna Deb, AliSadeghi, Thierry Deutsch, and Stefan Goedecker, “Norm-conserving pseudopotentials with chemical accuracy com-pared to all-electron calculations,” J. Chem. Phys. 138,104109 (2013), http://dx.doi.org/10.1063/1.4793260.

44 Stephan Mohr, Michel Masella, Laura E. Ratcliff, andLuigi Genovese, “Complexity reduction in large quantumsystems: Fragment identification and population analysisvia a local optimized minimal basis,” J. Chem. TheoryComput. (accepted).

45 Stephan Mohr, Michel Masella, Laura E. Ratcliff, andLuigi Genovese, “Complexity reduction in large quantumsystems: Reliable electrostatic embedding for multiscaleapproaches via optimized minimal basis functions,” J.Chem. Phys. (submitted).

46 Gerd Berghold, Michele Parrinello, and Jrg Hut-ter, “Polarized atomic orbitals for linear scalingmethods,” J. Chem. Phys. 116, 1800–1810 (2002),http://dx.doi.org/10.1063/1.1431270.

47 Joost VandeVondele and Jrg Hutter, “Gaussian basis setsfor accurate calculations on molecular systems in gas andcondensed phases,” J. Chem. Phys. 127, 114105 (2007),http://dx.doi.org/10.1063/1.2770708.

48 P. E. Maslen, C. Ochsenfeld, C. A. White, M. S.Lee, and M. Head-Gordon, “Locality and sparsityof ab initio one-particle density matrices and localizedorbitals,” J. Phys. Chem. A 102, 2215–2222 (1998),http://dx.doi.org/10.1021/jp972919j.

49 William H. Press, Saul A. Teukolsky, William T. Vetter-ling, and Brian P. Flannery, Numerical Recipes 3rd Edi-tion: The Art of Scientific Computing, 3rd ed. (CambridgeUniversity Press, New York, NY, USA, 2007).

50 Emanuel H. Rubensson and Elias Rudberg, “Locality-aware parallel block-sparse matrix-matrix multiplicationusing the chunks and tasks programming model,” Paral-lel Comput. 57, 87 – 106 (2016).

51 Valery Weber, Teodoro Laino, Alexander Pozdneev,Irina Fedulova, and Alessandro Curioni, “Semiem-pirical molecular dynamics (semd) i: Midpoint-basedparallel sparse matrixmatrix multiplication algo-rithm for matrices with decay,” J. Chem. TheoryComput. 11, 3145–3152 (2015), pMID: 26575751,http://dx.doi.org/10.1021/acs.jctc.5b00382.

52 Urban Borstnik, Joost VandeVondele, Valery Weber, andJrg Hutter, “Sparse matrix multiplication: The distributed

Page 18: E cient computation of sparse matrix functions for large

18

block-compressed sparse row library,” Parallel Comput.40, 47 – 58 (2014).

53 Nicolas Bock and Matt Challacombe, “An optimizedsparse approximate matrix multiply for matrices withdecay,” SIAM J. Sci. Comput. 35, C72–C98 (2013),https://doi.org/10.1137/120870761.

54 Yunkai Zhou, James R. Chelikowsky, and Yousef Saad,“Chebyshev-filtered subspace iteration method free ofsparse diagonalization for solving the kohnsham equation,”J. Comput. Phys. 274, 770 – 782 (2014).

55 Thomas Ericsson and Axel Ruhe, “The spectral transfor-mation Lanczos method for the numerical solution of largesparse generalized symmetric eigenvalue problems,” Math.Comput. 35, 1251–1251 (1980).

56 Tetsuya Sakurai and Hiroshi Sugiura, “A projectionmethod for generalized eigenvalue problems using numer-ical integration,” J. Comput. Appl. Math. 159, 119 – 128(2003).

57 Eric Polizzi, “Density-matrix-based algorithm for solvingeigenvalue problems,” Phys. Rev. B 79, 115112 (2009).

58 Ayako Nakata, Yasunori Futamura, Tetsuya Sakurai,David R Bowler, and Tsuyoshi Miyazaki, “Efficient calcu-lation of electronic structure using o(n) density functionaltheory,” J. Chem. Theory Comput. 0, null (0), pMID:28714682, http://dx.doi.org/10.1021/acs.jctc.7b00385.

59 Emanuel H. Rubensson and Pawe l Sa lek, “Systematicsparse matrix error control for linear scaling electronicstructure calculations,” J. Comput. Chem. 26, 1628–1637(2005).

60 “The density matrix in self-consistent field the-ory i. iterative construction of the density ma-trix,” Proc. R. Soc. A 235, 496–509 (1956),http://rspa.royalsocietypublishing.org/content/235/1203/496.full.pdf.

61 R. McWeeny, “Some recent advances in density matrix the-ory,” Rev. Mod. Phys. 32, 335–369 (1960).

62 Adam H. R. Palser and David E. Manolopoulos, “Canoni-cal purification of the density matrix in electronic-structuretheory,” Phys. Rev. B 58, 12704–12711 (1998).

63 A. Holas, “Transforms for idempotency purification of den-sity matrices in linear-scaling electronic-structure calcula-tions,” Chem. Phys. Lett. 340, 552 – 558 (2001).

64 Anders M. N. Niklasson, “Expansion algorithm for the den-sity matrix,” Phys. Rev. B 66, 155115 (2002).

65 Anders M. N. Niklasson, C. J. Tymczak, and Matt Challa-combe, “Trace resetting density matrix purification in o(n)self-consistent-field theory,” J. Chem. Phys. 118, 8611–8620 (2003), http://dx.doi.org/10.1063/1.1559913.

66 David A. Mazziotti, “Towards idempotent reduced densitymatrices via particle-hole duality: Mcweeny’s purificationand beyond,” Phys. Rev. E 68, 066701 (2003).

67 H. J. Xiang, W. Z. Liang, Jinlong Yang, J. G. Hou, andQingshi Zhu, “Spin-unrestricted linear-scaling electronicstructure theory and its application to magnetic carbon-doped boron nitride nanotubes,” J. Chem. Phys. 123,124105 (2005), http://dx.doi.org/10.1063/1.2034448.

68 Emanuel H. Rubensson and Anders M. N. Niklas-son, “Interior eigenvalues from density matrix ex-pansions in quantum mechanical molecular dynam-ics,” SIAM J. Sci. Comput. 36, B147–B170 (2014),https://doi.org/10.1137/130911585.

69 G. M. Amdahl, “Validity of the single processor approachto achieving large scale computing capabilities,” IEEESolid-State Circuits Society Newsletter 12, 19–20 (2007).

70 Lin Lin, Alberto Garca, Georg Huhs, and Chao Yang,“Siesta-pexsi: massively parallel method for efficient andaccurate ab initio materials simulation without matrixdiagonalization,” J. Phys.: Condens. Matter 26, 305503(2014).

71 Michael Wagner, Claudia Rosas, Judit Gimenez, andJesus Labarta, “Chess/siesta performance assessment re-port (pop ar 32),” (2016).