iterative solution of generalized eigenvalue

8/16/2019 Iterative Solution of Generalized Eigenvalue

1/15

ITERATIVE SOLUTION OF GENERALIZED EIGENVALUE

PROBLEMS FROM OPTOELECTRONICS WITH TRILINOS

CHRISTOF V¨OMEL

†

, RATKO G. VEPREK‡

, URBAN WEBER§

, AND PETER ARBENZ†

Abstract. In this paper, we study the iterative solution of generalized Hermitian eigenvalueproblems arising from the finite-element discretization of k·p models of optoelectronic nano systems.We are interested in computing the eigenvalues close to the band-gap which determine electronic andoptical properties of a given system.

Our work is based on the Trilinos project which provides an object-oriented software frameworkof integrated algorithms for the solution of large-scale physics problems. Trilinos enables users tocombine state-of-the-art eigensolvers with efficient preconditioners, sparse solvers, and partitioningmethods. Our study illustrates these possibilities and evaluates various algorithms for their suitabilityin the context of our physical problem setting.

AMS subject classifications. 65F15, 65Y15.

Key words. Generalized Eigenvalue Problem, electronic structure, k ·p method, Trilinos,Anasazi, Davidson, Jacobi-Davidson, LOBPCG, Krylov-Schur.

1. Introduction. Ongoing advances in semiconductor technology permit todaythe fabrication of novel optoelectronic devices on the nano-scale level. With pro-nounced quantum-mechanical effects occurring in these devices, there is a need for acorrect physical description by complex quantum-mechanical models. Furthermore,in order to determine, to design, and to optimize optical and electronic properties of adevice on the computer, scientists and engineers need efficient numerical methods attheir disposal. As a consequence, much recent research in computational science hasbeen dedicated to advancing numerical algorithms for this purpose, see for examplethe recent review [49].

As one instance, we study here how to solve a large generalized Hermitian eigen-

value problem resulting from a finite element discretization of one underlying quantummechanical model, the k·p envelope function method [12]. Of our particular interestare those eigenvalues which are close to the so-called band-gap. The band-gap rep-resents the border between the highest occupied and the lowest un-occupied electronstates. Its width has a strong effect on the electronic and optical properties of thephysical system.

Our focus in this study is on exploiting the capabilities provided by the Trili-nos [28, 59] project. Trilinos facilitates the use and inter-operability of multiple algo-rithmic concepts - eigensolvers, preconditioners, linear solvers, partitioning methods- and thus offers a powerful environment for exploring and evaluating a variety of approaches. It is a central part of the Advanced CompuTational Software (ACTS)Collection [24], a set of software tools developed and funded by the US Department

of Energy.Trilinos itself, with the Anasazi package [10], provides three eigensolvers:• an implementation [7] of the original Locally Optimal Block Preconditioned

Conjugate Gradient (LOBPCG) method [36],

†Institute of Computational Science, ETH Zürich, CAB, Universiẗatsstraße 6, 8092 Zürich,Switzerland, {cvoemel,arbenz}@inf.ethz.ch

‡Integrated Systems Laboratory, Department of Information Technology and Electrical Engi-neering, ETH Zürich, Switzerland, [email protected]

§Formerly at Institute of Computational Science, ETH Zürich, [email protected]

1


2/15

2 Christof Vömel, Ratko G. Veprek, Urban Weber, and Peter Arbenz

• a block version [7] of Davidson’s [16] generalized method [15, 44, 45], and• a Krylov-Schur method [57, 58].

In addition, we include in our study the implementation [26] of the Jacobi-Davidsonalgorithm [29, 52, 53] that is also based on Trilinos. One part of this paper is dedicated

to exhibiting how these methods interact with and benefit from other independentsoftware packages that can be integrated with Trilinos, including LAPACK [6, 37],ScaLAPACK [13, 14, 22, 23], Umfpack [17, 18, 19, 20], Mumps [1, 4, 5], AMD [2, 3],Metis [31, 32], and ParMetis [33, 34].

A second major part of this paper gives a comparison of the aforementionedeigensolvers - in combination with selected preconditioners, iterative and/or directsparse linear solvers - on a set of challenging physics problems. Specifically, we studythe computation of electron states close to the band-gap of InAs/GaAs quantum wiresof various sizes and with models of varying complexity.

The rest of this paper is organized as follows. Section 2 presents some backgroundinformation on the numerical formulation of the physical model. Section 3 contains adescription of the eigensolvers, their realization in Trilinos, and their interplay with

other software packages. Experimental results are given in Section 4. Section 5concludes and summarizes.

Fig. 2.1. Probability density of the fourth valence sub-band state in an InAs square quantum wire with an area of 6 nm2, embed-ded within GaAs. The state was obtained by solving the k·p envelope equations for a k·p8 × 8 model on a tensorial mesh with 17161vertices using linear finite elements.

Fig. 2.2. Iso-probability surface plot for |Ψ|2 = 0.004 of the fourth valence sub-band state in a cubic InAs quantum dot with a vol-ume of 5 nm3, embedded within GaAs. The state was obtained using the k · p 8 × 8model. The mesh consists of 40474 verticesand 257767 tetrahedral elements.

2. The k · p method. The optically interactive regions in an optoelectronic

device usually consist of semiconductor heterostructures of a few atomic layers. Theseheterostructures lead to the formation of confined electron states. If the confinementis in one direction, the system is denoted as a quantum well, while the confinementin two dimensions is referred to as a quantum wire, see an example in Figure 2.1.The confinement in all three directions leads to a formation of an artificial atomdenoted as a quantum dot, see also Figure 2.2. The charge carrier in a semiconductornanostructure can be described by it’s Bloch wavefunction [8]

Ψnk(x, z) = unk(x, z)eik·xF nk(z), (2.1)


3/15

Eigenvalue problems from optoelectronics with Trilinos 3

where x is the coordinate in the free direction and z stands for the confined direc-tion(s). unk(x, z) is the fast oscillating, lattice-periodic part with

unk(r) = unk(r + T), (2.2)

where T is a lattice vector that describes the translation invariance of the crystallattice. The lattice periodic part is modulated by a slowly varying plane wave eik·x

along the translational invariant direction(s) and by a slowly varying envelope F nk(z)in the confined direction.

The wavefunction Ψnk is the solution of the time-independent Schrödinger equa-tion of the nanostructure

Ĥ Ψnk(x, z) = E nkΨnk(x, z) (2.3)

and is indexed by two quantum numbers: an integer n describing the sub-band indexand the continuous wavenumber k. The solution of (2.3) is highly non-trivial as, inprinciple, it requires the inclusion of an immense number of atoms and electrons.

The k·p envelope function method [30, 41] simplifies the problem by expanding thewavefunction in a specific set of known bulk lattice-periodic wavefunctions um0(x, z)

Ψnk(x, z) =m

um0(x, z)eik·xF nk,m(z). (2.4)

As a result, one obtains a coupled system of partial differential equations for theenvelopes F nk,m(z), the k·p envelope equations

−i,j

∂ iH(2)ij (z; k)∂ j +

i

H

(1)i;L(z; k)∂ i + ∂ iH

(1)i;R(z; k)

+H(0)(z; k)Fnk(z) = E nkFnk(z).(2.5)

In principle, the number of basis functions um0(x, z) is infinite. In calculations, onereduces the basis to a very small set of say 4, 6, or 8 relevant eigenfunctions fromthe energy bands at k = 0. The less important ones are included using Löwdin’sperturbation theory [40]. This truncates (2.5) to a finite set of equations which canbe solved using finite elements.

By left-multiplying (2.5) with a test function W∗, integrating over the domainΩ, and applying a zero Dirichlet boundary condition for test and wave functions, oneobtains the weak form of (2.5)

Ω

ij

∂ iW⋆H

(2)ij ∂ jFnkdx +

Ω

i

W⋆H(1)i;L∂ iFnk − ∂ iW

⋆H(1)i;RFnkdx

+

Ω

W⋆H(0)Fnk = E nk

Ω

W⋆Fnkdx.(2.6)

One can express Fnk in terms of the finite-element basis functions {N j}M j=1 andcoefficients cnk,j

Fnk =

M j=1

cnk,jN j, (2.7)


4/15


and insert (2.7) into (2.6) to obtain

M

j=1

a(N i, N j)cnk,j = E nk

M

j=1

m (N i, N j) cnk,j ∀N i. (2.8)

Here, a(·, ·) and m(·, ·) are the bilinear forms on the left- and right-hand sides of (2.6). The evaluation of the integrals for the basis functions then yields a generalizedHermitian eigenvalue problem whose solution is discussed in Section 3.

3. Eigensolvers.

3.1. Approximation from a subspace. Our generalized eigenvalue problemarising from the finite-element discretization of the physical model can be stated as

Ax = λM x (3.1)

where A and M are Hermitian, and in addition M is positive definite. As a conse-quence, the pencil (A, M ) has real eigenvalues and a complete set of M -orthogonal,in general complex, eigenvectors.

Each eigenvector is a stationary point of the Rayleigh quotient ρ

ρ(y) := y∗Ay

y∗M y, y = 0. (3.2)

Convergence of an eigenpair approximation can be measured by the residual

r(y) := Ay − ρ(y)My, r(y) ⊥ y, (3.3)

as there is guaranteed to be an eigenvalue λ satisfying

|λ − ρ(y)| ≤ r(y)M −1

M yM −1, (3.4)

see [46, Theorem 15.9.1]. Furthermore, the residual is collinear to the gradient of theRayleigh quotient, ∇ρ. Thus, it is used - typically together with a preconditioner - inoptimization-based eigensolvers such as LOBPCG.

One general framework of computing approximations to the exact eigenpairs froma given subspace is the so-called Rayleigh-Ritz approach [9, 46]. Let v1,...,vj denotean M -orthonormal basis of an approximation subspace, that is V ∗MV = I , where V denotes the matrix with columns v1,...,vj . The Galerkin condition

V ∗(AV s − θMV s) = 0 (3.5)

yields a projected eigenproblem with j solutions (θi, si). The pairs (θi, ui := V si) arecalled Ritz values and Ritz vectors. The Ritz values θi are real, and the Ritz vectorsui can be chosen to be M -orthogonal. Each of the eigensolvers in our study has itsown way of constructing the approximation subspace, this is explained in Section 3.2.

3.2. Algorithms. This section gives a short overview of the eigensolvers usedin our study. More information on the theory is available in [7, 56], implementationaspects can be found in [10] and [26].


5/15


3.2.1. LOBPCG. The Locally Optimal Block Preconditioned Conjugate Gra-dient (LOBPCG) method [36] is an enhanced block variant of the PreconditionedConjugate Gradient (PCG) method [7, 35]. This kind of methods can be used tocompute extrema of the Rayleigh quotient (3.2) from a subspace built from precon-

ditioned vectors derived from ∇ρ.In its non-blocked form, LOBPCG is called LOPCG and works with a three-

dimensional approximation space. Depending on whether left- or rightmost eigenpairsare approximated, an iterate is constructed as

yi+1 = arg

minmax

{ρ (y) : y ∈ span(yi−1, yi, P r(yi))} , (3.6)

note the preconditioned residual direction P r. In the block algorithm LOBPCG, theminimizing set of Ritz vectors is chosen from a subspace built from the block vectorgeneralization of the right-hand side of (3.6), see [36].

3.2.2. Davidson. Extending Davidson’s original eigenvalue method [16], the

single-vector Generalized Davidson (GD) algorithm [15, 44, 45] successively augmentsthe projection subspace by directions of the form

P z = −r(yi). (3.7)

The preconditioner is built as a (usually easily invertible) approximation to the opera-tor (A − σM ). Davidson originally proposed using the diagonal [16], GD allows moregeneral preconditioners. The shift σ targets the desired eigenvalues and is commonlyfixed.

In the block version available in Trilinos, one simultaneously solves (3.7) with ablock residual of a block of orthogonal vectors as right-hand side, see also [10].

3.2.3. Jacobi-Davidson. The Jacobi-Davidson algorithm [29, 52, 53] uses amodification of the preconditioner (3.7) chosen in the Generalized Davidson algorithm.

The new direction z is enforced to be from the subspace that is M -orthogonal toyi, the previous best approximation. This yields the correction equation

(I − M yiy∗

i )P (I − yiy∗

i M )z = −r(yi), z ⊥M yi. (3.8)

In practice, it is solved only approximately with an inner solver, then the search sub-space is expanded by the correction z. The choice of the inner solver depends on whicheigenvalues are sought. The Conjugate Gradient (CG) method can be appropriate forextremal eigenvalues, our algorithm uses the Quasi-Minimal Residual (QMR) solver.These linear solvers and other alternatives are described in [48] which also contains acomplete list of references.

3.2.4. Krylov-Schur. In addition to LOBPCG and Block-Davidson, Trilinos

also provides a block implementation of the Krylov-Schur [57, 58] algorithm. Incontrast to the other algorithms, Trilinos requires the generalized eigenvalue problemto be transformed into a standard one to apply this algorithm. Another distinguishingfeature is the absence of preconditioning.

The Krylov-Schur algorithm, like the implicitly restarted Arnoldi (IRA) algo-rithm [38, 39, 54], computes eigenpair approximations via Rayleigh-Ritz from a Krylovsubspace Ki(Z, v) := span

v , Z v , . . . , Z i−1v

. Here, the operator Z depends on which

eigenvalues are sought. For example, Z = A−1M is used to compute the eigenvalues of smallest magnitude of (A, M ). Whereas the IRA algorithm employs a QR algorithm


6/15


{AnasaziKrylov−Schur

LOBPCG

Davidson

Jacobi−Davidson

Shift−Invert

Operator {Preconditioner Ifpack

ML

Amesos Umfpack

ParMeTis

Mumps

AMD

MeTis,ScaLAPACK

Fig. 3.1. Interaction of Trilinos eigensolvers with third-party software packages; BLAS and LAPACK are not shown as all eigensolvers directly rely on them via the Epetra [27] base class from Trilinos. Anasazi’s Krylov-Schur algorithm invokes a Shift-Invert operator which relies, via Amesos,on a sparse direct linear solver, either Umfpack or Mumps. Umfpack uses a symmetric AMD ordering of the sparse matrix, MUMPS uses Metis as ordering package and, during the numerical factorization, invokes ScaLAPACK on the dense Schur complement arising at the very end of the sparse factorization. Anasazi’s LOBPCG and Davidson as well as Jacobi-Davidson depend on a preconditioner which can be obtained from either Ifpack or ML (which uses ParMetis to compute aggregates). Note that the illustration just covers those connections that are exploited in this paper.Trilinos allows further interactions, for example the use of Umfpack and Mumps via Amesos ascoarse grid solvers for ML, or within Schwarz-type preconditioners from Ifpack.

with perfect shifts for restarting, the Krylov-Schur algorithm uses a reordering of theSchur form of the Rayleigh-Ritz matrix to separate wanted from un-wanted spectralinformation in the approximation subspace. It is argued that this strategy facilitatesthe deflation of converged Ritz vectors [57, 58]. However, the Arnoldi-Hessenbergform of the Rayleigh-Ritz matrix in IRA is not preserved but altered to a rank-oneperturbed Schur-Hessenberg structure, for details see [57, 58].

3.3. Operator implementation and interaction with other software. Thissection highlights central aspects of how to implement the major operators (Shift-Invert and preconditioners) in the Trilinos framework. We also shed light at theimportant role of other software packages.

A first overview of the interaction between Trilinos and other software is given inFigure 3.1. Section 3.3.1 gives details about the implementation of the Shift-Invert

operator. Sections 3.3.2 and 3.3.3 describe the construction of the ILU and MLpreconditioners.

3.3.1. Shift-Invert Operator with Umfpack and Mumps. In this section,we discuss the implementation of a Shift-Invert operator for the Block-Krylov-Schuralgorithm using the sparse direct solvers Umfpack [17, 18, 19, 20] and Mumps [1, 4, 5].Trilinos has the capability to use these solvers via its Amesos package [51].

For nonsingular A, the generalized eigenvalue problem (3.1) can be transformedinto a standard problem for A−1M . In the Krylov-Schur method, the operator A−1M is only needed in the form of a matrix-vector product. A−1 is not explicitly computed;it is rather available via the the triangular factors of a sparse direct factorization. Sucha factorization can be quite expensive from the memory point of view, its advantageis that it can be reused multiple times once it has been computed.

Both Umfpack and Mumps are based on a three-step process. In the first phase,the symbolic factorization, the sparsity pattern is analyzed. Permutations and scalingare employed to reduce fill-in and improve stability of the subsequent second phase,the numerical factorization. Umfpack automatically detects the symmetry of thematrix and invokes an Approximate Minimum Degree Ordering [2, 3]. Mumps uses aNested-Dissection ordering from Metis [31, 32] (except for the two smallest systemswhere a built-in Approximate Minimum Fill, AMF [47], is used).

In the numerical factorization, one uses threshold pivoting to balance the needfor stability with occurring fill-in. In addition, Mumps utilizes the LU factorization


7/15


from ScaLAPACK [13, 14, 22, 23] on the dense Schur complement arising at the veryend of the sparse factorization.

Once the triangular factors of the matrix have been obtained, one can easily solvelinear systems for multiple right-hand sides via forward and backward solve. This

constitutes the third phase and is repeatedly invoked in the Shift-Invert Krylov-Schuralgorithm.

3.3.2. Incomplete Factorization Preconditioning with Ifpack. Ifpack [50]provides a large selection of algebraic preconditioners including relaxation schemes,incomplete factorizations, and domain decomposition methods. Background informa-tion on these methods as well as a long list of references are contained in [48].

For this study, several variants of incomplete LU factorization were evaluated fortheir suitability inside of the Davidson, Jacobi-Davidson, and LOBPCG eigensolvers.We did not consider incomplete Cholesky as we wanted to use the preconditioner forboth definite and indefinite eigenproblems.

We investigated incomplete LU with threshold-based element dropping, ILUT [48,Section 10.4]. However, dropping entries of small magnitude was finally ruled out asit seemed to lead to an increase of the number of iterations in the eigensolvers andproved to be more time-consuming on the whole.

We then considered the ILU(p) class, where p denotes the level of fill allowed.ILU(0) (‘Zero Fill-In ILU’ [48, Section 10.3.2]) only allows fill in those elements of L and U which correspond to nonzero entries of A. More general, ILU(p) allows‘p-th order fill-in.’ Here, each entry gets assigned a level and fill-in is allowed only forentries whose level is less or equal p, for details see [48, Section 10.3.3]. First-order fill-in ( p = 1) proved to be a good trade-off between complexity of the preconditioner andeffectiveness in the eigensolver and became our method of choice for the experimentsin Section 4.

3.3.3. Multilevel Preconditioning with ML. The ML package [25] fromTrilinos provides multilevel preconditioners for definite linear systems. It is a powerfultool that offers a lot of flexibility in selecting the individual components - smoothers,multigrid cycle, coarse grid solvers, etc. - of the algorithm. There is a vast amountof research on the best settings of multilevel preconditioners for a given problem, seefor example [11, 42, 43].

In our experiments in this paper, we use ML with most of the standard settings(automatic nullspace detection, Gauss-Seidel as smoother, KLU as dense linear solveron the coarse grid) for smoothed aggregation [25, Table 6]. However, ParMetis [33, 34]was enabled for aggregation to make use of its powerful graph partitioning algorithms.

3.4. Real formulation. Trilinos currently does not support matrices with com-plex entries. This will change once Tpetra, the templated version of Trilinos’s fun-damental Epetra package becomes publicly available. Since our application matrices

coming from the k·p method are complex, we exploit that a complex Hermitian ma-trix may be replaced by a real symmetric one of twice its size, see [46, Section 1.4.2]and also [21].

If A = Ar + iAi be Hermitian matrix, then its real part Ar is symmetric andits imaginary part Ai is skew-symmetric, Ar = AT r , Ai = −A

T i . For an eigenpair

(λr, xr + ixi), one has

Ar A

T i

Ai Ar

xr −xixi xr

=

λr

λr

xr −xixi xr

.


8/15


9/15


10/15


The Shift-Invert Krylov-Schur algorithm proves to be the overall most reliablealgorithm in our study. It has been successfully applied to the definite 4- and 6-bandas well as the indefinite 8-band systems. Moreover, using Mumps, it is the fastestalgorithm when one only looks at the time spent in the eigensolver and excludes the

time to construct the Shift-Invert operator.Interestingly, there is a noticeable performance difference between Mumps and

Umfpack. For example, Umfpack needs 291.57 seconds for the 8-band 2D big Wire,versus 101.44 seconds for Mumps. As the number of eigensolver steps (i.e. operatorapplications, ‘OpVecs’), of the eigensolver is almost always the same for both Umfpackand Mumps, this performance difference can be mostly attributed to applying A−1

via forward- and backward solve with the triangular factors. Umfpack seems to incura major penalty due to data structure reorganization in the Trilinos interface eachtime the Shift-Invert operator is applied.

In terms of computational cost, the sparse direct factorization needed to constructthe Shift-Invert operator for the Krylov-Schur method is by far more costly thaneither the ILU or the ML preconditioner needed for the other methods. For example,

compare the Umfpack factorization time of 182.46 seconds for the 6-band 2D big Wirewith the corresponding time of 5.16 seconds for ML and of 22.03 seconds for the IfpackILU in the Jacobi-Davidson algorithm.

Of the three other methods considered, Jacobi-Davidson is the only solver in whichIfpack ILU has been successfully applied to the indefinite 8-band problems. It seemsclear that by design, LOBPCG is only suitable for computing extremal eigenvalues,but this finding certainly seemed a bit disappointing for the Davidson algorithm.

Coming back to the Jacobi-Davidson algorithm, we see that in terms of totalcost, it can be preferable to the Shift-Invert Krylov-Schur algorithm. This is partic-ularly true when memory becomes a concern as storage of the factors of the sparsefactorization grows rapidly. Nevertheless, when memory is not a concern and moreeigenpairs have to be computed, the cost for the sparse direct factorization, which one

only computes once at the beginning of the algorithm, can be mitigated. For this rea-son, we still consider the Shift-Invert Krylov-Schur superior. Another considerationsupporting this assessment is the algorithms’ scaling to multiple processors to whichwe will come shortly.

Comparing the Jacobi-Davidson, Davidson, and LOBPCG algorithms, it is clearthat the Davidson algorithm is least efficient, regardless of the choice of the precon-ditioner. It always requires the largest amount of matrix vector (MV) products andexhibits the slowest convergence.

The choice of the preconditioner does influence the convergence behavior but thereseems to be no clear winner in terms of reducing the number of eigensolver iterations.However, applying the ML preconditioner clearly is more expensive than applying theIfpack ILU preconditioner. For example, Jacobi-Davidson takes 148.79 seconds with

Ifpack ILU for the 6-band 2D big Wire versus 752.29 seconds with ML preconditioner.Leaving the reliability issue for the indefinite 8-band systems aside, LOBPCGwith Ifpack ILU is faster than Davidson and Jacobi-Davidson. The performancedifference is most pronounced and clearly visible in the 6-band 2D big Wire. For thisreason, we consider LOBPCG the first choice for definite problems when a sparsedirect factorization is not available or possible.

To look at scaling behavior, Tables 4.5 and 4.6 give the results on two and fourprocessors. Shown are only the results for the larger two physical systems. Shift-InvertKrylov-Schur with the sequential Umfpack is omitted.


11/15


12/15


In order to measure parallel performance, define the speedup S p = t1/tp for a givennumber of processors p, the sequential execution time t1, and the parallel executiontime t p on p processors.

Using the speedup metric, Shift-Invert Krylov-Schur achieves on five out of the

six test cases a speedup between 1.55 and 1.64 on two processors. On four processors,the speedup is around two for the bigger problems, but 3 .34 on the 8-band 2D bigWire problem. This shows reasonable scalability of the application of the Shift-Invertoperator, and thus of the whole Krylov-Schur algorithm. The same is true for thecomputation of the sparse direct factorization on which the operator is based.

The computation of the Ifpack ILU preconditioner scales, but its quality dimin-ishes with increasing number of processors. The Ifpack manual [50, Section 4] remarksthat ‘it tends to become less robust and require more iterations as the number of pro-cessors used increases.’ This can be seen here happening in practice. The numberof inner iterations in the Jacobi-Davidson nearly doubles on two processors so thatthere is no noticeable time change. However, there is a remarkable gain on four pro-cessors for the 2D big Wire, at least on this system there is a speedup of around 1.3,

and of even 2.52 for the 8-band indefinite problem. The reasons for the quite erraticperformance of the ILU are not understood at this point.

The increase in the number of iterations in the Davidson algorithm with IfpackILU is so dramatic (up to 12,500 MV products for the 6-band 2D big Wire), that thereis actually a performance loss when more than one processor is used. This togetherwith its sequential performance makes the Davidson algorithm the worst among allconsidered.

LOBPCG with Ifpack behaves similarly to the Jacobi-Davidson, with no visiblegains on two processors but speedups of 1.26 and 1.63 for the 4- and 6-band 2D bigWire systems on four processors. Its performance together with the smallest amountof memory needed among all competitors makes LOBPCG the algorithm of choice fordefinite systems.

As on one processor, ML preconditioning is consistently slower than precondi-tioning with Ifpack’s ILU for Jacobi-Davidson, Davidson, and LOBPCG. ML doesnot lead to an increase in iterations when more than one processor is used. Thus,the quality of ML appears to be scalable. Nevertheless, the cost of applying thepreconditioner remains large and ML seems non-competitive with Ifpack’s ILU.

Summarizing, we arrive at the following conclusions. When a sparse direct fac-torization is possible, Krylov-Schur with Shift-Invert operator seems the best optionbecause of reliability, high performance, and scalability. Otherwise, Jacobi-Davidsonwith Ifpack preconditioner should be used for the indefinite 8-band problems. For 4-and 6-band problems, Jacobi-Davidson is a good choice but LOBPCG with Ifpackseems an even better option. The Davidson algorithm behaves poorly on our exam-ples; it is possible that its performance could be improved by a different restartingstrategy like the one considered in [55] to obtain better results similar to those shown

in [60].

5. Summary, conclusions, and future work. In this paper, we studied thecomputation of eigenvalues close to the band-gap of k · p models of optoelectronicnano systems. These are, depending on the k·p model, located either at the interioror at one end of the spectrum.

We illustrated how our generalized eigenvalue problems can be solved within Trili-nos by combining state-of-the-art eigensolvers with efficient preconditioners, sparsesolvers, and partitioning methods.


13/15


Our evaluation revealed the Krylov-Schur algorithm with Shift-Invert operator tobe overall most reliable. It also is the most efficient algorithm when memory is not anissue and the cost for the sparse direct factorization can be mitigated by computingmultiple eigenvalues. Otherwise, the Ifpack preconditioner turned out to be more

economical than ML despite scaling issues. Jacobi-Davidson emerged as the winnerfor the indefinite 8-band systems while LOBPCG seems preferable for the definite 4-and 6-band systems.

Since the Shift-Invert spectral transformation works so well, one may wonderwhether other spectral transformations could be successfully employed, for exampleto make LOBPCG cope with the indefinite 8-band problems. In this regard, weconsidered the ‘folded’ (AM −1A, M ) and ‘harmonic’ pencils (AM −1A, A) where inboth cases, the operator AM −1A is definite.

Because of its nature, AM −1A seems pre-destined for a matrix-free approachand we explored use of the matrix-free multilevel preconditioner offered by the MLpackage. What complicates its application is the presence of M −1: it causes completefill-in in the incidence graph of the operator on which matrix-free aggregation is based.

Substitution of M −1

by a simpler matrix, e.g. a diagonal one from lumping M , turnedout to be ineffective and led to prohibitively slow convergence. At this point, moreresearch is needed to turn this option into a viable alternative.

Acknowledgment. We thank Cyril Flaig, Marzio Sala, and Heidi Thornquistfor giving us advice and letting us benefit from their experience with Trilinos. Weare also grateful to the developers of all packages used in this paper for making theirsoftware available for public use.

REFERENCES

[1] P. Amestoy, A. Guermouche, J.-Y. L’Excellent, and S. Pralet. Hybrid scheduling for the parallelsolution of linear systems. Parallel Computing , 32(2):136–156, 2006.

[2] P. A. Amestoy, T. A. Davis, and I. S. Duff. Algorithm 837: AMD, An approximate minimumdegree ordering algorithm. ACM Trans. Math. Software , 30(3):381–388, 2004.[3] P. R. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering algorithm.

SIAM J. Matrix Anal. Appl., 17:886–905, 1996.[4] P. R. Amestoy, I. S. Duff, and J. Y. L’Excellent. Multifrontal parallel distributed symmetric

and unsymmetric solvers. Computer Methods in Appl. Mech. Eng., 184:501–520, 2000.[5] P. R. Amestoy, I. S. Duff, J. Y. L’Excellent, and J. Koster. A fully asynchronous multifrontal

solver using distributed dynamic scheduling. SIAM J. Matrix Anal. Appl., 23(1):15–41,2001.

[6] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Green-baum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK User’s Guide . SIAM,Philadelphia, 3. edition, 1999.

[7] P. Arbenz, U. L. Hetmaniuk, R. B. Lehoucq, and R. S. Tuminaro. Comparison of Eigensolversfor Large-scale 3D Modal Analysis using AMG-Preconditioned Iterative Methods. Int. J.Numer. Methods Eng., 64(2):204–236, 2005.

[8] N. W. Ashcroft and N. D. Mermin. Solid state physics. Saunders College, Philadelphia, 1.

edition, 1976.[9] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. Templates for the solution of

algebraic eigenvalue problems - A practical guide . SIAM, Philadelphia, 2000.[10] C. G. Baker, U. L. Hetmaniuk, R. B. Lehoucq, and H. B Thornquist. Anasazi software for the

numerical solution of large-scale eigenvalue problems. Technical Report SAND 2007-0350J,Sandia National Laboratories, Albuquerque, NM, USA, 2007.

[11] W. L. Briggs, V. E. Henson, and S. F. McCormick. A Multigrid Tutorial . SIAM, Philadelphia,2. edition, 2000.

[12] M. G. Burt. Fundamentals of envelope function theory for electronic states and photonic modesin nanostructures. J.Phys.: Condens. Matter , 11:R53–R83, 1999.

[13] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker,


14/15


and R. C. Whaley. ScaLAPACK: A portable linear algebra library for distributed memorycomputers - design issues and performance. Computer Physics Communications, 97:1–15,1996. (also as LAPACK Working Note #95).

[14] J. Choi, J. J. Dongarra, S. Ostrouchov, A. Petitet, and D. Walker. A proposal for a set of parallel basic linear algebra subprograms. Technical Report UT-CS-95-292, University of

Tennessee, Knoxville, TN, USA, 1995. (LAPACK Working Note #100).[15] M. Crouzeix, B. Philippe, and M. Sadkane. The Davidson method. SIAM J. Sci. Comput.,

15(1):62–76, 1994.[16] E. R. Davidson. The Iterative Calculation of a Few of the Lowest Eigenvalues and Correspond-

ing Eigenvectors of Large Real-Symmetric Matrices. J. Comp. Phys., 17:87–94, 1975.[17] T. A. Davis. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method.

ACM Trans. Math. Software , 30(2):165–195, 2004.[18] T. A. Davis. Algorithm 832: UMFPACK, an unsymmetric-pattern multifrontal method. ACM

Trans. Math. Software , 30(2):196–199, 2004.[19] T. A. Davis and I. S. Duff. An unsymmetric-pattern multifrontal method for sparse LU fac-

torization. SIAM J. Matrix Anal. Appl., 18(1):140–158, 1997.[20] T. A. Davis and I. S. Duff. A Combined Unifrontal/Multifrontal Method for Unsymmetric

Sparse Matrices. ACM Trans. Math. Software , 25(1):1–19, 1999.[21] D. Day and M. A. Heroux. Solving Complex-Valued Linear Systems via Equivalent Real

Formulations. SIAM J. Sci. Comput., 23(2):480–498, 2001.

[22] J. Dongarra and R. C. Whaley. A User’s Guide to the BLACS v1.1. Technical Report UT-CS-95-281, University of Tennessee, Knoxville, TN, USA, 1995. (LAPACK Working Note#94).

[23] J. J. Dongarra and R. A. van de Geijn. Lapack working note 37: Two dimensional basiclinear algebra communication subprograms. Technical Report UT-CS-91-138, Universityof Tennessee, Knoxville, TN, USA, 1991.

[24] L. A. Drummond and O. A. Marques. An Overview of the Advanced CompuTational Software(ACTS) Collection. ACM Trans. Math. Software , 31(3):282–301, 2005.

[25] M.W. Gee, C.M. Siefert, J.J. Hu, R.S. Tuminaro, and M.G. Sala. ML 5.0 smoothed aggregationuser’s guide. Technical Report SAND2006-2649, Sandia National Laboratories, 2006.

[26] R. Geus. The Jacobi-Davidson algorithm for solving large sparse symmetric eigenvalue prob-lems with application to the design of accelerator cavities. Dissertation ETH No. 14734,ETH Zurich, Switzerland, 2002.

[27] M. Heroux. Epetra Performance Optimization Guide. Technical Report SAND2005-1668,Sandia National Laboratories, 2005.

[28] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B.

Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist,R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley. An overview of theTrilinos project. ACM Trans. Math. Software , 31(3):397–423, 2005.

[29] M. E. Hochstenbach and Y. Notay. The Jacobi–Davidson method. GAMM Mitteilungen ,29:368–382, 2006.

[30] E. O. Kane. Energy Band Theory. In Handbook on Semiconductors Vol. 1, W. Paul (Ed.),pages 193–217. North-Holland, Amsterdam, NL, 1982.

[31] G. Karypis and V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning IrregularGraphs. SIAM J. Sci. Comput., 20(1):359–392, 1998.

[32] G. Karypis and V. Kumar. MeTis - A Software Package for Partitioning Unstructured Graphs,Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices - Ver-sion 4.0 . University of Minnesota, 1998.

[33] G. Karypis and V. Kumar. Parallel Multilevel series k-Way Partitioning Scheme for IrregularGraphs. SIAM Review , 41(2):278–300, 1999.

[34] G. Karypis, K. Schloegel, and V. Kumar. ParMeTis, Parallel Graph Partitioning and Sparse Matrix Ordering Library, Version 3.1. University of Minnesota, 2003.

[35] A. V. Knyazev. Preconditioned eigensolvers - an oxymoron? Electronic Transactions on Numerical Analysis, 7:104–123, 1998.

[36] A. V. Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block precon-ditioned conjugate gradient method. SIAM J. Sci. Comput., 23(2):517–541, 2001.

[37] LAPACK 3.1. http://www.netlib.org/lapack/lapack-3.1.0.changes, 2006.[38] R. B. Lehoucq and D. C. Sorensen. Deflation techniques for an implicitly restarted Arnoldi

iteration. SIAM J. Matrix Anal. Appl., 17(4):789–821, 1996.[39] R. B. Lehoucq, D. C. Sorensen, and C. Yang. ARPACK User’s Guide: Solution of Large

Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM, Philadelphia,1998.


15/15

iterative solution of generalized eigenvalue

Documents