a parallel qz algorithm for distributed...

23
A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJ ¨ ORN ADLERBORN , BO K ˚ AGSTR ¨ OM , AND DANIEL KRESSNER Abstract. Appearing frequently in applications, generalized eigenvalue problems represent one of the core problems in numerical linear algebra. The QZ algorithm by Moler and Stewart is the most widely used algorithm for addressing such problems. Despite its importance, little attention has been paid to the parallelization of the QZ algorithm. The purpose of this work is to fill this gap. We propose a parallelization of the QZ algorithm that incorporates all modern ingredients of dense eigensolvers, such as multishift and aggressive early deflation techniques. To deal with (possibly many) infinite eigenvalues, a new parallel deflation strategy is developed. Numerical experiments for several random and application examples demonstrate the effectiveness of our algorithm on two different distributed memory HPC systems. Key words. generalized eigenvalue problem, nonsymmetric QZ algorithm, multishift, bulge chasing, infinite eigenvalues, parallel algorithms, level 3 performance, aggressive early deflation. AMS subject classifications. 65F15, 15A18 1. Introduction. This paper is concerned with the numerical solution of gener- alized eigenvalue problems, which consist of computing the eigenvalues and associated quantities of a matrix pair (A, B) for general complex or real n × n matrices A, B. The QZ algorithm proposed in 1973 by Moler and Stewart [32] is the most widely used algorithm for addressing generalized eigenvalue problems with dense matrices. Since then, it has undergone several modifications [14, 35]. In particular, significant speedups on serial machines have been obtained in [23] by extending multishift and aggressive early deflation techniques [9, 10]. In this work, we propose a parallelization of the QZ algorithm that incorporates these techniques as well. Our developments build on preliminary work presented in [1, 2] and extend recent work [19, 20] on par- allelizing the QR algorithm for standard eigenvalue problems. In our parallelization, we also cover aspects that are unique to the QZ algorithm, such as the occurrence of possibly many infinite eigenvalues. Apart from the QZ algorithm, other approaches for solving generalized eigen- value problems have been considered for parallelization. This includes usage of a synchronous linear processor array [8], nonsymmetric Jacobi algorithms [11, 13] as well as spectral divide-and-conquer algorithms [5, 21, 31, 33]. 1.1. Generalized Schur decomposition. Throughout this work, we assume that the pair (A, B) is regular, that is, det(A - λB) is not zero for all λ. Otherwise, (A, B) needs to be preprocessed to deflate the corresponding singular part in its Kronecker canonical form, either by exploiting underlying structure of applying the GUPTRI algorithm [15]. In the following, we restrict our discussion to matrices with real entries: A, B R n×n ; the complex case is treated in an analogous way. The goal of the QZ algorithm consists of computing a generalized Schur decomposition Q T AZ = S, Q T BZ = T, (1) Department of Computing Science and HPC2N, Ume˚ a University, SE-90187 Ume˚ a, Swe- den([email protected], [email protected]) SB–MATHICSE–ANCHP, EPF Lausanne, Station 8, CH-1015 Lausanne, Switzerland ([email protected]) 1

Upload: others

Post on 12-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYHPC SYSTEMS

BJORN ADLERBORN† , BO KAGSTROM† , AND DANIEL KRESSNER‡

Abstract. Appearing frequently in applications, generalized eigenvalue problems represent oneof the core problems in numerical linear algebra. The QZ algorithm by Moler and Stewart is themost widely used algorithm for addressing such problems. Despite its importance, little attentionhas been paid to the parallelization of the QZ algorithm. The purpose of this work is to fill this gap.We propose a parallelization of the QZ algorithm that incorporates all modern ingredients of denseeigensolvers, such as multishift and aggressive early deflation techniques. To deal with (possiblymany) infinite eigenvalues, a new parallel deflation strategy is developed. Numerical experimentsfor several random and application examples demonstrate the effectiveness of our algorithm on twodifferent distributed memory HPC systems.

Key words. generalized eigenvalue problem, nonsymmetric QZ algorithm, multishift, bulgechasing, infinite eigenvalues, parallel algorithms, level 3 performance, aggressive early deflation.

AMS subject classifications. 65F15, 15A18

1. Introduction. This paper is concerned with the numerical solution of gener-alized eigenvalue problems, which consist of computing the eigenvalues and associatedquantities of a matrix pair (A,B) for general complex or real n× n matrices A,B.

The QZ algorithm proposed in 1973 by Moler and Stewart [32] is the most widelyused algorithm for addressing generalized eigenvalue problems with dense matrices.Since then, it has undergone several modifications [14, 35]. In particular, significantspeedups on serial machines have been obtained in [23] by extending multishift andaggressive early deflation techniques [9, 10]. In this work, we propose a parallelizationof the QZ algorithm that incorporates these techniques as well. Our developmentsbuild on preliminary work presented in [1, 2] and extend recent work [19, 20] on par-allelizing the QR algorithm for standard eigenvalue problems. In our parallelization,we also cover aspects that are unique to the QZ algorithm, such as the occurrence ofpossibly many infinite eigenvalues.

Apart from the QZ algorithm, other approaches for solving generalized eigen-value problems have been considered for parallelization. This includes usage of asynchronous linear processor array [8], nonsymmetric Jacobi algorithms [11, 13] aswell as spectral divide-and-conquer algorithms [5, 21, 31, 33].

1.1. Generalized Schur decomposition. Throughout this work, we assumethat the pair (A,B) is regular, that is, det(A− λB) is not zero for all λ. Otherwise,(A,B) needs to be preprocessed to deflate the corresponding singular part in itsKronecker canonical form, either by exploiting underlying structure of applying theGUPTRI algorithm [15].

In the following, we restrict our discussion to matrices with real entries: A,B ∈Rn×n; the complex case is treated in an analogous way. The goal of the QZ algorithmconsists of computing a generalized Schur decomposition

QTAZ = S, QTBZ = T, (1)

†Department of Computing Science and HPC2N, Umea University, SE-90187 Umea, Swe-den([email protected], [email protected])

‡SB–MATHICSE–ANCHP, EPF Lausanne, Station 8, CH-1015 Lausanne, Switzerland([email protected])

1

Page 2: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

2 B. Adlerborn, B. Kagstrom, and D. Kressner

where Q,Z ∈ Rn×n are orthogonal and the pair (S, T ) is in (real) generalized Schurform. This means that T is upper triangular and S is quasi-upper triangular withdiagonal blocks of size 1×1 or 2×2. A 1×1 block sjj corresponds to the real eigenvalueλ = sjj/tjj of (A,B). In fact, the LAPACK [3] implementation DHGEQZ of the QZalgorithm does not even form this ratio but directly returns the pair (α, β) = (sjj , tjj).This convention has the advantage that it covers infinite eigenvalues in a seamlessmanner, by letting β = 0, and it is used in our parallel implementation as well. A2 × 2 diagonal block in S corresponds to a complex conjugate pair of eigenvalues,which can be computed from the eigenvalues of([

sjj sj,j+1

sj+1,j sj+1,j+1

],

[tjj tj,j+1

0 tj+1,j+1

]).

This is performed by the LAPACK routine DLAGV2, which also normalizes T such thattj,j+1 = 0 and tjj ≥ tj+1,j+1 > 0, using a procedure described in [34].

1.2. Structure of the QZ algorithm. The QZ algorithm proceeds by firstcomputing a decomposition of the form

QTAZ = H, QTAZ = T, (2)

where Q,Z ∈ Rn×n are orthogonal and T is again upper triangular, but H is onlyin upper Hessenberg form, that is, hij = 0 for i ≥ j + 2. Two different types of al-gorithms have been proposed to reduce (A,B) to such a Hessenberg-triangular form.After an initial reduction of B to triangular form, the original algorithm by Moler andStewart [32] uses Givens rotations to zero out each entry below the subdiagonal ofA. Its memory access pattern lets this algorithm perform rather poorly on standardcomputing architectures. To address this, a blocked two-stage approach has beenproposed by Dackland and Kagstrom [14]. The first stage reduces (A,B) to blockHessenberg-triangular form only, which can be achieved by means of blocked House-holder reflectors. The second stage reduces (A,B) further to Hessenberg-triangularform by applying sweeps of Givens rotations. While the first stage can be parallelizedquite well, the complex data dependencies make the parallelization of the second stagea more difficult task. Recently, significant progress has been made in this direction onshared-memory architectures, in the context of reducing a single matrix to Hessenbergform [26]. In the numerical experiments of this paper, we make use of a preliminaryparallel implementation of the two-stage algorithm described in [1]. However, it hasbeen demonstrated in [24] that a serial blocked implementation of the rotation-basedalgorithm by Moler and Stewart outperforms the two-stage approach. We thereforeplan to incorporate a parallel implementation of this algorithm as well.

Prior to any reduction, an optional preprocessing step called balancing can beused. The balancing described in [36] and implemented in the LAPACK routineDGGBAL consists of permuting (A,B) to detect isolated eigenvalues and applying adiagonal scaling transformation to remedy bad scaling. The scaling part is by defaultturned off in most implementations, such as the LAPACK driver routine DGGEV andthe Matlab command eig(A,B). We will therefore not consider it any further.

The computation of eigenvectors or, more generally, deflating subspaces of (A,B)requires to postprocess the matrix pair (S, T ) in the generalized Schur decomposi-tion (1). More specifically, if the (right) deflating subspace associated with a set ofeigenvalues is desired, then these eigenvalues need to be reordered to the top leftcorners of (S, T ). Such a reordering algorithm has been proposed in [25] and itsparallelization is discussed in [18].

Page 3: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 3

In this paper, we focus on parallelizing the iterative part of the QZ algorithmwhich reduces a pair (H,T ) in Hessenberg-triangular form to generalized Schur form(S, T ). This iterative part consists of three ingredients: bulge chasing, (aggressiveearly) deflation, and deflation of infinite eigenvalues. In the following we will refer tothree different algorithms, with the abbreviations below, all of which use the abovethree ingredients (see also Appendix A):

• PDHGEQZ: This contribution.• PDHGEQZ1 : Modified version of parallel QZ algorithm described in Adlerborn

et al. [2].• KKQZ: Serial QZ algorithm proposed by Kagstrom and Kressner [23].

2. Parallel algorithms. In what follows, we assume that the reader is familiarwith the basics of the implicit shifted QZ algorithm, see [30, 38] for introductions.

2.1. Data layout. We follow the convention of ScaLAPACK [7] for the dis-tributed data storage of matrices. Suppose that P = Pr · Pc parallel processors arearranged in a Pr × Pc rectangular grid. The entries of a matrix are then distributedover the grid using a 2-dimensional block-cyclic mapping with block size nb in bothrow and column dimensions. In principle, ScaLAPACK allows for different block sizesfor the row and column dimensions but for simplicity we assume that identical blocksizes are used.

2.2. Multishift QZ iterations.

2.2.1. Chasing one bulge. Consider a Hessenberg-triangular pair (H,T ) andtwo shifts σ1, σ2, such that either σ1, σ2 ∈ R or σ1 = σ2. For the moment, we willassume that T is invertible. Then the first step of the classic implicit double shift QZiteration consists of computing the first column of the shift polynomial:

v = (HT−1 − σ1I)(HT−1 − σ2I)e1 =

XXX0...0

,where e1 denotes the first unit vector and the symbol X denotes arbitrary, typicallynonzero, entries. Now an orthogonal transformation Q0 is constructed such that QT

0 vis a multiple of e1. This can be easily achieved by a 3 × 3 Householder reflector.Applying Q0 from the left to H and T affects the first three rows and creates fill-inbelow the (sub-)diagonal:

H ← QT0H =

X X X X X X X · · ·X X X X X X X · · ·X X X X X X X · · ·0 0 X X X X X · · ·0 0 0 X X X X · · ·0 0 0 0 X X X · · ·0 0 0 0 0 X X · · ·...

.

.

....

.

.

....

.

.

....

, T ← QT0 T =

X X X X X X X · · ·X X X X X X X · · ·X X X X X X X · · ·0 0 0 X X X X · · ·0 0 0 0 X X X · · ·0 0 0 0 0 X X · · ·0 0 0 0 0 0 X · · ·...

.

.

....

.

.

....

.

.

....

.

Here, X is used to denote entries that are affected by the current transformation. Toannihilate the two new entries in the first column of T , we use a trick introduced byWatkins and Elsner [39] and shown to be numerically backward stable in [23]. LetZ0 be a Householder reflector that maps T−1e1 to a multiple of the first unit vector.

Page 4: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

4 B. Adlerborn, B. Kagstrom, and D. Kressner

Then applying Z0 from the right affects the first three columns and results in thenonzero pattern

H ← HZ0 =

X X X X X X X · · ·X X X X X X X · · ·X X X X X X X · · ·X X X X X X X · · ·0 0 0 X X X X · · ·0 0 0 0 X X X · · ·0 0 0 0 0 X X · · ·...

.

.

....

.

.

....

.

.

....

, T ← TZ0 =

X X X X X X X · · ·0 X X X X X X · · ·0 X X X X X X · · ·0 0 0 X X X X · · ·0 0 0 0 X X X · · ·0 0 0 0 0 X X · · ·0 0 0 0 0 0 X · · ·...

.

.

....

.

.

....

.

.

....

.

The region(H(2 : 4, 1 : 3), T (2 : 4, 1 : 3)

)is called the bulge pair. This encodes the

information contained in the shifts σ1, σ2, in a way made concrete in [37]. By ananalogous procedure, the trailing two entries in the first column of H are annihilatedby a Householder transformation from the left and, subsequently, the subdiagonalentries in the second column of T are annihilated by a Householder transformationfrom the right. In effect, the bulge pair is moved one step towards the bottom rightcorner:

H ← QT1HZ1 =

X X X X X X X · · ·X X X X X X X · · ·0 X X X X X X · · ·0 X X X X X X · · ·0 X X X X X X · · ·0 0 0 0 X X X · · ·0 0 0 0 0 X X · · ·...

.

.

....

.

.

....

.

.

....

, T ← QT1 TZ1 =

X X X X X X X · · ·0 X X X X X X · · ·0 0 X X X X X · · ·0 0 X X X X X · · ·0 0 0 0 X X X · · ·0 0 0 0 X X X · · ·0 0 0 0 0 X X · · ·...

.

.

....

.

.

....

.

.

....

.(3)

2.2.2. Chasing several bulges in parallel. In principle, the described processof bulge chasing can be continued until the bulge pair disappears at the bottom rightcorner, which would complete one double shift QZ iteration. However, the key toefficient serial and parallel implementations of the QZ algorithm is to chase severalbulges at the same time. For example, after the bulge pair in (3) has been chased twosteps further, we can immediately introduce another bulge pair belonging to anothertwo shifts, without disturbing the existing bulge pair. We then have a chain of twotightly1 packed bulges, which can be chased simultaneously. In practice, we willwork with chains containing significantly more than two bulges to attain good nodeperformance.

Another key to create potential for good parallel performance is to delay updatesas much as possible. We chase bulges within a window (i.e., a principal submatrix),similar to the technique described in [19]. Only after the bulges have been chased tothe bottom of the window we update the remaining parts of the matrix pair (H,T )outside the window, to the right and above, using level-3 BLAS. After the off-diagonalupdate, the window is placed on a new position so that the bulges can be moved furtherdown the diagonal of (H,T ). As in [19] we use several windows, up to min(Pr, Pc),each containing a chain of bulges, and deal with them in parallel.

Figure 1 illustrates the described technique for three active windows, each con-taining a bulge chain consisting of three bulges. Each window is owned by a singleprocess, leading to intra-block chases. Windows that overlap process borders lead to

1In fact, as recently shown in [28] for the QR algorithm, it is possible and beneficial to pack thebulges even closer.

Page 5: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 5

Fig. 1. Three bulge chains, where each chain consists of three bulges (black boxes). The black3× 3 blocks in the H-part are dense, while the associated blocks in the T -part have zero elements inthe last row and first column, respectively. Only parts of the matrices are displayed. The solid redlines represent block/process borders.

inter-block chases. In the algorithm, we alternate between intra-block and inter-blockchases, see [19] for more details.

Generally, we use the undeflatable eigenvalues returned by aggressive early defla-tion described in § 2.3 as shifts for the QZ iteration. Exceptionally, it may happenthat there are not sufficiently many undeflatable eigenvalues available. In such cases,we apply the QZ algorithm (PDHGEQZ or PDHGEQZ1) to a sub-problem for obtainingadditional shifts.

Each bulge consumes two shifts, and occupies a 3×3 principal submatrix. Assum-ing that each window is of size nb×nb, we will pack at most nb/6 bulges in a windowto be able to move all bulges across the process border during a single inter-blockchase. The number of active windows is then given by

min(d3nshifts/nbe, Pr, Pc),

where nshifts is the total number of shifts to be used.

2.3. Aggressive early deflation (AED). Classically, the convergence of theQZ algorithm is monitored by inspecting the subdiagonal entries of H. A subdiagonalentry hk+1,k is declared negligible if it satisfies

|hk+1,k| ≤ u× (|hk,k|+ |hk+1,k+1|), (4)

where u denotes the unit roundoff (≈ 10−16 in double precision arithmetic). Negli-gible subdiagonal entries can be safely set to zero, deflating the generalized eigenvalueproblem into smaller subproblems. Aggressive early deflation (AED) is a techniqueproposed by Brahman, Byers and Mathias [10] for the QR algorithm, that goes be-yond (4) and significantly speeds up convergence.

2.3.1. Basic algorithm. In the following, we give a brief overview of AEDfor the QZ algorithm as proposed in [23]. First, the Hessenberg-triangular pair is

Page 6: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

6 B. Adlerborn, B. Kagstrom, and D. Kressner

partitioned as

(H,T ) =

H11 H12 H13

H21 H22 H23

0 H32 H33

, T11 T12 T13

0 T22 T230 0 T33

,

such that H32 ∈ RnAED×1 and H33, T33 ∈ RnAED×nAED , where nAED � n denotes the sizeof the so called AED window (H33, T33).

By applying the QZ algorithm, the AED window is reduced to (real) generalizedSchur form:

QT33H33Z33 = H33, QT

33T33Z33 = T33.

Performing the corresponding orthogonal transformation of (H,T ) yields

(H,T )←

H11 H12 H13

H21 H22 H23

0 s H33

, T11 T12 T13

0 T22 T230 0 T33

, (5)

where s = QT33H32 is the so called spike that contains the newly introduced nonzero

entries below the subdiagonal of H. If the last entry of the spike satisfies

|snAED| ≤ u× ‖H‖F , (6)

it can be safely set to zero. This deflates a real eigenvalue of the matrix pair (5),provided that the diagonal block at the bottom right corner of H33 is 1 × 1. If thisdiagonal block is 2× 2, the corresponding complex conjugate pair of eigenvalues canonly be deflated if the last two entries of the spike both satisfy (6).

If deflation was successful, the described process is repeated for the remainingnonzero entries of the spike. Otherwise, the generalized Schur form (H33, T33) is re-ordered to move the undeflatable eigenvalue to the top left corner. In turn, an untestedeigenvalue is moved to the bottom right corner. After all eigenvalues of (H33, T33)have been checked for convergence by this procedure, the last step of AED consists ofturning the remaining undeflated part of the pair (H,T ) back to Hessenberg-triangularform.

2.3.2. Parallel implementation. As already demonstrated in [20], a carefulimplementation of AED is vital to achieving good overall parallel performance. Fromthe discussion above, we identify three computational tasks:

1. Reducing an nAED × nAED Hessenberg-triangular matrix pair to generalizedSchur form.

2. Reordering the eigenvalues of an nAED×nAED matrix pair in generalized Schurform.

3. Reducing a general matrix pair of size at most nAED × nAED to Hessenberg-Triangular form.

Only for very small values of nAED, say nAED ≤ 201, it makes sense to perform thesetasks serially and call the corresponding LAPACK routines. For larger values of nAED,parallel algorithms are used for all three tasks.

To perform Task 1, we call our, now modified, parallel implementation PDHGEQZ1 [2]of the QZ algorithm with (serially performed) AED for moderately sized problems,say nAED ≤ 6000, and recursively call the parallel implementation PDHGEQZ describedin this paper for larger problems.

Page 7: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 7

To perform Task 2, we make use of ideas described in [18] for reordering eigen-values in parallel. To create potential for parallelism and good node performance,we work with groups of undeflatable eigenvalues instead of moving them individually.Such a group is reordered to the top left corner by an algorithm quite similar to theparallel bulge chasing discussed in § 2.2.2. We refer to [19, Sec. 2.3] for more details.

To perform Task 3, we call the parallel implementation of Hessenberg-triangularreduction described in [1].

After all three tasks have been performed, it remains to apply the correspondingorthogonal transformations to the off-diagonal parts of (H,T ), above and to the rightof the AED window. These updates are preformed by calls to the PBLAS routinePDGEMM.

2.3.3. Avoiding communication via data redistribution. Since nAED � n,the computational intensity of AED is comparably small and the parallel distributionof the matrices on a grid of Pr×Pc of processors may lead to significant communicationoverhead. This has been observed for the parallel QR algorithm [20] and holds forthe parallel QZ algorithm as well.

A simple cure to this phenomenon is to redistribute data before performing AED,to limit the amount of participating processors and to keep the communication over-head under control. More specifically, the nAED × nAED AED window is first redis-tributed across a smaller PAED × PAED grid, then Tasks 1 to 3 discussed above areperformed, and finally the AED window is distributed back to the Pr × Pc grid. Thechoice of PAED depends on nAED, see § 3.2.

To get an impression of the benefits from this technique: Without redistribution,the total time spent on AED is 2095 seconds when solving a 32 000× 32 000 randomgeneralized eigenvalue problems on a 8 × 8 grid of Akka (see § 3.1). When usingPAED = 4, which means redistributing the AED window to a 4×4 grid, the total timefor AED reduces to 1283 seconds.

2.4. Deflation of infinite eigenvalues. If B is (nearly) singular, it can be ex-pected that one or several diagonal entries of the triangular matrix T in the Hessenberg-triangular form are (nearly) zero. Following the LAPACK implementation of the QZalgorithm, we consider a diagonal entry tii negligible if

|tii| ≤ u · ||T ||F (7)

holds. Setting such an entry to zero and reordering it to the bottom right or topleft corner allows to deflate an infinite eigenvalue. In the absence of roundoff error,this reordering is automatically effected by QZ iterations [37]. However, to avoidunnecessary iterations and the loss of infinite eigenvalues due to roundoff error, it isimportant to take care of infinite eigenvalues separately.

2.4.1. Basic algorithm. In the following, we briefly sketch the mechanism fordeflating infinite eigenvalues proposed in [32]. Suppose that H is unreduced, i.e.hk+1,k 6= 0 for all subdiagonal elements, and that the third diagonal entry of T iszero:

H =

X X X X X X · · ·X X X X X X · · ·0 X X X X X · · ·0 0 X X X X · · ·0 0 0 X X X · · ·0 0 0 0 X X · · ·...

.

.

....

.

.

....

.

.

.

, T =

X X X X X X · · ·0 X X X X X · · ·0 0 0 X X X · · ·0 0 0 X X X · · ·0 0 0 0 X X · · ·0 0 0 0 0 X · · ·...

.

.

....

.

.

....

.

.

.

.

Page 8: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

8 B. Adlerborn, B. Kagstrom, and D. Kressner

Applying a Givens rotations to columns 2 and 3 makes the second diagonal entry zeroas well, but introduces an additional nonzero in H:

H ←

X X X X X X · · ·X X X X X X · · ·0 X X X X X · · ·0 X X X X X · · ·0 0 0 X X X · · ·0 0 0 0 X X · · ·...

.

.

....

.

.

....

.

.

.

, T ←

X X X X X X · · ·0 0 X X X X · · ·0 0 0 X X X · · ·0 0 0 X X X · · ·0 0 0 0 X X · · ·0 0 0 0 0 X · · ·...

.

.

....

.

.

....

.

.

.

.

This nonzero entry can be annihilated by applying a Givens rotations to rows 3 and4:

H ←

X X X X X X · · ·X X X X X X · · ·0 X X X X X · · ·0 0 X X X X · · ·0 0 0 X X X · · ·0 0 0 0 X X · · ·...

.

.

....

.

.

....

.

.

.

, T ←

X X X X X X · · ·0 0 X X X X · · ·0 0 0 X X X · · ·0 0 0 X X X · · ·0 0 0 0 X X · · ·0 0 0 0 0 X · · ·...

.

.

....

.

.

....

.

.

.

.

By an analogous procedure, the zero diagonal entries of T can be moved one positionfurther upwards. A Givens rotation can then be applied to rows 1 and 2 in order toannihilate the first subdiagonal entry of H, finally yielding

H =

X X X X X X · · ·0 X X X X X · · ·0 X X X X X · · ·0 0 X X X X · · ·0 0 0 X X X · · ·0 0 0 0 X X · · ·...

.

.

....

.

.

....

.

.

.

, T =

0 X X X X X · · ·0 X X X X X · · ·0 0 X X X X · · ·0 0 0 X X X · · ·0 0 0 0 X X · · ·0 0 0 0 0 X · · ·...

.

.

....

.

.

....

.

.

.

.Thus, an infinite eigenvalue can be deflated at the top left corner. An analogousprocedure can be used to deflate infinite eigenvalues at the bottom right corner. Thisreduces the required operations if the zero diagonal entry of T is closer to that corner.

2.4.2. Parallel implementation. A number of applications lead to matrix pen-cils with a substantial fraction of infinite eigenvalues, see § 3 for examples. In thiscase, applying the above procedure to deflate each infinite eigenvalue individually isclearly not very efficient in a parallel environment.

Instead, we aim at deflating several infinite eigenvalues simultaneously. To do thisin a systematic manner and attain good node performance, we proceed similarly as inthe parallel multishift QZ iterations and parallel eigenvalue reordering algorithm dis-cussed in § 2.2.2 and § 2.3.2, respectively. Up to min(Pr, Pc) computational windowsare placed on the diagonal of (H,T ), such that each window is owned by a single diag-onal process. Within each window, the negligible diagonal entries of T are identifiedaccording to (7), zeroed, and they are all moved either to the top left corner or to thebottom right corner of T , depending which corner is nearest. Only then we performthe corresponding updates outside the windows, by accumulating the rotations intoorthogonal matrices and performing matrix-matrix multiplications.

In the next step, all computational windows are shifted upwards or downwards.In effect, the zero diagonal entries of T are located in the bottom or top parts ofthe windows. Moreover, as illustrated in Figure 2, the windows now overlap processborders. To move the zero diagonal entries in the first computational window acrossthe process border, we proceed as follows:

Page 9: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 9

Fig. 2. Crossborder layout of three windows during the parallel deflation of infinite eigenvalues.Computational windows and process borders indicated by black and red solid lines, respectively.Darker-gray areas indicate regions that need to be updated after the windows have been processed.

1. The two processes on the diagonal (P00 and P11) exchange their parts ofthe window and receive the missing off-diagonal parts from the other twoprocesses (P01 and P10). In effect, two identical copies of the computationalwindow are created.

2. Each of the two diagonal processes (P00 and P11) identifies and zeroes negli-gible diagonal entries of T within the window, and moves all of them to thetop left corner (or the bottom right corner, whichever is nearest the top leftor bottom right corner of T .).

3. The two off-diagonal processes (P01 and P10) receive the updated off-diagonalparts of the window from one of the on-diagonal processes (P00 or P11).The accumulated orthogonal transformation matrices generated in Step 2 arebroadcasted to the blocks on both sides of the process borders (to P00, P01

and to P02, P03, P11, P12, P13).For updating the parts of the matrix outside the window, neighboring pro-cesses holding cross-border regions exchange their data in parallel and com-pute the updates in parallel.

To achieve good parallel performance, the above procedure needs to be applied toseveral computational windows simultaneously. However, some care needs to be ap-plied in order to avoid intersecting scopes of the diagonal processes in Steps 1 and 2.Following an idea proposed in [19], this can be achieved as follows. We number thewindows from bottom to top, starting with index 0. First all even-numbered windowsare treated and only then all odd-numbered windows are treated in parallel.

When the zero diagonal entries of T have been moved across the process bor-ders, the next step again consists of shifting all computational windows upwards ordownwards such that they are owned by diagonal processes. This allows to repeatthe whole procedure, until all zero diagonal entries of T arrive at one of the cornersand admit deflation. The parallel procedure of chasing zeros along the diagonal of Tand deflating infinite eigenvalues is illustrated and described in some more detail inFigures 3–7.

Page 10: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

10 B. Adlerborn, B. Kagstrom, and D. Kressner

Fig. 3. Example where 8 zeros have been identified on the diagonal of T, indicated by filled redsquares. The grid is of size 5× 5 and process borders are indicated by red solid lines.

Fig. 4. Intra-block chase where all diagonal blocks are processed in parallel. The 4 activediagonal blocks, within which the zero diagonal entries of T are moved up or down, are markedin green. The old positions of the zero diagonal entries are marked by white squares, while thenew positions or (so far) not moved zeros are marked by red squares. Three and two deflations areperformed at the top left and bottom right diagonal blocks, respectively, leading to the zero subdiagonalentries of H marked by red squares on the subdiagonal of H. The chase is followed by horizontaland vertical broadcasts of accumulated transformations, so that off-diagonal processors can updatetheir parts of H and T in parallel. The area affected by the off-diagonal updates is marked in darkgrey.

3. Computational Experiments.

3.1. Computing environment. The algorithms described in § 2 have beenimplemented in a Fortran 90 routine PDHGEQZ. Following the ScaLAPACK style, weuse BLACS [7] for communication and call ScaLAPACK / PBLAS or LAPACK /BLAS routines whenever appropriate.

We have used the two HPC2N computer systems Akka and Abisko, see Table 1,

Page 11: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 11

Fig. 5. Inter-block chase, phase one, where all even-numbered diagonal blocks, indexed by0 . . . nblock − 1 from bottom to top, are processed in parallel. In this case there is only one activediagonal block. See Figure 4 for an explanation of the color marking.

Fig. 6. Inter-block chase, phase two, where all odd-numbered diagonal blocks are processed inparallel. See Figure 4 for an explanation of the color marking.

for our computational experiments. For all our experiments on Akka, we used theFortran 90 compiler mpif90 version 4.0.13 from the PathScale EKOPath(tm) CompilerSuite with the optimization flag -O3. For all our experiments on Abisko, we used theIntel compiler mpiifort version 13.1.2 with the optimization flag -O3.

On Abisko, each floating point unit (FPU) is shared between two cores. To attainnear to optimal performance, we only utilize 50% of the cores within a compute node.For example, a grid Pr × Pc = 10 × 10 is allocated on 5 compute nodes, using up to24 cores on each node. Although each core on Akka has its own FPU, we also donot utilize more than 50% of the cores on Akka either, for memory capacity reasons.Each compute node on Akka has 8 cores, so a 10× 10 grid is allocated on 25 computenodes, using 4 cores on each node.

Page 12: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

12 B. Adlerborn, B. Kagstrom, and D. Kressner

Fig. 7. Intra-block and inter-block chases are repeated until all zeros on the diagonal of Thave been moved to the top left or bottom right corners. A total of 8 deflations can be made. Theremaining problem size is decreased accordingly, indicated by the black square on H and T .

Table 1Akka and Abisko at the High Performance Computing Center North (HPC2N)

Akka 64-bit low power Intel Xeon Linux cluster672 dual socket quadcore L5420 2.5GHz nodes256KB dedicated L1 cache, 12MB shared L2 cache,16GB RAM per nodeCisco Infiniband and Gigabit Ethernet, 10 GB/secbandwidthOpenMPI 1.4.4, GOTO BLAS 1.13LAPACK 3.4.0, ScaLAPACK/BLACS/PBLAS 2.0.1

Abisko 64-bit AMD Opteron Linux Cluster322 nodes with a total of 15456 CPU coresEach node is equipped with 4 AMD Opteron 6238 (Interlagos)12 core 2.6 GHz processors10 ’fat’ nodes with 512 GB RAM each, as well as 312’thin’ nodes with 128 GB RAM each40 Gb/s Mellanox InfinibandIntel-MPI 13.1.2, ACML package 5.3.1 (includes LAPACK 3.4.0)ScaLAPACK/BLACS/PBLAS 2.0.2

3.2. Selection of nb and nAED. Our implementation PDHGEQZ of the parallel QZalgorithm is controlled by a number of parameters, in particular the AED window sizenAED and the number of shifts nshifts. For choosing these parameters, we follow thestrategy proposed in [19, 20] for the parallel QR implementation. The block size nb,which determines the data distribution block size, is set to nb = 130 when n > 2000.If n ≤ 2000, an nb value of 60 is near optimal. This is the same for both Akka andAbisko.

As explained in § 2.3.3, the nAED×nAED AED window is redistributed to a PAED×PAED grid before performing AED. The redistribution itself requires some computationand communication, but the cost is small compared to the whole AED process, see[20]. The local size of the AED window is set to a new value nlocal during the dataredistribution. On Akka and Abisko, we choose nlocal = nbd384/nbe, implying thateach process involved in the AED should own at least a 384× 384 block of the whole

Page 13: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 13

AED window. The value 384 is taken from the QR implementation, see [20]. ThenPAED is chosen as the smallest value that satisfies

nAED ≤ (1 + PAED) · nlocal. (8)

Before the redistribution, we also adjust nb to a value better suited for the problemsize, i.e. 60 or 130 depending on nAED. When nAED ≤ 6 000, we use PDHGEQZ1 tocompute the generalized Schur decomposition of the adjusted AED window, otherwisePDHGEQZ is called recursively.

3.3. Accuracy. All experiments have been performed in double precision arith-metic (εmach ≈ 2.2 × 10−16). For each run of the parallel QZ algorithm, we haveverified its backward stability by measuring

Rr = max

{‖QTAZ − S‖F

‖A‖F,‖QTBZ − T‖F

‖B‖F

},

where (A,B) is the original matrix pair before reduction to generalized Schur form(S, T ). The numerical orthogonality of the transformations has been tested by mea-suring

Ro = max

{‖QTQ− In‖F

εmachn,‖ZTZ − In‖F

εmachn

}.

For all experiments reported in this paper, we have observed Rr ∈ [10−14, 10−15] andRo ∈ [0.5, 2.5].

To verify that the matrix pair (S, T ) returned by PDHGEQZ is indeed in real gen-eralized Schur form, we have checked that the subdiagonal of S does not have twosubsequent nonzero entries and that eigenvalues of two-by-two blocks correspond toa complex conjugate pair of eigenvalues. All considered matrix pairs passed this test.

3.4. Performance for random problems. We consider four different modelsof n× n pseudo-random matrix pairs (H,T ) in Hessenberg-triangular form.

Hessrand1 For j = 1, . . . , n − 2, the subdiagonal entry hj+1,j is the square rootof a chi-squared distributed random variable with n − j degrees of freedom(∼ χ(n− j)). For the diagonal entries of T , we choose t11 ∼ χ(n) and tjj ∼χ(j − 1) for j = 2, . . . , n. All other nonzero entries of H and T are normallydistributed with mean zero and variance one (∼ N(0, 1)). This correspondsto the distribution obtained when reducing a full matrix pair (A,B) withentries ∼ N(0, 1) to Hessenberg-Triangular form; see [9] for a similar model.Such a matrix pair (H,T ) has reasonably well-conditioned eigenvalues and,in turn, the QZ algorithm obeys a fairly predictable convergence behaviour.

Hessrand2 In this model, the nonzero entries of the Hessenberg matrix H and thetriangular matrix T are all chosen from a uniform distribution in [0, 1]. Theeigenvalues of such matrix pairs are notoriously ill-conditioned and lead to amore erratic convergence behaviour of the QZ algorithm.

Hessrand3 The Hessenberg matrix H is chosen as in the Hessrand2 model, but theupper triangular matrix T is chosen as in the Hessrand1 model. The eigenval-ues tend to be less ill-conditioned compared to Hessrand2, but the potentialill-conditioning of T causes some eigenvalues to be identified as infinite eigen-values by the QZ algorithm.

Page 14: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

14 B. Adlerborn, B. Kagstrom, and D. Kressner

Fig. 8. Serial execution times of PDHGEQZ, DHGEQZ, and KKQZ for Hessrand1 matrix pairs onAkka (left figure) and Abisko (right).

Infrand This model is identical to Hessrand1, with the notable exception that weset each diagonal element of T with probability 0.5 to zero. This yields asubstantial number of infinite eigenvalues; around 1/3 of the eigenvalues areinfinite.

3.4.1. Serial execution times. To verify that our parallel implementationPDHGEQZ also performs well on a single core, we have compared it with the serial im-plementations DHGEQZ (LAPACK version 3.4.0) and KKQZ (prototype implementationof serial multishift QZ algorithm with AED [23]). Figure 8 shows the timings obtainedon a single core for Hessrand1 matrix pairs. The serial performance of PDHGEQZ turnsout to be quite good. It is even faster than KKQZ, although this is mainly due to thefact that KKQZ represents a rather preliminary implementation with little performancetuning. We expect the final version of KKQZ to be at least en par with PDHGEQZ. Similarresults have been obtained for Hessrand2, Hessrand3, and Infrand.

3.4.2. Parallel execution time. Tables 2–4 show the parallel execution timesobtained on Akka and Abisko for random matrix pairs of size n = 4 000–32 000.

The figures for the Hessrand1 model in Table 2 indicate a parallel performancecomparable to the one obtained for the parallel QR algorithm [20] on the same ar-chitecture. Analogous to the parallel QR algorithm, the execution time T (·, ·) of theparallel QZ algorithm, as a function of n and p, satisfies

2T (n, p) ≤ T (2n, 4p) < 4T (n, p), (9)

provided that the local load per core is fixed to n/√p = 4000. This indicates that

PDHGEQZ scales rather well, see also § 3.4.3.

As expected, the results for the Hessrand2 model in Table 3 are somewhat erratic.Compared to the Hessrand1 model, significantly smaller execution times are obtainedfor larger n, due to the greater effectiveness of AED.

Finally, Table 4 reveals good parallel performance also for matrix pairs with asubstantial fraction (roughly 1/3) of infinite eigenvalues. Interestingly, for n ≥ 8 000,the execution times for the Infrand model are often larger compared to the Hessrand1model, especially for a smaller number of processes. This effect is due to the success ofAED already in the very beginning of the QZ algorithm: Deflating a converged finiteeigenvalue in the bottom right corner with AED requires significantly less operationsthan deflating an infinite eigenvalue located far away from the top left and bottomright corners.

Page 15: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 15

Table 2Parallel execution time (in seconds) of PDHGEQZ for Hessrand1 model. Numbers within paren-

theses are the execution time ratios of PDHGEQZ over PDHSEQR [20] for similar random problems.

Akka AbiskoPr × Pc n = 4k n = 8k n = 16k n = 32k n = 4k n = 8k n = 16k n = 32k

1× 1 114(1.0) 73(1.5)2× 2 76(1.4) 403(1.4) 37(1.8) 226(2.5)4× 4 35(1.0) 181(1.1) 598(0.8) 23(1.8) 134(2.1) 418(1.5)6× 6 28(0.7) 127(1.1) 432(1.0) 2188(1.0) 20(1.8) 85(2.2) 316(1.9) 1218(1.4)8× 8 25(0.7) 91(0.9) 367(1.1) 1754(1.2) 19(1.6) 72(1.9) 251(1.9) 1051(1.5)

10× 10 24(0.7) 88(0.7) 298(0.9) 1486(0.9) 18(2.0) 66(1.7) 208(1.9) 919(1.8)

Table 3Parallel execution time (in seconds) of PDHGEQZ for Hessrand2 model.

Akka AbiskoPr × Pc n = 4k n = 8k n = 16k n = 32k n = 4k n = 8k n = 16k n = 32k

1× 1 143 462× 2 79 82 48 554× 4 45 29 102 33 22 716× 6 43 31 90 335 21 21 66 1998× 8 28 26 87 303 22 21 58 193

10× 10 37 26 83 291 29 22 63 187

3.4.3. Parallel execution time: n = 100 000 benchmark. To test the per-formance for larger matrix pairs, we performed a few runs for n = 100 000 for theHessrand1 and Hessrand3 models. The obtained execution times (in seconds) onAkka and Abisko are shown in the first row of Table 5 and Table 6, respectively.Moreover, the following runtime statistics for PDHGEQZ are shown:

#AED number of times AED has been performed#sweeps number of multishift QZ sweeps#shifts/n average number of shifts needed to deflate a finite eigenvalue%AED percentage of execution time spent on AED#redist number of times a redistribution of data has been performed

On Akka and for Hessrand1, see Table 5, we have included a comparison withthe parallel QR algorithm applied to the fullrand model [20]. For the timings theratios PDHGEQZ/PDHSEQR are listed, while absolute values are reported for the otherstatistics. The number for redistributions is available only for the QZ algorithm.Although the QZ algorithm is expected to performs about 3.5 more flops than the QRalgorithm [17], the execution time ratios vary between 0.8 and 2.1. This is explainedby a more effective AED wihtin PDHGEQZ for this problem type. The execution timeson Abisko are substantially lower than on Akka, for both Hessrand1 and Hessrand3.Even though the per core computing power is roughly the same on both machines,the network used on Abisko is faster than on Akka, and this in conjunction with usingIntel-MPI instead of OpenMPI give faster execution times on Abisko for these randomproblems.

It turns out that only a few multishift QZ sweeps are performed for Hessrand1.For Hessrand3, the ill-conditioning makes AED even more effective and no multishiftQZ sweep needs to performed.

3.5. Impact of infinite eigenvalues on performance and scalability. Thepurpose of this experiment is to investigate the impact of a varying fraction of infi-

Page 16: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

16 B. Adlerborn, B. Kagstrom, and D. Kressner

Table 4Parallel execution time (in seconds) of PDHGEQZ for Infrand model.

Akka AbiskoPr × Pc n = 4k n = 8k n = 16k n = 32k n = 4k n = 8k n = 16k n = 32k

1× 1 117 672× 2 64 534 34 2594× 4 30 194 1153 18 110 6916× 6 21 126 627 3748 16 73 380 20248× 8 19 81 441 3235 15 58 310 1636

10× 10 18 74 342 2322 16 55 238 1272

Table 5Execution time and statistics for PDHGEQZ on Akka for n = 100 000. Numbers within parentheses

provide a comparison to PDHSEQR for a similar random problem.

Pr × Pc = 16× 16 Pr × Pc = 24× 24 Pr × Pc = 32× 32 Pr × Pc = 40× 40Hessrand1 Hessrand3 Hessrand1 Hessrand3 Hessrand1 Hessrand3 Hessrand1 Hessrand3

Time 22093(1.6) 2725 13165(1.6) 1614 9326(1.4) 1450 7028(0.8) 2539#AED 26(35) 17 27(31) 17 26(27) 17 25(23) 17#sweeps 2(5) 0 2(6) 0 4(13) 0 9(12) 0#shifts/n 0.08(0.20) 0 0.07(0.23) 0 0.11(0.35) 0 0.15(0.49) 0%AED 83%(48%) 100% 81%(43%) 100% 78%(39%) 100% 73%(54%) 100%#redist 1(-) 1 26(-) 17 25(-) 17 24(-) 17

nite eigenvalues on the performance, additional to the observations already made inTable 4.

We consider n× n matrix pairs (A,B) of the form

A = Q

[A11 00 A22

]ZT , B = Q

[B11 00 0

]ZT ,

where A11, B11 ∈ R(n−m)×(n−m) and H22 ∈ Rm×m are full matrices with entriesrandomly chosen from a uniform distribution in [0, 1]. The orthogonal transformationmatrices Q and Z are randomly chosen as well. Such a matrix pair will have m infiniteeigenvalues of index 1 (corresponding to Jordan blocks of size 1× 1).

Table 7 shows the execution time on Akka of PDHGEQZ applied to (A,B), after re-duction to Hessenberg-triangular form. In all tests, the induced m infinite eigenvalueshave been correctly identified by the parallel QZ algorithm. The fact that the execu-tion times decrease as the number of infinite eigenvalues increases confirms the goodperformance of our parallel algorithm for deflating infinite eigenvalues. Regardless ofhow many infinite eigenvalues we have, equation 9 still holds, except when considering40% infinite eigenvalues and comparing execution times on a 2× 2 grid for n = 4000with a grid of size 4× 4 for n = 8000.

Similar observations have been made on Abisko.

3.6. Performance for benchmark examples. In the following, we presentperformance results for PDHGEQZ run on a number of different benchmarks, see Tables 9– 16. The benchmarks are described in Table 8; most of them come from real worldproblems, while a few are constructed examples. The benchmarks are stored in filesusing the Matrix Market format, see [4], and to be able to process these benchmarkfiles effectively, specialized routines were developed that parse the files so that everyprocess in the grid store their specific part of the globally distributed matrix pair(A,B). A and B are treated as dense in all benchmarks, although some of the samples

Page 17: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 17

Table 6Execution time and statistics for PDHGEQZ on Abisko for n = 100 000.

Pr × Pc = 16× 16 Pr × Pc = 24× 24 Pr × Pc = 32× 32 Pr × Pc = 40× 40Hessrand1 Hessrand3 Hessrand1 Hessrand3 Hessrand1 Hessrand3 Hessrand1 Hessrand3

Time 10259 1053 8031 1463 6092 1367 5680 2163#AED 32 17 26 17 24 17 28 17#sweeps 1 0 2 0 4 0 9 0#shifts/n 0.04 0 0.07 0 0.11 0 0.16 0%AED 79% 100% 74% 100% 70% 100% 66% 100%#redist 2 1 25 17 23 17 25 17

Table 7Execution time for PDHGEQZ on Akka for a varying fraction of infinite eigenvalues.

Pr × Pc n 10% inf 20% inf 30% inf 40% inf1× 1 4 000 88 69 56 502× 2 4 000 38 33 26 204× 4 4 000 31 28 21 166× 6 4 000 24 19 16 138× 8 4 000 20 17 14 102× 2 8 000 235 185 153 1494× 4 8 000 140 116 100 816× 6 8 000 101 85 67 588× 8 8 000 82 70 59 46

are very sparse, and reduced to Hessenberg-Triangular form, except for the BBMSNbenchmark which is stored in HT-form, before PDHGEQZ is called.

Table 8Description of benchmark problems

Name Size n # ∞ Description Ref

BBMSN scalable 0 academic example [23]

BCSST25 15 439 0 Columbia Center skyscraper [4]MHD4800 4 800 0 Alfven spectra in Magnetohydrodynamics [4]

xingo6u, Bz1 20 738 17769 Brazilian Interconnect Power System [22]xingo3012, Bz2 20 944 17910 Brazilian Interconnect Power System [22]bips07 1998, Bz3 15 066 13046 Brazilian Interconnect Power System [16]bips07 2476, Bz4 16 861 14336 Brazilian Interconnect Power System [16]bips07 3078, Bz5 21 128 18017 Brazilian Interconnect Power System [16]

railtrack2, RTR[1..5] scalable - Palindromic quadratic eigenvalue problem [6]

convective 11 730 0 2D convective thermal flow problems [29]gyro 17 361 0 butterfly gyroscope [29]steel1 5 177 0 heat transfer in steel profile [29]steel2 20 209 0 heat transfer in steel profile [29]supersonice 11 730 407 supersonic engine inlet [29]t2dal 4 257 0 2D micropyros thruster, linear FE [29]t2dah 11 445 0 2D micropyros thruster, quadratic FE [29]t3dl 20 360 0 3D micropyros thruster, linear FE [29]

mna2 9 223 1146 modified nodal analysis [12]mna3 4 863 416 modified nodal analysis [12]mna5 10 913 123 modified nodal analysis [12]

The first column in Table 8 shows the benchmark names and, in some cases,acronyms used in the result tables. Column two holds the problem sizes n for thebenchmarks. For the benchmarks marked scalable, the problem sizes considered are

Page 18: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

18 B. Adlerborn, B. Kagstrom, and D. Kressner

marked in the result tables. Column three shows the average of the number of infiniteeigenvalues identified. Columns 4 and 5 give a brief description of and reference toeach benchmark.

Table 9Execution time, in seconds, on Akka and Abisko for BBMSN.

n Pr × Pc Akka Abisko5000 1× 1 6 710000 2× 2 23 1820000 4× 4 53 32

The BBMSN benchmark, see Table 9 for execution times on Akka and Abisko, isconstructed in such a way that the AED should be very successful. It turns out thatthe scaling is not optimal, but equation (9) still holds. Compared with the Hessrand1model in Table 2 we observe drastically reduced run-times, explained by the fact thatno QZ sweeps are needed to compute the generalized Schur-form.

Table 10Execution time, in seconds, on Akka and Abisko for Matrix market examples.

Akka AbiskoPr × Pc MHD4800 BCSST25 MHD4800 BCSST25

1× 1 304 1212× 2 82 354× 4 26 1195 19 8406× 6 23 701 16 5508× 8 24 602 16 536

10× 10 23 492 16 518

BCSST25 and MDH4800 are two benchmarks with nonsingular B, i.e. only finiteeigenvalues; see Table 10. For MDH4800 we observe good scaling, using a grid sizeup to 4× 4, but the problem is too small to be solved effectively on larger grid sizes.BCSST25 is a somewhat larger problem, and PDHGEQZ therefore show better scaling.

Table 11Execution time, in seconds, on Akka and Abisko for Brazilian interconnect benchmarks.

Akka AbiskoPr × Pc Bz1 Bz2 Bz3 Bz4 Bz5 Bz1 Bz2 Bz3 Bz4 Bz5

4× 4 228 333 164 2326× 6 117 165 306 339 559 96 126 220 226 2348× 8 123 125 186 192 199 64 99 143 146 160

10× 10 78 102 155 157 152 61 74 113 123 120

The Brazillian interconnect power system benchmarks have a large ratio of infiniteeigenvalues; see Table 11 for execution times. We observe good overall scaling prop-erties. The initial Hessenberg-Triangular reduction moves tiny or zero elements in Bup to the upper left corner, making the deflation of infinite eigenvalues particularlycheap. Table 12 reports the number of identified infinite eigenvalues for two differentbenchmarks and five different grid sizes. These problems have quite unbalanced Aand B, leading to greater impact from rounding errors, and hence, the number ofidentified infinite eigenvalues will vary, even for the same grid and problem sizes ifthe same problem is executed more than once.

Results for the scalable railtrack benchmark are reported in Table 13 for fivedifferent problem sizes and six different grid sizes, both for Akka and Abisko. These

Page 19: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 19

Table 12Number of deflated infinte eigenvalues for Bz3 and Bz4 on Akka.

Pr × Pc #∞ Bz3 #∞ Bz44× 4 13004 143056× 6 13028 142638× 8 13041 1433310× 10 13046 14336

problems have a high fraction of infinite eigenvalues, which is reflected in the measuredexecution times. On average, PDHGEQZ identified 1409, 2813, 4212, 6963, and 8320infinite eigenvalues for RT1, RT2, RT3, RT4, and RT5, respectively.

Table 13Execution time, in seconds, on Akka and Abisko for the railtrack benchmark. n is

5640, 8640, 11280, 16920, 19740 for RT1, RT2, RT3, RT4, and RT5 respectively.

Akka AbiskoPr × Pc RT1 RT2 RT3 RT4 RT5 RT1 RT2 RT3 RT4 RT5

1× 1 98 692× 2 43 70 22 364× 4 19 29 30 60 12 19 24 486× 6 15 20 25 45 58 10 15 17 38 588× 8 13 18 23 44 67 10 12 16 33 47

10× 10 12 18 20 43 63 9 12 15 30 45

All benchmarks related to the Oberwolfach test suite, except one (supersonic),have a nonsingular B part. Execution times on Akka are displayed in Table 14;corresponding results on Abisko are found in Table 15. Most of these problems arerather small, and therefore PDHGEQZ only reveals good scaling properties for smallergrid sizes.

Table 14Execution time, in seconds, on Akka for Oberwolfach benchmarks.

Pr × Pc t2dal steel1 connective t2dah supersonic gyro steel2 t3dl1× 1 243 6062× 2 108 2974× 4 76 197 883 735 6616× 6 73 190 616 343 328 1052 2164 20668× 8 68 182 523 304 309 854 1847 1921

10× 10 71 175 437 357 300 732 1509 1450

In Table 16, execution times are reported for the modified nodal analysis bench-mark group, both for Akka and Abisko. These problems have a moderate fraction ofinfinite eigenvalues. Some of the problems are too small to be solved effectively onthe larger grid sizes.

4. Conclusions. We have presented a new parallel algorithm and implemen-tation PDHGEQZ of the multishift QZ algorithm with aggressive early deflation forreducing a matrix pair to generalized Schur form. To the best of our knowledge,this is the only parallel implementation capable of handling infinite eigenvalues. Ourextensive computational experiments demonstrate robust numerical results and scal-able parallel performance, comparably to what has been achieved for a recent parallelimplementation of the QR algorithm [20].

Page 20: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

20 B. Adlerborn, B. Kagstrom, and D. Kressner

Table 15Execution time, in seconds, on Abisko for Oberwolfach benchmarks.

Pr × Pc t2dal steel1 convective t2dah supersonic gyro steel2 t3dl1× 1 117 2492× 2 42 1354× 4 24 70 353 315 3056× 6 20 56 193 179 163 737 1567 13488× 8 18 50 175 152 147 621 1293 1140

10× 10 17 47 176 158 151 345 1161 1013

Table 16Execution time, in seconds, on Akka and Abisko for modified nodal analysis benchmarks.

Akka AbiskoPr × Pc mna3 mna2 mna5 mna3 mna2 mna5

1× 1 113 672× 2 61 304× 4 25 203 984 16 143 6296× 6 21 130 614 14 91 3608× 8 17 109 560 13 78 402

10× 10 16 106 351 13 101 363

There is certainly room to improve the tuning of the parameters used in PDHGEQZ

and the interplay of the different components, including trade-offs between algorithmvariants used in the implementation that targets distributed memory architectures.However, from the experience gained in the experiments, we do not expect this tolead to any further dramatic performance improvements. To boost the performancemuch more, the coarse-grain parallelism has to be combined with a fine-grained andstrongly scalable parallel QZ algorithm that effectively manages the shared memoryhierarchies of multicore nodes. Such initiatives for the parallel QR algorithm hasrecently started, see [27]. A critical part not addressed in this paper is the initialreduction to Hessenberg-triangular form. If this part is not handled properly, it willdominate the overall execution time. A parallel implementation of the Hessenberg-triangular reduction, based on ideas from [24], is currently in preparation.

Acknowledgments. The authors are grateful to Robert Granat, Lars Karlsson,and Meiyue Shao for helpful discussions on parallel QR and QZ algorithms. The workwas supported by the Swedish Research Council(VR) under grant A0581501, and byeSSENCE, a strategic collaborative e-Science programme funded by the Swedish Gov-ernment via VR. We thank the High Performance Computing Center North (HPC2N)for providing computational resources and valuable support during test and perfor-mance runs.

Appendix A. Implementation aspects.

Our software is written in Fortran 90. It is a ScaLAPACK-style implementa-tion, using BLACS for communication and ScaLAPACK/LAPACK/PBLAS/BLASroutines where appropriate.

Figure 9 shows an overview of the main routines related to our parallel QZ soft-ware. One-directed arrows indicate that one routine calls other routines. For exam-ple, PDHGEQZ1 is called by PDHGEQZ and PDHGEQZ3 and calls four routines: PDHGEQZ2,PDHGEQZ4, PDHGEQZ5, and PDHGEQZ7. Calls to LAPACK, ScaLAPACK, BLACS etc.are not show in Figure 9. Main entry routine is PDHGEQZ, see Figure 10 for an interfacedescription, but the call might be directly passed on to PDHGEQZ1 if the problem is

Page 21: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 21

small enough, i.e. n ≤ 6000. PDHGEQZ3 is responsible for doing parallel AED.For the QZ sweeps, both PDHGEQZ and PDHGEQZ1 call PDHGEQZ5 and provide shifts,

number of computational windows to chase and the number of shifts to use withineach window as parameters. PDHGEQZ5 then sets up and moves the windows until theyare chased off the diagonal of (H,T ).

Fig. 9. Software hierarchy for PDGEHQZ.

RECURSIVE SUBROUTINE PDHGEQZ(JOB, COMPQ, COMPZ,$ N, ILO , IHI , A, DESCA, B, DESCB,$ ALPHAR, ALPHAI, BETA, Q, DESCQ, Z , DESCZ,$ WORK, LWORK, IWORK, LIWORK, INFO, RLVL)

∗ . .∗ . . S ca l a r Arguments . .∗ . .

CHARACTER COMPQ, COMPZ, JOBINTEGER IHI , ILO , INFO, LWORK, NINTEGER LIWORKINTEGER RLVL

∗ . .∗ . . Array Arguments . .∗ . .

DOUBLE PRECISION A(∗ ) , B(∗ ) , Q(∗ ) , Z(∗ )DOUBLE PRECISION ALPHAI(∗ ) , ALPHAR(∗ ) , BETA(∗ )DOUBLE PRECISION WORK(∗ )INTEGER IWORK(∗ )INTEGER DESCA( 9 ) , DESCB(9)INTEGER DESCQ( 9 ) , DESCZ(9)

Fig. 10. Interface for PDGEHQZ

The interface for PDHGEQZ is similar to the existing (serial) LAPACK routineDHGEQZ, see Figure 10. The main difference between PDHGEQZ and DHGEQZ is the

Page 22: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

22 B. Adlerborn, B. Kagstrom, and D. Kressner

use of descriptors to define partitioning and globally distributed matrices across thePr × Pc process grid, instead of leading dimensions, and that PDHGEQZ requires aninteger workspace. RLVL indicates what level of recursion current execution is runningin, and should initially be set to 0. We do not allow for more than two levels, thatis, PDHGEQZ can call itself, but not more than one time. We follow the conventionin LAPACK/ScaLAPACK for workspace queries, allowing −1 in LWORK or LIWORK.Optimal workspace is then returned in WORK(1) and IWORK(1).

REFERENCES

[1] B. Adlerborn, K. Dackland, and B. Kagstrom, Parallel and blocked algorithms for re-duction of a regular matrix pair to Hessenberg-triangular and generalized Schur forms, inPARA 2002, J. Fagerholm et al., ed., LNCS 2367, Springer-Verlag, 2002, pp. 319–328.

[2] B. Adlerborn, D. Kressner, and B. Kagstrom, Parallel variants of the multishift QZalgorithm with advanced deflation techniques. PARA 2006, LNCS 4699, pp. 117–126, 2006.

[3] E. Anderson, Z. Bai, C. H. Bischof, S. Blackford, J. W. Demmel, J. J. Dongarra,J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C. Sorensen,LAPACK Users’ Guide, SIAM, Philadelphia, PA, third ed., 1999.

[4] Z. Bai, D. Day, J. W. Demmel, and J. J. Dongarra, A test matrix collection for non-Hermitian eigenvalue problems (release 1.0), Technical Report CS-97-355, Department ofComputer Science, University of Tennessee, Knoxville, TN, USA, Mar. 1997. Also availableonline from http://math.nist.gov/MatrixMarket.

[5] Z. Bai, J. W. Demmel, and M. Gu, An inverse free parallel spectral divide and conqueralgorithm for nonsymmetric eigenproblems, Numer. Math., 76 (1997), pp. 279–308.

[6] T. Betcke, N. J. Higham, V. Mehrmann, C. Schroder, and F. Tisseur, NLEVP: a collec-tion of nonlinear eigenvalue problems, MIMS EPrint, (2011).

[7] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. W. Demmel, I. Dhillon, J. J.Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, andR. C. Whaley, ScaLAPACK Users’ Guide, SIAM, Philadelphia, PA, 1997.

[8] D. Boley, Solving the generalized eigenvalue problem on a synchronous linear processor array,Parallel Comput., 3 (1986), pp. 153–166.

[9] K. Braman, R. Byers, and R. Mathias, The multishift QR algorithm. I. Maintaining well-focused shifts and level 3 performance, SIAM J. Matrix Anal. Appl., 23 (2002), pp. 929–947.

[10] , The multishift QR algorithm. II. Aggressive early deflation, SIAM J. Matrix Anal.Appl., 23 (2002), pp. 948–973.

[11] A. Bunse-Gerstner and H. Faßbender, On the generalized Schur decomposition of a matrixpencil for parallel computation, SIAM J. Sci. Statist. Comput., 12 (1991), pp. 911–939.

[12] Y. Chahlaoui and P. Van Dooren, A collection of benchmark examples for model reductionof linear time invariant dynamical systems, SLICOT working note 2002-2, 2002. Availableonline from http://www.slicot.org.

[13] J.-P. Charlier and P. Van Dooren, A Jacobi-like algorithm for computing the generalizedSchur form of a regular pencil, J. Comput. Appl. Math., 27 (1989), pp. 17–36. Reprinted inParallel algorithms for numerical linear algebra, 17–36, North-Holland, Amsterdam, 1990.

[14] K. Dackland and B. Kagstrom, Blocked algorithms and software for reduction of a regularmatrix pair to generalized Schur form, ACM Trans. Math. Software, 25 (1999), pp. 425–454.

[15] J. W. Demmel and B. Kagstrom, The generalized Schur decomposition of an arbitrary pencilA−λB: robust software with error bounds and applications. II. Software and applications,ACM Trans. Math. Software, 19 (1993), pp. 175–201.

[16] F. D. Freitas, J. Rommes, and N. Martins, Gramian-based reduction method applied to largesparse power system descriptor models, Power Systems, IEEE Transactions on, 23 (2008),pp. 1258–1270.

[17] G. H. Golub and C. F. Van Loan, Matrix Computations, Johns Hopkins University Press,Baltimore, MD, third ed., 1996.

[18] R. Granat, B. Kagstrom, and D. Kressner, Parallel eigenvalue reordering in real Schurforms, Concurrency and Computation: Practice and Experience, 21 (2009), pp. 1225–1250.

[19] R. Granat, B. Kagstrom, and D. Kressner, A novel parallel QR algorithm for hybriddistributed memory HPC systems, SIAM J. Sci. Comput., 32 (2010), pp. 2345–2378.

[20] R. Granat, B. Kagstrom, D. Kressner, and M. Shao, Parallel library software for themultishift QR algorithm with aggressive early deflation, technical report, 2013. Available

Page 23: A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORYsma.epfl.ch/~anchpcommon/publications/parallelqz.pdf · A PARALLEL QZ ALGORITHM FOR DISTRIBUTED MEMORY HPC SYSTEMS BJORN ADLERBORN y,

A parallel QZ algorithm for distributed memory HPC systems 23

from http://anchp.epfl.ch.[21] E. S. Huss-Lederman, S. Quintana-Ortı, X. Sun, and Y.-J. Y. Wu, Parallel spectral division

using the matrix sign function for the generalized eigenproblem, International Journal ofHigh Speed Computing, 11 (2000), pp. 1–14.

[22] N. Martins J. Rommes and F. Damasceno, Computing rightmost eigenvalues for small-signalstability assessment of large-scale power systems, Power Systems, IEEE Transactions on,25 (2010), pp. 929 – 938.

[23] B. Kagstrom and D. Kressner, Multishift variants of the QZ algorithm with aggressive earlydeflation, SIAM J. Matrix Anal. Appl., 29 (2006), pp. 199–227. Also appeared as LAPACKworking note 173.

[24] B. Kagstrom, D. Kressner, E. S. Quintana-Ortı, and G. Quintana-Ortı, Blocked algo-rithms for the reduction to Hessenberg-triangular form revisited, BIT, 48 (2008), pp. 563–584.

[25] B. Kagstrom and P. Poromaa, Computing eigenspaces with specified eigenvalues of a regularmatrix pair (A,B) and condition estimation: theory, algorithms and software, Numer.Algorithms, 12 (1996), pp. 369–407.

[26] L. Karlsson and B. Kagstrom, Efficient reduction from block Hessenberg form to Hessenbergform using shared memory, in Proceedings of the 10th International Conference on AppliedParallel and Scientific Computing - Volume 2, Springer-Verlag, 2012, pp. 258–268.

[27] L. Karlsson, B. Kagstrom, and E. Wadbro, Fine-grained bulge-chasing kernels for stronglyscalable parallel QR algorithms, Parallel Comput., (2013).

[28] L. Karlsson, D. Kressner, and B. Lang, Optimally packed chains of bulges in multishift QRalgorithms, ACM Trans. Math. Software, (2014). To appear.

[29] J. G. Korvink and B. R. Evgenii, Oberwolfach benchmark collection, in Dimension Reduc-tion of Large-Scale Systems, P. Benner, V. Mehrmann, and D. C. Sorensen, eds., vol. 45of Lecture Notes in Computational Science and Engineering, Springer, Heidelberg, 2005,pp. 311–316.

[30] D. Kressner, Numerical Methods for General and Structured Eigenvalue Problems, vol. 46 ofLecture Notes in Computational Science and Engineering, Springer-Verlag, Berlin, 2005.

[31] A.N. Malyshev, Parallel algorithm for solving some spectral problems of linear algebra, LinearAlgebra Appl., 188/189 (1993), pp. 489–520.

[32] C. B. Moler and G. W. Stewart, An algorithm for generalized matrix eigenvalue problems,SIAM J. Numer. Anal., 10 (1973), pp. 241–256.

[33] T. Sakurai and H. Tadano, Cirr: a Rayleigh-Ritz type method with contour integral forgeneralized eigenvalue problems, Hokkaido Math. J, 36 (2007), pp. 745–757.

[34] C. F. Van Loan, Generalized Singular Values with Algorithms and Applications, PhD thesis,The University of Michigan, 1973.

[35] R. C. Ward, The combination shift QZ algorithm, SIAM J. Numer. Anal., 12 (1975), pp. 835–853.

[36] , Balancing the generalized eigenvalue problem, SIAM J. Sci. Statist. Comput., 2 (1981),pp. 141–152.

[37] D. S. Watkins, Performance of the QZ algorithm in the presence of infinite eigenvalues, SIAMJ. Matrix Anal. Appl., 22 (2000), pp. 364–375.

[38] D. S. Watkins, The matrix eigenvalue problem, SIAM, Philadelphia, PA, 2007.[39] D. S. Watkins and L. Elsner, Theory of decomposition and bulge-chasing algorithms for the

generalized eigenvalue problem, SIAM J. Matrix Anal. Appl., 15 (1994), pp. 943–967.