a fast and robust mixed-precision solver for the solution

17

A Fast and Robust Mixed-Precision Solverfor the Solution of Sparse SymmetricLinear Systems

J. D. HOGG and J. A. SCOTT

Rutherford Appleton Laboratory

On many current and emerging computing architectures, single-precision calculations are at leasttwice as fast as double-precision calculations. In addition, the use of single precision may reducepressure on memory bandwidth. The penalty for using single precision for the solution of linearsystems is a potential loss of accuracy in the computed solutions. For sparse linear systems, theuse of mixed precision in which double-precision iterative methods are preconditioned by a single-precision factorization can enable the recovery of high-precision solutions more quickly and useless memory than a sparse direct solver run using double-precision arithmetic.

In this article, we consider the use of single precision within direct solvers for sparse symmetriclinear systems, exploiting both the reduction in memory requirements and the performance gains.We develop a practical algorithm to apply a mixed-precision approach and suggest parametersand techniques to minimize the number of solves required by the iterative recovery process. Theseexperiments provide the basis for our new code HSL MA79—a fast, robust, mixed-precision sparsesymmetric solver that is included in the mathematical software library HSL.

Numerical results for a wide range of problems from practical applications are presented.

Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: Numerical Linear Algebra—Error analysis; linear systems (direct and iterative methods); sparse, structured and very largesystems (direct and iterative methods); G.4 [Mathematical Software]: Efficiency

General Terms: Algorithms, Experimentation

Additional Key Words and Phrases: Gaussian elimination, sparse symmetric linear systems, iter-ative refinement, FGMRES, multifrontal method, mixed precision, Fortran 95

ACM Reference Format:Hogg, J. D. and Scott, J. A. 2010. A fast and robust mixed-precision solver for the solution of sparsesymmetric linear systems. ACM Trans. Math. Softw. 37, 2, Article 17 (April 2010), 24 pages.DOI = 10.1145/1731022.1731027 http://doi.acm.org/10.1145/1731022.1731027

Work supported by EPSRC grants EP/F006535/1 and EP/E053551/1.Authors’ addresses: Computational Science and Engineering Department, Rutherford AppletonLaboratory, Chilton, OX11 0QX, U.K.; email: {jonathan.hogg, jennifer.scott}@stfc.ac.uk.Permission to make digital or hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributed for profit or commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 PennPlaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]© 2010 ACM 0098-3500/2010/04-ART17 $10.00DOI 10.1145/1731022.1731027 http://doi.acm.org/10.1145/1731022.1731027

ACM Transactions on Mathematical Software, Vol. 37, No. 2, Article 17, Publication date: April 2010.

17:2 • J. D. Hogg and J. A. Scott

1. INTRODUCTION

A common task in scientific software packages is solving linear systems

Ax = b, (1)

where A is a large sparse symmetric matrix and b is the known right-hand side.There are two common approaches to the solution of such systems:

—Direct methods. These are generally variants of Gaussian elimination,involving a factorization PAPT → LDL T of the system matrix A, whereL is a unit lower triangular matrix, D is block diagonal matrix (with 1 × 1and 2 × 2 blocks), and P is a permutation matrix. The solution process iscompleted by performing forward and then backward substitutions (i.e., byfirst solving a lower triangular system and then an upper triangular system).Direct methods are popular because, when properly implemented, they aregenerally robust and achieve a high level of accuracy, making them suitablefor use as general-purpose black-box solvers for a wide range of problems.The main limitation of direct methods is that the memory required normallyincreases rapidly with problem size.

—Iterative methods. These involve some iterative scheme and are often basedon using Krylov subspaces of A. In general, their performance is dependentupon the availability of an appropriate preconditioner. For large-scale prob-lems, a carefully chosen and tuned preconditioned iterative method will oftenrun significantly faster than a direct solver and will require far less memory;indeed, for very large problems, an iterative method is often the only avail-able choice. Unfortunately, for many of the “tough” systems that arise frompractical applications, the difficulties involved in finding and computing agood preconditioner can make iterative methods infeasible.

In this article, we are concerned with using a direct method to obtain anapproximate solution to (1) and then applying an iterative method to refine thesolution to improve its accuracy. In other words, we use the direct factorizationas a preconditioner for an iterative solver. All our results are obtained usingmultifrontal solvers, but much of our work will apply equally to left- and right-looking supernodal solvers. Modern direct solvers for sparse matrices makeextensive use of Level 3 BLAS kernels [Dongarra et al. 1990] in their aim toachieve close to dense performance on modern cache-based architectures (see,e.g., Gould et al. [2007] for an overview of currently available sparse symmetricdirect solvers). But direct solvers also perform sparse scatter operations (theseinvolve taking the entries of a dense vector and putting, or scattering, theminto a sparse vector). While the speed of the BLAS operations is limited largelyby latency and the flop rate, the scatter operations are limited more by memorybandwidth and latency. With the recent emergence of multicore processors andwith future chips likely to have ever larger numbers of cores, the data transferrate and memory latency are expected to become an ever tighter constraint[Graham et al. 2004].

Computing and storing the matrix factor L in single precision (by whichwe mean working with 32-bit floating-point numbers) rather than in double


Mixed-Precision Solution of Sparse Symmetric Linear Systems • 17:3

precision not only offers significant reductions in data movement but givesstorage savings. In the dense case, L is the same size as the original matrix sothat, assuming the original matrix A is retained (it can be overwritten by L ifit is no longer needed), holding L in single precision reduces the total memoryby 25%. For a sparse direct solver, the savings in its memory requirements arecloser to 50% since the storage is dominated by that required for L which, whilesparse, generally contains many more entries than A. As a result, using singleprecision allows the sparse solver to be used to solve larger problems (or thosewith denser factors). Single precision is potentially particularly advantageousfor an out-of-core sparse direct solver (i.e., one that stores the factors and pos-sibly some of the work arrays in files) because the amount of disk access is alsoapproximately halved.

In addition to the advantages of memory savings and reduced data move-ment, single-precision arithmetic is currently more highly optimized (and hencefaster) than double-precision computation on a number of architectures, suchas commodity Intel chips, Cell processors and general-purpose computing ongraphics cards (GPGPU). Buttari et al. [2007a] reported differences as greatas a factor of 10 in speed. Thus it is highly advantageous to carry out as muchcomputation as possible on these chips using single-precision arithmetic.

In general, direct solvers use double-precision arithmetic, although somepackages (including those from the HSL mathematical software library [HSL2007]) also offer a single-precision version. The accuracy required when solvingsystem (1) is application dependent. If the required accuracy is less than about10−5 (which may be all that is appropriate if, e.g., the problem data is notknown to high accuracy), single-precision may often be used. However, usersfrequently request greater accuracy or, if the problem is very ill conditioned,higher precision may be necessary to obtain a solution that is accurate to atleast a modest number of significant figures. In such cases, the double-precisionversion of the solver has traditionally been used. Motivated by the advantagesof using single-precision on modern architectures, recent studies [Arioli andDuff 2008; Buttari et al. 2007b, 2008] have shown that it may be possible to usea matrix factorization computed using the single-precision version of a directsolver as a preconditioner for a simple iterative method that is used to regainhigher precision.

Our aim is to design and develop a library-quality mixed-precision sparsesolver for the solution of symmetric (possibly indefinite) linear systems andto demonstrate its performance on problems from practical applications. Ourmixed-precision solver, which is included within HSL, is called HSL MA79.HSL MA79 is a serial code that is built upon the HSL multifrontal solvers MA57[Duff 2004] and HSL MA77 [Reid and Scott 2008, 2009b]. HSL MA77 is an out-of-core solver that is specifically designed for solving large problems. By employingit within HSL MA79, HSL MA79 can also be used as an out-of-core solver, enablingit to solve much larger problems than would otherwise be possible. Workingout-of-core can add a significant overhead to the time required to factorize andthen solve a linear system. In our numerical experiments, we consider whetherthe savings achieved by performing the factorization in single precision and



storing the single-precision factor data in files can offset the time requiredby the additional solves (each of which involves reading the factor data fromdisk) that are needed by the iterative procedure used to recover double preci-sion accuracy. In addition to providing users with a mixed-precision solver thatis efficient (in terms of both memory requirements and computation times),portable and easy-to-use, our main contribution in this article is to explorehow to combine the direct solvers with iterative refinement and with FGM-RES [Saad 1994] and, in particular, how to choose and tune the parametersinvolved to ensure HSL MA79 is robust and suitable for use both as a black-boxlinear solver and as a tool for experimenting with mixed-precision solutiontechniques.

This article is organized as follows. We end this introduction by describingour test environment. Section 2 describes the algorithms used in our mixed-precision solver and, using numerical experiments, establishes default valuesfor the parameters involved, extending both the work of Buttari et al. [2008]and the preliminary MATLAB experimental results of Arioli and Duff [2008].Section 3 describes our Fortran 95 implementation of the mixed-precision solverHSL MA79. Numerical results for HSL MA79 are given in Section 4 and our conclu-sions are presented in Section 5. The availability of the codes discussed in thisarticle is covered in Section 6.

1.1 Test Environment

All reported experiments in this article are performed on a Dell Precision T5400with two Intel E5420 quad core processors running at 2.5 GHz backed by 8 GBof RAM. In all our tests, we use the Goto BLAS [Goto and van de Geijn 2008]and the gfortran-4.3 compiler with -O1 optimization. Although our test ma-chine has multiple cores, we used only sequential BLAS, as preliminary ex-periments with the threaded BLAS seemed to show negligible performanceincreases (in some cases they gave decreases) in a sparse solver context. Alltimings were elapsed times in seconds. We worked with two test sets; all butthree of the problems were drawn from the University of Florida Sparse MatrixCollection [Davis 2007] and all are symmetric with either real or integer valuedentries.

—Test Set 1. Small to medium matrices with n ≥ 1000 and at most 107 entriesin the upper (or lower) triangular part. This set comprises 330 problems.

—Test Set 2. Medium to large matrices with n ≥ 10000. This set comprises 232problems.

We note that 170 problems belong to both sets. The problems were held as twotest sets because it was more practical to perform a lot of tests on the smallertest problems. Furthermore, MA57 was not able to solve the largest problems inTest Set 2 (because of insufficient memory); for these, the out-of-core facilitiesoffered by HSL MA77 were needed. In all our experiments, we used thresholdpartial pivoting with the threshold parameter set to u = 0.01 (thus all the testproblems were treated as indefinite, even though some are known to be positive



Table I. Summary of IEEE 754-2008 Standard, Comparing Single andDouble Precision

Single DoubleSign + exponent + mantissa 32 = 1 + 8 + 23 64 = 1 + 11 + 52Epsilon (min ε : 1 + ε �= 1) 2−24 ≈ 6.0 × 10−8 2−53 ≈ 1.1 × 10−16

Underflow threshold 2−126 ≈ 1.2 × 10−38 2−1022 ≈ 2.3 × 10−308

Overflow threshold (2 − 2−23) × 2127 (2 − 2−52) × 21023

≈ 3.4 × 1038 ≈ 1.8 × 10308

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 200 400 600 800 1000

MF

LOP

S

m=n=k

SGEMMDGEMM

Fig. 1. The performance of single- (SGEMM) and double- (DGEMM) precision matrix-matrix multipli-cation for a range of sizes of square matrices. These experiments used the Goto BLAS.

definite). Furthermore, we scaled the test problems using the HSL package MC77(the ∞-norm scaling was used) [Hogg and Scott 2008; Ruiz 2001].1

We end this section by summarizing in Table I the IEEE floating-point stan-dard [IEEE 2008]. Of particular interest are the epsilon values, which effec-tively limit the precision in direct solvers.

2. ALGORITHM

As we have already observed, single-precision operations can be performedmuch faster than double-precision, speeding up core operations (this was ex-plained in detail in [Buttari et al. 2007b]). This is illustrated in Figure 1.Here we show the performance of the Level 3 BLAS kernel for matrix-matrixmultiplication in single precision (SGEMM) and in double-precision (DGEMM) for

1In our tests, the direct solvers were unable to solve problems GHS indef/boyd1 and GHS indef/

aug3d with this choice of scaling so, in these instances, we scaled using MC64 [Duff 2004; Duff andKoster 2001].



1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

0.01 0.1 1 10 100 1000 10000

time(

doub

le)/

time(

sing

le)

time(double)

Fig. 2. The ratios of the times required by the factorization phase of the sparse direct solver MA57when run in single and double precision (the problems are a subset of Test Set 1).

square matrices of order up to 1000. Since GEMM is used extensively withinmodern sparse direct solvers (including MA57 and HSL MA77), this demonstratesthe potential advantage of performing the factorization using single precision.Figure 2 shows how this translates into a performance gain for the factorizationphase of MA57 (here the test set comprises the subset of Test Set 1 problems thattake at least 0.01 s to factorize in double precision on our test machine). Whilewe do not see gains of quite the factor of 2 that was achieved for SGEMM, we do seeworthwhile improvements on the larger problems, that is, those taking longerthan about 1 s to factorize. Here GEMM operations dominate the factorizationtime, whereas, on the smaller problems, integer operations as well as book-keeping are more dominant and for such problems there may be little rewardin terms of computation time in pursuing a mixed-precision approach.

In our early runs of MA57, floating-point underflows caused single precision tounderperform double. This was because of the way modern Intel CPUs handlesuch events. We were able to reduce this problem by setting a processor flag toflush denormals to zero, avoiding a cascade of slow operations, although thereis still a potential speed drop if a significant number occur (indeed, a similarsituation can occur in double-precision but is less common because of the muchlarger exponent range). There was also one test matrix (not shown in Figure 2)for which the ratio of the single-precision to the double-precision factorizationtime was well in excess of 2. This was an indefinite problem and although in bothcases the pivot sequence was the same, during the factorization this sequencewas modified to maintain numerical stability more in the double- than in thesingle-precision case, resulting in a higher flop count and dense factor.



Algorithm 1. Mixed-precision solver.

Input: Requested accuracy γSet prec = singledo

Factorize PAP T as LDL T using precision precSolve Ax = b and compute β.if β ≤ γ then exitPerform iterative refinement (Algorithm 2)if β ≤ γ then exitPerform FGMRES (Algorithm 3)if β ≤ γ then exitif prec = single then

Set prec = doubleelse

Set error flag and exitendif

end do

Our aim was to perform a single-precision factorization and then, if nec-essary, use double-precision postprocessing to recover a solution to the desiredprecision. For maximum efficiency, we wanted to try the cheapest algorithm firstand, only if this failed, did we want to resort to applying more computationallyexpensive alternatives. In the worst case, we would fall back to performing adouble-precision factorization.

Setting r = b− Ax, we define the norm of the scaled residual (the backwarderror) to be

β = ‖r‖∞‖A‖∞‖x‖∞ + ‖b‖∞

. (2)

We require that the computed solution x satisfies β ≤ γ , where γ is a parameterchosen by the user. If β ≤ γ is satisfied, we say that the requested accuracy hasbeen achieved. This provides a stopping criteria in our algorithms. We note that,in the case where we scale A prior to the factorization, the unscaled matrix isretained for any iterative method used to refine the solution and (2) is usedwith the unscaled A for checking the accuracy (see also Hogg and Scott [2008]).

Algorithm 1 summarizes the basic mixed-precision approach. The factor-ization is performed in single precision then, if the scaled residual β ex-ceeds γ , mixed-precision iterative refinement (Algorithm 2) is performed. Ifthe requested accuracy has not yet been achieved, mixed-precision FGMRES(Algorithm 3) is used and, finally, if β is still too large, a switch is made to doubleprecision and the computation restarted. In the next two sections, we discussthe iterative refinement and FGMRES steps.

2.1 Iterative Refinement

Iterative refinement is a simple first-order method used to improve a computedsolution of (1); it is outlined in Algorithm 2. Here βk is the norm of the scaledresidual (2) on the kth iteration and i maxitr is the maximum number of iter-ations. The system Ax = b and the correction equation Ayk+1 = rk are solved



Algorithm 2. Mixed-precision iterative refinement.

Input: Single-precision factorization of A, double-precision right-hand size b,requested accuracy γ , minimum reduction δ and i maxitr

Solve Ax1 = b (using single precision)Compute r1 = b− Ax1 and β1 (using double precision)Set k = 1do while (βk > γ and k < i maxitr)

Solve Ayk+1 = rk (using single precision)Set xk+1 = xk + yk+1 (using double precision)Compute rk+1 = b− Axk+1 and βk+1 (using double precision)if βk+1 > δβk or ‖rk+1‖∞ ≥ 2‖rk‖∞ then Set error flag and exit (stagnation)Set k = k + 1

end dox = xk

Algorithm 3. Mixed-precision FGMRES right preconditioned by a direct solver withadaptive restarting. Norms here are 2-norms.

Input: Single-precision factorization of A, double-precision right-hand size b,γ , δ, f maxitr, restart, max restart

Solve Ax = bCompute r = b− Ax and βInitialise j = 0; βold = β; xold = xdo while (β > γ and j < f maxitr)

βold = βInitialise v1 = r/‖r‖, y0 = 0, k = 0do while(

∥∥‖r‖e1 − Hk yk

∥∥ ≥ γ (‖A‖‖x‖ + ‖b‖) and k < restart)k = k + 1 (Increment restart counter)j = j + 1 (Increment iteration counter)Solve Azk = vk and compute w = AzkOrthogonalize w against v1, . . . , vk to obtain a new w. Set vk+1 = w/‖w‖Form Hk , a trapezoidal basis for the Krylov subspace spanned by v1, . . . , vk

(Full details of this step may be found in Saad [2003])yk = arg miny

∥∥‖r‖e1 − Hk yk

∥∥(Minimize residual over the Krylov subspace)

end doSet Zk = [z1 · · · zk]Compute x = x + Zk yk , r = b− Ax and compute new βif β ≥ δβoldrestart = 2 × restartif restart > max restart then Set error flag and exit

end ifif β > βold

x = xoldelse

xold = x; βold = βend if

end do

using the computed single-precision factors of A. There are a number of waysthis could be done depending on what data type conversions we perform. InAlgorithm 2, “solve Ax = b using single precision” means that we transform



Table II. Number of Problems in Test Set 1 for Which Iterative Refinement AchievedRequested Accuracy (γ = 5 × 10−15) Using a Range of Values of the Improvement Parameter

δ (i maxitr= 10) (Results for our default setting are in bold.)

δ 0.001 0.01 0.05 0.07 0.08 0.1 0.2 0.3 0.4 0.5 ∞Converged 194 227 252 256 258 259 264 265 265 265 268

Failed 136 103 78 74 72 71 66 65 65 65 62

the input right-hand side vector bfrom double to single precision, use the single-precision factorization of A to solve for x in single precision, and then transformx from single to double precision. This allows us to use an unmodified single-precision version of the solution phase of our direct solver. The transformationsbetween single- and double-precision vectors are done by taking copies that aretransitory and exist only for the duration of the solve; all vectors are otherwisekept in double precision.

Skeel [1980] proved that, to reduce the scaled residual to a given precision,it is sufficient to compute the residual and the correction in that precision.However, since we wished to obtain residuals with double-precision accuracyusing factors computed in single precision, we performed the forward and backsubstitutions (which we refer to as the solves throughout the rest of this arti-cle) in single precision and compute the residuals and corrected solution xk+1in double precision. This mixed-precision version of iterative refinement wasalso used by Buttari et al. [2007b] and Buttari et al. [2008], while Demmelet al. [2009] used a similar scheme for overdetermined least-squares problems,further developing bounds on the forward error.

Iterative refinement generally decreases the residual significantly for a num-ber of iterations before stagnating (i.e., reaching a point after which littlefurther accuracy is achieved), although for some problems (including the testproblems HB/bcsstm27, Cylshell/s3rmq4m1, and GHS psdef/s3dkq4m2 that wereconsidered in Arioli and Duff [2008]), a large number of iterations are neededbefore any substantial reduction in the residual is achieved. To detect stagna-tion (and thus avoid performing unnecessary solves), we employed a minimumimprovement parameter δ. A large δ allows the iterative refinement to continueuntil the maximum number of iterations has been performed. This increasesthe likelihood of convergence at the expense of carrying out additional itera-tions for problems that have stagnated before reaching the requested accuracyγ . The number of additional iterations can be reduced or eliminated by choosinga small δ. The condition ‖rk+1‖∞ ≥ 2‖rk‖∞ detects numerical issues where xk+1explodes but the residual βk+1 remains small due to scaling by xk+1. For dif-ferent values of δ, Table II reports the number of problems belonging to TestSet 1 that achieved the requested accuracy when factorized using MA57 in sin-gle precision and then corrected using mixed-precision iterative refinement.We see that values in the range [0.05, 0.5] have a similar success rate of justunder 80%. We chose as our default δ = 0.3 as this provides a good compro-mise between the number of problems that converged (259) and minimizes thewasted iterations on the remainder—62% of the problems that failed with thisδ used the same number of solves as for δ = 0.001, and only 4 of the 65 re-quired more than two additional iterations before stagnation was recognized.



We remark that the package MA57 includes an option (which we did not use)to perform iterative refinement (using the same precision as the factorization)and, by default, it uses δ = 0.5 (see also Demmel et al. [2006] and Higham[1997]).

Our implementation of iterative refinement also offers an option to terminateonce a chosen number i maxitr of iterations has been performed. An upperlimit on the maximum number of iterations can be established by consideringthe following example. Assume the initial scaled residual is β = 10−7 and thedefault improvement parameter δ = 0.3 is used. If stagnation has not occurred,after 13 iterations the scaled residual must be less than 1.6×10−14 and, after 15iterations, less than 4.8 × 10−15. Based on our experiments, we set the defaultvalue to i maxitr = 10 (note that for most of our test examples we found thateither the requested accuracy was achieved or stagnation was recognized beforethis limit was reached).

2.2 Preconditioned FGMRES

For our examination of FGMRES, we used the 62 problems from Test Set 1 thatfailed to achieve the requested accuracy using iterative refinement with any δ.We call this the Reduced Test Set 1.

In Algorithm 1, FGMRES [Saad 1994] refers to a right-preconditioned vari-ant of FGMRES. Arioli et al. [2007] have shown that, in cases where iterativerefinement fails, FGMRES may succeed and is more robust than either iterativerefinement or GMRES. Arioli and Duff [2008] proved that a mixed-precisioncomputation, where the matrix factorization is computed in single-precisionand the FGMRES iteration in double-precision, gives a backward stable algo-rithm. We note that their result is given for the restart counter k in Algorithm 3sufficiently large and so is not of practical use in our case because we will onlyuse a small number of iterations.

Our variant of FGMRES, shown as Algorithm 3, is essentially that given inArioli and Duff [2008] but additionally uses an adaptive restart parameter. Heref maxitr is the maximum number of iterations and e1 denotes the first columnof the identity matrix. Algorithm 3 uses double precision throughout except forthe solution of the systems involving A; for these systems, we have the abilityto perform the forward and backward substitutions in either single or doubleprecision, as we detail below. Our adaptive restarting strategy relies on a similarconcept to the minimum improvement parameter in iterative refinement—weexpect to reduce the backward error in the outer iterations and, if the reductionis too small, we increase the restart parameter (up to a specified maximummax restart). If there is no reduction in the backward error β (although in exactarithmetic the convergence of FGMRES is nonincreasing, in finite-precisionarithmetic the backward error may increase), we restore the solution from theprevious outer iteration before restarting. In our experiments, we comparedadaptive restarting with using a fixed restart parameter. We found that a smallinitial value for the adaptive restart parameter (typically less than or equal to 4)reduced the number of iterations required to obtain the requested accuracy andenabled us to solve some problems that failed to converge with a fixed restart;



Table III. Comparison of Performance of FGMRES Using Single- and Double-Precision SolvesFollowing Single-Precision Factorization for a Range of Restart Parameters on Reduced Test

Set 1 (δ = 0.3, f maxitr=128, max restart=128)

restart = 1 2 4 8 16Problems failed for both single- and double-precision solves 23 23 23 23 23Problems solved using double- but not single-precision solve 5 5 5 5 5Average difference in number of solves (see (3)) 14.9 15.6 16.2 12.9 12.9Average ratio of number of solves (see (4)) 2.8 2.5 3.1 2.1 2.1

the strategy had little effect for larger initial values. All FGMRES results inthis article are hence based on this adaptive restarting algorithm.

An alternative to the “solve using single precision” described in Section 2.1is to convert the computed factors from single to double precision and thenuse the double-precision version of the solution phase of our direct solver. Theadditional memory needed for this can be limited by writing a special-purposesingle-to-double solve routine that takes the double-precision right-hand side btogether with the single-precision factorization of A and returns the computedsolution x in double precision. To avoid holding the factorization in both singleand double precision, the routine converts the single-precision entries of L todouble precision only as they are required and discards them afterwards. Forour multifrontal solvers, the additional double-precision storage needed is atmost that required for the largest frontal matrix. We refer to this technique assolve using double precision.

Performing the solves using single-precision has obvious speed advantagesper iteration, but, if solving using double-precision results in a lower iterationcount, it may be faster overall. Our experiments using a range of values for theadaptive restart parameter showed that, in all cases, using double precisionreduces the total number of solves required. Table III attempts to capture towhat extent the double-precision approach is better. Shown here are

—the number of problems that fail to converge using either single- or double-precision solves;

—the number of problems that converge only using double-precision solves;—the arithmetic mean of the number of extra solves needed when solving using

single precision, that is,1

|P|∑i∈P

(Solvesi(single) − Solvesi(double)

), (3)

where Solvesi(double) (Solvesi(single)) is the number of solves used for prob-lem i when using double- (single-) precision solves, and P is the set of prob-lems on which both single- and double-precision solves converge to the re-quested accuracy.

—The geometric mean of the ratio of the number of solves using single precisionto the number using double precision, that is,

(∏i∈P

Solvesi(single)Solvesi(double)

) 1|P|

. (4)



A possible explanation of the good performance of the solves using double preci-sion is that the norms used in FGMRES are 2-norms rather than the ∞-normsused elsewhere in this article, and are prone to larger error accumulation as aresult, which higher precision may help to counter.

We comment that a constant number of failing problems regardless of therestart value is what we expect from our adaptive restarting procedure.

The reduction in the number of solves when using double precision has to beset against the increased time per solve compared to single precision. On ourtest computer, the time for a solve in double precision is approximately twicethat in single precision; hence if the ratio of the number of solves in single preci-sion to those in double is more than 2 (as is the case in the last line of Table III)then double precision is faster, while also allowing us to solve more problems.As a result, in the remainder of this article we use double-precision solves forFGMRES. Note that iterative refinement will still use single-precision solvesas similar experiments showed there to be no significant benefit in using dou-ble precision. In particular, although more iterations were carried out, iterativerefinement still eventually stagnates on the problems belonging to the ReducedTest Set 1.

In Table IV, we report the number of solves performed within FGMRES fora range of values for the adaptive restart parameter on the 39 problems in theReduced Test Set 1 for which FGMRES was successful. The results indicate thatrestart = 4, 8, or 16 was generally the best choice; restart = 4 was taken tobe the default. We observe that for some problems a higher restart parameterresulted in a larger number of iterations. This was because the terminationconditions were only tested when the algorithm was restarted; higher valuesof restart tested for termination less frequently.

3. IMPLEMENTING THE MIXED-PRECISION STRATEGY

In this section, we discuss the design and development of our new mixed-precision solver HSL MA79. The code is written in Fortran 95 and, at its heart,uses the HSL direct solvers MA57 [Duff 2004] and HSL MA77 [Reid and Scott 2008,2009b]. We start by giving a brief overview of these solvers, highlighting someof their key features that are important for HSL MA79.

3.1 MA57

MA57 is designed to solve sparse symmetric linear systems (1); the system ma-trix may be either positive definite or indefinite. The multifrontal method isused [Duff and Reid 1983]. A detailed discussion of the design of MA57 and itsperformance is given by Duff [2004]. Relevant work on the pivoting and scalingstrategies available within MA57 is given by Duff and Pralet [2005, 2007].

In common with other HSL solvers, MA57 is available in both double- andsingle-precision versions. It offers a range of options, including solving for mul-tiple right-hand sides, computing partial solutions, error analysis, a matrixmodification facility, and a stop and restart facility. Although the default set-tings for the control parameters should work well in general, there are several



Table IV. Number of Solves Performed within FGMRES for Reduced Test Set 1 Using aRange of Values for restart Following Unsuccessful Iterative Refinement. (For eachproblem, the best result is in bold. Results are shown for δ = 0.3, f maxitr= 64, andmax restart=64. The 23 problems that failed for all values of restart are not shown.)

restartProblem1 2 4 8 16

Boeing/bcsstk35 49 48 64 48 48Boeing/bcsstk38 3 4 4 8 8Boeing/crystk01 15 14 12 8 8Boeing/crystk02 3 6 4 8 8Boeing/crystk03 3 6 4 8 8Boeing/msc01050 7 6 4 8 8Cunningham/qa8fk 3 6 4 8 8Cylshell/s3rmq4m1 11 10 8 8 8Cylshell/s3rmt3m1 17 12 4 8 8DNVS/ship 001 48 48 48 64 64GHS indef/cont-201 25 24 20 16 16GHS indef/cvxqp3 23 22 32 24 24GHS indef/ncvxbqp1 11 8 8 8 8GHS indef/ncvxqp5 10 10 8 8 8GHS indef/sparsine 14 14 12 16 16GHS indef/stokes128 1 2 4 8 8GHS psdef/oilpan 11 10 8 8 8GHS psdef/s3dkq4m2 40 16 16 8 8GHS psdef/s3dkt3m2 17 16 16 16 16GHS psdef/vanbody 31 30 28 24 24Gset/G33 2 2 4 8 8HB/bcsstm27 13 12 8 8 8Koutsovasilis/F2 3 6 4 8 8ND/nd3k 2 2 4 8 8Oberwolfach/gyro 10 8 8 8 8Oberwolfach/gyro k 10 8 8 8 8Oberwolfach/t2dah 7 6 4 8 8Oberwolfach/t2dah a 7 6 4 8 8Oberwolfach/t2dal 7 6 4 8 8Oberwolfach/t2dal a 7 6 4 8 8Oberwolfach/t2dal bci 7 6 4 8 8Oberwolfach/t3dh 3 6 4 8 8Oberwolfach/t3dh a 3 6 4 8 8Oberwolfach/t3dl 3 6 4 8 8Oberwolfach/t3dl a 3 6 4 8 8Pajek/Reuters911 15 12 12 8 8Schenk IBMNA/c-56 31 30 28 16 16Schenk IBMNA/c-62 31 30 28 24 24Simon/olafu 16 12 16 8 8

parameters available to enable the user to tune the code for his or her problemclass or computer architecture.

Like many modern symmetric direct solvers, MA57 has three distinct phases:an analyse phase that works only with the sparsity pattern to set up datastructures for the factorization, the numerical factorization phase that usesthese data structures to compute the matrix factor, and a solve phase thatmay be called any number of times after the factorization is complete to solverepeatedly for different right-hand sides.



The efficiency of a direct method, in terms of both the storage needed andthe work performed, is dependent on the order in which the elimination oper-ations are performed, that is, the order in which the pivots are selected. Forsymmetric matrices that are positive definite, the pivotal sequence chosen us-ing the sparsity pattern of A alone can be used during the factorization withoutmodification. For symmetric indefinite problems, a tentative pivot sequence ischosen based upon the sparsity pattern (treating zeros on the diagonal as en-tries) and this is modified if necessary (possibly to include 2 × 2 pivots) duringthe factorization to maintain numerical stability. MA57 offers the user a num-ber of ordering options, including variants of the minimum degree algorithm[Amestoy et al. 1996, 2004] and multilevel nested dissection through an inter-face to the well-known MeTiS package [Karypis and Kumar 1999]. Based onthe study by Duff and Scott [2005], by default, MA57 chooses between MeTiS andapproximate minimum degree (avoiding problems with dense rows [Amestoyet al. 2007; Dollar and Scott 2009]).

3.2 HSL MA77

HSL MA77 [Reid and Scott 2008, 2009b] is also a multifrontal solver designed tosolve positive definite and indefinite sparse symmetric systems. It too has sep-arate analyse, factorize, and solve phases. A fundamental difference betweenMA57 and HSL MA77 is that the latter is an out-of-core solver, that is, it is designedto allow the matrix data, the computed factors, and some of the intermediatework arrays to be held in files. The advantage of this is that it enables muchlarger problems to be solved.

Storing data in files potentially adds a significant overhead to the time re-quired to factor and then solve the linear system. To minimize this overhead,Reid and Scott [2009a] have written a set of Fortran subroutines that managea virtual memory system so that actual input/output occurs only when reallynecessary. These routines are available within HSL as the package HSL OF01and are used by HSL MA77 to efficiently handle all input and output. Resultsreported by Reid and Scott [2009b] illustrated the effectiveness of HSL OF01in minimizing the out-of-core overhead; they also presented results for prob-lems that were too large for MA57 to solve on their test machine. On problemsthat MA57 is able to solve, the performance of HSL MA77 is favorable: on someproblems it is faster than MA57, while on others the converse is true. The mainexception is the solve time. The factor L has to be read from file once for theforward substitution and then again (in reverse order) for the back substitu-tion. This is independent of the number of right-hand sides. Thus, for a sin-gle right-hand side (or small number of right-hand sides), the solve phase ofHSL MA77 can be significantly more expensive than the corresponding phase ofMA57, although for a large number of right-hand sides there is a smaller relativedifference.

As with MA57, HSL MA77 offers a range of options. These include allowing thefiles to be replaced by arrays (so that, if there is sufficient space, the data is allstored in main memory). The user can specify the initial sizes of these arrays



and an overall limit on their total size. If an array is found to be too small, thecode will continue using a combination of files and arrays. Another importantoption allows the user to specify whether he or she is running the code on a32-bit or 64-bit architecture. On a 64-bit machine, it is possible to run problemswith much larger frontal matrices (on a 32-bit machine, the maximum frontsize is limited to approximately 16,000).

We remark that HSL MA77 requires the user to input the pivot order. Thiscan be computed by calling MeTiS or the HSL package MC47 that implementsan approximate minimum degree algorithm; alternatively, the analyse phaseof MA57 may be used to select the pivot sequence. Each of these alternativescomputes a pivot order of 1 × 1 pivots; 2 × 2 pivots may be selected during thefactorization. In some cases, it may be advantageous to specify a tentative pivotsequence that includes 2 × 2 pivots; HSL MA77 allows the user to do this. A pivotorder that contains 2 × 2 pivots may be found using the package HSL MC68, andthis remains an area of active research.

3.3 Design of HSL MA79

HSL MA79 has been designed to provide a robust and efficient implementationof Algorithm 1 for both positive-definite and indefinite systems, using exist-ing HSL packages as its main building blocks. In particular, it uses the directsolvers MA57 and HSL MA77 and the implementation of FGMRES offered by MI15,together with the scaling packages MC30, MC64, and MC77. HSL MA79 includes arange of options but it is not our intention to incorporate all possibilities avail-able within the direct solvers MA57 and HSL MA77. Instead, we have designed ageneral-purpose package that is straightforward to use and, by restricting thenumber of parameters that have to be set, we do not require the user to havea detailed knowledge and understanding of all the different components of thealgorithms used.

The following procedures are available to the user.

(1) MA79 factor solve accepts the matrix A, the right-hand sides b, and therequested accuracy. Based on the matrix, it selects whether to use MA57 orHSL MA77; by default, single precision is selected as the initial precision. Thecode then implements Algorithm 1. The matrix factorization is retained forfurther solves.

(2) MA79 refactor solve uses the information returned from a previous callto MA79 factor solve to reduce the time required to factorize and solveAx = b. The sparsity pattern of A must be unchanged since the call toMA79 factor solve; only the numerical values of the entries of A and b mayhave changed. By default, the precision for the factorization is chosen basedon that used by MA79 factor solve. The matrix factorization is retained forfurther solves.

(3) MA79 solve uses the computed factors generated by MA79 factor solve(or MA79 refactor solve) to solve further systems Ax = b. Multi-ple calls to MA79 solve may follow a call to MA79 factor solve (orMA79 refactor solve).



(4) MA79 finalize should be called after all other calls are complete for aproblem. It deallocates the components of the derived data types anddiscards the matrix factors.

We now discuss the first three of the preceding procedures in more detail(the finalize routine needs no further explanation).

3.3.1 MA79 factor solve. On the call to MA79 factor solve, the user mustsupply the entries in the lower triangular part of A in compressed sparse col-umn (CSC) format and may optionally supply a pivot order. If no pivot orderis supplied, one is computed using the analyse phase of MA57. Statistics on theforthcoming factorization (such as the maximum frontsize, the number of flops,and the number of entries in the factor) are also computed. These are exactfor the factorization phase of MA57 if the problem is positive definite; otherwise,they are lower bounds for MA57 and estimates of lower bounds for HSL MA77.By default, the statistics are used to choose the direct solver. MA57 is selectedunless one or more of the following holds.

(1) It is not possible to allocate the arrays required by MA57. (We allow the userto specify a maximum amount of memory, and, if the predicted memoryusage for MA57 exceeds this, we use HSL MA77.)

(2) The matrix is positive definite with a maximum frontsize greater than 1500.(3) The matrix is not positive definite and the user-supplied pivot sequence

includes 2 × 2 pivots.(4) The user has chosen HSL MA77.

The choice in list item (2) was made on the basis of numerical experimentation(see Reid and Scott [2009b]). The main motivation for selecting MA57 as thedefault solver is that HSL MA77 is primarily designed as an out-of-core solverand this incurs an overhead (which may be significant if the problem is notvery large). Furthermore, the process of refactorizing in double precision isalso more expensive for HSL MA77 because it is necessary to reload the matrixdata and repeat its analysis phase (this can be avoided for MA57).

At the start of the factorization with MA57, HSL MA79 allocates the requiredarrays based on the analyse statistics If larger work arrays are later neededbecause of delayed pivots, HSL MA79 attempts to use the stop and restart facilityoffered by MA57 but, if there is insufficient memory to allocate sufficiently largearrays, HSL MA79 switches to HSL MA77. This may add a significant extra cost asMA77 analyse must be called and the factorization completely restarted.

By default, HSL MA79 works in mixed precision following Algorithm 1 andits development through Section 2. The user may, however, choose to performthe whole computation in double precision. In this case, HSL MA79 provides aconvenient interface to MA57 and HSL MA77 (albeit without the full flexibilityand options offered by each of these packages individually). This facilitatescomparisons between mixed and double precision. Working in double precisionthroughout may be advisable for very ill-conditioned systems or for very largeproblems for which repeated calls to the solve routine MA79 solve are expected.



HSL MA79 includes a number of scaling options, provided by the HSL packagesMC30 (Curtis and Reid’s [1972] method minimizing the sum of logarithms of theentries), MC64 (symmetrized scaling based on maximum matching by Duff andKoster [Duff and Koster 2001; Duff and Pralet 2005]), and MC77 using the 1-or ∞-norms (iterative process of simultaneous norm equilibration [Ruiz 2001]).The default is the ∞-norm equilibration scaling from MC77 because recent tests[Hogg and Scott 2008] on a large number of problems from practical applicationshave shown that, in general, this provides a good fast scaling.

HSL MA79 offers complete control of the parameters in Algorithms 2 and 3,in addition to the ability to disable any particular method of recovering pre-cision in Algorithm 1 (e.g., the user may specify that the use of iterative re-finement is to be skipped). It also supports tuning of the major parameters af-fecting the performance of the factorization phase, such as the block size usedby the dense linear algebra kernels that lie at the heart of the multifrontalalgorithm.

By default, the requested accuracy is achieved when β given by (2) is less thana user-prescribed value γ . However, HSL MA79 also allows the user to specify,using an optional subroutine argument on the call to MA79 factor solve, analternative measure of accuracy. If present, it must compute a function β =f (A, x, b, r) that is compared to γ when testing for termination. For example,it may be used to test for the requested accuracy in the 2-norm or using acomponent-wise approach.

An important feature of MA79 factor solve is that it returns detailed infor-mation on the solution process. This includes which solver was used and theprecision, together with details of the matrix factorization (the number of en-tries in the factor, the maximum frontsize, the number of 2 × 2 pivots chosen,the numbers of negative and zero pivots) and information on the refinement(the number of steps of iterative refinement, the number of FGMRES itera-tions performed, and the total number of calls to the solution phase of MA57 orHSL MA77). In addition, the full information type or array from the factorizationcode (MA57B or MA77 factor) is returned to the user.

We note that the user can pass any number of right-hand sides b toMA79 factor solve. In particular, the user can set the number of right-handsides to zero. In this case, the routine will only perform the matrix factorizationin the requested (or default) precision.

3.3.2 MA79 refactor solve. We envisage that a user may want to factorizeand solve a series of problems with the same sparsity pattern as the originalmatrix A but different numerical values. In this case, HSL MA79 takes advantageboth of the pivot sequence used within MA79 factor solve and of the experiencegained on the initial factorization.

On a call to MA79 refactor solve, the user must input the values of theentries in both the lower and upper triangular parts of the new matrix in CSCformat, with the entries in each column in order of increasing row index. Thisformat (which is the format the original matrix is returned to the user in onexit from MA79 factor solve) is required so that HSL MA79 can avoid, before thefactorization begins, taking and manipulating additional copies of the matrix



(for large problems, this avoidance is important). Having the matrix in this formalso has the side benefit of allowing a more efficient matrix-vector product.

3.3.3 MA79 solve. After a call to MA79 factor solve, MA79 solve may becalled to solve for additional right-hand sides. If MA57has been used (or HSL MA77was run in-core), the cost of each additional solve is generally small but, ifHSL MA77 was run with the factors held on disk, the solve time can be significant(see Reid and Scott [2009b]). If the solve is performed at the same time as thefactorization, the entries of L are used to perform the forward substitution asthey are generated, cutting the amount of data that must be read (and hencethe time) for the solve approximately in half.

3.4 Errors and Warnings

HSL MA79 issues two levels of errors: fatal errors that cause the computation toterminate immediately and warnings that are intended to alert the user to whatcould be a problem but which will not prevent the computation from continuing.In either case, a flag is set (with a negative value for an error and a positivevalue for a warning) and a message is optionally printed (the user controls thelevel of diagnostic printing). Examples of fatal errors include a user-suppliedpivot order that is not a permutation and calls to routines within the HSL MA79package that are out of sequence.

A warning is issued if the user-supplied matrix data contains out-of-rangeindices (these are ignored) or duplicated indices (the corresponding matrix en-tries are summed). A warning is also issued if the matrix is found to be singu-lar or if MA57 was requested but HSL MA77 is used because of insufficient mainmemory. Note that more than one warning may be issued. At the end of thecomputation, a warning is given if the requested accuracy was not obtainedafter all allowable fall-back options were attempted. In particular, if the factor-ization has been performed in single precision, the requested accuracy may notbe achieved on a call to MA79 solve without resorting to a double-precision fac-torization. In this case, because this cannot occur within MA79 solve, the usershould call MA79 refactor solve, explicitly specifying the factorization is to beperformed in double precision.

Full details of errors and warnings and of the levels of diagnostic printingare given in the user documentation for HSL MA79.

4. NUMERICAL RESULTS

In this section, we present results obtained using Version 1.1.0 of HSL MA79.This uses Version 3.2.0 of MA57 and Version 4.0.0 of HSL MA77. Our experimentswere performed on the machine and test examples described in Section 1. Therequested accuracy was β < 5×10−15 and, unless stated otherwise, we used thedefault settings for all the HSL MA79 parameters (in particular, the parameterschosen in Section 2 are used to control the solution recovery).

Figures 3 and 4 compare the performance of mixed-precision and double pre-cision for Test Sets 1 and 2, respectively, with MA57 selected as the direct solverwithin HSL MA79 for Test Set 1 and HSL MA77 for Test Set 2. If the requested



0

0.5

1

1.5

2

2.5

0.001 0.01 0.1 1 10 100 1000 10000

time(

doub

le)/

time(

mix

ed)

time(double)

Accuracy achieved with mixed precision (305 problems)Fall back to double precision required (25 problems)

Fig. 3. Ratio of times to solve Equation (1) with HSL MA79 in mixed-precision and double-precisionmodes on Test Set 1 using MA57 as the solver, with accuracy γ = 5 × 10−15.

0

0.5

1

1.5

2

2.5

0.01 0.1 1 10 100 1000 10000 100000

time(

doub

le)/

time(

mix

ed)

time(double)

Accuracy achieved with mixed precision (202 problems)Fall back to double precision required (30 problems)

Fig. 4. Ratio of times to solve Equation (1) with HSL MA79 in mixed-precision and double-precisionmodes on Test Set 2 using HSL MA77 as the solver, with accuracy γ = 5 × 10−15.



Table V. Number of Problems That Exited at Each Stage ofAlgorithm 1 Implemented as HSL MA79

Test Set 1 Test Set 2MA57 HSL MA77

After iterative refinement 265 81% 157 68%After FGMRES 40 12% 45 19%After fall-back to double-precision 24 7% 28 12%Failed 1 <1% 2 1%

accuracy was only achieved by falling back on a double-precision factoriza-tion, the mixed-precision time included the double-precision factorization time.From Figure 3, we see that, if the time taken by HSL MA79 in double precisionis less than about 1 s, there is generally little or no advantage in terms of com-putational time to using mixed precision (in fact, for a number of problems,running in double precision is almost twice as fast as using mixed precision).However, for the larger problems within Test Set 1, mixed precision outper-formed double precision by more 50%. For the problems in Test Set 2 withthe out-of-core solver HSL MA77, on our test machine mixed precision is onlyrecommended if the double-precision time is greater than about 10 s. For prob-lems that run more rapidly than this, the savings from the single-precisionfactorization are not large enough to offset the cost of the additional solves(which, in this case, involve reading data from disk). Of course, if the useris prepared to accept a less accurate solution (i.e., the requested accuracy γ

is chosen to be greater than 5 × 10−15), this will affect the balance betweenthe mixed-precision time (which will decrease as fewer refinement steps willbe needed) and the double-precision time (which, in many instances, will beunchanged).

In Table V, we report the number and percentage of problems in each testset for which the requested accuracy was achieved after iterative refinementand after iterative refinement followed by FGMRES. We also report the num-ber of problems that had to fall back to a double-precision factorization toachieve the requested accuracy and the number that failed to achieve thiseven in double precision. There were just two such problems: GHS indef/boyd1and GHS indef/blockqp1 (these had final scaled residuals of 6.2 × 10−14 and2.9 × 10−14, respectively).

As expected, when mixed precision failed to reach the requested accuracy,HSL MA79 spent longer establishing this fact than if double precision was usedoriginally. Thus it is essential for a potential user to experiment to see whetherthe mixed-precision approach will be advantageous for his or her applicationand computing environment.

It is of interest to consider not only the total time taken to solve the system(1), but also the times for each phase of the solution process in mixed precisionand in double precision. Table VI reports timings for the various phases for asubset of problems of different sizes from Test Set 2. The problems are orderedby the total time required to solve (1) using double precision. The time for theanalyze phase (which here includes the time to scale the matrix) is independentof the precision.



Tab

leV

I.T

imes

toS

olve

Equ

atio

n(1

)fo

rS

ubs

etof

Pro

blem

sfr

omT

est

Set

2w

ithHSLMA79

Usi

ngHSLMA77

asS

olve

r.(m

den

otes

mix

edpr

ecis

ion

and

dde

not

esdo

ubl

epr

ecis

ion

.Th

en

um

bers

inpa

ren

thes

esar

eit

erat

ion

cou

nts

.-in

dica

tes

iter

ativ

ere

fin

emen

t(o

rF

GM

RE

S)

was

not

requ

ired

.)

Tot

alA

nal

yze

Fact

oriz

eIt

erat

ive

refi

nem

ent

FG

MR

ES

β

Pro

blem

md

md

md

md

md

HB/bcsstk17

0.30

0.25

0.10

0.11

0.14

0.06

(5)

--

-1.

17e-

151.

24e-

16Boeing/crystm03

1.08

1.09

0.64

0.33

0.44

0.09

(3)

--

-3.

78e-

164.

80e-

16Boeing/bcsstm39

3.86

3.97

3.66

0.11

0.30

0.09

(2)

--

-2.

31e-

151.

28e-

16Rothberg/cfd1

6.22

8.41

2.98

2.48

5.43

0.69

(3)

--

-1.

41e-

162.

32e-

16Rothberg/cfd2

11.8

14.6

4.97

5.27

9.62

1.39

(3)

--

-1.

69e-

162.

74e-

16INPRO/msdoor

28.7

20.2

5.39

9.64

11.8

2.70

(3)

-10

.1(8

)-

2.96

e-16

2.64

e-16

ND/nd6k

29.7

38.8

5.33

20.3

33.1

3.77

(8)

--

-3.

67e-

151.

68e-

15GHSpsdef/apache2

61.8

66.1

19.8

29.1

46.4

12.5

(6)

--

-1.

79e-

161.

43e-

15Koutsovasilis/F1

63.5

79.1

16.9

36.1

60.1

9.10

(4)

--

-1.

75e-

152.

16e-

16Lin/Lin

57.4

79.9

10.8

39.3

66.4

7.26

(5)

2.69

(1)

--

3.20

e-16

2.03

e-16

ND/nd12k

104

149

14.5

78.0

134

11.5

(8)

--

-1.

00e-

152.

07e-

15PARSEC/Ga3As3H12

348

537

21.6

302

511

24.3

0(8

)5.

86(1

)-

-3.

27e-

153.

37e-

16ND/nd24k

372

570

36.5

297

532

36.5

(9)

--

-1.

02e-

152.

94e-

15GHSindef/sparsine

409

540

18.4

314

517

5.28

(2)

4.92

(1)

70.5

(16)

-4.

20e-

163.

51e-

16PARSEC/Si34H36

783

1287

38.7

706

1236

37.7

(6)

12.1

(1)

--

4.91

e-16

2.71

e-16

GHSpsdef/audikw1

810

1329

65.7

620

1262

120

(7)

--

-4.

67e-

155.

84e-

17PARSEC/Ga10As10H30

1187

2046

51.6

1082

1976

53.3

(6)

18.7

(1)

--

2.96

e-16

1.96

e-16

PARSEC/Si87H76

6058

9310

153

4703

8815

1202

(7)

344

(1)

--

6.64

e-16

1.95

e-16



5. CONCLUSIONS AND FUTURE DIRECTIONS

In this article, we have explored a mixed-precision strategy that is capable ofoutperforming a traditional double-precision approach for solving large sparsesymmetric linear systems. Building on the recent work of Arioli and Duff [2008]and Buttari et al. [2008], we have designed and developed a practical and robustsparse mixed-precision solver; the new package HSL MA79 is available withinthe HSL Library. Numerical experiments on a large number of problems haveshown that, in about 90% of our test cases, it was possible to use a mixed-precision approach to get accuracy of 5 × 10−15; in the remaining cases, it wasnecessary to resort to computing a double-precision factorization (or to accepta less accurate solution). HSL MA79 is designed to allow an automatic fall backto double precision and is tuned to minimize the work performed before thishappens. However, although we have demonstrated robustness, our experienceis that, in terms of computational time, the advantage of using mixed precisionis limited to large problems (how large will depend on the direct solver usedwithin HSL MA79, on the computing platform, and also on the requested accu-racy); the user is advised that experimentation with his or her problems willbe necessary to decide whether or not to use mixed precision.

Future work on HSL MA79 will focus on more efficiently recovering double-precision accuracy in the case of multiple right-hand sides; this will lead to thereplacement of the MI15 implementation of FGMRES with a specially modifiedvariant of FGMRES, which may require a different adaptive restarting strategy.Our current implementation of HSL MA79 requires that the original matrix Areside in memory throughout the factorization. This is not a requirement forthe out-of-core solver HSL MA77 and we would like to remove it from HSL MA79 byoffering an interface that allows A to be supplied by the user in a file. We wouldadditionally like to collect more large problems to test and refine our code on.

Throughout this article, we have considered factorizing A in single precisioncombined with recovery using double precision. However, this does not have tobe the case. Provided the condition number of the matrix is less than the recip-rocal of the requested accuracy γ , the theory [Arioli and Duff 2008] supportsrecovery to arbitrary precision. In this case, the refinement must be carried outin extended precision. It is also possible to perform a factorization in doubleprecision and then recover to higher precision. This will be the subject of aseparate study.

6. CODE AVAILABILITY

All the codes discussed in this article have been developed for inclusion inthe mathematical software library HSL. All use of HSL requires a license.Individual HSL packages (together with their dependencies and accompanyingdocumentation) are available without charge to individual academic users fortheir personal (noncommercial) research and for teaching; licenses for otheruses involve a fee. Details of the packages and how to obtain a license plusconditions of use are available online.2

2www.cse.stfc.ac.uk/nag/hsl.



ACKNOWLEDGMENTS

We are grateful to our colleagues John Reid, Iain Duff, and Mario Arioli foruseful discussions relating to this work. We would also like to thank threeanonymous referees for their constructive and helpful comments.

REFERENCES

AMESTOY, P., DAVIS, T., AND DUFF, I. 1996. An approximate minimum degree ordering algorithm.SIAM J. Matrix Anal. Appl. 17, 886–905.

AMESTOY, P., DAVIS, T., AND DUFF, I. 2004. Algorithm 837: AMD, an approximate minimum degreeordering algorithm. ACM Trans. Math. Softw. 30, 3, 381–388.

AMESTOY, P., DOLLAR, H., REID, J., AND SCOTT, J. 2007. An approximate minimum degree algorithmfor matrices with dense rows. Tech. rep. RAL-TR-2007-020. Rutherford Appleton Laboratory,Chilton, U.K.

ARIOLI, M. AND DUFF, I. 2008. FGMRES to obtain backward stability in mixed precision. Tech.rep. RAL-TR-2008-006. Rutherford Appleton Laboratory, Chilton, U.K.

ARIOLI, M., DUFF, I., GRATTON, S., AND PRALET, S. 2007. A note on GMRES preconditioned by aperturbed LDLT decomposition with static pivoting. SIAM J. Sci. Comput. 29, 5, 2024–2044.

BUTTARI, A., DONGARRA, J., AND KURZAK, J. 2007a. Limitations of the playstation 3 for high per-formance cluster computing. Tech. rep. CS-07-594. University of Tennessee Computer ScienceDepartment, Knokville, TN.

BUTTARI, A., DONGARRA, J., KURZAK, J., LUSZCZEK, P., AND TOMOV, S. 2008. Using mixed precision forsparse matrix computations to enhance the performance while achieving 64-bit accuracy. ACMTrans. Math. Softw. 34, 4.

BUTTARI, A., DONGARRA, J., LANGOU, J., LANGOU, J., LUSZCZEK, P., AND KURZAK, J. 2007b. Mixed preci-sion iterative refinement techniques for the solution of dense linear systems. Int. J. High Perform.Comput. Appl. 21, 4, 457–466.

CURTIS, A. AND REID, J. 1972. On the automatic scaling of matrices for Gaussian elimination. IMAJ. Appl. Math. 10, 1, 118–124.

DAVIS, T. 2007. The University of Florida sparse matrix collection. Technical rep. University ofFlorida, Gaisesville, FL. http://www.cise.ufl.edu/∼davis/techreports/matrices.pdf.

DEMMEL, J., HIDA, Y., KAHAN, W., LI, X. S., MUKHERJEEK, S., AND RIEDY, E. J. 2006. Error boundsfrom extra precise iterative refinement. ACM Trans. Math. Softw. 32, 2, 325–351.

DEMMEL, J., HIDA, Y., LI, X. S., AND RIEDY, E. J. 2009. Extra-precise iterative refinement for overde-termined least squares problems. ACM Trans. Math. Softw. 35, 4. (Also LAPACK Working Note188.)

DOLLAR, H. AND SCOTT, J. 2009. A note on fast approximate minimum degree orderings for matriceswith some dense rows. Numer. Lin. Alg. Appl. 17, 1, 43–45.

DONGARRA, J., CROZ, J. D., HAMMARLING, S., AND DUFF, I. 1990. A set of level 3 basic linear algebrasubprograms. ACM Trans. Math. Softw. 16, 1, 1–17.

DUFF, I. 2004. MA57—a new code for the solution of sparse symmetric definite and indefinitesystems. ACM Trans. Math. Softw. 30, 118–154.

DUFF, I. AND KOSTER, J. 2001. On algorithms for permuting large entries to the diagonal of asparse matrix. SIAM J. Matrix Anal. Appl. 22, 4, 973–996.

DUFF, I. AND PRALET, S. 2005. Strategies for scaling and pivoting for sparse symmetric indefiniteproblems. SIAM J. Matrix Anal. Appl. 27, 313–340.

DUFF, I. AND PRALET, S. 2007. Towards a stable static pivoting strategy for the sequential andparallel solution of sparse symmetric indefinite systems. SIAM J. Matrix Anal. Appl. 29, 1007–1024.

DUFF, I. AND REID, J. 1983. The multifrontal solution of indefinite sparse symmetric linear sys-tems. ACM Trans. Math. Softw. 9, 302–325.

DUFF, I. AND SCOTT, J. 2005. Towards an automatic ordering for a symmetric sparse direct solver.Tech. rep. RAL-TR-2006-001. Rutherford Appleton Laboratory, Chilton, U.K.

GOTO, K. AND VAN DE GEIJN, R. 2008. High performance implementation of the level-3 BLAS. ACMTrans. Math. Softw. 35, 1.



GOULD, N., SCOTT, J., AND HU, Y. 2007. A numerical evaluation of sparse direct solvers for thesolution of large sparse symmetric linear systems of equations. ACM Trans. Math. Softw. 33, 2,10:1–10:32.

GRAHAM, S. L., SNIR, M., AND PATTERSON, C. A. 2004. Getting Up to Speed: The Future of Super-computing. National Academies Press, Washington, DC.

HIGHAM, N. 1997. Iterative refinement for linear systems and LAPACK. IMA J. Numer. Anal. 17,495–509.

HOGG, J. D. AND SCOTT, J. A. 2008. The effects of scalings on the performance of a sparse symmetricindefinite solver. Tech. rep. RAL-TR-2008-007. Rutherford Appleton Laboratory, Chilton, U.K.

HSL. 2007. A collection of Fortran codes for large-scale scientific computation.http://www.cse.stfc.ac.uk/nag/hsl.

IEEE. 2008. IEEE standard for floating-point arithmetic. IEEE Standard 754-2008. IEEE Stan-dards Committee, Piscataway, NJ, 1–58.

KARYPIS, G. AND KUMAR, V. 1999. A fast and high quality multilevel scheme for partitioning irreg-ular graphs. SIAM J. Sci. Comput. 20, 359–392.

REID, J. AND SCOTT, J. 2008. An efficient out-of-core sparse symmetric indefinite direct solver.Tech. rep. RAL-TR-2008-024. Rutherford Appleton Laboratory, Chilton, U.K.

REID, J. AND SCOTT, J. 2009a. Algorithm 891: A Fortran virtual memory system. ACM Trans. Math.Softw. 36, 1.

REID, J. AND SCOTT, J. 2009b. An out-of-core sparse Cholesky solver. ACM Trans. Math. Softw. 36, 2.RUIZ, D. 2001. A scaling algorithm to equilibrate both rows and columns norms in matrices. Tech.

rep. RAL-TR-2001-034. Rutherford Appleton Laboratory, Chilton, U.K.SAAD, Y. 1994. A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Sci. Statist.

Comput. 14, 461–469.SAAD, Y. 2003. Iterative Methods for Sparse Linear Systems, 2nd ed. SIAM, Philadelphia, PA,

273–275.SKEEL, R. 1980. Iterative refinement implies numerical stability for Gaussian elimination. Math.

Computat. 35, 817–832.

Received September 2008; revised March 2009, July 2009; accepted October 2009


a fast and robust mixed-precision solver for the solution

Documents