algorithms for solving numerical linear algebra problems on supercomputers

255

Algorithms for Solving Numerical Linear Algebra Problems on Supercomputers

T.J. D E K K E R , W. H O F F M A N N and P.P.M. D E R I J K Department of Computer Systems, University of Amsterdam,

Kruislaan 409, 1098 SJ Amsterdam, The Netherlands.

In this paper some numerical algorithms are considered in relation to computations on oector and parallel processors. Moreooer, some results of experiments on a oector computer are reported.

The algorithms considered are Gaussian elimination and Gauss-Jordan elimination for soloing full linear systems, and Hestenes' one-sided Jacobi iteration to calculate the Singular Value Decomposition of a matrix.

The advent of parallel and vector computers has lead to several new designs and implementations of numerical algorithms, especially in the area of solving large problems in linear algebra. In this paper attention is focussed on algorithms for solving full linear systems and algorithms for solving full, possibly rank-deficient, linear least- squares problems. A short synopsis of the theory, with an emphasis on vectorisation and parallellisa- tion aspects, and some results of experiments on vector computers are given.

The most commonly used algorithm for solving full linear systems is Gaussian elimination. This algorithm is fast and reliable. The solution obtained always has a small residual and is, in fact, the exact solution of a slightly perturbed problem [28]. Consequently, the solution obtained by Gaussian elimination has a small (relative) error, provided the system is not too ill-conditioned. The amount of work for solving an n-by-n system is of

North-Holland Future Generation Computer Systems 4 (1988/89) 255-263

the order ~n 3 floating-point operations. The algorithm can very well be adapted to execution on a vector processor [8,15] or a parallel computer [9].

Another related algorithm is Gauss-Jordan elimination. This algorithm requires 1.5 times more work for large systems (the amount of work being of the order of n 3 floating-point operations). It has, nevertheless, received renewed interest be- cause of its promising properties with respect to vectorisation and parallelisation. The Gauss-Jor- dan algorithm is stable in the sense that each solution obtained has an absolute error which is strictly comparable with that corresponding to Gaussian elimination. The residual corresponding to a Gauss-Jordan solution, however, can be much greater for an ill-conditioned system than that corresponding to Gaussian elimination [24]. More recently, the Gauss-Jordan algorithm has been rehabilitated by showing that the residual will, in most practical situations, be not larger than the residual corresponding to the Gaussian elimination solution, provided the pivoting is done in a proper way [5,6]. These two algorithms are treated below in more detail and some experimental results are given.

For solving linear least-squares problems, it is important to know if the matrix is of full rank or (approximately) equal to a matrix of deficient rank. Algorithms for full-rank least-squares problems make use of QR factorisation. When, however, the matrix is or may be (nearly) rank-deficient, the most reliable methods are those using the Singular Value Decomposition (SVD). This decomposition reduces a real matrix A, by means of orthogonal transformations, to a diagonal matrix.

Algorithms to calculate the SVD which are suitable for execution on vector or parallel computers, are one-sided Jacobi [14,26] and two-sided Jacobi [20,4]. These algorithms are akin to Jacobi's iterative method for solving the Hermitean (or real symmetric) eigenvalue problem.

The most commonly used algorithm reduces the matrix, by means of orthogonal transformations, to a bi-diagonal matrix (i.e. only the main diagonal and an adjacent codiagonal are non-zero)

0376-5075/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)

256 T.J. Dekker et aL / Numerical Linear Algebra Problems

and then solves the SVD problem for this matrix iteratively [13]. Parallel variants of this algorithm have been developed [18].

In the sequel, we treat the one-sided Jacobi in more detail and give some experimental results and comparisons with a library routine implementing the Golub-Reinsch algorithm mentioned.

All experiments reported in this paper have been carried out on the CDC Cyber 205 vector computer (having one vector pipe at that time) of the Academic Computer centre Amsterdam (SARA), using the Fortran 200 extension of For- tran 77 for this machine.

2. Gaussian Elimination

For a given matrix A of order n and a given right-hand side vector b, we want to find a solution vector x solving the linear system

A x = b .

Gaussian elimination consists of three components, namely LU factorisation, forward substitution and back substitution. The calculations in each component can be performed in different orders and can correspondingly use different basic vector or matrix operations. The efficiency of these operations, and hence also of the entire algorithm, may vary considerably for various vector and parallel computer architectures. A survey of the possible arrangements of Gaussian elimination in relation to performance on a vector computer is given in [15]. We now describe the three components of Gaussian elimination.

(i) L U factorisation reduces matrix A to an upper triangular matrix, U, by means of n - 1 successive elimination steps. Starting from A °) = A, the k th elimination step, k = l . . . . . n - l , transforms A (k) into A (k+ 1) such that the elements of the k th column of A (k÷a) below the main diagonal become zero. Thus, after n - 1 steps the upper triangular matrix U = A (n) is obtained.

The k th elimination steps proceeds as follows. Firstly, an element of sufficiently large magnitude is selected in the lower right (n - k + 1)-th order submatrix of A (~). This element is called the k th pivot and denoted by ~k. Let the selected pivot element be 8 k = A(p~.q ~ for some p >~ k and q >/k. Then the k th and the p th rows of A (k) are interchanged (if p > k) and similarly the q th and k th

columns of A (k), yielding the matrix ~(k), say. The pivoting, i.e. the pivot selection and corresponding interchanges, are mostly needed for numerical stability. Strategies for the pivot selection are discussed below.

Subsequently, on matrix .~(k) the elimination is performed, introducing the required zeroes in the k th column. This can be formulated as premultiplying ~(k) by the matrix

= -- ~ k m k e k , M k I - 1 T ( 2 . 1 )

where e k is the k th unit vector and m k is the column vector whose first k elements are zero and whose remaining elements are given by

=~!k) i = k + l n. (mk)i . . . . . .

In other words, Ak+ 1 is obtained from A~) by means of a rank-one modification

A(k+l) = Mk~(k) = .~(k) _ ~- lmk~(k,.) ' (2.2)

where .4(kk. ) denotes the k th row of ~(k). The total effect of this component can be sum-

marised as follows. For given matrix A, the LU factorisation finds permutation matrices P and Q, a (unit) lower-triangular matrix L and an upper- triangular matrix U, such that

P A Q = LU.

Here P and Q are permutation matrices corresponding to the interchanging of rows and columns, respectively, and matrix L is a (uni t) lower triangular matrix containing the elimination factors (here "uni t " means that all diagonal elements are one), as follows:

Li, j = ~71(F~/j)i for i > j ,

L , , j = I f o r i = j ,

L~,j = O f o r i < j ,

where /~j denotes that the elements of mj are properly interchanged according to the row interchanges in the subsequent elimination steps.

(ii) Forward substitution transforms right-hand side vector b in correspondence with the transformation of A, namely the eliminations and the row interchanges, performed in the LU factorisation. It is executed in n - 1 elimination steps which, in the form of column operations, are akin to the elimination steps of the LU factorisation.

Starting from b(1)=b, the k th step, k = 1 . . . . . n - 1, interchanges the k th and the p th

T.J. Dekker et al. / Numerical Linear Algebra Problems 257

element of b (k), corresponding to the row interchanges of the k th step in the LU factorisation, yielding the vector /~(k), say; subsequently, /¢k) is transformed into b (k+ ~) as follows:

b(k+ l )= Mkf~(k) = [~(k) _ 8;l~)(kk)mk ' (2.3)

which is a vector minus scalar times vector operation.

Thus, after n - I steps the solution y = b (n) of the forward substitution system is obtained.

The total effect of this component can be sum- marised as follows: forward substitution finds vector y solving

L y = P b .

(iii) Back substitution solves the linear system resulting from the first two components, which is equivalent with the given system, i.e. it solves the upper triangular linear system

Uz = y

and obtains the solution vector x by interchanging the elements of z, to compensate for the column interchanges in the LU factorisation, according to

X = Q z .

The solution of the triangular system can be obtained by means of column operations as follows. Starting from y ( " ) = y and z , = 8 ~ y , , the k th step, k = n - 1 , . . . , 2, 1, calculates y(k) and z k according to

y(k) =y~k+l) _ Zk+lU--k+l ' (2.4)

Z k = 8;ay(kk), (2.5)

where U is obtained from U by replacing the diagonal elements by zero, and U.,k+a denotes the (k + 1)th column of U.

Pivoting Strategies

If no pivoting is performed, then P = Q = I (i.e. the identity matrix) and the three components of the algorithm reduce to:

A = LU, Ly = b, Ux = y .

Mostly, partial pivoting is carried out with either only row interchanges (then Q = I ) or only column interchanges (then P = I ) . For large systems, however, numerical stability can only be guaranteed with complete pivoting which involves

both row and column interchanges [28]. For partial pivoting with row interchanges, the k th pivot is selected as an element of largest magnitude in the k th column of ~(k) on or below the main diagonal (hence, p >t k and q = k). For partialpivoting with column interchanges, the k th pivot is selected similarly in the k th row of ~ k ) (hence, p = k and q >~ k). For complete pivoting, the k th pivot is selected as an element of largest magnitude in the lower right (n - k + 1)th order submatrix of ~(k) (hence, p >~ k and q >/k).

Complete pivoting requires more work than partial pivoting. It is possible, however, to perform a mixed pivoting strategy combining partial and complete pivoting as follows. Partial pivoting is performed as long as a certain upper bound on the possible pivot growth, calculated in each elimination step, remains smaller than a certain threshold. When this threshold is exceeded, complete pivoting is used in the remaining elimination steps. (This check on the pivot growth is called monitoring of the pivot growth.) A careful choice of the threshold parameter ensures that the algorithm practically is as economic as Gaussian elimination with partial pivoting and as reliable as Gaussian elimination with complete pivoting [3,15].

Vectorisation and Parallelisation Aspects

The most important part of the elimination process, in terms of amount of work to be performed, is the rank-one modification (formula (2.2)). It lends itself well to efficient execution on vector and parallel processors. The calculation can be done by rows or by columns, which can, in any order, be processed in parallel. Each row or column operation is a vector minus scalar times vector ( "AXPY") operation [21] which can be executed as a vector operation under certain con- ditions, depending on the vector machine architec- ture and the way the matrix is stored. For instance, on a CDC Cyber 205 in Fortran, column vector operations can be efficiently executed, be- cause matrix columns are stored contiguously in memory.

The rank-one modification can also be done block-wise, which may be attractive for large matrices to avoid page faults and extra communi- cation with secundary storage. For these and other reasons, the rank-one modification is included in a

258 T.J. Dekker et a L / Numerical Linear Algebra Problems

proposed extended set of Basic Linear Algebra Subprograms [10,11].

Partial pivoting requires both a column and a row operation, one for selecting a pivot, the other for performing an interchange. For instance, on a Cyber 205 in Fortran, column interchanges are efficient, and the pivot selection in a row can be performed using a gather operation followed by an operation on the gathered vector.

The vector operations in forward and back substitution can also be performed efficiently on a vector computer. For a system with multiple right-hand sides, the modifications of the right- hand sides in the forward and back substitutions again have the form of rank-one modifications and can be executed in parallel.

From these figures, one obtains the number of megaflops (i.e. millions of floating-point operations per second) for these routines according to the formula

10-6(2n3/3 + 2n 2 ) / C P time.

The largest number of megaflops is achieved by CCRPCF which yields 46.9 megaflops for n = 200 and 64.1 megaflops for n = 400. This is a quite satisfactory performance on a one-pipe Cyber 205, for which the maximal megaflop rate, using linked triads, is 100. For more details and results of these experiments see [15].

The routines CCRPCF and CCRMCF have been included in the NUMVEC FORTRAN library [17].

Numerical Experiments

We here summarize some results of experiments reported elsewhere [15]. Some vectorised variants of Gaussian elimination were compared. The best, i.e. most efficient, variant turned out to be routine CCRPCF using column vector operations and partial pivoting with column interchanges. Moreover, routine CCRMCF using the same algorithm, but extended with mixed pivoting, worked quite satisfactorily, requiring only lit- tle extra time for monitoring the pivot growth.

These routines were also compared with some other Gaussian elimination routines available on the Cyber 205, namely from the well-known libraries LINPACK [7] and N A G [23] and from QQLIB, a library in Fortran 200 provided by Control Data [25]. The Gaussian elimination routine, QQGEL, from QQLIB is well vectorised and (much) faster than the corresponding routines from LINPACK and NAG.

Table 1 gives some CP-times for solving a system of linear equations of order n for various values of n, using the routines CCRPCF, CCRMCF and QQGEL.

Table 1 CP time in seconds for various orders n

n - - 2 5 n = 5 0 n = 1 0 0 n = 2 0 0 n = 4 0 0

3. Gauss-Jordan Algorithm

The algorithm of Gauss-Jordan transforms matrix A by means of elementary transformations into a diagonal matrix, and performs similar transformations to the right-hand side vector b in order to find the solution vector x.

The transformation of A is achieved in n successive elimination steps. Starting from A 0) = A, the k th elimination step, k = 1 . . . . . n, transforms A (k) into A (k+l~ such that the off-diagonal elements in the k th column, not only below but also above the main diagonal, become zero. Thus, after n steps the diagonal matrix D = A (n+l~ is obtained.

The k th elimination step can be formulated as follows. Firstly, a pivot 8 k = A (k~ (say) is selected -~p ,q

according to some pivoting strategy, as in Gaus- sian elimination. Then rows a n d / o r columns of A (k~ are interchanged, such that in the resulting matrix ~(k) the pivot is the element 6~ ~(k) Zik , k .

Subsequently, on matrix A(~) the elimination is performed, introducing the required zeroes in the k th column. This is achieved by premultiplying -4~k) by the matrix

T =I- -1 T (3.1) 8k gkek,

where e k is the k th unit vector and gk is the column vector given by

g k = A ( k ) e k - - 6kek,

i.e. the vector obtained from the k th column of .~k) by replacing its diagonal element by zero. In

T.J. Dekker et al. / Numerical

other words, A k+l is obtained from ~(k) by means of a rank-one modification

A~k+l) = Tk~(k )= ~{(k)_ 8; lgk~k . ) , (3.2)

where A(kk. ) is the k th row of ~(k). Summarising, Gauss-Jordan elimination trans-

forms a given matrix A into a diagonal matrix D according to

T, . . . 7?IPA Q = D,

where P and Q are permutation matrices corresponding to the interchanging of rows and columns, respectively, and matrices 7~k, k = 1 . . . . . n, are elementary elimination matrices obtained from T~ by properly interchanging the elements of gk, according to the row interchanges in the subsequent elimination steps. (Hence, when P = I, then T k = T k for all k.)

The corresponding transformation of right-hand side vector b proceeds as follows. Starting from b (1)= b, the k th elimination step, k = 1 , . . . , n, transforms b (k) into b (k+~) as follows. Firstly, two elements of b (k) are interchanged, corresponding to the interchanging of rows of matrix ~(k), yielding the vector ~(k), say. Subsequently, bk+ 1 is obtained from ~(k) by means of a vector minus

scalar times vector operation

b(k+l) = T k f 3 ( k ) = ~)(k) -- ~ k l ~ ( k k ) g k " (3 .3)

Summarising, the Gauss-Jordan transformation of the right-hand side vector b yields a vector y satisfying

y = L . . 7=,Pb Thus, the given linear system is transformed into the equivalent system

DQTx = y ,

which is easily solved by calculating

z = O - l y (3.4)

and interchanging the elements of z to obtain x according to

x = Qz. (3.5)

Pivoting Strategies

Mostly, partial pivoting is performed, which, as explained above, can use either row interchanges or column interchanges. The numerical behaviour

Linear Algebra Problems 259

of the Gauss-Jordan algorithm is quite different for these two strategies, in contrast to the behaviour of Gaussian elimination. The accuracy of the calculated solution is in all cases of the same order of magnitude. The difference manifests itself, however, in the size of the residual, r = b - Ax , of a calculated approximate solution x.

Gauss-Jordan using partial pivoting with row interchanges often yields a much larger residual corresponding to the calculated solution than Gaussian elimination does [24]. On the other hand, Gauss-Jordan using partial pivoting with column interchanges yields a residual which is mostly not larger than the residual obtained by Gaussian elimination of the same system [5].

Gauss-Jordan can also be performed with complete pivoting or with mixed pivoting, in order to obtain a more reliable algorithm for very large systems.

Vectorisation and Parallelisation Aspects

The most important part of the Gauss-Jordan algorithm is the rank-one modification (formula (3.2)), which can be performed efficiently on vector and parallel processors in a similar way as for Gaussian elimination. Moreover, the vector operations are more efficient (on the average), be- cause the elimination steps operate on entire columns (one element in each column excepted), whereas the elimination steps in Gaussian elimination operate on columns of lengths decreasing from n to 1. Although Gauss-Jordan requires about 50% more floating-point operations than Gaussian elimination, namely order of n 3 operations versus ~-n 3 operations, these algorithms require the same number of vector operations, namely ½n 2 vector minus scalar times vector operations.

Moreover, the rank-one modification can be performed blockwise to avoid page faults for large n, in the same way as for Gaussian elimination.

Gauss-Jordan elimination is particularly suitable for calculating the inverse of a matrix. Matrix inversion, by means of Gauss-Jordan or Gaussian elimination, requires order of 2n 3 floating-point operations. Gauss-Jordan matrix inversion can, however, be arranged such that only n 2 vector operations are needed. This cannot be achieved using Gaussian elimination.

260 T.J. Dekker et al. / Numerical Linear Algebra Problems

Table 2 CP time in seconds for various orders n

Numerical experiments

The results of experiments on numerical stability and timing of this algorithm and a similar algorithm for matrix inversion have been published elsewhere [5,6].

Table 2 gives some timing results for routines G J P C F , solving a l inear sys tem, and INVGJ, calculating the inverse of a matrix. These routines both are implementations of our Gauss- Jordan algorithm using partial pivoting with column interchanges. For comparison, the table also contains the corresponding timing results for routine GCRPCF and a Gauss-Jordan matrix inversion routine from [19].

These results show that the Gauss-Jordan routine CJPCF is a rather efficient on a vector computer. Although it requres more work, it is competive with Gaussian elimination for order n up to nearly 50. From these figures one obtains the number of megaflops according to the for- mulas

lO-6(n 3 + 2 n Z ) / c P time

10 -6 X 2n3 /C p time

for GJPCF,

for INVGJ and

Johnson.

The number of megaflops for n = 200 is 58.0 for GJPCF and 58.8 for INVGJ. For more details and results of these experiments see [5,6].

The routines GJPCF and INVGJ have been included in the NUMVEC FORTRAN library [161.

4 . S i n g u l a r V a l u e D e c o m p o s i t i o n B y O n e - s i d e d

J a c o b i I t e r a t i o n .

Let A be a real m x n matrix and, for simplic- ity, let m>~n. Then there exist an m X n real orthogonal matrix U (i.e. u T u = I) , an nth order real othogonal matrix V and an n th order diag-

onal matrix Z, having nonnegative diagonal elements o i = Zi,i, i = 1 . . . . . n, such that

A = U Z V T (4.1)

and

al>~02>~ - . . %>~0.

This decomposition of A is called the (stan- dard) Singular Value Decomposition (SVD) of A. The diagonal elements of Z are called the singular values of A and the columns of U and V the (left and right) singular vectors of A.

Let r denote the rank of A. Then o r > 0 and (when r < n) Or+ 1 . . . . . a, = 0. Thus, we can reformulate (4.1) as follows

A = UrZrVr T, (4.2)

where /-7, is the marix of the first r columns of U, and V~ similarly of V, and Z r is he r th order upper left submatrix of ~, i.e.

~r = d iag (o l , . . . , o r) .

This form of the decomposition is called the reduced SVD.

Singular Value Decomposition is a fundamen- tal tool for solving several problems in numerical linear algebra, such as the calculation of: - the rank of a matrix, - the (minimum-norm) solution of linear least-

squares problems, - the pseudo-inverse of a matrix, - the solution of homogeneous linear systems, - a low-rank approximation of a matrix. For instance, the pseudo-inverse of matrix A is given by

A t = V~_Y 71urT ,

and the least-squares problem to find vector x minimizing the norm of the residual b - A x has the solution

x = A t b = V~ (Z7 a (VTb) ) ,

where the brackets indicate the preferred (i.e. most efficient) order of calculation; if r < n, then this value of x is the solution of minimal Euclidean vector norm in the space of all solutions of the given least-squares problem.

Important applications are, for instance, filter- ing techniques for signal processing and data re- duction techniques in digital image processing.

T.J. Oekker et al. / Numerical Linear Algebra Problems 261

One-Sided Jacobi Iteration

The reduced SVD of A can be calculated by performing a sequence of plane rotations as follows:

X (°) = A,

X ~k+l) = X(k)Rk, k = 0, 1 . . . . . K - 1, (4.3)

where R k = R k ( p , q, O) is a rotation in the (p , q)-coordinate plane and the rotation angle 0 is chosen such that the p th and the qth columns of X ~k) become mutually orthogonal. The iteration is continued until for certain K a matrix X = X (K) is obtained whose columns are all approximately mutually orthogonal, i.e. x T x is approximately equal to a diagonal matrix.

During the iteration process the columns of X (k) may be interchanged according to a certain pivoting strategy, to be explained below, in order to (ensure and) accelerate convergence and to obtain a final matrix whose columns are in order of non-increasing Euclidean vector length. In view of this, we let each matrix R k in (4.3) denote a rotation combined, where needed, with a permutation matrix corresponding to the interchange of two columns of X (k) at the k th step. Thus, we have

X = A V, (4.4)

where V = R 0 . . . . , R K- 1- From matrix X the reduced SVD of A is

obtained as follows. Let xj denote the j t h column of X and u i that of U. The rank r of A is determined numerically as the smallest integer such that the Euclidean vector length IIxjl[ is smaller than a certain threshold, ~, for all j > r. Then the nonzero singular values oj and the corresponding left singular vectors u j, for j = 1 . . . . . r, are

oj = I[ x j I[, uj = x J o j .

Scaled Jacobi Rotations

A plane rotation in its classical form requires 4m multiplications and 2m additions. I t can, however, be reformulated, in the same way as the modified Givens rotations [12,1], such that only

2m multiplications and 2m additions are needed. To achieve this, X ~k) is written in the form

x(k~ = y~k)D(k) '

where D ~k) is an nth order diagonal matrix of scale factors for the columns of y~k). These scale factors are chosen such that the p th and qth columns of y(k) can be obtained by means of two vector plus scalar times vector operations.

One has to check regularly after a certain number of steps (depending on the floating-point ex- ponent range), if no overflow occurs in the elements of D ~k). Overflow will rather seldom hap- pen, though, as the growth factor for these elements is between 1 and 2 in each step.

Another modification of the plane rotation makes it possible that the p th and qth columns of y~k+a) can overwrite the corresponding columns of y~k) without copying one of these vectors in auxiliary storage. This possibility depends, however, on the fact that the rotation angle 0 for these Jacobe rotations can always be chosen such that 101 < ¼"~ [26].

Pivoting Strategy and Stopping Criterion

The iteration process consists of consecutive sweeps which basically proceed as follows. In each sweep all possible subscript pairs (p , q) are selected once to perform the corresponding plane rotation. Thus, the number of subscript pairs and plane rotations in each sweep is ½n(n - 1). For a sequential algorithm, a good pivoting strategy is as follows.

Each sweep is subdivided into n - 1 consecutive subsweeps. In the p th subsweep, p = 1 , . . . , n - 1, two columns are interchanged, if needed, such that the Euclidean vector length of the p th column is at least as large as that of the j t h column for all j > p ; subsequently, rotations in the planes (p , q) are performed for q = p + ] , . . . , n .

Using this strategy, all possible pairs of columns are treated, convergence is guaranteed and, since the pivoting strategy tends to treat the larger columns first, the convergence rate may be im- proved.

In practice, only the nonnegligeable columns are treated, i.e. the columns of Euclidean vector length larger than ~. Moreover, not all possible rotations are carried out. Rotations are omitted

262 T.J. Dekker et al. / Numerical Linear Algebra Problems

for those pairs of columns for which the cosine of the angle has a modulus not larger than a certain threshold ~, according to the formula

v~k~T~(k~ x~ k~ IIx~ k) (4.5) I.~p .~q ] ~"F[[ [[ [[

This threshold is kept constant during one sweep, but is gradually decreased for subsequent sweeps, in a rate corresponding to the quadratic convergence of Jacobi iteration, and, from a certain step onward, is kept at a certain minimal value, b~, depending on the machine precision. The iteration stops when, for all pairs of non-negligeable columns, inequality (4.5) holds where ~- has its minimal value b~.

Thus, the stopping criterion is determined by two threshold parameters ~ and #, and a formula to obtain subsequent values of ~" for a certain number of initial sweeps.

Vector&ation and Parallelisation Aspects

As stated before, the vector plus scalar times vector operations to perform the scaled Jacobi rotations, can be efficiently executed on a vector computer.

Another optimization can be obtained when also the right singular vectors (matrix V) are required. Then the plane rotations, with column interchanges, must be accumulated according to

V (°) = I,

V (k+O = V(k)Rk, k -- 0, 1 . . . . . K - 1, (4.6)

yielding the final matrix V = V (x). This formula is completely analogous to (4.3) to calculate X (k+l) from X (k). It can, therefore, be modified as de- scribed above to obtain scaled Jacobi rotations updating the matrix of right singular vectors, where obviously the same scaling matrix D (k) is used. Hence, storing matrices X (k) and V (k) such that each column of X (k~ and the corresponding column of V (k) are stored contiguously in memory, the calculation of the p th columns of X (k+l) and V (k+l~ can be executed by one vector operation, and similarly the q th columns of these two matrices. Thus, the start-up times needed for these vector operations is reduced by a factor two.

In a parallel environment, a sweep of Jacobi rotations can be performed as follows. The columns of X ~k), and similarly those of V (~), are subdivided into entier (½n) disjoint pairs, which

can be treated in parallel. Thus, generating all possible different subdivisions into disjoint pairs in a cyclic order, a complete sweep can be performed in n subsequent parallel steps [2].

Numerical Experiments

The results of experiments on the numerical properties and timing of some variants of this algorithm with different pivoting strategies and stopping criteria have been published in [26]. We here summarize the results of some timing experiments.

Routine PJSVDT implementing one-sided Jacobi with pivoting and stopping criterion as sketched above, was called with the parameter values ~'= machine precision ( - 1 0 -14) and /~ = ~/-m. Moreover, this routine was compared with two implementations of the algorithm of [13], namely routine SSVDC from LINPACK [7] and a similar routine from N A G [23].

As test matrices were used, among other ones, some special square matrices T n of various order n [13] and some rectangular m × n matrices S,, m = 2n, whose elements were randomly chosen in the interval ( - 1, 1).

Tables 3 and 4 give some timing results of routines PJSVDT and SSVDC for some of the test matrices mentioned. The CP-times for the corresponding routine from NAG, not given here, were (much) larger.

These results indicate that one-sided Jacobi is at least competitive with the Golub-Reinsch algorithm.

Table 3 CP time in seconds for square nth order matrices T n for various n

SSVDC 1:246 5,317 13.779 27.901

Table 4 CP time in seconds for rectangular random matrices of various dimensions

T.J. Dekker et al. / Numerical Linear Algebra Problems 263

A n o p t i m i z e d v e r s i o n o f P J S V D T a n d a r o u t i n e

to c a l c u l a t e t he ( m i n i m u m - n o r m ) s o l u t i o n o f a

l i n e a r l e a s t - s q u a r e s s y s t e m h a v e b e e n i n c l u d e d in

t h e N U M V E C F O R T R A N l i b r a r y [27].

Rtttl t rt.iw~ ,~,

[1] J.L. Barlow and I.C.F. lpsen, Scaled Givens rotations for the solution of linear least-squares problems, SIAM J. Sci. Statist. Comput. 8 (1987) 716-733.

[2] R.P. Brent and F.T. Luk, The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays, SIAM J. Sci. Statist. Comput. 6 (1985) 69-84.

[3] P.A. Businger, Monitoring the numerical stability of Gaussian elimination, Numer. Math. 16 (1971) 360-361.

[4] J.-P. Charlier and P. van Dooren, On Kogbetliantz's SVD algorithm in the presence of clusters, Linear Algebra Appl. 95 (1987) 135-160.

[5] T.J. Dekker and W. Hoffmann, Rehabilitation of the Gauss-Jordan algorithm, Report 86-28, Dept. of Mathematics, Unversity of Amsterdam, 1986.

[6] T.J. Dekker and W.Hoffmann, Numerical improvement of the Gauss-Jordan algorithm, in: A.H.P, van der Burgh and M.M. Mattheij, Eds., Proc. ICIAM 87, Contributions from the Netherlands, Paris-La Villette, 1987, pp. 143-150.

[7] J.J. Dongarra, C.B. Moler, J.R. Bunch and G.W. Stewart, LINPACK User's Guide (SIAM, Philadelphia, PA, 1979).

[8] J.J. Dongarra, F.G. Gustavson and A. Karp, Implement- ing linear algebra algorithms for dense matrices on a vector pipe-line machine; SIAM Rev. 26 (1984) 91-112.

[9] J.J. Dongarra and D.C. Sorensen, Linear algebra on high- performance computers, in: M. Feilmeier et al., Eds., Parallel Computing 85, Proc. 2nd Internat. Confer. Uni- verstitat Berlin, (North-Holland, Amsterdam, 1986) 3-32.

[10] J. Dongarra, J. Du Croz, S. Hammarling and R.J. Hanson, An extended set of basic linear algebra subprograms, ACM TOMS 4 (1988)1-17.

]l 1] J.J. Dongarra, J. Du Croz. S. Hammarling and R.J. Han- son, Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test pro- grams, ACM TOMS 4 (1988) 18-32.

[12] M. Gentleman, Least squares computations by Givens transformations without square roots, J. Inst. Math. AppL 12 (1973) 329-336.

[13] G.H. Golub and C. Reinsch, Singular value decomposition and least-squares solutions; Numer. Math. 14 (1970) 403 420.

[14] M.R. Hestenes, Inversion of matrices by biorthogonaliza- tion and related results, SIAM J. Appl. Math. 6 (1958) 51-90.

[15] W. Hoffmann, Solving linear systems on a vector computer, J. Comput. Appl. Math. 18 (1987) 353-367.

[16] W. Hoffmann, Chapter simultaneous linear equations, update ~1 ; Report NM-R8712, in: NUMVEC Fortran Library manual, Centre for Mathematics and Computer Science, Amsterdam, 1987.

[17] W. Hoffmann and W. Lioen, Chapter simulaneous linear equations, Report NM-R8614, in: NUMVEC Fortran Library manual, Centre for Mathematics and Computer Science, Amsterdam, 1986.

[18] E.R. Jessup and D.C. Sorensen, A parallel algorithm for computing the Singular Value Decomposition of a matrix; Techn. Mere. No. 102, Math. & Comput. Sci. Division, Argonne Nat. Lab., Argonne, IL, 1987.

[19] Ch.H.J. Johnson, Matrix arithmetic on the Cyber 205, Supercomputer 8 / 9 (1985) 28-42.

[20] E. Kogbetliantz, Solution of linear equations by diagonali- zation of coefficient matrices, Quart. Appl. Math. 13 (1955) 123-132.

[21] C. Lawson, R. Hanson, D. Kincaid and F. Krogh, Basic Linear Algebra Subprograms for Fortran usage, ACM TOMS 5 (1979) 308-323.

[22] F.T. Luk, A rotation method for computing the QR decomposition, SIAM J. Sci. Statist. Comput. 7 (1986) 441-451.

[23] NAG, NAG FORTRAN Library manual, Mark 11~ Numerical algorithms Group Ltd, Oxford, 1984.

[24] G. Peters and J.H. Wilkinson, On the stability of Gauss-Jordan elimination with pivoting, Comm. ACM 18 (1975) 20-24.

[25] QQL1B, A library of utility routines and math. algorithms on the Cyber 200, Cyber 200 Support, Roseville, Min- nesota.

[26] P.P.M. de Rijk, A one-sided Jacobi algorithm for computing the singular value decomposition on a vector computer, Report 86-21, Department of Mathematics, Uni- versity of Amsterdam, 1986; also: SlAM J. Sci. Statist. Comput., to appear.

[27] P.P.M. de Rijk, Chapter Simultaneous linear equations, Routine SVDTJP and LSQMNS, Report NM-R8719, in: NUMVEC Fortran Library manual. Centre for Mathe- matics and Computer Science, Amsterdam, 1987.

[28] J.H. Wilkinson, Error analysis of direct methods of matrix inversion, J. ACM 8 (1961) 281 330.

algorithms for solving numerical linear algebra problems on supercomputers

Documents