variation in efficiency of parallel algorithms

Computers & Strucfures Vol. 21, NO. 5. pp. 1025-1034. 1985 Printed in Great Britain.

0045-7949/85 $3.00 + .oO 0 1985 Pergamon Press Ltd.

VARIATION IN EFFICIENCY OF PARALLEL ALGORITHMS

AK~KO HAYASHI, ROBERT J. MELOSH and SENOL UTKU Department of Civil Engineering, Duke University, Durham, NC 27705. U.S.A.

and

MOKTAR SALAMA Jet Propulsion Laboratory, Pasadena, CA U.S.A.

(Received 13 February 1984)

Abstract-This paper presents the investigation of the variation in efficiency of a parallel algorithm as a function of the number of processors. A parallel algortithm was developed which predicted the behavior of a typical linear engineering system by solving a set of linear equilibrium equations. Ex- perimental results of the algorithm were obtained by simulating the parallel process on a uniprocessor. The results showed the parallel method to have certain advantages over certain sequential methods, thus showing promise for this parallel algorithm in being competitive against some sequential algorithms.

1. INTRODUCTION

The objective of this study is to investigate some iterative parallel-processor linear equation solving algorithms with respect to efficiency for analyses of typical linear engineering systems.

Parallel processing is an alternative to the sequential processing of conventional computers. The idea of parallel processing is to use a number of processing machines working simultaneously on a problem to solve it faster than just one machine can.

The basic equations used in predicting the linear behavior of a typical engineering system are a set of n linear equations:

Ku = p, (1.1)

where K = an n x n positive definite, sparsely populated, symmetric matrix; u = an n x 1 vector of unknown responses; p = an n x 1 vector of prescribed constants; and bold-face type characters denote vector or matrix variables.

Two basic groups of procedures exist to solve these equations; direct methods and iterative methods. In direct methods, the solution is obtained with a predetermined number of calculations. The factorization methods (Gauss elimination, Crout’s method and Cholesky decomposition) are among the more popular of these methods. The iterative methods require an initial approximation. The solution is then obtained by a sequence of converging approximations with an undetermined but finite number of calculations.

This study examines a hybrid method in which iteration is used to solve the problem, but a direct method is used on the local processor level.

Timing is the basis on which the efficiency of the processing is determined.

Three factors influence the computation time in parallel processing. One is the time per sweep, i.e. the time for each processor to compute its new approximation. This time is dependent on the algorithm and the number of solutions it must develop. The second factor is the waiting time. This is the time a machine must stand idle while it is waiting for the other processors to finish, or waiting for the data transfer between processors. The first com- ponent of the idle time depends on how well syn- chronized the algorithm is and the transfer time is dependent on the machine capabilities. The third time factor depends on the rate of convergence of the solution estimates. Both the problem being solved (i.e. the properties of K) and the algorithm affect the rate of convergence.

Previous work[l, 2, 61 examines the suitability of different algorithms to parallel processing. Ex- perimental results show time improvement achieved by some of these methods over sequential methods[l, 2, 61.

The current paper defines the algorithm used in this study and the results of experiments. The next section details the hybrid solution algorithm and discusses its suitability to parallel processing. Sec- tion 3 describes the testing setup for the experiments. Section 4 presents the experimental results. The last section reviews the conclusions of this study.

2. THE PARALLEL ALGORITHM

This section describes the hybrid iterative algorithm and its implementation in the parallel processing mode. It discusses the process, its computer implementation, and the conditions for convergence.

To implement the procedure in the parallel processing mode, the coefficient matrix is divided

1025

1026 AKIKO HAVASHI et al.

among the processors. Consider a coefficient matrix K, with R rows, and R column, The coefficient matrix is divided among P processors where each processor holds iVi, i = 1, 2, . . . , P number of rows of the matrix.

Each processor decomposes the square matrix of its principal diagonal partition matrix and solves for the unknowns associated with it. The coeffl- cients of the nondiagonal partition matrices are mul- tiplied with the corresponding approximation u summed with the right-hand side vector of prescribed constants, and used for getting the partition unknowns.

Suppose the coefficient matrix is divided into three parts.

where uT = (u, u2 us) and subscripts denote the vector partitions.

The equations for computing the new approximations of the solutions from a previous one are

where u’ ith approximation to the solution u and K12,K~~,K33 = principal diagonal matrices. The approximation ii’ + ’ is then used to update values on the right-hand side, and the process is repeated until the approximations achieve the desired accuracy.

The calculations which must be repeated on every iteration are the ones which compute the new vector of constants on the right-hand side and the forward and backward pass which solves the new approximations. Each processor only solves for the unknowns associated with its principal diagonal ele- ments. The remaining unknowns are retrieved from the other processors and used in the right-hand side computation. This defines a “hybrid block Jacobi” iteration.

To minimize calculations, decomposition is only required during the first iteration. The decomposition process evaluates the exact solution on the local processor level, but since each processor has only part of the problem, it cannot find the exact solution on the global level.

The Cholesky decomposition algorithm is used as part of the algorithm. It is a direct factorization method for solving a system of linear equations with a real, symmetric positive definite matrix.

The implementation of the Cholesky decomposition algorithm in the study exploits the sparseness and bandedness of the coefficient matrix to make the algorithm more efficient in terms of storage usage and the number of calculations performed. Because of symmetry, the computer stores only the upper half-band portion of the coefficient matrix. Also, the algorithm overwrites over the original values of the coefficient matrix with the decomposition. When the decomposition is complete the forward and backward pass yield the unknowns.

The basic form of the system of linear equations for the iteration is

(2.3)

where k = 1, 2, 3, . . . . There are a number of iterative methods that fall

into the general form of eqn (2.3). Each method differs from the other slightly, but the integrity of the basic form remains.

PFOO~. Substituting the previous expression of the vector 2” into the next, the sequence takes the form

x@) = A%” + (f + A + a*+ + Ak-‘)F. (2.4)

If the series (I + A + *.* + A’-‘) converges, the iteration also converges. When the modulus of A is less than one then Ak approaches 0 as k approches infinity, i.e. (I + A + ... + Ak-‘)F has a limit A- ‘F[4].

Hybrid block Jacobi is a modified form of the iteration process. The convergence of the problem Ku = p is confirmed by first transforming it to the form u = Au + b. If the modulus of the scaled matrix A is less than unity, the problem converges.

We consider one more parallel algorithm, the Ja- cobi algorithm. No modifications are necessary to implement it in the parallel computer mode. The set of equations describing it are:

~1 = (fr - k,2x2 - k,zx3 - I** - k,dx,)iktl,

x2 = (f2 - k22x2 - k2Jx3 - .a. - k2nXnM22r

xn = (fn - kn,x, - knsz - es* - k~,~~-,~~-,)fk~~.

(2.5)

This set of equations was divided equally among the processors. This method is referred to as the “partition Jacobi” algorithm in this study. The one modification to the process supporting eqn (2.3) is that the x,, variable being solved for is completely eliminated from the ~ght-hand side of the equation, and the left-hand side was divided by its coefficient.

Variation in effkiency 1027

Note that this algorithm is not the main focus of the study. Its implementation provides data for comparisons with the hybrid block Jacobi algorithm.

3. EXPERIMENTAL PROCEDURES

The algorithms were implemented on the IBM personal computer, with special logic to simulate the parallel computation. The additional simulation code reads and writes data off disk to simulate the parallel flow of data. The disks serve as the com- mon memory for all the processors.

The number of digits of accuracy of each iteration is calculated:

Digit accuracy = log 10 II &*t - uexact II E

II uexact 111 E 1 ’ where u,,~ = the approximation to I(,,,,~ and u,,,,~ = the “exact” solution obtained by direct solution of the equation. This is a measure of convergence. The convergence rate is in units of the number of digits of accuracy gained per iteration.

Convergence of the parallel process is not af- fected by the simulation, because the flow of data is identical to that of the parallel computation. The simulation time however, differs greatly from the parallel processing time, since then all the processors are run sequentially. The real computation time is the time it takes for all processors to complete their computations, while the parallel computation time is the time it takes for the slowest processors to finish. So the time of only one processor is recorded, and that time is assumed as a representative time for all other processors.

The built-in timer in the computer records the computation time of each processor. The disk read- ing time is not included in time measure. In this study, the computation time of the two processes were compared. if parallel computation cannot compete against sequental processing in computation time alone, it will not be able to compete with the added time that data transferring takes.

The coefficient matrices used in running the experiments were structural stiffness matrices. They were chosen because they possess all the necessary properties, real, symmetric, positive definite, and bandedness for the algorithm being tested, when in the properly supported state (i.e. all rigid body movements were prevented).

The examples used in the study were the stiffness matrices of planar trusses. For these specific examples, u became the vector of nodal displace- ments, and p became the vector of prescribed load- ings. All trusses used in the experiments were of one basic con~guration. An &node truss, 1Cnode truss, and a 26-node truss were used in the experiments. They are shown in Fig. 1.

The LARS (lattice analysis and redesign system) program was used in this study. It is a finite element

1000 I bs 1000 Ibr

IO' I 38 IO’ each

8 - NODE TRUSS

IO00 lb loo0 IbS

14- NODE TRUSS

12PIO’aoch -I

26- NODE TRUSS

Fig. 1. An 8-node, 14-node, and 26node truss.

program developed at Duke University which ana- lyzes lattice element structures.

Each processor solves for the same number of equations in all the tests. This imposes more uni- formity in the performance of the processors and in their results. It adds validity to the assumption made concerning parallel timing: that the computation of one processor is representative of all the others. Morever, it is easier to analyze and compare results between the simulations when the number of equations between them are equivalent.

To facilitate this requirement, trusses corresponding to 12, 24, and 48 equations were used in the test. These numbers were chosen because they have many integer factors. This results in the col- lection of more data per problem. The results of the experiments are functions of the number of processors or equivalently the number of equations each processor solves. With more data available, the trend between the variables becomes more evident. The three trusses chosen correspond to these number of equations when the nodes with prescribed deflections were not counted as part of the equations set.

Three combinations were created of the two algorithms, and they were each used to solve the truss problems. The performances of each were compared against the other in both the parallel processing and uniprocessing modes. The three combinations were: hybrid block Jacobi, partition Jacobi, and combination Jacobi. Combination Ja- cobi uses both hybrid block and partition Jacobi. It

1028 AKIKO HAY

first uses the hybrid block Jacobi algorithm to solve its set of equation and then it follows with partition Jacobi in one iteration.

4. SIMULATION RESULTS

This section discusses the results of the simulation. Each set of results represents a run with a different number of processors. Comparisons made within the set of runs of each truss disclose the effects the number of processors have on convergence rate and time.

The focus of the results is on the 26node truss. it is the largest problem tested in the study, so it is not as susceptible to fluctuations in the results as the smaller trusses demonstrated. Furthermore, the timing resolution was adequate to record a regular progression of times for each set of results. Results from the smaller trusses suggest the results from the 26-node truss apply to them also.

The timing results for the 26-node problem are all based on precompiled basic code run times. In comparing the overall method efficiencies of partition and hybrid block Jacobi, the interpreter com- piler times were used. The resolution of the timer was not adequate to reflect the changes in time for the precompiled version. Thus, two separate runs were made: one to obtain the interpreter time and the second to obtain the convergence rate. Twenty cycles were used to obtain the convergence rate to ensure the elimination of flutter in the results. As

ASH1 et ai,

for timing, only one cycle was required to obtain the timing data for partition Jacobi. Two cycles were necessary for hybrid block Jacobi, since the first cycle time included decomposition, while the second one did not.

Figures 2(a) through 2(e) are plots of the number of cycles vs the digits of accuracy for the g-node truss, 1Cnode truss, and the 26-node truss. The slopes of the curves measure the convergence rates.

The curves suggest that as the number of processors increases, the convergence rate decreases. This trend is consistently broken at one number of processors for all the truss problem tested in the study. For the 8-node truss, the pattern breaks when the curve corresponding to the 4 processor case goes above the 3 processor curve. For the 14- node truss, the 8 processor curves goes above the 6th processor curve, and in the 26-node truss, the 16 processor curve jumps above the 12 processor curve. At these jumps, the convergence rate increases as the processor increases.

A pattern exists for this unexpected behavior. In all three problems, the jumps occurred at the same location relative to the size of each truss. It always occurred between the second to the last processor case and its preceding case. This occurrence is thought to be related to the topographical con- nectivity of the structure and the bandwidth. In effect, it was some characteristic effect of the coefficient matrix.

The convergence rate and the iteration times

3.2

2.4

2 Processors

I I I I I I I I I I I I I I I I I I I II

NUMBER OF CYCLES

Fig. 2.(a) Combination Jacobi, B-node truss.

22

Variation in efficiency 1029

2 Processors

0 ~~~~~~~~~~~~~~~~~~~~~

NUMBER OF CYCLES

Fig. 2.(b) Block Jacobi, eight-node truss.

0.6 -

G 2 OA-

s

::

k 0 5

0.2 -

2 Processors

22

NUMBER OF CYCLES

Fig. 2.(c) Combination Jacobi, 14-node truss.

Awco HAYASHI et af.

NUMBER OF CYCLES

Fig. 2.(d) Combination Jacobi, 26-node truss.

NlJMElER OF CYCLES


0 Combination Jacobi Data Points

A 8lock Jacobi Data Points

50

NO. OF PROCESSOR Fig. 2.(t) The plot of the number of equations against the convergence rate.

co~esponding to the three methods used for solving the 26node truss are in Table I. The convergence rate of partition Jacobi remains constant with respect to the number of processors. It remains constant because its computations are performed to- tally in parallel, therefore its process remains unaffected by the separation of the problem. The convergence rate of hybrid block and combination Jacobi on the other hand, declines as the number of processors increases. This results because each processor solves the unknowns with an increasingly smaller portion of the problem as the coefficient matrix is divided. One processor solves the equations in one cycle. This is the Cholesky decomposition method. When the number of processors is equivalent to the number of equations, the hybrid block Jacobi method is reduced to the partition Ja- cobi method.

The convergence rate of hybrid block and combination Jacobi declines quite rapidly in the beginning and then levels off. The rapid decline in the beginning reflects the fact that reduction in the number of equations per processor is initially great. It levels off when the change in the number of equations per processor becomes small. The plot of the

number of equations against the convergence rate in Fig. 2(f) illustrates this occurrence.

Block Jacobi’s (shortened form of hybrid block) initial convergence rate is five times greater than partition Jacobi’s convergence rate, and combinations Jacobi’s convergence rate is almost ten times greater. Combination Jacobi’s convergence rate initially does not add up to be the sum of the two convergence rates. This suggests that improved approximations may have beneficial effects on the convergence of biock Jacobi. The results of block Jacobi are improved by partition Jacobi each time which implies the right-hand sum of eqn (2.2) is a closer approximation to the results. This occurrence seems to improve the results of that partition, thus resulting in the improvement of the convergence rate for the entire process. However, as the number of processors increases this effect appears to lose this influence and the convergence rate becomes the sum of the two methods, block and partition Jacobi.

Table 2 displays the process times of each of the methods. The processing time is the time it takes for the process to reach a solution of desired accuracy. The processing time is calculated using the

1032 AKIKO HAYASHI et al.

Table 1. Simulation results of 26-node truss?

Number of Partition Block Combination processors time* convergence time5 convergence time11

1 68 - - - - 2 33 .0027 170 .0049 170

46 46 33

3 22 .0020 76 .0035 76 31 31

22 4 17 .0016 48 .0026 48

24 24 17

6 12 .OOlO 27 .0018 27 17 17

12 8 9 .OOll 19 .0017 19

14 14 9

12 6 .0006 12 .OOlO 12 IO 10

6 16 5 .0007 IO .0015 10

8 a 5

24 3 .0005 7 .0009 7 6 6

3 48 2 .0005 5 .0009 5

4 4

t convergence rate is measured in digits of accuracy per iteration and time is in seconds.

j: The convergence rate for Partition Jacobi is .0005 for all cases. 8 The first time listed includes the decomposition process, the second is for

forward and backward pass only. (1 The first time includes decomposition, the


greater cycle time, partition Jacobi becomes the most efficient method among the three. This occurs between processor eight and twelve as shown in Table 2. Block Jacobi eventually reduces to the partition Jacobi method but still manages to retains a greater cycle time, so its loss in efficiency continues to decline in comparison to partition Jacobi. Com- bination Jacobi also continues to decline in efficiency against partition due to its much greater cycle time.

The interest in computing the time for a second and higher accuracy of four digits is to determine if the effects of the initial decomposition process in the first iteration have any significant effects which in time would appear later in the comparison between partition and block. No obvious effects were evident. Also, the processing times in Table 3 were compared to the time it took to solve the problem with only one processor in the hybrid block Jacobi method, which in effect is the Cholesky method. Four significant digits in a solution was assumed to be a comparable solution to the the direct methods solutions. In terms of wall clock time, the cost of parallel processing is almost 40 times greater for even the lowest processing time case, and 80 times greater for the highest processing time case of combination Jacobi.

Table 4 displays the total computer resource time for each method. The computer resource time is calculated by multiplying the process time by the number of processors. This provides a measure of the computer cost. The computer resource time increases with an increasing number of processors for hybrid block and combination Jacobi. For partition Jacobi, the computer resource time remains constant at specified accuracies. Again, the reason is because the convergence rate is independent of the number of processors. The slight increase in the resource time seen in the data is a result of the loss in resolution of the timer in measuring the time for a small number of equations.

Hybrid block and combination Jacobi initially require smaller computer resource time than Par- tition Jacobi. But once again, because the hybrid block and combination Jacobi have increasing com-

Table 4. Comparison of computer resource timest (calculated for .Ol digits of accuracy)

Processor Partition

1 1360 2 1320 3 1320 4 1360 6 1440 8 1440

12 1440 16 1600 24 1440 48 1600

t Time is in seconds.

Block Combination

- - 588 564 600 597 792 720

1080 852 1056 1120 2028 1944 1824 1424 2904 2424 3888 3264

puter resource times, partition Jacobi’s resource time eventually becomes the smallest thus making it the most efficient method of the three. This occurs at the twelfth processor which was the same location as that of the processing time. The patterns here are the same as that of the processing times and the reasons are also identical.

5. CONCLUSIONS

Investigation of the variations in efficiency of parallel algorithms is the objective of this study. Measures of the efficiency have been based on computer experiments on these algorithms.

In regard to convergence, block and combination Jacobi show better convergence rates than partition Jacobi. Combination Jacobi has a convergence rate slightly better than that of hybrid Jacobi. Block and combination reduce to the partition Ja- cobi at the maximum number of processors when each processor is solving one equation.

The wall clock time decreases as the number of processors increases for all the algorithms. Table 2 repeats this trend for the lower number of processors. Combination Jacobi has the least wall clock time followed by block, and then partition Jacobi. For the higher number of processors, the partition Jacobi method becomes the best.

In contrast to the decrease in wall clock time, the computer resource time increases as the number of processors increase for all the parallel algorithms. Thus, the usage of machines is becoming increasingly more inefficient as more processors are used. Inefftciency is initially the greatest in partition Jacobi. Block Jacobi follows, and then combination Jacobi. Hybrid block and combination Ja- cobi become more inefficient than partition Jacobi due once again to the effect of their decreasing convergence rate.

Comparisons between the parallel computations and uniprocessor Cholesky computation show that all parallel algorithms are much more costly than sequential. Comparisons were made on the basis of four digit accuracy in Table 3. The best time in the parallel processing mode was 40 times greater than the time of Cholesky. For the comparison between the parallel methods and uniprocessor partition Ja- cobi, all the parallel methods were faster.

The parallel algorithms compete with the iterative uniprocessor Jacobi, but are unable to do so with the direct method. However, for huge problems and a limited memory space, parallel computing algorithms such as the one developed here, may be more advantageous than the uniprocessor direct or iterative methods. Also, since the factorization methods experience a rapid increase in computation time with the increase of the number of equations, the parallel methods may become competitive with them, if many processor are used.

This investigation shows that the hybrid method developed is more effective in solving problems in

1034 AhlhO H\r

the parallel mode than partition Jacobi. Results in- dicate it holds promise of being

wi~ et al.

Acknowledgements-We wish to acknowledge the Jet Propulsion Labratory’s support of this research.

3.

4.

5.

REFERENCES 6.

I. Gerard M. Baudet, Asynchronous iterative methods for multiprocessors. .I. Assoc. Comput. Mach. 25,226-244 (1978).

7.

2. Victor Conrad and Yehuda Wallach, Iterative solution

of linear equations on a parallel processor System. IEEE Trans. Comput. 26, 838-847 (1977). Mohamed El-Essaw, Library Research Work, Civil Engineering Dept., Duke Universitv. April 28. (1982). V. 3. Fade&a,‘&mputational Me&odi Of Linear Ai- gebra. Dover Publications, New York (1957). Roger W. Hackney and C. R. Jesshope, Parallel Com- puters: Architecture, Programming and Algorithm. Bristol. M. Salama, S. Utku and R. J. Melosh, Parallel solution of finite element equations. Proc. 8th Conf. Electron. Comput. AXE. Houston, Texas, 1983, pp. 526-539. Senol Utku, Vector spaces, matrices as linear maps and the algebraic eigenvalue problems. Civil Engineering and Computer Science Dept., Duke University, 1980.

variation in efficiency of parallel algorithms

Documents