sse based parallel solution for railway network...

Performance Analysis of PC Based SIMD Parallel MechanismY.F. Fung1, M.F. Ercan2, W.L. Cheung1, T.K. Ho1, C.Y. Chung1, G. Singh1

1Department of Electrical Engineering,The Hong Kong Polytechnic University, Hong Kong

2School of Electrical and Electronic Engineering,Singapore Polytechnic, Singapore

Abstract:- The Streaming SIMD extension (SSE) is a special feature that is available in the Intel Pentium III and P4 classes of microprocessors. As its name implies, SSE enables the execution of SIMD (Single Instruction Multiple Data) operations upon 32-bit floating-point data and performance of floating-point algorithms can therefore be improved. This article presents the adoption of SSE to obtain better computation performance of linear system equations and approaches to optimize the performance of the algorithm. By exploiting the special architectural features of the latest processors on PC platforms, a significant speed-up of computations can be achieved.

Key-Words: - SIMD parallel mechanism, LU decomposition, performance optimization

1 Introduction Due to the advance in microprocessor fabrication technology, microprocessor that are being used in personal computers are becoming more powerful and therefore, the personal computer is now becoming a major computing platform to develop, as well as to implement, software systems. The performance of a microprocessor is enhanced by various features, one of which is the Single Instruction Multiple Data (SIMD) parallel mechanism. In the Intel family of microprocessor, the SIMD mechanism is called the SSE (Streaming SIMD Extension) [1], and in the AMD microprocessor, it is the 3D Now technology [2]. Since the AMD microprocessors are compatible with the Intel family, therefore, we will only concentrate on the SSE feature.

The SIMD feature of the Intel x86 microprocessor was first introduced in1997 [3]. The MMX (multimedia extension) was tailored for improving the multimedia applications by packing eight 8-bit or four 16-bit integer data elements into 64-bit registers and several calculations can be performed simultaneously according to the SIMD execution model. In 1999, Intel introduced the Pentium III microprocessors and the SSE feature was then included in Pentium III as well as the Pentium 4 processors that follow. The working mechanism of SSE is similar to MMX, however, the special register (or the SSE register) is 128-bit and the data that can be manipulated in parallel including 32-bit floating-point values and therefore, the scope of applications of extends beyond multimedia applications. In [4], a study of the SSE in implementing a neural network was discussed.

In this paper, we will discuss the basic features of the SSE mechanism and examine how the SSE features can be applied in implementing a solution for a set of linear equations. The drawbacks of SSE as well as techniques to optimize our algorithm are also discussed.

2 The SSE mechanismThe SSE mechanism is supported by the SSE registers, which are 128-bit wide and they can store floating-point values, as well as integers and characters [5]. There are eight SSE registers inside the Pentium III processors. The SSE registers can be directly addressed and easily utilized through a suitable programming tool, for example the Intel compiler [5].

The SSE architecture provides the flexibility of operating with various data types that is registers can hold eight 16-bit integers or four 32-bit floating-point values. SIMD operations, such as add, multiply, etc., can be performed between the two registers and a significant speed-up can be achieved. Theoretically, a maximum speedup ratio of four can be achieved when floating point values are used. These operations can be invoked directly using assembly codes embedded in a standard C/C++ programs or using some special data types provided by compilers supporting SSE. For instance, with the Intel compiler, a new data type, F32vec4, is devised and it represents a 128-bit storage containing four 32-bits floating-point data. These data types are defined as C++ classes and can therefore be used in a C/C++ program directly. In addition, there are functions provided to load/unload data into the new data structure. Once data are stored in the SSE registers, they can be

1

operated upon in parallel, i.e. a set of four floating-point values can be manipulated in a single operation.

Streaming SIMD extension has provided a powerful computation tool for performance improvement. A detailed performance comparison of SIMD extension with other architectural solutions, such as VLIW (Very Long Instruction Word) and Superscalar computers, in multimedia and signal processing applications was reported in [6]. It has been demonstrated that significant speed-up can be achieved with SIMD extension despite the compiler dependency and the need for improved compiler technology to take the full advantage of parallel programming. However, an interesting capability of SSE, in particular for our application, is its support for floating-point values. This feature has tremendously widened its application areas.

5. Experimental results

Results obtained from solving the equation based on LU decomposition and forward and backward substitution will be discussed. Processing time required for different dimensions of the matrix A is given in Table 1. Two different cases are being compared, namely, (1) traditional approach without using SSE, (2) solution obtained by SSE and all the experiments are based on a Pentium III system with a 550MHz operating frequency.

Size of Matrix A100 200 300 400 500

Traditional 8.5 52.25 170.25 513 1222SSE 4.25 28.75 109.75 338 867.8Speedup 1.94

11.817 1.551 1.51

91.408

Table 1 Processing time and speedup ratio for determining solution of in (ms)

As shown in Table 1, speedup ratios obtained do not match with the theoretical value, which is four. There are two major reasons for this phenomenon. First, the current design of the algorithm is not optimized, and second, efficiency is reduced by the overhead caused by the packing and unpacking of data.

5.1. Performance optimization

Techniques that can be applied to optimize the performance of parallel algorithms are discussed in [7].

For the parallel LU decomposition algorithm, there are two possible solutions. Referring to Figure 2, the computation involved in the algorithm is to determine value of A2 repetitively. Data are read from memory continuously and therefore, prefetch data into the cache memory of the system should improve the data access time. Prefetching is supported in the SSE and users can choose the cache memory (Level 1 or Level 2) to host the data. The Level 1 cache is closest to the processor and the Level 2 cache is farther away from the processor than the first-level (Level 1) cache.

In the LU decomposition algorithm, elements of the matrix are processed row-wise and a logical approach to optimize the cache usage is by prefetching elements of a row before it is being processed. However, the number of bytes being fetched and stored in the cache is depended on the kind of processor being used and a minimum of 32 bytes will be fetched.

Another approach to optimize the algorithm is by unrolling the loop [7]. Loop unrolling benefits performance in two ways. First, it reduces the incidence of branch misprediction by removing a conditional jump. Second, it increases the pool of instructions available for re-ordering and scheduling of the processor. In the LU decomposition algorithm, as shown in Figure 2, there are three loops and we used the label (outer), (middle), and (inner) to represent them. The outer loop determines which diagonal element is being applied in the following computation. The middle loop

identifies the value of and selects the row of matrix

for operation. The inner loop controls the processing of elements within a row. Since most of the computations are performed in the inner loop, therefore, it will be unrolled two times. Because there are three loops, different combinations of loop unrolling can be tested and the speedup ratios obtained by loop unrolling and data prefetching are given in Table 2. For the outer and middle loops, they are also unrolled twice during our tests and data prefetching is applied in all cases, except in the unoptimized case which is the last case shown in Table 2.

Size of Matrix100 200 300 400 500

Unoptimized 1.94 1.82 1.55 1.52 1.41Unrolled loop and

2.11 2.09 1.80 1.64 1.59

2

Unrolled loop and 2.28 1.97 1.62 1.58 1.60

Unrolled loop , and

2.6 2.08 1.81 1.61 1.56

Unrolled loop , and (without prefetch)

2.53 1.95 1.89 1.65 1.59

Table 2 Speedup ratios obtained by loop unrolling and data prefetch

As shown in Table 2, the performance of the optimized algorithm has been improved significantly in cases where matrix A equals to 100 and 200. Comparing the case where the size of matrix is 100, the speedup ratio obtained when all three loops are unrolled is 2.6, which is almost 34% faster than the unoptimized case. The effect of data prefetch (without loop unrolling) is listed in the last row of Table 2. It seems that data prefetch cannot produce significant enhancement. This may due to the fact that the cache memory is also being utilized in the unoptimized algorithm as data of the matrix are still in the cache because the data was recently accessed in the initialization stage.

After optimization, better speedup ratio can be obtained for different sizes of matrices, however, the theoretical ratio of four cannot be reached and this is due to the overhead induced by the extra steps required to convert data from standard floating-point values to the 128-bit F32vec4 data and vice-versa. Two packing operations and one unpacking operations are performed in the inner loop and one packing operation is included in the middle loop. The number of pack/unpack operations executed in the inner loop can be approximated by equation (8).

(8)

And the number of packing operations performed in the middle loop is equal to

(9)

where symbol applied in both (8) and (9) is the size of the vector x in the equation .

Based on equations (8) and (9), the number of packing and unpacking operations increases when the size of the matrix A increases implying that the overhead induced becomes more significant for larger sizes of matrices. And this is coherence with our experimental results presented in Table 2. From our empirical study, it takes about two clock cycles to carry out a pack, or unpack, operation.

6. Application

In the above sections, we have described how SSE can be applied in the implementation of a LU decomposition algorithm and we also discussed how to optimize its performance by unrolling the loop and prefetching data. In this section, we will apply the SSE based LU decomposition algorithm to determine solutions for a set of linear equations created by a railway system simulator [8], which is simulating the operation of one of the DC electrical railway network in Hong Kong. The data are stored in a file, including a sequence of admittance matrices and current vectors representing duration of simulated time-steps. In our test data, the size of the admittance matrix is 37x37 and the number of time-steps is 100. The timing results obtained from the SSE solution and the traditional solutions are included in Table 4.

Original SSE SpeedupRatio

Case 1 431 ms 367 ms 1.17Case 2 91 ms 27 ms 3.37

Table 3 Timing results obtained from railway network simulator

As described in the above paragraph, the data generated by the simulator are stored in a file; therefore, operations such as opening the file and reading data from the file must be performed. The file operations induce severe overhead in the results, as depicted in the first row of Table 3 (Case 1). The results for Case 2 were obtained by removing the overhead caused by the file operations from the total processing time. The speedup ratio derived from Case 2 is close to the theoretical value of 4, this is reasonable as the problem size is small. Certainly, when the SSE algorithm is being incorporated into the simulator then the overhead caused by the file operations will be eliminated and the efficiency of the simulator can be highly improved.

3

7. Conclusions

In this paper, we have presented the basic operations required in utilizing the SSE features of the Pentium III processors. In order to examine the effectiveness of SSE, the railway network simulation problem was introduced and solved by applying SSE functions. According to our results, a speedup ratio around 3 can be obtained for the railway network simulation if the overhead induced by file operations can be minimized. This will be the case when the SSE algorithm is being embedded in the railway network simulator. The results are satisfactory because only minor modifications of the original program are needed in order to utilize the SSE features. Most importantly, additional hardware is not required for this performance enhancement. Therefore, SSE is a cost-effective solution for improving the performance of computation intensive problems, and the railway network simulation problem is one of the ideal applications. Currently, only the DC railway system has been studied and we are also planning to apply the SSE approach for solving AC system, which involves complex number arithmetic. As there is not a natural match between the SSE data types and complex number, and this will be a challenging task to determine a SSE based solution for the AC system.

However, there is one form of overhead, which is induced by the SSE mechanism. The overhead is caused by the mechanism to pack data into, as well as unpack data from the 128-bit format. Since the number of packing and unpacking operations is proportional to the magnitude of N2, where N is the size of the matrix in our test algorithm. So the gain in performance is reduced when the problem involves a large matrix, such as 400.

In addition, we have tested two methods to optimize our algorithm and they include loop unrolling and data prefetch. Loop unrolling can improve the performance significantly, but it will increase the size of the program. Data prefetch is technique to utilize the available cache memory but its effect is difficult to prove because the cache memory is used implicitly in other algorithms as well. In our future study, we will examine the performance of SSE algorithms using assembly codes instead of the high-level programming techniques applied in this paper, moreover, software tool such as the Performance Counter Library [x4] will be applied in order to study the effect of cache pre-fetches in depth.

In this paper, we concentrate on the LU decomposition algorithm and its application in the electrical railway simulation. However, the SSE features discussed can certainly be applied in other problems so it is a valuable tool for developing PC based application software.

References:[1] Conte G., Tommesani S., Zanichelli F., The long and winding road to high-performance image processing with MMX/SSE, Proc. of the Fifth IEEE International Workshop for Computer Architectures for Machine Perception 2000, pp. 302-310, 2000.[2] AMD Athlon Processor Model 4 Data Sheet, available at www.amd.com[3] The Complete Guide to MMX Technology, Intel Corporation, McGraw-Hill, 1997.[4] Strey A. and Bange M., “Performance Analysis of Intel’s MMX and SSE: A case Study”, LNCS, Vol. 2150, pp.142-147,2001. [5] Intel C/C++ Compiler Class Libraries for SIMD Operations User’s Guide, Intel, 2000.[4] Wang K.C.P., and Zhang X., Experimentation with a host-based parallel algorithm for image processing, Proc. of Second Int. Conf. On Traffic and Transportation Studies, pp. 736-742, 2000. [5] Mellitt, B., Goodman, C.J. and Arthurton, R.I.M., Simulator for Studying Operational and Power-Supply Conditions in Rapid-Transit Railways, Proc. IEE, vol. 125, no. 4, pp. 298-303, 1978.[6] Talla D., John L.K., Lapinskii V., and Evans B. L., “Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW and Superscalar Architectures”, Proc. of Int. Conference on Computer Design, pp. 163-172, 2000.

[6] George, A., and Liu, J.W.H., Computer Solution of Large Sparse Positive Definite Systems, Prentice-Hall, 1981.[7] 32-bit Floating Point Real & Complex 16-Tap FIR Filter Implemented Using Streaming SIMD Extensions, Intel Corporation 1999.[8] Ho, T.K., Mao, B.H., Yang, Z. X., and Yuan, Z.Z., A General-purpose Simulator for Train Operations, Proc. of International Conf. on Traffic and Transportation Studies, pp. 830-839, 1998.[X1] Bhargava, R., John L.K., Evans B. L., and Radhakrishnan R., “Evaluating MMX technology using DSP and multimedia applications”, Proc. of 31st ACM/IEEE International Symposium on Microarchitecture, pp. 37-46, 1998.[X2] Talla D., John L.K., Lapinskii V., and Evans B. L., “Evaluating signal processing and multimedia applications on

4

SIMD, VLIW and Superscalar architectures”, Proc. Of Int. Conference on Computer Design, pp. 163-172, 2000.[X3] [x4] Berrendorf R., and Mohr B., “PCL – The Performance Counter Library”, Research Centre Juelich (Germany), available from http://www.fz-juelich.de/zam/PCL.[] Y.F. Fung, M.F. Ercan, T.K. Ho, and W.L. Cheung, “A parallel solution to linear systems,” Microprocessors and Microsystems 26, 2002, pp. 39-44.

Acknowledgment:This work is supported by The Hong Kong Polytechnic University and the Singapore Polytechnic.

5

data management and system modeling [Good98]. Numerous simulation packages have been developed and found successful applications. However, most of these simulators were designed to suit particular railway systems or specific studies.

Therefore we have aimed to speed up this operation by means of SSE features provided in standard PCs today. In the following, we will present the application which motivated this study. Section 3 describes the use of SSE to solve various linear systems with real, complex and sparse matrices. In section 4, we present the results of our empirical studies.

the parallel computing features of Pentium CPUs are explored and evaluated in the computation-intensive problem of large-scale matrix solution which is the most time-consuming step of the simulator.

2.2 Computational Demand and Parallel Computing

IBM-compatible PCs are the most commonly used platforms for railway whole-system simulators. For a simple DC railway line with two separate tracks and 20 trains on each one, the coefficient matrix is only around 50x50. The simulation time is still much faster than real-time with Pentium processors. The power network calculation has not caused too many problems in most off-line studies even with a single processor. However, dealing with a large, complicated and busy network (larger and less sparse coefficient matrix) and/or AC supply system (complex numbers in matrix equation) presents a formidable challenge to the simulator, particularly in the increasingly demanding real-time applications.

Parallel computing is an obvious direction to deal with the computation demand. Multiple-processor hardware is a convenient way out, but it usually requires substantial modifications within the simulator and increases the system cost. Partitioning the electrical network into several smaller, more manageable sub-circuits by node splitting is another possible approach. This technique involves tearing a given network into a number of independent parts and putting the solutions of the divided parts together to form the solution of the original problem [Roh88]. Equal workload assignment to the processors and suitable architecture of processors are other critical considerations on maximizing the additional computational power.

This article focuses on exploiting the SIMD computing capabilities of the latest processors so that

execution of the application - railway power network simulator - can be sped up on a single processor system with minimum alterations on source codes.

3. Parallel Computing

A common method for parallel computing within a single processor is to pipeline the calculation steps and hence maximize its full computation capability. The Intel Pentium III and IV are the most popular processors in today’s computer world. Despite not primarily intended for parallel computing, they are equipped with such potential because its architectural design consists of a few high capacity registers and the available compilers allows high-level manipulation of these registers

3.1 Streaming SIMD Extension

3.2 Parallel LU Decomposition Based on SSE

The matrix equation resulted from an electrical railway system is in the form of:

(1)Here, A is a symmetrical sparse matrix of order n representing the admittance linking the nodes, vector represents the current produced by the source and is an unknown solution vector defining the voltage attained at each node.

There are various methods for solving equation (1) on a computer. A commonly used procedure is LU decomposition [Geo81] where matrix is factored into lower and upper triangular matrices, L and U, such that

(2)

and this is followed by forward/backward substitution of the form

(3)

6

and

(4)

With both L and U attained, forward substitution identifies the intermediate results in (3) and then vector is determined by backward substitution by (4). In LU decomposition, elements in the matrix are being processed along the diagonal and on a row-by-row basis. Data stored in a row of the matrix can be processed in a group of four because they are independent on each other and data parallelism can be applied, with the SSE registers, to shorten the computational time. In our earlier study, we have demonstrated SSE based parallel implementation of linear system equations [Fung02]. Here, we will briefly explain the steps involved.

As mentioned earlier, the compiler technology that can explore the available parallelism in a program and produce efficient codes by fully exploiting the SSE registers is not yet available. Despite the presence of certain features to make programming easier, it remains the users’ role to explore available parallelism during program development. In this section, we present the facilitation of the available parallelism in LU decomposition algorithm. The calculation involved in LU decomposition is illustrated by the pseudo codes in Figure 3.

Figure 3. Pseudo codes for calculating LU decomposition.

’s represent elements in the matrix and they are processed along the diagonal, in a row-by-row manner. Data stored in a row of the matrix map naturally into F32vec4 data format and therefore, four elements in a

row can be evaluated in a single step. The term is a

constant when elements in row i are being processed. It can, therefore, be stored in a F32vec4 type with the

command _mm_load_ps1 which loads a single 32-bit floating-point value and copies it into the 128-bit storage. The pseudo codes given in Figure 4 illustrate the steps involved in the implementation of these operations using SSE functions.

In forward substitution, the operations can be represented by:

(5)where represents element in the matrix as shown in equation (4);bi represents element in the [b] matrix Li,j represents elements in the [L] matrixSSE operations are also applicable in computing .

Four elements of and can be stored in two different F32vec4 data and multiplied in a single operation. Operations in the backward F32vec C, A1, A2; /* 128-bit values */Float x;For (k=0; k<n-2; k++) /* the outer loop */For (i=k+1; i<n-1; i++) {

x = ; /* the middle loop */

_mm_load_ps1(C, x);for (j=k+1; j<n-1; j+=4){ /* the inner loop */store four values from a(k,j) to a(k,j+3) into A1;store four value from a(i,j) to a(i,j+3) intoA2;A2 = A2 – (A1*C); }}

Figure 4. Pseudo codes for LU decomposition with SSE functions

substitution phase are represented by

7

For k=0 to n-2DoFor i= k+1 to n-1DoFor j=k+1 to n-1

end forend forend for

(6)where Uj,j represents elements in the upper triangular matrix [U]; m is the size of the vector [x]. Similar to forward substitution, the multiplication of can be executed by SSE functions with four elements of and being operated on at the same instant.

3.3 Complex numbers

When AC power supply is adopted in a railway system, the power network matrix will contain complex numbers. To solve equation with complex numbers in our algorithm two SSE registers are used to store four complex numbers. One of these registers holds the real components and the other one stores the imaginary parts. In complex number arithmetic, to multiply two values, we need to perform four multiplications and two add/subtract operations (depending on the sign of the numbers.) With SSE, results for multiplying four complex numbers can be obtained with two packing/unpacking operations, four multiplications, and two add/subtract operations. On the other hand, in the sequential method, a total of sixteen multiplications and eight add/subtract operations are required. In case of multiplying four pairs of real numbers, the results are obtained in four multiplications and with SSE, we need two packing/unpacking and one multiplication. Since the packing/unpacking operation is the major source of overhead incurred in the SSE mechanism so the advantage of using SSE is best reflected in the complex number algorithm and the speedup obtained is much better than the real-number cases. These are discussed in Section 4.

3.4 Sparse matrix

In many engineering problems, such as the railway network problem discussed in Section 3.3, the matrix representing the system is sparse. We have, therefore, included the application of the SSE for solving the sparse matrix problem. When dealing with the sparse matrix problem with SSE, we must derive a method to handle the non-zero elements effectively. Since the SSE register can process four values at each operation, therefore, if most of the elements processed in

an operation are equal to zero then the algorithm will not be efficient. In order to optimize our algorithm, we only process the non-zero elements by a packing process. The steps of the process are listed in Figure 5.

Step 1: Scan a row of the matrix Step 2: Store the co-ordinates of the non-zero elements into an arrayStep 3: Extract non-zero elements from matrix according to the co-ordinate array created in Step 2 Step 4: carry out the computation Step 5: Scan the next row and repeat steps 2 to 4

Figure 5. Steps for packing data of a sparse matrix

4. Experimental Study

In this section, we focus on the amount of speed-up that can be achieved in solving , based on LU decomposition method, using SSE registers. As this is the most time consuming computation of the simulator, a worthy speed-up will contribute to the overall performance of the simulator.

4.1 Computation Speed-up

Study of computation speed-up for different sizes of the coefficient matrix is presented in Figure 6. Two different cases are compared, namely, traditional approach (that is without SSE) and solution obtained by SSE. All experiments are conducted on a Pentium IV system.

8

Figure 6. Processing time and speedup ratio of solution of with real numbers.

As shown in Figure 6, a speed up can be obtained for matrixes up to 2000x2000 dimensions. A speed-up factor of four, that one would expect to have theoretically, was only approximated for 300x300 matrix size. There are two drawbacks of the SSE based parallel LU decomposition algorithm. First, the current design of the algorithm is not optimized, and secondly, data packing and unpacking overhead have taken substantial amount of CPU time. Data packing and unpacking involves operations to convert data from standard floating-point values to the 128-bit F32vec4 data and vice-versa. Two packing operations and one unpacking operations are executed in the inner loop and one packing operation is included in the middle loop, as shown in Figure 4. The number of pack/unpack operations executed in the inner loop can be approximately given by:

(7)The number of packing operations performed in the middle loop is equal to

(8)where is the size of the vector x in the equation

. Referring to equations (7) and (8), the number of packing and unpacking operations escalates when the size of the matrix increases, which implies that the overhead induced becomes more significant for larger

matrices. It is in coherence with our experimental results presented in Figure 6. The speedup ratio drops as the size of the matrix increases. When we compare our results with our earlier experiment with Pentium III processor (see Figure 7), we can observe a better speed-up ratio and CPU timing with Pentium IV processor.

The performance results when solving equations with complex numbers are presented in Figure 8. The advantage of using SSE is best reflected in the complex number algorithm and the speedup obtained is much better than the real-number cases as depicted in the Figure 8. However, a comparison with Pentium III results shows that better speed-up is achieved for smaller matrices (see Figure 9). The main reason for this is the shorter computation time obtained with Pentium IV, which consequently enhances the data-packing overhead over the performance for smaller matrices.

Figure 7. A comparison of speed-ups for two versions of Pentium processors.

9

Figure 8. Processing time and speedup ratio of solution of with complex numbers.

Figure 9. A comparison of speed-ups for two versions of Pentium processors.

The results presented in above figures are all based on dense matrices. The SSE features are also employed to solve the sparse matrix problem and the results are given in Table 1. The sparse matrices applied in our tests are obtained from the software package MatPower [Zim]. MatPower is a package of MatLab m-files for solving power flow and optimal power flow problems. The matrix generated by MatPower represents the complexity of a power system network. For example, in the case of A with a size of 300x300, it represents a power system network with 300 buses. The A matrix generated by the MatPower is sparse and the percentage of non-zero element is included in Table 1. Furthermore, we have studied the effect of matrix size and percentage of non-zero elements on the performance and the results are presented in Table 2 and Figure 10. We have observed good performance improvement for smaller matrix sizes and a steady speed-up for large

matrices. The speed-up moderately decreased with the increasing non-zero elements.

Table 1 Results for solving the Sparse Complex Number Matrices (in msec.)

Size of Matrix A Non-SSE SSE Speedup

60x60 1.07 0.25 4.28118x118 3.13 2.67 1.17300x300 66.13 29.73 2.22

Table 2 Results for Solving the Sparse Complex Number Matrices (in msec.)

Size of Matrix

%5 non-zero %15 non-zero

Non-SSE

SSE SSE-opt

Non-SSE

SSE

100 110 10 10 160 10300 431 50 50 571 80400 1042 341 321 1322 401500 2053 651 561 2383 771800 8923 2573 2424 9523 27241000 18016 4917 4737 18937 51471500 62680 16454 16244 64032 16944

10

Figure 10. Speed-up of sparse matrix calculations.

4.2 Performance Optimization

In the previous section, we have presented and compared results obtained by applying the SSE mechanism in implementing the LU decomposition algorithm. Due to the overhead in packing/unpacking data, we cannot obtain the theoretical speedup. However, there are still techniques that can be applied to optimize the algorithm. Optimization techniques on parallel algorithms have been discussed in [Int99]. For the parallel LU decomposition algorithm, there are two possible approaches. Referring to Figure 4, the computation involved in the algorithm is to determine the value of A2 repetitively. Data are read from memory continuously, and therefore prefetching data into the cache memory of the system should improve the data access time. Prefetching is supported in SSE and users can choose the cache memory (Level 1 or Level 2) to hold the data. The Level 1 cache is the closest to the processor whilst the Level 2 cache is farther away from the processor.

In the LU decomposition algorithm, elements of the matrix are processed row-wise and a logical approach to optimize the cache usage is by prefetching elements of a row before it is being processed. However, the number of bytes being fetched and stored in the cache depends on the exact model of processor being used and a minimum of 32 bytes is fetched.

Another approach to optimize the algorithm is by unrolling the loop. Loop unrolling benefits

performance in two ways. First, it reduces the incidence of branch mis-prediction by removing a conditional jump. Second, it increases the pool of instructions available for re-ordering and scheduling of the processor. In the LU decomposition algorithm, as shown in Figure 4, there are three loops and we use the labels (outer), (middle), and (inner) to represent them respectively. The outer loop determines which diagonal element is being applied in the following computation.

The middle loop identifies the value of and selects

the row of matrix for operation. The inner loop controls the processing of elements within a row. Since most of the computations are performed in the inner loop, it will be unrolled two times. As there are three cascading loops involved in the computations, different combinations of loop unrolling have been studied. From Table 3, the speedup ratios obtained by loop unrolling and data prefetching are given. The outer and middle loops are also unrolled twice within these tests and data prefetching is applied in all cases, except in the unoptimized case which is the last case shown in Table 3.

Table 3 Speedup ratios obtained by loop unrolling and data prefetch

Size of Matrix100 200 300 400 500

Unoptimized 1.94 1.82 1.55 1.52 1.41Unrolled loop and 2.11 2.09 1.80 1.64 1.59Unrolled loop and 2.28 1.97 1.62 1.58 1.60Unrolled loop , and 2.6 2.08 1.81 1.61 1.56Unrolled loop , and (without prefetch) 2.53 1.95 1.89 1.65 1.59

The results show that the performance of the optimized algorithm has been improved significantly for the matrix sizes of 100 and 200. In the case of matrix size 100, the speedup ratio obtained when all three loops are unrolled has reached 2.6, which is almost 34% faster than the unoptimized case. The effect of unrolling loops without prefetch is listed in the last row of Table 3. It can be concluded that for LU decomposition data prefetching alone cannot produce a significant improvement. This is perhaps due to the fact that the cache memory is also being utilized in the unoptimized algorithm as data of the matrix are still in the cache since the data have recently been accessed in the initialization stage.

11

After optimization, better speedup ratio were obtained for different sizes of matrices, however, the theoretical ratio of four cannot be reached due to unavoidable sequential fragments in the algorithm and the overheads involved in employing SSE registers.

4.3 Application in Railway Simulation

In the previous sections, we have described how SSE can be applied in the implementation of a LU decomposition algorithm and we have also discussed how its performance is optimized by unrolling the loop and prefetching data. We now highlight the performance improvement when integrating SSE based LU decomposition algorithm to the railway network simulator program.

In this practical application of the parallel algorithm, input to the simulator is a set of linear equations related to the electrical circuit in the railway system. Integration of the network solution algorithm to the simulator is usually realized in either modular or embedded format. The simulator and the solution algorithm are functionally and physically independent in the former so that they communicate only through external files which store the data of the matrix equation at each time step. The latter allows the solution algorithm to be embedded as a function within the simulator and the data exchange is via internal variables.

The timing results, based on a 733MHz Pentium III machine, obtained from the simulator [Ding03] were included in Table 4. The different sizes of the matrix represent different number of trains running on the network and there are 5, 10, 15, and 20 trains corresponding to the matrix size of 102x102, 132x132, 162x162, and 192x192. Because AC system is being studied in the simulator so complex numbers are applied in the calculation. Functions of the simulator include step to generate the data, as well as saving the data, since these functions are non-SSE based and they can be regarded as the data I/O overhead. By referring to Table 4, the speedup ratio obtained is close to 2 in most cases and when the number of trains is 20, by applying SSE, the computing time is reduced by eight minutes. This result demonstrates the benefit of SSE in solving one type of engineering problem, and our current research direction is to study how SSE can be embedded in other forms of parallel mechanism. One example is combining SSE with a dual-CPU machine.

Table 4 Timing results (in seconds) obtained from railway network simulator

Matrix size Non-SSE SSE Speedup

102x102 210 135 1.56

132x132 436 167 2.61

162x162 764 402 1.90

192x192 1312 645 2.03

5. Conclusions

In this paper, we present an LU decomposition algorithm and its application in electrical railway simulation. We have demonstrated a simple and cost-effective approach to enhance the computation capability on a PC. We have utilized the latest SIMD extensions included in Pentium processor to speedup LU decomposition computations involved in the simulator. According to our results, a speedup ratio around 3 can be obtained for the railway network simulation. The results are satisfactory; and more importantly such improvement is achieved simply by minor modifications to the original program. Furthermore, additional hardware support is not required for such performance enhancement. Therefore, SSE is a cost-effective solution for improving the performance of computationally intensive problems. Another significant contribution of this study is that it shows a successful application of SSE in an area other than the primarily targeted applications in multimedia. We have also implemented the complex number algorithms for both dense matrices and sparse matrices, and the speedups obtained in both cases are better than those in the real-number cases. The reason has been elaborated in Section 3.3.

The drawback of SSE mechanism is the overhead, which is caused by the operations to pack data into, as well as unpack data from, the 128-bit format. The number of packing and unpacking operations is proportional to N2 (where N is the size of the matrix). The gain in performance is reduced when the problem involves matrices with size larger than 400x400 in the real number cases. We have tested two methods, loop unrolling and data pre-fetching, to optimize the real number algorithm. Loop unrolling can improve the performance significantly, but it increases the size and complexity of the program. Data prefetching is a technique, which utilizes the available cache memory but its effect is difficult to prove or quantify because the

12

cache memory is used implicitly in other algorithms as well. In further studies, we will examine the performance of SSE algorithms using assembly codes instead of the high-level programming techniques applied in this paper. Moreover, software tool, such as the Performance Counter Library [Ber], will be applied in order to study the effect of cache prefetches in depth. Our study demonstrates that the SSE features embedded to the Pentium series processors are valuable tools for developing PC based application software.

AcknowledgementThis work is supported by The Hong Kong Polytechnic University under the grant number A-PD59.

References[Ber] Berrendorf R., and Mohr B., “PCL – The Performance Counter Library”, Research Centre Juelich (Germany), available from http://www.fz-juelich.de/zam/PCL.[Bhar98] Bhargava, R., John L.K., Evans B. L., and Radhakrishnan R., “Evaluating MMX Technology Using DSP and Multimedia Applications”, Proc. of 31st

ACM/IEEE International Symposium on Microarchitecture, pp. 37-46, 1998.[Cont00] Conte G., Tommesani S., Zanichelli F., “The Long and Winding Road to High-performance Image Processing with MMX/SSE”, Proc. of the Fifth IEEE International Workshop for Computer Architectures for Machine Perception, pp. 302-310, 2000.[Ding03] Ding Y., Ho T.K., Fung Y.F., Liu H.D., Mao B.H., “Parallel Algorithm of Train Movement Simulation on Electrified Railway”, Journal of System Simulation, 2003 (accepted).[Fung02] Fung Y. F., Ercan M. F., Ho T.K., and Cheung W. L., “A Parallel Solution to Linear Systems,” Microprocessors and Microsystems, Vol 26, pp 39-44, 2002.[Geo81] George, A., and Liu, J.W.H., “Computer Solution of Large Sparse Positive Definite Systems”, Prentice-Hall, 1981.[Good87] Goodman, C.J., Mellitt, B. and Rambukwella, N.B., “CAE for the Electrical Design of Urban Rail Transit Systems”, COMPRAIL’87, pp. 173-193, 1987.[Good98] Goodman, C.J., Siu, L.K. and Ho, T.K., “A Review of Simulation Models for Railway Systems”, Proc. of Int. Conf. on Developments in Mass Transit Systems, pp. 80-85, 1998.[Ho02] Ho, T.K., Mao, B.H., Yuan, Z.Z., Liu, H.D. and Fung, Y.F., “Computer Simulation and Modeling in

Railway Applications”, Computer Physics Communications, Vol. 143(1), pp. 1-10, 2002[Int99] 32-bit Floating Point Real & Complex 16-Tap FIR Filter Implemented Using Streaming SIMD Extensions, Intel Corporation 1999.[Int00] Intel C/C++ Compiler Class Libraries for SIMD Operations User’s Guide, Intel, 2000.[Me78] Mellitt, B., Goodman, C.J. and Arthurton, R.I.M., “Simulator for Studying Operational and Power-Supply Conditions in Rapid-Transit Railways”, Proc. IEE, Vol. 125(4), pp. 298-303, 1978.[MMX] The Complete Guide to MMX Technology, Intel Corporation, McGraw-Hill, 1997.[Roh] Rohrer, R.A., “Circuit Partitioning Simplified”, IEEE Trans. on CAS, vol. 35, no.1, pp. 2-5, 1988.[Strey01] Strey A. and Bange M., “Performance Analysis of Intel’s MMX and SSE: A case Study”, LNCS, Vol. 2150, pp.142-147, 2001. [Tal00] Talla D., John L.K., Lapinskii V., and Evans B. L., “Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW and Superscalar Architectures”, Proc. of Int. Conference on Computer Design, pp. 163-172, 2000. [Wang00] Wang K.C.P., and Zhang X., “Experimentation with a Host-based Parallel Algorithm for Image Processing”, Proc. of Second Int. Conf. on Traffic and Transportation Studies, pp. 736-742, 2000. [Zim] www.pserc.cornell.edu/matpower/matpower.html

13

sse based parallel solution for railway network...

Documents