[ieee comput. soc fourth international conference on high-performance computing - bangalore, india...

Parallelization of Finite Volume Computations for Heat Transfer Application Using Unstructured Mesh Partitioning Algorithms

Chaman Singh Verma V.C.V. Rao

Center for Development of Advanced Computing Department of Computer Science Pune University Campus, Ganeshkind

Pune, INDIA csv@ cdac. ernet. in

Abstract A n ejfective parallelization of Finite Volume Com-

putations for Heat transfer application using unstructured triangular meshes and mesh partitioning algorithms are presented. The mesh partztaoning software (METIS) is employed to create nearly load balanced subdomains and to minimize the interprocesser communications. Eficient data structures are de- veloped to handle the neighboring element information at the interfaces of all subdomains and a simple strategy to overlap the computations to communications has been implemented for improving the performance of the program. The explicit t ime integration method is used and the results for the rectangular domain and L-shaped domain have been presented. The code (PHEAT2D) as written in Fortran90 and MPI for messge passing is used. The algorithm is tested on distributed memory MIMD machine, P A R A M OPEN- FRAME.

1 Introduction Advances in algorithmic design have contributed as

much to computer performance as architectural advances. This is particularly true in the context of parallel computing, where efficient parallel algorithmic design is essential for optimum performance ([l], [Z]). Several attempts have been made to modify the algorithms for the solution of partial differential equations (PDEs) on parallel computers, particularly on MIMD machines ( [ 3 ] , [4], [5]) to reduce the computational time. However, in order to take advantage of these parallel machines, all aspects of a simulation i.e. mesh generation, mesh partitioning, mesh refinement features, data structures and efficient linear equation solution techniques must be included and made to run efficiently ( [ 3 ] , [7], [ S I ) . In the present work, a parallel computing strategy is implemented for the

University of Minnesota Minneapolis, MN - 55455

vcvr aoecs . umn .edu

PDEs arising in Heat Transfer. Finite volume method (FVM) with explicit time marching scheme has been used to solve the governing equations on unstructured meshes. M E T I S , a well known unstructured Graph Partitioning software which uses multi level graph partition algorithms, has been used for partitioning the unstructured triangular mesh ( [ 6 ] ) . The computations in each domain are performed concurrently on each processor and the interface elements are handled sep- arately with the help of virtual elements. A detailed experiments have been carried out for two different domains and the results are presented.

2 Description of Application Problem The two dimensional transient heat conduction

problem has been considered for illustrating parallelisation strategy and analyzing the performance. The governing equation for the temperature, T ( z , y)(z T ) in a two dimensional region f2 with boundary dS2 is given by

+ - + q = o dY2

(1)

d2T 1 - - (E dT d t

t E (tO1t.fll ( Z I Y ) E fl with boundary and initial conditions

T = T l on dRl; - dT = f(T) on dR2; d n

Here, dC&, (i = 1 , 2 ) is a part of the boundary 8 0 , f ( T ) depends on nature of boundary conditions such as convection, heat flux or radiation, Q is a heat source term, t o and t f are the starting and final time and TO represents the initial temperature at time t = to.

72 1094-7256/97 $10.00 0 1997 IEEE

2.1 Finite Volume Solution procedure Let A be the discretized triangular mesh of the

domain 0. The aim is to calculate the temperature, T , at each nodal point of the triangular mesh, A at t = tn+l (= t, + At) with given time step (At) and initial condition at instant of time, t = t,. The general form of the Eq. 1 can be written as

dE d F dG -=-+-+s dt ax d y (3)

with appropriate boundary and initial conditions. Here

Let A, be an element with the vertices A, B and C numbered in the anticlockwise direction and let z be control point which is centroid of the element, A,. Also, let E,” be the temperature at the control point x of the element A, at time t,. Now, the temperature, E,”+’ at t = tn+l (= t , +At) at the control point z of each element A, of the mesh A is obtained by FVM in two steps. First, integrating the Eq. 3 over an element A, and using Green’s theorem the following relation is obtained :

Secondly, writing the discrete form of the above Eq. 5, we have

( F A B A Y A B + FBCAYBC + F C A ~ Y C A ) - ( G A B ~ X A B + GBCAXBC + GCAAXCA)

-SxAA, (6)

where AXAB = XB - X A , AYAB = IJB - Y A , A,, is the area of element A,, and ( F A B ~ G A B ) , ( F B c , G B c ) , (FcA,GcA) are the gradient values at the mid-point of the sides AB, BC and CA of the element A, respect- ively. The temperature E:+’ at control point a: of A, for t = tn+l is calculated explicitly as follows :

where

The temperature, E:o+dl,, at each node A, can be ap- proximated by averaging the temperature values of all surrounding nodes of the polygon region of R and time marching analysis is performed. In the present study, the rectangular domain and L-shaped domain (see Fig. l ( a ) and Fig. l (b ) ) have been considered with Dirich- let boundary conditions. The transient computations for 100 time iterations have been performed with uni- form time steps.

3 Serial PHEAT code The PHEAT2D code is a general purpose finite

volume code for initial boundary value problems using unstructured meshes. Delaunay triangulation and Transfinite mapping methods have been used for the mesh generation. For calculations, the preprocessor and boundary/initial conditions information have been supplied to analysis. The computation involves the calculation of unknown variable at centre of each triangle which needs all neighbouring triangles information and the calculation of the unknown variables at each grid point for every time step. Also, the serial code is op- timized usng SUNfSO compiler on UltraSparc work- station. The code can be used for the calculation of heat transfer, fluid flow analysis and it has adaptive mesh refinement capabilities. The code is written in Fortran90 which has dynamic memory allocation features.

4 Parallelisation strategy The parallelisation strategy described here con-

siders all aspects of simulation such as the mesh partitioning, I/O, distribution of mesh data, asynchronous exchange of mesh data and numerical algorithm. The most important feature of the strategy is application of mesh partitioning algorithm (METIS) to create nearly load balanced subdomains which are assigned to different processors of the parallel machine. In the explicit FVM computations, the communication between processors takes place only at the computation of temperature variable at control points of elements on the interface of subdomain and at the nodal points on the interface of subdomains for every time step. An efficient data structure is necessary for this computations, communication and memory access performance of the program and to overlap communication with computations. For this, the elements in each subdomain are categorised into four different groups defined as boundary triangles, interior triangles, interface triangles and virtual elements and have been explained as follows :

73

4.1 Categorization of elements A boundary triangle is defined as a triangle with at

least two vertices on the boundary of the domain. In- terior elements have all the necessary data and their neighbouring elements information available on the local processors. Only the interface elements would require interprocessor communications for the calculation of the temperature at their control points. Virtual elements are the elements in the domain on which no computations are performed, but their presence is re- quired for the calculation of temperature values at the interface elements. The interface temperature values send by one processor resides on the virtual elements on the adjacent processor. Such division is based on the fact that there are only few elements in the subdomain which would demand interprocessor communication and a large number of elements have all the necessary information available on the local processor. The mesh dat>a includes elements connectivity, element and it,s neighbouring elements, node and its neighbouring elements, the interface elements and neighbouring processor element information. A Lookup table is pre- pared which maps the local renumbering with global renumbering of nodes and it ensures that the data received on the neighbouring processor is correct.

4.2 Parallel PHEAT code First, the triangular mesh A is generated in the

domain 0 and then Graph Partitioning algorithm is applied on the master processor (iproc = 0), which gives Nproc subdomains (Ldi), i = 0,2, ...., Nproc - 1 ) with triangular meshes A('), i = O , l , ..., (Nproc- 1 ) . The mesh partitioning algorithm has been used to create subdomains which allow the overall computational load to be as evenly distributed as possible among the processors. The mesh partitioning is done such a way that nodes and edges can be shared among multiprocessors, but the elements are entirely within a par- ticular processor. Within a subdomain, each node is assigned a unique integer identification and a node on the interface is assigned to multiple integer identifica- tions. After sending the necessary mesh data i.e. connectivity, neighbours, coordinates, etc., of A(k) of the subdomain to the k t h processor asynchronously, the FVM computations are performed concurrently on each processor. At each time step, each subdomain sends information such as interface temperature values This information is received by neighbouring processor and it is used for the calculation of temperature values at the interface elements in the sub domain which is mapped onto the neighbouring processor. These temperature values are placed on the virtual elements at the receiving processors. Before unpacking the mes-

sage from the other processors, computations for the internal elements and elements near boundary could be performed. At the end of computation, it is ex- pected that the message has arrived on the processor, so the computations at the inte,face elements could be performed. In this way, the computation at each control point, 2, is obtained. Thereafer, the temperature at each nodal point is computed by considering the polygon surrounding the node using weighted averaging approach of all nodal values on the polygon region which requires interprocessor communication. 4.3 Parallel algorithm

The important steps involved in the calculation of temperature, E,"+1 at each control point, 2, and the temperature at each node of the triangulation, A(;) (0 i 5 Nprocs - I ) of the subdomain, Cl(;) on all processors have been explained below :

Program Parallel-FVM For iproc = 0, Nprocs - 1 Step 1 if ( iproc== 0) then

Generate triangular mesh A , execute METIS to get Nproc subdomains,

e distribute the initial temperature, Ennode for all

0 distribute the mesh data A(;) to the ith processor

nodes and,

asynchronously

endif Step 2 The processor iproc recevies the mesh data of the triangulation,A(iJ"'oC) and the temperature Ennode for all nodes of A(aProc) of the sub domain, f2( iP'oc) .

Step 3 For itime = 1, Maximum-time

1.

2.

3.

4.

5.

Set tn+l = t , + At

Start sending the temperature,EE of the interface and virtual elements at t = t, to the nearest neighbouring processors asynchronously.

(a) Obtain the temperature E,"+' at time, t = tn+l

Calculate the temperature, E,"+1, at control point of each element, A, of the mesh A('proc) of O(iProc) from the Eq. 7 for all internal elements.

The processor iproc will receive the temperature, E," of the interface and virtual elements from the nearest neighbouring processors asynchronously.

Calculate the temperature, E,"+', of A, of the mesh A(aP'oc), from the Eq. 7 for the elements near the interface of A(zproc).

74

6.

7 .

8.

(b) Obtain the temperature, at time t = t n+ l

Calculate temperature, for all nodal points of internal elements of the mesh A ( i p r o c ) of the subdomain, Q( ip 'oc)

The processor iproc will s e n d asynchronously and receive all t8he information of temperature E':+', computed for control point of A, of all interface elements of ~ ( ~ p ~ ~ ~ ) from nearest processors.

Calculate temperature, E:::e for all nodal points on interface elements of A ( i p T o c ) of the subdomain, Q ( i P ' 0 " ) .

enddo TimeLoop enddo Processor-Loop s top

End program Para1 lel-F Vhl

5 Results and discussion The computing platform is PARAM OPENFRAME

computing system which is a cluster of eight work- stations. Each node is composed of an Ultra Sparc 2 processor with a clock rate of 200 MHz and 128 Mbytes of memory. The processors are connected over an Fast Ethernet with a peak bandwidth of 100 Mbit/s. MPI has been used for message passing model. The serial and parallel codes are implemented on the same machine. The timings reported here wall clock times which include CPU, system, waiting and 1/0 time.

Table 1 shows the performance of the PHEAT2D code on two, four, and eight processors for rectangular domain for different meshes. Figs. 2a-d show the mesh partitioning for class A size problem. The rectangular grid is partitioned among two, four, six and eight processors with one subdomain being assigned to each processor. It is observed that the mesh partitioning software, METIS creates subdomains which allow the overall computational load to be as evenly distributed as possible among the processors. Also, the number of interface elements and virtual elements are vary- ing from 15 % t,o 20 % with respect to interior elements. On the coarse mesh, the speed is very low due to less amount of computations but as the mesh size is increased the speed up is substantially increased, es- pecially on 4 and 8 processors. This performance is further improved due to asynchronous communication at initial time iiiterval and due to overlapping communication in each domain at interface. This trend shows that satisfactory speedup can be achieved on 8

pro( f'ssors and efhcienc y 15 dboiit 5540% OII ~ikc;

size of the problems as shown in llic Tabir i rl dup and efficiency decrease for morc than 8 p t for Class D problem which may be due do ' ~ t r n s t , 2 d bandwidth of the interconnecting netwcrk

Figs 3a-d show the part::ion - r ~ j d s ~ I

grid into two, four six and eight subdoni each sub domain being assigned L- shaped domain As expec te eight equal subdomains may create faces which inflates the number of 1

Consequently, th iriterprocessoi CO

xhieved performance of the code PHE iesults show similar trend as that of rL main It is obs are equally di3tributed on ali pi terprocessor com increased upto 60-65 % rts the cornputatmm pr from codrse mesh to final mesh ori 8 P ~ O L Y S ~ O I ? l b

can be concluded that the explicit FVM c o ~ p show good performance on parallel mdchines foi large size of meshes

6 Conclusions An efficient parallelisation strategy has been prcl

posed for solving the Heat Transfer applications asin unstructured triangular meshes and graph partnt ~on::~g algorithms by finite >olumr method T h e g r q h par titioning algorithm METIS I S found l o bc v e i y cessful in load balancing and minlmiration oi the in c I-

processor comrnuriication The parallelisdtiori &-a shows a Satisfactory performance upto 8 p i s ( J * ~ ? c I -

large mesh sizes An efficient data structunt scbAfxllir !or

storing the mesh data and mapping scheme for elernents information onto neighbouring ~ ~ O C C S S O ~ ~

used Also, the computations to cominurIicbtiorii been overlapped in the algorithm to obtalii per ance improvement The software 1s w r i t k n 1131

tran 90 and MPI and i t can be used for finite computations for both explicit and iriiplicit c alt i d n tions

7 Acknowledgments We thank Prof George Kdrypis hnd Prof t i p i n

Kuniar, Department of Computer Science of Minnesota, Minneapolis for providing M ware for us

References [l] Mcbryan, 0 , 'New aidiitectures

highlights and new algorithms', Purallc f Compbt m g , Vol 7, pp 477-499 (1988)

pcrformincc

75

[31

[41

[51

Farhat C., Lanteri S. and Fezoui L., 'Mixed finite volume/finite element massively parallel computations: Euler flow, unstructured grids and up- wind approximations', Proc. ICASE Workshop on Unstructured Scientific Computations on Scalable Multiprocessors, (1990)

Farhat C. and Lesoinne M., 'Automatic Partition- ing of Unstructured Meshes for the Parallel Solu- tion of Problems in Computational Mechanics', Int. J . Numer. Meth. Engg. Vol. 36., pp. 745-764 (1993)

Lohner It. Camberos J . and Merriam., 'Parallel unstructured grid generation', Comp. Meth. Appl. Mech. Engg. Vol. 95, pp. 343-357 (1992)

Ronald W. Lewis, Yao Zheng and Asif S. Usmani., 'Aspects of adaptive mesh generation based on domain decomposition and Delaunay triangulation', Finite Elements in Analysis ana' Design, Vol. 20, pp. 47-70 (1995)

Karypis G and Kumar V., METIS, Unstrctured Graph Partitioning and Sparse Matrix Ordering System, Version 2.0, Department of Computer Science, University of Minnesota, August (1995)

Karypis G and Kumar V., 'Analysis of multilevel graph partitioning', Technical Report TR 95- U3 7, Department of Computer Science, University of Minnesota, (1995)

Kumfert, G. and Pothen, A., A multilevel nested bisection algorithm, Unpublished work (1995)

Figure 1: Finite Element mesh on rectangular and L shape domain

76

Problem tY Pe

class A

class B

I 14450 class C

Elements

1800

7442

class D I 28800

time (seconds)

9.2 5.4 4.0 4.3 48.6 26.6 17.2 12.5 115.0 62.5 38.0 26.1 249.5 136.0 82.0 52.0 I

Speedup

1 .oo 1.70 2.30 2.14 1.00 1.83 2.82 3.89 1.00 1.84 3.03 4.41 1 . O O 1.83 3.04 4.80

7396 1 21845

time (seconds)

12.1 6.9 5.0 4.6 27.6 15.0 2.8 7.4

196.0 105.0 61.2 44.8 685.0 380.0 205.0 127.0

-t-

Speedup Efficiency

1 .oo 100.0% 1.75 87.7 % 2.42 60.5 % 2.63 32.9 % 1 .OO 100.0% 1.84 92.0 % 2.82 70.4 % 3.73 46.6 % 1 .00 100.0% 1.87 93.3 % 3.21 80.3 % 4.38 54.7 % 1 .OO 100.0% 1.80 90.2 96 3.34 83.6 % 5.40 67.4 %

14641 43440

8220

32640

73260

Nproc

1 2 4 8 1 2 4 8

8 1 2 4 8 1 2 4 8 1 2 4 8

Table 1 : Performance results for rectangular domain

class A

class B I 5400 I 2821

class D 1 48600 1 24661

I

1

Efficiency

100.0% 85.2 % 58.5 % 26.7 % 100.0% 91.4 % 71.6 % 49.6 % 100.0% 92.0 % 75.7 % 55.1 % 100.0% 91.7 % 76.1 % 60.0 %

Table 2 : Performance results for L-shaped domain

77

Figure 2: Finite Element mesh on Rectangular domain (a-d)

78

Figure 3: Finite Element mesh on L-shape domain (a-d)

79

[ieee comput. soc fourth international conference on high-performance computing - bangalore, india...

Documents