[ieee 2009 6th international conference on electrical engineering/electronics, computer,...

4
On Performance Study of The Global Arrays Toolkit on Homogeneous Grid Computing Environments: Multi-level Topology-Aware and Multi-level Parallelism Sirod Sirisup and Suriya U-ruekolan Large-Scale Simulation Research Laboratory National Electronics and Computer Technology Center (NECTEC) 112 Thailand Science Park, Pahon Yothin Rd., Klong 1, Klong Luang, Pathumthani 12120 Thailand Tel: 662-564-6900 Ext.2276, Fax: 662-564-6772 Email: [email protected] Abstract-The Global Arrays toolkit is a library that allows programmers to write parallel programs that use large ar- rays distributed across processing nodes through the Aggregate Remote Memory Copy Interface (ARMCI). In this study, we have further our investigation on the performance of a parallel application implemented with the Global Arrays toolkit on homogeneous Grid computing environments. Two types of the homogeneous Grid computing environments have been examined in this study. In order to fully take advantage of the Global Arrays toolkit, the topology-aware and multi-level parallelism have been also implemented in the evaluating application. We have found that the current implementation of the evaluating application outperform both MPICH-G2 and a typical Global Arrays implementations in all studied cases. I. INTRODUCTION Grid computing involves heterogeneous collections of com- puters that may reside in different administrative domains, run different software, be subject to different access con- trol policies, and be connected by networks with widely varying performance characteristics [ 1] . Nonetheless, main drawbacks of Grid computing are high latency and high-speed interconnection requirements. Most of large-scale problems in computational science and engineering e.g. computational fluid dynamics, computational chemistry and bioinformatics, has been implemented in the form of parallel applications. The algorithms in the applications typically need interprocess communications significantly. With the aforementioned prob- lems, the applications run on Grid computing environment may be degraded. A way to resolve this problem is to reduce network utilization of interprocess communications in any parallel applications, thus efficient network utilization on Grid computing environment can be accomplished. In past research [2], we investigated the performance of the Global Arrays toolkit [3] on cluster and Grid computing environment. We found in cluster environment, the perfor- mance evaluate application implemented with the toolkit per- form better than the MPICH implementation application when message size is large enough. However, for low bandwidth 978-1-4244-3388-9/09/$25.00 ©2009 IEEE and high latency circumstances such as the current Grid computing environment, the degradation of the Global Arrays implementation application compared with the MPICH-G2 implementation application is observed. Thus a way to im- prove the efficiency of the application is to employ multi-level topology-aware [4] and multi-level parallelism [5] techniques. In the essence, we employ Global Arrays operations in each cluster and MPICH-G2 operations for intercluster calls in order to reduce communication cost. However, in this study we will focus our attention for the cases of homogeneous Grid environments because the efficiency comparison of the recent implementation to the typical Global Arrays implementation must be done. II. RELATED TECHNOLOGY For performance study of the GA toolkit together with multi-level parallelism and multi-level topology-aware tech- niques on homogeneous Grid computing environments, we focus only on three main technologies as follows: A. Global Arrays and ARMel The Global Arrays (GA) [3],[6] was designed to simplify the programming methodology on distributed memory sys- tems. The most innovative idea of GA is that it provides and asynchronous one-sided, shared memory programming environment for distributed memory systems. The GA has been included the ARMCI (Aggregate Remote Memory Copy Interface) library which provides one-sided communication capabilities for distributed array libraries and compiler run- time systems. ARMCI offers a simpler and lower-level model of one-sided communication than MPI-2 [7],[8]. GA reduces the effort required to write parallel program for clusters since they can assume a virtual shared memory. Part of the task user is to explicitly define the physical data locality for the virtual shared memory and the appropriate access patterns of the parallel algorithm.

Upload: suriya

Post on 26-Feb-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) - Chonburi, Thailand (2009.05.6-2009.05.9)]

On Performance Study of The Global ArraysToolkit on Homogeneous Grid Computing

Environments: Multi-level Topology-Aware andMulti-level Parallelism

Sirod Sirisup and Suriya U-ruekolanLarge-Scale Simulation Research Laboratory

National Electronics and Computer Technology Center (NECTEC)112 Thailand Science Park, Pahon Yothin Rd., Klong 1, Klong Luang, Pathumthani 12120 Thailand

Tel: 662-564-6900 Ext.2276, Fax: 662-564-6772Email: [email protected]

Abstract-The Global Arrays toolkit is a library that allowsprogrammers to write parallel programs that use large ar­rays distributed across processing nodes through the AggregateRemote Memory Copy Interface (ARMCI). In this study, wehave further our investigation on the performance of a parallelapplication implemented with the Global Arrays toolkit onhomogeneous Grid computing environments. Two types of thehomogeneous Grid computing environments have been examinedin this study. In order to fully take advantage of the GlobalArrays toolkit, the topology-aware and multi-level parallelismhave been also implemented in the evaluating application. Wehave found that the current implementation of the evaluatingapplication outperform both MPICH-G2 and a typical GlobalArrays implementations in all studied cases.

I. INTRODUCTION

Grid computing involves heterogeneous collections of com­puters that may reside in different administrative domains,run different software, be subject to different access con­trol policies, and be connected by networks with widelyvarying performance characteristics [1] . Nonetheless, maindrawbacks of Grid computing are high latency and high-speedinterconnection requirements. Most of large-scale problemsin computational science and engineering e.g. computationalfluid dynamics, computational chemistry and bioinformatics,has been implemented in the form of parallel applications.The algorithms in the applications typically need interprocesscommunications significantly. With the aforementioned prob­lems, the applications run on Grid computing environmentmay be degraded. A way to resolve this problem is to reducenetwork utilization of interprocess communications in anyparallel applications, thus efficient network utilization on Gridcomputing environment can be accomplished.

In past research [2], we investigated the performance ofthe Global Arrays toolkit [3] on cluster and Grid computingenvironment. We found in cluster environment, the perfor­mance evaluate application implemented with the toolkit per­form better than the MPICH implementation application whenmessage size is large enough. However, for low bandwidth

978-1-4244-3388-9/09/$25.00 ©2009 IEEE

and high latency circumstances such as the current Gridcomputing environment, the degradation of the Global Arraysimplementation application compared with the MPICH-G2implementation application is observed. Thus a way to im­prove the efficiency of the application is to employ multi-leveltopology-aware [4] and multi-level parallelism [5] techniques.In the essence, we employ Global Arrays operations in eachcluster and MPICH-G2 operations for intercluster calls inorder to reduce communication cost. However, in this studywe will focus our attention for the cases of homogeneous Gridenvironments because the efficiency comparison of the recentimplementation to the typical Global Arrays implementationmust be done.

II. RELATED TECHNOLOGY

For performance study of the GA toolkit together withmulti-level parallelism and multi-level topology-aware tech­niques on homogeneous Grid computing environments, wefocus only on three main technologies as follows:

A. Global Arrays and ARMel

The Global Arrays (GA) [3],[6] was designed to simplifythe programming methodology on distributed memory sys­tems. The most innovative idea of GA is that it providesand asynchronous one-sided, shared memory programmingenvironment for distributed memory systems. The GA hasbeen included the ARMCI (Aggregate Remote Memory CopyInterface) library which provides one-sided communicationcapabilities for distributed array libraries and compiler run­time systems. ARMCI offers a simpler and lower-level modelof one-sided communication than MPI-2 [7],[8]. GA reducesthe effort required to write parallel program for clusters sincethey can assume a virtual shared memory. Part of the taskuser is to explicitly define the physical data locality for thevirtual shared memory and the appropriate access patterns ofthe parallel algorithm.

Page 2: [IEEE 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) - Chonburi, Thailand (2009.05.6-2009.05.9)]

Fig. 2. Experimental setting for the second Grid environment

(2)

o::; y ::; b (1)o::; x ::; a,U xx + U yy = f(x, y),

Memory

TABLE ICLUSTER NODE SPECIFICATION

The specification of each individual node and its intercon­nect network communication are shown in Table I.

ProcessorProcessors per no e

For the second environment, there are two clusters bridgedby a gigabit network switch reside on the same site which isbridged by 10/100 Mpbs network switch to a cluster on remotesite shown in Figure 2.

Communication mechanism on the Grid computing environ­ment is handled by the MPICH-G2 library. The MPICH-G2Globus I/O uses SSL for securing messages and this does takesome overhead utilization. However, the Globus I/O does notfully utilize the security facility. In the default setting of thecurrent distribution of Globus (Globus 4.0.5), the security layerdoes not encrypt any messages, but only protects the integrityof messages by using message digests. In the current study,Globus toolkit (4.0.5) together with MPICH-G2 are used inthe current Grid computing environment.

IV. EVALUATING APPLICATION AND IMPLEMENTATION

In order to evaluate the performance the Global Arrays(GA)toolkit with multi-level parallelism and multi-level topology­aware techniques, the structure of parallel algorithm of theevaluating application is taken to be the same as in [2].Precisely, the application essentially solves the following gov­erning POE:

The finite difference (FE) approximation discretizes the com­putational domain into a set of discrete mesh points (Xi, Yj )

with evenly spaced of distance h. With a zero source term, thefinite difference equation representing equation (1) is reducedto:

C. Multi-Level Topology-Aware

The multi-level topology-aware approach minimizes mes­saging across the slowest links at each level by clusteringthe processes at the wide-area level into site groups, andthen within each site group, clustering processes at the local­area level into machine groups. One benefit of using a multi­level topology-aware tree to implement a collective operationis that we are free to select different subtree topologies ateach level. This technique thus allows developer to create amulti-level parallelism applications based on a proper paralleltools in each machine group. The MPICH-G2 addresses thisissue within the standard MPI framework by using the MPIcommunicator construct to deliver topology information to anapplication. It associates attributes with MPI communicator tocommunicate this topology information, which is expressedwithin each process in terms of topology depths and colorsused by MPICH-G2 to represent network topology in a com­putational Grid [4].

III. EXPERIMENTAL SETTING

In this section, we provide brief description of experimentalsetting used in the current investigation. The Grid computingenvironment used here composes of three clusters, whichconsists of two three-host clusters: 6 processors in each clusterand a two-host cluster: 4 processors, connected through theThaiSarn network. The experimental setting composes of twotype of network topologies separating into two homogeneousGrid environments. For the first environment, there are threeclusters in the environment each separately reside on isolatedsites bridged by a 10/100 Mbps network switch shown inFigure 1.

B. Grid Computing

High-performance "computational grids" involve heteroge­neous collections of computers that may reside in differentadministrative domains, run different software, be subject todifferent access control policies, and be connected by networkswith widely varying performance characteristics [1]. By usingcommunication mechanisms for a Grid environment,such asMPICH-G2 to use services provided by the Globus toolkit, anapplication can now run across clusters spread over campus­area, metropolitan-area, and wide-area networks, that is, ameta-computing environment is ready [9].

Fig. 1. Experimental setting for the first Grid environment

Page 3: [IEEE 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) - Chonburi, Thailand (2009.05.6-2009.05.9)]

1------ n ------I

where Ui,j represents the approximation of U(Xi, Yj). Thesolution process starts with initial estimates for all Ui,j values,then the iterative process must be performed until the valuesconverged, we refer to this process as the Jacobi method. Theconvergence of the Jacobi method is guaranteed because of thediagonally dominant structure. Suppose that we are workingwith an n x n mesh with p processes, with the row wise block­stripped decomposition, each of the p processes manages itsown mesh of size njp x n, see Figure 3.

In

1o ] ------ P-2 P-]

I-----n I

tI11Ir--1111(------------SftH~1-------~~~;···~~:~~~~~~7C-------- ':

i1II1 i 1111 t,-i 1111 1

: GA blocks : GAblocks I : GAblocks :,------ -----------------------, ------ ----------------------,' ,------ ------------~--------_ ..'

...... .... MPI_Send IMPI_Receivefunctions

Fig. 4. A row wise block-stripped decomposition of (n/number of sites) x nmesh with p processes.

Fig. 3. A row wise block-stripped decomposition of n x n mesh with pprocesses.

The detail of each implementation is described as follows:1) MPICH-G2 implementation: For the MPICH-G2 imple­

mentation application , in each iteration, each interior processmust send and receive 2n values to and from its immediateneighbors in order to update ghost points around its block.The implementation of the algorithm with MPI library is quitestraight forward: via MPI_Send and MPI_Receive functions.

2) Global Arrays implementation: For Global Arrays im­plementation, in order to retrieve and update ghost and actualdata in each interior process's block, we employ the one-sidedcommunication operations NGA_Get and NGA_Put.

3) Global Arrays with multi-level topology-aware and multi-level parallelism implementation:For this implementation, the computational gridtopology discovery can be accomplished by thetopology depths and colors reported by MPI_Attr~et

function through the MPICHX_TOPOLOGY_DEPTHS,MPICHX_TOPOLOGY_COLORS variables. In this stage,meshes of size (njnumber of sites)jp x n is handled by eachsite, see Figure 4.

In each site, the Global Arrays is employed to handle themesh by creating Global Arrays blocks handling exclusivelyby those processes in each site. The exclusively handling ispossible through those Global Arrays process group APIs. Theprocesses in each site will thus use the one-sided communi­cation operations NGA_Get and NGA_Put in each processgroup. In order to send and receive 2n values to and from itsimmediate neighbor sites, a predetermined process (processwith id = 0 in each site) is designated to do the taskthrough the MPI_Send and MPI_Receive functions. After theinformation has been exchanged the predetermined process isthus transfer the ghost data to the proper process in its sitevia Global Arrays block. This algorithm forms a multi-levelparallelism because it is composed of inter-site MPICH-G2calls and intra-site Global Arrays calls.

v. RESULTS AND DISCUSSION

We have implemented the evaluating application in threetypes of implementations: Global Arrays(GA),MPICH-G2 andGlobal Arrays with multi-level topology-aware and multi­level parallelism techniques. The solving algorithm of eachimplementation is identical e.g. the Jacobi iteration. All im­plementations were executed on different grid resolutions:216 x 216,432 x 432 and 864 x 864 meshes. The numbers ofprocesses used in the current study are 3,6,9,12 and 16 whichfrom a process from each site and so on.

~MPICHG2-216

'~'MPICHG2-432

-e-MPICHG2-864

~GA-216

·.·GA-432

200 :'::~~~:~:gy-AWareGA-216 ,,' ..

·.·TopologyAware-GA-432 ,.,., ......

-B-TopologyAware-GA-864 ,I ,,,' ..

:'" ~:~;:,~~~~:;~?~~~~~~~,',;i.~i~~;;~'.~--_·· __·_~~~::~100 '\"" ........

~ ".',........I1 a-------------.a--····-····--·····-a

"'. . , , ..°2'----------'-----'-----N-um~'--er-Of-p-roC-es-'-~~r-s -------'--------'---------'

Fig. 5. Execution time of all implementations of the evaluating applicationon the first Grid computing environment

The results of execution time of all the cases for allimplementations on the first Grid computing environment areshown in Figure 5. From the figure, we can see clearly that themulti-level topology-aware and multi-level parallelism imple­mentation greatly outperform the other two implementations.We see that the Global Arrays implementation performs poorlycompared to MPICH-G2 implementation as usual becausethere is significant amount network communication betweenprocessor p and processor other than the immediate neighbors

Page 4: [IEEE 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) - Chonburi, Thailand (2009.05.6-2009.05.9)]

of the processor p. The details of this behavior have beenexclusively studied in [2]. It is noted here that in case ofthree processors, MPICH-G2 and multi-level topology-awareand multi-level parallelism implementations are identical. Theoverall speedups of the multi-level topology-aware and multi­level parallelism implementation are 6.02, 4.77 and 4.01for n = 216,432, 864, respectively. The overall speedup iscalculated by performing a linear regression to the executiontime.

environment for a typical parallel application. For low band­width together with high latency circumstances such as thecurrent Grid computing environment, the multi-level topology­aware and multi-level parallelism implementation applicationoutperform the MPICH-G2 and typical Global Arrays imple­mentations in all studied cases. However, the developer mustconsider two issues which are the sensitivity of the algorithmand the initial investment in re-coding of the application inorder to efficiently employ the Global Arrays with multi­level topology-aware and multi-level parallelism techniques onhomogeneous Grid computing environment.

Fig. 6. Execution time of all implementations of the evaluating applicationon the second Grid computing environment

Figure 6 shows the results of execution time of all thecases for all implementations on the second Grid computingenvironment. Here, the performance of MPICH-G2 implemen­tation is greatly improved compared to previous environment.However, the performance of the multi-level topology-awareand multi-level parallelism implementation is almost the same.It is possible that network bandwidth of 10/100 Mpbs issufficient for transfer data between sites. The overall speedupsof the multi-level topology-aware and multi-level parallelismimplementation are 4.39,4.14 and 1.55 for n = 216,432,864,respectively. However, we observe a large degradation of theperformance in the typical Global Arrays implementation case.This may result from the unbalanced network bandwidth. Afurther investigation is still needed for this issue.

However, to effectively utilize the Global Arrays with multi­level topology-aware and multi-level parallelism implementa­tion, the developer must make sure that the algorithm of theapplication is stable. This is because the explicit control ofthe Global Arrays (group) operations can not be easily done.The Global Arrays with multi-level topology-aware and multi­level parallelism implementation can be simply ported to thehomogeneous Grid environment but the initial investment inre-coding of the application is quite considerable compared tothe MPICH-G2 implementation.

~200

I~ 150

~~ 100

~,_".• •.•"-,-,':.o._._._._._._._._.-...._._._._._._.-.'~-8-MPICHG2-864

~GA-216

·.·GA-432

-.-GA-864

~TopoIogy-AwareGA-216

'~'TopoIogyAware-GA-432

-B-TopoiogyAware-GA-864

..~:.:.~:,,_~:.:.:. __~:.8:.:~_~_~:.:.:::.~_~ _._ _ ----.()o - - - - - - - - - - - - - --

Numter of proces~~rs

REFERENCES

[1] I. Foster, C. Kesselman and S. Tuecke, "The Anatomy of the Grid:Enabling Scalable Virtual Organizations", International Journal of Su­percomputer Applications, pp. 200-222, 2001.

[2] S. Sirisup and S. U-ruekolan, "On Performance Study of the GlobalArrays Toolkit on Cluster and Grid computing Environments," 5th In­ternational Conference on Electrical Engineering/Electronics, Computer,Telecommunications and Information Technology, ECTI-CON 2008, pp.141-144, 2008.

[3] J. Nieplocha, R.J. Harrison and R.J. Littlefield, "Global Arrays: ANonuniform Memory Access Programing Model for High-PerformanceComputers", Journal of Supercomputing,l0:169-189, 1997.

[4] Nicholas T. Karonis, R. de Supinski Bronis, I. Foster, W. Gropp, E. Luskand J. Bresnahan, "Exploiting hierarchy in parallel computer networks tooptimize collective operation performance," Proceedings of the Interna­tional Parallel Processing Symposium, IPPS" pp. 377-384, 2000.

[5] C. Xavier, R. Sachetto, V. Vieira, R.W. Dos Santos and W. Meira Jr.,"Multi-level parallelism in the computational modeling of the heart," Pro­ceedings - Symposium on Computer Architecture and High PerformanceComputing, art. no. 4384036, pp. 3-10, 2007.

[6] L. Huang, B. Chapman, R. Kendall, "OpenMP for Clusters," The FifthEuropean Workshop on OpenMP, EWOMP'03, Aachen, Germany, 2003.

[7] R. Thakur, W. Gropp and B. Toonen, "Optimizing the SynchronizationOperations in Message Passing Interface One-Sided Communication,"International Journal of High Performance Computing Applications, Vol.19, No.2, 119-128,2005.

[8] J. Nieplocha and B. Carpenter, "ARMCI: A Portable Remote MemoryCopy Library for Distributed Array Libraries and Complier Run-TimeSystems", In Proceedings of the 11 IPPS/SPDP'99 Workshops Held inConjunction with the 13th International Parallel Processing Symposiumand 10th Symposium on Parallel and Distributed Processing, pp. 533-546,Springer, Heidelberg, 1999.

[9] I. Foster and Nicholas T. Karonis, "A Grid-Enabled MPI: Message Passingin Heterogeneous Distributed Computing Systems", SCO1, IEEE, 2001.

VI. CONCLUSIONS

In this paper, we investigate the performance of the GlobalArrays toolkit with multi-level topology-aware and multi­level parallelism techniques on homogeneous Grid computing