[ieee 2012 ieee 15th international conference on computational science and engineering (cse) -...

8
A parallel application programming and processing environment proposal for grid computing Augusto Mendes Gomes Júnior Engeneering and Technology School Anhembi Morumbi University São Paulo, Brazil [email protected] Liria Matsumoto Sato Department of Computer and Digital Systems Engineering University of Sao Paulo São Paulo, Brazil [email protected] Francisco Isidro Massetto Department of Computer Science Federal University of ABC Santo Andre, Brazil [email protected] Abstract: The execution of parallel applications, using grid computing, requires an environment that enables them to be executed, managed, scheduled and monitored. The execution environment must provide a processing model, consisting of programming and execution models, with the objective appropriately exploiting grid computing characteristics. This paper proposes a parallel processing model, based on shared variables for grid computing, consisting of an execution model that is appropriate for the grid and a CPAR parallel language programming model. The environment is designed to execute parallel applications in grid computing, where all the characteristics present in grid computing are transparent to users. The results show that this environment is an efficient solution for the execution of parallel applications. Keywords: Distributed systems. Grid computing. High performance computing. Parallel programming language. I. INTRODUCTION The Parallel Programming paradigm [1] generally makes use of shared variables as an interprocess communication mechanism on the same machine. The use of shared variables makes the programming model more transparent and closer to the traditional, sequential programming model. The programmer does not require explicit communication and concurrency mechanisms. In order to make the use of those shared variables on a distributed environment feasible, such as a cluster of computers, memory must be simulated through DSM techniques. There are several DSM systems with a programming language attached to them, such as TreadMarks [2] and Score Omni-OpenMP [3]. Computational Grids have a geographically-distributed hierarchical domain organization, in which each might contain mono or multiprocessor nodes or even clusters of computers. However, due to common cluster implementation, internal nodes have a private IP address, which results in more difficulties with regards to communication with the external environment. In order to use a parallel programming language on a Computational Grid, it must be adapted to Grid Computing requirements. Moreover, middleware that enables execution, transparent communication, application scheduling and monitoring is also required. Considering these issues, the main goal of this paper is to present a parallel processing model based on shared variables for Computational Grids. This model uses the CPAR [4] parallel programming model, extended to Grid Computing environments. The main objective of this model is to achieve performance, while minimizing message exchange, by applying a scheduling mechanism that privileges nodes inside the same domain, thus reducing the distance between communicating nodes, and adequately using the hierarchical Grid structure. CPAR language was chosen because it provides multiple parallelism levels, on hierarchical levels. Fitting Grid architecture features and the use of shared variables makes communication among processes transparent, thus avoiding the need for users to write explicit message exchange instructions. The execution environment implemented in this paper uses the processing model proposed. This environment is part of the CPAR-Grid that consists of an execution environment for CPAR parallel applications and a compiler responsible for compiling the user application and generates the file to be submitted and executed in the processing nodes. The next sections will describe the execution environment for computational grids, as well the programming model, followed by its implementation, testing, results and finally the conclusions. II. RELATED WORK Currently, implementation techniques for grid computing are being focused on by several research projects [5]. However, the development of parallel applications for Grid Computing is a more difficult study area, mainly because a specific infrastructure to support the execution of a parallel application in this environment is required. Most available middleware for Grid Computing does not provide an abstraction on the programming language level for parallel application development. Developers must consider several infrastructure details during code writing, such as resource distribution over the network, active resource discovery and explicit communication among processes. In this study we have explored features for the following programming environments for Grids: ABACUS [6], SAGA [7] SATIN [5] and MPICH-G2 [8]. The ABACUS environment is an Object-Oriented Java implementation which provides a service-based abstraction level. A service is similar to an object and can be shared with other objects, since it is mapped through a virtual address. The SAGA environment is also an Object-Oriented implementation, however it is implemented in C++ and 2012 IEEE 15th International Conference on Computational Science and Engineering 978-0-7695-4914-9/12 $26.00 © 2012 IEEE DOI 10.1109/ICCSE.2012.30 154

Upload: francisco-isidro

Post on 09-Mar-2017

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: [IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) - Paphos, Cyprus (2012.12.5-2012.12.7)] 2012 IEEE 15th International Conference on Computational

A parallel application programming and processing environment proposal for grid computing

Augusto Mendes Gomes Júnior Engeneering and Technology School

Anhembi Morumbi University São Paulo, Brazil

[email protected]

Liria Matsumoto SatoDepartment of Computer and Digital

Systems Engineering University of Sao Paulo

São Paulo, Brazil [email protected]

Francisco Isidro Massetto Department of Computer Science

Federal University of ABC Santo Andre, Brazil

[email protected]

Abstract: The execution of parallel applications, using grid computing, requires an environment that enables them to be executed, managed, scheduled and monitored. The execution environment must provide a processing model, consisting of programming and execution models, with the objective appropriately exploiting grid computing characteristics. This paper proposes a parallel processing model, based on shared variables for grid computing, consisting of an execution model that is appropriate for the grid and a CPAR parallel language programming model. The environment is designed to execute parallel applications in grid computing, where all the characteristics present in grid computing are transparent to users. The results show that this environment is an efficient solution for the execution of parallel applications.

Keywords: Distributed systems. Grid computing. High performance computing. Parallel programming language.

I. INTRODUCTION

The Parallel Programming paradigm [1] generally makes use of shared variables as an interprocess communication mechanism on the same machine. The use of shared variables makes the programming model more transparent and closer to the traditional, sequential programming model. The programmer does not require explicit communication and concurrency mechanisms.

In order to make the use of those shared variables on a distributed environment feasible, such as a cluster of computers, memory must be simulated through DSM techniques. There are several DSM systems with a programming language attached to them, such as TreadMarks [2] and Score Omni-OpenMP [3].

Computational Grids have a geographically-distributed hierarchical domain organization, in which each might contain mono or multiprocessor nodes or even clusters of computers. However, due to common cluster implementation, internal nodes have a private IP address, which results in more difficulties with regards to communication with the external environment.

In order to use a parallel programming language on a Computational Grid, it must be adapted to Grid Computing requirements. Moreover, middleware that enables execution, transparent communication, application scheduling and monitoring is also required.

Considering these issues, the main goal of this paper is to present a parallel processing model based on shared variables for Computational Grids. This model uses the CPAR [4]

parallel programming model, extended to Grid Computing environments.

The main objective of this model is to achieve performance, while minimizing message exchange, by applying a scheduling mechanism that privileges nodes inside the same domain, thus reducing the distance between communicating nodes, and adequately using the hierarchical Grid structure.

CPAR language was chosen because it provides multiple parallelism levels, on hierarchical levels. Fitting Grid architecture features and the use of shared variables makes communication among processes transparent, thus avoiding the need for users to write explicit message exchange instructions.

The execution environment implemented in this paper uses the processing model proposed. This environment is part of the CPAR-Grid that consists of an execution environment for CPAR parallel applications and a compiler responsible for compiling the user application and generates the file to be submitted and executed in the processing nodes.

The next sections will describe the execution environment for computational grids, as well the programming model, followed by its implementation, testing, results and finally the conclusions.

II. RELATED WORK

Currently, implementation techniques for grid computing are being focused on by several research projects [5]. However, the development of parallel applications for Grid Computing is a more difficult study area, mainly because a specific infrastructure to support the execution of a parallel application in this environment is required.

Most available middleware for Grid Computing does not provide an abstraction on the programming language level for parallel application development. Developers must consider several infrastructure details during code writing, such as resource distribution over the network, active resource discovery and explicit communication among processes. In this study we have explored features for the following programming environments for Grids: ABACUS [6], SAGA [7] SATIN [5] and MPICH-G2 [8].

The ABACUS environment is an Object-Oriented Java implementation which provides a service-based abstraction level. A service is similar to an object and can be shared with other objects, since it is mapped through a virtual address.

The SAGA environment is also an Object-Oriented implementation, however it is implemented in C++ and

2012 IEEE 15th International Conference on Computational Science and Engineering

978-0-7695-4914-9/12 $26.00 © 2012 IEEE

DOI 10.1109/ICCSE.2012.30

154

Page 2: [IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) - Paphos, Cyprus (2012.12.5-2012.12.7)] 2012 IEEE 15th International Conference on Computational

includes bindings with other languages, such as Java, Perl C and Python. In SAGA, shared information is available only through files, which can be duplicated by other processes and are handled by a common operation (read and write). Methods, objects and variable sharing are not supported. SAGA also provides checkpoint operations and application recovery, in a fault-tolerant Grid environment.

Another environment, SATIN, is another Java Object-Oriented implementation. Shared objects are supported and their manipulation is based on method invocation.

Methods can share objects, where there is an internal mechanism to ensure information consistency. It is based on the divide-and-conquer method, with de-centralized management. Additionally, SATIN also offers fault-tolerance mechanisms.

MPICH-G2 interface is not a programming environment, but rather a programming interface for grid computing based on message-passing. Application developers can use the widely used MPI interface for inter-process communication, however one infrastructure aspect is necessary: all communicating nodes must have public IP addresses.

One difference among the aforementioned environments is with regards to the sharing mechanism. Except for MPICH-G2, each one has a different way of handling shared information. ABACUS and SATIN share services and objects, respectively. SAGA binds with other programming languages and shares files. The environment implemented in this paper uses shared variables for information exchange among processes.

Considering the internal cluster nodes (with private IP addresses), only SATIN and the our environment can execute parallel applications in Grid environments using internal cluster nodes, without requiring a public address. This feature is very important, because it enables the use of the entire grid infrastructure, optimizing hierarchical grid organization.

III. ENVIRONMENT

The CPAR [4] programming language was chosen for this project. Using CPAR, we developed an execution model which takes into account grid infrastructure features, making all grid infrastructure transparent to the user. Accordingly, users only need to focus on parallel application development. All distribution, communication, execution and application management are the responsibility of the environment.

The entire environment has its own processing model, consisting of a programming model and an execution model.

A. Programming Model In order to create an execution environment which allows

parallel applications to be executed in computational grids, the programming model must offer adequate processing resources for this purpose. The main goal is to gain the use of available resources in the Grid efficiently, thus minimizing communication costs and latency due to the physical distance between different sites.

The environment uses the CPAR programming model [4], which offers multiple parallelism levels in its building elements, along with a hierarchical processing structure.

CPAR makes use of shared variables for inter-process communication and it facilitates application development because communication and consistency issues are transparent to the programmer. When a process updates a variable, all processes that share this variable can access the updated value.

The building elements available in CPAR are as follows: macroblocks, parallel blocks, macrotasks and microtasks.

The main building elements for a parallel application in CPAR are macroblocks and macrotasks. A macroblock is a part of the code that is independent from the rest of the parallel application, which can be executed on another site, different from the execution site, mainly because communication is rare between macroblocks and other applications. Accordingly, macroblocks allow processing distribution among different grid domains, thus minimizing communication among different domains.

A macrotask can be dependent on the scope in which it was instantiated and must be executed geographically close to the node which initiated it. The scheduler must choose nodes in the same domain in order to execute macrotasks. Usually a macrotask execution node shares variables and the allocation of nodes in the same domain, thus reducing communication latency during the execution of a macrotask.

Macrotasks

This building element generates a task to be executed in parallel with the selected nodes. A macrotask consists of nprocesses, where one is the master executor and the remaining processes are parallel executors. All processes are responsible for executing the parallel portion of the task. The sequential portion is executed by the master executor only. All processes obey a barrier before the parallel execution in forall and parbegin instructions. When all processes reach the barrier, the execution starts. Another barrier finishes the execution and this is also managed by the master executor.

Microtasks

A microtask is a building element inside the macrotask. This means that it is only launched inside a macrotask scope. A microtask might implement homogeneous parallelism (using forall instruction) or heterogeneous parallelism (using parbegin instruction).

In homogeneous parallelism, loop iterations are divided into processes that execute macrotasks. The ideal situation is for all processes to be located in close physical proximity. For example if a loop has 100 iterations and 4 processes executing this parallel loop, each one will execute 25 iterations.

In heterogeneous parallelism, each parallel section between parbegin-also, also-also or also-parend instructions is executed for a different process in a macrotask. If there are four processes, each one will execute a different snippet.

Parallel Blocks

The parallel block is composed by parallel sections with some relationship. A parallel block can be used in macroblocks, main function and inside other parallel block. All parallel sections cobegin-also, also-also, also-coend are

155

Page 3: [IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) - Paphos, Cyprus (2012.12.5-2012.12.7)] 2012 IEEE 15th International Conference on Computational

executed by threads on the same process, thus by the same node on the Grid.

These threads control the execution flow of the application, providing synchronization among processing elements when necessary. This type of solution centralizes control in one node, using shared memory. If the execution node is a SMP or multicore node, all threads are executed simultaneously, otherwise they will be executed concurrently.

For each parallel block, there is an independent execution line and a barrier at the end for synchronization. When the parallel block finishes its execution, the waiting process can continue with sequential execution.

Macroblocks

Whenever they are created, macroblocks go to an execution queue and the scheduler defines which process will perform execution. The ideal situation is to choose an idle domain for this purpose.

Using macroblocks, the application is divided into snippets with little interdependency among them. Macrotasks created into macroblocks should preferentially be executed in the same domain as the macroblock that created them.

When a macroblock call occurs, the calling process continues executing normally and there is no synchronization between them. In this model, we implemented a Wait_Block(name) call, which synchronizes the calling process with the called macroblock. When the process executes Wait_Block, it continues waiting for the end of the macroblock.

Synchronization mechanisms

The synchronization mechanisms are used to allow that data shared by different processes be used exclusively. While a process is accessing the data, other processes cannot access it. The main CPAR resources for synchronization are Semaphores and Events.

A semaphore is an abstract data type consisting of a counter and a queue that stores the tasks descriptors. The semaphore guarantees exclusive access to a shared resource. The semaphores are used in an atomic manner through the lock and unlock operations. When the counter is at zero, all processes that attempt to access it will be blocked until the resource be released and the counter incremented. The semaphore can be used in any part of the user program.

An event is an abstract data type that indicates to the other processes the occurrence of a certain condition, which was being awaited. The event can be used in any part of the user program.

Shared Variables

CPAR language uses shared variables for inter-process communication. Those variables store data simultaneously used by processes. To avoid data inconsistency, CPAR implements a synchronization mechanism that denies simultaneous writing access.

All shared variables have a hierarchical scope. This means that CPAR allows variable declaration used by

processes inside a macrotask, into a macroblock scope or the scope of the entire application.

However, variable updates are executed according to the scope. If a macroblock has a shared variable, all macrotasks created by it will share the same variable

A global shared variable should be declared only as a last resort, because it causes the broadcast of updated values to all processes in the application.

The consistency model for shared variable updating executes a copy of this variable on each process. The consistency model used in this work is release consistency [2]. With this model, when the value changes, all processes within the execution scope, as well as those that have already used any variable of this scope, will receive a message with the new value. However, to minimize the messages and maintain the consistency, the messages with the shared variable updating are sent in the end of a critical section, controlled by synchronization mechanisms, or a microtask.

B. Execution model To execute a parallel application over a computational

Grid environment, a specific model is necessary to support the execution. Considering grid infrastructure, we defined some entities to manage application execution, as follows:

• Master coordinator: responsible for managing application execution;

• Executor: responsible for macrotask and macroblock execution;

• Local coordinator: responsible for managing local execution;

• Sender: controls the data flow from executor to coordinators;

• Scheduling mechanism: schedules all macrotasks and macroblocks;

• Execution Queue: data structure where coordinators store macroblocks and macrotasks to be executed on each node

• Shared variables manager: responsible for storing shared variables;

Figure 1 depicts the execution model adopted for a parallel application executed on a Grid infrastructure.

Figure 1. Execution model

156

Page 4: [IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) - Paphos, Cyprus (2012.12.5-2012.12.7)] 2012 IEEE 15th International Conference on Computational

In this model, one node is the master coordinator, whose main feature is that it controls the execution of the main block. Remaining nodes are slave nodes, executing tasks and blocks. The execution queue is kept on slave nodes, once they are responsible for execution.

The scheduler runs on the master node, which also keeps all updated information. To manage shared variables, every node has a specific entity running on it.

IV. IMPLEMENTATION

The execution environment has a coordinator, sender, executor, queue, table of shared variables and scheduler entities. The coordinator and the sender are processes and the executor consists of threads created by the coordinator. The number of executors is equivalent the number of processing cores in each node. The others entities are data structures maintained by the coordinator. The following sections explain the implementation of the main features of the environment.

A. Buffer Buffers are used in the environment in order to prevent

an overload in the system by sending updated messages. If a message is sent to other nodes in each shared variable update, a lot of time would be spent on updating the information.

With buffers, the updates are sent only when there is the need to provide these values to other nodes. Each buffer is a shared memory area between the executor and the sender of each node.

When an executor updates a shared variable of the system, the value is written in the variable stored in local memory. This information is stored in the buffer. At an appropriate moment, the sender will package the buffer and send it to all nodes that have already used this variable.

The coordinators receive the updated message and update the memory location, calculated from the information contained in the message and in the shared variables table.

The structure buffer strategy, as defined in G.S. Craveiro’s thesis [9], is used in this paper. Buffers are vectors with four dimensions, which can be detailed as follows:

• 1st Dimension: this dimension specifies the variable scope, which can be local to any macrotask / macroblock or global.

• 2nd Dimension: this dimension specifies whether the update should be central, only for the master coordinator or macrotask / macroblock manager, or total, for all nodes that are within the variable scope. If it is total, the writing of a shared variable must be propagated to all nodes that have been assigned to execute the macrotask / macroblock. And if it is central, only one node will receive the processing results. The objectives of this strategy are to increase performance and decrease message traffic.

• 3rd Dimension: this dimension specifies the variable type (integer, character, and double). Each data type has a specific buffer because of its distinctive size.

• 4th Dimension: this dimension identifies the buffer used to store the data. This dimension is necessary because there is a buffer rotation. When the sender is packaging and sending a buffer, the executor does not need to stop its execution and wait for the release of the buffer. At this moment, the executor uses the other buffer.

Considering the buffer dimensions, the format for sending an update to the buffer includes the following items: variable handler (identify the variable in the system), scope, distribution type, data type, information to calculate the variable address (displacement) and the current value of the variable.

Buffer access is carried out by the executor and sender of a node, and is shared between them. Accordingly, it is necessary to use an access control mechanism. The strategy for access control is similar to the producer-consumer problem, in which the executor works as a producer, and the sender as a consumer.

B. Shared variable update Shared variables can be used in the main function of the

program, in the scope of macrotasks, microtasks, parallel blocks and macroblocks. In order to reduce the amount of exchanged messages to update the shared variables and to maintain the consistence, messages are sent only in the following situations:

• At the end of a critical section defined by the semaphore lock and unlock commands.

• At the end of a microtask (forall or parbegin). The strategy for updating shared variables used in the

environment updates the variables only in the processes that have already used it. The processes that have never used the variable do not receive an update for it. However, to maintain environment consistency, whenever a process accesses a shared variable for the first time, it will have to communicate with the macrotask / macroblock manager or with the master coordinator in order to receive the current value of the variable.

This strategy reduces message exchange because only those that have used the variable will receive the update. The notification of the processes that already had used a shared variable is executed by the macrotask / macroblock manager if the variable is local, or by the coordinator master if the variable is global.

In the first access to a shared variable, the executor informs the sender (step 1 in Figure 2) that this is the first access and asks it to contact the macrotask / macroblock manager or the master coordinator. The sender sends a message requesting the current value of this variable (step 2 in Figure 2). The master coordinator or the macrotask / macroblock manager sends the current value and all nodes that have used this variable to the coordinator node (step 3 in Figure 2). After that, it also sends a message to other coordinators using this variable, indicating the identifier of the new node that initially used the shared variable (step 4 in Figure 2).

In Figure 2, node 1 is the task manager and it is executed by the nodes 1, 2, 3 and 4. The shared variable is local to this

157

Page 5: [IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) - Paphos, Cyprus (2012.12.5-2012.12.7)] 2012 IEEE 15th International Conference on Computational

task, in which nodes 1, 2 and 4 have already used it. Node 3 is accessing this variable for the first time.

The Figure 3 shows the update operation of the shared variables. The executor notifies the sender of its node (step 1 in Figure 3) that the shared area is full or that it has achieved a synchronization point. The sender packages and sends the message to the corresponding coordinator(s) node (s) (step 2 in Figure 3). The coordinator of the receiver node receives the message and updates the data in its table of shared variables (step 3 in Figure 3).

Figure 2. First access to a shared variable

There is a table that indicates all the nodes that have already used a shared variable and its scope. The sender uses this table to find out the nodes that require the updated message.

Figure 3. Shared variables updating

There is a table that indicates all the nodes that have already used a shared variable and its scope. The sender uses this table to find out the nodes that require the updated message.

C. Scheduling strategy The environment scheduler is called whenever a

macroblock or a macrotask is executed. It is responsible for

choosing macroblock or macrotask execution nodes. The correct scheduling of macrotasks and macroblocks is essential for obtaining superior performance. The scheduler has an algorithm for choosing the macroblock execution nodes and another for macrotasks.

Macroblock scheduling

The data dependence between a macroblock and the rest of the application is minor.

When macroblock execution call occurs, the process that makes the call will be responsible for its submission and management. This is because a macroblock can be called in the main function program or in another macroblock scope. If it is called by the main function, the master coordinator will manage its execution. Otherwise, the slave coordinator of the node will manage it.

The strategy used to schedule macroblocks is to search idle nodes or nodes with low processing loads and these nodes must be in different domains than the node that is making the call. It happens because a macroblock creates an independent piece of code and if it executes in an idle domain, it can use the nodes of this domain to execute the macroblock code. This reduces the number of exchanged messages between the domains. Figure 4 shows the flowchart of the macroblock scheduling strategy.

This flowchart shows that if there are idle domains, the scheduler chooses the idle domain that has the greatest number of free processors. If there are no free domains, the scheduler checks all domains to find the one with the greatest number of free processors, and chooses it. In the chosen domain, the chosen node to process the macroblock will be the node that has the greatest number of free processors.

Figure 4. Flowchart of the macroblocks scheduling strategy

Macrotask scheduling

A macrotask represents a sequential and / or parallel series of instructions, which usually has some data dependency on the scope that it created and can be in the main block or macroblock scope. Therefore, the macrotask scheduling strategy must be different from the strategy used in macroblock scheduling.

In the adopted strategy, the node, which made its call, will be responsible for its submission and management. If the macrotask is called in the main function, the master node

158

Page 6: [IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) - Paphos, Cyprus (2012.12.5-2012.12.7)] 2012 IEEE 15th International Conference on Computational

is responsible. Otherwise, the responsibility is of the slave node that executes the macroblock, which is responsible for calling the macrotask.

The strategy adopted to schedule macrotasks, as shown in the flowchart in Figure 5, is to search idle nodes or nodes with low processing loads. These nodes must be in the same domain as the node that made the macrotask call. In this context, there is still a choice priority for multiprocessor nodes. This strategy minimizes message exchange, because the memory is shared.

This flowchart shows that the first allocation choice is the nodes in the same domain as the node that made the macrotask call. If the allocation cannot be completed with these nodes, the scheduler allocates all possible instances in this domain. After that, the scheduler checks whether there is any domain that can allocate the remaining instances. If there is, it checks for the domain that has the fewest free processors and makes the allocation in this domain. This strategy is executed in order to maintain the availability of the domains with the most free processors. These are good candidates to execute the macroblocks that are instantiated during the execution of the application.

If there is no domain that can allocate all remaining macrotask instances, the scheduler checks the domain that can allocate the most instances and executes the allocation. The scheduler repeats this process until all instances have been allocated, checking for domains where it can allocate the remaining instances.

V. TESTS AND RESULTS

Tests were carried out to analyze the execution performance of the CPAR parallel applications on the environment. Performance is related to the following features of the environment: shared variable strategy updating, scheduling strategy and a reduction in the number of messages exchanged and the distance between nodes that carry out the communication. The next sections detail the environment infrastructure, the tests carried out and their results.

Figure 5. Flowchart of the macrotask scheduling strategy

A. Environment infraestructure In order to enable integration, communication, execution

control and the management of the various environment domains, an infrastructure that makes everything transparent to the user is necessary. The infrastructure architecture used in this work is presented in Figure 6.

Figure 6. Environment infrastructure.

The infrastructure is structured over several levels, with specific features:

• Jobs: parallel applications to be submitted to execution in the environment.

• MPI Implementation: MPI implementation for communication of processes. The processes can be allocated to processors on nodes with public or private addresses [10].

• Message Service: responsible for sending messages between processes in different domains [11].

• GCSE and LIMA: These tools were developed in N.C. Paula’s PhD thesis [12]. They submit and control the execution of MPI applications in the grid computing environment. They also monitor all system nodes (availability, occupation and characteristics of the nodes).

• Globus: Globus is an integration infrastructure for grid computing that offers, among other functionalities, resources for node allocation and authentication between the various system components. All system nodes with public addresses must have Globus installed [13].

• Condor: responsible for control of scheduling and execution in each local environment [14].

This infrastructure is important to initialize the execution. Once the execution starts, the infrastructure is used only for communication between the processing nodes. For this communication, we use MPI.

B. Tests carried out and analysis of the results The algorithm that calculates the average of the sum of

multiplied matrices was used to validate the following: the updating mechanism of the shared variable according to its scope (global, macroblock, macrotask); the impact of

159

Page 7: [IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) - Paphos, Cyprus (2012.12.5-2012.12.7)] 2012 IEEE 15th International Conference on Computational

communication between processing nodes; and the scheduling strategy for macrotasks and macroblocks. This algorithm implemented in three different versions to show the impact of the implemented strategies in the environment.

Equation (1) shows the formula used for calculation. All matrices used are square and of the double type.

average = ( (AxB + CxD) + (ExF + GxH) ) / 2 (1)

The sequential version executes all processing sequentially, using only one processing core.

The complete version creates two macroblocks, with the first macroblock multiplying matrices AxB and CxD and then the sum of these matrices is multiplied. The second macroblock does the same thing as the first macroblock, but it uses matrices E, F, G and H. Subsequently, the two macroblocks finalize and the main block calculates the average of the matrices that keep the sum carried out on macroblocks. The multiplication and the sum of the matrices are carried out in macrotasks. Matrices A, B, C, D, E, F, G and H are shared variables in the scope of macrotasks that use them. The matrices that store the multiplication of matrices AxB, CxD, ExF GxH and shared variables are in the scope of macroblocks that use them.

The macrotask-only version does not use macroblocks, and all macrotasks are created in the scope of the main block. Firstly, four macrotasks are created, where each one multiplies two matrices. After the end of the four macrotasks, the main block creates two macrotasks that sum together the multiplied matrices. Finally, the main block calculates the average of the matrices that executes the addition. In this version, the matrices A, B, C, D, E, F, G and H are shared variables in the scope of macrotasks that use them. The other shared variables, necessary for multiplication, addition and calculating the average of matrices, are global.

A grid computing environment was set up to carry out the tests. The environment consists of three domains. Each domain has two processing nodes and each node has two Quad-Core 2.0 Ghz, 8 Gbytes of memory and 210 Gbytes of disk. The operating system is Suse9 64-bit and the environment uses the grid computing infrastructure described in this paper. Figure 7 shows a graph with the average time, in milliseconds, for communication between the domains.

Figure 7. Communication time between domains

To analyze the scheduling strategy for macrotasks and macroblocks and the updating of shared variables in the environment, we used the complete version and the macrotask–only version. Table 1 shows the number of internal, in the same domain, and external, between different domains, messages sent using both versions of the algorithm. All macrotasks execute with six instances.

TABLE I. NUMBER OF SENT MESSAGES

Matrix size

Complete version Macrotask-only version

Internal External Internal External1000 464 303 288 10621200 578 419 301 15041500 794 635 340 23322000 1262 1103 420 4126

Table 1 shows that the total number of messages sent in the macrotask-only version is greater than in complete version. Moreover, the macrotask-only version has an absolute majority of its messages as external, while the complete version has more internal messages sent than external ones. This occurs because the complete version has shared variables to store the multiplied matrices in the scope of the macroblock. This results in less external messages being sent because the macrotasks that do the multiplication are in the same domain as the macroblocks that made the call.

In the macrotask-only version, the shared variables that store the multiplied matrices are global. The macrotasks that execute in different domains of the main block domain must to send external messages to update these variables.

As the communication time in sending an external message is greater than the time of an internal message, the processing time in the macrotask-only version is greater than in the complete version. Figure 8 shows a graph with the execution time, in seconds, of the two versions.

Figure 8. Execution time of the macrotask-only version and complete version

The execution times in Figure 8 show that the execution time in the complete version is less than the other version. This shows that proper use of building elements of the CPAR language is a better use of the hierarchical structure of grid computing by applications. The use of macroblocks

160

Page 8: [IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) - Paphos, Cyprus (2012.12.5-2012.12.7)] 2012 IEEE 15th International Conference on Computational

executes the distribution of the application in domains and reduces communication between them.

The use of macroblocks improves the scheduler decision and it can choose the best nodes to execute macrotasks and macroblocks, showing that the scheduling strategy used for computational grids can exploit its hierarchical structure. Accordingly, it is necessary to use the construction elements of the CPAR language correctly.

In order to analyze the performance of the environment, we compared the execution time between the sequential version and complete version. Figure 9 shows the speedup of the complete version compared to the sequential version. In the complete version, each macrotask is created using two, four and six processing cores for the execution. The number of processing cores of the macrotask informs the cores used to execute the macrotask in parallel. For example, if a macrotask is executed with six processing cores, the microtasks, in the macrotask scope, are executed in parallel by six processing cores.

In tests, the increase in the number of processing cores that execute a macrotask generated better results, showing that as the size of the matrices increases, the speedup also increases. It occurs because the data amount to be processed is greater and the processing nodes spend more time executing the data e less time sending messages. The test with six processing cores yielded the best results, as expected, since the number of cores used is greater.

Figure 9. Speedup of the complete version

VI. CONCLUSION

The processing model, proposed in this paper, used the construction elements of CPAR parallel language to explain the pieces of code that are executed in parallel. These elements can be executed hierarchically.

The CPAR language uses shared variables for communication between processes on the same node. The execution model uses distributed shared memory for storage of shared variables in grid computing. In order to maintain consistency in the variables, the execution model updates the shared variables by means of a release consistency mechanism.

In order to decrease the number of messages sent, when a shared variable is updated in a node, the update strategy

makes this node send updated messages only to nodes that have already used it.

The environment implemented the processing model for grid computing proposed in this paper. The results showed that the environment can properly use the hierarchical structure of grid computing and allocate the components of CPAR language, according to their characteristics.

REFERENCES

[1] MATTSON, T. G., SANDERS, B. A., MASSINGILL, B. L., A Pattern Language for Parallel Programming, Addison Wesley Software Patterns Series. 2004.

[2] KELEHER, P., Lazy Release Consistency for Distributed Shared Memory, Ph.D. Thesis, Department of Computer Science, Rice University, December 1994.

[3] CONSORTIUM, P. C., PC cluster consortium, 2010, http://www.pccluster.org

[4] SATO, L. M., Ambientes de programação para sistemas paralelos e distribuídos, Livre Docência Thesis, Polytechnical School, University of Sao Paulo, 1995.

[5] NIEUWPOORT, R. V. v. et al., Satin: a High-Level and Efficient Grid Programming Model, ACM Transactions on Programming Languages and Systems.,Vol. 10., 2010, p. 1–40.

[6] WANG, X. et al., Abacus: A Service-Oriented Programming Language for Grid Applications, Proceedings of the 2005 IEEE International Conference on Services Computing, Vol. 1, 2005, p. 225-232.

[7] GOODALE, T. et al., SAGA: A Simple API for Grid Applications, High-level application programming on the Grid, Computacional Methods in Science and Technology, 2006, p. 7-20.

[8] KARONIS, N., TOONEN, B., FOSTER, I., MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface, Journal of Parallel and Distributed Computing, 2003.

[9] CRAVEIRO, G. S, Um ambiente de execução para suporte à programação paralela com variáveis compartilhadas em sistemas distribuídos heterogêneos, Phd Thesis, Polytechnical School, University of Sao Paulo, 2003.

[10] GOMES JR, A. M. et al., Implementação da Interface MPI e de sua Infraestrutura para Grades Computacionais., 1a Escola Regional de Alto Desempenho de São Paulo (ERAD-SP), Sao Paulo, 2010.

[11] MASSETTO, F. I. et al., A Message Forward Tool for integration of Clusters of Clusters based on MPI Architecture, The 2nd Russia-Taiwan Symposium on Methods and Tools of Parallel Programming Multicomputers, Russia, Maio, 2010.

[12] PAULA, N. C., Um ambiente de monitoramento de recursos e escalonamento cooperativo de aplicações paralelas em grades computacionais. Phd Thesis, Polytechnical School, University of Sao Paulo, 2009.

[13] FOSTER, I., KESSELMAN, C., Globus: A Meta Computing Infrastructure Toolkit, International Journal of Supercomputer Applications, 11(2), 1997, p. 115-128.

[14] CONDOR, Projeto Condor, http://www.cs.wisc.edu/condor/

161