analytical modelling in parallel and distributed computing

Peter Hanuliak and MicHal Hanuliak

AnAlyticAl Modelling in PArAllel

And distributed coMPuting

For a full listing of Chartridge Books Oxfords titles, please contact us:Chartridge Books Oxford, 5 & 6 Steadys Lane, Stanton Harcourt, Witney, Oxford, OX29 5RL, United KindomTel: +44 (0) 1865 882191Email: [email protected]: www.chartridgebooksoxford.com

The current trends in High Performance Computing (HPC) are to use networks of workstations (NOW, SMP) or a network of NOW networks (Grid) as a cheaper alternative to the traditionally-used, massive parallel multiprocessors or supercomputers. Individual workstations could be single PCs (personal computers) used as parallel computers based on modern symmetric multicore or multiprocessor systems (SMPs) implemented inside the workstation.

With the availability of powerful personal computers, workstations and networking devices, the latest trend in parallel computing is to connect a number of individual workstations (PCs, PC SMPs) to solve computation-intensive tasks in a parallel way to typical clusters such as NOW, SMP and Grid. In this sense it is not yet correct to consider traditionally evolved parallel computing and distributed computing as two separate research disciplines.

To exploit the parallel processing capability of this kind of cluster, the application program must be made parallel. An effective way of doing this for (parallelisation strategy) belongs to the most important step in developing an effective parallel algorithm (optimisation). For behaviour analysis we have to take into account all the overheads that have an influence on the performance of parallel algorithms (architecture, computation, communication etc.).

In this book we discuss this kind of complex performance evaluation of various typical parallel algorithms (shared memory, distributed memory) and their practical implementations. As real application examples we demonstrate the various influences during the process of modelling and performance evaluation and the consequences of their distributed parallel implementations.

9 781909 287907

ISBN 978-1-909287-90-7

An

Alytic

Al M

od

elling

in PA

rAllel A

nd

distribu

ted c

oM

Putin

gPeter Hanuliak and MicHal Hanuliak

Analytical Modelling in Parallel and Distributed Computing

Analytical Modelling in Parallel and Distributed

Computing

Peter Hanuliak and

Michal Hanuliak

Chartridge Books Oxford 5 & 6 Steadys LaneStanton HarcourtWitneyOxford OX29 5RL, UKTel: +44 (0) 1865 882191Email: [email protected]: www.chartridgebooksoxford.com

First published in 2014 by Chartridge Books Oxford

ISBN print: 978-1-909287-90-7ISBN ebook: 978-1-909287-91-4

Peter Hanuliak and Michal Hanuliak 2014.

The right of Peter Hanuliak to be identified as author of this work has been asserted by them in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.

British Library Cataloguing-in-Publication Data: a catalogue record for this book is available from the British Library.

All rights reserved. No part of this publication may be reproduced, stored in or introduced into a retrieval system, or transmitted, in any form, or by any means (electronic, mechanical, photocopying, recording or otherwise) without the prior written permission of the publishers. This publication may not be lent, resold, hired out or otherwise disposed of by way of trade in any form of binding or cover other than that in which it is published without the prior consent of the publishers. Any person who does any unauthorised act in relation to this publication may be liable to criminal prosecution and civil claims for damages. Permissions may be sought directly from the publishers, at the above address.

Chartridge Books Oxford is an imprint of Biohealthcare Publishing (Oxford) Ltd.

The use in this publication of trade names, trademarks service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. The publishers are not associated with any product or vendor mentioned in this publication. The authors, editors, contributors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologise to any copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged, please write and let us know so we may rectify in any future reprint. Any screenshots in this publication are the copyright of the website owner(s), unless indicated otherwise.

Limit of Liability/Disclaimer of Warranty

The publishers, author(s), editor(s) and contributor(s) make no representations or warranties with respect to the accuracy or completeness of the contents of this publication and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This publication is sold with the understanding that the publishers are not rendering legal, accounting or other professional services. If professional assistance is required, the services of a competent professional person should be sought. No responsibility is assumed by the publishers, author(s), editor(s) or contributor(s) for any loss of profit or any other commercial damages, injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. The fact that an organisation or website is referred to in this publication as a citation and/or potential source of further information does not mean that the publishers nor the author(s), editor(s) and contributor(s) endorses the information the organisation or website may provide or recommendations it may make. Further, readers should be aware that internet websites listed in this work may have changed or disappeared between when this publication was written and when it is read.

Typeset by Domex, India Printed in the UK and USA

Contents

Preview xiiiAcknowledgements xv

Part I. Parallel Computing 1

Introduction 3Developing periods in parallel computing 3

1 Modelling of Parallel Computers and Algorithms 7

Model construction 7

2 Parallel Computers 11

Classifi cation 12Architectures of parallel computers 14Symmetrical multiprocessor system 14Network of workstations 16Grid systems 19Conventional HPC environment versus Grid environments 21Integration of parallel computers 22Metacomputing 22Modelling of parallel computers including communication networks 23

vi Analytical Modelling in Parallel and Distributed Computing

3 Parallel Algorithms 29

Introduction 29Parallel processes 31Classifi cation of PAs 33Parallel algorithms with shared memory 34Parallel algorithms with distributed memory 35Developing parallel algorithms 35Decomposition strategies 37Natural parallel decomposition 38Domain decomposition 39Functional decomposition 39Mapping 44Inter process communication 45Inter process communication in shared memory 45Inter process communication in distributed memory 46Performance tuning 46

4 Parallel Program Developing Standards 49

Parallel programming languages 49Open MP standard 50Open MP threads 53Problem decomposition 54MPI API standard 55MPI parallel algorithms 57Task Groups 58Communicators 59The order of tasks 59Collective MPI commands 60Synchronisation mechanisms 61Conditional synchronisation 61Rendezvous 61

Contents vii

Synchronisation command barriers 61MPI collective communication mechanisms 62Data scattering collective communication commands 63Java 70

5 Parallel Computing Models 73

SPMD model of parallel computation 74Fixed (atomic) network 74PRAM model 75Fixed communication model GRAM 76Flexible models 76Flexible GRAM model 77The BSP model 77Computational model MPMD functionality of all system resources 80Load of communication network 81

6 The Role of Performance 83

Performance evaluation methods 85Analytic techniques 85Asymptotic (order) analysis 86Application of queuing theory systems 90Kendall classifi cation 91The simulation method 94Experimental measurement 95

Part II. Theoretical aspects of PA 97

7 Performance Modelling of Parallel Algorithms 99

Speed up 99Effi ciency 100Isoeffi ciency 100

viii Analytical Modelling in Parallel and Distributed Computing

Complex performance evaluation 102Conclusion and perspectives 103

8 Modelling in Parallel Algorithms 105

Latencies of PA 105

Part III. Applied Parallel Algorithms 113

9 Numerical Integration 115

Decomposition model 116Mapping of parallel processes 117Performance optimisation 120Chosen illustration results 123

10 Synchronous Matrix Multiplication 127

The systolic matrix multiplier 127Instruction systolic array matrix multiplier 129ISA matrix multiplier 131Datafl ow matrix multiplication 132Wave front matrix multiplier 133The asynchronous matrix multiplication 134Decomposition strategies 134Domain decomposition methods for matrix multiplication 134Comparison of used decomposition models 137

11 Discrete Fourier Transform 139

The Fourier series 139The discrete Fourier transform 140 The discrete fast Fourier transform 141 Two-dimensional DFFTs 145 Analysed examples 147One element per processor 147Multiple elements per processor 147

Contents ix

Multiple elements per processor with routing 149Multiple elements per processor in computer networks 150Chosen illustration results 150

12 The Triangle Problem 157

The manager/worker strategy 157Combinatorial problems 157Sequential algorithms 159Parallel algorithms 160Performance optimisation 161

13 The System of Linear Equations 165

Methods of solving SLR 166Cramers rule 166Gaussian elimination method 167Sequential algorithm GEM 167Decomposition matrix strategies 168Evaluation of GEM 168The evaluation of matrix parallel algorithms 170Common features of MPA 171Decomposition models MPA 172Domain decomposition 172Iteration methods 173Parallel iterative algorithms SLR 173Convergence of iterative methods 176

14 Partial Differential Equations 181

The Laplace differential equation 183Local communication 185Algorithms of the Jacobi iterative method 187Optimisation of parallel Jacobi algorithms 189Complexity of the sequential algorithm 193Gauss Seidel iterative sequential algorithms 194

x Analytical Modelling in Parallel and Distributed Computing

Matrix decomposition models 195Jacobi iterative parallel algorithms 197Parallel algorithms with a shared memory 197Parallel iterative algorithms with a distributed memory 198The red-black successive over-relaxation method 203Complex analytical performance modelling of IPA 207Basic matrix decomposition models 207 Matrix decomposition into strips 209Matrix decomposition into blocks 210Optimisation of the decomposition method selection 211Parallel computational complexity 213Complex analytical performance modelling 214Issoefi ciency functions 216Canonical matrix decomposition models 217Optimisation of issoeffi ciency functions 219Conclusions of issoefi ciency functions 221Chosen results 221

Part IV. The Experimental Part 229

15 Performance Measurement of PAs 231

Direct performance measurement methodology for MPAs 231Performance measurement of PAs 232Performance measurement of PAs with a shared memory 232Performance measurement MPAs with a distributed memory 232Performance measurement of PAs in NOW and Grid 233Measurement delays 235Measurements on parallel computers 237Measurement of SMPs 237Measurements on parallel computers in the world 237Measurement of NOW 239

Contents xi

Measurement of the performance verifi cation criteria of PAs 240Isoeffi ciency functions of PAs 242

16 Measuring Technical Parameters 245

Specifi cations of measurements 245Technical parameters for the average time of computer operations 245Computing complexity of GEM 246Process for deriving technical parameter tc 250Applied uses of technical parameter tc 254Verifi cation of the accuracy for approximation relations 254Simulated performance comparisons of parallel computers 256Communication complexity and communication technical parameters 258Classic parallel computers 258The NOW parallel computer 261Evaluation of collective communication mechanisms 262The Broadcast collective communication mechanism type 263

17 Conclusions 265

Appendix 1 267

Basic PVM routines 267Preliminaries 267Point-to-point message passing 268Group routines 270

Appendix 2 273

Basic MPI routines 273Preliminaries 273Point-to-point message passing 274Group routines 276

xii Analytical Modelling in Parallel and Distributed Computing

Appendix 3 279

Basic Pthread routines 279Thread management 279Thread synchronisation 281

Condition variables 282

References 285

Preview

The current trends in High Performance Computing (HPC) are to use networks of workstations (NOW, SMP) or a network of NOW networks (Grid) as a cheaper alternative to the traditionally-used, massive parallel multiprocessors or supercomputers. Individual workstations could be single PCs (personal computers) used as parallel computers based on modern symmetric multicore or multiprocessor systems (SMPs) implemented inside the workstation.

With the availability of powerful personal computers, workstations and networking devices, the latest trend in parallel computing is to connect a number of individual workstations (PCs, PC SMPs) to solve computation-intensive tasks in a parallel way to typical clusters such as NOW, SMP and Grid. In this sense it is not yet correct to consider traditionally evolved parallel computing and distributed computing as two separate research disciplines.

To exploit the parallel processing capability of this kind of cluster, the application program must be made parallel. An effective way of doing this for (parallelisation strategy) belongs to the most important step in developing an effective parallel algorithm (optimisation). For behaviour analysis we have to take into account all the overheads that have an influence on the performance of parallel algorithms (architecture, computation, communication etc.). In this book we discuss this kind of complex performance evaluation of various typical parallel algorithms (shared memory, distributed memory) and their practical implementations. As real application examples we demonstrate the various influences during the process of modelling and performance evaluation and the consequences of their distributed parallel implementations.

xiv Analytical Modelling in Parallel and Distributed Computing

Keywords: Parallel computers, parallel algorithm, performance modelling, NOW, analytical model, decomposition, inter process communication, IPX, OpenMP, MPI, complex performance modelling, effectiveness, speed up, issoeficiency, NOW, Grid.

Acknowledgements

This work was carried out within the project named Modelling, optimisation and prediction of parallel computers and parallel algorithms, at the University of Zilina, the Slovak Republic. The authors gratefully acknowledge the universal help of project supervisor Prof. Ing. Ivan Hanuliak, PhD.

Part 1: Parallel Computing

Introduction

The performance of actual computers (sequential, parallel) depends to a degree on embedded parallel principles on various levels of technical (Hardware) and program support means (Software). At the level of the internal architecture of a basic module CPU (Central Processor Unit) of a PC they are implementations of a scalar pipeline execution or a multiple pipeline (superscalar, super pipeline) execution and capacity extension of caches and their redundant usage at various levels and in the form of shared and local caches (L1, L2, L3). At the level of the motherboard there is multiple usage of cores and processors in building multicore or multiprocessor systems such as the SMP (Symmetrical Multiprocessor System) as a powerful computation node, where this is also an SMP parallel computer.

Developing periods in parallel computing

During the first period of parallel computing between 1975 and 1995, scientific supercomputers dominated, which were specially designed for High Performance Computing (HPC). These parallel computers have mostly used a computing model based on data parallelism. Those systems were way ahead of standard common computers in terms of their performance and price. General purpose processors on a single chip, which had been invented in the early 1970s, were only mature enough to hit the HPC market by the end of the 1980s, and it was not until the end of the 1990s that the connected standard workstation or even personal computers (PCs) had become competitive, at least in terms of theoretical peak

4 Analytical Modelling in Parallel and Distributed Computing

performance. Increased processor performance was caused through massive usage of various parallel principles in all forms of produced processors. Parallel principles were used like this in single PCs and workstations (scalar and superscalar pipelines, Symmetrical Multiprocessor Systems SMPs), as well as on POWER PC as in a connected Network of Workstations (NOW). The experience gained with the implementation of parallel principles and the extension of computer networks has led to the use of connected computers for parallel solutions. This trend is to be characterised through the downsizing of supercomputers such as Cray/SGI, T3E and from other massive parallel systems (the number of used processors is >100) to cheaper and more universal parallel systems in the form of a Network of Workstations (NOW). This period we can refer to as the second period. Their large growth since 1980 has been influenced by the simultaneous influence of three basic factors and [22, 41]:

High performance processors (Pentium and higher, Power PC, RISC etc.).

High speed interconnecting networks (100M and Gigabit Ethernet, Myrinet, Infiniband).

Standard tools for the development of parallel algorithms (OpenMP, Java, PVM, MPI).

Gradational change from specialised supercomputers (Cray/SGI, T3E a pod.) and other massive (Number of processors >100) parallel computers [66] to more available but powerful PCs based on multiprocessors or multicores through NOW networks were characterised by people as downsizing. This trend began in 1980 (the introduction of personal computers), and was inspired by the simultaneous influence of the above-mentioned three basic factors. The developing trends are actually going towards the building of widespread, connected NOW networks with high computation and memory capacity (Grid). Conceptually, Grid belongs to the class of metacomputer.

A metacomputer can be understood as a massive computer network of computing nodes built on the principle of the common use of existing processors, memories and other resources with the

Introduction 5

objective of creating an illusion of one huge, powerful supercomputer. Such higher integrated forms of NOW (Grid module), named Grid systems or metacomputers, we can define as the third period in the developing trends of parallel computers.

The developing trends were going toward building widespread connected networks with high computation and memory capacities (Grid), whereby their components could make possible the existence of supercomputers and their innovative types. Conceptually, Grid belongs to the class of metacomputer, where a metacomputer can be understood as a massive computer network (with high-speed data transmission) of computing nodes built on the principle of the common use of the existing processors, memory and other resources, with the objective of creating an illusion of one huge, powerful supercomputer [96].

There has been an increasing interest in the use of networks of distributed workstations (clusters) connected by high-speed networks for solving large computation-intensive problems. This trend is mainly driven by the cost effectiveness of such systems, as compared to parallel computers with their massive numbers of tightly coupled processors and memories. Parallel computing on a cluster of powerful workstations (NOW, SMP, Grid), connected by high-speed networks has given rise to a range of hardware and network-related issues on any given platform.

The Network of workstations (NOW) has become a widely-accepted form of high-performance parallel computing. As in conventional multiprocessors, parallel programs running on this kind of platform are often written in an SPMD form (Single program Multiple data) to exploit data parallelism, or in an improved SPMD form to also take into account the potential of the functional parallelism of a given application. Each workstation in a NOW is treated similarly to a processing element in a multiprocessor system. However, workstations are far more powerful and flexible than processing elements in conventional multiprocessors.

The dominant trend, and that also in the field of High Performance Computing (HPC), are networked, connected and powerful workstation SMPs known as NOW (Network of Workstations) and their higher, massive integrated forms named Grid or metacomputer. The effective usage of these dominant PCs in forms such as NOW


and Grid require principal new forms and methodical strategies in the proposal, development, modelling and optimisation of the PA (shared memory, distributed memory), and which is commonly known as an effective PA.

Distributed computing using a cluster of powerful workstations (NOW, SMP, Grid) was reborn as a kind of lazy parallelism. A cluster of computers could team up to solve many problems at once, rather than one problem at a higher speed. To get the most out of a distributed parallel system, the designers and software developers must understand the interaction between the hardware and software parts of the system. It is obvious that use of a computer network based on personal computers would be in principle less effective than the typical massive parallel architectures used in the world, because of higher communications overheads, but also because a network of more and more powerful workstations consisting of powerful personal computers (PCs, PC SMPs), is the way of the future as very cheap, flexible and perspective parallel computers. We can see this trend in its dynamic growth just in the parallel architectures based on the networks of workstations as a cheaper and flexible architecture compared to conventional multiprocessors and supercomputers. The principles of these conventional parallel computers are currently effectively implemented in modern symmetric multiprocessor systems (SMPs) based on the same processors [1] (multiprocessors, multicores). The unification of both approaches (NOW and SMP) has opened up for the future new possibilities in massive HPC computing.

1Modelling of Parallel Computers and Algorithms

Generally a model is the abstraction of a system (Fig. 1.1.). The functionality of the model represents the level of the abstraction applied. That means, if we know all there is about a system, and we are willing to pay for the complexity of building a true model, the role of abstraction is nearly zero. In practical cases we wish to abstract the view we take of a system to simplify the complexity of the real system. We wish to build a model that focuses on some basic elements of our interest, and to leave the rest of the real system as only an interface with no details beyond proper input and output. A real system is the concrete applied process or system that we are going to model [85]. In our case they should be applied Parallel Algorithms (PA) or concrete parallel computers (SMP, NOW, Grid etc.).

The basic conclusion is that a model is the subjective view of a modellers subjective insight into a modelled real system. This personal view defines what is important, what the purposes are, the details, the boundaries and so on. Therefore the modeller must understand the system in order to guarantee the useful features of the created model.

Model construction

Modelling is highly creative process, which incorporates the following basic assumptions:


A strong aptitude for abstract thought. Brainstorming (creativity). Alternating behaviour and strategy. Logical, hierarchical approaches to differentiating between

primary and secondary facts.

In general, the development of a model in any scientific area includes the selection of the following steps [51, 77]:

Define the problem to be studied, as well the criteria for analysis. Define and/or refine the model of the system. This includes the

development of abstractions into mathematical, logical or procedural relationships.

Collect data input for the model. Define the outside world and what must be fed into or taken from the model to simulate that world.

Select a modelling tool and prepare and augment the model for tool implementation.

Verify that the tool implementation is an accurate reflection of the model.

Figure 1.1 The modelling process.

Real system

Model

Abstraction

Modelling of Parallel Computers and Algorithms 9

Validate that the tool implementation provides the desired accuracy or correspondence with the real-world system being modelled.

Experiment with the model to obtain performance measurements. Analyse the tool results. Use the findings to derive designs and improvements for the real-

world system.

A corresponding flow diagram of model development is represented in Fig. 1.2:

Figure 1.2 Flow diagram of model development.

Problem

End

Graphicalillustration

Real model

Formalisation(Mathematical model)

Problem descriptionanalysis

Essentialproperties

Accuracy

YesStore

Yes

No

No

Model improvement


Figure 1.3 Applied computer modelling.

As a practical illustration we have chosen the applied modelling of the classical sequential von Neumann computer shown below in Fig. 1.3:

Computer

End

Block schema ofcomputer

Real model

Formalisation(Mathematical model)

Von Neumancomputer (1946)

Accuracy

Yes

No

Model improvement

Essentialproperties

Yes

Store

No

2Parallel Computers

Basic technical components of parallel computers are illustrated in Fig. 2.1 as follows:

Modules of processors, cores of a mixture of them. Modules of computers (sequential, parallel). Memory modules. Input/output (I/O) modules.

These modules are connected through intern high-speed communication networks (within the concrete module) and extern (among used computing modules) high-speed communication networks [8, 99].

Figure 2.1 Basic building modules of parallel computers.

Module ofcomputer I/O module

Memorymodule

Module ofprocessor


Classification

It is very difficult to classify all existent parallel systems. But from the point of view of the programmer-developer we divide them into the two following different groups:

Synchronous parallel architectures. These are used for performing the same or a very similar process (independent part of programme) on different sets of data (data parallelism) in active computing nodes of a parallel system. They are often used under central control, which means under the global clock synchronisation (vector, array system etc.) or a distributed local control mechanism (systolic systems etc.). This group consists mainly of parallel computers (centralised supercomputers) with any form of shared memory. Shared memory defines the typical system features, and in some cases can in considerable measure reduce the development of some parallel algorithms. To this group belong actual dominated parallel computers based on multiple cores, processors, or also a mixture of them (Symmetrical Multi-processors SMP), and most of the realised massive parallel computers (classic supercomputers) [60, 84]. One practical example of this kind of synchronous parallel computer is illustrated in Fig. 2.2. The basic common characteristics are as follows:

Shared memory (or at least part of a memory). Using shared memory for communication.

Figure 2.2 A typical example of a synchronous parallel computer.

Hostcomputer

Sharedmemory

Control computer

Array of processors(compute nodes)

Parallel Computers 13

Supported developing standard OpenMP, OpenMPThreads, Java. Asynchronous parallel computers. They are composed of a

number of fully independent computing nodes (processors, cores or computers) which are connected through some communication network. To this group belong mainly various forms of computer networks (cluster), a network of powerful workstations (NOW) or a more integrated network of NOW networks (Grid). Any cooperation and control are performed through inter process communication mechanisms (IPC) via realised remote or local communication channels. The typical examples of asynchronous parallel computers are illustrated in Fig. 2.3. According to the latest trends, asynchronous parallel computers based on PC computers (single, SMP) are dominant parallel computers. Their basic common characteristics are as follows [10, 63]:

No shared memory (distributed memory). A computing node could have some form of local memory where

this memory is in use only by a connected computing node. Cooperation and control of parallel processes only using

asynchronous message communication. Supported developing standards.

MPI (Message Passing Interface) PVM (Parallel Virtual Machine) Java.

Figure 2.3 One example of an asynchronous parallel computer.

Communicationnetwork

LocalmemoryProcessor

ProcessorLocal

memory

ProcessorLocal

memory

ProcessorLocal

memory


The classification of all existent parallel computer architectures are illustrated below in Fig 2.4.

Figure 2.4 Classifi cation of parallel computers.

Architectures of parallel computers

Symmetrical multiprocessor systems

A symmetrical multiprocessor system (SMP) is the multiple usage of the same processor or cores which are included on the motherboard in order to increase the whole performance of this kind of system. Typical common characteristics are as follows [41, 43]:

Each processor or core (computing node) of the multiprocessor system can access the main memory (shared memory).

I/O channels or I/O devices are allocated to individual computing nodes according to their demands.

An integrated operation system coordinates the cooperation of the entire multiprocessor resources (hardware, software etc.).

The concept of such a multiprocessor system is illustrated in Fig. 2.5.

Virtualparallel computer

SIMD

Synchronous

Systolic

Vector/Array

Others

Asynchronous

SMP

GRID

NOW

Others


Figure 2.5 Typical characteristics of multiprocessor systems.

Figure 2.6 The architecture of a multiprocessor (8-Intel processor).

Hardware:processors or cores (CPU units)shared memory or shared

shared I/O devicesmultiport memory

shared I/ O channels

Software:only one integrated operation systemabilities of system reconfiguration

Control signals Messages

Task stream

An actual typical example of eight multiprocessor systems (Intel Xeon) is illustrated below in Fig. 2.6.

MemoryBank 0-3

MemoryBank 0-3

Max16 GB

Max16 GB

PROfusion

Pent.III x.

Pent.III x.

Pent.III x.

Pent.III x.

Pent.III x.

Pent.III x.

Pent.III x.

Pent.III x.

Bus 1

Bus 2 Bus 2

Left bus100 MHz

Right bus100 MHz

Left memoryport (Cache)

Right memoryport (Cache)

PCIbridge

PCIbridge

PCIbridge

PCIbridge

Bus 13100 MHz V/V busControl cards

64-bitv/v bus

Slots64-bit.,66 MHz

hot plug

PROfusion - cross switch of 3 bus and 2 memory ports (parallel)

PCI cards - type Enthanced PCI (64 bit, 66 MHz, Hot Plug - on-line exchange)

PCI PCI PCI PCI


A basic abstract model of a parallel computer with a shared memory is illustrated in Fig. 2.7.

Figure 2.7 A basic abstract model of a multiprocessor.

Network of workstations

There has been increasing interest in the use of networks of workstations (NOW) connected together via high-speed networks for solving large computation intensive problems [39]. The principal architecture of NOW is illustrated in Fig. 2.8. below. This trend is mainly driven by the cost effectiveness of such systems, as compared to massive multiprocessor systems with tightly-coupled processors and memories (supercomputers). Parallel computing on a cluster of workstations connected by high-speed networks has given rise to a range of hardware and network related issues on any given platform. Load balancing, inter-processor communication (IPC), and transport protocol for these machines are being widely studied [76, 80]. With the availability of cheap personal computers, workstations and networking devices, the recent trend has been to connect a number of these workstations to solve computation intensive tasks in parallel

Processors, cores


P1 P2 Pn. . .

Shared memoryM


with clusters. The network of workstations (NOW) has become a widely-accepted form of high performance computing (HPC). Each workstation in a NOW is treated similarly to a processing element in a multiprocessor system. However, workstations are far more powerful and flexible than processing elements in conventional multiprocessors (supercomputers). To exploit the parallel processing capability of a NOW, an application algorithm must be paralleled. One way to do this for an application problem is to build its own decomposition strategy. This step belongs to one of the most important steps in developing effective parallel algorithms.

Figure 2.8 The architecture of a NOW.

One typical example of networks of workstations also used for solving large computation intensive problems is illustratred in Fig. 2.9 below. The individual workstations are mainly extremely powerful personal workstations based on a multiprocessor or multicore platform. Parallel computing on a cluster of workstations connected by high-speed networks has given rise to a range of hardware and network-related issues on any given platform.

A practical example of a NOW module is represented below in Fig. 2.10. It also represents our outgoing architecture in terms of the laboratory parallel computer. On such a modular parallel computer we have been able to study basic problems in parallel and distributed computing such as load balancing, inter-processor communication (IPC), modelling and optimisation of parallel algorithms (Effective PA) etc. [34]. The coupled computing nodes PC1, PC2 ... PCi

PC 1 PC 2 PC 3 PC n. . .

Myrinet switch Myrinet ports 1G Ethernet (10G Ethernet) ports


(workstations) are able to be single extremely powerful personal computers (PC) or SMP parallel computers. In this way, parallel computing using networks of conventional PC workstations (single, multiprocessor, multicore) and Internet computing, all suggest the advantages of unifying parallel and distributed computing. Parallel computing and distributed computing have traditionally evolved as two separate research disciplines. Parallel computing has addressed

Figure 2.9 Typical architecture of a NOW.

Figure 2.10 A practical example of a NOW.

Parallel Applications

Parallel Programming EnvironmentsSequential Applications

ClusterSupporting SW (Midlleware)

Comn. Drivers(SW)

Comn. Drivers(SW)

Comn. Drivers(SW)

Network card(HW)

Network card(HW)

Network card(HW)

PC/WorkstationPC/WorkstationPC/Workstation

High Speed Network/Switch

Ethernet switch

Myrinet (InfiniBand) switch

Laboratory (SMP, NOW)

IntelXeon

. . .PC1 PC2 PCi

1

1 2 i

i2


problems of communication-intensive computation on highly-coupled processors [29], while distributed computing has been concerned with the coordination, availability, timeliness and so on of more likely-coupled computations [62].

A basic abstract model of a parallel computer with a distributed memory (NOW) is illustrated in Fig. 2.10.

Grid systems

Grid technologies have attracted a great deal of attention recently, and numerous infrastructure and software projects have been undertaken to realise various versions of Grids. In general, Grids represent a new way of managing and organising computer networks, and mainly of their deeper resource sharing [5]. Grid systems are expected to operate on a wider range of other resources, such as processors (CPUs), storages, data modules, network components, software (typical resources) and atypical resources like graphical and audio input/output devices, sensors and so on (See Fig. 2.11). All these resources typically exist within nodes that are geographically distributed and span multiple administrative domains. The virtual machine constitutes a set

Figure 2.11 An abstract model of a NOW.


M1 M2 Mn. . .

Modules of distributed memory

P1 P2 Pn. . .


of resources taken from a resource pool [54]. It is obvious that existent HPC parallel computers (supercomputers etc.) could also be members of these Grid systems. In general, Grids represent a new way of managing and organising computer networks and mainly of their deeper resource sharing (Fig. 2.12).

Figure 2.12 Architecture of a Grid node.

Conceptually they come from a structure of virtual parallel computers based on computer networks. In general, Grids represent a new way of managing and organising resources like a network of NOW networks. This term defines a massive computational Grid with the following basic characteristics [28, 75]:

A wide area network of integrated free computing resources. It is a massive number of inter-connected networks, which are connected through high-speed connected networks, during which time the entire massive system is controlled by a network operation system, which creates the illusion of a powerful computer system (a virtual supercomputer).

It grants the function of metacomputing, which means a computing environment which enables individual applications a functionality of all system resources.

Users Management(administrator)

Grid resources(pool)

Sharing Mechanisms

Processor1

Data1

Datai

Storage1

Storagej

I/O1

I/Ok

Processorn


The system combines distributed parallel computation with remote computing from user workstations.

Conventional HPC environment versus Grid environments

In Grids, the virtual pool of resources is dynamic and diverse, since the resources can be added or withdrawn at any time according to their owners discretion, and their performance or load can change frequently over time. A typical number of resources in the pool is of the order of several thousand or even more. An application in a conventional parallel environment (HPC computing) typically assumes a pool of computational nodes from a subset, of which a virtual concurrent machine is formed. The pool consists of PCs, workstations, and possibly supercomputers, provided that the user has access (a valid login name and password) to all of them. This virtual pool of nodes for a typical user can be considered as static, and this set varies in practice from 10 to 100 nodes. In Table 2.1 we summarise the differences analysed between the conventional distributed and Grid systems. We can also generally say that:

Table 2.1 Basic comparison of HPC and Grid computing.

Conventional HPC environments Grid environments

1. A virtual pool of computational nodes.

A virtual pool of resources.

2. A user has access (credential) to all the nodes in the pool.

A user has access to the pool but not to the individual nodes.

3. Access to a node means access to all resources on the node

Access to a resource may be restricted.

4. The user is aware of the applications and features of the nodes.

A user has little or no knowledge about each resource.

5. Nodes belong to a single trust domain.

Resources span multiple trust domains

6. Elements in the pool of 10 to 100, more or less static.

Elements in the pool are >>100, dynamic.


HPC environments are optimised to provide maximal performance.

Grids are optimised to provide maximal resource capacities.

Integration of parallel computers

With the availability of cheap personal computers, workstations and networking devices, the recent trends have been to connect a number of such workstations to solve computational intensive tasks in parallel with various integrated forms of clusters based on computer networks. We have illustrated in Fig. 2.12 a typical integrated complex consisting of NOW network modules. It is clear that any classical parallel computer (a massive multiprocessor, supercomputers etc.) in the world could be a member of this NOW [90].

With the aim of attaining connectivity to any of the existent integrated parallel computers in Europe (supercomputers, NOW, Grid) we can use the European classical massive parallel systems by means of scientific visits by project participants to the HPC centres in the EU (EPCC Edinburgh in the UK, BSC Barcelona in Spain, CINECA Bologna in Italy, GENCI Paris in France, SARA Amsterdam in the Netherlanda, HLRS Stuttgart in Germany and CSC Helsinki in Finland) [100].

Metacomputing

This term defines massive parallel computers (supercomputer, SMP, Grid) with the following basic characteristics [94, 96]:

A wide area network of integrated-free computing resources. This is a massive number of inter-connected networks, which are connected through high-speed connected networks, during which time the entire massive system is controlled with a network operation system, which creates the illusion of a powerful computer system (virtual supercomputer).


It grants a function of metacomputing that means a computing environment which provides functionality of all the systems resources for individual applications.

A system which combines distributed parallel computation with remote computing from user workstations.

The best example of an existent metacomputer is the Internet as a massive international network of computer networks (Internet module). Fig. 2.13 illustrates the Internet as a virtual parallel computer from the viewpoint of an average Internet user.

Switch 1

Laboratory 1

Switch 2

Laboratory 2

Switch n 1

Switch n(central)

.

.

.

GRID Modul(SMP , NOW)

1

2

i Router

i1

1

1 i

i

Laboratory n 1

Figure 2.13 Integration of NOW networks.

Another viewpoint of the Internet as a network of connected individual computer networks is illustrated in Fig. 2.14 below. The typical networking switches are bridges, routers, gateways and so on, which we denote with the common term network processors [35].

Modelling of parallel computers including communication networks

Communication demands (parallel processes, IPC data) in parallel computers to arrive randomly at a source node and follow a specific route through the communication networks towards their destination


node. Data lengths of communicated IPC data units (for example in words) are considered to be random variables following distributions according to the Jackson theorem [15, 33]. Those data units are then sent independently through the communication network nodes towards the destination node. At each node, a queue of incoming data units is served according to a first-come first-served (FCFS) basis.

In Fig. 2.15 we see illustrated a generalisation of any parallel computer, including the communication network as follows:

Figure 2.14 The Internet as a virtual parallel computer.

Internet

workstation


Computing nodes ui (i=1,2,3 ... U) of any parallel computer are modelled as graph nodes.

Network communication channels are modelled as graph edges rij (ij), representing communication intensities (relation probabilities).

Figure 2.15 The Internet as a network of connected networks.

networking switches (bridges, routers, gateways etc.)

workstation


Another used parameter of this abstract model is defined as follows:

u , ... ,, 21 represent the total intensity of the input data stream to individual network computing nodes (the summary input stream from other connected computing nodes to the given i-th computing node. It is given as a Poisson input stream with intensity i demands in time units.

ijr are given as the relation probabilities from node i to the neighbouring connected nodes j

u21 , .... ,, correspond to the total extern output stream of data units from used nodes (the total output stream to the connected computing nodes of the given node).

The created abstract model, according to Fig. 2.16 below, belongs to a queuing theory in the class of open queuing theory systems (open queuing networks). Formally, we can adjust an abstract model by

2

U1

3

2

U1

3

r21

3

1 U

2

r12

r31

r13

r1u

r2u

r3u

ru3

ru1

ru2

Figure 2.16 Model of a parallel computer including its communication network.


adding two virtual nodes: node 0 and node U+1 according to Fig. 2.17 where:

Virtual node 0 represents the sum of individual total extern input intensities

1

U

ii

=

= to computing nodes ui

Virtual node U+1 represents the sum of individual total intern output intensities

1

U

ii

=

= from computing nodes ui.

Figure 2.17 Adjusted abstract model.

i i

r13

r31

i=1i=1

r21

r12

r1ur2u

r3u

r3u+1

r1u+1

ruu+1

uu

r2u+1

ru1

ru2

ru3

U+1U

3

2

10

ro1

rou

ro1

ro1

3Parallel Algorithms

Introduction

During recent years, there has been increased interest in the field of scientific research into effective parallel algorithms. This trend towards parallel algorithms also supports the actual trends in programming technologies towards the development of modular applied algorithms based on object oriented programming (OOP). OOP algorithms are on their own merit a result of abstract thinking towards parallel solutions for existent complex problems.

Users and programmers from the beginning of applied computers were using, on demand, more powerful computers and more efficient applied algorithms. To the more effective technologies which take a long time belongs the implementation of parallel principles into computers, just as applied parallel algorithms. In this way the term parallel programming could relate to every program, which contains more than one parallel process [11, 53]. This process represents single independent sequential parts of a program.

One basic attribute of parallel algorithms is to achieve faster solutions in comparison with the quickest sequential solution. The role of the programmer goes to the given parallel computer and to the given application task (complex application problem) in order to develop parallel algorithms (PA). Fig. 3.1 below demonstrates how to derive parallel algorithms from existent sequential algorithms.


In general we suppose that potentially effective parallel algorithms according to the defined algorithm classification (Fig. 3.2) should be in group P as classified polynomial algorithms.

The other acronyms used in Fig. 3.2 are as follows [39]:

NP General non-polynomial group of all algorithms. NC (Nicks group). A group of effective polynomial algorithms. PC Polynomial complete. A group of polynomial algorithms

with a high degree of complexity. NPC Non-polynomial complete. This group consists of non-

polynomial algorithms with their high solving complexity. The existence or use of any NPC algorithm in an effective way makes it available for solving other NPC algorithms effectively.

Figure 3.1 The methodology of deriving parallel algorithms.

Description(diagram)

Sequentialalgorithm

Way ofcomputing

Problem

Parallel

AnalystAbstract formalisation

Practice

Aplicationinformatics

Accuracyof solving

Sequential

No

Complexalgorithm

End

No

Parallelalgorithm

Accuracyof solving

No

Modificationof algorithm

Modificationof algorithm

Yes

Expert toproblem

Parallel Algorithms 31

Parallel processes

To derive PA we have to create the conditions for potential parallel activities by dividing the input problem algorithm into its constituent parts (decomposition strategy) as illustrated in Fig. 3.3. These individual parts could be as follows:

Heavy parallel processes. Light parallel processes named as threads.

In general we can define a standard process as a developed algorithm, or as its independent parts. In detail, the process does not represent only some part of a compiled program because the register status of the processor (process context) also belongs to its characterisation. An illustration of this kind of standard process is shown in Fig. 3.4 on page 32.

Figure 3.2 Algorithm classifi cation.

NP

NPC

NC

P

PC

Figure 3.3 Illustration of parallel processes.

Complex problem(sequential algorithm)

Decomposition

Parallelprocess 1

Parallelprocess 2

Parallelprocess n

. . .


Figure 3.4 Illustration of a standard process.

Every standard process has, therefore, its own system stack, which contains processed local data. In case of a process interruption there is also an actual register status of the processor. It is obvious that we may have contemporary multiple numbers of standard processes, which are used together in some program parts, but their processing contexts (process local data) are different. The tools needed to manage processes (initialisation, abort, synchronisation, communication etc.) are within the cores of multi-task operation systems in the form of services. An illustration of a standard multi-processes state is shown in Fig. 3.5.

But the concept of generating standard processes with individual address spaces is very time-consuming. For example, in operation system UNIX, a new process is being generated with an operation fork (), which makes the system call in order to create a nascent process with its new own address space. But in detail it means memory allocation, copying of data segments and the descriptors from the original process, as well as a realisation of a nascent process stack. Therefore, we have named this concept a heavy-weighted process. It is obvious that the heavy-weighted approach does not support the effectiveness of applied parallel processing or the necessary scalability of parallel algorithms. In relation to this it

Registers

Staticdata

Program code

Stack

Standard parallel process

Memory module


became necessary to develop another less time-consuming concept of process generation named the light-weighted process. This lighter concept of generating new processes under another name came about as threads were implemented into various operation systems, the supported threads libraries and parallel developing environments. The basic difference between a standard process and a thread is that we can generate additional new threads within a standard process, which use the same address space, including a descriptor declaration of the origin of the standard process.

Classification of PAs

In principle, parallel algorithms are divided into the two following basic classes:

Parallel algorithms with shared memory (PAsm). In this case, parallel processes can communicate through shared variables using an existent shared memory. In order to control parallel processes typical synchronisation tools are used, such as busy waiting, and semaphores and monitors to guarantee the exclusive use of shared resources by only single parallel process [60, 66]. These algorithms are developed for parallel computers with a dominant

Figure 3.5 A parallel algorithm based on multiple parallel processes.

Staticdata

Processescode

Stack Stack Stack

Memory module

Parallel processes

RegistersRegistersRegisters


shared memory as actual symmetrical multiprocessors or multicore systems on the motherboard (SMP).

Parallel algorithms with distributed memory (PAdm). Distributed parallel algorithms have the task of the synchronisation and cooperation of parallel processes only via network communication. The term distributed (asynchronous) parallel algorithm is defined in terms of individual parallel processes being performed on independent computing nodes (processors, computers single, parallel) of a parallel computer with a distributed memory [68, 73]. These algorithms are developed for parallel computers with a distributed memory as an actual NOW system, and their higher integration forms are known as as Grid systems.

Mixed PAs. Very perceptive parallel algorithms which use the advantages of dominant parallel computers based on the NOW modules as follows:

The use of parallel processes with a shared memory in individual workstations (SMPs PCs).

The use of other parallel processes based on distributed memory in a NOW module.

The main difference between these groups is in the form of inter-process communication (IPC) among parallel processes. Generally we can say that IPC communication in a parallel system with a shared memory can use more opportunities for communication than in distributed systems.

Parallel algorithms with a shared memory

A typical activity graph of parallel algorithms with shared memory PAsm is shown in Fig. 3.6. In order to control decomposed parallel processes there is a necessary syncronisation mechanism, which is as follows:

Semaphors. Monitors.


Busy waiting. Path expession. Critical region (CR). Conditional critical region (CCR).

Figure 3.6 A typical activity graph of PAsm.

Synchr.

Par. process Par. processPar. process Par. process

Synchr.

Par. process Par. processPar. process Par. process

Parallel algorithms with a distributed memory

Parallel algorithms with distributed memory PAdm are parallel processes, which are carried out on the asynchronous computing nodes of any given parallel computer. Therefore, for all the required cooperation of parallel processes we only have available inter-process communication IPCs. The principal illustration of parallel processes for PAdm is illustrated in Fig. 3.7.

Developing parallel algorithms

To exploit the parallel processing capability the application program must be made parallel. The most effective way of doing this for a particular application problem (decomposition strategy) belongs to


Figure 3.7 Ilustration of the activity graph of PAdm.

the most important step in developing an effective parallel algorithm [45]. The development of the parallel network algorithm, according to Fig. 3.8, includes the following activities:

Decomposition the division of the application into a set of parallel processes.

Mapping the way in which processes and data are distributed among the nodes.

Inter-process communication the way of corresponding and synchronisation among individual processes.

Tuning alternation of the working application to improve performance (performance optimisation).

. . . Par.process

Par.process

Par.process

Par.process

Figure 3.8 Development steps in parallel algorithms.

Decomposition Synthesisof solution

Parallelsolution Problem Processes

Parallel computerProcessors orworkstations

Mapping


Decomposition strategies

When developing sequential algorithms, this implicitly supposes the existence of an algorithm for any given problem. Only later, during the stage of practical programming are they defined and use suitable data structures. In contrast to this classic development method, the suggestion of a parallel algorithm should be included at the beginning of a stage potential decomposition strategy, including the distribution of the input data to perform decomposed parallel processes. The selection of a suitable decomposition strategy has a cardinal influence on the further development of the parallel algorithm.

The decomposition strategy defines a potential division of any given complex problem into its constituent parts (parallel processes) in such a way, that they could be performed in parallel via the computing nodes of any given parallel computer. The existence of some kind of decomposition method is a critical assumption for a possible parallel algorithm. The potential degree of decomposition of any given complex problem is crucial for the effectiveness of a parallel algorithm [62, 70]. Until now, developed parallel algorithms and the corresponding decomposition strategies have been mainly related to available synchronous parallel computers based on classic massive parallel computers (supercomputers and their innovations). The development of parallel algorithms for the actual dominant parallel computers NOW and Grid requires at least modified decomposition strategies incorporating the following priorities:

An emphasis on functional parallelism for complex problems. A more minimal inter-process communication IPC.

The most important step is to choose the best decomposition method for any given application problem. To do this it is necessary to understand concrete application problems, data domains, the algorithms used and the flow of control in any given application. When designing a parallel program, the description of the high-level algorithm must include, in addition to designing a sequential program, the method you intend to use to break the application down into processes (decomposition strategy) and distribute the


data to different computing nodes (mapping). The chosen decomposition method drives the rest of the programs development. This is true in the case of developing new applications, as in a porting serial code. The decomposition method tells us how to structure the code and data and defines the communication topology [69, 88].

Problem parallelisation is very creative process, which creates a potential degree of parallelism. This is a way of dividing complex problems into their constituent parts (parallel processes) in such a way that it is possible to perform PA in parallel. The way of decomposition depends strongly on the task algorithm used and on the data structures. This has a significant influence on the performance and its communication consequences. Until now, the decomposition models and strategies developed have seemed to be only close to the supercomputers and their innovated types (classic parallel computers) in use around the world. On the other hand, there is the realisation that PAs at this time dominate parallel computers (SMP, NOW, Grid), thus demanding modified decomposition models and strategies with respect to more minimal interpositions on communication intensity (NOW, Grid), and also deriving a waiting latency T(s, p) wait when using shared resources or not at their full capacities:

Natural parallel decomposition. Domain decomposition. Control decomposition:

manager/workers; functional.

A divide-and-conquer strategy for the decomposition of complex problems.

Object oriented programming (OOP).

Natural parallel decomposition

Natural parallel decomposition allows a simple creation of parallel processes, whereby for their cooperation, normally there is a necessarily low number of inter process communication IPCs. Also,


in parallel computation the sequence of the individual solutions is normally not important. As a consequence there is no necessity for any synchronisation of the parallel processes carried out during parallel computation. Based on these attributes, natural parallel algorithms allow the achievement of practical ideal p multiple speed ups using p computation nodes of parallel computer (linear speed up), with minimal additional efforts needed for developing parallel algorithms. Typical examples are numerical integration parallel algorithms [31]. In chapter 9 we will analyse applied parallel algorithms based on the natural decomposition model (the computation of ).

Domain decomposition

One typical characteristic of many complex problems is some regularity in sequential algorithms or in their data structures (computational or data modularity). The existence of these computational or data modules then represents the domain of the computation or data. A decomposition strategy based on this domain makes up a substantial part of these complex problems involved in generating parallel processes. This domain is mostly characterised by a massive, discrete or static data structure. A typical example of a computational domain is iteration computation, and a typical example of a data domain is a matrix [25, 93].

Functional decomposition

Functional decomposition strategies concentrate their attention on finding parallelism in the distribution of a sequential computation stream in order to create independent parallel processes. In comparison to domain decomposition we are concerned with creating potential alternative control streams of concrete complex problems. In this way we are streaming in functional decomposition to create as many parallel threads as possible. An illustration of functional decomposition is shown in Fig. 3.9.


The most widely distributed functional strategies are:

Controlled decomposition. Manager/workers (server/clients).

Typical parallel algorithms are complex optimisation problems, which are connected to the consecutive searching of massive data structures.

Control decomposition

Control decomposition as an alternative to functional decomposition concentrates on any given complex problem as a sequence of individual activities (operations, computing steps, control activities etc.), from which we are able to derive multiple control processes. For example, we can consider searching to be a tree; one which responds to game moves where the branch factor changes from node

Figure 3.9 An illustration of functional decomposition.

Function 2

Function 1

Function 3

Branchblock

yes

not


to node. Any static allocation of a tree is either not possible, or it causes an unbalanced load.

In this way, this decomposition methods supposed irregular structure controls the decomposition, which is loosely connected with complex problems in artificial intelligence and similar non-numerical applications. Secondly, it is very natural to look at any given complex problem as a collection of modules, which represent the necessity for functional parts of algorithms.

Decomposition strategy known as manager/workers

Another alternative to functional decomposition is the strategy called manager/workers. In this case one parallel process is used as a control (the manager). The manager process then sequentially and continuously generates the necessary parallel processes (the workers) for their performance in controlled computing nodes. An illustration of this decomposition method known as manager/workers is shown in Fig. 3.10.

Mainpar. process

Worker WorkerWorker Worker

Mainpar. process

Worker WorkerWorker Worker

Figure 3.10 The manager/worker parallel structure.


The manager process controls the computation sequence in relation to the sequential finishing of the parallel processes carried out by the individual workers. This decomposition strategy is suitable mainly in cases where any given problem does not contain static data or a known fixed number of computations. In these cases it becomes necessary to concentrate on controlling aspects of the individual parts of complex problems. After this analysis has been carried out, then the needed communication sequence appears to achieve the demanded time sequences of the parallel processes created. The degree of the division of any given complex problem coincides with the number of the computing nodes of the parallel computer, along with the parallel computer architecture and with the knowledge of the performance of the computing nodes. One of the most important elements of the previous steps is the allocation of the algorithms. It is more effective to allocate a parallel process to the first free computing node (a worker) in comparison with a defined sequential order of allocation. In chapter 12 we will analyse applied parallel algorithms (complex combinatorial problems) based on the manger/worker decomposition.

The divide-and-conquer strategy

The divide-and-conquer strategy decomposes complex problems into sub-tasks which have the same size, but it iteratively keeps repeating this process to obtain yet smaller parts of any given complex problem. In this sense, this decomposition model iteratively applies a problem partitioning technique, as we can see in Fig. 3.11. Divide-and-conquer is sometimes known as recursive partitioning [41]. A typical complex problem size has an integer power of 2, and the divide-and-conquer strategy halves the complex problem into two equal parts at each iterative step.

In chapter 11 we will show an example of applying the divide-and-conquer strategy to analyse Discrete Fourier Transform (DFT).


The decomposition of big problems

In order to decompose big problems it is necessary in many cases necessary to use more than one decomposition strategy. This is mainly true due to the hierarchical structure of a concrete big problem. The hierarchical character of big problems means that we look on such a big problem as set of various hierarchical levels, whereby it would be useful to apply a different decomposition strategy on every level. This approach is known as multilayer decomposition.

The effective use of multilayer decomposition is contributing to a new generation of commnon parallel computers based on the implementation of more than a thousand computing nodes (processors, cores). Secondly, the unifying trends of high performance parallel computing (HPC) based on massive parallel computers (SMP modules, supercomputers) and distributed computing (NOW, Grid) are opening up new horizons to programmers.

Examples of typical big problems are weather forecasting, fluid flow, the structural analysis of substance building, nanotechnologies, high physics energies, artificial intelligence, symbolic processing, knowledge economics and so on. The multilayer decomposition model makes it possible to decompose a big problem, first into simpler modules, and then in the second phase to apply a suitable decomposition strategy only to these decomposed modules.

Figure 3.11 An illustration of the divide-and-conquer strategy (n=8).


Object oriented decomposition

Object oriented decomposition is an integral part of object oriented programming (OOP). In fact, it presents a modern method of parallel program development. OOP, besides increasing the demand for abstract thinking on the part of the programmer, contains the decomposition of complex problems into independent parallel modules known as objects [93]. In this way, the object oriented approach looks at a complex problem as a collection of abstract data structures (objects), where the integral parts of these objects also have object functions built into them as another form of parallel processing. In the same way, OOP creates a bridge between sequential computers (the Von Neumann concept), as well as modern parallel computers based on SMP, NOW and Grid. An illustration of an object structure is illustrated in Fig. 3.12.

Mapping

This step allocates already created parallel processes to the computing nodes of a parallel computer for their parallel execution. It is necessary to achieve the goal that every computing node should perform its allocated parallel processes (one or more), with at least approximate input loads (load balancing) based on the real assumption of equally powerful computing nodes. The fulfilment of this condition contributes to optimal parallel solution latency.

Figure 3.12 Object structure.

.

.

.

Object

Data object

Procedure 1

Procedure 2

Procedure n


Inter process communication

In general we can say that the most dominant elements of parallel algorithms are their sequential parts (parallel processes) and inter process communication (IPC) between the parallel processes taking place.

Inter process communication in shared memoryInter process communication (IPC)

Inter process communication (IPC) for parallel algorithms with a shared memory (PAsm) is defined in the context of supporting developing standards as follows:

OpenMP. OpenMP threads. Pthreads. Java threads. Other.

The concrete communication mechanisms allow the use of the existence of shared memory, which causes every parallel process to store the data being communicated in a specific memory file with its own address, and then another parallel process can read the stored data (shared variables). It looks very simple, but it is necessary to guarantee that only one parallel process can use this addressed memory file. These necessary control mechanisms are known as synchronisation tools. Typical synchronisation tools are:

Busy waiting. Semaphore. Conditional critical regions (CCR). Monitors. Path expressions.


These synchronisation tools are also used in modern multi-user operation systems (UNIX, Windows etc.).

Inter process communication in distributed memory

Inter process communication (IPC) for parallel algorithms with a distributed memory (PAdm) is defined within supporting developing standards as follows:

MPI (message passing interface). Point to point (PTP) communication commands. Send commands. Receive commands. Collective comunication commands. Data distribution commands. Data gathering commands.

PVM (Parallel virtual machine). Java (Network communication support). Other.

To create the necessary synchronisation tools in MPI we only have available an existent network communication of connected computing nodes. A typical MPI network communication is shown in Fig. 3.13. Based on existent communication links, an MPI contains the synchronisation command BARRIER.

Performance tuning

After verifying a developed parallel algorithm on a concrete parallel system, the next step is performance modelling and optimisation (effective PA). This step consists of an analysis of previous steps in such a way as to minimise the whole latency of parallel computing T(s, p). The optimisation of T(s, p) carried out for any given parallel algorithm depends mainly on the following factors:


Figure 3.13 Ilustration of an MPI network communication.

The allocation of a balanced input load to the computing nodes in use in a parallel computer (load balancing) [7].

The minimisation of the accompanying overheads (parallelisation, inter process communication IPC, control of PA) [91].

To carry out load balancing we obviously need to use the equally powerful computing nodes of a PC, which results in load allocation for any given developed PA. In dominant asynchronous parallel computers (NOW, Grid) it is necessary to reduce (optimise) the number of inter process communications IPC (communication loads); for example, through the use of an alternative existing decomposition model.

NIC NIC

Par.process i

Par.processj

Computingnode i

communication channel

Data message

Datamessage

Computingnode j

4Parallel Program Developing Standards

Parallel programming languages

The necessary supported parallel developing standards would have suitable tools and services for various existing forms of parallel processes. The existing parallel programming languages are divided in a similar way to parallel computers into the two following basic groups:

Synchronous programming languages (SPL). Asynchronous programming languages APL).

Synchronous programming language simplifies an assumed concrete shared address space (shared memory). Based on this shared memory the synchronisation tools are implemented, such as monitors, semaphores, critical sections, conditional critical sections and so on. [66]. Typical parallel computers for the application of SPL are SMP multicore and multiprocesor parallel computers.

On the other hand, asynchronous programming languages correspond to a parallel application program interface (API), which only has a distributed memory [13, 86]. So in this case the only tool for parallel process cooperation is inter process communication IPC. Typical parallel computers for the application of API are asynchronous parallel computers based on computer networks


(NOW, Grid). The support for application parallel interfaces API means the development of standard tools for both groups of parallel algorithms PA (PAsm, PAdm).

Open MP standard

In terms of the development of parallel and applied parallel computers, currently their representation of parallel architectures with a shared memory also plays a very important role. The condition for their effective application deployment was the standardisation of their development environment API. This standard, after the experience gained with a parallel extension of the High Performance Fortran (HPF), became the OpenMP standard for existing parallel computers with a shared memory.

OpenMP is an API for programming languages C / C + + and Fortran, which supports SMP parallel computers (multiprocessors, multicores, supercomputers) with a shared memory, under different operating systems (Unix, MS Windows, etc.). The methodology of parallelisation in OpenMP is based on the use of so-called compiler directives, library functions and shared variables in order to specify their support for parallelism in any given existing shared memory. Control compiler commands (directives) indicate parts of a compiled program which is executed in parallel and simultaneously with additional auxiliary functions having been determined in order to parallelise the relevant parts. The advantage of this chosen approach is the relatively simple transfer method between a sequential developing API and an OpenMP API, because sequential compilers consider compiler directives as remarks and ignore them. The OpenMP is composed of a set of compiler directives, library functions and variables in commands that affect the implementation of parallel algorithms. The advantage of this approach is relatively simple, and the transparent relationship between parallel and serial is on the basis that the serial compiler directives are ignored (comments). The basic structure of an OpenMP API is illustrated in Fig. 4.1.

Parallel Program Developing Standards 51

The original demands to an OpenMP and its properties are as follows [64, 82]:

Portability. Scalability. Efficiency. High level. Support of data parallelism. Simple to use. Functional robustness.

The basic functional properties are as follows:

An API for parallel computers with a shared memory (shared address space).

Portability of program codes. Supported by Fortran (in the world), C/C++. Placed between High Performance Fortran (HPF) and MPI.

Figure 4.1 The Structure of an OpenMP.

Complex problems

Programmer

Directivesto compile

OpenMPlibrary

OtherAPIs

OpenMP runtime library

OS support for shared memory

Applicationlayer

Programlayer

Systemlayer

Technicallayer

Node 1 Node 2 Node 3 Node p

Shared memory

. . .


From the HPF comes simplicity of use through compiler directives, and from the MPI comes effectiveness and functionality.

Standardised from 1997.

The purpose of the OpenMP standard was to provide a unified model for developing parallel algorithms with a shared memory (PAsm), which would be portable between different parallel architectures with a shared memory, and which also comes from different producers. In relation to the other existent parallel API standard (MPI), OpenMP is between HPF and MPI. The HPF has easy-to-use use compiler commands, and the MPI standard provides high functionality. Some basic modules of OpenMP are illustrated in Fig. 4.2.

Figure 4.2 OpenMP modules.

When analysing the development of parallel algorithms and computers and their future direction, it is implicitly assumed that there will be further innovations of the OpenMP standard and its available possible alternatives for more massive multiprocessors and multicore parallel architectures with a shared memory (massive SMP architectures), which could eliminate its smaller scalability when applied for use. Individual OpenMP modules are illustrated in Fig. 4.3. Further details of OpenMP can be found in [64] or in special manuals for OpenMP.

Control of parallelexecutions

Control ofparallel program

parallel

OpenMP modules

Decomposition

Allocationto threads

do/paralleldo section

Dataprocessing

Datamanagement

sharedmemory

Synchronisation

Controlof threads

criticalatomicbarrier

Functionsand variables

Runtimeenvironment

omp_set_num_threads()omp_get_thread_num()

OMP_NUM_THREADSOMP_SCHEDULE


OpenMP threads

Classic operating systems generate processes with separate address space to guarantee safety and to protect individual processes in multi-user and multi-tasking environments. Technical support for this kind of protection began with Intel processors in the Intel model 80286 through a new protected mode. When creating a classic process it generates a so-called basic thread. Any additional thread could create, through the already-generated thread, a command in the following typical shape:

thread create (function, argument1, ..., argumentn);

This way of creating threads is very similar to the calling procedure. Thread commands create a new program branch in order to perform a function with any given argument. For each new branch a separate stack is created, whereby they jointly use the remaining address space (code segments, data segments, process descriptors) of the original parent process. OpenMP threading of an API implements multi-threading parallelism, in which the main thread of calculation is divided into a specified number of controlled subordinate threads. The fibres are then executed in parallel, whereby all the threads use together all the resources of a basic thread. Each thread has an identification number, which can be obtained by using the function

Figure 4.3 Basic modules of OpenMP.

Parallel controlstructures

Flow control inparallel program

parallel

OpenMP languageextensions

Work sharing

Distributionto threads

do/paralleldo section

Dataenvironment

Scopesvariables

global andlocal

variables

Synchronisation

Control threadexecution

criticalatomicbarrier

Runtime functionsand variables

Runtimeenvironment

omp_set_num_threads()omp_get_thread_num()OMP_NUM_THREADS

OMP_SCHEDULE


(omp_get_thread_num()). The identification number is an integer, while the main thread has the number 0. After finishing the creation of threads in parallel, the threads join together with the main thread to carry on executing parallel algorithms. The executive (runtime) environment allocates threads for processing based on the input parallel computers load and other system factors. The number of threads can be set using runtime variables within OpenMP functions.

Problem decomposition

Fig. 4.4. illustrates problem decomposition as a process of multithread parallelisation in OpenMP. Problem decomposition into individual threads can be done through the following commands:

omp for or omp do. These commands decompose a given program part into a defined number of threads.

Section commands perform the allocation of contiguous blocks to different threads.

A single command defines a single functional block of code that is executed by only one thread.

A master command is similar to a single command, but the functional block of code is performed by only the main thread.

Figure 4.4 An illustration of multithread parallelisation in OpenMP.

V1

V2

Vp

.

.

.

Parallel process 1

V1

V2

Vp

.

.

.

Parallel process 2

V1

V2

Vp

.

.

.

Parallel process n

. . .

Mainthread

JoinJoinJoin

Fork Fork Fork


A thread is thus a simplified version of the standard process, including its own instruction register and stack for the independent execution of parallel program parts. Due to the different mechanisms of thread generation and its support, applied parallel algorithms were not portable to the existing different operating systems. The same goes for various innovative alternatives of the same operating systems. A move towards standardisation was defined in the middle of the nineteen-nineties by extending the set library of C language thread support. The credit for these extensions belongs to a group of people in an organisation called POSIX (Portable Operating Systems Interface), so this extended additional library was named Pthreads (POSIX threads). A listed set of these library routines is currently available in various versions of UNIX operating systems. Pthreads with their structure were a low-level programming model for parallel computers with a shared memory, though they were not aimed at high parallel computing (HPC). The reasons for this include a lack of support for the much-used FORTRAN version, and even the C / C + + language was problematic when applied for use with scientific parallel algorithms, because its orientation was towards supporting the parallelism of tasks, with minimal support for data parallelism.

The Pthreads library contains a set of commands for managing and synchronising threads. In Appendix 3 there is a set of basic commands which is sufficient for creating threads and their later joining and synchronisation.

MPI API standard

MPI (Message Passing Interface) [39, 82] is the standard for the development of distributed parallel algorithms (PAdm) with message communication for asynchronous parallel computers with a distributed memory (NOW, Grid etc.). The basics of its standardisation were defined in 1993 and 1994 by a international group of dedicated professionals and developers under the name of the MPI Forum (approximately 40 organisations from the US and Europe), which gained experience from this time in using MPI API PVM (parallel virtual language), such as PVM from Oak Ridge National Laboratory,


as well as PARMACS, CHIMP EPCC and so on. The aims of MPI API were to provide a standard communication message library for creating portable parallel programs (source code portability of PP between various parallel computers), and effective (optimised) parallel programs based on message communication (distributed memory). MPI is inherently not a language, but it is a library program service that can be used by programs written in C / C + + and FORTRAN.

MPI provides a rich collection of communication programs for two-point transmission PTP (point-to-point communication) and collective operations for data exchange, global computations, synchronisation and the joining of partial results. MPI also defines a number of equally important essential requirements such as derived data types and the necessary specifications of communication services. Currently, there are several implementations of MPI and its versions for networks of workstations, groups of personal computers (clusters), multiprocessors with a distributed memory and a virtual shared memory (VSM). The MPI Forum, after standardising the MPI 1 version in 1994, began working to add further requested new services to the previous MPI standard, including dynamic processes, support for the parallel decomposition strategy known as manager/worker (client/server), collective communications, parallel I/O operations and functions for the non-blocking of collective communication (an innovation of MPI-1 in 1995). Further developments of the standard MPI continued through to the development of MPI 2, which was standardised in 1997.

Generally, existing MPI standards are still considered as being low-level because most of the activities for the development of distributed parallel

analytical modelling in parallel and distributed computing

Documents

parallel computers

parallel way

evolved parallel computing

parallel processing

distributed memory

distributed computingfor

right of peter hanuliak

process of modelling