advances in parallel computing
TRANSCRIPT
Advances in Parallel Computing
Parallel processing is ubiquitous today, with applications ranging from mobile devices such as
laptops, smart phones and in-car systems to creating Internet of Things (IoT) frameworks and
High Performance and Large Scale Parallel Systems. The increasing expansion of the application
domain of parallel computing, as well as the development and introduction of new technologies
and methodologies are covered in the Advances in Parallel Computing book series. The series
publishes research and development results on all aspects of parallel computing. Topics include
one or more of the following:
• Parallel Computing systems for High Performance Computing (HPC) and High Throughput
Computing (HTC), including Vector and Graphic (GPU) processors, clusters, heterogeneous
systems, Grids, Clouds, Service Oriented Architectures (SOA), Internet of Things (IoT), etc.
• High Performance Networking (HPN)
• Performance Measurement
• Energy Saving (Green Computing) technologies
• System Software and Middleware for parallel systems
• Parallel Software Engineering
• Parallel Software Development Methodologies, Methods and Tools
• Parallel Algorithm design
• Application Software for all application fields, including scientific and engineering
applications, data science, social and medical applications, etc.
• Neuromorphic computing
• Brain Inspired Computing (BIC)
• AI and (Deep) Learning, including Artificial Neural Networks (ANN)
• Quantum Computing
Series Editor:
Professor Dr. Gerhard R. Joubert
Volume 36
Recently published in this series
Vol. 35. F. Xhafa and A.K. Sangaiah (Eds.), Advances in Edge Computing: Massive Parallel
Processing and Applications
Vol. 34. L. Grandinetti, G.R. Joubert, K. Michielsen, S.L. Mirtaheri, M. Taufer and R. Yokota
(Eds.), Future Trends of HPC in a Disruptive Scenario
Vol. 33. L. Grandinetti, S.L. Mirtaheri, R. Shahbazian, T. Sterling and V. Voevodin (Eds.), Big
Data and HPC: Ecosystem and Convergence
Volumes 1–14 published by Elsevier Science.
ISSN 0927-5452 (print)
ISSN 1879-808X (online)
Parallel Computing: Technology
Trends
Edited by
Ian Foster Argonne National Laboratory and University of Chicago, Chicago, USA
Gerhard R. Joubert Technical University Clausthal, Clausthal-Zellerfeld, Germany
Luděk Kučera Charles University, Prague, Czech Republic
Wolfgang E. Nagel Technical University Dresden, Dresden, Germany
and
Frans Peters formerly Philips Research, Eindhoven, Netherlands
Amsterdam Berlin Washington, DC
© 2020 The authors and IOS Press.
This book is published online with Open Access and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
ISBN 978-1-64368-070-5 (print) ISBN 978-1-64368-071-2 (online) Library of Congress Control Number: 2020934256 doi: 10.3233/APC36
Publisher
IOS Press BV Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail: [email protected]
For book sales in the USA and Canada:
IOS Press, Inc. 6751 Tepper Drive Clifton, VA 20124 USA Tel.: +1 703 830 6300 Fax: +1 703 830 2300 [email protected]
LEGAL NOTICE
The publisher is not responsible for the use which might be made of the following information.
PRINTED IN THE NETHERLANDS
Conference Organisation
Conference Committee
Gerhard Joubert, Germany (Conference Chair)
Ian Foster, USA
Luděk Kučera, Czech Republic
Thomas Lippert, Germany
Wolfgang Nagel, Germany Frans Peters, Netherlands
Program Committee
Ian Foster, USA
Wolfgang Nagel, Germany
Symposium Committee
Gerhard Joubert, Germany
Thomas Lippert, Germany
Luděk Kučera, Czech Republic
Organising & Exhibition Committee
Luděk Kučera, Czech Republic
Finance Committee
Frans Peters, Netherlands (Finance Chair)
ParCo2019 Sponsors
EPI (European Processor Initiative)
hoComputer & Intel Jülich Supercomputing Centre, Jülich Forschungszentrum, Germany
The University of Chicago, USA
Charles University, Czech Republic
Technical University Clausthal, Germany
vii
Program Committee
Ian Foster, USA (Program Committee Chair)
Wolfgang Nagel, Germany (Program Committee Chair)
David Abramson, Australia Marco Aldinucci, Italy
Christian Bischof, Germany
Jens Breitbart, Germany
Kris Bubendorfer, New Zealand
Andrea Clematis, Italy
Umit Catalyurek, USA Sudheer Chunduri, USA
Massimo Coppola, Italy
Luisa D’Amore, Italy
Pasqua D’Ambra, Italy
Erik D’Hollander, Belgium
Bjorn De Sutter, Belgium Ewa Deelman, USA
Frédéric Desprez, France
Didier El Baz, France
Ian Foster, USA
Geoffrey Fox, USA Franz Franchetti, USA
Basilio B. Fraguela, Spain
Karl Fürlinger, Germany
Edgar Gabriel, USA
Efstratios Gallopoulos, Greece
José Daniel Garcia Sanchez, Spain Michael Gerndt, Germany
William Gropp, USA
Georg Hager, Germany
Kevin Hammond, UK
Lei Huang, USA
Emmanuel Jeannot, France Odej Kao, Germany
Wolfgang Karl, Germany
Carl Kesselman, USA
Christoph Kessler, Sweden
Harald Köstler, Germany
Dieter Kranzlmüller, Germany Herbert Kuchen, Germany
Alexey Lastovetsky, Ireland
Jin-Fu Li, Taiwan
Jay Lofstead, USA
Ignatio Martin Llorente, Spain
Allen D. Malony, USA Simon McIntosh-Smith, UK
Bernd Mohr, Germany
Wolfgang E. Nagel, Germany
Kengo Nakajima, Japan
Bogdan Nicolae, USA
Dimitrious Nikolopoulos, UK Manish Parashar, USA
Christian Pérez, France
Sege Petiton, France
Oscar Plata, Spain
Sabri Pllana, Sweden Enrique S. Quintana-Ortí, Spain
Carl Raicu, USA
J. (Ram) Ramanujam, USA
Matei Ripeanu, Canada
Dirk Roose, Belgium
Peter Sanders, Germany Henk Sips, Netherlands
Domenico Talia, Italy
Michela Taufer, USA
Valerie Taylor, USA
Doug Thain, USA
George Thiruvathukal, USA Massimo Torquati, Italy
Denis Trystram, France
Sudharshan Vashkudai, USA
Jose Luis Vazquez-Poletti, Spain
Jon Weissman, USA
viii
Mini-Symposia
Tools and Infrastructure for Reproducibility in Data-intensive Applications
Organisers
Sandro Fiore, USA
Ian Foster, USA
Carl Kesselman, USA
ParaFPGA 2019: Parallel Computing with FPGAs
Organisers
Erik D’Hollander, Belgium
Abdellah Touhafi, Belgium
Program Committee:
Frank Hannig, Germany
Yun Liang, China
Tsutomu Maruyama, Japan
Dionisios Pnevmatikatos, Greece Viktor Prasanna, USA
Dirk Stroobandt, Belgium
Wim Vanderbauwhede, UK
Sotirios G. Ziavras, USA
Energy-efficient Computing on Parallel Architectures (ECO-PAR)
Organisers
Enrico Calore, Italy
Nikela Papadopoulou, Greece Sebastiano Fabio Schifano, Italy
Vladimir Stegailov, Russia
ELPA – A Parallel Dense Eigensolver for Symmetric Matrices with Applications
in Computational Chemistry
Organisers
Thomas Huckle, Germany
Bruno Lang, Germany
ix
Contents
Preface v
Ian Foster, Gerhard Joubert, Luděk Kučera, Wolfgang Nagel and Frans Peters
Conference Organisation vii
Opening
Four Decades of Cluster Computing 3
Gerhard Joubert and Anthony Maeder
Invited Talks
Will We Ever Have a Quantum Computer? 11 M.I. Dyakonov
Empowering Parallel Computing with Field Programmable Gate Arrays 16
Erik H. D’Hollander
Main Track
Deep Learning Applications
First Experiences on Applying Deep Learning Techniques to Prostate Cancer
Detection 35
Eduardo José Gómez-Hernández and José Manuel García
Deep Generative Model Driven Protein Folding Simulations 45
Heng Ma, Debsindhu Bhowmik, Hyungro Lee, Matteo Turilli, Michael Young,
Shantenu Jha and Arvind Ramanathan
Economics
A Scalable Approach to Econometric Inference 59
Philip Nadler, Rossella Arcucci and Yi-Ke Guo
Cloud vs On-Premise HPC: A Model for Comprehensive Cost Assessment 69
Marco Ferretti and Luigi Santangelo
GPU Computing Methods
GPU Architecture for Wavelet-Based Video Coding Acceleration 83
Carlos de Cea-Dominguez, Juan C. Moure, Joan Bartrina-Rapesta
and Francesc Aulí-Llinàs
xi
GPGPU Computing for Microscopic Pedestrian Simulation 93
Benedikt Zönnchen and Gerta Köster
High Performance Eigenvalue Solver for Hubbard Model: Tuning Strategies for
LOBPCG Method on CUDA GPU 105
Susumu Yamada, Masahiko Machida and Toshiyuki Imamura
Parallel Smoothers in Multigrid Method for Heterogeneous CPU-GPU Environment 114
Neha Iyer and Sashikumaar Ganesan
Load Balancing Methods
Progressive Load Balancing in Distributed Memory. Mitigating Performance and
Progress Variability in Iterative Asynchronous Algorithms 127 Justs Zarins and Michèle Weiland
Learning-Based Load Balancing for Massively Parallel Simulations of Hot
Fusion Plasmas 137
Theresa Pollinger and Dirk Pflüger
Load-Balancing for Large-Scale Soot Particle Agglomeration Simulations 147
Steffen Hirschmann, Andreas Kronenburg, Colin W. Glass and Dirk Pflüger
On the Autotuning of Task-Based Numerical Libraries for Heterogeneous
Architectures 157
Emmanuel Agullo, Jesús Cámara, Javier Cuenca and Domingo Giménez
Parallel Algorithms
Batched 3D-Distributed FFT Kernels Towards Practical DNS Codes 169 Toshiyuki Imamura, Masaaki Aoki and Mitsuo Yokokawa
On Superlinear Speedups of a Parallel NFA Induction Algorithm 179
Tomasz Jastrząb
A Domain Decomposition Reduced Order Model with Data Assimilation
(DD-RODA) 189
Rossella Arcucci, César Quilodrán Casas, Dunhui Xiao, Laetitia Mottet, Fangxin Fang, Pin Wu, Christopher Pain and Yi-Ke Guo
Predicting Performance of Classical and Modified BiCGStab Iterative Methods 199
Boris Krasnopolsky
Parallel Applications
Gadget3 on GPUs with OpenACC 209 Antonio Ragagnin, Klaus Dolag, Mathias Wagner, Claudio Gheller,
Conradin Roffler, David Goz, David Hubber and Alexander Arth
Exploring High Bandwidth Memory for PET Image Reconstruction 219
Dai Yang, Tilman Küstner, Rami Al-Rihawi and Martin Schulz
xii
Parallel Architecture
The Architecture of Heterogeneous Petascale HPC RIVR 231
Miran Ulbin and Zoran Ren
Design of an FPGA-Based Matrix Multiplier with Task Parallelism 241
Yiyu Tan, Toshiyuki Imamura and Daichi Mukunoki
Application Performance of Physical System Simulations 251 Vladimir Getov, Peter M. Kogge and Thomas M. Conte
Parallel Methods
A Hybrid MPI+Threads Approach to Particle Group Finding Using Union-Find 263
James S. Willis, Matthieu Schaller, Pedro Gonnet and John C. Helly
Parallel Performance
Improving the Scalability of the ABCD Solver with a Combination of New Load
Balancing and Communication Minimization Techniques 277
Iain Duff, Philippe Leleux, Daniel Ruiz and F. Sukru Torun
Characterization of Power Usage and Performance in Data-Intensive Applications
Using MapReduce over MPI 287
Joshua Davis, Tao Gao, Sunita Chandrasekaran, Heike Jagode, Anthony Danalis, Jack Dongarra, Pavan Balaji and Michela Taufer
Feedback-Driven Performance and Precision Tuning for Automatic Fixed Point
Exploitation 299
Daniele Cattaneo, Michele Chiari, Stefano Cherubin, Antonio Di Bello
and Giovanni Agosta
Parallel Programming
A GPU-CUDA Framework for Solving a Two-Dimensional Inverse Anomalous
Diffusion Problem 311
P. de Luca, A. Galletti, H.R. Ghehsareh, L. Marcellino and M. Raei
Parallelization Strategies for GPU-Based Ant Colony Optimization Applied
to TSP 321
Breno Augusto de Melo Menezes, Luis Filipe de Araujo Pessoa, Herbert Kuchen and Fernando Buarque De Lima Neto
DBCSR: A Blocked Sparse Tensor Algebra Library 331
Ilia Sivkov, Patrick Seewald, Alfio Lazzaro and Jürg Hutter
Acceleration of Hydro Poro-Elastic Damage Simulation in a Shared-Memory
Environment 341 Harel Levin, Gal Oren, Eyal Shalev and Vladimir Lyakhovsky
xiii
BERTHA and PyBERTHA: State of the Art for Full Four-Component
Dirac-Kohn-Sham Calculations 354
Loriano Storchi, Matteo de Santis and Leonardo Belpassi
Prediction-Based Partitions Evaluation Algorithm for Resource Allocation 364
Anna Pupykina and Giovanni Agosta
Unified Generation of DG-Kernels for Different HPC Frameworks 376 Jan Hönig, Marcel Koch, Ulrich Rüde, Christian Engwer
and Harald Köstler
Invasive Computing for Power Corridor Management 386
Jophin John, Santiago Narvaez and Michael Gerndt
Enforcing Reference Capability in FastFlow with Rust 396 Luca Rinaldi, Massimo Torquati and Marco Danelutto
Performance
AITuning: Machine Learning-Based Tuning Tool for Run-Time Communication
Libraries 409 Alessandro Fanfarillo and Davide del Vento
Towards Benchmarking the Asynchronous Progress of Non-Blocking MPI
Operations 419
Alexey V. Medvedev
Power Management
Acceleration of Interactive Multiple Precision Arithmetic Toolbox MuPAT
Using FMA, SIMD, and OpenMP 431
Hotaka Yagi, Emiko Ishiwata and Hidehiko Hasegawa
Dynamic Runtime and Energy Optimization for Power-Capped HPC
Applications 441 Bo Wang, Christian Terboven and Matthias Müller
Programming Paradigms
Paradigm Shift in Program Structure of Particle-in-Cell Simulations 455
Takayuki Umeda
Backus FP Revisited: A Parallel Perspective on Modern Multicores 465
Alessandro di Giorgio and Marco Danelutto
Multi-Variant User Functions for Platform-Aware Skeleton Programming 475
August Ernstsson and Christoph Kessler
Scalability Analysis
POETS: Distributed Event-Based Computing – Scaling Behaviour 487
Andrew Brown, Mark Vousden, Alex Rast, Graeme Bragg, David Thomas,
Jonny Beaumont, Matthew Naylor and Andrey Mokhov
xiv
Towards High-End Scalability on Biologically-Inspired Computational Models 497
Dario Dematties, George K. Thiruvathukal, Silvio Rizzi,
Alejandro Wainselboim and B. Silvano Zanutto
Scientific Visualization
GraphiX: A Fast Human-Computer Interaction Symmetric Multiprocessing
Parallel Scientific Visualization Tool 509
Re’em Harel and Gal Oren
When Parallel Performance Measurement and Analysis Meets In Situ Analytics
and Visualization 521
Allen D. Malony, Matt Larsen, Kevin Huck, Chad Wood, Sudhanshu Sane and Hank Childs
Stream Processing
Seamless Parallelism Management for Video Stream Processing on Multi-Cores 533 Adriano Vogel, Dalvan Griebler, Luiz Gustavo Fernandes
and Marco Danelutto
High-Level Stream Parallelism Abstractions with SPar Targeting GPUs 543
Dinei A. Rockenbach, Dalvan Griebler, Marco Danelutto
and Luiz G. Fernandes
Mini-Symposia
Energy-Efficient Computing on Parallel Architectures (ECOPAR)
Energy-Efficiency Evaluation of FPGAs for Floating-Point Intensive Workloads 555
Enrico Calore and Sebastiano Fabio Schifano
GPU Acceleration of Four-Site Water Models in LAMMPS 565 Vsevolod Nikolskiy and Vladimir Stegailov
Energy Consumption of MD Calculations on Hybrid and CPU-Only
Supercomputers with Air and Immersion Cooling 574
Ekaterina Dlinnova, Sergey Biryukov and Vladimir Stegailov
Direct N-Body Application on Low-Power and Energy-Efficient Parallel
Architectures 583 David Goz, Georgios Ieronymakis, Vassilis Papaefstathiou,
Nikolaos Dimou, Sara Bertocco, Antonio Ragagnin, Luca Tornatore,
Giuliano Taffoni and Igor Coretti
Performance and Energy Efficiency of CUDA and OpenCL for GPU Computing
Using Python 593 Håvard H. Holm, André R. Brodtkorb and Martin L. Sætra
Computational Performances and Energy Efficiency Assessment for a Lattice
Boltzmann Method on Intel KNL 605
Ivan Girotto, Sebastiano Fabio Schifano, Enrico Calore, Gianluca di Staso
and Federico Toschi
xv
Performance, Power Consumption and Thermal Behavioral Evaluation
of the DGX-2 Platform 614
Matej Spetko, Lubomir Riha and Branislav Jansik
On the Performance and Energy Efficiency of Sparse Matrix-Vector
Multiplication on FPGAs 624
Panagiotis Mpakos, Nikela Papadopoulou, Chloe Alverti, Georgios Goumas and Nectarios Koziris
Evaluation of DVFS and Uncore Frequency Tuning Under Power Capping
on Intel Broadwell Architecture 634
Lubomir Riha, Ondrej Vysocky and Andrea Bartolini
ELPA – A Parallel Dense Eigensolver for Symmetric Matrices with Applications
in Computational Chemistry
ELPA: A Parallel Solver for the Generalized Eigenvalue Problem 647
Hans-Joachim Bungartz, Christian Carbogno, Martin Galgon,
Thomas Huckle, Simone Köcher, Hagen-Henrik Kowalski, Pavel Kus,
Bruno Lang, Hermann Lederer, Valeriy Manin, Andreas Marek,
Karsten Reuter, Michael Rippl, Matthias Scheffler and Christoph Scheurer
ParaFPGA 2019. Parallel Computing with FPGAs
Parallel Totally Induced Edge Sampling on FPGAs 671
Akshit Goel, Sanmukh R. Kuppannagari, Yang Yang, Ajitesh Srivastava
and Viktor K. Prasanna
An Implementation of Non-Local Means Algorithm on FPGA 681
Hayato Koizumi and Tsutomu Maruyama
Accelerating Binarized Convolutional Neural Networks with Dynamic Partial
Reconfiguration on Disaggregated FPGAs 691
Panagiotis Skrimponis, Emmanouil Pissadakis, Nikolaos Alachiotis
and Dionisios Pnevmatikatos
Porting a Lattice Boltzmann Simulation to FPGAs Using OmpSs 701
Enrico Calore and Sebastiano Fabio Schifano
A Processor Architecture for Executing Global Cellular Automata as Software 711
Christian Ristig and Christian Siemers
Crossbar Implementation with Partial Reconfiguration for Stream Switching
Applications on an FPGA 721
Yuichi Kawamata, Tomohiro Kida, Yuichiro Shibata and Kentaro Sano
Tools and Infrastructure for Reproducibility in Data-Intensive Applications
Cryptographic Methods with a Pli Cacheté. Towards the Computational
Assurance of Integrity 733
Thatcher L. Collins
xvi
Replicating Machine Learning Experiments in Materials Science 743
Line Pouchard, Yuewei Lin and Hubertus Van Dam
Documenting Computing Environments for Reproducible Experiments 756
Jason Chuah, Madeline Deeds, Tanu Malik, Youngdon Choi
and Jonathan L. Goodall
Toward Enabling Reproducibility for Data-Intensive Research Using the Whole Tale Platform 766
Kyle Chard, Niall Gaffney, Mihael Hategan, Kacper Kowalik,
Bertram Ludäscher, Timothy McPhillips, Jarek Nabrzyski,
Victoria Stodden, Ian Taylor, Thomas Thelen, Matthew J. Turk
and Craig Willis
Subject Index 779
Author Index 783
xvii
Four Decades of Cluster Computing
Gerhard JOUBERTa,1, Anthony MAEDERb a Clausthal University of Technology, Germany
b Flinders University, Adelaide, Australia
Abstract.
During the latter half of the 1970s high performance computers (HPC) were con-structed using specially designed and manufactured hardware. The preferred archi-tectures were vector or array processors, as these allowed for high speed pro-cessing of a large class of scientific/engineering applications. Due to the high costof the development and construction of such HPC systems, the number of avail-able installations was limited. Researchers often had to apply for compute time onsuch systems and wait for weeks before being allowed access. Cheaper and moreaccessible HPC systems were thus in great need. The concept to construct highperformance parallel computers with distributed Multiple Instruction MultipleData (MIMD) architectures using standard off-the-shelf hardware promised theconstruction of affordable supercomputers. Considerable scepticism existed at thetime about the feasibility that MIMD systems could offer significant increases inprocessing speeds. The reasons for this were due to Amdahl’s Law, coupled withthe overheads resulting from slow communication between nodes and the complexscheduling and synchronisation of parallel tasks. In order to investigate the poten-tial of MIMD systems constructed with existing off-the-shelf hardware a firstsimple two processor system was constructed that finally became operational in1979. In this paper aspects of this system and some of the results achieved are re-viewed.
Keywords. MIMD parallel computer, cluster computer, parallel algorithms, speed-up, gain factor.
1. Introduction
During the 1960s and 1970s the solution of increasingly complex scientific problems
resulted in a demand for more powerful computers. The available sequential processors
proved unable to meet these demands. The attempts implemented in the late 1960s to
optimise the execution of sequential program code by analysing program execution
patterns resulted in optimised execution strategies [1, 2]. These attempts to increase the
processing speeds of sequential SISD (Single Instruction Single Data) computers were
limited and did not offer the compute power needed for the processing of compute in-
tensive problems. A typical problem at the time was to be able to compute a 24 hour
weather forecast in less than 24 hours.
A next step was to speed up the execution of compute intensive sections of a pro-
gram through specially designed hardware. An often occurring operation in scientific
computations is the processing of vectors and matrices. Such operations can be ex-
ecuted in parallel by SIMD (Single Instruction Multiple Data) processors. It was thus a
natural approach in the 1970’s to develop vector and array processors as the supercom-
1 Lange-Feld-Str. 45, Hanover, Germany. E-mail: [email protected]
Parallel Computing: Technology TrendsI. Foster et al. (Eds.)© 2020 The authors and IOS Press.This article is published online with Open Access by IOS Press and distributed under the termsof the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).doi:10.3233/APC200017
3
puters of the day. Examples are the ICL DAP (Distributed Array Processor), ILLIAC,
CRAY, etc.
The problem was that the development of such specially designed and built ma-
chines was expensive. The use of such supercomputers by researchers as well as soft-
ware developers was limited due to the high cost of purchasing and running these sys-
tems. In addition the programming of applications software often had to resort to ma-
chine level instructions in order to utilise the particular hardware characteristics of the
available machine.
The development of integrated circuits during the early 1970’s, which enabled the
large scale production of processors at ever lower cost, opened up the possibility to use
such components to construct MIMD parallel computers at low cost. The concept pro-
posed in a non-published talk in 1976 [3] was that the future of high performance com-
puting at acceptable costs was possible by using standard COTS (Components Off The
Shelf) to construct low-cost parallel computers. The architecture of such systems could
be adapted by using standard as well as special compute nodes, different storage archi-
tectures and various interconnection networks.
The concept of developing such systems was, however, deemed unattractive dur-
ing the late 1970’s mainly due to two aspects. The first was Amdahl’s Law [4] that
only a relatively small percentage of programs could be parallelised, and the second
was that the synchronisation and communication requirements would create an over-
head, which made parallel systems highly inefficient. A further aspect that hampered
the acceptance of MIMD systems, was Grosch’s Law [5], which stated that computer
performance increases as the square of the cost, i.e. if a computer costs twice as much
one could expect it to be four times more powerful. This does not apply to MIMD sys-
tems as the addition of nodes results in a linear increase in compute power. Moore’s
Law [6] maintained in 1965 that the number of components per integrated circuit
doubled every year, which was revised in 1975 to double every two years. This resulted
in an estimated doubling of computer chip performance due to design improvements
about every 18 months. It was an open question in how far these developments could
offset the inherent disadvantages of MIMD systems.
In 1977 Prof. Tsutomu Hoshino and Prof. Kawai started a project in Japan to con-
struct a parallel computer using standard components. Their aim was to develop a par-
allel system architecture that could be used to solve particular problems. The system
was later called the PAX computer [7]. This approach was different from that described
in the following sections, where the general applicability of MIMD systems to solve
compute intensive problems was the main objective.
2. A Simple MIMD Parallel Computer
In 1976/77 a project was started at the University of Natal, South Africa to investigate
the possibilities of achieving higher compute performances by connecting standard
available mini-computers [8]. The final development stage was reached in 1979 when
the system was upgraded to have both nodes with identical hardware. The parallel sys-
tem was later named the CSUN (Computer System of the University of Natal) [8].
The project involved three aspects, viz. hardware and architecture, network and
software.
G. Joubert and A. Maeder / Four Decades of Cluster Computing4
2.1 Hardware and Architecture
The available hardware consisted of two standard HP1000 mini-computers. The pro-
cessors were identical, but the memory sizes differed initially. The architecture decided
on was a master-slave configuration with distributed memories. No commonly access-
ible memory was available. The HP1000 offered a micro programming capability,
which allowed for special functions to be executed at high speed.
Fig. 1: The cluster system, admired by Chris Handley2
2.2 Network
The connection of the two nodes had to offer high communication speeds. This was
realised by using a high-speed connection available for HP1000 mini computers for
logging high volumes of data collected by scientific instruments. The cable was adap-
ted by HP to supply a computer interface at both ends allowing the interconnection of
the two nodes via interface cards installed in each machine. These interfaces were user
configurable by means of adjustable switch settings for timing or logistic characterist-
ics, allowing a computer-to-computer mode. The maximum transmission speed was
one million 16 bit words per second.
2.3 Software
The Real Time Operating System (RTOS), HP-RTE, available for the HP1000 offered
the basic platform for running and managing the nodes. The system had to be enhanced
2 Later: University of Otago, New Zealand
G. Joubert and A. Maeder / Four Decades of Cluster Computing 5
by additional software modules to achieve the control of the overall parallel computer
system. A monitor was developed to create an interface for users to input and run pro-
grams. Programs and data were provided on punched cards or tape.
A critical component was the communication between the two nodes. For this
drivers were developed that also allowed for the synchronisation of tasks. With the
master-slave organisation of the system the slave always had to be under control of the
master. In an interrupt-driven environment this is easily accomplished. The communic-
ation available between the two nodes did not allow to transmit specific interrupt sig-
nals between the two machines. Thus data controlled transmission, i.e. sending all mes-
sages with header information, was used. Both sender and receiver had to wait for ac-
knowledgement from the counterpart before message transmission could begin. This
caused an additional overhead for the synchronisation of tasks.
The master node was responsible for all controlling activities. It prepared tasks
for execution by the slave, downloaded these together with the data needed to the slave,
which then started executing the tasks. The master in the meantime prepared its own
tasks and executed these in parallel, exchanging intermediate results with the slave.
The master also executed any serial tasks as required. The later upgrade of the system
to have two equally equipped nodes simplified task scheduling.
Such a setup is of course very sensitive to the volume and frequency of data trans-
mission. This must thus be considered by programmers when selecting an algorithm for
solving a particular problem.
No programming tools for developing parallel software were available at the time.
The standard programming language for scientific applications was FORTRAN. A pre-
compiler was developed that processed instructions from programmers to automatically
create parallel tasks that were inserted in the FORTRAN program code. The compiler
subsequently created tasks that could be executed in parallel, which information was
used to schedule the parallel execution of tasks.
3. Applications
The aim with the project was to show that at least some algorithms could be executed
in less time by a cluster constructed with standard components. The two-node cluster
was a starting point that could be easily expanded by adding more, not necessarily
identical, nodes.
The physical limitations of the available nodes as well as the architecture of the
cluster limited the classes of problems that could possibly be efficiently executed.
Thus, a comparatively low volume of interprocessor data transfers as well as few syn-
chronisation points relative to the amount of computational work, was an advantage.
Problems implemented on the cluster were, for example:
• Partial Differential Equations: One-dimensional heat equation solved by expli-
cit and implicit difference methods [9]
• Solution of tridiagonal linear systems [10]
• Numerical integration [11].
G. Joubert and A. Maeder / Four Decades of Cluster Computing6
4. Gain Factor
Several methods for assessing parallel computer performance are available, such as
speedup, cost, etc. These metrics proved insufficient, especially in view of Amdahl’s
Law [4], for a comparison of the overall time used to solve a problem on a sequential
processor and the MIMD system described above.
The measurement needed was a comparison of overall sequential compute time, Ts,
and overall parallel compute time, Tp. A further aspect was that the optimal sequential
and parallel algorithms may differ substantially. Thus, in the comparisons, the optimal
algorithm for each processing mode—sequential or parallel—was used.
A large number of aspects influence the value of Tp, such as organisation and
speed of processors (these need not be identical, thus potentially resulting in a hetero-
geneous system), interprocessor communication speed, communications software
design, construction of algorithms, etc. In practice time measurements can be made to
obtain values for Ts and Tp for particular algorithms. This gives a Gain Factor:
G = (Ts - Tp)/Ts
If 0 < G ≤ 1 parallel processing offers an advantage over sequential processing.
The upper limit, G = 1, is obtained when Tp, the overall time used to solve a problem
with the parallel machine, is zero. When G ≤ 0 parallel computation offers no advant-
age. Note that G applies equally well to the performance measurement of heterogen-
eous systems, and includes communication and administration overheads and covers
the limitations expressed in Amdahl's Law.
Results obtained for a number of test cases using the two node cluster, are [12]:
• Solution of tridiagonal linear systems, 120x120: G = 0.42
• One-dimensional diffusion equation, 30.000 time steps: G = 0.481
• Numerical integration, 30.000 steps: G = 0.497.
With a two node cluster the value of G ≤ 0.5.
These results showed that, at least in some cases, parallel processing using an
MIMD system with distributed memories may offer significant advantages.
5. Conclusions
The results obtained with the simple two-node MIMD parallel system showed that
clusters constructed with standard components can be used to boost the execution of
parallel algorithms for solving certain classes of problems.
The results obtained with the system prompted further research on the effects of
more nodes, different connection networks and suitable algorithms.
This work resulted in the start of the international Parallel Computing (ParCo)
conference series with the first conference held in 1983 in West-Berlin. The aims with
these events was to stimulate research and development of all types of parallel systems,
as it was clear from the outset that not one architecture is suitable for solving all prob-
lems.
It took more than a decade for the idea of using standard components to construct
HPC systems to be adopted by industry on a comprehensive scale. It was also only
gradually realised that the flexibility of cluster systems allowed for the processing of a
G. Joubert and A. Maeder / Four Decades of Cluster Computing 7
wide range of compute intensive and/or large scale problems. The resulting advent of
cheaper parallel systems built with commodity hardware lead to many specially de-
signed HPC systems becoming less competitive due to their high price tags and limited
application spectrum. The resulting major crisis in the supercomputing industry during
the late 1980’s and early 1990’s lead to the demise of many companies supplying spe-
cially designed hardware aimed at particular problem classes..
Exascale computing is presently the next step in HPC and this will require extreme
parallelism, employing many thousands or millions of nodes, to achieve its goals.
With the end of Moore’s Law approaching, new technologies may emerge, to
achieve the future development of HPC beyond exascale.
References
[1] Anderson, D. W., Sparacio, F. J., Tomasulo, R. M.: The IBM System/360 Model 91: Machine Philosophyand Instruction-Handling (1967), See: http://home.eng.iastate.edu/~zzhang/courses/cpre585-f04/reading/
ibm67-anderson-360.pdf
[2] Schneck, Paul B.: The IBM 360-91, In: Supercomputer Architecture, The Kluwer International Series in
Engineering and Computer Science (Parallel Processing and Fifth generation Computing), Springer, Bo
ston, MA, Vol. 31, 53-98 (1987)
[3] Joubert, G.: Invited Talk, Helmut Schmidt University, Hamburg, January 1976
[4] Amdahl, Gene M.: Validity of the Single Processor Approach to Achieving Large-Scale Computing Cap
abilities . AFIPS Conference Proceedings (30): 483–485. doi:10.1145/1465482.1465560, (1967)
[5] Grosch, H.R.J.: High Speed Arithmetic: The Digital Computer as a Research Tool, Journal of the Optical
Society of America. 43 (4): 306–310 (1953). doi:10.1364/JOSA.43.000306
[6] Moore, Gordon: Cramming More Components onto Integrated Circuits, Electronics
Magazine. 38 (8): 114–117 (1965)
[7] Hoshino, Tsutomu: PAX Computer, Reading, Massachusetts, etc.: Addison Wesley Publishing Company
(1989)
[8] Proposed by U. Schendel, Free University of Berlin (1979)
[9] Joubert, G. R., Maeder, A. J.: An MIMD Parallel Computer System, Computer Physics Communications,Amsterdam: North Holland Publishing Company, Vol. 26, 253-257 (1982)
[10] Joubert, Gerhard, Maeder, Anthony: Solution of Differential Equations with a Simple Parallel Com puter, International Series on Numerical Mathematics (ISNM), Birkhäuser: Basel, Vol. 68,137-144
(1982)
[11] Joubert, G. R., Cloete, E.: The Solution of Tridiagonal Linear Systems with an MIMD Parallel Com-
puter, ZAMM Zeitschrift für Angewandte Mathematik und mechanik, Vol. 65, 4, 383-385 (1985)
[12] Joubert, G. R., Maeder, A. J., Cloete, E.: Performance measurements of Parallel Numerical Algorithms
on a Simple MIMD Computer, Proceedings of the the Seventh South African Symposium on Numerical
Mathematics, Computer Science Department, University of Natal, Durban, ISBN 0 86980 264 X, 25-36 (1981)
G. Joubert and A. Maeder / Four Decades of Cluster Computing8