the feast indices-realistic evaluation of modern software
TRANSCRIPT
An lntemabimal Journal
computers & mathematics
PERGAMON Computers and Mathematics with Applications 41 (2001) 1431-1464 www.elsevier.nl/locate/camwa
The FEAST Indices-Realistic Evaluation of Modern Software Components and
Processor Technologies
S. TUREK AND CH. BECKER Institute for Applied Mathematics, LS III, University of Dortmund
Vogelpothsweg 87, D-44227 Dortmund, Germany
tureQfeatflow.de
A. RUNGE Institut fiir Angewandte Mathematik, Universitgt Heidelberg
Im Neuenheimer Feld 294, 69120 Heidelberg, Germany
(Received and accepted March 1999)
Abstract-we examine the computational efficiency of linear algebra components in iterative solvers for grid-oriented simulations of PDEs. While the standard sparse matrix-vector (MV) tech- niques show significant losses of performance, especially on modern processors, our sparse banded components have the potential to exploit today’s high computing power. We explain the major concepts of the FEAST software which contains such highly tuned numerical linear algebra basic components (Sparse Banded Blss) up to complete multigrid solvers, all being optimized with respect to the actual hardware platform. Based on algorithmic and computational studies, we present the FEAST indices which are indicators for the true performance of many modern processors, depend- ing on the underlying FEM space, the problem size and the implementation style. These indices allow a new rating of the various hardware platforms with regard to different mathematical solution strategies, for academic and realistic numerical problems and ranging from ‘low cost’ PCs up to supercomputers. @ 2001 Elsevier Science Ltd. All rights reserved.
Keywords-High performance computing, Benchmarking, Numerical linear algebra, PDEs.
1. MOTIVATION
One current trend in the software development for PDEs, and here especially for finite element
(FEM) approaches, goes clearly towards very sophisticated object-oriented techniques and adap- tive methods in any sense. On the other hand, the employed data and solver structures, and
particularly the ‘matrix structures’ (for an overview, see [1,2]), are mostly chosen in a somewhat
old-fashioned way-as ‘globally defined’ types-which neglect the very specific performance fa-
cilities of modern hardware platforms. As a result, the observed computational efficiency is often
far from the expected peak rates of (potentially available) one GFLOP/s nowadays, and the ‘real
life’ gap is even further increasing if one extrapolates current hardware developments (see also
the papers of Rude, for instance, [3]).
0898-1221/01/$ - see front matter @ 2001 Elsevier Science Ltd. All rights reserved. Typeset by A@-TEX PII: SO898-1221(01)00108-0
1432 S. TUREK et al.
Today, high performance calculations in this field seem to be reachable only by explicitly ex-
ploiting ‘caching in’ and ‘pipelining’ in combination with sequentially stored arrays (see [4]).
While the realization of such techniques tends to be easier for finite difference approaches, it is
far from being obvious how to perform similar approaches for much more complex finite element
codes. These discrepancies, between numerics and software concepts and the available hardware,
often lead to unreasonable calculation times for ‘real world’ problems, e.g., (nonstationary) CFD
calculations in 3D, as can be easily seen from recent benchmark comparisons for commercial as
well as research codes (for instance, see [5,6]). H ence, strategies for massive efficiency enhance-
ment are necessary, not only from the mathematical (algorithms, discretizations) but also from
the software side of view. To realize some of these aims, our new FEM package FEAST (Finite
Element Analysis & Solution Tools) is under development (see [7,8]).
In this paper, we concentrate on the aspect of testing !inear algebra routines for the stiffness
matrices resulting from FEM discretizations. These are the most important components in the
corresponding multigrid framework which is mainly responsible for the total efficiency of the
complete simulation. First of all, we demonstrate the performance of typical software tools based
on standard sparse MV techniques, before we explain our machine-dependent Sparse Banded
Bias approaches [9] for achieving strong performance improvements. Consequently, these tech-
niques are the basis for our recent evaluation of software components and modern processor tech-
nologies which are collected in several types of FEAST indices. These explain, for instance, why
many codes run faster on ‘low cost’ PCs than on single processors of supercomputers. However,
they also demonstrate the (hidden) potential of supercomputing power on modern (workstation)
processors as employed by DEC, IBM, HP, SUN, and SGI.
2. TYPICAL EXAMPLES FOR PERFORMANCE LOSSES
One of the main components in iterative solvers-Krylov-space methods or multigrid-are
matrix-vector (MV) applications. They are needed for defect calculations, smoothing, step-
length control, etc., and often they consume SO-SO% of CPU time. Hereby, sparse MV con-
cepts are the standard techniques in FEM codes (and others), also well known as the ‘compact
storage’ technique: depending on the programming language, the matrix entries plus index ar-
rays/lists/pointers are stored as long arrays or heaps, containing the ‘nonzero elements’ only.
For an overview on applied techniques, see for instance SPARSKIT [2] and the literature cited
therein. While this sparse approach can be applied for general meshes and arbitrary numberings
of the unknowns, no explicit advantage of (possible) highly structured parts can be exploited.
Consequently, a massive loss of performance with respect to the possible peak rates may be
expected since-at least for large problems with more than 100,000 unknowns-no ‘caching in’
and ‘pipelining’ can be exploited such that the higher cost of memory access will dominate the
resulting MFLOP/s rates.
To demonstrate this failure, we start with examples from our FEATFLOW code [6] which seems
to be one of the most efficient simulation tools for incompressible flow on general domains (see the
results in [5]). We apply FEATFLOW to the following “2D flow around a car” configuration, and
we measure the resulting MFLOP/ s rates for the MV multiplication inside of the multigrid solver
for the momentum equation (see [6] for mathematical and algorithmic details). Here, we use the
typical (for FEM approaches) ‘two-level’ (TL) numbering (old vertices preserve their numbers if
the mesh is refined), a version of the bandwidth-minimizing Cuthill-McKee (CM) algorithm and
an arbitrary ‘stochastic’ numbering.
All numbering strategies have in common that, based on the standard sparse ‘compact storage
rowwise’ (CSR) technique (see [2]), the cost for arithmetic operations, for storage and for the
number of memory accesses are identical. However, the resulting timings in FORTRAN77 on
a SUN ENTERPRISE E450 (about 250 MFLOP/s peak performance) can be very different, as
Table 1 shows.
FEAST Indices-Realistic Evaluation 1433
Figure 1. Coarse mesh for ‘flow around a car’
Table 1. MFLOP/s rates of sparse MV multiplication for different numberings.
Computer #Unknowns TL CM Stochastic
13,688 20 22 19
SUN E450 54,256 15 17 13
(- 250 MFLOP/s) 216,032 14 16 6
(CSR) 862,144 15 16 4
These and numerous similar results (see also Section 6) can be concluded by the following
statements which are quite representative for many other numerical simulation tools.
1. Different numbering strategies can lead to identical numerical results and work (w.r.t.
arithmetic operations and memory access), but at the same to huge differences in elapsed
CPU time.
2. Sparse MV techniques are slow (with respect to possible peak rates) and depend massively
on the problem size and the amount and kind (?) of memory access.
3. In contrast to the mathematical theory, most multigrid implementations will not show a
realistic run-time behaviour which is directly proportional to the mesh level.
To shed more light into this ‘strange’ behaviour, we examine more carefully the computer
performance for second-order scalar PDEs on tensorproduct meshes with trilinear FEM spaces
(M matrices with bandwidth 27). Again, we apply different numbering strategies in the sparse
CSR format; however, the numerical work and the results of the MV operations are identical for
all cases.
Table 2. MFLOP/s rates of sparse MV multiplication on tensorproduct meshes.
Computer #Unknowns Rowwise Two-Level Stochastic
IBM RS6000/597 173 86 81 81
(160 MHz) 333 81 55 16
‘PBSC’ 653 81 14 8
IBM RS6000/590 173 42 42 42
(66 MHz) 333 41 39 27
‘POWER2’ 653 41 17 7
DEC/21164 173 54 31 29
(433 MHz) 333 51 16 10
‘CRAY T3E’ 653 49 13 8
INTEL PENTIUM II 173 30 28 28
(400 MHz) 333 30 26 24
‘ALDI PC’ 653 30 23 19
1434 S. TUREK et al.
Based on such studies,’ we can characterize more precisely the
haviour.
computational run-time be-
l The MFLOP/s rates are far away from the announced peak performance.
l They depend significantly on the size of the problem and the l&d of memory access.
l ‘Old’ processors (IBM 590) can be even faster (!) than ‘new’ ones (IBM 597).
l ‘Supermarket’ PCs can be significantly faster than processors in ‘supercomputers’.
Table 3, which now in contrast is based on the highly structured MV techniques in FEAST,
shows that the same application can (!) be performed much faster: we additionally exploit vet-
torization facilities and data locality. Additionally, we can even further differ between the case
of variable matrix entries (‘var’) and constant bands (‘const’) as typical for Poisson-like PDEs.
Table 3. MFLOP/s rates of Sparse Banded Blss [9] MV multiplication on tensor- product meshes.
Computer #Unknowns Var Const Computer Var Const
IBM RS6000/597 173 188 480 IBM RS6000/590 102 195
(160 MHz) 333 172 393 (66 MHz) 94 175
‘PPSC’ 653 176 390 ‘POWERZ’ 94 176
DEC/21164 173 103 404 INTEL PENTIUM II 51 180
(433 MHz) 333 101 313 (400 MHz) 51 137
‘CRAY T3E’ 653 101 268 ‘ALDI PC’ 48 124
Such differences in performance are caused by the fact that modern processors are already
sensible supercomputers with respect to ‘caching in’ and ‘pipelining’ which has been explicitly
exploited by the Sparse Banded Blas-based MV techniques. However, the following examples
show that also the kind of arithmetic work has to be considered (division-free numerical linear
algebra) and that a precise knowledge of the cache architectures is necessary.
Table 4. MFLOP/s rates for tridiagonal preconditioners in ‘classical’ formulation, compared with the Sparse Banded Bias versions from FEAST for ‘variable’ and ‘con- stant’ matrix entries. The left table shows the results obtained from traditional im- plementations with division, while the right table gives the rates without division. It is obvious that divisions in an algorithm can significantly degradate the performance if compared with pure addition/multiplication-based algorithms.
To make this point clear: the use of tensorproduct meshes does not (!!!) automatically guar-
antee higher performance; all proposed optimization strategies with respect to data structures,
algorithmic redesign, programming language, cache architectures, and memory management must
be considered.
Table 5 shows the expected development of the (actual) processor technology and demonstrates
that such ‘machine-oriented’ algorithmic and implementation techniques will be absolutely nec-
essary: the explicit handling of data locality, internal parallelism and vectorization, in contrast
to minimizing the memory access at the same time, will be the key techniques for the next years.
‘ALDI is a supermarket in Germany which is well known for its cheap prices but also high-quality goods. Based originally on household and food stuffs only, ALDI has offered a typical PENTIUM 11/400MHz computer which has been sold out due to its cheap price and particularly ALDI’s good reputation: about 250,000 pieces in two weeks.
FEAST Indices-Realistic Evaluation 1435
160
140
120
100
60
60
40
20
0’ 0
I
20000 d I I I I I 1
40000 60000 60000 100000 120000 POS(Y,N) for N=4225
20 1 I I I I I I I I I I 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500
POS(Y,N) for N=1050625
Figure 2. MFLOP/s rates of DAXPY operations based on machine-optimized perfor- mance libraries. We only vary the difference POS in the relative position to each other of both vectors to be added. The results for the SUN E450 (top, 1 MByte LZ-cache) are representative for processors with l-fold associative cache architectures. The actual performance of such ‘simple’ numerical linear algebra tasks of BLASl-type depends massively on the relative position POS and varies between 160 and 3 (!!!) MFLOP/s although both small vectors fit completely into cache. The bottom figure shows similar losses of performance for the CRAY T3E, here for long vectors. These examples demonstrate that such failures can be only avoided by a hardware-specific and user-defined memory management.
Table 5. Excerpts from the ‘1997 Semiconductor Roadmap’ [lo].
Year of First Shipment
Local clock (MHz)
Transistors/chip
1997 1999 2001 2003 2006 2009 2012
750 1250 1500 2100 3500 6000 10K
11M 21M 40M 76M 200M 520M 1.48
Then, it might get really possible to exploit more adequately the performance of future pro-
cessors which tend to be even more powerful-as single processors with up to 1 TFLOP/s-than the complete CRAY T3E today. On the other hand, neglecting these features may not only
lead to nonsignificant performance acceleration, but even performance degradation is possible as
the previous tables have shown (compare the ‘old’ IBM 590 with the ‘new’ IBM 597). At the
moment, most of today’s sparse implementations (see also Section 6) would favorize the use of
‘cheap’ INTEL-based computers only. This development might lead to an economic (for certain companies) and much more to a scientific disaster as the performance rates will show.
1436 S. TUREK et al.
3. DESCRIPTION OF THE FEAST PROJECT
FEAST is based on the following concepts (see [7,8] for details) which shall enable the combina-
tion of such highly tuned linear algebra tools with very sophisticated FEM simulation strategies:
l consequent application of (recursive) ‘divide and conquer’ strategies,
l hierarchical data and solver structures, but also hierarchical (!) ‘matrix structures’,
l frequent use of machine-optimized low-level numerical linear algebra routines.
The result shall be a fiexible FEM package for many ‘real life’ problems with special emphasis
on:
l (closer to) peak performance on modern processors,
l typical multigrid behaviour (with respect to efficiency and robustness),
l parallelization and vectorization directly included on ‘low level’,
l open for different adaptivity concepts and a posterior? error control.
In contrast to many other approaches which aim to develop software for research or education
topics, our approach is clearly designed for high performance applications with industrial back-
ground, especially in CFD. Consequently, our main emphasis lies on the aspects ‘efficiency’ and
‘robustness’ and less on topics as ‘easy implementable’ or ‘most modern programming language’.
Therefore, FORTRAN77/90 is used such that (for us absolutely necessary) the transparent access
to the data structures is possible. Further, this makes it possible to adopt many reliable parts
of the predecessor packages FEAT2D, FEATSD, and FEATFLOW. One of the most important
principles in FEAST is the consequent application of divide and conquer strategies. The solution
of a ‘global’ problem is recursively split into smaller independent subproblems on ‘patches’ as
part of the complete set of unknowns. There are two major aims in this splitting procedure which
can be performed by hand or via self-adaptive strategies:
l find and exploit locally structured parts,
l find and hide locally anisotropic parts.
While on such ‘anisotropic’ parts the usual sparse techniques will be applied, we try to exploit
the much higher performance on the other-highly structured-patches. Consequently, the in-
tention is to minimize the number of ‘sparse areas’ and to apply preferably all numerical linear
algebra tasks on such ‘structured patches’. Then, the major three tasks for realizing such a
simulation tool are:
l the design of the ‘skeleton’ for the recursive splitting into local/global levels,
l the implementation of the typical FEM facilities on the ‘low-level’ patches,
l the development of ‘reference element solvers’ on the ‘low-level’ patches.
The corresponding data, solver, and matrix structures are described in the papers [7,8] and
particularly in [ll]. The aim of this work is a (careful) description of the third task: optimization
of the ‘reference element solvers’ and the corresponding numerical linear algebra tools on the ‘low-
level’ patches. In our context, these are quadrilaterals (2D), respectively, hexaeders (3D) which
are discretized with (logically equivalent) tensorproduct meshes. That means, we can (!) apply
linewise or rowwise numbering, but the local mesh size may vary arbitrarily. This optimization
procedure is split into two tasks:
l the manual task of algorithmic design and corresponding implementation w.r.t. optimal
MFLOP/s rates,
l the numerical task to derive ‘optimal’ multigrid convergence with respect to efficiency and
robustness.
While the numerical task, the optimization of multigrid convergence rates for different FEM
spaces, problem sizes, (PDE) problem types and mesh topologies (but on tensorproduct meshes),
is currently examined in [12], we show recent results of the ‘MFLOP/s optimization’ in this paper
FEAST Indices--Realistic Evaluation 1437
(see also [9] for the technical details of the Sparse Banded Blas. To be precise, we provide the
reader with:
1. ‘optimal’ MFLOP/ s rates for different numerical linear algebra tasks on generalized ten-
sorproduct meshes (for bilinear and trilinear FEM),
2. evaluation of many modern processors with respect to the ‘real’ performance of such
different basic tasks up to complete multigrid algorithms.
4. DESCRIPTION OF THE NUMERICAL LINEAR ALGEBRA COMPONENTS
All following ratings and the test software (ptest . f) are part of the internet and can be
downloaded from our homepage. In this paper, we publish the first version of more or less
complete results for (almost) all modern hardware platforms. Since everybody has access to the
data, everybody is invited to check new processors or different software environments to look for
improved FEAST indices. Our hope is that this (permanent) process of testing and rating of new
hardware components gets into an automatic loop if sufficiently many ‘test persons’ participate.
On the one hand, some ‘pressure’ on the vendors is generated, for instance, to provide the users
with optimal compiler options. On the other hand, there is a fair ‘competition’ possible between
the various hardware configurations. So, the user and also ‘client’ of such products has the chance
to compare the different test environments with regard to his own applications and with respect
to the true cost/performance relation. To provide this kind of information is definitely one of the
aims of the subsequent FEAST indices.
We will not explain the underlying tests in all details, and we will not show the results of all
comparisons: these are part of the diploma thesis of [13], respectively, see also [9] and our home-
page. Therefore, in short terms only, the following linear algebra tasks are the basic components
of the FEAST indices.
4.1. DAXPY-Like Applications
Beside the (standard) linear combination DAXPY, we also apply the (variable) linear combination
DAXPYV
y(i) = cr(i)z(i) + y(i)
which is important in banded MV multiplications on vector computers. Additionally, we check
the (indexed) variant DAXPYI
y(i) = Q(i)Z(j(i)) + y(i).
Here, the scaling factor cr is a vector acting on the components of the vector z which depend
on the index i via an index vector j(i). We apply two tests with different index vectors j(i)
which simulate moderate and stochastic jumps in the numberings. These tests are quite good
representations for the complete sparse MV applications.
The arithmetic work count for all DAXPY-like variants is defined as 2 x N, with N the number
of vector components, such that the corresponding MFLOP/s rates are determined via
2xN
CPUTIME x 10G’
4.2. Variants of Mv Multiplications
Assuming a (generalized) tensorproduct mesh with A4 vertices in each space dimension, the
resulting number of unknowns is M2 (2D), respectively, 1M3 (3D) f i we consider vertex-oriented
discretizations: here conforming bilinear, respectively, trilinear FEM. Assuming the typical 9-
point, respectively, 27-point stencil for the corresponding matrices, the resulting storage cost and
hence, the measure for the MFLOP/s rates are identical for all following techniques, independent
1438 S. TUREK et al.
of the kind of MV multiplication. The test program ptest .f examines eight different basic
implementations (plus various blocking techniques, see [9] for the technical details) which all
are designed to exploit potentially ‘caching in’ and ‘vectorization’ facilities. The MV-V results
correspond to the case of arbitrary matrix entries, while MV-C represents the case of constant
band entries as typical for Poisson-like problems on (in each ‘local direction’) equidistant meshes.
Additionally, we perform measurements for a corresponding sparse MV application in the CSR
format as described above. We examine three variants of numberings: linewise numbering but
nevertheless indexed access (SPARSE), ‘two-level’ numbering (FEAT) as typical for semiadaptive
FEM simulations without local adaptivity, and finally ‘stochastic’ numbering (ADAP) of the un-
knowns being representative for fully adaptive approaches. To make this point clear: all MV
applications are performed for the same matrix. We only vary the storage and access techniques,
hereby exploiting the tensorproduct structure or not, taking into account the case of constant
entries or not. However, in all cases we define the work count to measure the MFLOP/s rates as
18 x N (in 2D),
54 x N
CPUTIME x lo6 respectively,
CPUTIME x lo6 (in 3D)
Moreover, we test the performance of matrix-vector multiplications with tridiagonal matrices
which are basic tools for certain preconditioners as the ‘linewise GS’ schemes below. Again, we
check in ptest . f the MFLOP/ s rates for variable and constant entries, and they are determined
via 6xN
CPU TIME x 106.
4.3. Tridigonal-Based Preconditioners
Assuming tensorproduct meshes, the application of ‘inverse’ tridiagonal matrices as precon-
ditioners (TRIS) can be easily performed and provides rather good convergence properties with
respect to mesh anisotropies (see [12]). We apply the ‘division-free’ variants from the previous
section, with various blocking strategies, again for variable and constant matrix entries. The
MFLOP/s rates are defined as 5xN
CPU TIME x 10” ’
Additionally, this tridiagonal preconditioner can be combined with the described tridiagonal
MV multiplications to work as ‘linewise Gau&Seidel’ (TRIGS) or ‘linewise ADI’ preconditioners
(see [9]). Taking into account the convergence studies in [12], these schemes are our actual
favorites as multigrid smoothers with regard to numerical and computational efficiency. The
MFLOP/s rates are defined as
11 x N (in 2D),
29 x N
CPUTIME x lo6 respectively,
CPUTIME x lo6 (in 3D).
4.4. Smoothers in Multigrid
Based on the previously described DAXPY-like operations, the MV multiplications and the pro-
posed tridiagonal preconditioners, we can determine the MFLOP/s rates of the corresponding
smoothing operators which all are written and implemented in the following general notation:
Here, A and C are matrices in RNxN, with C being the preconditioner, and x1, xlfl, b are
N-dimensional vectors. The parameter w is an arbitrary relaxation factor, while the indices 1
and 1+ 1 are the usual counters in iterative procedures. Our candidates for the preconditioner C are:
l C = diag(A) corresponds to Jacobi iteration (x DAXPYV),
.
l
FEAST Indices--Realistic Evaluation 1439
C = ‘IRIS(A) corresponds to tridiagonal preconditioners, respectively, linewise variants of
Jacobi, C = TRIGS(A) corresponds to the ‘lower+tridiagonal’ preconditioner, respectively, line- wise Gaul%Seidel.
There are several reasons why we explicitly use and optimize this form of the basic iteration
which
1.
2.
3.
is in contrast to many ‘red-black’ or other ‘multicolouring’ approaches.
This general form allows the independent splitting into the three tasks MV multiplication, preconditioning, and linear combination, which all have been optimized with respect to ‘caching in’ and ‘pipelining’. The explicit use of the complete defect Ad - b is advantageous in certain techniques for implementing complicated or ‘moving’ boundary conditions (see [6]). All components in standard multigrid, i.e., smoothing, defect calculation, step-length
control, grid transfer, are included in this basic iteration.
As an example, the MFLOP/ s rates for ‘linewise GS’ smoothing S-TRIGS(N) which consists of
DAXPY, MV, and TRIGS (taking into account the case of variable (V) or constant (C) entries) are calculated separately on each mesh level, corresponding to problem size N. They read in 3D
MFLOPs-TRIGS(N) := (54 + 29 + 2) x N
54xN 29xN 2xN MFLOPMV(N) + MFLOPTRIGS(N) + MFLOPDAXPY(N) >
x 106
In an analogous way, the corresponding MFLOP/ s rates for Jacobi or tridiagonal smoothing are estimated, based individually on the previously calculated rates in 2D and 3D with respect to the underlying numerical linear algebra components.
4.5. Complete Multigrid Cycle
Finally, we can (recursively) determine the MFLOP/ s rates for a standard multigrid cycle, consisting of m total smoothing steps (including pre- and postsmoothing steps; here: m = 2), defect calculation, grid transfer and coarse grid approximation. As an example, the actual rates for ‘line GS’ smoothing on level 1 with N(1) unknowns read for the corresponding multigrid variant M-TRIGS(N(1) > in 3D
MFLOPM-TRIGS(N(~)) :=
1.5 x (85m + 54 + 18) x N(1) 85mx N(1) 54xN(Z) 9x2xN(1) (85m+54+18) x N(I)
MFLOPS-TRIGS(N(~)) + MFLOPMV(N(~) + MFLOPDAXPY(N(I)) + o.5 ' MFLOPMVI-TRIGS(N(I-I)) x 10s
The factor 1.5 is achieved from the recursion through the applied W-cycle and can be analo- gously determined for other cycles and the 2D case (see [13] for the details).
4.6. Measurements of Classical Sparse Applications
In 3D, we have directly included a sparse MV multiplication for the same matrix as before, which results from a trilinear FEM discretization of a scalar Poisson-like problem and which leads to a (maximum) matrix stencil of 27. Then, the total number of vertices, respectively, unknowns is defined as N := M3. All matrix entries are sequentially stored in a long array (we perform FORTRAN77), and as usual for the CSR format (see [2]) we employ two additional index arrays for accessing the load vector. We examine three variants of numberings: linewise numbering but nevertheless indexed access (SPARSE), a simulated ‘two-level’ numbering (FEAT) as typical for semiadaptive FEM simulations without local adaptivity (which allows maximum differences in the numbering of 0(M2)), and finally a ‘stochastic’ numbering (ADAP) of the unknowns (allowing
1440 S. TUREK et al.
jumps in the numbering of two neighboured vertices of order O(M3)) which is representative for fully adaptive approaches. Then, in all cases we define the MFLOP/s rates as before:
54 x N
CPU TIME x 10” ’
Unfortunately, we have missed this direct inclusion of sparse MV application in 2D (now we cannot change anymore) such that we have to simulate the analogous behaviour for the sparse
(CSR) MV multiplication via the indexed DAXPY routines DAXPYV and DAXPYI. As in the 3D case, we allow for the index vector that jumps of order O(M) (remember: N := M2 in 2D) or
O(M2) can occur which compare well with the FEAT, respectively, the ADAP results from the 3D case. These 2D tests correspond to our older ‘Elch Tests’ which have been described in [14].
5. THE FEAST INDICES
Before we explain more in detail all auxiliary indices and how to compute them, we present already the final result: the actual version of the ‘total’ FEAST index. They show our ‘total’ evaluation of most of the available processors and allow different specific rankings with respect to the explained implementation techniques and applied data structures for standard conforming FEM. Most of the notation in the following tables and figures should be self-explainable; if not
so, take a look at [13].
q 21264 WSMAP
a PWR3 (200)
121294WS(500)
PWR2bz.97 (160)
PWFw397 (160)
ORIQIN 2ooo
FEAT- Index
ULTRA30 (339)
HP C240 (233)
PWw590 (69)
n CRAY T3E (433)
w ORIGIN 2UO
n ULTRA90 (300)
rn21164WS(4cq
q 21134 PC (533)
n 21154WS(500)
q HPPA3WO
q ULTRA10 (333)
ULTRA E450 (230)
n PWReFm
n ULTRA1 (140)
n P6ntim II (253)
n R-9(75) q PentiumPro (200)
n HP735
Figure 3. The ‘total’ FEAST index: the graphical and table-based presentation of the specific 3D and 2D values is given in the Appendix. The performance in 2D is smaller than in 3D, due to the more compact matrices (see the Conclusions). Keep in mind that these are averaged MFLOP/s rates only, which may significantly differ from the results for specific problem sizes.
The first term determines the kind of processor, followed by some additional denotations: PWR for IBM POWER2 or POWER3 processors, model 590 or whatever; 21264 or 21164 denotes the ALPHA chip in DECs, with or without KAP preprocessor, being a workstation model (WS), a PC (under LINUX, but using the executable code from the workstation), or a PC under LINUX with the EGCS F77 compiler; ULTRA stands for SUN computers; ORIGIN, respectively, RBOOO and RlOOOO, denote SGIs; PI1 and PPro denote Pentium II and Pentium Pro architecture; PPC/F50 denotes the PowerPC version F50 by IBM. In brackets, further details about the actual clock rate are given. More information about the precise definition of the tested computers can be obtained from [13], or look at our homepage.
In the following sections, we will discuss only some of the ‘auxiliary’ indices in detail, while a much more complete description of the results in 3D and in 2D is given in the Appendix. In most cases, the 2D results allow the same qualitative and even quantitative conclusions. Only
FEAST Indices--Realistic Evaluation 1441
in such cases that significant differences between 2D and 3D results occur, we explicitly state
the corresponding results. Additionally, keep in mind that the indices, or better the ranking
of computers, will change since new processors or different configurations (compiler, operating
system, etc.) will be added. Therefore, check out the given internet address. This webpage
contains also a full-color presentation of the results which may be helpful in better viewing the
indices. All calculations for the evaluation of the included MFLOP/s rates are performed for several
levels 1 of (global) mesh refinement. We set for the number N(1) of unknowns
N(1) = 652, N(2) = 12g2, N(3) = 2572, N(4) = 5132, N(5) = 10252, (in 2D),
N(1) = 173, N(2) = 333, N(3) = 653, (in 30).
Higher refinement levels in 2D are somewhat unrealistic. Furthermore, the storage costs are
already far beyond typical cache sizes such that the effect of further increasing the problem size
can be easily extrapolated. In 3D, one may state that level 1 = 3 with N(3) = 653 = 274,625
unknowns does not seem to correspond to a ‘fine’ mesh width, but one has to take into account
that on this level 3, the storage of the stiffness matrix (bandwidth 27) plus index arrays for
the sparse techniques consumes already (almost) 128 MByte of RAM. This example shows that
we discuss not only differences with respect to computing efficiency but also with regard to
storage cost. Based on the MFLOP/s rates for the basic and composite numerical linear algebra
components from the previous section, we can define the following (auxiliary) indices which are
part of the shown FEAST indices.
5.1. The Sparse FEAT and ADAP Indices
These indices are measures for the computational efficiency with respect to classical sparse
techniques, and they are based on the previously defined MFLOP/ s rates for different numberings
in the sparse (CSR) MV multiplication. As typical for all following indices, the rates from the
different levels 1 are weighted by special factors c(l). These are for the FEAT index in 2D and in
3D:
c(5) = SO%, c(4) = 25%, c(3) = lo%, c(2) = 4%, ~(1) = I%,
respectively, c(3) = SO%, c(2) = 16%, ~(1) = 4%,
and for the ADAP index:
c(5) = 40%, c(4) = 30%, c(3) = 15%, c(2) = IO%, c(l) = 5%
respectively, c(3) = 75%, C(2) = 20%, C(l) = 5%.
While the definition of both indices in 3D is straightforward, we have to do some modifications
in the 2D case as explained before. To be precise, the resulting total FEAT index in 2D is
calculated as an average between the DAXPYV and DAXPYI values (with ‘moderate’ jumps of order
O(M)), and in 3D between the SPARSE and FEAT values. Analogously, the ADAP index in 2D
is the average between both DAXPYI values (with the two kinds of allowed jumps in numberings)
and in 3D between the FEAT and ADAP values. All these values are summed up with the special weighting factors accordingly to each mesh level.
The corresponding results of the FEAT and ADAP indices can be found in the Appendix. The
computations demonstrate the dependence on the problem size and the kind of indexed access due
the described numbering strategies: the resulting MFLOP/s rates are quite slow compared with
the following results for performing the Sparse Banded B&-like MV multiplication with the same (!!!) matrix. Beside DEC’s and IBM’s top models, the SGIs and particularly the PENTIUMs
1442 S. TUREK et al.
lead to (relatively) good results. However, we compare 20 MFLOP/s as best values with less
than 10 MFLOP/s for many other processors. In contrast, POWER2 models (IBM) and older
DEC’s (21164, CRAY T3E) deteriorate on the finest level 3 to about 10 MFLOP/s only. This
shows impressively that some of our ‘preferred’ machines may have, more or less, problems with
the described sparse techniques since the underlying cache architectures lead to massive cache
misses and hence, to performance losses. Further increasing of the number of unknowns can lead
to even still slower performance results.
5.2. The DAXPY Index
For calculating the DAXPY index, the DAXPY rates are averaged over all mesh levels. The
corresponding weights for each mesh level are differently defined, in 2D as well as in 3D:
c(5) = 70%, c(4) = 20%, c(3) = S%, c(2) = 3%, c(l) = I%,
respectively, c(3) = 82%, c(2) = 15%, c(l) = 3%.
5.3. The FEAST-mv-v and FEAST-mv-c Indices
These MFLOP/s rates are based on our optimized Sparse Banded Blas software [9] and have
to be compared with the previous sparse FEAT and ADAP indices. FEAST-mv-v denotes the
results with variable matrix entries, while FEAST-mv-c measures the even higher MFLOP/s
rates for the (special) case of constant band entries (see the explanations in the previous section).
Like the DAXPY index, they are also part of the subsequent FEAST-v and FEAST-c indices
(see later). The corresponding weighting factors c(i) are identical as for the described DAXPY
index.
Figure 4 shows the corresponding 3D results for the different matrix-vector multiplications
(MV-C, MV-V, FEAT, ADAP) on the finest mesh level 3: performance differences of almost
a factor of 50 get visible for certain computers as the IBM PWR2s. While modern worksta-
tion processors show a huge potential of supercomputing power for ‘structured’ data, they lose
for ‘unstructured’ data in combination with sparse MV techniques, particularly compared with
PENTIUMs. The complete presentation of these indices is part of the Appendix.
5.4. The FEAST-mglgs-v and FEAST-mglgs-c Indices
These MFLOP/s rates estimate the resulting performance of one complete multigrid sweep with
one pre- and one postsmoothing step. The examined variants in ptest .f are FEAST-mgjac-v
and FEAST-m&c-c for Jacobi smoothing, FEAST-mgtri-v and FEAST-mgtri-c for tridiagonal
smoothing, and our preferred combination FEAST-mglgs-v and FEAST-mglgs-c with the de-
scribed ‘line GauBSeidel (TRIGS) smoother: they are the most essential part of the subsequent
FEAST-v and FEAST-c Indices. The corresponding weighting factors c(i) are again identical as
for both previous indices.
Figure 5 shows the corresponding results for complete Sparse Banded Bias multigrid algorithms
which are optimized with respect to numerical efficiency and robustness (see [12]). They are
recently our ‘best’ multigrid work horses on such tensorproduct meshes. In the case of ‘constant’
bands in the matrices (see the results in the Appendix), the results further improve significantly.
5.5. The ‘Auxiliary’ FEAST-v and FEAST-c Indices
These indices measure the resulting performance of our Sparse Banded Bias tools if they are
purely applied on meshes which consist of macros (M quadrilaterals/hexaeders) only and which
all are discretized via the described generalized tensorproduct meshes. For recent examples and
FEAST Indices--Realistic Evaluation 1443
II 21264 WYKAP
q 21264 ws (500)
n PWR3 (200)
~ H PWR2/590 (66)
q ORIGIN 200
n 21164wS(400)
n 21164pC(533)
,3121v34wsp0)
IHPPA6OW
21164 Linux (533)
n ULTRAI (140)
rnpwRneo
n Pm(ium II (266)
n ~6000 (75)
Pd!umPro (200)
HP 735
21264 WS/KAP
I21264 ws @Jo)
PWR2/397 (160)
/I. “I ULTRA60 (360)
n ORIGIN 200
n 2116.4 ws (400) q 21164 PC (533)
IHPPA~wo ULTRA60 (300) ~ ULTRA10 (333) ~ Rloooo (195)
_, ULTRA E450 (250)
ULTRA1 (200)
21164 Linux (533)
PenthI II (400)
n umbv (140)
n ~~~560
n pMtkm II (266)
q RBOOO (75) PsntlumPro (200)
HP 735
/ q 21264 WS/KAP
~ q 21264 ws (500)
r..J ULTRA60 (36.0)
1 i:j HP C240 (236)
FEAT - Level 3 3D
q pwR2/590(86)
n ORIGIN 200
n 21164ws(400) ,m2i154Pc(533) /~PllMWS(500)
IHPPA6000
ULTRA60 (300)
ULTRA10 (333)
Rloooo (195)
ULTRA E450 (250)
21164 Llnux (533)
q Pentturn II (400)
q ULTRAI (140)
n ~~~560
n psntium II (266)
q ~6000 (75)
PentiumPro (200)
HP 735
I@( 21264 WS/KAP
E# ORIGIN 2000
i ! ULTRA60 (360)
HP C240 (236)
CAAY T3E (433)
n PWR~/~SO (66)
li ORIGIN 200
~h21164 WS (400)
lm21164PC(533)
!1121164WS(500)
q HP PA6000
ULTRA60 (300)
ULTRA10 (333)
FM0000 (196)
E-2 ULTRA E450 (250)
ULTRA1 (200)
21164 Linux (533)
Psnthnm II (400)
q ULTRA1 (140)
n PWRI560
!i Pentium II (266)
q ~6000 (75)
n PentiumPro (200)
HP 735
Figure 4. The figures show the MFLOP/s rates for the discussed MV multiplications (MV-C, MV-V, FEAT, ADAP) on mesh level 3 in 3D. There are huge differences possible of up to a factor of 50 if d@erent MV techniques are applied to the same matrix.
1444 S. TUREK et al.
n 21284 WS (500) q ULTRA0 (300)
_-- _~
I w PWR2K.90 (66) n PWwM)O
YGLGSN - Level 2 3D
, ~---__ g 21264 WSIKAP
q 21254 ws (500)
n PWRJ (200)
PwR2/397 (160)
/rnpWRV5so(66) q ORIGIN 200
n 2llS4WS(400)
I~21164Pc(533)
n HPPABWO
ULTRA60 (300)
q ULTRA10 (333)
RIO000 (185)
13 ULTRA E450 (2501
ULTRA1 (200)
21154UnuX(533)
I Pentium II (400)
n ULTRA1 (140)
n PWMSO
q Pennurn II (265)
IRsooO(75) I PentiumPro (200)
EHP735
q 21264 WSMAP
* 21264 ws (500)
I PWRI (200)
~121164w5(400)
/R2llfMPC(533)
I !! ?!!!!5L%_
ULTRA1 (200)
21154Linux(533) i I %mJm II wo) W ULTRA1 (140)
)
n wwsao ;
q PenuMl ” (2w ~ III-cm / n PenuumPro (200)
&HP735
Figure 5. FEAST-mglgs-v index in 3D. The first row shows the FEAST-mglgs-v index, followed by the specific results for the M-TRIGS routines on the different mesh levels. The computations show that not only MV multiplications can be efficiently applied, but also complete multigrid algorithms with a very robust smoother inside which works very efficiently for regular as well ss very anisotropic meshes (see [12]). Again, the results are measured MFLOP/s rates.
FEAST Indices--Realistic Evaluation 1445
ii iii.34 WSKAP
-______ ~-~~ mHPPA6006
/m21264ws(500) •~LTRA~OPO~ n PWR3 (200) ULTRA10 (333)
~- 0 PWR2/397 (160) RlOoOa (195)
ULTRA E450 (250)
ULTRA1 (200)
21164 Llnux (533)
Pentium II (400)
q Pentium II (266)
n 21164WS(400) q R6aOO(75)
PentiumPro @Xl)
FEAST/C-total 3D
200
175
150
125
100
75
50
25
0
FEAST/V-total 3D
II 21264 WSKAP
) I21264 ws (500)
w PWR3 (200)
~ i 1 ULTRA66 (360)
;;; HP C24U (236)
q ORIGIN 200
q 21164WS(400)
q 21164 PC (533)
q 21164 WS (500)
q HP PA6006
ULTRA60 (300)
ULTRA10 (333)
A10000 (195)
ULTRA E450 (250)
k$ ULTRA1 (200)
21164 Linux (533)
H Pentium II (400)
I ULTRA1 (140)
n PWFwo
q Pentium II (266)
R6000 (75)
PentiumPro (200)
HP 735
Figure 6. FEAST-v and FEAST-c indices in 3D.
actual test implementations of the FEAST package, see [11,15]. Both indices are averaged val-
ues, that means 50% FEAST-mglgs-v, 25% FEAST-mgtri-v, 10% FEAST-mgjac-v, 10% FEAST-
mv-v, and 5% DAXPY, respectively, the analogous values for the ‘C-versions. Again, each mesh
level is differently weighted via
c(5) = 70%, c(4) = 20%, c(3) = G%, c(2) = 3%, c(l) = l%,
respectively, c(3) = 82%, c(2) = 15%, c( 1) = 3%.
Figure 6 shows the corresponding 3D results which are ‘averaged estimates’ for the processors if
purely applied to the highly structured patches on the low-level parts of FEAST. In the Appendix,
we give a direct comparison between the corresponding 2D and 3D results which demonstrates the
potentially higher MFLOP/s rates in the 3D applications due to the ‘wider’ matrices. Since the
typical bandwidth is 27 instead of 9 in 2D only, the applied cache strategies seem to work better.
Moreover, they also give an impression of the difference in the resulting computing performance
if such tensorproduct strategies-FEAST-c and FEAST-v-can be applied in comparison to the
standard sparse techniques (see the FEAT and ADAP indices). However, these are only averaged
MFLOP/s rates while the actual difference with respect to a special problem size may be even
much greater as the previous examples have shown.
5.6. The ‘Total’ FEAST Indices
The final FEAST indices can be collected which are defined as follows. The corresponding
‘total’ FEAST index-as average of both 2D and 3D FEAST indices-had been shown in the
beginning of this section.
FEAST := 60% FEAST-v + 20% FEAST-c + 15% FEAT + 5% ADAP.
1446 S. TUFCEK et al.
6. (EXCERPTS FROM) THE FEATFLOW BENCHMARK
Before we come to the conclusions from the previous FEAST indices, we still have to discuss
the following major question in this context of benchmarking.
How realistic is our evaluation of the processors via the FEAST indices, if we compare
with results from realistic applications, particularly computed with ‘production codes’?
The FEATFLOW Benchmark is such set of test calculations which are very similar to the
mentioned CFD Benchmark ‘flow around a cylinder’ [5]. The interesting aspects are the total
CPU cost for each computer type, and how they are distributed to the separate tasks in a
CFD code, i.e., mesh generation, assembling of matrices and right-hand sides, solving linear
subproblems or postprocessing steps.
The precise configurations are described at our homepage; here we only provide the reader
with the most important details. The complete description with all mathematical details of
discretization and solver can be found in [6]: we apply a coupled multigrid approach with the
so-called Vanka smoother in a direct steady solver (CCSD), and the discrete projection scheme
as nonstationary scheme (PP2D). All calculations are performed with the nonconforming rotated
multilinear finite elements. Additionally, we apply stabilization techniques of upwind (UPW)
or streamline diflusion (SD) type for the convective term, and we use SOR as smoother in the
multigrid solver for the scalar subproblems if applying the projection scheme. The following
abbreviations for the elapsed time in seconds are used, with ‘total’ denoting the complete elapsed
CPU time:
PRO N mesh generation, initialization phase and postprocessing,
LC N modifications of matrices (in sparse storage technique) or right-hand sides,
MAT N CPU time for matrix generations,
C-MG N elapsed time for multigrid in the CCSD test with the Vanka smoother,
U-MG N CPU times for solving velocity, respectively, pressure subproblems via multigrid,
P-MG if the discrete projection scheme PP2D is applied.
Exemplarily, we show the results of some selected computers for the direct stationary approach
CC3D and for the nonstationary scheme PP2D. All tests require about 128 Mbyte of RAM. The
complete results can be found at our homepage in the internet. All ‘FEATFLOW users’ are
invited to participate in this special computer benchmark which is part of the freely-available
FEATFLOW software so that the run of this set of tests can be easily performed by everybody.
Table 6. (Selected) results for test case CC3D-SD.
This Vanka smoother inside of the coupled multigrid solver CCSD (for velocity and pressure simultaneously) is a memory-access intensive process which explains why the tested IBMs, HPs,
and DECs are only slightly faster than the Pentium II PC. The rates are quite in good agreement
with the previous sparse FEAT and ADAP indices and show that for such kind of ‘index-oriented’
FEAST Indices---Realistic Evaluation 1447
techniques, PENTIUM II processors lead to very satisfying results, particularly regarding the
cost/performance relation.
Table 7. (Selected) results for test case PPPD-UPW.
In contrast, PP2D is based on operator-splitting ideas which require the numerous solution
of scalar PDE problems for the velocity, respectively, the pressure components. Since these
kinds of tasks require much more arithmetic operations in comparison to memory accesses, the
processors of type IBM, HP, DEC, and SUN are significantly faster than the Pentium II processor.
However, since this code is implemented without the optimized Sparse Banded Blas MV tools,
we are still far away from the potentially available performance rates from the previous FEAST
indices. These studies indicate how the future generation of CFD solvers has to look like, such
that very sophisticated and powerful numerical methods can be combined with the available high
performance rates on recent processors. For a mathematical and algorithmic discussion of such
concepts, look at [G,16]. Having exclusively such simulation results with sparse techniques, there
would not be any need for other processors than INTELs.
7. CONCLUSIONS FROM THE FEAST INDICES
2D vs. 3D Indices
The results from the 2D tests compared with the analogous measurements in 3D are very
similar: the IBM processors and the new DEC 21264 are the best, with respect to the total ranking
and in most cases with regard to the specific auxiliary indices, too. However, the corresponding
MFLOP/s rates are in 3D somewhat higher, at least for processors with ‘large’ (leve12) caches.
The explanation might be that the applied matrices are less ‘sparse’, with 27 instead of nine
bands, such that cache-blocking strategies can be better optimized. On the other hand, computers
‘without’ large second level (L2) cache, as for instance the IBM PWR2s and probably pure vector
computers, do not improve.
As a consequence, these IBM workstations based on the PWR2 and the P2SC processors
might be favorable for codes which have not yet incorporated such cache-based optimization
strategies. On the other hand, also for certain nonconforming FEM approximations or cell-
centered discretizations which generally lead to even more sparse matrices (with 5- or 7-point
stencils only), these kinds of processors may be the ‘winners’. In contrast, for higher-order
discretizations as bi- or triquadratic (FEM) discretizations, the cache-oriented processors will
even more improve.
‘Optimality’ of the Indices
The following results have been directly provided by the vendors such that we assume that ‘optimal’ compiler options and hardware components have been utilized:
l DEC 21264 and DEC 21164 workstation with 400 MHz,
1448 S. TUREK et al.
l IBM POWER3 and P2SC, IBM F50 (PowerPC),
l SGI ORIGIN 2000 with 250 MHz,
l SUN ULTRA 60 with 300 MHz and 360 MHz,
l HP V2250.
All other processors have been tested by us. So, everybody is invited for improving the indices
of these models (and also the ‘vendor-tested’ versions). In particular, the use of other FORTRAN
compilers (F90, EGCS, GNU-FORTRAN, etc.) or different programming languages appears to
be interesting and, in fact, such tests have been already performed by us (see [13]). For instance,
compare in 3D the MFLOP/ s rates of DEC-PC and DEC-LINUX which are identical computers.
While on the DEC-PC variant, the executable code has been compiled on a workstation environ-
ment with the original DEC compiler and then copied to the PC-target computer, in the case of
the DEC-LINUX configuration, the executable code has been directly generated via the EGCS
F77 compiler and then executed under LINUX. The differences are very significant!
Further improvements do not have to be to restricted to examining the hardware components;
it will be also interesting to modify the test software ptest . f. So far, we have implemented in
the Sparse Banded Blas several versions with different blocking strategies, and we have tested
some specified blocking parameters: however, some modifications of these routines are allowed
and even desired, as long as they are applicable to such special g-point, respectively, 27-point
matrices which arise from conforming FEM discretizations on generalized tensorproduct meshes.
Additionally, the specific adaption with respect to certain machine-dependent values as the pre-
cise cache sizes or number of registers has not been applied up to now. Hence, there is still
a large potential to gain further improvements for such low-level numerical linear algebra rou-
tines. Look also at the ‘data locality’ research project of Riide et al., which is well described at
http://wwwbode.informatik.tu-muenchen.de/Par/arch/cache/index.html and which con-
tains a lot of further literature and activities in this field.
FEAST Indices vs. Application- and Problem-Specific Results
The shown FEAST indices-total or auxiliary-are only averaged values. For many applica-
tions, it might appear to be much more realistic if the corresponding results for specific tasks and
particularly for different problem sizes are explicitly considered. Then, the observed differences
between the processors may be much larger than the FEAST indices do indicate. For instance,
the DEC 21164 processor shows excellent performance for ‘small’ problems, with dramatic losses
of performance if the problem size increases. So, it might be much more favourable to work on
‘many’ small patches (with 500 MFLOP/ s instead of ‘few’ larger patches, with 50 MFLOP/s )
only. Therefore, watch out for your own specific problem configuration! The previous FEAT-
FLOW Benchmark has shown that the corresponding results are significantly dependent on the
numerical method and solution algorithm, and that the differences in the resulting performance
may be much weaker. Nevertheless, keep in mind that those FEATFLOW results have been ob-
tained via classical sparse techniques only, in contrast to the architecture-optimized results of the
FEAST indices. Additionally, if one performs similar tests with recent object-oriented languages,
as for instance C++, you should compare with those optimized FEAST results which really ex-
ploit the FORTRAN facilities. So, there is a good chance for a true comparison of FORTRAN77
with C++ or similar.
IBM POWER2 AND POWERS. While the tested POWER2 and especially P2SC variants show an excellent behaviour for the sparse banded techniques, they demonstrate dramatic losses
of performance for the sparse applications. As explained before, they seem to be quite robust
with respect to ‘cache-optimized’ problems, while they may ‘lose’ against processors with larger
cache-architectures if the matrices get less sparse. Therefore, they belong to our favorites for
applications in FEAST-and also FEATFLOW-which are mainly based on the described gen-
eralized tensorproduct meshes on the lowest level.
FEAST Indices--Realistic Evaluation 1449
In contrast, the new POWER3 processor appears to be a very robust and efficient ‘work horse’
which can be applied for classical sparse as well as for our optimized highly structured numerical
linear algebra components. This computer is, beside the DEC 21264, the clear winner at the
moment.
IBM POWERPC F50. This ‘low cost’ model can be viewed as rival of PENTIUM II PCs, with
respect to price and performance. Therefore, look at our ratings of the PENTIUM world.
DEC 21164 AND 21264, CRAY T3E. The older 21164 processor shows huge performance
problems as soon as the problem size increases: up to now, none of our cache-optimization
strategies seems to work fine, in contrast to (almost) all other platforms. Similar problems are
valid for the CRAY T3E (Stuttgart) which is based on a variant of the 21164 processor. It might
be necessary that further optimizations are performed directly by CRAY.
On the other hand, the new 21264 processor improves in a significant way: having the same
clock rate of 500 MHz, the speed-up is almost a factor of 3. Using the KAP preprocessor, the
performance can be further increased; however, it is not clear to us if this additional speed-
up can be achieved in more complex numerical simulations, as for instance, in FEATFLOW.
Nevertheless, with or without this KAP preprocessor, the DEC 21264 chip is the clear winner
beside the IBM POWER3 processor.
HP, SGI, and SUN. The performance of the ‘top models’ of these vendors-and probably the
prices, too-is very similar. The processors seem to be quite robust with respect to sparse MV
applications while they cannot achieve the same high MFLOP/ s rates for highly structured data
as IBM and DEC. On the other hand, it might be expected that the prices are correspondingly
cheaper, too. Therefore, these models are typical ‘mid-range work horses’ for most numerical
simulations, especially during code development and code testing. However, it is not clear if they
will provide the computing power which seems to be necessary for many ‘real life’ problems. At
the moment, we are not sure about the ‘real’ quality of the HP V2250; we can only hope that
the corresponding 3D test will be performed to guarantee a better rating of this model.
INTEL PENTIUM II. One of our problems with this processor is that we could not get into
contact with INTEL to let them perform their own optimizations, maybe with special compiler
options. Additionally, we could only apply the (freely available) EGCS F77 compiler. At the
moment, this processor type seems to be quite ‘insensible’, regarding cache optimization strategies
as well as with respect to performance degradation through sparse MV techniques. Based on the
cheapest price, this hardware selection is the clear winner if highly unstructured data and sparse
MV techniques are applied, for instance, in fully adaptive FEM simulations. However, this is
more due to the huge losses of the other processors than based on the own high performance: we
compare 20 MFLOP/s with less than ten MFLOP/s. On the other hand, we have not figured
out so far ‘good’ optimization strategies for highly structured data such that our measured
performance values are often slower by a factor of 5 and even more. Nevertheless, we claim
that for most available software tools which are based on sparse techniques (or which do not use
FORTRAN or C.), the choice of PENTIUM II processors may be preferable.
Actually, the specific choice of optimal (?) hardware can be determined via following ‘thumb
rules’.
l For fully unstructured data and corresponding sparse MV techniques, all hardware plat-
forms show slow performance only. The implementation of corresponding tools is quite
easy and straightforward, especially if available software tools as, for instance, SPARSE
BLAS [l], SPARSKIT [2], NISTBLAS [17], or similar are employed. So, the time and
work for the ‘code development’ may be quite fast, but it seems to be impossible to obtain
high performance rates since the resulting efficiency is mainly due to the cost of memory
access and less due to the possible performance of the processors. Therefore, we recom-
mend the use of PENTIUM PCs since they are by far the cheapest ones. Additionally,
1450 S. TUREK et al.
the choice of the programming language might appear to be quite unimportant since no
special facilities can be exploited, not even by FORTRAN. Nevertheless, our experience is
that even for such applications FORTRAN77 tools can be significantly faster than C++
codes.
l On the other hand, if one spends more time and work in software concepts and correspond-
ingly in special numerical approaches which are able to exploit ‘caching in’ and ‘pipelining’,
the use of modern workstation processors and corresponding optimized FORTRAN com-
pilers is advisable. At the moment, only these hardware/software combinations seem to
be able to really exploit a higher percentage of today’s high performance rates. However,
as the FEAST project shows, the design and development of such simulation tools can
be much harder, but the final gain in CPU time and, hence, the gain of validity for the
simulation data will be great.
Based on the actual FEAST indices, we clearly prefer both IBM’s and DEC’s top
models. The difference in single processor performance appears to be a factor of ten in
the maximum, such that one might propose to take ten processors of a slower type to gain
similar performance. However, keep in mind that parallelization is always a hard job, with
regard to numerical design, implementation tasks and also stability of the hardware. If one
really would be successful to apply a complete numerical simulation with several hundreds
of MFLOP/s on one single processor with 4 GByte RAM, it will be very hard to beat this
performance on a parallel system, especially if very ‘strong’ numerical components are
used which are often specifically adapted for sequential runs.
l The performance of future processors will change, but looking at the ‘historical’ develop-
ment, it seems that certain processor characteristics remain preserved: the IBM processors
have been and still are very efficient for highly structured data, while SUN, SGI, or HP
are the typical ‘robust work horses’ for all kinds of numerical simulations. So, we believe
that the specific characteristics of the different processor families will remain valid in some
sense for the near future.
Therefore, we end with the following final remarks.
1. Do not base your rating of computers exclusively on the clock rates or other technical
details.
2. Make your own tests or look at performance ratings which are representative for your
specific applications and needs.
3. Always keep in mind that there are numerous hidden traps to lose computing performance.
In fact, it is and it will be even more and more difficult to achieve a significant percentage
of the growing peak rates on modern processors.
8. ACKNOWLEDGEMENTS
The recent state of the FEAST indices is due to the successful work of many people. Let me
mention the FEAST group and especially Alex Runge who is the major person to evaluate and
particularly to prepare the specific results in text-based and graphical ways.
Additionally, we were very pleased about the spontaneous participation of the processor ven-
dors and the help of many other persons: M. Zahn from Rechenzentrum der Universitat Augs-
burg, M. Claus from Rechenzentrum der TU Chemnitz, A. Meyer from Universitat Chemnitz,
U. Rude from Universitat Augsburg, B. Mollenhoff from Rechenzentrum der Universitat Heidel-
berg, R. Wehrse from Universitat Heidelberg, J. Porta, H. Holthoff and C. Pospiech from IBM,
W. Hahn from HP, R. Lange from SGI, U. Graf from SUN, J. Pareti from DEC/COMPAQ, and
many more. In advance, let us already thank all future test persons.
Without their help, the more or less complete data bank with respect to today’s available
computer systems would not have been possible. On the other hand, also the next generation
of processors, computer memory and compilers has to be continuously tested. Therefore, if
FEAST Indices--Realistic Evaluation 1451
sufficiently many ‘test persons’ actively participate in this processor benchmarking, there is a good
chance to create a quite simple but nevertheless realistic rating of modern hardware platforms,
particularly in addition to the well-known LINPACK or SPEC benchmarks. So, please feel free
to contribute ‘new’ indices to this new data bank, which together with the required software can
be found under the FEATFLOW homepage http: //www . iwr . uni-heidelberg . de/-f eatf low.
REFERENCES
1. BLAS Technical Forum, http: //www.netlib. org/cgi-bin/checkout/blast/blast .pl. 2. Y. Saad,SPARSKIT,http://uww.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html. 3. U. Riide, Technological trends and their impact on the future of supercomputers, In High Performance Sci-
entific and Engineering Computing, (Edited by H.-J. Bungartz, F. Durst and C. Zenger), LNCSE, Springer- Verlag, (1999).
4. H. Hellwagner, U. Riide, L. Stals and C. Wei5, Data locality optimixations to improve the efficiency of multigrid methods, In Proc. 14 th GAMM Seminar ‘Concepts of Numerical Software’, Kiel, January 1998, NNFM, Vieweg, (1998).
5. M. Schlfer and S. Turek, Benchmark computations of laminar flow around cylinder, In Flow Simulation with High-Performance Computers II, Volume 52 of Notes on Numerical Fluid Mechanics, (Edited by E.H. Hirschel), pp. 547-566, Vieweg, (1996).
6. S. Turek, Efficient solvers for incompressible flow problems: An algorithmic approach in view of computational aspects, LNCSE 6, (1999).
7. S. Turek et al., Some basic concepts of FEAST, In Proc. 14 th GAMM Seminar ‘Concepts of Numerical
Software’, Kiel, January 1998, NNFM, Vieweg, (1998). 8. S. Turek et al., On the realistic performance of components in iterative solvers, In High Performance Scienti$c
and Engineering Computing, (Edited by H.-J. Bungartz, F. Durst and C. Zenger), LNCSE, (1999). 9. S. Turek et al., Proposal for Sparse Banded BLAS techniques, (available via internet) (to appear).
10. The national technology roadmap for semiconductors, 1997 edition, http: //www. sematech. erg/public /roadmap/index.htm.
11. C. Becker, FEAST-The realization of finite element software for high-performance applications, Ph.D. Thesis
(to appear). 12. M. Altieri, Robuste und effiziente Mehrgitter-Verfahren auf verallgemeinerten Tensorprodukt-Gittern, Di-
ploma Thesis (to appear). 13. A. Runge, Zur realistischen Leistung von Software-Komponenten und modernen Prozessoren zur numerischen
Simulation von partiellen Differentialgleichungen, Diploma Thesis (to appear). 14. S. Turek, Konsequenzen eines numerischen ‘Elch Tests’ fiir Computersimulationen, Technical Report SFB
359, University of Heidelberg 46, (1998). 15. S. Kilian, Efficient parallel iterative solvers of ScaRC-type and their application to the incompressible Navier-
Stokes equations, Ph.D. Thesis (to appear). 16. S. Turek et al., Trends in processor technology and their impact on Numerics for PDE’s, (available via
Internet) (to appear). 17. The NIST Sparse BLAS, http: //math.nist .gov/spblas. 18. S. Kilian and S. Turek, An example for parallel ScaRC and its application to the incompressible Navier-Stokes
equations, In Proc. ENUMATH-97, Heidelberg, October 1997, World Science, (1998).
1452 S. TUREK et al.
APPENDIX
SOME FEAST INDICES
Table 8. The ‘total’ FEAST index in 3D and its (major) auxiliary indices. The figure gives a graphical representation of the ‘total’ FEAST index in 3D whereat the values correspond to averaged MFLOP/ s rates. Additionally, the table shows the results of the (major) auxiliary indices which have been explained in Section 5; their graphical presentation and a direct comparison to the 2D values is given later. Keep in mind that these are averaged MFLOP/s rates only, which may significantly differ from the results for specific problem sizes.
Computer FEAST FEAST-v FEAST-c FEAT ADAP
DEC-WS-21264-500MHz-KAP 240 220 476 66 47
DEC-WS-21264-500MHz 203 185 387 80 54
IBM-PWR3-200MHz 199 172 410 77 43
IBM-PWR2SC-397-16OMHz 161 153 297 58 21
IBM-PWR2SC597-160MHz 159 154 289 52 19
SGI-ORIGIN2000-250MHz 127 106 275 43 3G
SUN-ULTRA60-360MHz 116 90 276 34 26
HP-C240-240MHz 113 90 261 39 26
CRAY-T3E(STUTTG.) 89 78 185 31 11
IBM-PWR2-590-66MHz 86 87 140 31 17
DEC-WS-21164-400MHz 84 68 187 32 22
SGI-ORIGIN200-180MHz 84 71 174 34 27
SUN-ULTRA60-300MHz 79 69 162 28 21
DEC-PC-21164-533MHz 79 62 181 31 17
DEC-WS-21164-500MHz 78 63 186 18 14
HP-PA8000 75 58 179 25 15
SUN-ULTRA21-333MHz 65 51 148 28 20
SGI-IOK-IP25-195MHz 62 47 150 18 13
SUN-ULTRA450-250MHz 59 45 138 21 14
DEC-PC-21164-533MHz-EGCS 57 48 121 21 14
SUN-IJLTRAl-200MHz 56 44 129 23 15
PENTIUMII-400MHz 52 45 102 26 21
SUN-ULTRAl-140MHz 39 33 83 16 8
IBM-PWR-580 39 39 63 14 7
SGI-RSOOO-75MHz 35 25 92 11 8
PENTIUMII-266MHz 35 30 69 16 14
PENTIUM-PRO-200 23 18 51 10 7
HP-735 19 15 45 8 5
FEAST Indices-Realistic Evaluation 1453
Table 9. The ‘total’ FEAST index in 2D and its (major) auxiliary indices. The figure
gives a graphical representation of the ‘total’ FEAST inf!iex in 2D, while the table shows the results of the (major) auxiliary indices; a direct comparison with the 3D values is given later. One remarkable result is that the performance in 2D is smaller than in 3D, due to the more compact matrices (see the Conclusions).
w PPcF5o (333)
Computer FEAST FEAST-v FEAST-c FEAT ADAP
DEC-WS-21264-500MHz-RAP 165 153 313 55 52
IBM-PWR3-2OOMHz 152 140 262 85 62
IBM-PWR2SC-597-160MHz 143 142 233 63 38
IBM-PWR2SC-397-160MHz 142 141 233 64 38
DEC-WS-21264-5OOMHz 132 121 248 53 49
HP-V2250 118 108 229 38 34
HP-C240-240MHz 89 82 166 32 30
SUN-ULT~60-36OMHz 89 78 171 41 38
IBM-PWR2-590-66MHz 85 88 128 36 25
SGI-ORIGIN2000-250MHz 84 70 175 33 32
CRAY-TJE(STUTTG.) 71 58 147 36 20
SUN-ULTRA60-3OOMHz 68 61 122 35 32
HP-PA8000 61 57 116 22 17
SGI-ORIGIN200-180MHz 61 54 123 24 22
DEC-PC-21164-533MHz 59 48 129 21 19
DEC-WS-21164-500MHz 58 49 122 20 20
DEC-WS-21164-400MHz 56 49 111 20 21
SUN-ULTRA21-333MHz 50 4;2 98 23 24
SUN-ULTRA450-250MHz 49 42 98 19 20
SUN-ULTRAl-200MHz 48 43 93 19 17
PENTIUMII-4OOMHz 45 41 80 24 21
SGI-RlOOOO-IP25-195MHz 41 35 a7 14 14
PWRPC-F50-333MHz 36 31 74 12 10
IBM-PWR-580 32 31 53 15 9
SUN”ULTRAl-14OMHz 32 29 60 13 10
PENTIUMII-266MHz 29 27 51 15 13
SGI-R800~75MHz 28 23 56 13 15
PENTIUM-PRO-200 17 16 33 6 6
SGI-RlOQOO-150MHz 16 12 41 5 7
HP-735 16 14 32 5 5
1454 S. TUREK et al.
B21264WslKAP n nPPAsw2 - --- H21284 WE (5a-J) mlJLrruw(s2sJ
l.nmAE4ra(2E4
_-----. ~_
PEAT- L#Jvals20
Figure 7. FEAT index in 3D. The first row shows the FEAT index, followed by the specific results for the FEAT MV routines (see Section 4) on different mesh levels. The computations demonstrate the dependence on the problem size and the kind of indexed access via the described numbering strategies (compare with the following ADAP index.). The main results are that the resulting MFLOP/s rates are quite slow compared with the following results for performing the Sparse Banded Blas-like MV multiplication with the same (!!!) matrix. Beside DEC’s and IBM’s top models, the SGIs aud particularly the PENTIUMs lead to (relatively) good results. In contrast, POWER2 models (IBM) and older DECs (21164, CRAY T3E) deteriorate on the finest level 3 to about ten MFLOP/s only. The actual ordering of the processors is due to the total FEAST index and shows impressively that some of our ‘preferred’ machines may have, more or less, problems with the described spume techniques due to massive cache misses. Further increasing of the problem size can lead to even still slower performance results (compare Level 3 vs. Level 1).
FEAST Indices--Realistic Evaluation 1455
/D21254 WSKAP
~~212%WS(5oo)
n PWRB (200) 3 PwFui397 (160)
/ n PWR7f5e7 (leo)
1L.i ULTRMY) (3eo)
1 E HP C240 (235)
n HPPMWO
I ULTRA60 wo)
q ULTRAIO (333)
Rlwoo (1%)
ULTRA E45o (25o)
VLTRAI (2a)
211&(uM(533)
n Pnntium II (4oo)
q ULTRAI (14)
iiimoomi n PNlnumPm wol q HP735
I ACAP. LawI 3 3D
21264 WSKAP
21284 ws (500)
PWFw397 (150)
PWR2l597 (160,
‘13 HP C240 (236)
CRAY T3E (133)
lrn PWR2l580 (88)
IIORlGlN 200
/m2ii84Ws(400)
lm21184PC(533)
I ~2”64.%E!!~~_
q HP PA80oo
“LTRABO (300)
ULTRA10 (333)
RlOOOO (IES,
I3 ULTRA E&Jo (250)
UB ULTRA, 1200,
21184 Lh”, (533)
w pe”tIYln II (.OO)
I ULTM.1 (140)
I PWR/S80
q P*ntium II (266)
n RlOW (76)
n P~nemlPro (200,
llHP735
‘I
~~~~~.. ~. ~~ a HP PABoo
LILTRAM (300)
ULTRA10 (333)
z Rloow (125)
;: ! ULTRA E45o (2xI
ULTRA1 (2w)
21154 unux (523)
R Pemium Ii (400) q ULTRCII (14o)
n iwwsm n pati II (236)
m-(75) n pmtkmpm (2~)
llHP735
,w 21264 WSMAP
‘rn 21264 ws (5~)
n PWFW (2oo)
AGAP-LHdl3D
n ~ww59o (66)
q ORIGIN 2M)
~m2twwswq ~m21b34pc632) tm2w.4wsgiw
m psntium II (400)
n ULTRAI (140)
n pwwseo q penth II (265)
n ~6000 (75)
n etiunpro (2oo)
(IHP735
Figure 8. ADAP index in 3D. The first row shows the ADAP index, followed by the specific results for the ADAP MV routines (see Section 4) on different mesh lev- els. Due to the even higher jumps in the numbering of neighboured vertices, the MFLOP/s rates further decrease and are of maximum size 25 (!!!) on level 3. Even both DEC’s and IBM’s top models show about 20 MFLOP/s only, while again the SGIs and particularly the PENTIUMs lead to the (relatively) best results. However, we compare 20 Ml?LOP/s as best values with less than 10 MFLOP/s for many other processors. Additionally to the previous FEAT index, the dependence on the mesh level, respectively, the problem size, gets much more clear and varies up to a factor of ten between Level 1 and Level 3.
1456 S. TUREK et al.
‘W PWRS (200)
i m PWiW307 po)
n ORIGIN ZOO
n 2llMws(400)
n 2lill4Pc(5ss)
n 2llMWSI5Ofl~
WHPPMOOO
q ULTRA60 (wo)
n ULTRA10 (395)
RIO000 (10s)
ULTRb E4w (2sc
ULTRA1 (200)
q 21164 Lhux (533)
w P*ntbm II (400)
n ULTRA* (140)
n Pww.52a
n PmthJm II (266)
n RWOO (75)
n PhlvnPm (2w)
IHP795
1 I21264 WSKAP
I m 21264 ws (5W)
!MPWR3(2w)
~~21184WS(5w)
HHPPAWOII
ULTRA60 (300)
ULTRA10 (333)
R1OOOo (185)
ULTRA E450 (250)
ULTRA1 (200)
21164 LhJr (533)
n POnllwn II (400)
n ULTRA1 (140)
mPwR/wO
I psrmum II (266)
mm(n)
l Pa-diumPm (2W)
IliP
I MVN- Lwd33D
H 212S4 ws (243)
I PWRI (ZW)
/k9HPC240(238)
CRAY T3E (433)
n ~R2690 (es)
n ORIGIN 200
n 21164WS(400)
n 211luPc(533)
n 21164 ws (500~
ULTRAw(3Oq ~ ULTRAlO(333) ~ RlWOO (195)
ULTRA E4w (250)
ULTRA1 (200)
21164Lhux(533)
n Psntlum II (400) I
q ULTRA1 (140)
n m-
n Pennurn II (266)
I MVN-L.d23D
400 it------
w 21264 ws (5W)
FZ I PWRB (200)
‘/
i
Figure 9. FEAST-mv-v index in 3D. The first row shows the FEAST-mv-v index, followed by the specific results for the MV-V routines on the different mesh levels. The computations demonstrate the much weaker dependence on the problem size, at least for some of the processors (it is obvious that for problems which fit com- pletely into the cache, the rates are often better). The main results are that the resulting MFLOP/s rates are essentially improved through the applied ‘caching in’ and ‘pipelining’ strategies, especially if compared with the previous sparse indices. Nevertheless, we expect in future even higher improvements if more machine-specific optimizations for this task are applied.
FEAST Indices-Realistic Evaluation
~ q ORIQIN 200
n 21164WS(4lw)
w21164PC(533)
n HPPAtlOW
ULTRA&o (300)
H ULTRA10 (333)
RlOom(ls5) ~ ULTRA E4w (250) / ULTRAl(200) ; 21164 Ltnlcf (533)
I Pantturn II (400)
~
1 n ULTRA1 (140)
w-
q Pentturn II (266)
q R6ccO(75) ~ I PenttumPm (2001
HP PA6006
ULTWUU) (300)
ULTRA10 (333)
RlOOOo (165)
ULTRA E450 (250)
I Pentlum II (266)
MV/C Led 2 3D
21264 WS/KAP HP PA600,
ULTRA66 (3W)
ULTRA10 (363)
RlWW (195)
ULTRA E450 (250)
ULTRA1 (200)
21164Lbwx(533)
Pentium II (400)
ULTRA1 (140)
PWw580
Psntlum It (266)
RBMM (75)
PentiumPro (200)
HP 736
Figure 10. FEAST-mv-c index in 3D. The first row shows the FEAST-mv-c index, fol- lowed by the specific results for the MV-C routines on the different mesh levels. These studies show the importance of the main strategy in FEAST: ‘detect locally struc- tured parts and exploit the correspondingly regular data’. More than 700 MFLOP/s are realistic which have to be compared with the previous sparse indices.
.i i. __ ULTRA66 (360)
g1 HP c24a (238)
‘m PWR2/560 (66)
~10~1~1~200
‘~21164W8(400)
q HP PA6000
ULTRAW (300)
ULTRA10 (333)
RIOMM (165)
. ..A ULTRA E450 (250)
ULTRA1 (200)
21164Llnuf(533)
I Pantium II (400)
H ULTRA1 (140)
w PWR/560
q Pentium II (266)
I RWOO (75) PentlumPro (200)
HP 735
1457
1458 S. TUREK et al.
; 175
(50
125
100
75
50
25
0
MVN-total2D
-- ----~ 1~21254 w&lap
1 n PWRS ,200)
/ n PwR2697,160)
em Pww227 ,t5q
I I HP C240,236)
12HMWS,500) I
q 21154W5,uxl) I
ULTRAlO(522) ~
ULTRA E450(25O) /
ULTRAl,2cq
prmknnli(4al) I
n R1aaoo(125) ~
n PPCF50,222) I
n uLTRAt(140)
n Fw!a%KI :
w PHltium II (22.5)
m-F%
q PdunPm (200) :
n RtoGoo(154) ~
HP 725
21164ws,500)
2,124 ws (@xl)
ULTRAtO ,223)
_, ULTRA E&C! ,250)
ULTRAt ,200)
Pentlum II ,400)
I RIOWI ,125)
II PPCIFSO ,223)
uLTRAl(t40)
PwFv550
Pentlum II (255)
n Rsoao(7J)
I PmtbmPm (200)
R10000,150)
HP 735
275T. -- ~~._ I
,l2t2&(wsKAP 21154 ws ,500)
) I PWFn3 (2W) 2?154Ws ,400)
ULTRAlO(233)
ULTRA E&O (250)
” ULTFCAM ,360) I PPc!F50,333)
q ULTRAl(140)
MVN-Le~l32D
q CRAY T2E (US) n PenNun II (w
-. ~~ULTRA60,300) wF!8000,75)
1 I ORIGIN 200
/uHPPA8000
q PenthPrn ,200)
RlOOw,wJ)
/121154PC,533) HP735 I_..
I
MVN-LwdlPD
Figure 11. FEAST-mv-v index in 2D. The first row shows the FEAST-mv-v index, followed by the specific results for the MV-V routines on the different mesh levels. In contrast to the previous 3D case, the ‘older’ POWER2 architecture by IBM is superior which is equipped with much smaller La-caches. Due to the more compact matrices (g-point instead of 27-point matrices in 3D), the processors with larger cache sizes cannot gain the same improvements as in the 3D case. The quality of the ‘really old’ IBM 590 with 66 MHz clock rate only is surprisingly excellent if compared with many new processors.
FEAST Indices--Realistic Evaluation
!~ULTRA50(200) n R5OOOW5l q ORIGIN 200 q PentiullPm (200)
n HPPA.9000
112y84PC(w
21264 WWKAP
PWAJ (200)
PwR2/597 (150)
PwR2/397 (160) ULTRA E450 (250)
21264 ws (5M))
HP V2250 (236)
i:::; HP c240 (236)
II ULTRA1 (140)
1 PWW5Ba (65) q pwR550
II CRAY T3E (423) q Petivn II (268)
p ULTRA50 (300)
q ORIGIN 200 rnR5_O ~ q PenaMPm(200)
n HPPA5000 RlOca (150)
,W21154PC(533) HP735
SW
7Oc
eJm
5Lul
400
300
200
1M
,
21254 W?JKAP
PWRI (2Oq
~~PWR2!597(160)
PwRz397 (160)
21284 ws (500)
HP V2250 (238)
i .: HP C240 (235)
~ q Pww5BO (5s) I CRAY T3E (433)
I ULTRA50 (303) I ORIGIN 2xX
lHPPA.5000 W21154PC(533)
21164 ws (500) 2llMWS(400)
ULTPAlO (333)
ii ULTRA E450 (W
q PPc/Fso (333) I ULTRA1 (140)
q PwRKml w Pullilml II (266)
n W(75) q PennMPm (200)
RlOOfXI (150)
HP735
21254 WSIKAP 21164w2.1503,
3 PwR3 (200)
‘W Pww597 (150)
I 1 HP c240 (236)
II PWRZBO (66) fi CRAY T3E (433)
/w ULTRA50 (3BO)
m ORIGIN 2W
WHPPAWXI
i !w5 y (_e_
i..j ULTRA E450 (250)
ULTRA1 (200) Penlium II (400)
RIO000 (195)
q PPWSO (333) w ULTRA1 (140)
q Pww550 n Pmth II (2s)
n R3000(75) q PmliunPm (200)
RlWOO (150)
HP733
1459
Figure 12. FEAST-mv-c index in 2D. The first row shows the FEAST-mv-c index, followed by the specific results for the MV-C routines on the different mesh levels. Again, these rates-up to almost 900 MFLOP/s-show the absolut importance of exploiting such structured patches if available. However, they also show how hard (or even impossible at the moment) it is for certain computer architectures to achieve these high rates for large problem sizes.
1460 S. TUREK et al.
225
200
175
155
125 ____
155
75
50
25
0 MGLGSN- total 30
/mm34 wsiw imzi264ws(500) n WFW 1200) m Pwlw307 wJo) q wmm7 1160)
n ~wn2/555 (55) n ORIQIN 255 n 21164ws(400) n 21154pc(593) n 21164ws(500)
n wpCgOOa q ULTRA60 (300)
I ULTRA10 (333)
RloOOo (135)
ULTRA E4So (250)
ULTRA1 (200)
q 2w34hxp33) n pantlum II (400)
n ULTRA1 (14o)
n twwsao n pdun II (265)
n RaMlo(73 n PmtlumPm 1200)
q w735 .___
n HP PA8000
ULTRAEO (300)
ULTRA10 (333)
RIOooo (195) ~ ULTRA E45o (250)
ULTRA1 (200) 1 21164 Lhux (533) ~ n Pentlum ti (400)
w ULTRA1 (140)
1
n PWR~M) n Pentium ii (266)
n MOW (75)
~
n PentlumPro (200)
q HP735 !
I MGLGSN - Level 3 3D
250 ~. ._~__~____~_~-. __~_
ULTRA50(300) ~ ULTRA10 (333) I
q 2w4Linux(533)
n pnthnn II (2.35)
MGLQSiV - Low12 3D
n 21254 WS~KAP
n 21254 ws (500) n PWW v.oo) q PwPz/337 (15o)
n pww69o (es) n ORIGIN MO
n 21154ws(400) n 21154pc(533)
im21164ws(5~)
-mHPPALWo _~~
I ULTRA60 (300)
n ULTRA10 (333)
RlOOoo (195)
m ULTRA E4w (250)
ULTRA1 (200)
n 21 I 54 hx (625) n btlun II (400)
n ULTRAI (140)
n ~~560 n ~tih II (253)
m-000(75) n PmtiumPro (200)
q HP735
Figure 13. FEAST-mglgs-v index in 3D. The first row shows the FEAST-mglgs-v index, followed by the specific results for the M-TRIGS routines on different mesh levels. The computations show that not only MV multiplications can be efficiently applied, but also complete multigrid algorithms with a very robust smoother inside which works very efficiently for regular as well as anisotropic meshes (see [12]).
1461 FEAST Indices-Realistic Evaluation
MGLGS/Gtotal3D
‘I 21264~WsKAP
II 21264 ws (500)
n PWR3 (200)
~ @ ORIGIN 2wO
I : ULTRA60 (360)
q PWFw5oO (66)
‘I ORIGIN 200
,M21164WS(400)
!121164PC(533)
!JB21164WS(500)
q HPPAB000
ULTRA60 (300)
ULTRA10 (333)
RICO00 (195)
T._i ULTRA E&O W30) 21 ULTRA1 (200)
¤~Rneo n Pentium II (266)
m-(75)
PmtiumPro (200)
HP 735
MGLGWC . Level 3 3D
21264 WSKAP
i ULTRA60 (360)
q PWRZZ590 (66)
I ORIGIN 200
‘II 21164 ws (400)
i w 21164 PC (533)
jU21164 WS (500)
~ n HP PA6000
ULTRA60 (300)
ULTRA10 (333)
RlOWO (196)
ULTRA E450 (250)
E ULTRA1 (2W)
21164 Linux (533)
Pentturn II (400)
q ULTRA1 (140)
n PWR/560
m Pentium II (266)
n A6000 (75)
n PentiumPro (200)
HP 735
MGLGWC - Level 2 3D
MGLGWC - Level 13D
: # PWRB (200)
PWRZ397 (160)
PWR2/597 (160)
ORIGIN 2ooO
: ULTRA60 (360)
q PWR2/5SO (66)
I ORIGIN 2W
q 21164ws(400)
q 21164Pc(533)
‘~21164ws(500)
## HP PA6000
ULTRA60 (300)
ULTRA10 (333)
g RIO000 (195)
/ i ULTRA E450 (250)
ULTRA1 (200)
21164 Unux (533)
Pentlum II (400)
# ULTRA1 (140)
n PWR/560
n Pentium II (266)
n R8000 (75)
Figure 14. FEAST-mglgs-c index in 3D. The first row shows the FEAST-mglgs-c index, followed by the specific results for the M-TRIGS routines on the different mesh levels. Again, the demand for detecting locally structured patches is demonstrated since complex multigrid solvers with a very high numerical complexity can be per- formed with a correspondingly excellent computational efficiency, too.
21264 WSKAP HP PA6000
21264 ws (500) ULTRA60 (300)
PWR3 (200) ULTRA10 (333)
PWR2/397 (160) RIO000 (195)
PWR2/597 (160) i::: ULTRA E450 (250)
ORIGIN 2000 ULTRA1 (200)
i ULTRAW (360) 21164 Linux (533)
-., HP C240 (236) I Pentium II (400)
CRAY T3E (433) ## ULTRA1 (140)
q PWR2/500 (66) q PWw5w
q ORIGIN 200 I Pentium II (266)
i121164WS(400) IRSOOO(75)
!a21164Pc(533) PentlumPro (200)
~121164ws(500) HP 735
1462 S. TUREK et al.
I-
~- q 21254 WSKAP
l PwR3 cw I PwR26O7 (180)
~ q PwR2BO (O5)
1 m GRAY T3E (433)
‘m IJLTRMO (3OO)
n ORIQIN 200
~HPPk30Ofl
q 21154Pc(533) .~_~
I21154 ws (5OO)
n 21154ws(400)
ULTRA10 (333)
ULTRA E450 (250)
ULTRA1 (2OO)
Pentbum II (400)
wRlOWO(l25)
n PpclFso (333) n ULTRA1 (140)
n Pww55O
n Pentlum II (2%)
WR2JJOO(7J) q PentiumPm (200)
RlOOOO(l50)
HP735
~-- ~IIPlP54WS/KAP
/ q PwR3 (200)
-. m CRAY T3E (433)
-~~ m ORIGIN 200
t i~21154 PC (533)
aHPPA5OiM
MQLQSN - Level 3 2D
n CRAY T3E (433)
1 m ULTRA50 (300)
‘I ORIQIN 200
q HPPA5000
n 21154 PC (533)
21154WS(5W) --
21154WS(40O)
ULTRA10 (333)
ULTRA IS50 (250)
ULTRA1 (200)
PenthJm II (400)
RIO000 (195)
w PPcF50 (333)
w ULTFIAI (140)
n Pww550
n PeathIm II (286)
n RBOOO (75) q PenUumPro (200)
RlOOOO (150)
HP735
250
MQLQSN - Level 1 2D
/I 21254 WSMAP
‘E PWR3 (ZW)
q ORIGIN 2WO
~ n PwRasw (55)
! I CRAY T3E (433)
n ULTRA50 (3OO)
n ORIQIN 200
n HP PA5OW
!rn21154PC6331
21164 ws (500)
21164ws(400)
ULTRA10 (333)
ULTRA E450 (250)
ULTRA1 (200)
POntltml II (400)
n RIOOW (125)
n PPCMO (333) n ULTRA1 (MO)
n PwW650
n Psntlum II (288)
I R5000 (75) I PurlkmPrn (200)
RIOOOO (150)
HP735
21184 ws (500)
112llMWS(4O0)
ULTRA1 0 (333)
ill ULTRA E45O (250)
a ULTRA1 (200)
0 Pentlum II (400)
n RlOOOO (125)
n ppumo (333) n ULTRA1 (140)
n PwRf55O n PeMum II (258)
n R8ooo (75) I PmtiumPr0 (2W)
RlOOOO (IW)
HP 735
Figure 15. FEAST-mglgs-v index in 2D. The first row shows the FEAST-mglgs-v index, followed by the specific results for the M-TRIGS routines on different mesh levels. The 2D results are somewhat slower as the 3D case, as already indicated by the MFLOP/s rates for MV-V multiplication.
FEAST Indices--Realistic Evaluation 1463
GRAY T3E (433) l Pmllum II (2ea)
~~(~) n=(75t q PenulmlPm (2uo)
II Rl@xo (150)
350
300
250
200
150
100
50
0
l 212s4ws/K4P
n PwR3 (ZOO)
I PWR2697 (160)
u PWR2/3@7 (180)
m 21264 ws (EM)
HP V2250 (236)
HP c240 (236)
ULTR#J (38o)
ORlOiN 2ooo
I PWIw590 (66)
n CRAY T3E (433)
m ULTRA@ @LlO)
I ORIGIN 2W
mHPPA3OW
,1211&1pc(533)
-_-_l_....__ q 2llwws@xl)
a 21164 ws (4Oo)
ULTRA10 (333)
ULTRA Ells0 (2~9)
ULTRA1 (2v3)
Pmtium II (4Oo)
II RlOLMO (125)
8 ~~ w3) a ULTRA1 (145)
n PWWMU)
II P9rruum II (266)
n RaoGu 175) II PontiumPr0 (ZOO)
RIO000 (1M)
HP 735
‘~21264WstKAP m PWRI (200) n PWR2697 (160)
,Iy PWR2M7 (160)
/II 21234 ws {soo)
~~~RIOINXXX)
n PWR2690 (66)
n CMY TIE (433) II ULTPA60 (300)
m ORIGIN 200
s1211&4 WS (500)
mnle4ws(4w)
/
ULTRA10 (333)
ULTRA E4!%l(250)
ULTRA1 (200) 1 Phiurn II (400)
RlOOW(l95) j IPPOF50(333) I
i uLTRA’ wJ1 m Pww680 ; II PenNurn II (266) I w NOW (75) I POLO (200) j
IRloooo(lsof HP 735 ,
Figure 16. FEAST-mglgs-c index in 2D. The first row shows the FEAST-mglgs-c index, followed by the specific results for the M-TRIGS routines on different mesh levels. The 2D resufts are somewhat slower as the 3D case, as already indicated by the MFLOP/s rates for MV-C multiplication.
1464 S. TUREK et al.
__-.._____._ -
.._____
1 I I
I’ 1 _~ - -._..._
__~ ____
.-_-._ ~
Figure 17. Some indices (2D vs. 3D). The reader should compare the absolute MFLOP/s results which are somewhat higher in 3D for the Sparse Banded Bias appli- cations in FEAST.