the feast indices-realistic evaluation of modern software

An lntemabimal Journal

computers & mathematics

PERGAMON Computers and Mathematics with Applications 41 (2001) 1431-1464 www.elsevier.nl/locate/camwa

The FEAST Indices-Realistic Evaluation of Modern Software Components and

Processor Technologies

S. TUREK AND CH. BECKER Institute for Applied Mathematics, LS III, University of Dortmund

Vogelpothsweg 87, D-44227 Dortmund, Germany

tureQfeatflow.de

A. RUNGE Institut fiir Angewandte Mathematik, Universitgt Heidelberg

Im Neuenheimer Feld 294, 69120 Heidelberg, Germany

(Received and accepted March 1999)

Abstract-we examine the computational efficiency of linear algebra components in iterative solvers for grid-oriented simulations of PDEs. While the standard sparse matrix-vector (MV) techniques show significant losses of performance, especially on modern processors, our sparse banded components have the potential to exploit today’s high computing power. We explain the major concepts of the FEAST software which contains such highly tuned numerical linear algebra basic components (Sparse Banded Blss) up to complete multigrid solvers, all being optimized with respect to the actual hardware platform. Based on algorithmic and computational studies, we present the FEAST indices which are indicators for the true performance of many modern processors, depending on the underlying FEM space, the problem size and the implementation style. These indices allow a new rating of the various hardware platforms with regard to different mathematical solution strategies, for academic and realistic numerical problems and ranging from ‘low cost’ PCs up to supercomputers. @ 2001 Elsevier Science Ltd. All rights reserved.

Keywords-High performance computing, Benchmarking, Numerical linear algebra, PDEs.

1. MOTIVATION

One current trend in the software development for PDEs, and here especially for finite element

(FEM) approaches, goes clearly towards very sophisticated object-oriented techniques and adaptive methods in any sense. On the other hand, the employed data and solver structures, and

particularly the ‘matrix structures’ (for an overview, see [1,2]), are mostly chosen in a somewhat

old-fashioned way-as ‘globally defined’ types-which neglect the very specific performance fa-

cilities of modern hardware platforms. As a result, the observed computational efficiency is often

far from the expected peak rates of (potentially available) one GFLOP/s nowadays, and the ‘real

life’ gap is even further increasing if one extrapolates current hardware developments (see also

the papers of Rude, for instance, [3]).

0898-1221/01/$ - see front matter @ 2001 Elsevier Science Ltd. All rights reserved. Typeset by A@-TEX PII: SO898-1221(01)00108-0

1432 S. TUREK et al.

Today, high performance calculations in this field seem to be reachable only by explicitly ex-

ploiting ‘caching in’ and ‘pipelining’ in combination with sequentially stored arrays (see [4]).

While the realization of such techniques tends to be easier for finite difference approaches, it is

far from being obvious how to perform similar approaches for much more complex finite element

codes. These discrepancies, between numerics and software concepts and the available hardware,

often lead to unreasonable calculation times for ‘real world’ problems, e.g., (nonstationary) CFD

calculations in 3D, as can be easily seen from recent benchmark comparisons for commercial as

well as research codes (for instance, see [5,6]). H ence, strategies for massive efficiency enhance-

ment are necessary, not only from the mathematical (algorithms, discretizations) but also from

the software side of view. To realize some of these aims, our new FEM package FEAST (Finite

Element Analysis & Solution Tools) is under development (see [7,8]).

In this paper, we concentrate on the aspect of testing !inear algebra routines for the stiffness

matrices resulting from FEM discretizations. These are the most important components in the

corresponding multigrid framework which is mainly responsible for the total efficiency of the

complete simulation. First of all, we demonstrate the performance of typical software tools based

on standard sparse MV techniques, before we explain our machine-dependent Sparse Banded

Bias approaches [9] for achieving strong performance improvements. Consequently, these tech-

niques are the basis for our recent evaluation of software components and modern processor tech-

nologies which are collected in several types of FEAST indices. These explain, for instance, why

many codes run faster on ‘low cost’ PCs than on single processors of supercomputers. However,

they also demonstrate the (hidden) potential of supercomputing power on modern (workstation)

processors as employed by DEC, IBM, HP, SUN, and SGI.

2. TYPICAL EXAMPLES FOR PERFORMANCE LOSSES

One of the main components in iterative solvers-Krylov-space methods or multigrid-are

matrix-vector (MV) applications. They are needed for defect calculations, smoothing, step-

length control, etc., and often they consume SO-SO% of CPU time. Hereby, sparse MV con-

cepts are the standard techniques in FEM codes (and others), also well known as the ‘compact

storage’ technique: depending on the programming language, the matrix entries plus index ar-

rays/lists/pointers are stored as long arrays or heaps, containing the ‘nonzero elements’ only.

For an overview on applied techniques, see for instance SPARSKIT [2] and the literature cited

therein. While this sparse approach can be applied for general meshes and arbitrary numberings

of the unknowns, no explicit advantage of (possible) highly structured parts can be exploited.

Consequently, a massive loss of performance with respect to the possible peak rates may be

expected since-at least for large problems with more than 100,000 unknowns-no ‘caching in’

and ‘pipelining’ can be exploited such that the higher cost of memory access will dominate the

resulting MFLOP/s rates.

To demonstrate this failure, we start with examples from our FEATFLOW code [6] which seems

to be one of the most efficient simulation tools for incompressible flow on general domains (see the

results in [5]). We apply FEATFLOW to the following “2D flow around a car” configuration, and

we measure the resulting MFLOP/ s rates for the MV multiplication inside of the multigrid solver

for the momentum equation (see [6] for mathematical and algorithmic details). Here, we use the

typical (for FEM approaches) ‘two-level’ (TL) numbering (old vertices preserve their numbers if

the mesh is refined), a version of the bandwidth-minimizing Cuthill-McKee (CM) algorithm and

an arbitrary ‘stochastic’ numbering.

All numbering strategies have in common that, based on the standard sparse ‘compact storage

rowwise’ (CSR) technique (see [2]), the cost for arithmetic operations, for storage and for the

number of memory accesses are identical. However, the resulting timings in FORTRAN77 on

a SUN ENTERPRISE E450 (about 250 MFLOP/s peak performance) can be very different, as

Table 1 shows.

FEAST Indices-Realistic Evaluation 1433

Figure 1. Coarse mesh for ‘flow around a car’

Table 1. MFLOP/s rates of sparse MV multiplication for different numberings.

Computer #Unknowns TL CM Stochastic

13,688 20 22 19

SUN E450 54,256 15 17 13

(- 250 MFLOP/s) 216,032 14 16 6

(CSR) 862,144 15 16 4

These and numerous similar results (see also Section 6) can be concluded by the following

statements which are quite representative for many other numerical simulation tools.

1. Different numbering strategies can lead to identical numerical results and work (w.r.t.

arithmetic operations and memory access), but at the same to huge differences in elapsed

CPU time.

2. Sparse MV techniques are slow (with respect to possible peak rates) and depend massively

on the problem size and the amount and kind (?) of memory access.

3. In contrast to the mathematical theory, most multigrid implementations will not show a

realistic run-time behaviour which is directly proportional to the mesh level.

To shed more light into this ‘strange’ behaviour, we examine more carefully the computer

performance for second-order scalar PDEs on tensorproduct meshes with trilinear FEM spaces

(M matrices with bandwidth 27). Again, we apply different numbering strategies in the sparse

CSR format; however, the numerical work and the results of the MV operations are identical for

all cases.

Table 2. MFLOP/s rates of sparse MV multiplication on tensorproduct meshes.

Computer #Unknowns Rowwise Two-Level Stochastic

IBM RS6000/597 173 86 81 81

(160 MHz) 333 81 55 16

‘PBSC’ 653 81 14 8

IBM RS6000/590 173 42 42 42

(66 MHz) 333 41 39 27

‘POWER2’ 653 41 17 7

DEC/21164 173 54 31 29

(433 MHz) 333 51 16 10

‘CRAY T3E’ 653 49 13 8

INTEL PENTIUM II 173 30 28 28

(400 MHz) 333 30 26 24

‘ALDI PC’ 653 30 23 19


Based on such studies,’ we can characterize more precisely the

haviour.

computational run-time be-

l The MFLOP/s rates are far away from the announced peak performance.

l They depend significantly on the size of the problem and the l&d of memory access.

l ‘Old’ processors (IBM 590) can be even faster (!) than ‘new’ ones (IBM 597).

l ‘Supermarket’ PCs can be significantly faster than processors in ‘supercomputers’.

Table 3, which now in contrast is based on the highly structured MV techniques in FEAST,

shows that the same application can (!) be performed much faster: we additionally exploit vet-

torization facilities and data locality. Additionally, we can even further differ between the case

of variable matrix entries (‘var’) and constant bands (‘const’) as typical for Poisson-like PDEs.

Table 3. MFLOP/s rates of Sparse Banded Blss [9] MV multiplication on tensorproduct meshes.

Computer #Unknowns Var Const Computer Var Const

IBM RS6000/597 173 188 480 IBM RS6000/590 102 195

(160 MHz) 333 172 393 (66 MHz) 94 175

‘PPSC’ 653 176 390 ‘POWERZ’ 94 176

DEC/21164 173 103 404 INTEL PENTIUM II 51 180

(433 MHz) 333 101 313 (400 MHz) 51 137

‘CRAY T3E’ 653 101 268 ‘ALDI PC’ 48 124

Such differences in performance are caused by the fact that modern processors are already

sensible supercomputers with respect to ‘caching in’ and ‘pipelining’ which has been explicitly

exploited by the Sparse Banded Blas-based MV techniques. However, the following examples

show that also the kind of arithmetic work has to be considered (division-free numerical linear

algebra) and that a precise knowledge of the cache architectures is necessary.

Table 4. MFLOP/s rates for tridiagonal preconditioners in ‘classical’ formulation, compared with the Sparse Banded Bias versions from FEAST for ‘variable’ and ‘constant’ matrix entries. The left table shows the results obtained from traditional implementations with division, while the right table gives the rates without division. It is obvious that divisions in an algorithm can significantly degradate the performance if compared with pure addition/multiplication-based algorithms.

To make this point clear: the use of tensorproduct meshes does not (!!!) automatically guar-

antee higher performance; all proposed optimization strategies with respect to data structures,

algorithmic redesign, programming language, cache architectures, and memory management must

be considered.

Table 5 shows the expected development of the (actual) processor technology and demonstrates

that such ‘machine-oriented’ algorithmic and implementation techniques will be absolutely nec-

essary: the explicit handling of data locality, internal parallelism and vectorization, in contrast

to minimizing the memory access at the same time, will be the key techniques for the next years.

‘ALDI is a supermarket in Germany which is well known for its cheap prices but also high-quality goods. Based originally on household and food stuffs only, ALDI has offered a typical PENTIUM 11/400MHz computer which has been sold out due to its cheap price and particularly ALDI’s good reputation: about 250,000 pieces in two weeks.


160

140

120

100

60

60

40

20

0’ 0

I

20000 d I I I I I 1

40000 60000 60000 100000 120000 POS(Y,N) for N=4225

20 1 I I I I I I I I I I 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500

POS(Y,N) for N=1050625

Figure 2. MFLOP/s rates of DAXPY operations based on machine-optimized performance libraries. We only vary the difference POS in the relative position to each other of both vectors to be added. The results for the SUN E450 (top, 1 MByte LZ-cache) are representative for processors with l-fold associative cache architectures. The actual performance of such ‘simple’ numerical linear algebra tasks of BLASl-type depends massively on the relative position POS and varies between 160 and 3 (!!!) MFLOP/s although both small vectors fit completely into cache. The bottom figure shows similar losses of performance for the CRAY T3E, here for long vectors. These examples demonstrate that such failures can be only avoided by a hardware-specific and user-defined memory management.

Table 5. Excerpts from the ‘1997 Semiconductor Roadmap’ [lo].

Year of First Shipment

Local clock (MHz)

Transistors/chip

1997 1999 2001 2003 2006 2009 2012

750 1250 1500 2100 3500 6000 10K

11M 21M 40M 76M 200M 520M 1.48

Then, it might get really possible to exploit more adequately the performance of future pro-

cessors which tend to be even more powerful-as single processors with up to 1 TFLOP/s-than the complete CRAY T3E today. On the other hand, neglecting these features may not only

lead to nonsignificant performance acceleration, but even performance degradation is possible as

the previous tables have shown (compare the ‘old’ IBM 590 with the ‘new’ IBM 597). At the

moment, most of today’s sparse implementations (see also Section 6) would favorize the use of

‘cheap’ INTEL-based computers only. This development might lead to an economic (for certain companies) and much more to a scientific disaster as the performance rates will show.


3. DESCRIPTION OF THE FEAST PROJECT

FEAST is based on the following concepts (see [7,8] for details) which shall enable the combina-

tion of such highly tuned linear algebra tools with very sophisticated FEM simulation strategies:

l consequent application of (recursive) ‘divide and conquer’ strategies,

l hierarchical data and solver structures, but also hierarchical (!) ‘matrix structures’,

l frequent use of machine-optimized low-level numerical linear algebra routines.

The result shall be a fiexible FEM package for many ‘real life’ problems with special emphasis

on:

l (closer to) peak performance on modern processors,

l typical multigrid behaviour (with respect to efficiency and robustness),

l parallelization and vectorization directly included on ‘low level’,

l open for different adaptivity concepts and a posterior? error control.

In contrast to many other approaches which aim to develop software for research or education

topics, our approach is clearly designed for high performance applications with industrial back-

ground, especially in CFD. Consequently, our main emphasis lies on the aspects ‘efficiency’ and

‘robustness’ and less on topics as ‘easy implementable’ or ‘most modern programming language’.

Therefore, FORTRAN77/90 is used such that (for us absolutely necessary) the transparent access

to the data structures is possible. Further, this makes it possible to adopt many reliable parts

of the predecessor packages FEAT2D, FEATSD, and FEATFLOW. One of the most important

principles in FEAST is the consequent application of divide and conquer strategies. The solution

of a ‘global’ problem is recursively split into smaller independent subproblems on ‘patches’ as

part of the complete set of unknowns. There are two major aims in this splitting procedure which

can be performed by hand or via self-adaptive strategies:

l find and exploit locally structured parts,

l find and hide locally anisotropic parts.

While on such ‘anisotropic’ parts the usual sparse techniques will be applied, we try to exploit

the much higher performance on the other-highly structured-patches. Consequently, the in-

tention is to minimize the number of ‘sparse areas’ and to apply preferably all numerical linear

algebra tasks on such ‘structured patches’. Then, the major three tasks for realizing such a

simulation tool are:

l the design of the ‘skeleton’ for the recursive splitting into local/global levels,

l the implementation of the typical FEM facilities on the ‘low-level’ patches,

l the development of ‘reference element solvers’ on the ‘low-level’ patches.

The corresponding data, solver, and matrix structures are described in the papers [7,8] and

particularly in [ll]. The aim of this work is a (careful) description of the third task: optimization

of the ‘reference element solvers’ and the corresponding numerical linear algebra tools on the ‘low-

level’ patches. In our context, these are quadrilaterals (2D), respectively, hexaeders (3D) which

are discretized with (logically equivalent) tensorproduct meshes. That means, we can (!) apply

linewise or rowwise numbering, but the local mesh size may vary arbitrarily. This optimization

procedure is split into two tasks:

l the manual task of algorithmic design and corresponding implementation w.r.t. optimal

MFLOP/s rates,

l the numerical task to derive ‘optimal’ multigrid convergence with respect to efficiency and

robustness.

While the numerical task, the optimization of multigrid convergence rates for different FEM

spaces, problem sizes, (PDE) problem types and mesh topologies (but on tensorproduct meshes),

is currently examined in [12], we show recent results of the ‘MFLOP/s optimization’ in this paper

FEAST Indices--Realistic Evaluation 1437

(see also [9] for the technical details of the Sparse Banded Blas. To be precise, we provide the

reader with:

1. ‘optimal’ MFLOP/ s rates for different numerical linear algebra tasks on generalized ten-

sorproduct meshes (for bilinear and trilinear FEM),

2. evaluation of many modern processors with respect to the ‘real’ performance of such

different basic tasks up to complete multigrid algorithms.

4. DESCRIPTION OF THE NUMERICAL LINEAR ALGEBRA COMPONENTS

All following ratings and the test software (ptest . f) are part of the internet and can be

downloaded from our homepage. In this paper, we publish the first version of more or less

complete results for (almost) all modern hardware platforms. Since everybody has access to the

data, everybody is invited to check new processors or different software environments to look for

improved FEAST indices. Our hope is that this (permanent) process of testing and rating of new

hardware components gets into an automatic loop if sufficiently many ‘test persons’ participate.

On the one hand, some ‘pressure’ on the vendors is generated, for instance, to provide the users

with optimal compiler options. On the other hand, there is a fair ‘competition’ possible between

the various hardware configurations. So, the user and also ‘client’ of such products has the chance

to compare the different test environments with regard to his own applications and with respect

to the true cost/performance relation. To provide this kind of information is definitely one of the

aims of the subsequent FEAST indices.

We will not explain the underlying tests in all details, and we will not show the results of all

comparisons: these are part of the diploma thesis of [13], respectively, see also [9] and our home-

page. Therefore, in short terms only, the following linear algebra tasks are the basic components

of the FEAST indices.

4.1. DAXPY-Like Applications

Beside the (standard) linear combination DAXPY, we also apply the (variable) linear combination

DAXPYV

y(i) = cr(i)z(i) + y(i)

which is important in banded MV multiplications on vector computers. Additionally, we check

the (indexed) variant DAXPYI

y(i) = Q(i)Z(j(i)) + y(i).

Here, the scaling factor cr is a vector acting on the components of the vector z which depend

on the index i via an index vector j(i). We apply two tests with different index vectors j(i)

which simulate moderate and stochastic jumps in the numberings. These tests are quite good

representations for the complete sparse MV applications.

The arithmetic work count for all DAXPY-like variants is defined as 2 x N, with N the number

of vector components, such that the corresponding MFLOP/s rates are determined via

2xN

CPUTIME x 10G’

4.2. Variants of Mv Multiplications

Assuming a (generalized) tensorproduct mesh with A4 vertices in each space dimension, the

resulting number of unknowns is M2 (2D), respectively, 1M3 (3D) f i we consider vertex-oriented

discretizations: here conforming bilinear, respectively, trilinear FEM. Assuming the typical 9-

point, respectively, 27-point stencil for the corresponding matrices, the resulting storage cost and

hence, the measure for the MFLOP/s rates are identical for all following techniques, independent


of the kind of MV multiplication. The test program ptest .f examines eight different basic

implementations (plus various blocking techniques, see [9] for the technical details) which all

are designed to exploit potentially ‘caching in’ and ‘vectorization’ facilities. The MV-V results

correspond to the case of arbitrary matrix entries, while MV-C represents the case of constant

band entries as typical for Poisson-like problems on (in each ‘local direction’) equidistant meshes.

Additionally, we perform measurements for a corresponding sparse MV application in the CSR

format as described above. We examine three variants of numberings: linewise numbering but

nevertheless indexed access (SPARSE), ‘two-level’ numbering (FEAT) as typical for semiadaptive

FEM simulations without local adaptivity, and finally ‘stochastic’ numbering (ADAP) of the un-

knowns being representative for fully adaptive approaches. To make this point clear: all MV

applications are performed for the same matrix. We only vary the storage and access techniques,

hereby exploiting the tensorproduct structure or not, taking into account the case of constant

entries or not. However, in all cases we define the work count to measure the MFLOP/s rates as

18 x N (in 2D),

54 x N

CPUTIME x lo6 respectively,

CPUTIME x lo6 (in 3D)

Moreover, we test the performance of matrix-vector multiplications with tridiagonal matrices

which are basic tools for certain preconditioners as the ‘linewise GS’ schemes below. Again, we

check in ptest . f the MFLOP/ s rates for variable and constant entries, and they are determined

via 6xN

CPU TIME x 106.

4.3. Tridigonal-Based Preconditioners

Assuming tensorproduct meshes, the application of ‘inverse’ tridiagonal matrices as precon-

ditioners (TRIS) can be easily performed and provides rather good convergence properties with

respect to mesh anisotropies (see [12]). We apply the ‘division-free’ variants from the previous

section, with various blocking strategies, again for variable and constant matrix entries. The

MFLOP/s rates are defined as 5xN

CPU TIME x 10” ’

Additionally, this tridiagonal preconditioner can be combined with the described tridiagonal

MV multiplications to work as ‘linewise Gau&Seidel’ (TRIGS) or ‘linewise ADI’ preconditioners

(see [9]). Taking into account the convergence studies in [12], these schemes are our actual

favorites as multigrid smoothers with regard to numerical and computational efficiency. The

MFLOP/s rates are defined as

11 x N (in 2D),

29 x N

CPUTIME x lo6 respectively,

CPUTIME x lo6 (in 3D).

4.4. Smoothers in Multigrid

Based on the previously described DAXPY-like operations, the MV multiplications and the pro-

posed tridiagonal preconditioners, we can determine the MFLOP/s rates of the corresponding

smoothing operators which all are written and implemented in the following general notation:

Here, A and C are matrices in RNxN, with C being the preconditioner, and x1, xlfl, b are

N-dimensional vectors. The parameter w is an arbitrary relaxation factor, while the indices 1

and 1+ 1 are the usual counters in iterative procedures. Our candidates for the preconditioner C are:

l C = diag(A) corresponds to Jacobi iteration (x DAXPYV),

.

l


C = ‘IRIS(A) corresponds to tridiagonal preconditioners, respectively, linewise variants of

Jacobi, C = TRIGS(A) corresponds to the ‘lower+tridiagonal’ preconditioner, respectively, linewise Gaul%Seidel.

There are several reasons why we explicitly use and optimize this form of the basic iteration

which

1.

2.

3.

is in contrast to many ‘red-black’ or other ‘multicolouring’ approaches.

This general form allows the independent splitting into the three tasks MV multiplication, preconditioning, and linear combination, which all have been optimized with respect to ‘caching in’ and ‘pipelining’. The explicit use of the complete defect Ad - b is advantageous in certain techniques for implementing complicated or ‘moving’ boundary conditions (see [6]). All components in standard multigrid, i.e., smoothing, defect calculation, step-length

control, grid transfer, are included in this basic iteration.

As an example, the MFLOP/ s rates for ‘linewise GS’ smoothing S-TRIGS(N) which consists of

DAXPY, MV, and TRIGS (taking into account the case of variable (V) or constant (C) entries) are calculated separately on each mesh level, corresponding to problem size N. They read in 3D

MFLOPs-TRIGS(N) := (54 + 29 + 2) x N

54xN 29xN 2xN MFLOPMV(N) + MFLOPTRIGS(N) + MFLOPDAXPY(N) >

x 106

In an analogous way, the corresponding MFLOP/ s rates for Jacobi or tridiagonal smoothing are estimated, based individually on the previously calculated rates in 2D and 3D with respect to the underlying numerical linear algebra components.

4.5. Complete Multigrid Cycle

Finally, we can (recursively) determine the MFLOP/ s rates for a standard multigrid cycle, consisting of m total smoothing steps (including pre- and postsmoothing steps; here: m = 2), defect calculation, grid transfer and coarse grid approximation. As an example, the actual rates for ‘line GS’ smoothing on level 1 with N(1) unknowns read for the corresponding multigrid variant M-TRIGS(N(1) > in 3D

MFLOPM-TRIGS(N(~)) :=

1.5 x (85m + 54 + 18) x N(1) 85mx N(1) 54xN(Z) 9x2xN(1) (85m+54+18) x N(I)

MFLOPS-TRIGS(N(~)) + MFLOPMV(N(~) + MFLOPDAXPY(N(I)) + o.5 ' MFLOPMVI-TRIGS(N(I-I)) x 10s

The factor 1.5 is achieved from the recursion through the applied W-cycle and can be analogously determined for other cycles and the 2D case (see [13] for the details).

4.6. Measurements of Classical Sparse Applications

In 3D, we have directly included a sparse MV multiplication for the same matrix as before, which results from a trilinear FEM discretization of a scalar Poisson-like problem and which leads to a (maximum) matrix stencil of 27. Then, the total number of vertices, respectively, unknowns is defined as N := M3. All matrix entries are sequentially stored in a long array (we perform FORTRAN77), and as usual for the CSR format (see [2]) we employ two additional index arrays for accessing the load vector. We examine three variants of numberings: linewise numbering but nevertheless indexed access (SPARSE), a simulated ‘two-level’ numbering (FEAT) as typical for semiadaptive FEM simulations without local adaptivity (which allows maximum differences in the numbering of 0(M2)), and finally a ‘stochastic’ numbering (ADAP) of the unknowns (allowing


jumps in the numbering of two neighboured vertices of order O(M3)) which is representative for fully adaptive approaches. Then, in all cases we define the MFLOP/s rates as before:

54 x N

CPU TIME x 10” ’

Unfortunately, we have missed this direct inclusion of sparse MV application in 2D (now we cannot change anymore) such that we have to simulate the analogous behaviour for the sparse

(CSR) MV multiplication via the indexed DAXPY routines DAXPYV and DAXPYI. As in the 3D case, we allow for the index vector that jumps of order O(M) (remember: N := M2 in 2D) or

O(M2) can occur which compare well with the FEAT, respectively, the ADAP results from the 3D case. These 2D tests correspond to our older ‘Elch Tests’ which have been described in [14].

5. THE FEAST INDICES

Before we explain more in detail all auxiliary indices and how to compute them, we present already the final result: the actual version of the ‘total’ FEAST index. They show our ‘total’ evaluation of most of the available processors and allow different specific rankings with respect to the explained implementation techniques and applied data structures for standard conforming FEM. Most of the notation in the following tables and figures should be self-explainable; if not

so, take a look at [13].

q 21264 WSMAP

a PWR3 (200)

121294WS(500)

PWR2bz.97 (160)

PWFw397 (160)

ORIQIN 2ooo

FEAT- Index

ULTRA30 (339)

HP C240 (233)

PWw590 (69)

n CRAY T3E (433)

w ORIGIN 2UO

n ULTRA90 (300)

rn21164WS(4cq

q 21134 PC (533)

n 21154WS(500)

q HPPA3WO

q ULTRA10 (333)

ULTRA E450 (230)

n PWReFm

n ULTRA1 (140)

n P6ntim II (253)

n R-9(75) q PentiumPro (200)

n HP735

Figure 3. The ‘total’ FEAST index: the graphical and table-based presentation of the specific 3D and 2D values is given in the Appendix. The performance in 2D is smaller than in 3D, due to the more compact matrices (see the Conclusions). Keep in mind that these are averaged MFLOP/s rates only, which may significantly differ from the results for specific problem sizes.

The first term determines the kind of processor, followed by some additional denotations: PWR for IBM POWER2 or POWER3 processors, model 590 or whatever; 21264 or 21164 denotes the ALPHA chip in DECs, with or without KAP preprocessor, being a workstation model (WS), a PC (under LINUX, but using the executable code from the workstation), or a PC under LINUX with the EGCS F77 compiler; ULTRA stands for SUN computers; ORIGIN, respectively, RBOOO and RlOOOO, denote SGIs; PI1 and PPro denote Pentium II and Pentium Pro architecture; PPC/F50 denotes the PowerPC version F50 by IBM. In brackets, further details about the actual clock rate are given. More information about the precise definition of the tested computers can be obtained from [13], or look at our homepage.

In the following sections, we will discuss only some of the ‘auxiliary’ indices in detail, while a much more complete description of the results in 3D and in 2D is given in the Appendix. In most cases, the 2D results allow the same qualitative and even quantitative conclusions. Only


in such cases that significant differences between 2D and 3D results occur, we explicitly state

the corresponding results. Additionally, keep in mind that the indices, or better the ranking

of computers, will change since new processors or different configurations (compiler, operating

system, etc.) will be added. Therefore, check out the given internet address. This webpage

contains also a full-color presentation of the results which may be helpful in better viewing the

indices. All calculations for the evaluation of the included MFLOP/s rates are performed for several

levels 1 of (global) mesh refinement. We set for the number N(1) of unknowns

N(1) = 652, N(2) = 12g2, N(3) = 2572, N(4) = 5132, N(5) = 10252, (in 2D),

N(1) = 173, N(2) = 333, N(3) = 653, (in 30).

Higher refinement levels in 2D are somewhat unrealistic. Furthermore, the storage costs are

already far beyond typical cache sizes such that the effect of further increasing the problem size

can be easily extrapolated. In 3D, one may state that level 1 = 3 with N(3) = 653 = 274,625

unknowns does not seem to correspond to a ‘fine’ mesh width, but one has to take into account

that on this level 3, the storage of the stiffness matrix (bandwidth 27) plus index arrays for

the sparse techniques consumes already (almost) 128 MByte of RAM. This example shows that

we discuss not only differences with respect to computing efficiency but also with regard to

storage cost. Based on the MFLOP/s rates for the basic and composite numerical linear algebra

components from the previous section, we can define the following (auxiliary) indices which are

part of the shown FEAST indices.

5.1. The Sparse FEAT and ADAP Indices

These indices are measures for the computational efficiency with respect to classical sparse

techniques, and they are based on the previously defined MFLOP/ s rates for different numberings

in the sparse (CSR) MV multiplication. As typical for all following indices, the rates from the

different levels 1 are weighted by special factors c(l). These are for the FEAT index in 2D and in

3D:

c(5) = SO%, c(4) = 25%, c(3) = lo%, c(2) = 4%, ~(1) = I%,

respectively, c(3) = SO%, c(2) = 16%, ~(1) = 4%,

and for the ADAP index:

c(5) = 40%, c(4) = 30%, c(3) = 15%, c(2) = IO%, c(l) = 5%

respectively, c(3) = 75%, C(2) = 20%, C(l) = 5%.

While the definition of both indices in 3D is straightforward, we have to do some modifications

in the 2D case as explained before. To be precise, the resulting total FEAT index in 2D is

calculated as an average between the DAXPYV and DAXPYI values (with ‘moderate’ jumps of order

O(M)), and in 3D between the SPARSE and FEAT values. Analogously, the ADAP index in 2D

is the average between both DAXPYI values (with the two kinds of allowed jumps in numberings)

and in 3D between the FEAT and ADAP values. All these values are summed up with the special weighting factors accordingly to each mesh level.

The corresponding results of the FEAT and ADAP indices can be found in the Appendix. The

computations demonstrate the dependence on the problem size and the kind of indexed access due

the described numbering strategies: the resulting MFLOP/s rates are quite slow compared with

the following results for performing the Sparse Banded B&-like MV multiplication with the same (!!!) matrix. Beside DEC’s and IBM’s top models, the SGIs and particularly the PENTIUMs


lead to (relatively) good results. However, we compare 20 MFLOP/s as best values with less

than 10 MFLOP/s for many other processors. In contrast, POWER2 models (IBM) and older

DEC’s (21164, CRAY T3E) deteriorate on the finest level 3 to about 10 MFLOP/s only. This

shows impressively that some of our ‘preferred’ machines may have, more or less, problems with

the described sparse techniques since the underlying cache architectures lead to massive cache

misses and hence, to performance losses. Further increasing of the number of unknowns can lead

to even still slower performance results.

5.2. The DAXPY Index

For calculating the DAXPY index, the DAXPY rates are averaged over all mesh levels. The

corresponding weights for each mesh level are differently defined, in 2D as well as in 3D:

c(5) = 70%, c(4) = 20%, c(3) = S%, c(2) = 3%, c(l) = I%,

respectively, c(3) = 82%, c(2) = 15%, c(l) = 3%.

5.3. The FEAST-mv-v and FEAST-mv-c Indices

These MFLOP/s rates are based on our optimized Sparse Banded Blas software [9] and have

to be compared with the previous sparse FEAT and ADAP indices. FEAST-mv-v denotes the

results with variable matrix entries, while FEAST-mv-c measures the even higher MFLOP/s

rates for the (special) case of constant band entries (see the explanations in the previous section).

Like the DAXPY index, they are also part of the subsequent FEAST-v and FEAST-c indices

(see later). The corresponding weighting factors c(i) are identical as for the described DAXPY

index.

Figure 4 shows the corresponding 3D results for the different matrix-vector multiplications

(MV-C, MV-V, FEAT, ADAP) on the finest mesh level 3: performance differences of almost

a factor of 50 get visible for certain computers as the IBM PWR2s. While modern worksta-

tion processors show a huge potential of supercomputing power for ‘structured’ data, they lose

for ‘unstructured’ data in combination with sparse MV techniques, particularly compared with

PENTIUMs. The complete presentation of these indices is part of the Appendix.

5.4. The FEAST-mglgs-v and FEAST-mglgs-c Indices

These MFLOP/s rates estimate the resulting performance of one complete multigrid sweep with

one pre- and one postsmoothing step. The examined variants in ptest .f are FEAST-mgjac-v

and FEAST-m&c-c for Jacobi smoothing, FEAST-mgtri-v and FEAST-mgtri-c for tridiagonal

smoothing, and our preferred combination FEAST-mglgs-v and FEAST-mglgs-c with the de-

scribed ‘line GauBSeidel (TRIGS) smoother: they are the most essential part of the subsequent

FEAST-v and FEAST-c Indices. The corresponding weighting factors c(i) are again identical as

for both previous indices.

Figure 5 shows the corresponding results for complete Sparse Banded Bias multigrid algorithms

which are optimized with respect to numerical efficiency and robustness (see [12]). They are

recently our ‘best’ multigrid work horses on such tensorproduct meshes. In the case of ‘constant’

bands in the matrices (see the results in the Appendix), the results further improve significantly.

5.5. The ‘Auxiliary’ FEAST-v and FEAST-c Indices

These indices measure the resulting performance of our Sparse Banded Bias tools if they are

purely applied on meshes which consist of macros (M quadrilaterals/hexaeders) only and which

all are discretized via the described generalized tensorproduct meshes. For recent examples and


II 21264 WYKAP

q 21264 ws (500)

n PWR3 (200)

~ H PWR2/590 (66)

q ORIGIN 200

n 21164wS(400)

n 21164pC(533)

,3121v34wsp0)

IHPPA6OW

21164 Linux (533)

n ULTRAI (140)

rnpwRneo

n Pm(ium II (266)

n ~6000 (75)

Pd!umPro (200)

HP 735

21264 WS/KAP

I21264 ws @Jo)

PWR2/397 (160)

/I. “I ULTRA60 (360)

n ORIGIN 200

n 2116.4 ws (400) q 21164 PC (533)

IHPPA~wo ULTRA60 (300) ~ ULTRA10 (333) ~ Rloooo (195)

_, ULTRA E450 (250)

ULTRA1 (200)

21164 Linux (533)

PenthI II (400)

n umbv (140)

n ~~~560

n pMtkm II (266)

q RBOOO (75) PsntlumPro (200)

HP 735

/ q 21264 WS/KAP

~ q 21264 ws (500)

r..J ULTRA60 (36.0)

1 i:j HP C240 (236)

FEAT - Level 3 3D

q pwR2/590(86)

n ORIGIN 200

n 21164ws(400) ,m2i154Pc(533) /~PllMWS(500)

IHPPA6000

ULTRA60 (300)

ULTRA10 (333)

Rloooo (195)

ULTRA E450 (250)

21164 Llnux (533)

q Pentturn II (400)

q ULTRAI (140)

n ~~~560

n psntium II (266)

q ~6000 (75)

PentiumPro (200)

HP 735

I@( 21264 WS/KAP

E# ORIGIN 2000

i ! ULTRA60 (360)

HP C240 (236)

CAAY T3E (433)

n PWR~/~SO (66)

li ORIGIN 200

~h21164 WS (400)

lm21164PC(533)

!1121164WS(500)

q HP PA6000

ULTRA60 (300)

ULTRA10 (333)

FM0000 (196)

E-2 ULTRA E450 (250)

ULTRA1 (200)

21164 Linux (533)

Psnthnm II (400)

q ULTRA1 (140)

n PWRI560

!i Pentium II (266)

q ~6000 (75)

n PentiumPro (200)

HP 735

Figure 4. The figures show the MFLOP/s rates for the discussed MV multiplications (MV-C, MV-V, FEAT, ADAP) on mesh level 3 in 3D. There are huge differences possible of up to a factor of 50 if d@erent MV techniques are applied to the same matrix.


n 21284 WS (500) q ULTRA0 (300)

_-- _~

I w PWR2K.90 (66) n PWwM)O

YGLGSN - Level 2 3D

, ~---__ g 21264 WSIKAP

q 21254 ws (500)

n PWRJ (200)

PwR2/397 (160)

/rnpWRV5so(66) q ORIGIN 200

n 2llS4WS(400)

I~21164Pc(533)

n HPPABWO

ULTRA60 (300)

q ULTRA10 (333)

RIO000 (185)

13 ULTRA E450 (2501

ULTRA1 (200)

21154UnuX(533)

I Pentium II (400)

n ULTRA1 (140)

n PWMSO

q Pennurn II (265)

IRsooO(75) I PentiumPro (200)

EHP735

q 21264 WSMAP

* 21264 ws (500)

I PWRI (200)

~121164w5(400)

/R2llfMPC(533)

I !! ?!!!!5L%_

ULTRA1 (200)

21154Linux(533) i I %mJm II wo) W ULTRA1 (140)

)

n wwsao ;

q PenuMl ” (2w ~ III-cm / n PenuumPro (200)

&HP735

Figure 5. FEAST-mglgs-v index in 3D. The first row shows the FEAST-mglgs-v index, followed by the specific results for the M-TRIGS routines on the different mesh levels. The computations show that not only MV multiplications can be efficiently applied, but also complete multigrid algorithms with a very robust smoother inside which works very efficiently for regular as well ss very anisotropic meshes (see [12]). Again, the results are measured MFLOP/s rates.


ii iii.34 WSKAP

-______ ~-~~ mHPPA6006

/m21264ws(500) •~LTRA~OPO~ n PWR3 (200) ULTRA10 (333)

~- 0 PWR2/397 (160) RlOoOa (195)

ULTRA E450 (250)

ULTRA1 (200)

21164 Llnux (533)

Pentium II (400)

q Pentium II (266)

n 21164WS(400) q R6aOO(75)

PentiumPro @Xl)

FEAST/C-total 3D

200

175

150

125

100

75

50

25

0

FEAST/V-total 3D

II 21264 WSKAP

) I21264 ws (500)

w PWR3 (200)

~ i 1 ULTRA66 (360)

;;; HP C24U (236)

q ORIGIN 200

q 21164WS(400)

q 21164 PC (533)

q 21164 WS (500)

q HP PA6006

ULTRA60 (300)

ULTRA10 (333)

A10000 (195)

ULTRA E450 (250)

k$ ULTRA1 (200)

21164 Linux (533)

H Pentium II (400)

I ULTRA1 (140)

n PWFwo

q Pentium II (266)

R6000 (75)

PentiumPro (200)

HP 735

Figure 6. FEAST-v and FEAST-c indices in 3D.

actual test implementations of the FEAST package, see [11,15]. Both indices are averaged val-

ues, that means 50% FEAST-mglgs-v, 25% FEAST-mgtri-v, 10% FEAST-mgjac-v, 10% FEAST-

mv-v, and 5% DAXPY, respectively, the analogous values for the ‘C-versions. Again, each mesh

level is differently weighted via

c(5) = 70%, c(4) = 20%, c(3) = G%, c(2) = 3%, c(l) = l%,

respectively, c(3) = 82%, c(2) = 15%, c( 1) = 3%.

Figure 6 shows the corresponding 3D results which are ‘averaged estimates’ for the processors if

purely applied to the highly structured patches on the low-level parts of FEAST. In the Appendix,

we give a direct comparison between the corresponding 2D and 3D results which demonstrates the

potentially higher MFLOP/s rates in the 3D applications due to the ‘wider’ matrices. Since the

typical bandwidth is 27 instead of 9 in 2D only, the applied cache strategies seem to work better.

Moreover, they also give an impression of the difference in the resulting computing performance

if such tensorproduct strategies-FEAST-c and FEAST-v-can be applied in comparison to the

standard sparse techniques (see the FEAT and ADAP indices). However, these are only averaged

MFLOP/s rates while the actual difference with respect to a special problem size may be even

much greater as the previous examples have shown.

5.6. The ‘Total’ FEAST Indices

The final FEAST indices can be collected which are defined as follows. The corresponding

‘total’ FEAST index-as average of both 2D and 3D FEAST indices-had been shown in the

beginning of this section.

FEAST := 60% FEAST-v + 20% FEAST-c + 15% FEAT + 5% ADAP.

1446 S. TUFCEK et al.

6. (EXCERPTS FROM) THE FEATFLOW BENCHMARK

Before we come to the conclusions from the previous FEAST indices, we still have to discuss

the following major question in this context of benchmarking.

How realistic is our evaluation of the processors via the FEAST indices, if we compare

with results from realistic applications, particularly computed with ‘production codes’?

The FEATFLOW Benchmark is such set of test calculations which are very similar to the

mentioned CFD Benchmark ‘flow around a cylinder’ [5]. The interesting aspects are the total

CPU cost for each computer type, and how they are distributed to the separate tasks in a

CFD code, i.e., mesh generation, assembling of matrices and right-hand sides, solving linear

subproblems or postprocessing steps.

The precise configurations are described at our homepage; here we only provide the reader

with the most important details. The complete description with all mathematical details of

discretization and solver can be found in [6]: we apply a coupled multigrid approach with the

so-called Vanka smoother in a direct steady solver (CCSD), and the discrete projection scheme

as nonstationary scheme (PP2D). All calculations are performed with the nonconforming rotated

multilinear finite elements. Additionally, we apply stabilization techniques of upwind (UPW)

or streamline diflusion (SD) type for the convective term, and we use SOR as smoother in the

multigrid solver for the scalar subproblems if applying the projection scheme. The following

abbreviations for the elapsed time in seconds are used, with ‘total’ denoting the complete elapsed

CPU time:

PRO N mesh generation, initialization phase and postprocessing,

LC N modifications of matrices (in sparse storage technique) or right-hand sides,

MAT N CPU time for matrix generations,

C-MG N elapsed time for multigrid in the CCSD test with the Vanka smoother,

U-MG N CPU times for solving velocity, respectively, pressure subproblems via multigrid,

P-MG if the discrete projection scheme PP2D is applied.

Exemplarily, we show the results of some selected computers for the direct stationary approach

CC3D and for the nonstationary scheme PP2D. All tests require about 128 Mbyte of RAM. The

complete results can be found at our homepage in the internet. All ‘FEATFLOW users’ are

invited to participate in this special computer benchmark which is part of the freely-available

FEATFLOW software so that the run of this set of tests can be easily performed by everybody.

Table 6. (Selected) results for test case CC3D-SD.

This Vanka smoother inside of the coupled multigrid solver CCSD (for velocity and pressure simultaneously) is a memory-access intensive process which explains why the tested IBMs, HPs,

and DECs are only slightly faster than the Pentium II PC. The rates are quite in good agreement

with the previous sparse FEAT and ADAP indices and show that for such kind of ‘index-oriented’

FEAST Indices---Realistic Evaluation 1447

techniques, PENTIUM II processors lead to very satisfying results, particularly regarding the

cost/performance relation.

Table 7. (Selected) results for test case PPPD-UPW.

In contrast, PP2D is based on operator-splitting ideas which require the numerous solution

of scalar PDE problems for the velocity, respectively, the pressure components. Since these

kinds of tasks require much more arithmetic operations in comparison to memory accesses, the

processors of type IBM, HP, DEC, and SUN are significantly faster than the Pentium II processor.

However, since this code is implemented without the optimized Sparse Banded Blas MV tools,

we are still far away from the potentially available performance rates from the previous FEAST

indices. These studies indicate how the future generation of CFD solvers has to look like, such

that very sophisticated and powerful numerical methods can be combined with the available high

performance rates on recent processors. For a mathematical and algorithmic discussion of such

concepts, look at [G,16]. Having exclusively such simulation results with sparse techniques, there

would not be any need for other processors than INTELs.

7. CONCLUSIONS FROM THE FEAST INDICES

2D vs. 3D Indices

The results from the 2D tests compared with the analogous measurements in 3D are very

similar: the IBM processors and the new DEC 21264 are the best, with respect to the total ranking

and in most cases with regard to the specific auxiliary indices, too. However, the corresponding

MFLOP/s rates are in 3D somewhat higher, at least for processors with ‘large’ (leve12) caches.

The explanation might be that the applied matrices are less ‘sparse’, with 27 instead of nine

bands, such that cache-blocking strategies can be better optimized. On the other hand, computers

‘without’ large second level (L2) cache, as for instance the IBM PWR2s and probably pure vector

computers, do not improve.

As a consequence, these IBM workstations based on the PWR2 and the P2SC processors

might be favorable for codes which have not yet incorporated such cache-based optimization

strategies. On the other hand, also for certain nonconforming FEM approximations or cell-

centered discretizations which generally lead to even more sparse matrices (with 5- or 7-point

stencils only), these kinds of processors may be the ‘winners’. In contrast, for higher-order

discretizations as bi- or triquadratic (FEM) discretizations, the cache-oriented processors will

even more improve.

‘Optimality’ of the Indices

The following results have been directly provided by the vendors such that we assume that ‘optimal’ compiler options and hardware components have been utilized:

l DEC 21264 and DEC 21164 workstation with 400 MHz,


l IBM POWER3 and P2SC, IBM F50 (PowerPC),

l SGI ORIGIN 2000 with 250 MHz,

l SUN ULTRA 60 with 300 MHz and 360 MHz,

l HP V2250.

All other processors have been tested by us. So, everybody is invited for improving the indices

of these models (and also the ‘vendor-tested’ versions). In particular, the use of other FORTRAN

compilers (F90, EGCS, GNU-FORTRAN, etc.) or different programming languages appears to

be interesting and, in fact, such tests have been already performed by us (see [13]). For instance,

compare in 3D the MFLOP/ s rates of DEC-PC and DEC-LINUX which are identical computers.

While on the DEC-PC variant, the executable code has been compiled on a workstation environ-

ment with the original DEC compiler and then copied to the PC-target computer, in the case of

the DEC-LINUX configuration, the executable code has been directly generated via the EGCS

F77 compiler and then executed under LINUX. The differences are very significant!

Further improvements do not have to be to restricted to examining the hardware components;

it will be also interesting to modify the test software ptest . f. So far, we have implemented in

the Sparse Banded Blas several versions with different blocking strategies, and we have tested

some specified blocking parameters: however, some modifications of these routines are allowed

and even desired, as long as they are applicable to such special g-point, respectively, 27-point

matrices which arise from conforming FEM discretizations on generalized tensorproduct meshes.

Additionally, the specific adaption with respect to certain machine-dependent values as the pre-

cise cache sizes or number of registers has not been applied up to now. Hence, there is still

a large potential to gain further improvements for such low-level numerical linear algebra rou-

tines. Look also at the ‘data locality’ research project of Riide et al., which is well described at

http://wwwbode.informatik.tu-muenchen.de/Par/arch/cache/index.html and which con-

tains a lot of further literature and activities in this field.

FEAST Indices vs. Application- and Problem-Specific Results

The shown FEAST indices-total or auxiliary-are only averaged values. For many applica-

tions, it might appear to be much more realistic if the corresponding results for specific tasks and

particularly for different problem sizes are explicitly considered. Then, the observed differences

between the processors may be much larger than the FEAST indices do indicate. For instance,

the DEC 21164 processor shows excellent performance for ‘small’ problems, with dramatic losses

of performance if the problem size increases. So, it might be much more favourable to work on

‘many’ small patches (with 500 MFLOP/ s instead of ‘few’ larger patches, with 50 MFLOP/s )

only. Therefore, watch out for your own specific problem configuration! The previous FEAT-

FLOW Benchmark has shown that the corresponding results are significantly dependent on the

numerical method and solution algorithm, and that the differences in the resulting performance

may be much weaker. Nevertheless, keep in mind that those FEATFLOW results have been ob-

tained via classical sparse techniques only, in contrast to the architecture-optimized results of the

FEAST indices. Additionally, if one performs similar tests with recent object-oriented languages,

as for instance C++, you should compare with those optimized FEAST results which really ex-

ploit the FORTRAN facilities. So, there is a good chance for a true comparison of FORTRAN77

with C++ or similar.

IBM POWER2 AND POWERS. While the tested POWER2 and especially P2SC variants show an excellent behaviour for the sparse banded techniques, they demonstrate dramatic losses

of performance for the sparse applications. As explained before, they seem to be quite robust

with respect to ‘cache-optimized’ problems, while they may ‘lose’ against processors with larger

cache-architectures if the matrices get less sparse. Therefore, they belong to our favorites for

applications in FEAST-and also FEATFLOW-which are mainly based on the described gen-

eralized tensorproduct meshes on the lowest level.


In contrast, the new POWER3 processor appears to be a very robust and efficient ‘work horse’

which can be applied for classical sparse as well as for our optimized highly structured numerical

linear algebra components. This computer is, beside the DEC 21264, the clear winner at the

moment.

IBM POWERPC F50. This ‘low cost’ model can be viewed as rival of PENTIUM II PCs, with

respect to price and performance. Therefore, look at our ratings of the PENTIUM world.

DEC 21164 AND 21264, CRAY T3E. The older 21164 processor shows huge performance

problems as soon as the problem size increases: up to now, none of our cache-optimization

strategies seems to work fine, in contrast to (almost) all other platforms. Similar problems are

valid for the CRAY T3E (Stuttgart) which is based on a variant of the 21164 processor. It might

be necessary that further optimizations are performed directly by CRAY.

On the other hand, the new 21264 processor improves in a significant way: having the same

clock rate of 500 MHz, the speed-up is almost a factor of 3. Using the KAP preprocessor, the

performance can be further increased; however, it is not clear to us if this additional speed-

up can be achieved in more complex numerical simulations, as for instance, in FEATFLOW.

Nevertheless, with or without this KAP preprocessor, the DEC 21264 chip is the clear winner

beside the IBM POWER3 processor.

HP, SGI, and SUN. The performance of the ‘top models’ of these vendors-and probably the

prices, too-is very similar. The processors seem to be quite robust with respect to sparse MV

applications while they cannot achieve the same high MFLOP/ s rates for highly structured data

as IBM and DEC. On the other hand, it might be expected that the prices are correspondingly

cheaper, too. Therefore, these models are typical ‘mid-range work horses’ for most numerical

simulations, especially during code development and code testing. However, it is not clear if they

will provide the computing power which seems to be necessary for many ‘real life’ problems. At

the moment, we are not sure about the ‘real’ quality of the HP V2250; we can only hope that

the corresponding 3D test will be performed to guarantee a better rating of this model.

INTEL PENTIUM II. One of our problems with this processor is that we could not get into

contact with INTEL to let them perform their own optimizations, maybe with special compiler

options. Additionally, we could only apply the (freely available) EGCS F77 compiler. At the

moment, this processor type seems to be quite ‘insensible’, regarding cache optimization strategies

as well as with respect to performance degradation through sparse MV techniques. Based on the

cheapest price, this hardware selection is the clear winner if highly unstructured data and sparse

MV techniques are applied, for instance, in fully adaptive FEM simulations. However, this is

more due to the huge losses of the other processors than based on the own high performance: we

compare 20 MFLOP/s with less than ten MFLOP/s. On the other hand, we have not figured

out so far ‘good’ optimization strategies for highly structured data such that our measured

performance values are often slower by a factor of 5 and even more. Nevertheless, we claim

that for most available software tools which are based on sparse techniques (or which do not use

FORTRAN or C.), the choice of PENTIUM II processors may be preferable.

Actually, the specific choice of optimal (?) hardware can be determined via following ‘thumb

rules’.

l For fully unstructured data and corresponding sparse MV techniques, all hardware plat-

forms show slow performance only. The implementation of corresponding tools is quite

easy and straightforward, especially if available software tools as, for instance, SPARSE

BLAS [l], SPARSKIT [2], NISTBLAS [17], or similar are employed. So, the time and

work for the ‘code development’ may be quite fast, but it seems to be impossible to obtain

high performance rates since the resulting efficiency is mainly due to the cost of memory

access and less due to the possible performance of the processors. Therefore, we recom-

mend the use of PENTIUM PCs since they are by far the cheapest ones. Additionally,


the choice of the programming language might appear to be quite unimportant since no

special facilities can be exploited, not even by FORTRAN. Nevertheless, our experience is

that even for such applications FORTRAN77 tools can be significantly faster than C++

codes.

l On the other hand, if one spends more time and work in software concepts and correspond-

ingly in special numerical approaches which are able to exploit ‘caching in’ and ‘pipelining’,

the use of modern workstation processors and corresponding optimized FORTRAN com-

pilers is advisable. At the moment, only these hardware/software combinations seem to

be able to really exploit a higher percentage of today’s high performance rates. However,

as the FEAST project shows, the design and development of such simulation tools can

be much harder, but the final gain in CPU time and, hence, the gain of validity for the

simulation data will be great.

Based on the actual FEAST indices, we clearly prefer both IBM’s and DEC’s top

models. The difference in single processor performance appears to be a factor of ten in

the maximum, such that one might propose to take ten processors of a slower type to gain

similar performance. However, keep in mind that parallelization is always a hard job, with

regard to numerical design, implementation tasks and also stability of the hardware. If one

really would be successful to apply a complete numerical simulation with several hundreds

of MFLOP/s on one single processor with 4 GByte RAM, it will be very hard to beat this

performance on a parallel system, especially if very ‘strong’ numerical components are

used which are often specifically adapted for sequential runs.

l The performance of future processors will change, but looking at the ‘historical’ develop-

ment, it seems that certain processor characteristics remain preserved: the IBM processors

have been and still are very efficient for highly structured data, while SUN, SGI, or HP

are the typical ‘robust work horses’ for all kinds of numerical simulations. So, we believe

that the specific characteristics of the different processor families will remain valid in some

sense for the near future.

Therefore, we end with the following final remarks.

1. Do not base your rating of computers exclusively on the clock rates or other technical

details.

2. Make your own tests or look at performance ratings which are representative for your

specific applications and needs.

3. Always keep in mind that there are numerous hidden traps to lose computing performance.

In fact, it is and it will be even more and more difficult to achieve a significant percentage

of the growing peak rates on modern processors.

8. ACKNOWLEDGEMENTS

The recent state of the FEAST indices is due to the successful work of many people. Let me

mention the FEAST group and especially Alex Runge who is the major person to evaluate and

particularly to prepare the specific results in text-based and graphical ways.

Additionally, we were very pleased about the spontaneous participation of the processor ven-

dors and the help of many other persons: M. Zahn from Rechenzentrum der Universitat Augs-

burg, M. Claus from Rechenzentrum der TU Chemnitz, A. Meyer from Universitat Chemnitz,

U. Rude from Universitat Augsburg, B. Mollenhoff from Rechenzentrum der Universitat Heidel-

berg, R. Wehrse from Universitat Heidelberg, J. Porta, H. Holthoff and C. Pospiech from IBM,

W. Hahn from HP, R. Lange from SGI, U. Graf from SUN, J. Pareti from DEC/COMPAQ, and

many more. In advance, let us already thank all future test persons.

Without their help, the more or less complete data bank with respect to today’s available

computer systems would not have been possible. On the other hand, also the next generation

of processors, computer memory and compilers has to be continuously tested. Therefore, if


sufficiently many ‘test persons’ actively participate in this processor benchmarking, there is a good

chance to create a quite simple but nevertheless realistic rating of modern hardware platforms,

particularly in addition to the well-known LINPACK or SPEC benchmarks. So, please feel free

to contribute ‘new’ indices to this new data bank, which together with the required software can

be found under the FEATFLOW homepage http: //www . iwr . uni-heidelberg . de/-f eatf low.

REFERENCES

1. BLAS Technical Forum, http: //www.netlib. org/cgi-bin/checkout/blast/blast .pl. 2. Y. Saad,SPARSKIT,http://uww.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html. 3. U. Riide, Technological trends and their impact on the future of supercomputers, In High Performance Sci-

entific and Engineering Computing, (Edited by H.-J. Bungartz, F. Durst and C. Zenger), LNCSE, Springer- Verlag, (1999).

4. H. Hellwagner, U. Riide, L. Stals and C. Wei5, Data locality optimixations to improve the efficiency of multigrid methods, In Proc. 14 th GAMM Seminar ‘Concepts of Numerical Software’, Kiel, January 1998, NNFM, Vieweg, (1998).

5. M. Schlfer and S. Turek, Benchmark computations of laminar flow around cylinder, In Flow Simulation with High-Performance Computers II, Volume 52 of Notes on Numerical Fluid Mechanics, (Edited by E.H. Hirschel), pp. 547-566, Vieweg, (1996).

6. S. Turek, Efficient solvers for incompressible flow problems: An algorithmic approach in view of computational aspects, LNCSE 6, (1999).

7. S. Turek et al., Some basic concepts of FEAST, In Proc. 14 th GAMM Seminar ‘Concepts of Numerical

Software’, Kiel, January 1998, NNFM, Vieweg, (1998). 8. S. Turek et al., On the realistic performance of components in iterative solvers, In High Performance Scienti$c

and Engineering Computing, (Edited by H.-J. Bungartz, F. Durst and C. Zenger), LNCSE, (1999). 9. S. Turek et al., Proposal for Sparse Banded BLAS techniques, (available via internet) (to appear).

10. The national technology roadmap for semiconductors, 1997 edition, http: //www. sematech. erg/public /roadmap/index.htm.

11. C. Becker, FEAST-The realization of finite element software for high-performance applications, Ph.D. Thesis

(to appear). 12. M. Altieri, Robuste und effiziente Mehrgitter-Verfahren auf verallgemeinerten Tensorprodukt-Gittern, Di-

ploma Thesis (to appear). 13. A. Runge, Zur realistischen Leistung von Software-Komponenten und modernen Prozessoren zur numerischen

Simulation von partiellen Differentialgleichungen, Diploma Thesis (to appear). 14. S. Turek, Konsequenzen eines numerischen ‘Elch Tests’ fiir Computersimulationen, Technical Report SFB

359, University of Heidelberg 46, (1998). 15. S. Kilian, Efficient parallel iterative solvers of ScaRC-type and their application to the incompressible Navier-

Stokes equations, Ph.D. Thesis (to appear). 16. S. Turek et al., Trends in processor technology and their impact on Numerics for PDE’s, (available via

Internet) (to appear). 17. The NIST Sparse BLAS, http: //math.nist .gov/spblas. 18. S. Kilian and S. Turek, An example for parallel ScaRC and its application to the incompressible Navier-Stokes

equations, In Proc. ENUMATH-97, Heidelberg, October 1997, World Science, (1998).


APPENDIX

SOME FEAST INDICES

Table 8. The ‘total’ FEAST index in 3D and its (major) auxiliary indices. The figure gives a graphical representation of the ‘total’ FEAST index in 3D whereat the values correspond to averaged MFLOP/ s rates. Additionally, the table shows the results of the (major) auxiliary indices which have been explained in Section 5; their graphical presentation and a direct comparison to the 2D values is given later. Keep in mind that these are averaged MFLOP/s rates only, which may significantly differ from the results for specific problem sizes.

Computer FEAST FEAST-v FEAST-c FEAT ADAP

DEC-WS-21264-500MHz-KAP 240 220 476 66 47

DEC-WS-21264-500MHz 203 185 387 80 54

IBM-PWR3-200MHz 199 172 410 77 43

IBM-PWR2SC-397-16OMHz 161 153 297 58 21

IBM-PWR2SC597-160MHz 159 154 289 52 19

SGI-ORIGIN2000-250MHz 127 106 275 43 3G

SUN-ULTRA60-360MHz 116 90 276 34 26

HP-C240-240MHz 113 90 261 39 26

CRAY-T3E(STUTTG.) 89 78 185 31 11

IBM-PWR2-590-66MHz 86 87 140 31 17

DEC-WS-21164-400MHz 84 68 187 32 22

SGI-ORIGIN200-180MHz 84 71 174 34 27

SUN-ULTRA60-300MHz 79 69 162 28 21

DEC-PC-21164-533MHz 79 62 181 31 17

DEC-WS-21164-500MHz 78 63 186 18 14

HP-PA8000 75 58 179 25 15

SUN-ULTRA21-333MHz 65 51 148 28 20

SGI-IOK-IP25-195MHz 62 47 150 18 13

SUN-ULTRA450-250MHz 59 45 138 21 14

DEC-PC-21164-533MHz-EGCS 57 48 121 21 14

SUN-IJLTRAl-200MHz 56 44 129 23 15

PENTIUMII-400MHz 52 45 102 26 21

SUN-ULTRAl-140MHz 39 33 83 16 8

IBM-PWR-580 39 39 63 14 7

SGI-RSOOO-75MHz 35 25 92 11 8

PENTIUMII-266MHz 35 30 69 16 14

PENTIUM-PRO-200 23 18 51 10 7

HP-735 19 15 45 8 5


Table 9. The ‘total’ FEAST index in 2D and its (major) auxiliary indices. The figure

gives a graphical representation of the ‘total’ FEAST inf!iex in 2D, while the table shows the results of the (major) auxiliary indices; a direct comparison with the 3D values is given later. One remarkable result is that the performance in 2D is smaller than in 3D, due to the more compact matrices (see the Conclusions).

w PPcF5o (333)

Computer FEAST FEAST-v FEAST-c FEAT ADAP

DEC-WS-21264-500MHz-RAP 165 153 313 55 52

IBM-PWR3-2OOMHz 152 140 262 85 62

IBM-PWR2SC-597-160MHz 143 142 233 63 38

IBM-PWR2SC-397-160MHz 142 141 233 64 38

DEC-WS-21264-5OOMHz 132 121 248 53 49

HP-V2250 118 108 229 38 34

HP-C240-240MHz 89 82 166 32 30

SUN-ULT~60-36OMHz 89 78 171 41 38

IBM-PWR2-590-66MHz 85 88 128 36 25

SGI-ORIGIN2000-250MHz 84 70 175 33 32

CRAY-TJE(STUTTG.) 71 58 147 36 20

SUN-ULTRA60-3OOMHz 68 61 122 35 32

HP-PA8000 61 57 116 22 17

SGI-ORIGIN200-180MHz 61 54 123 24 22

DEC-PC-21164-533MHz 59 48 129 21 19

DEC-WS-21164-500MHz 58 49 122 20 20

DEC-WS-21164-400MHz 56 49 111 20 21

SUN-ULTRA21-333MHz 50 4;2 98 23 24

SUN-ULTRA450-250MHz 49 42 98 19 20

SUN-ULTRAl-200MHz 48 43 93 19 17

PENTIUMII-4OOMHz 45 41 80 24 21

SGI-RlOOOO-IP25-195MHz 41 35 a7 14 14

PWRPC-F50-333MHz 36 31 74 12 10

IBM-PWR-580 32 31 53 15 9

SUN”ULTRAl-14OMHz 32 29 60 13 10

PENTIUMII-266MHz 29 27 51 15 13

SGI-R800~75MHz 28 23 56 13 15

PENTIUM-PRO-200 17 16 33 6 6

SGI-RlOQOO-150MHz 16 12 41 5 7

HP-735 16 14 32 5 5


B21264WslKAP n nPPAsw2 - --- H21284 WE (5a-J) mlJLrruw(s2sJ

l.nmAE4ra(2E4

_-----. ~_

PEAT- L#Jvals20

Figure 7. FEAT index in 3D. The first row shows the FEAT index, followed by the specific results for the FEAT MV routines (see Section 4) on different mesh levels. The computations demonstrate the dependence on the problem size and the kind of indexed access via the described numbering strategies (compare with the following ADAP index.). The main results are that the resulting MFLOP/s rates are quite slow compared with the following results for performing the Sparse Banded Blas-like MV multiplication with the same (!!!) matrix. Beside DEC’s and IBM’s top models, the SGIs aud particularly the PENTIUMs lead to (relatively) good results. In contrast, POWER2 models (IBM) and older DECs (21164, CRAY T3E) deteriorate on the finest level 3 to about ten MFLOP/s only. The actual ordering of the processors is due to the total FEAST index and shows impressively that some of our ‘preferred’ machines may have, more or less, problems with the described spume techniques due to massive cache misses. Further increasing of the problem size can lead to even still slower performance results (compare Level 3 vs. Level 1).


/D21254 WSKAP

~~212%WS(5oo)

n PWRB (200) 3 PwFui397 (160)

/ n PWR7f5e7 (leo)

1L.i ULTRMY) (3eo)

1 E HP C240 (235)

n HPPMWO

I ULTRA60 wo)

q ULTRAIO (333)

Rlwoo (1%)

ULTRA E45o (25o)

VLTRAI (2a)

211&(uM(533)

n Pnntium II (4oo)

q ULTRAI (14)

iiimoomi n PNlnumPm wol q HP735

I ACAP. LawI 3 3D

21264 WSKAP

21284 ws (500)

PWFw397 (150)

PWR2l597 (160,

‘13 HP C240 (236)

CRAY T3E (133)

lrn PWR2l580 (88)

IIORlGlN 200

/m2ii84Ws(400)

lm21184PC(533)

I ~2”64.%E!!~~_

q HP PA80oo

“LTRABO (300)

ULTRA10 (333)

RlOOOO (IES,

I3 ULTRA E&Jo (250)

UB ULTRA, 1200,

21184 Lh”, (533)

w pe”tIYln II (.OO)

I ULTM.1 (140)

I PWR/S80

q P*ntium II (266)

n RlOW (76)

n P~nemlPro (200,

llHP735

‘I

~~~~~.. ~. ~~ a HP PABoo

LILTRAM (300)

ULTRA10 (333)

z Rloow (125)

;: ! ULTRA E45o (2xI

ULTRA1 (2w)

21154 unux (523)

R Pemium Ii (400) q ULTRCII (14o)

n iwwsm n pati II (236)

m-(75) n pmtkmpm (2~)

llHP735

,w 21264 WSMAP

‘rn 21264 ws (5~)

n PWFW (2oo)

AGAP-LHdl3D

n ~ww59o (66)

q ORIGIN 2M)

~m2twwswq ~m21b34pc632) tm2w.4wsgiw

m psntium II (400)

n ULTRAI (140)

n pwwseo q penth II (265)

n ~6000 (75)

n etiunpro (2oo)

(IHP735

Figure 8. ADAP index in 3D. The first row shows the ADAP index, followed by the specific results for the ADAP MV routines (see Section 4) on different mesh levels. Due to the even higher jumps in the numbering of neighboured vertices, the MFLOP/s rates further decrease and are of maximum size 25 (!!!) on level 3. Even both DEC’s and IBM’s top models show about 20 MFLOP/s only, while again the SGIs and particularly the PENTIUMs lead to the (relatively) best results. However, we compare 20 Ml?LOP/s as best values with less than 10 MFLOP/s for many other processors. Additionally to the previous FEAT index, the dependence on the mesh level, respectively, the problem size, gets much more clear and varies up to a factor of ten between Level 1 and Level 3.


‘W PWRS (200)

i m PWiW307 po)

n ORIGIN ZOO

n 2llMws(400)

n 2lill4Pc(5ss)

n 2llMWSI5Ofl~

WHPPMOOO

q ULTRA60 (wo)

n ULTRA10 (395)

RIO000 (10s)

ULTRb E4w (2sc

ULTRA1 (200)

q 21164 Lhux (533)

w P*ntbm II (400)

n ULTRA* (140)

n Pww.52a

n PmthJm II (266)

n RWOO (75)

n PhlvnPm (2w)

IHP795

1 I21264 WSKAP

I m 21264 ws (5W)

!MPWR3(2w)

~~21184WS(5w)

HHPPAWOII

ULTRA60 (300)

ULTRA10 (333)

R1OOOo (185)

ULTRA E450 (250)

ULTRA1 (200)

21164 LhJr (533)

n POnllwn II (400)

n ULTRA1 (140)

mPwR/wO

I psrmum II (266)

mm(n)

l Pa-diumPm (2W)

IliP

I MVN- Lwd33D

H 212S4 ws (243)

I PWRI (ZW)

/k9HPC240(238)

CRAY T3E (433)

n ~R2690 (es)

n ORIGIN 200

n 21164WS(400)

n 211luPc(533)

n 21164 ws (500~

ULTRAw(3Oq ~ ULTRAlO(333) ~ RlWOO (195)

ULTRA E4w (250)

ULTRA1 (200)

21164Lhux(533)

n Psntlum II (400) I

q ULTRA1 (140)

n m-

n Pennurn II (266)

I MVN-L.d23D

400 it------

w 21264 ws (5W)

FZ I PWRB (200)

‘/

i

Figure 9. FEAST-mv-v index in 3D. The first row shows the FEAST-mv-v index, followed by the specific results for the MV-V routines on the different mesh levels. The computations demonstrate the much weaker dependence on the problem size, at least for some of the processors (it is obvious that for problems which fit completely into the cache, the rates are often better). The main results are that the resulting MFLOP/s rates are essentially improved through the applied ‘caching in’ and ‘pipelining’ strategies, especially if compared with the previous sparse indices. Nevertheless, we expect in future even higher improvements if more machine-specific optimizations for this task are applied.

FEAST Indices-Realistic Evaluation

~ q ORIQIN 200

n 21164WS(4lw)

w21164PC(533)

n HPPAtlOW

ULTRA&o (300)

H ULTRA10 (333)

RlOom(ls5) ~ ULTRA E4w (250) / ULTRAl(200) ; 21164 Ltnlcf (533)

I Pantturn II (400)

~

1 n ULTRA1 (140)

w-

q Pentturn II (266)

q R6ccO(75) ~ I PenttumPm (2001

HP PA6006

ULTWUU) (300)

ULTRA10 (333)

RlOOOo (165)

ULTRA E450 (250)

I Pentlum II (266)

MV/C Led 2 3D

21264 WS/KAP HP PA600,

ULTRA66 (3W)

ULTRA10 (363)

RlWW (195)

ULTRA E450 (250)

ULTRA1 (200)

21164Lbwx(533)

Pentium II (400)

ULTRA1 (140)

PWw580

Psntlum It (266)

RBMM (75)

PentiumPro (200)

HP 736

Figure 10. FEAST-mv-c index in 3D. The first row shows the FEAST-mv-c index, followed by the specific results for the MV-C routines on the different mesh levels. These studies show the importance of the main strategy in FEAST: ‘detect locally structured parts and exploit the correspondingly regular data’. More than 700 MFLOP/s are realistic which have to be compared with the previous sparse indices.

.i i. __ ULTRA66 (360)

g1 HP c24a (238)

‘m PWR2/560 (66)

~10~1~1~200

‘~21164W8(400)

q HP PA6000

ULTRAW (300)

ULTRA10 (333)

RIOMM (165)

. ..A ULTRA E450 (250)

ULTRA1 (200)

21164Llnuf(533)

I Pantium II (400)

H ULTRA1 (140)

w PWR/560

q Pentium II (266)

I RWOO (75) PentlumPro (200)

HP 735

1457


; 175

(50

125

100

75

50

25

0

MVN-total2D

-- ----~ 1~21254 w&lap

1 n PWRS ,200)

/ n PwR2697,160)

em Pww227 ,t5q

I I HP C240,236)

12HMWS,500) I

q 21154W5,uxl) I

ULTRAlO(522) ~

ULTRA E450(25O) /

ULTRAl,2cq

prmknnli(4al) I

n R1aaoo(125) ~

n PPCF50,222) I

n uLTRAt(140)

n Fw!a%KI :

w PHltium II (22.5)

m-F%

q PdunPm (200) :

n RtoGoo(154) ~

HP 725

21164ws,500)

2,124 ws (@xl)

ULTRAtO ,223)

_, ULTRA E&C! ,250)

ULTRAt ,200)

Pentlum II ,400)

I RIOWI ,125)

II PPCIFSO ,223)

uLTRAl(t40)

PwFv550

Pentlum II (255)

n Rsoao(7J)

I PmtbmPm (200)

R10000,150)

HP 735

275T. -- ~~._ I

,l2t2&(wsKAP 21154 ws ,500)

) I PWFn3 (2W) 2?154Ws ,400)

ULTRAlO(233)

ULTRA E&O (250)

” ULTFCAM ,360) I PPc!F50,333)

q ULTRAl(140)

MVN-Le~l32D

q CRAY T2E (US) n PenNun II (w

-. ~~ULTRA60,300) wF!8000,75)

1 I ORIGIN 200

/uHPPA8000

q PenthPrn ,200)

RlOOw,wJ)

/121154PC,533) HP735 I_..

I

MVN-LwdlPD

Figure 11. FEAST-mv-v index in 2D. The first row shows the FEAST-mv-v index, followed by the specific results for the MV-V routines on the different mesh levels. In contrast to the previous 3D case, the ‘older’ POWER2 architecture by IBM is superior which is equipped with much smaller La-caches. Due to the more compact matrices (g-point instead of 27-point matrices in 3D), the processors with larger cache sizes cannot gain the same improvements as in the 3D case. The quality of the ‘really old’ IBM 590 with 66 MHz clock rate only is surprisingly excellent if compared with many new processors.

FEAST Indices--Realistic Evaluation

!~ULTRA50(200) n R5OOOW5l q ORIGIN 200 q PentiullPm (200)

n HPPA.9000

112y84PC(w

21264 WWKAP

PWAJ (200)

PwR2/597 (150)

PwR2/397 (160) ULTRA E450 (250)

21264 ws (5M))

HP V2250 (236)

i:::; HP c240 (236)

II ULTRA1 (140)

1 PWW5Ba (65) q pwR550

II CRAY T3E (423) q Petivn II (268)

p ULTRA50 (300)

q ORIGIN 200 rnR5_O ~ q PenaMPm(200)

n HPPA5000 RlOca (150)

,W21154PC(533) HP735

SW

7Oc

eJm

5Lul

400

300

200

1M

,

21254 W?JKAP

PWRI (2Oq

~~PWR2!597(160)

PwRz397 (160)

21284 ws (500)

HP V2250 (238)

i .: HP C240 (235)

~ q Pww5BO (5s) I CRAY T3E (433)

I ULTRA50 (303) I ORIGIN 2xX

lHPPA.5000 W21154PC(533)

21164 ws (500) 2llMWS(400)

ULTPAlO (333)

ii ULTRA E450 (W

q PPc/Fso (333) I ULTRA1 (140)

q PwRKml w Pullilml II (266)

n W(75) q PennMPm (200)

RlOOfXI (150)

HP735

21254 WSIKAP 21164w2.1503,

3 PwR3 (200)

‘W Pww597 (150)

I 1 HP c240 (236)

II PWRZBO (66) fi CRAY T3E (433)

/w ULTRA50 (3BO)

m ORIGIN 2W

WHPPAWXI

i !w5 y (_e_

i..j ULTRA E450 (250)

ULTRA1 (200) Penlium II (400)

RIO000 (195)

q PPWSO (333) w ULTRA1 (140)

q Pww550 n Pmth II (2s)

n R3000(75) q PmliunPm (200)

RlWOO (150)

HP733

1459

Figure 12. FEAST-mv-c index in 2D. The first row shows the FEAST-mv-c index, followed by the specific results for the MV-C routines on the different mesh levels. Again, these rates-up to almost 900 MFLOP/s-show the absolut importance of exploiting such structured patches if available. However, they also show how hard (or even impossible at the moment) it is for certain computer architectures to achieve these high rates for large problem sizes.


225

200

175

155

125 ____

155

75

50

25

0 MGLGSN- total 30

/mm34 wsiw imzi264ws(500) n WFW 1200) m Pwlw307 wJo) q wmm7 1160)

n ~wn2/555 (55) n ORIQIN 255 n 21164ws(400) n 21154pc(593) n 21164ws(500)

n wpCgOOa q ULTRA60 (300)

I ULTRA10 (333)

RloOOo (135)

ULTRA E4So (250)

ULTRA1 (200)

q 2w34hxp33) n pantlum II (400)

n ULTRA1 (14o)

n twwsao n pdun II (265)

n RaMlo(73 n PmtlumPm 1200)

q w735 .___

n HP PA8000

ULTRAEO (300)

ULTRA10 (333)

RIOooo (195) ~ ULTRA E45o (250)

ULTRA1 (200) 1 21164 Lhux (533) ~ n Pentlum ti (400)

w ULTRA1 (140)

1

n PWR~M) n Pentium ii (266)

n MOW (75)

~

n PentlumPro (200)

q HP735 !

I MGLGSN - Level 3 3D

250 ~. ._~__~____~_~-. __~_

ULTRA50(300) ~ ULTRA10 (333) I

q 2w4Linux(533)

n pnthnn II (2.35)

MGLQSiV - Low12 3D

n 21254 WS~KAP

n 21254 ws (500) n PWW v.oo) q PwPz/337 (15o)

n pww69o (es) n ORIGIN MO

n 21154ws(400) n 21154pc(533)

im21164ws(5~)

-mHPPALWo _~~

I ULTRA60 (300)

n ULTRA10 (333)

RlOOoo (195)

m ULTRA E4w (250)

ULTRA1 (200)

n 21 I 54 hx (625) n btlun II (400)

n ULTRAI (140)

n ~~560 n ~tih II (253)

m-000(75) n PmtiumPro (200)

q HP735

Figure 13. FEAST-mglgs-v index in 3D. The first row shows the FEAST-mglgs-v index, followed by the specific results for the M-TRIGS routines on different mesh levels. The computations show that not only MV multiplications can be efficiently applied, but also complete multigrid algorithms with a very robust smoother inside which works very efficiently for regular as well as anisotropic meshes (see [12]).

1461 FEAST Indices-Realistic Evaluation

MGLGS/Gtotal3D

‘I 21264~WsKAP

II 21264 ws (500)

n PWR3 (200)

~ @ ORIGIN 2wO

I : ULTRA60 (360)

q PWFw5oO (66)

‘I ORIGIN 200

,M21164WS(400)

!121164PC(533)

!JB21164WS(500)

q HPPAB000

ULTRA60 (300)

ULTRA10 (333)

RICO00 (195)

T._i ULTRA E&O W30) 21 ULTRA1 (200)

¤~Rneo n Pentium II (266)

m-(75)

PmtiumPro (200)

HP 735

MGLGWC . Level 3 3D

21264 WSKAP

i ULTRA60 (360)

q PWRZZ590 (66)

I ORIGIN 200

‘II 21164 ws (400)

i w 21164 PC (533)

jU21164 WS (500)

~ n HP PA6000

ULTRA60 (300)

ULTRA10 (333)

RlOWO (196)

ULTRA E450 (250)

E ULTRA1 (2W)

21164 Linux (533)

Pentturn II (400)

q ULTRA1 (140)

n PWR/560

m Pentium II (266)

n A6000 (75)

n PentiumPro (200)

HP 735

MGLGWC - Level 2 3D

MGLGWC - Level 13D

: # PWRB (200)

PWRZ397 (160)

PWR2/597 (160)

ORIGIN 2ooO

: ULTRA60 (360)

q PWR2/5SO (66)

I ORIGIN 2W

q 21164ws(400)

q 21164Pc(533)

‘~21164ws(500)

## HP PA6000

ULTRA60 (300)

ULTRA10 (333)

g RIO000 (195)

/ i ULTRA E450 (250)

ULTRA1 (200)

21164 Unux (533)

Pentlum II (400)

# ULTRA1 (140)

n PWR/560

n Pentium II (266)

n R8000 (75)

Figure 14. FEAST-mglgs-c index in 3D. The first row shows the FEAST-mglgs-c index, followed by the specific results for the M-TRIGS routines on the different mesh levels. Again, the demand for detecting locally structured patches is demonstrated since complex multigrid solvers with a very high numerical complexity can be performed with a correspondingly excellent computational efficiency, too.

21264 WSKAP HP PA6000

21264 ws (500) ULTRA60 (300)

PWR3 (200) ULTRA10 (333)

PWR2/397 (160) RIO000 (195)

PWR2/597 (160) i::: ULTRA E450 (250)

ORIGIN 2000 ULTRA1 (200)

i ULTRAW (360) 21164 Linux (533)

-., HP C240 (236) I Pentium II (400)

CRAY T3E (433) ## ULTRA1 (140)

q PWR2/500 (66) q PWw5w

q ORIGIN 200 I Pentium II (266)

i121164WS(400) IRSOOO(75)

!a21164Pc(533) PentlumPro (200)

~121164ws(500) HP 735


I-

~- q 21254 WSKAP

l PwR3 cw I PwR26O7 (180)

~ q PwR2BO (O5)

1 m GRAY T3E (433)

‘m IJLTRMO (3OO)

n ORIQIN 200

~HPPk30Ofl

q 21154Pc(533) .~_~

I21154 ws (5OO)

n 21154ws(400)

ULTRA10 (333)

ULTRA E450 (250)

ULTRA1 (2OO)

Pentbum II (400)

wRlOWO(l25)

n PpclFso (333) n ULTRA1 (140)

n Pww55O

n Pentlum II (2%)

WR2JJOO(7J) q PentiumPm (200)

RlOOOO(l50)

HP735

~-- ~IIPlP54WS/KAP

/ q PwR3 (200)

-. m CRAY T3E (433)

-~~ m ORIGIN 200

t i~21154 PC (533)

aHPPA5OiM

MQLQSN - Level 3 2D

n CRAY T3E (433)

1 m ULTRA50 (300)

‘I ORIQIN 200

q HPPA5000

n 21154 PC (533)

21154WS(5W) --

21154WS(40O)

ULTRA10 (333)

ULTRA IS50 (250)

ULTRA1 (200)

PenthJm II (400)

RIO000 (195)

w PPcF50 (333)

w ULTFIAI (140)

n Pww550

n PeathIm II (286)

n RBOOO (75) q PenUumPro (200)

RlOOOO (150)

HP735

250

MQLQSN - Level 1 2D

/I 21254 WSMAP

‘E PWR3 (ZW)

q ORIGIN 2WO

~ n PwRasw (55)

! I CRAY T3E (433)

n ULTRA50 (3OO)

n ORIQIN 200

n HP PA5OW

!rn21154PC6331

21164 ws (500)

21164ws(400)

ULTRA10 (333)

ULTRA E450 (250)

ULTRA1 (200)

POntltml II (400)

n RIOOW (125)

n PPCMO (333) n ULTRA1 (MO)

n PwW650

n Psntlum II (288)

I R5000 (75) I PurlkmPrn (200)

RIOOOO (150)

HP735

21184 ws (500)

112llMWS(4O0)

ULTRA1 0 (333)

ill ULTRA E45O (250)

a ULTRA1 (200)

0 Pentlum II (400)

n RlOOOO (125)

n ppumo (333) n ULTRA1 (140)

n PwRf55O n PeMum II (258)

n R8ooo (75) I PmtiumPr0 (2W)

RlOOOO (IW)

HP 735

Figure 15. FEAST-mglgs-v index in 2D. The first row shows the FEAST-mglgs-v index, followed by the specific results for the M-TRIGS routines on different mesh levels. The 2D results are somewhat slower as the 3D case, as already indicated by the MFLOP/s rates for MV-V multiplication.


GRAY T3E (433) l Pmllum II (2ea)

~~(~) n=(75t q PenulmlPm (2uo)

II Rl@xo (150)

350

300

250

200

150

100

50

0

l 212s4ws/K4P

n PwR3 (ZOO)

I PWR2697 (160)

u PWR2/3@7 (180)

m 21264 ws (EM)

HP V2250 (236)

HP c240 (236)

ULTR#J (38o)

ORlOiN 2ooo

I PWIw590 (66)

n CRAY T3E (433)

m ULTRA@ @LlO)

I ORIGIN 2W

mHPPA3OW

,1211&1pc(533)

-_-_l_....__ q 2llwws@xl)

a 21164 ws (4Oo)

ULTRA10 (333)

ULTRA Ells0 (2~9)

ULTRA1 (2v3)

Pmtium II (4Oo)

II RlOLMO (125)

8 ~~ w3) a ULTRA1 (145)

n PWWMU)

II P9rruum II (266)

n RaoGu 175) II PontiumPr0 (ZOO)

RIO000 (1M)

HP 735

‘~21264WstKAP m PWRI (200) n PWR2697 (160)

,Iy PWR2M7 (160)

/II 21234 ws {soo)

~~~RIOINXXX)

n PWR2690 (66)

n CMY TIE (433) II ULTPA60 (300)

m ORIGIN 200

s1211&4 WS (500)

mnle4ws(4w)

/

ULTRA10 (333)

ULTRA E4!%l(250)

ULTRA1 (200) 1 Phiurn II (400)

RlOOW(l95) j IPPOF50(333) I

i uLTRA’ wJ1 m Pww680 ; II PenNurn II (266) I w NOW (75) I POLO (200) j

IRloooo(lsof HP 735 ,

Figure 16. FEAST-mglgs-c index in 2D. The first row shows the FEAST-mglgs-c index, followed by the specific results for the M-TRIGS routines on different mesh levels. The 2D resufts are somewhat slower as the 3D case, as already indicated by the MFLOP/s rates for MV-C multiplication.


__-.._____._ -

.._____

1 I I

I’ 1 _~ - -._..._

__~ ____

.-_-._ ~

Figure 17. Some indices (2D vs. 3D). The reader should compare the absolute MFLOP/s results which are somewhat higher in 3D for the Sparse Banded Bias applications in FEAST.

the feast indices-realistic evaluation of modern software

Documents