pop 1.4.3 performance - 1 degree global problem

POP 1.4.3 Performance - 1 Degree Global Problem

The POP code is a well known ocean circulation model developed at LANL. It is the ocean model component of the Community Climate Systems Model (CCSM) from NCAR. The chart below shows the current performance on the Altix and other platforms for a “1 degree” resolution global ocean circulation problem.

Note: Virtually no changes to the original code have been for the Altix runs. A total of about 100 lines of code have been modified. Most of the changes are in the boundary routine used in the CG solver. At this point a number of code modifications have been identified that will significantly improve on this performance. In contrast, the vector version has been in development for about 2 years by Japan, and lately Cray.

0.0

20.0

40.0

60.0

80.0

100.0

0 64 128 192 256

Cray X1Altix 1.5 GHzOrigin 600 MHz

Perf

orm

ance

(Yrs

/day

)

CPU Count (Cray X1 plotted as SSP count)

POP 1.4.3 - Performance on 1 Degree “X1” Problem

NOTE: X1 Data re-plotted from Pat Worley charts in X1 Early Performance Evaluation

POP 1.4.3 - 0.1 Degree North Atlantic Problem

The second POP scenario was a run of the 0.1 degree North Atlantic simulation as defined by LANL last year. The grid for this problem is 992x1280x40 (~51M points). As stated before no significant code changes were made. The results are presented below. Note that this simulation contains about 10x more points than the 1 degree problem above and requires about 6x more time steps per day. Thus, the “work” is about 60x more, yet yet the run performance is only about 17x slower on 256 CPUs. The turnover in both 1.0 and 0.1 degree problems is due to two effects, 1) Scaling in the barotropic computation, and 2) Extra useless work engendered by the extensive use of F90 array syntax notation in the code.

0.0

1.0

2.0

3.0

4.0

5.0

0 64 128 192 256

Altix 1.5 GHzOrigin 600 MHz

Perf

orm

ance

(Yrs

/day

)

CPU Count

POP 1.4.3 - Performance on 0.1 Degree “NA” Problem

NOTE: POP graphics courtesy of Bob Malone LANL

Compute time for 1000 year simulation

CCSM was used last year by NCAR to conduct a 1000 year global simulation using T42 resolution for the atmosphere and 1 degree resolution for the ocean. The simulation required 200 days of compute time to complete. The Altix code at this point has been partially optimized using MLP for all inter model communications. Some sub-models have been optimized further. About 4 man-months have been devoted to the project.

0 days 200 days 400 days

MPI SGI O3k

MPI IBM Pw3

MLP O3K 0.6 GHz

CCSM 2.0 Code Performance - 1000 year simulation

73 days (256 CPUS)

200 days

318 days

MLP Altix 1.5GHz

53 days (192 CPUs)

Performance Results for Applications in the Aerosciences

ARC3D OVERFLOW CART3D

OVERFLOW-MLP - 35M Point “Airplane” Problem

0.0

1.0

2.0

3.0

4.0

5.0

0 20 40 60 80 100 120 140 160

Percentage of Total Point Count by Block

The OVERFLOW “Airplane” problem has become a benchmarking standard at NAS. It has been one of the primary benchmarks used in evaluating the scaling performance of candidate HPC platforms for the past 6 years. This is very appropriate as more than 50% of all cycles burned at NAS are from OVERFLOW runs and/or codes with very similar performance characteristics.

The “Airplane” problem is a high fidelity steady state computation of a full aircraft configured for landing. The problem consists of 160 3-D blocks varying in size from 1.6M points to 11K points. The total point count is 35M. Load balance is very critical for this problem. SSI architectures are particularly well suited for load balancing. The chart below shows the size distribution of the 160 blocks in this problem.

Block Number

Perc

enta

ge o

f Tot

al P

ts

OVERFLOW-MLP Performance

0.0

25.0

50.0

75.0

100.0

125.0

0 32 64 96 128 160 192 256

Altix 1.5 GHzO3K-600MHz

35 Million Point “Airplane” Problem - GFLOP/s versus CPU Count

CPU count

The following chart displays the performance of OVERFLOW-MLP for Altix and Origin systems. OVERFLOW-MLP is a hybrid multi-level parallel code using OpenMP for loop level parallelism and MLP (a faster alternative to MPI) for the coarse grained parallelism. NOTE: This code is 99% VECTOR per Cray. The performance below translates into a problem run time of 0.9 seconds per step on the 256p Altix.

The ARC3D Code - OpenMP Test

The chart below presents the results of executions of the ARC3D code on O3K and Altix systems for differing CPU counts. ARC3D was a production CFD code at NAS for many years. It is a pure OpenMP parallel code. Its solution techniques for a single grid block are very similar to numerous production CFD codes in use today at NAS (OVERFLOW, CFL3D, TLNS3D, INS3D). It is an excellent test for revealing how a new system will perform at the single CFD block level. It’s response is applicable to earth science ocean and climate models as well.

The test below is for a 194x194x194 dimensioned grid. It shows excellent performance on the Altix relative to the O3K for 1-64 CPUs with almost a 3x win at all CPU counts.

0

1

2

3

4

5

6

Altix 1.5GHz

O3K 0.6MHz

01 CPU32 CPU64 CPU

ARC3D Performance Relative to O3K 600 MHz

Spee

dup

Rel

ativ

e to

O3K

The CART3D Code - OpenMP Test

The CART3D code was the NASA “Software of the Year” winner for 2003. It is routinely used for a number of CFD problems within the agency. It’s most recent application was to assist in the foam impact analysis done for the STS107 accident investigation.

The chart to the right presents the results of executing the OpenMP based CART3D production CFD code on various problems across differing CPU counts on the NAS Altix and O3K systems. As can be seen, the scaling to 500 CPUs on the weeks old Altix 512 CPU system is excellent.

NASAHSP Compute Server Suite

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

0

2

4

6

8

10

DHB1DHB2LRL1 LRL4 MR1 RAF1 RJB01RJB04RJB05RM2 TB1

NAS HSP3 Compute Server Suite Performance

The charts below present the relative performance (O3K600=1) across 4 platforms for the NAS HSP3 Compute Server Suite. This selection of codes was used historically as a benchmark suite for the HSP3 procurement (C90) at NAS.

0

2

4

6

8

10

RJB06KG2

LRL10 LRL6 LRL7 LRL8 RAF2 RAF3S01 TP1

HSP3 Compute Server Suite Performance

Code

Rel

ativ

e Pe

rfor

man

ce

0

2

4

6

8

10

DHB3KG1LRL2LRL3LRL5LRL9RS1TB1TB1TB2

Code Code

Rel

ativ

e Pe

rfor

man

ce

The NAS Parallel Benchmarks (NPBs) V2.1

The chart below presents the results of several executions of the NAS Parallel Benchmarks (NPBs 2.1) on Origin 3000 and Altix 1.3/1.5 GHz Systems. The NPBs are a collection of codes and code segments used throughout industry to comparatively rate the performance of alternative HPC systems.

0

1

2

3

4

5

6

BT B 09CPU

LU B 01CPU

LU B 16CPU

MG B 01CPU

MG B 08CPUSP B 09

CPUBT C 09

CPU

O3K 600 MhzAltix 1.3 GHzAltix 1.5 GHz

NPB Performance Relative to O3K 600 MHz

Rat

io to

O3K

Summary and Observations

The NASA - SGI 512p Altix SSI effort is already highly successful.A few items remain, but the system is very useable and stable

The Altix system routinely provides 3-5x the performance over current NAS systemsSmaller jobs (1-64 CPUs) tend to larger percent wins

The 512 system is well along its way to a solid production system for NASA needs.Running >50% workload 24/7Batch system up and running - jobs managed by PBS ProSystem uptime already measured in weeks

So what got accelerated by NASA Ames?

Production CFD Codes executing 100x C90 numbers of just a few years ago

Earth Science codes executing 2-4x faster than last year’s best efforts, 50x over a few years ago.

New expanded shared memory architectures: First 256,512, and 1024 CPU Origin systems. First 256,512 quasi-production Altix systems Where is the future at NAS?

Expanded Altix to 4096?

pop 1.4.3 performance - 1 degree global problem

Documents

code performance

run performance

current performance

altix code

cpus performance results

degree problems

original code

lines of code