benchmarks for parallel systems

21
Benchmarks for Parallel Systems Sources/Credits: “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, April 8, 2004, url:http://www.netlib.org/benchmark/performance.ps http://www.top500.org FAQ: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.ht ml Courtesy: Jack Dongarra (Top500) http://www.top500.org The LINPACK Benchmark: Past, Present, and Future, Jack Dongarra, Piotr Luszczek, and Antoine Petitet NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/

Upload: wendi

Post on 02-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Benchmarks for Parallel Systems. Sources/Credits: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Benchmarks for Parallel Systems

Benchmarks for Parallel Systems

Sources/Credits:“Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, April 8, 2004, url:http://www.netlib.org/benchmark/performance.pshttp://www.top500.orgFAQ: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.htmlCourtesy: Jack Dongarra (Top500)http://www.top500.orgThe LINPACK Benchmark: Past, Present, and Future, Jack Dongarra, Piotr Luszczek, and Antoine Petitet NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/

Page 2: Benchmarks for Parallel Systems

LINPACK (Dongarra: 1979)

Dense system of linear equations Initially used as a user’s guide for

LINPACK package LINPACK – 1979 N=100 benchmark, N=1000

benchmark, Highly Parallel Computing benchmark

Page 3: Benchmarks for Parallel Systems

LINPACK benchmark Implemented on top of BLAS1 2 main operations – DGEFA(Gaussian

elimination - O(n3)) and DGESL(Ax = b – O(n2))

Major operation (97%) – DAXPY: y = y + α.x

Called n3/3 + n2 times. Hence 2n3/3 + 2n2 flops (approx.)

64-bit floating point arithmetic

Page 4: Benchmarks for Parallel Systems

LINPACK N=100, 100x100 system of equations. No change in

code. User asked to give a timing routine called SECOND, no compiler optimizations

N=1000, 1000x1000 – user can implement any code, should provide the required accuracy: Towards Peak Performance (TPP). Driver program always uses 2n3/3 +2n2

“Highly Parallel Computing” benchmark – any software, matrix size can be chosen. Used in Top500

Based on 64-bit floating point arithmetic

Page 5: Benchmarks for Parallel Systems

LINPACK 100x100 – inner loop optimization 1000x1000 – three-loop/whole program

optimization Scalable parallel program – Largest problem

that can fit in memory Template of Linpack code

Generate Solve Check Time

Page 6: Benchmarks for Parallel Systems

HPL (Implementation of HPLinpack Benchmark)

Page 7: Benchmarks for Parallel Systems

HPL Algorithm

• 2-D block-cyclic data distribution

•Right-looking LU

•Panel factorization: various options

- Crout, left or right-looking recursive variants based on matrix multiply

- Number of sub-panels

- recursive stopping criteria

- pivot search and broadcast by binary-exchange

Page 8: Benchmarks for Parallel Systems

HPL algorithm Panel broadcast: -

Update of trailing matrix:

- look-ahead pipeline

Validity check - should be O(1)

Page 9: Benchmarks for Parallel Systems

Top500 (www.top500.org)

Top500 – 1993 Twice a year – June and November Top500 gives Nmax, Rmax, N1/2,

Rpeak

Page 10: Benchmarks for Parallel Systems
Page 11: Benchmarks for Parallel Systems

India and Top 500

Rank SiteSystem

VendorProcessors Rmax Rpeak

111 Geoscience (B)India

BladeCenter HS20 Cluster, Xeon EM64T 3.4 GHz - Gig-Ethernet IBM

1024 3755 6963.2

204 Semiconductor Company (L)India

eServer, Opteron 2.6 GHz, GigEthernet IBM

1024 2791 5324.8

231 Semiconductor Company (K)India

xSeries x336 Cluster Xeon EM64T 3.6 GHz - Gig-Ethernet IBM

730 2676.88 5256

293 Institute of Genomics and Integrative BiologyIndia

Cluster Platform 3000 DL140G2 Xeon 3.6 GHz Infiniband Hewlett-Packard

576 2156 4147.2

Page 12: Benchmarks for Parallel Systems
Page 13: Benchmarks for Parallel Systems
Page 14: Benchmarks for Parallel Systems

NAS Parallel Benchmarks - NPB Also for evaluation of Supercomputers A set of 8 programs from CFD 5 kernels, 3 pseudo applications NPB 1 – Original benchmarks NPB 2 – NAS’s MPI implementation. NPB 2.4

Class D has more work and more I/O NPB 3 – based on OpenMP, HPF, Java GridNPB3 – for computational grids NPB 3 multi-zone – for hybrid parallelism

Page 15: Benchmarks for Parallel Systems

NPB 1.0 (March 1994)

Defines class A and class B versions “Paper and pencil” algorithmic

specifications Generic benchmarks as compared to

MPI-based LinPack General rules for implementations –

Fortran90 or C, 64-bit arithmetic etc. Sample implementations provided

Page 16: Benchmarks for Parallel Systems

Kernel Benchmarks EP – embarrassingly parallel MG – multigrid. Regular communication CG – conjugate gradient. Irregular long distance

communication FT – a 3-D PDE using FFT. Rigorous test of long distance

communication IS – large integer sort Detailed rules regarding - brief statement of the problem - algorithm to be practiced - validation of results - where to insert timing calls - method for generating random numbers - submission of results

Page 17: Benchmarks for Parallel Systems

Pseudo applications / Synthetic CFDs

Benchmark 1 – perform few iterations of the approximate factorization algorithm (SP)

Benchmark 2 - perform few iterations of diagonal form of the approximate factorization algorithm (BT)

Benchmark 3 - perform few iterations of SSOR (LU)

Page 18: Benchmarks for Parallel Systems

Class A and Class B

Sample Code

Class A

Class B

Page 19: Benchmarks for Parallel Systems

NPB 2.0 (1995) MPI and Fortran 77

implementations 2 parallel kernels (MG,

FT) and 3 simulated applications (LU, SP, BT)

Class C – bigger size Benchmark rules – 0%,

5%, >5% change in source code

Page 20: Benchmarks for Parallel Systems

NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003)

EP and IS added FT rewritten NPB 2.4 – class D and rationale for class D

sizes 2.4 I/O – a new benchmark problem based

on BT (BTIO) to test the output capabilities A MPI implementation of the same (MPI-IO)

– different options using collective buffering or not etc.

Page 21: Benchmarks for Parallel Systems