presentation at the 4 th pmeo-pds workshop

1

Presentation at the 4th PMEO-PDS Workshop

Benchmark Measurements of Current UPC Platforms

Zhang Zhang and Steve SeidelMichigan Technological University

Denver, Colorado 3/22/2005

2

Presentation Outline

• Background– Unified Parallel C, implementations and users.– Previous UPC performance studies.

• Experiments– Available UPC platforms– Benchmarks

• Performance measurements• Conclusions

3

UPC Overview• UPC is an extension of C for partitioned shared memory

parallel programming.– A special case of shared memory programming model.– Similar languages: Co-Array Fortran, Titanium.– UPC homepage: http://www.upc.gwu.edu

• Platforms supported:– Cray X1, Cray T3E, SGI Origin, HP AlphaServer, HP UX,

Linux clusters, IBM SP.• UPC compilers:

– Open source: MuPC, Berkeley UPC, Intrepid UPC– Commercial: HP UPC, Cray UPC

• Users: – LBNL, IDA, AHPCRC, …

4

Related UPC Performance Studies

• Performance benchmark suites– UPC_Bench (GWU)

• Synthetic microbenchmark based on the STREAM benchmark.

• Application benchmarks: Sobel edge detection, matrix multiplication, N-Queens problem

– UPC NAS Parallel Benchmarks (GWU)

• Performance monitoring– Performance analysis for HP UPC compiler (GWU)– Performance of Berkeley UPC on HP AlphaServer

(Berkeley)– Performance of Intrepid UPC on SGI Origin (GWU)

5

Benchmarking UPC Systems• Extended shared memory bandwidth microbenchmarks

to cover various reference patterns:– Scalar references: 11 access patterns– Block memory operations: 9 access patterns

• Benchmarked six combinations of available UPC compilers and platforms using both the UPC STREAM (MTU code) and the UPC NAS Parallel Benchmarks (GWU code).– Compilers: MuPC, HP UPC, Berkeley UPC and Intrepid UPC– Platforms: Myrinet Linux cluster, HP AlphaServer SC, and T3E

• The first comparison of performance for currently available UPC implementations.

• The first report on MuPC performance.

6

Benchmarks

• Synthetic benchmarks:– The STREAM microbenchmark was rewritten using UPC with

more diversities of shared memory access patterns:• Local shared read / write

• Unit stride shared read / write / copy

• Random shared read / write / copy

• Stride-n shared read / write / copy

• Block transfers with variations of source and sink affinities.

• NAS Parallel Benchmark Suite v2.4– The UPC version was developed at GWU.– Five cores: CG, EP, FT, IS and MG.– Two variations: Naïve version and Hand-tuned version.– Input size: Class A workload.

7

Local Shared References

• Intrepid UPC: performance is poor on local shared accesses.• HP UPC: cache state has significant effects on local shared accesses.

8

Remote Shared References

• HP UPC and MuPC: caches help unit stride remote shared accesses.• Intrepid UPC does the best for remote shared accesses.

9

Block Memory Operations

• HP UPC: performance is poor on certain string functions.• Intrepid UPC: low performance on all categories.

10

NPB – CG

• The only case that scales well: Berkeley UPC + optimized code.

11

NPB – EP

12

NPB – FT

• HP, Berkeley and MuPC: performance is comparable.

13

NPB – IS

• HP, Berkeley and MuPC: performance is comparable.

14

NPB – MG

• MG performance is very inconsistent.

15

Conclusions• STREAM benchmarking:

– UPC language overhead reduces performance of local shared references.

– Remote reference caching helps stride-1 accesses.– Copying between two locations with the same affinity to a

remote thread needs optimization.• NPB benchmarking:

– Some implementation failed for some benchmarks. More stable and reliable implementations are needed.

– Hand-tuning techniques (e.g. prefetching) are critical in performance.

– Berkeley UPC is the best at handling unstructured, fine-grained references.

– MuPC experience shows that it will be more rewarding to optimize remote shared references than to improve network interconnects.

16

Thank you!

For more information:

http://www.upc.mtu.edu

presentation at the 4 th pmeo-pds workshop

Documents

upc version

upc language

remote shared referenceshp

berkeley upc optimized

available upc implementations

block memory operationshp

mupc performance

local shared accesses