presentation at the 4 th pmeo-pds workshop
DESCRIPTION
Presentation at the 4 th PMEO-PDS Workshop. Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University Denver, Colorado 3/22/2005. Presentation Outline. Background Unified Parallel C, implementations and users. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/1.jpg)
1
Presentation at the 4th PMEO-PDS Workshop
Benchmark Measurements of Current UPC Platforms
Zhang Zhang and Steve SeidelMichigan Technological University
Denver, Colorado 3/22/2005
![Page 2: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/2.jpg)
2
Presentation Outline
• Background– Unified Parallel C, implementations and users.– Previous UPC performance studies.
• Experiments– Available UPC platforms– Benchmarks
• Performance measurements• Conclusions
![Page 3: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/3.jpg)
3
UPC Overview• UPC is an extension of C for partitioned shared memory
parallel programming.– A special case of shared memory programming model.– Similar languages: Co-Array Fortran, Titanium.– UPC homepage: http://www.upc.gwu.edu
• Platforms supported:– Cray X1, Cray T3E, SGI Origin, HP AlphaServer, HP UX,
Linux clusters, IBM SP.• UPC compilers:
– Open source: MuPC, Berkeley UPC, Intrepid UPC– Commercial: HP UPC, Cray UPC
• Users: – LBNL, IDA, AHPCRC, …
![Page 4: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/4.jpg)
4
Related UPC Performance Studies
• Performance benchmark suites– UPC_Bench (GWU)
• Synthetic microbenchmark based on the STREAM benchmark.
• Application benchmarks: Sobel edge detection, matrix multiplication, N-Queens problem
– UPC NAS Parallel Benchmarks (GWU)
• Performance monitoring– Performance analysis for HP UPC compiler (GWU)– Performance of Berkeley UPC on HP AlphaServer
(Berkeley)– Performance of Intrepid UPC on SGI Origin (GWU)
![Page 5: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/5.jpg)
5
Benchmarking UPC Systems• Extended shared memory bandwidth microbenchmarks
to cover various reference patterns:– Scalar references: 11 access patterns– Block memory operations: 9 access patterns
• Benchmarked six combinations of available UPC compilers and platforms using both the UPC STREAM (MTU code) and the UPC NAS Parallel Benchmarks (GWU code).– Compilers: MuPC, HP UPC, Berkeley UPC and Intrepid UPC– Platforms: Myrinet Linux cluster, HP AlphaServer SC, and T3E
• The first comparison of performance for currently available UPC implementations.
• The first report on MuPC performance.
![Page 6: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/6.jpg)
6
Benchmarks
• Synthetic benchmarks:– The STREAM microbenchmark was rewritten using UPC with
more diversities of shared memory access patterns:• Local shared read / write
• Unit stride shared read / write / copy
• Random shared read / write / copy
• Stride-n shared read / write / copy
• Block transfers with variations of source and sink affinities.
• NAS Parallel Benchmark Suite v2.4– The UPC version was developed at GWU.– Five cores: CG, EP, FT, IS and MG.– Two variations: Naïve version and Hand-tuned version.– Input size: Class A workload.
![Page 7: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/7.jpg)
7
Local Shared References
• Intrepid UPC: performance is poor on local shared accesses.• HP UPC: cache state has significant effects on local shared accesses.
![Page 8: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/8.jpg)
8
Remote Shared References
• HP UPC and MuPC: caches help unit stride remote shared accesses.• Intrepid UPC does the best for remote shared accesses.
![Page 9: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/9.jpg)
9
Block Memory Operations
• HP UPC: performance is poor on certain string functions.• Intrepid UPC: low performance on all categories.
![Page 10: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/10.jpg)
10
NPB – CG
• The only case that scales well: Berkeley UPC + optimized code.
![Page 11: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/11.jpg)
11
NPB – EP
![Page 12: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/12.jpg)
12
NPB – FT
• HP, Berkeley and MuPC: performance is comparable.
![Page 13: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/13.jpg)
13
NPB – IS
• HP, Berkeley and MuPC: performance is comparable.
![Page 14: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/14.jpg)
14
NPB – MG
• MG performance is very inconsistent.
![Page 15: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/15.jpg)
15
Conclusions• STREAM benchmarking:
– UPC language overhead reduces performance of local shared references.
– Remote reference caching helps stride-1 accesses.– Copying between two locations with the same affinity to a
remote thread needs optimization.• NPB benchmarking:
– Some implementation failed for some benchmarks. More stable and reliable implementations are needed.
– Hand-tuning techniques (e.g. prefetching) are critical in performance.
– Berkeley UPC is the best at handling unstructured, fine-grained references.
– MuPC experience shows that it will be more rewarding to optimize remote shared references than to improve network interconnects.
![Page 16: Presentation at the 4 th PMEO-PDS Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814642550346895db34d5a/html5/thumbnails/16.jpg)
16
Thank you!
For more information:
http://www.upc.mtu.edu