an evaluation of global address space languages: co-array fortran and unified parallel c
Post on 04-Jan-2016
28 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
An Evaluation of Global Address Space Languages: Co-Array Fortran
and Unified Parallel C
Cristian Coarfa, Yuri Dotsenko, John Mellor-CrummeyRice University
Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi YaoGeorge Washington University
Daniel Chavarria-MirandaPacific Northwest National Laboratory
2
GAS Languages
• Global address space programming model
– one-sided communication (GET/PUT)
• Programmer has control over performance-critical factors
– data distribution and locality control
– computation partitioning
– communication placement
• Data movement and synchronization as language primitives
– amenable to compiler-based communication optimization
HPF & OpenMP compilers must get this right
simpler than msg passing
lacking in OpenMP
3
Questions
• Can GAS languages match the performance of hand-tuned message passing programs?
• What are the obstacles to obtaining performance with GAS languages?
• What should be done to ameliorate them?
– by language modifications or extensions
– by compilers
– by run-time systems
• How easy is it to develop high performance programs in GAS languages?
4
Approach
Evaluate CAF and UPC using NAS Parallel Benchmarks
• Compare performance to that of MPI versions
– use hardware performance counters to pinpoint differences
• Determine optimization techniques common for both languages as well as language specific optimizations
– language features
– program implementation strategies
– compiler optimizations
– runtime optimizations
• Assess programmability of the CAF and UPC variants
5
Outline
• Questions and approach
• CAF & UPC
– Features
– Compilers
– Performance considerations
• Experimental evaluation
• Conclusions
6
CAF & UPC Common Features
• SPMD programming model
• Both private and shared data
• Language-level one-sided shared-memory communication
• Synchronization intrinsic functions (barrier, fence)
• Pointers and dynamic allocation
7
CAF & UPC Differences I
• Multidimensional arrays
– CAF: multidimensional arrays, procedure argument reshaping
– UPC: linearization, typically using macros
• Local accesses to shared data
– CAF: Fortran 90 array syntax without brackets, e.g. a(1:M,N)
– UPC: shared array reference using MYTHREAD or a C pointer
8
CAF and UPC Differences II
• Scalar/element-wise remote accesses
– CAF: multidimensional subscripts + bracket syntax
a(1,1) = a(1,M)[this_image()-1]
– UPC: shared (“flat”) array access with linearized subscripts
a[N*M*MYTHREAD] = a[N*M*MYTHREAD-N]
• Bulk and strided remote accesses– CAF: use natural syntax of Fortran 90 array sections and operations on
remote co-array sections (less temporaries on SMPs)
– UPC: use library functions (and temporary storage to hold a copy)
9
CAF:
integer a(N,M)[*]
a(1:N,1:2) = a(1:N,M-1:M)[this_image()-1]
UPC:
shared int *a;
upc_memget(&a[N*M*MYTHREAD], &a[N*M*MYTHREAD-2*N], 2*N*sizeof(int));
P1 P2 PN
N
M
Bulk Communication
10
CAF & UPC Differences III
• Synchronization
– CAF: team synchronization
– UPC: split-phase barrier, locks
• UPC: worksharing construct upc_forall
• UPC: richer set of pointer types
11
Outline
• Questions and approach
• CAF & UPC
– Features
– Compilers
– Performance considerations
• Experimental evaluation
• Conclusions
12
CAF Compilers
• Rice Co-Array Fortran Compiler (cafc)
– Multi-platform compiler
– Implements core of the language
• core sufficient for non-trivial codes
• currently lacks support for derived type and dynamic co-arrays
– Source-to-source translator
• translates CAF into Fortran 90 and communication code
• uses ARMCI or GASNet as communication substrate
• can generate load/store for remote data accesses on SMPs
– Performance comparable to that of hand-tuned MPI codes
– Open source
• Vendor compilers: Cray
13
UPC Compilers
• Berkeley UPC Compiler
– Multi-platform compiler
– Implements full UPC 1.1 specification
– Source-to-source translator
• converts UPC into ANSI C and calls to UPC runtime library & GASNet
• tailors code to a specific architecture: cluster or SMP
– Open source
• Intrepid UPC compiler
– Based on GCC compiler
– Works on SGI Origin, Cray T3E and Linux SMP
• Other vendor compilers: Cray, HP
14
Outline
• Motivation and Goals
• CAF & UPC
– Features
– Compilers
– Performance considerations
• Experimental evaluation
• Conclusions
15
Scalar Performance
• Generate code amenable to backend compiler optimizations– Quality of back end compilers
• poor reduction recognition in the Intel C compiler
• Local access to shared data– CAF: use F90 pointers and procedure arguments
– UPC: use C pointers instead of UPC shared pointers
• Alias and dependence analysis– Fortran vs. C language semantics
• multidimensional arrays in Fortran
• procedure argument reshaping
– Convey lack of aliasing for (non-aliased) shared variables
• CAF: use procedure splitting so co-arrays are referenced as arguments
• UPC: use restrict C99 keyword for C pointers used to access shared data
16
Communication
• Communication vectorization is essential for high performance on cluster architectures for both languages
– CAF
• use F90 array sections (compiler translates to appropriate library calls)
– UPC
• use library functions for contiguous transfers
• use UPC extensions for strided transfer in Berkeley UPC compiler
• Increase efficiency of strided transfers by packing/unpacking data at the language level
17
Synchronization
• Barrier-based synchronization
– Can lead to over-synchronized code
• Use point-to-point synchronization
– CAF: proposed language extension (sync_notify, sync_wait)
– UPC: language-level implementation
18
Outline
• Questions and approach
• CAF & UPC
• Experimental evaluation
• Conclusions
19
Platforms and Benchmarks
• Platforms
– Itanium2+Myrinet 2000 (900 MHz Itanium2)
– Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB)
– SGI Altix 3000 (1.5 GHz Itanium2)
– SGI Origin 2000 (R10000)
• Codes
– NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
– MG, CG, SP, BT
– CAF and UPC versions were derived from Fortran77+MPI versions
20
MG class A (2563) on Itanium2+Myrinet2000
Intel compiler: restrict yields 2.3 time performance improvement
UPCstrided comm
28% faster thanmultiple transfers
UPCpoint to point
49% faster than barriers
CAFpoint to point
35% faster than barriers
Higher is better
21
MG class C (5123) on SGI Altix 3000
Intel C compiler: scalar performance
Fortran compiler: linearized array subscripts 30% slowdown compared
to multidimensional subscripts
Higher is better
64
22
MG class B (2563) on SGI Origin 2000
Higher is better
23
CG class C (150000) on SGI Altix 3000
Intel compiler: sum reductions in C 2.6 times slower than Fortran!
point to point
19% faster than
barriers
Higher is better
24
CG class B (75000) on SGI Origin 2000
Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran!
Higher is better
25
SP class C (1623) on Itanium2+Myrinet2000
restrict yields 18%
performance improvement
Higher is better
26
SP class C (1623) on Alpha+Quadrics
Higher is better
27
BT class C (1623) on Itanium2+Myrinet2000
UPC: use of restrict boosts the performance 43%
CAF: procedure splitting improves performance 42-60%
UPC: comm. packing 32%
faster
CAF: comm. packing 7%
faster
Higher is better
28
BT class B (1023) on SGI Altix 3000
use of restrict improves
performance 30%
Higher is better
29
Conclusions
• Matching MPI performance required using bulk communication
– library-based primitives are cumbersome in UPC
– communicating multi-dimensional array sections is natural in CAF
– lack of efficient run-time support for strided communication is a problem
• With CAF, can achieve performance comparable to MPI
• With UPC, matching MPI performance can be difficult
– CG: able to match MPI on all platforms
– SP, BT, MG: substantial gap remains
30
Why the Gap?
• Communication layer is not the problem
– CAF with ARMCI or GASNet yields equivalent performance
• Scalar code optimization of scientific code is the key!
– SP+BT: SGI Fortran: unroll+jam, SWP
– MG: SGI Fortran: loop alignment, fusion
– CG: Intel Fortran: optimized sum reduction
• Linearized subscripts for multidimensional arrays hurt!
– measured 30% performance gap with Intel Fortran
31
Programming for Performance
• In the absence of effective optimizing compilers for CAF and UPC, achieving high performance is difficult
• To make codes efficient across the full range of architectures, we need
– better language support for synchronization
• point-to-point synchronization is an important common case!
– better CAF & UPC compiler support
• communication vectorization
• synchronization strength reduction
– better compiler optimization of loops with complex dependence patterns
– better run-time library support
• efficient communication of strided array sections
top related