an evaluation of global address space languages: co-array fortran and unified parallel c

1

An Evaluation of Global Address Space Languages: Co-Array Fortran

and Unified Parallel C

Cristian Coarfa, Yuri Dotsenko, John Mellor-CrummeyRice University

Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi YaoGeorge Washington University

Daniel Chavarria-MirandaPacific Northwest National Laboratory

2

GAS Languages

• Global address space programming model

– one-sided communication (GET/PUT)

• Programmer has control over performance-critical factors

– data distribution and locality control

– computation partitioning

– communication placement

• Data movement and synchronization as language primitives

– amenable to compiler-based communication optimization

HPF & OpenMP compilers must get this right

simpler than msg passing

lacking in OpenMP

3

Questions

• Can GAS languages match the performance of hand-tuned message passing programs?

• What are the obstacles to obtaining performance with GAS languages?

• What should be done to ameliorate them?

– by language modifications or extensions

– by compilers

– by run-time systems

• How easy is it to develop high performance programs in GAS languages?

4

Approach

Evaluate CAF and UPC using NAS Parallel Benchmarks

• Compare performance to that of MPI versions

– use hardware performance counters to pinpoint differences

• Determine optimization techniques common for both languages as well as language specific optimizations

– language features

– program implementation strategies

– compiler optimizations

– runtime optimizations

• Assess programmability of the CAF and UPC variants

5

Outline

• Questions and approach

• CAF & UPC

– Features

– Compilers

– Performance considerations

• Experimental evaluation

• Conclusions

6

CAF & UPC Common Features

• SPMD programming model

• Both private and shared data

• Language-level one-sided shared-memory communication

• Synchronization intrinsic functions (barrier, fence)

• Pointers and dynamic allocation

7

CAF & UPC Differences I

• Multidimensional arrays

– CAF: multidimensional arrays, procedure argument reshaping

– UPC: linearization, typically using macros

• Local accesses to shared data

– CAF: Fortran 90 array syntax without brackets, e.g. a(1:M,N)

– UPC: shared array reference using MYTHREAD or a C pointer

8

CAF and UPC Differences II

• Scalar/element-wise remote accesses

– CAF: multidimensional subscripts + bracket syntax

a(1,1) = a(1,M)[this_image()-1]

– UPC: shared (“flat”) array access with linearized subscripts

a[N*M*MYTHREAD] = a[N*M*MYTHREAD-N]

• Bulk and strided remote accesses– CAF: use natural syntax of Fortran 90 array sections and operations on

remote co-array sections (less temporaries on SMPs)

– UPC: use library functions (and temporary storage to hold a copy)

9

CAF:

integer a(N,M)[*]

a(1:N,1:2) = a(1:N,M-1:M)[this_image()-1]

UPC:

shared int *a;

upc_memget(&a[N*M*MYTHREAD], &a[N*M*MYTHREAD-2*N], 2*N*sizeof(int));

P1 P2 PN

N

M

Bulk Communication

10

CAF & UPC Differences III

• Synchronization

– CAF: team synchronization

– UPC: split-phase barrier, locks

• UPC: worksharing construct upc_forall

• UPC: richer set of pointer types

11

Outline


• CAF & UPC

– Features

– Compilers



• Conclusions

12

CAF Compilers

• Rice Co-Array Fortran Compiler (cafc)

– Multi-platform compiler

– Implements core of the language

• core sufficient for non-trivial codes

• currently lacks support for derived type and dynamic co-arrays

– Source-to-source translator

• translates CAF into Fortran 90 and communication code

• uses ARMCI or GASNet as communication substrate

• can generate load/store for remote data accesses on SMPs

– Performance comparable to that of hand-tuned MPI codes

– Open source

• Vendor compilers: Cray

13

UPC Compilers

• Berkeley UPC Compiler

– Multi-platform compiler

– Implements full UPC 1.1 specification

– Source-to-source translator

• converts UPC into ANSI C and calls to UPC runtime library & GASNet

• tailors code to a specific architecture: cluster or SMP

– Open source

• Intrepid UPC compiler

– Based on GCC compiler

– Works on SGI Origin, Cray T3E and Linux SMP

• Other vendor compilers: Cray, HP

14

Outline

• Motivation and Goals

• CAF & UPC

– Features

– Compilers



• Conclusions

15

Scalar Performance

• Generate code amenable to backend compiler optimizations– Quality of back end compilers

• poor reduction recognition in the Intel C compiler

• Local access to shared data– CAF: use F90 pointers and procedure arguments

– UPC: use C pointers instead of UPC shared pointers

• Alias and dependence analysis– Fortran vs. C language semantics

• multidimensional arrays in Fortran

• procedure argument reshaping

– Convey lack of aliasing for (non-aliased) shared variables

• CAF: use procedure splitting so co-arrays are referenced as arguments

• UPC: use restrict C99 keyword for C pointers used to access shared data

16

Communication

• Communication vectorization is essential for high performance on cluster architectures for both languages

– CAF

• use F90 array sections (compiler translates to appropriate library calls)

– UPC

• use library functions for contiguous transfers

• use UPC extensions for strided transfer in Berkeley UPC compiler

• Increase efficiency of strided transfers by packing/unpacking data at the language level

17

Synchronization

• Barrier-based synchronization

– Can lead to over-synchronized code

• Use point-to-point synchronization

– CAF: proposed language extension (sync_notify, sync_wait)

– UPC: language-level implementation

18

Outline


• CAF & UPC


• Conclusions

19

Platforms and Benchmarks

• Platforms

– Itanium2+Myrinet 2000 (900 MHz Itanium2)

– Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB)

– SGI Altix 3000 (1.5 GHz Itanium2)

– SGI Origin 2000 (R10000)

• Codes

– NAS Parallel Benchmarks (NPB 2.3) from NASA Ames

– MG, CG, SP, BT

– CAF and UPC versions were derived from Fortran77+MPI versions

20

MG class A (2563) on Itanium2+Myrinet2000

Intel compiler: restrict yields 2.3 time performance improvement

UPCstrided comm

28% faster thanmultiple transfers

UPCpoint to point

49% faster than barriers

CAFpoint to point

35% faster than barriers

Higher is better

21

MG class C (5123) on SGI Altix 3000

Intel C compiler: scalar performance

Fortran compiler: linearized array subscripts 30% slowdown compared

to multidimensional subscripts

Higher is better

64

22

MG class B (2563) on SGI Origin 2000

Higher is better

23

CG class C (150000) on SGI Altix 3000

Intel compiler: sum reductions in C 2.6 times slower than Fortran!

point to point

19% faster than

barriers

Higher is better

24

CG class B (75000) on SGI Origin 2000

Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran!

Higher is better

25

SP class C (1623) on Itanium2+Myrinet2000

restrict yields 18%

performance improvement

Higher is better

26

SP class C (1623) on Alpha+Quadrics

Higher is better

27

BT class C (1623) on Itanium2+Myrinet2000

UPC: use of restrict boosts the performance 43%

CAF: procedure splitting improves performance 42-60%

UPC: comm. packing 32%

faster

CAF: comm. packing 7%

faster

Higher is better

28

BT class B (1023) on SGI Altix 3000

use of restrict improves

performance 30%

Higher is better

29

Conclusions

• Matching MPI performance required using bulk communication

– library-based primitives are cumbersome in UPC

– communicating multi-dimensional array sections is natural in CAF

– lack of efficient run-time support for strided communication is a problem

• With CAF, can achieve performance comparable to MPI

• With UPC, matching MPI performance can be difficult

– CG: able to match MPI on all platforms

– SP, BT, MG: substantial gap remains

30

Why the Gap?

• Communication layer is not the problem

– CAF with ARMCI or GASNet yields equivalent performance

• Scalar code optimization of scientific code is the key!

– SP+BT: SGI Fortran: unroll+jam, SWP

– MG: SGI Fortran: loop alignment, fusion

– CG: Intel Fortran: optimized sum reduction

• Linearized subscripts for multidimensional arrays hurt!

– measured 30% performance gap with Intel Fortran

31

Programming for Performance

• In the absence of effective optimizing compilers for CAF and UPC, achieving high performance is difficult

• To make codes efficient across the full range of architectures, we need

– better language support for synchronization

• point-to-point synchronization is an important common case!

– better CAF & UPC compiler support

• communication vectorization

• synchronization strength reduction

– better compiler optimization of loops with complex dependence patterns

– better run-time library support

• efficient communication of strided array sections

an evaluation of global address space languages: co-array fortran and unified parallel c

Documents