towards optimized upc implementations

Towards Optimized UPC Implementations

Tarek A. El-GhazawiThe George Washington University

[email protected]

IBM T.J. Waston UPC: Unified Parallel C 202/22/05

Agenda

Background UPC Language Overview Productivity Performance Issues Automatic Optimizations Conclusions


Parallel Programming Models What is a programming model?

An abstract machine which outlines the view perceived by the programmer of data and execution

Where architecture and applications meet A non-binding contract between the programmer and

the compiler/system

Good Programming Models Should Allow efficient mapping on different architectures Keep programming easy

Benefits Application - independence from architecture Architecture - independence from applications


Programming Models

Message Passing Shared Memory DSM/PGAS

MPI OpenMP UPC

Process/Thread

Address Space


Programming Paradigms ExpressivityLOCALITY

Implicit ExplicitPARALLEISM

Implicit

Explicit

Sequential(e.g. C, Fortran, Java)

Data Parallel(e.g. HPF, C*)

Shared Memory(e.g. OpenMP)

Distributed Shared Memory/PGAS(e.g. UPC, CAF, and Titanium)


What is UPC?

Unified Parallel C An explicit parallel extension of ISO C A distributed shared memory/PGAS parallel

programming language


Why not message passing?

Performance High-penalty for short transactions Cost of calls Two sided Excessive buffering

Ease-of-use Explicit data transfers Domain decomposition does not maintain the

original global application view More code and conceptual difficulty


Why DSM/PGAS?

Performance No calls Efficient short transfers locality

Ease-of-use Implicit transfers Consistent global application view Less code and conceptual difficulty


Why DSM/PGAS:New Opportunities for Compiler Optimizations

Ghost Zones

Thread0

Thread1

Thread2

Thread3

ImageSobel Operator

DSM P_Model exposes sequential remote accesses at compile time Opportunity for compiler directed prefetching


History

Initial Tech. Report from IDA in collaboration with LLNL and UCB in May 1999

UPC consortium of government, academia, and HPC vendors coordinated by GWU, IDA, and DoD

The participants currently are: IDA CCS, GWU, UCB, MTU, UMN, ARSC, UMCP, U florida, ANL, LBNL, LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid, Etnus, …


Status Specification v1.0 completed February of 2001, v1.1.1 in

October of 2003, v1.2 will add collectives and UPC/IO Benchmarking Suites: Stream, GUPS, RandomAccess, NPB

suite, Splash-2, and others Testing suite v1.0, v1.1 Short courses and tutorials in the US and abroad Research Exhibits at SC 2000-2004 UPC web site: upc.gwu.edu UPC Book by mid 2005 from John Wiley and Sons Manual(s)


Hardware Platforms

UPC implementations are available for SGI O 2000/3000

Intrepid – 32 and 64b GCC UCB – 32 b GCC

Cray T3D/E Cray X-1 HP AlphaServer SC, Superdome UPC Berkeley Compiler: Myrinet, Quadrics,

and Infiniband Clusters Beowulf Reference Implementation (MPI-

based, MTU) New ongoing efforts by IBM and Sun


UPC Execution Model

A number of threads working independently in a SPMD fashion MYTHREAD specifies thread index

(0..THREADS-1) Number of threads specified at compile-time or

run-time Process and Data Synchronization when needed

Barriers and split phase barriers Locks and arrays of locks Fence Memory consistency control


UPC Memory Model

Shared space with thread affinity, plus private spaces A pointer-to-shared can reference all locations in the shared

space A private pointer may reference only addresses in its

private space or addresses in its portion of the shared space Static and dynamic memory allocations are supported for both

shared and private memory

Shared

Thread 0

Private 0

Thread THREADS-1

Private 1 Private THREADS-1

Thread 1


UPC Pointers

How to declare them? int *p1; /* private pointer pointing

locally */ shared int *p2; /* private pointer pointing into

the shared space */ int *shared p3; /* shared pointer pointing locally */ shared int *shared p4; /* shared pointer pointing

into the shared space */ You may find many using “shared pointer” to mean a

pointer pointing to a shared object, e.g. equivalent to p2 but could be p4 as well.


UPC Pointers

Shared

Private P1

P2

P4P3

Thread 0

P1 P1P2

P2


Synchronization - Barriers

No implicit synchronization among the threads

UPC provides the following synchronization mechanisms: Barriers Locks Memory Consistency Control Fence


Memory Consistency Models

Has to do with ordering of shared operations, and when a change of a shared object by a thread becomes visible to others

Consistency can be strict or relaxed Under the relaxed consistency model, the shared

operations can be reordered by the compiler / runtime system

The strict consistency model enforces sequential ordering of shared operations. (No operation on shared can begin before the previous ones are done, and changes become visible immediately)


Memory Consistency Models

User specifies the memory model through: declarations pragmas for a particular statement or sequence of

statements use of barriers, and global operations

Programmers responsible for using correct consistency model


UPC and Productivity

Metrics Lines of ‘useful’ Code

indicates the development time as well as the maintenance cost

Number of ‘useful’ Characters alternative way to measure development and maintenance

efforts Conceptual Complexity

function level, keyword usage, number of tokens, max loop depth, …


Manual Effort – NPB Example

SEQ UPC SEQ MPI UPC Effort (%)

MPI Effort (%)

#line 665 710 506 1046 6.77 106.72 NPB-CG #char 16145 17200 16485 37501 6.53 127.49 #line 127 183 130 181 44.09 36.23 NPB-EP #char 2868 4117 4741 6567 43.55 38.52 #line 575 1018 665 1278 77.04 92.18 NPB-FT #char 13090 21672 22188 44348 65.56 99.87 #line 353 528 353 627 49.58 77.62 NPB-IS #char 7273 13114 7273 13324 80.31 83.20 #line 610 866 885 1613 41.97 82.26 NPB-MG #char 14830 21990 27129 50497 48.28 86.14

SEQSEQUPCUPCeffort #

##

SEQSEQMPIMPIeffort #

##


Manual Effort – More Examples

SEQ MPI SEQ UPC MPI Effort (%)

UPC Effort (%)

#line 41 98 41 47 139.02 14.63 GUPS #char 1063 2979 1063 1251 180.02 17.68 #line 12 30 12 20 150.00 66.67 Histogram #char 188 705 188 376 275.00 100.00 #line 86 166 86 139 93.02 61.63 N-Queens #char 1555 3332 1555 2516 124.28 61.80

SEQSEQUPCUPCeffort #

##

SEQSEQMPIMPIeffort #

##


Conceptual Complexity - HIST

Work Distr.

Data Distr.

Comm. Synch. & Consist.

Misc. Ops Sum Overall Score

#Parameters 5 4 0 3 0 12 #Function calls 0 0 0 4 0 4 #References to THREADS and MYTHREAD

2 1 0 0 0 3

#UPC Constructs & UPC Types

0 2 0 1 0 3

HIS

TO

GR

AM

UPC

Notes 2 if 1 for

2 shared decl.

1 lockdec 1 lock/unlock 2 barriers

22

#Parameters 5 0 15 0 6 26 #Function calls 0 0 2 2 4 8 # References to myrank and nprocs

3 0 2 0 2 5 #MPI Types 0 0 6 0 2 8

HIS

TO

GR

AM

MPI

Notes 2 if 1 for

1 Scatter 1 Reduce

(implicit w. Collective)

1 Init/Finalize 2 Comm

47


Conceptual Complexity - GUPS

Work Distr.

Data Distr.

Comm. Synch. & Consist.

Misc. Ops Sum Overall Score

#Parameters 21 6 0 0 0 27 #Function calls 0 4 0 2 0 6 #References to THREADS and MYTHREAD

3 4 0 0 0 7

#UPC Constructs & UPC Types

3 0 0 0 0 3 GU

PS U

PC

Notes 3 forall 2 for 3 if

5 shared 2 all_alloc 2 free

2 barriers

43

#Parameters 18 17 38 1 6 80 #Function calls 0 7 6 3 6 22 # References to myrank and nprocs

3 5 13 1 4 26 #MPI Types 0 6 2 0 0 8

GU

PS M

PI

Notes 5 for 3 if

2 mem alloc 2 mem free 3 window

2 one-sided 4 collect

(implicit w. Collective and WinFence) 1 barrier

Init Finalize comm_rank comm_size 2 Wtime (6 error handle)

136


UPC Optimizations Issues

Particular Challenges Avoiding Address Translation Cost of Address Translation

Special Opportunities Locality-driven compiler-directed prefetching Aggregation

General Low-level optimized libraries, e.g. collective Backend optimizations Overlapping of remote accesses and

synchronization with other work


Showing Potential Optimizations Through Emulated Hand-Tunings

Different Hand-tuning levels: Unoptimized UPC code

referred as UPC.O0 Privatized UPC code

referred as UPC.O1 Prefetched UPC code

hand-optimized variant using block get/put to mimic the effect of prefetching

referred as UPC.O2 Fully Hand-Tuned UPC code

Hand-optimized variant integrating privatization, aggregation of remote accesses as well as prefetching

Referred as UPC.O3 T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual

Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372


Address Translation Cost and Local Space Privatization- Cluster

STR

EAM

BEN

CH

MA

RK

MB/s Put Get Scale Sum

CC N/A N/A 1565.04 5409.3

UPC Private N/A N/A 1687.63 1776.81

UPC Local 1196.51 1082.89 54.22 82.7

UPC Remote 241.43 237.51 0.09 0.16

MB/s Copy (arr) Copy (ptr) Memcpy Memset

CC 1340.99 1488.02 1223.86 2401.26

UPC Private 1383.57 433.45 1252.47 2352.71

UPC Local 47.2 90.67 1202.8 2398.9

UPC Remote 0.09 0.20 1197.22 2360.59

Results gathered on a Myrinet Cluster


MB/Sec Memorycopy

Block Get

Block Put

ArraySet

Array Copy Sum Scale

GCC 127 N/A N/A 175 106 223 108

UPC Private 127 N/A N/A 173 106 215 107

UPC Local Shared 139 140 136 26 14 31 13

UPC Remote Shared (within SMP node) 130 129 136 26 13 30 13

UPC Remote Shared (beyond SMP node) 112 117 136 24 12 28 12

STR

EAM

BEN

CH

MA

RK

MB

/S

Bulk operations Element-by-Element operations

Address Translation and Local Space Privatization –

DSM ARCHITECTURE


Aggregation and Overlapping of Remote Shared Memory Accesses

0

0.05

0.1

0.15

0.2

0.25

1 2 4 8 16

THREADS

Exec

utio

n Ti

me

(sec

)

UPC NO OPT. UPC FULL OPT.

0.01

0.1

1

10

100

1 2 4 8 16 32

NP

Exec

utio

n Ti

me

(sec

)

UPC NO OPT. UPC FULL OPT.

Benefit of hand-optimizations are greatly application dependent: N-Queens does not perform any better, mainly because it is an

embarrassingly parallel program Sobel Edge Detector does get a speedup of one order of magnitude

after hand-optimizating, scales linearly perfectly. SGI O2000

UPC N-Queens: Execution Time

UPC Sobel Edge: Execution Time


Impact of Hand-Optimizations on NPB.CG

0

10

20

30

40

50

60

70

1 2 4 8 16 32

Processors

Com

puta

tion

Tim

e (s

ec)

UPC - O0 UPC - O1 UPC - O3 GCC Class A onSGI Origin 2k


Shared Address Translation Overhead

Address translation overhead is quite significant More than 70% of work for a local-shared memory access

Demonstrates the real need for optimization

ZActualAccess

ZActualAccess

YAddress

CalculationOverhead

UPC Put/GetFunction Call

OverheadX

PRIVATEMEMORY ACCESS

LOCALSHARED

MEMORY ACCESS

AddressTranslationOverhead

144

247

123

0

100

200

300

400

500

600

Local Shared memory access

Loca

l Sha

red

Acc

ess

Tim

e (n

s)

Memory Access Time Address Calculation Address Function Call

Overhead Present in Local-Shared Memory Accesses (SGI Origin 2000, GCC-UPC)

Quantification of the Address Translation Overheads


Shared Address Translation Overheads for Sobel Edge Detection

010

2030

405060

7080

90100

UP

C.O

0

UP

C.O

3

UP

C.O

0

UP

C.O

3

UP

C.O

0

UP

C.O

3

UP

C.O

0

UP

C.O

3

UP

C.O

0

UP

C.O

3

#Processors

Exe

cutio

n T

ime

(sec

)

Processing + Memory Access Address Function CallAddress Calculation

1 2 4 8 16

UPC.O0: unoptimized UPC code, UPC.O3: handoptimized UPC code. Ox notations from

T. El-Ghazawi, S. Chauvin, “UPC Benchmarking Issues”, Proceedings of the 2001 International Conference on Parallel Processing, Valencia, September 2001


Reducing Address Translation Overheads via Translation Look-Aside Buffers

F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005

Use Look-up Memory Model Translation Buffers (MMTB) to perform fast translations

Two alternative methods proposed to create and use MMTB’s: FT: basic method using direct addressing RT: advanced method, using indexed addressing

Was prototyped as a compiler-enabled optimization no modifications to actual UPC codes are needed


array[0] TH0

array[1] TH1

array[2] TH2

array[3] TH3

array[4] TH0

array[5] TH1

array[6] TH2

array[7] TH3

array[0] TH0

array[1] TH1

array[2] TH2

array[3] TH3

array[4] TH0

array[5] TH1

array[6] TH2

array[7] TH3

[0]

[4]

TH0

[1]

[5]

TH1

[2]

[6]

TH2

[3]

[7]

TH3

[0]

[4]

TH0

[1]

[5]

TH1

[2]

[6]

TH2

[3]

[7]

TH3

Array distributed across 4 THREADS

shared int array[8];

MMTB stored on each thread

FT Look-up Table

Data affinity

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

FT[0]

FT[4]

FT[1]

FT[5]

FT[2]

FT[6]

FT[3]

FT[7]

FT[0]

FT[4]

FT[1]

FT[5]

FT[2]

FT[6]

FT[3]

FT[7]

FT[0]

FT[4]

FT[1]

FT[5]

FT[2]

FT[6]

FT[3]

FT[7]

Different Strategies – Full-Table

Pros Direct mapping No address calculation

Cons Large memory required Can lead to competition over caches and

main memory

Consider shared [B] int array[8];To Initialize FT:

i [0,7], FT[i] = _get_vaddr(&array[i])To Access array[ ]:

i [0,7], array[i] = _get_value_at(FT[i])


Different Strategies – Reduced-Table: Infinite blocksize

RT Strategy:

Only one table entry in this case

Address calculation step is simple in that case

array[0]

array[1]

array[2]

array[3]

THREAD0

i

THREAD1

RT[0] RT[0]

THREAD2

RT[0]

THREAD3

RT[0]

BLOCKSIZE=infiniteOnly first address of the element of the array needs to be saved since all array data is contiguous

Consider shared [] int array[4];

To initialize RT:

RT[0] = _get_vaddr(&array[0])

To access array[]:

i [0,3], array[i] = _get_value_at( RT[0] + i )


Different Strategies – Reduced-Table: Default blocksize

RT Strategy:

Less memory required than FT, MMTB buffer has threads entries

Address calculation step is a bit costly but much cheaper than current implementations

array[0]

array[4]

array[8]

array[12]

THREAD1

RT RT

THREAD2

RT

THREAD3

RT

array[1]

array[5]

array[9]

array[13]

array[2]

array[6]

array[10]

array[14]

array[3]

array[7]

array[11]

array[15]

THREAD0

RT[0]

RT[1]

RT[2]

RT[3]

RT

BLOCKSIZE=1Only first address of elements on each thread are saved since all array data is contiguous

Consider shared [1] int array[16];

To initialize RT:

i [0,THREADS-1], RT[i] = _get_vaddr(&array[i])

To access array[]:

i [0,15], array[i] = _get_value_at( RT[i mod THREADS] + (i/THREADS))


Different Strategies – Reduced-Table: Arbitrary blocksize

RT Strategy:

Less memory required than for FT, but more than previous cases

Address calculation step more costly than previous cases

array[0]

array[1]

array[8]

array[9]

THREAD1

RT RTTHREAD2

RTTHREAD3

RT

array[2]

array[3]

array[10]

array[11]

array[4]

array[5]

array[12]

array[13]

array[6]

array[7]

array[14]

array[15]

THREAD0

RT[0]

RT[1]

RT[2]

RT[3]

RT

RT[4]

RT[5]

RT[6]

RT[7]

ARBITRARY BLOCK SIZESOnly first address of elements of each block are saved since all block data is contiguous

Consider shared [2] int array[16];

To initialize T:

i [0,7], RT[i] = _get_vaddr(&array[i*blocksize(array)])

To access array[]:

i [0,15], array[i] = _get_value_at( RT[i / blocksize(array)] + (i mod blocksize(array)) )


Performance Impact of the MMTB – Sobel Edge

FT and RT are performing around 6 to 8 folds better than the regular basic UPC version (O0)

RT strategy slower than FT since address calculation (arbitrary block size case), becomes more complex.

FT on the other hand is performing almost as good as the hand-tuned versions (O3 and MPI)

Sobel Edge (N=2048)

0

2

4

6

8

10

12

14

16

1 2 4 8 16

#THREADS

Exec

utio

n Ti

me

(sec

)

O0 O0.FT O0.RT O3 MPI

Sobel Edge (N=2048)

0

0.5

1

1.5

2

2.5

3

1 2 4 8 16

#THREADS

Exec

utio

n Ti

me

(sec

)

O0.FT O0.RT O3 MPI

Performance of Sobel-Edge Detection using new MMTB strategies (with and without O0)


Performance Impact of the MMTB – Matrix Multiplication

FT strategy: increase in L1 data cache misses due to the large table size

RT strategy: L1 kept low, but increase in number of loads and stores is observed showing increase in computations (arbitrary blocksize used)

MATRIX MULTIPLICATION (N=256)

0

2

4

6

8

10

12

14

16

1 2 4 8 16

# THREADS

Exec

utio

n Ti

me

(sec

)

UPC.O0 UPC.O0.FT UPC.O0.RT UPC.O3 MPI

Performance and Hardware Profiling of Matrix Multiplication using new MMTB strategies

MATRIX MULTIPLICATION (N=256)

0

2

4

6

8

10

12

14

16

UPC.O0

UPC.O3

UPC.O0.F

T

UPC.O0.R

T

UPC.O0

UPC.O3

UPC.O0.F

T

UPC.O0.R

T

UPC.O0

UPC.O3

UPC.O0.F

T

UPC.O0.R

T

UPC.O0

UPC.O3

UPC.O0.F

T

UPC.O0.R

T

UPC.O0

UPC.O3

UPC.O0.F

T

UPC.O0.R

T

THREADS

Tim

e (s

ec)

Computation L1 Data Cache Misses L2 Data Cache Misses TLB MissesGraduated Loads Graduated Stores Decoded Branches

1 THREAD 2 THREADS 4 THREADS 8 THREADS 16 THREADS


Time and storage requirements of the Address Translation Methods for the

Matrix Multiply Microkernel

Number of loads and stores can increase with arithmetic operators

Comparison among Optimizations of Storage, Memory Accesses and Computation Requirements

EN

THREADSPNEN

THREADSPBNEN

For a shared array of N elements with B

as blocksize

Storage requirements per

shared array

# of memory accesses per

shared memoryaccess

# of arithmetic operations pershared memory

access

UPC.O0 More than 25 More than 5

UPC.O0.FT 1 0

UPC.O0.RT 1 Up to 3(E: element size in bytes,P: pointer size in bytes)


UPC Work-sharing Construct OptimizationsBy thread/index number

(upc_forall integer)

upc_forall(i=0; i<N; i++; i)

loop body;

By the address of a shared variable (upc_forall address)

upc_forall(i=0; i<N; i++; &shared_var[i])

loop body;

By thread/index number (for optimized)

for(i=MYTHREAD; i<N; i+=THREADS)

loop body;

By thread/index number (for integer)

for(i=0; i<N; i++){

if(MYTHREAD == i%THREADS)loop body;

}

By the address of a shared variable (for address)

for(i=0; i<N; i++){

if(upc_threadof(&shared_var[i]) ==

MYTHREAD)loop body;

}


Performance of Equivalent upc_forall and for Loops

0

0.01

0.02

0.03

0.04

0.05

0.06

1 2 4 8 16 Processor(s)

upc_forall address upc_forall integer for address for integer for optimized

Tim

e (s

ec.)


Performance Limitations Imposed by Sequential C Compilers -- STREAM

NU

MA

(MB

/s)

BULK Element-by-Element

mem

cpy

mem

set

Struct cp

Copy (arr)

Copy (ptr)

Set

Sum

Scale

Add

Triad

F 291.21 163.90 N/A 291.59 N/A 159.68 135.37 246.3 235.1 303.82

C 231.20 214.62 158.86 120.57 152.77 147.70 298.38 133.4 13.86 20.71

Vector(M

B/s)


mem

cpy

mem

set

Struct cp

Copy (arr)

Copy (ptr)

Set

Sum

Scale

Add

Triad

F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053

C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824


Loopmark – SET/ADD Operations

Vector


mem

cpy

mem

set

Struct cp

Copy (arr)

Copy (ptr)

Set

Sum

Scale

Add

Triad

F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053

C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824

Let us compare loopmarks for each F / C operation


Loopmark – SET/ADD Operations

MEMSET (bulk set) 146. 1 t = mysecond(tflag) 147. 1 V M--<><> a(1:n) = 1.0d0 148. 1 t = mysecond(tflag) - t 149. 1 times(2,k) = t

SET 158. 1 arrsum = 2.0d0; 159. 1 t = mysecond(tflag) 160. 1 MV------< DO i = 1,n 161. 1 MV c(i) = arrsum 162. 1 MV arrsum = arrsum + 1 163. 1 MV------> END DO 164. 1 t = mysecond(tflag) - t 165. 1 times(4,k) = t

ADD 180. 1 t = mysecond(tflag) 181. 1 V M--<><> c(1:n) = a(1:n) + b(1:n) 182. 1 t = mysecond(tflag) - t 183. 1 times(7,k) = t

MEMSET (bulk set) 163. 1 times[1][k] = mysecond_(); 164. 1 memset(a, 1,

NDIM*sizeof(elem_t));; 165. 1 times[1][k] = mysecond_() -

times[1][k];

SET 217. 1 set = 2; 220. 1 times[5][k] = mysecond_(); 222. 1 MV--< for (i=0; i<NDIM; i++) 223. 1 MV { 224. 1 MV c[i] = (set++); 225. 1 MV--> } 227. 1 times[5][k] = mysecond_() -

times[5][k];

ADD 283. 1 times[10][k]= mysecond_(); 285. 1 Vp--< for (j=0; j<NDIM; j++) 286. 1 Vp { 287. 1 Vp c[j] = a[j] + b[j]; 288. 1 Vp--> } 290. 1 times[10][k] = mysecond_() -

times[10][k];

Fortran C

Legend: V: Vectorized – M: Multistreamed – p: conditional, partial and/or computed


UPC vs CAF using the NPB workloads In General, UPC slower than CAF, mainly due to

Point-to-point vs barrier synchronization Better scalability with proper collective operations Program writers can do a p-to-p syncronization using current

constructs Scalar performance of source-to-source translated code

Alias analysis (C pointers)» Can highlight the need for explicitly using restrict to help

several compiler backends Lack of support for multi-dimensional arrays in C

» Can prevent high level loop transformations and software pipelining, causing a 2 times slowdown in SP for UPC

Need for exhaustive C compiler analysis» A failure to perform proper loop fusion and alignment in the

critical section of MG can lead to 51% more loads for UPC than CAF

» A failure to unroll adequately the sparse matrix-vector multiplication in CG can lead to more cycles in UPC


Conclusions UPC is a locality-aware parallel programming

language With proper optimizations, UPC can outperform

MPI in random short accesses and can otherwise perform as good as MPI

UPC is very productive and UPC applications result in much smaller and more readable code than MPI

UPC compiler optimizations are still lagging, in spite of the fact that substantial progress has been made

For future architectures, UPC has the unique opportunity of having very efficient implementations as most of the pitfalls and obstacles are revealed along with adequate solutions


Conclusions

In general, four types of optimizations: Optimizations to Exploit the Locality

Consciousness and other Unique Features of UPC

Optimizations to Keep the Overhead of UPC low Optimizations to Exploit Architectural Features Standard Optimizations that are Applicable to all

Systems Compilers


Conclusions

Optimizations possible at three levels: Source to source program acting during the

compilation phase and incorporating most UPC specific optimizations

C backend compilers to compete with Fortran Strong run-time system that can work effectively

with the Operating System


Selected Publications

T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick, UPC: Distributed Shared Memory Programming. John Wiley &Sons Inc., New York, 2005. ISBN: 0-471-22048-5. (June 2005)

T. El-Ghazawi, F. Cantonnet, Y. Yao, S. Annareddy, A. Mohamed, Benchmarking Parallel Compilers for Distributed Shared Memory Languages: A UPC Case Study, Journal of Future Generation Computer Systems, North-Holland (Accepted)


Selected Publications

T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372

T. El-Ghazawi and F. Cantonnet. “UPC performance and potential: A NPB experimental study”. Supercomputing 2002 (SC2002), Baltimore, November 2002

F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005

CUG and PPOP

towards optimized upc implementations

Documents

unified parallel c

programming modelsmessage

upc web site

upc consortium of government

statusspecification

otherstesting suite

shared memorye

conceptual difficultyupc