towards optimized upc implementations
DESCRIPTION
Towards Optimized UPC Implementations. Tarek A. El-Ghazawi The George Washington University [email protected]. Agenda. Background UPC Language Overview Productivity Performance Issues Automatic Optimizations Conclusions. Parallel Programming Models. What is a programming model? - PowerPoint PPT PresentationTRANSCRIPT
Towards Optimized UPC Implementations
Tarek A. El-GhazawiThe George Washington University
IBM T.J. Waston UPC: Unified Parallel C 202/22/05
Agenda
Background UPC Language Overview Productivity Performance Issues Automatic Optimizations Conclusions
IBM T.J. Waston UPC: Unified Parallel C 302/22/05
Parallel Programming Models What is a programming model?
An abstract machine which outlines the view perceived by the programmer of data and execution
Where architecture and applications meet A non-binding contract between the programmer and
the compiler/system
Good Programming Models Should Allow efficient mapping on different architectures Keep programming easy
Benefits Application - independence from architecture Architecture - independence from applications
IBM T.J. Waston UPC: Unified Parallel C 402/22/05
Programming Models
Message Passing Shared Memory DSM/PGAS
MPI OpenMP UPC
Process/Thread
Address Space
IBM T.J. Waston UPC: Unified Parallel C 502/22/05
Programming Paradigms ExpressivityLOCALITY
Implicit ExplicitPARALLEISM
Implicit
Explicit
Sequential(e.g. C, Fortran, Java)
Data Parallel(e.g. HPF, C*)
Shared Memory(e.g. OpenMP)
Distributed Shared Memory/PGAS(e.g. UPC, CAF, and Titanium)
IBM T.J. Waston UPC: Unified Parallel C 602/22/05
What is UPC?
Unified Parallel C An explicit parallel extension of ISO C A distributed shared memory/PGAS parallel
programming language
IBM T.J. Waston UPC: Unified Parallel C 702/22/05
Why not message passing?
Performance High-penalty for short transactions Cost of calls Two sided Excessive buffering
Ease-of-use Explicit data transfers Domain decomposition does not maintain the
original global application view More code and conceptual difficulty
IBM T.J. Waston UPC: Unified Parallel C 802/22/05
Why DSM/PGAS?
Performance No calls Efficient short transfers locality
Ease-of-use Implicit transfers Consistent global application view Less code and conceptual difficulty
IBM T.J. Waston UPC: Unified Parallel C 902/22/05
Why DSM/PGAS:New Opportunities for Compiler Optimizations
Ghost Zones
Thread0
Thread1
Thread2
Thread3
ImageSobel Operator
DSM P_Model exposes sequential remote accesses at compile time Opportunity for compiler directed prefetching
IBM T.J. Waston UPC: Unified Parallel C 1002/22/05
History
Initial Tech. Report from IDA in collaboration with LLNL and UCB in May 1999
UPC consortium of government, academia, and HPC vendors coordinated by GWU, IDA, and DoD
The participants currently are: IDA CCS, GWU, UCB, MTU, UMN, ARSC, UMCP, U florida, ANL, LBNL, LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid, Etnus, …
IBM T.J. Waston UPC: Unified Parallel C 1102/22/05
Status Specification v1.0 completed February of 2001, v1.1.1 in
October of 2003, v1.2 will add collectives and UPC/IO Benchmarking Suites: Stream, GUPS, RandomAccess, NPB
suite, Splash-2, and others Testing suite v1.0, v1.1 Short courses and tutorials in the US and abroad Research Exhibits at SC 2000-2004 UPC web site: upc.gwu.edu UPC Book by mid 2005 from John Wiley and Sons Manual(s)
IBM T.J. Waston UPC: Unified Parallel C 1202/22/05
Hardware Platforms
UPC implementations are available for SGI O 2000/3000
Intrepid – 32 and 64b GCC UCB – 32 b GCC
Cray T3D/E Cray X-1 HP AlphaServer SC, Superdome UPC Berkeley Compiler: Myrinet, Quadrics,
and Infiniband Clusters Beowulf Reference Implementation (MPI-
based, MTU) New ongoing efforts by IBM and Sun
IBM T.J. Waston UPC: Unified Parallel C 1302/22/05
UPC Execution Model
A number of threads working independently in a SPMD fashion MYTHREAD specifies thread index
(0..THREADS-1) Number of threads specified at compile-time or
run-time Process and Data Synchronization when needed
Barriers and split phase barriers Locks and arrays of locks Fence Memory consistency control
IBM T.J. Waston UPC: Unified Parallel C 1402/22/05
UPC Memory Model
Shared space with thread affinity, plus private spaces A pointer-to-shared can reference all locations in the shared
space A private pointer may reference only addresses in its
private space or addresses in its portion of the shared space Static and dynamic memory allocations are supported for both
shared and private memory
Shared
Thread 0
Private 0
Thread THREADS-1
Private 1 Private THREADS-1
Thread 1
IBM T.J. Waston UPC: Unified Parallel C 1502/22/05
UPC Pointers
How to declare them? int *p1; /* private pointer pointing
locally */ shared int *p2; /* private pointer pointing into
the shared space */ int *shared p3; /* shared pointer pointing locally */ shared int *shared p4; /* shared pointer pointing
into the shared space */ You may find many using “shared pointer” to mean a
pointer pointing to a shared object, e.g. equivalent to p2 but could be p4 as well.
IBM T.J. Waston UPC: Unified Parallel C 1602/22/05
UPC Pointers
Shared
Private P1
P2
P4P3
Thread 0
P1 P1P2
P2
IBM T.J. Waston UPC: Unified Parallel C 1702/22/05
Synchronization - Barriers
No implicit synchronization among the threads
UPC provides the following synchronization mechanisms: Barriers Locks Memory Consistency Control Fence
IBM T.J. Waston UPC: Unified Parallel C 1802/22/05
Memory Consistency Models
Has to do with ordering of shared operations, and when a change of a shared object by a thread becomes visible to others
Consistency can be strict or relaxed Under the relaxed consistency model, the shared
operations can be reordered by the compiler / runtime system
The strict consistency model enforces sequential ordering of shared operations. (No operation on shared can begin before the previous ones are done, and changes become visible immediately)
IBM T.J. Waston UPC: Unified Parallel C 1902/22/05
Memory Consistency Models
User specifies the memory model through: declarations pragmas for a particular statement or sequence of
statements use of barriers, and global operations
Programmers responsible for using correct consistency model
IBM T.J. Waston UPC: Unified Parallel C 2002/22/05
UPC and Productivity
Metrics Lines of ‘useful’ Code
indicates the development time as well as the maintenance cost
Number of ‘useful’ Characters alternative way to measure development and maintenance
efforts Conceptual Complexity
function level, keyword usage, number of tokens, max loop depth, …
IBM T.J. Waston UPC: Unified Parallel C 2102/22/05
Manual Effort – NPB Example
SEQ UPC SEQ MPI UPC Effort (%)
MPI Effort (%)
#line 665 710 506 1046 6.77 106.72 NPB-CG #char 16145 17200 16485 37501 6.53 127.49 #line 127 183 130 181 44.09 36.23 NPB-EP #char 2868 4117 4741 6567 43.55 38.52 #line 575 1018 665 1278 77.04 92.18 NPB-FT #char 13090 21672 22188 44348 65.56 99.87 #line 353 528 353 627 49.58 77.62 NPB-IS #char 7273 13114 7273 13324 80.31 83.20 #line 610 866 885 1613 41.97 82.26 NPB-MG #char 14830 21990 27129 50497 48.28 86.14
SEQSEQUPCUPCeffort #
##
SEQSEQMPIMPIeffort #
##
IBM T.J. Waston UPC: Unified Parallel C 2202/22/05
Manual Effort – More Examples
SEQ MPI SEQ UPC MPI Effort (%)
UPC Effort (%)
#line 41 98 41 47 139.02 14.63 GUPS #char 1063 2979 1063 1251 180.02 17.68 #line 12 30 12 20 150.00 66.67 Histogram #char 188 705 188 376 275.00 100.00 #line 86 166 86 139 93.02 61.63 N-Queens #char 1555 3332 1555 2516 124.28 61.80
SEQSEQUPCUPCeffort #
##
SEQSEQMPIMPIeffort #
##
IBM T.J. Waston UPC: Unified Parallel C 2302/22/05
Conceptual Complexity - HIST
Work Distr.
Data Distr.
Comm. Synch. & Consist.
Misc. Ops Sum Overall Score
#Parameters 5 4 0 3 0 12 #Function calls 0 0 0 4 0 4 #References to THREADS and MYTHREAD
2 1 0 0 0 3
#UPC Constructs & UPC Types
0 2 0 1 0 3
HIS
TO
GR
AM
UPC
Notes 2 if 1 for
2 shared decl.
1 lockdec 1 lock/unlock 2 barriers
22
#Parameters 5 0 15 0 6 26 #Function calls 0 0 2 2 4 8 # References to myrank and nprocs
3 0 2 0 2 5 #MPI Types 0 0 6 0 2 8
HIS
TO
GR
AM
MPI
Notes 2 if 1 for
1 Scatter 1 Reduce
(implicit w. Collective)
1 Init/Finalize 2 Comm
47
IBM T.J. Waston UPC: Unified Parallel C 2402/22/05
Conceptual Complexity - GUPS
Work Distr.
Data Distr.
Comm. Synch. & Consist.
Misc. Ops Sum Overall Score
#Parameters 21 6 0 0 0 27 #Function calls 0 4 0 2 0 6 #References to THREADS and MYTHREAD
3 4 0 0 0 7
#UPC Constructs & UPC Types
3 0 0 0 0 3 GU
PS U
PC
Notes 3 forall 2 for 3 if
5 shared 2 all_alloc 2 free
2 barriers
43
#Parameters 18 17 38 1 6 80 #Function calls 0 7 6 3 6 22 # References to myrank and nprocs
3 5 13 1 4 26 #MPI Types 0 6 2 0 0 8
GU
PS M
PI
Notes 5 for 3 if
2 mem alloc 2 mem free 3 window
2 one-sided 4 collect
(implicit w. Collective and WinFence) 1 barrier
Init Finalize comm_rank comm_size 2 Wtime (6 error handle)
136
IBM T.J. Waston UPC: Unified Parallel C 2502/22/05
UPC Optimizations Issues
Particular Challenges Avoiding Address Translation Cost of Address Translation
Special Opportunities Locality-driven compiler-directed prefetching Aggregation
General Low-level optimized libraries, e.g. collective Backend optimizations Overlapping of remote accesses and
synchronization with other work
IBM T.J. Waston UPC: Unified Parallel C 2602/22/05
Showing Potential Optimizations Through Emulated Hand-Tunings
Different Hand-tuning levels: Unoptimized UPC code
referred as UPC.O0 Privatized UPC code
referred as UPC.O1 Prefetched UPC code
hand-optimized variant using block get/put to mimic the effect of prefetching
referred as UPC.O2 Fully Hand-Tuned UPC code
Hand-optimized variant integrating privatization, aggregation of remote accesses as well as prefetching
Referred as UPC.O3 T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual
Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372
IBM T.J. Waston UPC: Unified Parallel C 2702/22/05
Address Translation Cost and Local Space Privatization- Cluster
STR
EAM
BEN
CH
MA
RK
MB/s Put Get Scale Sum
CC N/A N/A 1565.04 5409.3
UPC Private N/A N/A 1687.63 1776.81
UPC Local 1196.51 1082.89 54.22 82.7
UPC Remote 241.43 237.51 0.09 0.16
MB/s Copy (arr) Copy (ptr) Memcpy Memset
CC 1340.99 1488.02 1223.86 2401.26
UPC Private 1383.57 433.45 1252.47 2352.71
UPC Local 47.2 90.67 1202.8 2398.9
UPC Remote 0.09 0.20 1197.22 2360.59
Results gathered on a Myrinet Cluster
IBM T.J. Waston UPC: Unified Parallel C 2802/22/05
MB/Sec Memorycopy
Block Get
Block Put
ArraySet
Array Copy Sum Scale
GCC 127 N/A N/A 175 106 223 108
UPC Private 127 N/A N/A 173 106 215 107
UPC Local Shared 139 140 136 26 14 31 13
UPC Remote Shared (within SMP node) 130 129 136 26 13 30 13
UPC Remote Shared (beyond SMP node) 112 117 136 24 12 28 12
STR
EAM
BEN
CH
MA
RK
MB
/S
Bulk operations Element-by-Element operations
Address Translation and Local Space Privatization –
DSM ARCHITECTURE
IBM T.J. Waston UPC: Unified Parallel C 2902/22/05
Aggregation and Overlapping of Remote Shared Memory Accesses
0
0.05
0.1
0.15
0.2
0.25
1 2 4 8 16
THREADS
Exec
utio
n Ti
me
(sec
)
UPC NO OPT. UPC FULL OPT.
0.01
0.1
1
10
100
1 2 4 8 16 32
NP
Exec
utio
n Ti
me
(sec
)
UPC NO OPT. UPC FULL OPT.
Benefit of hand-optimizations are greatly application dependent: N-Queens does not perform any better, mainly because it is an
embarrassingly parallel program Sobel Edge Detector does get a speedup of one order of magnitude
after hand-optimizating, scales linearly perfectly. SGI O2000
UPC N-Queens: Execution Time
UPC Sobel Edge: Execution Time
IBM T.J. Waston UPC: Unified Parallel C 3002/22/05
Impact of Hand-Optimizations on NPB.CG
0
10
20
30
40
50
60
70
1 2 4 8 16 32
Processors
Com
puta
tion
Tim
e (s
ec)
UPC - O0 UPC - O1 UPC - O3 GCC Class A onSGI Origin 2k
IBM T.J. Waston UPC: Unified Parallel C 3102/22/05
Shared Address Translation Overhead
Address translation overhead is quite significant More than 70% of work for a local-shared memory access
Demonstrates the real need for optimization
ZActualAccess
ZActualAccess
YAddress
CalculationOverhead
UPC Put/GetFunction Call
OverheadX
PRIVATEMEMORY ACCESS
LOCALSHARED
MEMORY ACCESS
AddressTranslationOverhead
144
247
123
0
100
200
300
400
500
600
Local Shared memory access
Loca
l Sha
red
Acc
ess
Tim
e (n
s)
Memory Access Time Address Calculation Address Function Call
Overhead Present in Local-Shared Memory Accesses (SGI Origin 2000, GCC-UPC)
Quantification of the Address Translation Overheads
IBM T.J. Waston UPC: Unified Parallel C 3202/22/05
Shared Address Translation Overheads for Sobel Edge Detection
010
2030
405060
7080
90100
UP
C.O
0
UP
C.O
3
UP
C.O
0
UP
C.O
3
UP
C.O
0
UP
C.O
3
UP
C.O
0
UP
C.O
3
UP
C.O
0
UP
C.O
3
#Processors
Exe
cutio
n T
ime
(sec
)
Processing + Memory Access Address Function CallAddress Calculation
1 2 4 8 16
UPC.O0: unoptimized UPC code, UPC.O3: handoptimized UPC code. Ox notations from
T. El-Ghazawi, S. Chauvin, “UPC Benchmarking Issues”, Proceedings of the 2001 International Conference on Parallel Processing, Valencia, September 2001
IBM T.J. Waston UPC: Unified Parallel C 3302/22/05
Reducing Address Translation Overheads via Translation Look-Aside Buffers
F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005
Use Look-up Memory Model Translation Buffers (MMTB) to perform fast translations
Two alternative methods proposed to create and use MMTB’s: FT: basic method using direct addressing RT: advanced method, using indexed addressing
Was prototyped as a compiler-enabled optimization no modifications to actual UPC codes are needed
IBM T.J. Waston UPC: Unified Parallel C 3402/22/05
array[0] TH0
array[1] TH1
array[2] TH2
array[3] TH3
array[4] TH0
array[5] TH1
array[6] TH2
array[7] TH3
array[0] TH0
array[1] TH1
array[2] TH2
array[3] TH3
array[4] TH0
array[5] TH1
array[6] TH2
array[7] TH3
[0]
[4]
TH0
[1]
[5]
TH1
[2]
[6]
TH2
[3]
[7]
TH3
[0]
[4]
TH0
[1]
[5]
TH1
[2]
[6]
TH2
[3]
[7]
TH3
Array distributed across 4 THREADS
shared int array[8];
MMTB stored on each thread
FT Look-up Table
Data affinity
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
FT[0]
FT[4]
FT[1]
FT[5]
FT[2]
FT[6]
FT[3]
FT[7]
FT[0]
FT[4]
FT[1]
FT[5]
FT[2]
FT[6]
FT[3]
FT[7]
FT[0]
FT[4]
FT[1]
FT[5]
FT[2]
FT[6]
FT[3]
FT[7]
Different Strategies – Full-Table
Pros Direct mapping No address calculation
Cons Large memory required Can lead to competition over caches and
main memory
Consider shared [B] int array[8];To Initialize FT:
i [0,7], FT[i] = _get_vaddr(&array[i])To Access array[ ]:
i [0,7], array[i] = _get_value_at(FT[i])
IBM T.J. Waston UPC: Unified Parallel C 3502/22/05
Different Strategies – Reduced-Table: Infinite blocksize
RT Strategy:
Only one table entry in this case
Address calculation step is simple in that case
array[0]
array[1]
array[2]
array[3]
THREAD0
i
THREAD1
RT[0] RT[0]
THREAD2
RT[0]
THREAD3
RT[0]
BLOCKSIZE=infiniteOnly first address of the element of the array needs to be saved since all array data is contiguous
Consider shared [] int array[4];
To initialize RT:
RT[0] = _get_vaddr(&array[0])
To access array[]:
i [0,3], array[i] = _get_value_at( RT[0] + i )
IBM T.J. Waston UPC: Unified Parallel C 3602/22/05
Different Strategies – Reduced-Table: Default blocksize
RT Strategy:
Less memory required than FT, MMTB buffer has threads entries
Address calculation step is a bit costly but much cheaper than current implementations
array[0]
array[4]
array[8]
array[12]
THREAD1
RT RT
THREAD2
RT
THREAD3
RT
array[1]
array[5]
array[9]
array[13]
array[2]
array[6]
array[10]
array[14]
array[3]
array[7]
array[11]
array[15]
THREAD0
RT[0]
RT[1]
RT[2]
RT[3]
RT
BLOCKSIZE=1Only first address of elements on each thread are saved since all array data is contiguous
Consider shared [1] int array[16];
To initialize RT:
i [0,THREADS-1], RT[i] = _get_vaddr(&array[i])
To access array[]:
i [0,15], array[i] = _get_value_at( RT[i mod THREADS] + (i/THREADS))
IBM T.J. Waston UPC: Unified Parallel C 3702/22/05
Different Strategies – Reduced-Table: Arbitrary blocksize
RT Strategy:
Less memory required than for FT, but more than previous cases
Address calculation step more costly than previous cases
array[0]
array[1]
array[8]
array[9]
THREAD1
RT RTTHREAD2
RTTHREAD3
RT
array[2]
array[3]
array[10]
array[11]
array[4]
array[5]
array[12]
array[13]
array[6]
array[7]
array[14]
array[15]
THREAD0
RT[0]
RT[1]
RT[2]
RT[3]
RT
RT[4]
RT[5]
RT[6]
RT[7]
ARBITRARY BLOCK SIZESOnly first address of elements of each block are saved since all block data is contiguous
Consider shared [2] int array[16];
To initialize T:
i [0,7], RT[i] = _get_vaddr(&array[i*blocksize(array)])
To access array[]:
i [0,15], array[i] = _get_value_at( RT[i / blocksize(array)] + (i mod blocksize(array)) )
IBM T.J. Waston UPC: Unified Parallel C 3802/22/05
Performance Impact of the MMTB – Sobel Edge
FT and RT are performing around 6 to 8 folds better than the regular basic UPC version (O0)
RT strategy slower than FT since address calculation (arbitrary block size case), becomes more complex.
FT on the other hand is performing almost as good as the hand-tuned versions (O3 and MPI)
Sobel Edge (N=2048)
0
2
4
6
8
10
12
14
16
1 2 4 8 16
#THREADS
Exec
utio
n Ti
me
(sec
)
O0 O0.FT O0.RT O3 MPI
Sobel Edge (N=2048)
0
0.5
1
1.5
2
2.5
3
1 2 4 8 16
#THREADS
Exec
utio
n Ti
me
(sec
)
O0.FT O0.RT O3 MPI
Performance of Sobel-Edge Detection using new MMTB strategies (with and without O0)
IBM T.J. Waston UPC: Unified Parallel C 3902/22/05
Performance Impact of the MMTB – Matrix Multiplication
FT strategy: increase in L1 data cache misses due to the large table size
RT strategy: L1 kept low, but increase in number of loads and stores is observed showing increase in computations (arbitrary blocksize used)
MATRIX MULTIPLICATION (N=256)
0
2
4
6
8
10
12
14
16
1 2 4 8 16
# THREADS
Exec
utio
n Ti
me
(sec
)
UPC.O0 UPC.O0.FT UPC.O0.RT UPC.O3 MPI
Performance and Hardware Profiling of Matrix Multiplication using new MMTB strategies
MATRIX MULTIPLICATION (N=256)
0
2
4
6
8
10
12
14
16
UPC.O0
UPC.O3
UPC.O0.F
T
UPC.O0.R
T
UPC.O0
UPC.O3
UPC.O0.F
T
UPC.O0.R
T
UPC.O0
UPC.O3
UPC.O0.F
T
UPC.O0.R
T
UPC.O0
UPC.O3
UPC.O0.F
T
UPC.O0.R
T
UPC.O0
UPC.O3
UPC.O0.F
T
UPC.O0.R
T
THREADS
Tim
e (s
ec)
Computation L1 Data Cache Misses L2 Data Cache Misses TLB MissesGraduated Loads Graduated Stores Decoded Branches
1 THREAD 2 THREADS 4 THREADS 8 THREADS 16 THREADS
IBM T.J. Waston UPC: Unified Parallel C 4002/22/05
Time and storage requirements of the Address Translation Methods for the
Matrix Multiply Microkernel
Number of loads and stores can increase with arithmetic operators
Comparison among Optimizations of Storage, Memory Accesses and Computation Requirements
EN
THREADSPNEN
THREADSPBNEN
For a shared array of N elements with B
as blocksize
Storage requirements per
shared array
# of memory accesses per
shared memoryaccess
# of arithmetic operations pershared memory
access
UPC.O0 More than 25 More than 5
UPC.O0.FT 1 0
UPC.O0.RT 1 Up to 3(E: element size in bytes,P: pointer size in bytes)
IBM T.J. Waston UPC: Unified Parallel C 4102/22/05
UPC Work-sharing Construct OptimizationsBy thread/index number
(upc_forall integer)
upc_forall(i=0; i<N; i++; i)
loop body;
By the address of a shared variable (upc_forall address)
upc_forall(i=0; i<N; i++; &shared_var[i])
loop body;
By thread/index number (for optimized)
for(i=MYTHREAD; i<N; i+=THREADS)
loop body;
By thread/index number (for integer)
for(i=0; i<N; i++){
if(MYTHREAD == i%THREADS)loop body;
}
By the address of a shared variable (for address)
for(i=0; i<N; i++){
if(upc_threadof(&shared_var[i]) ==
MYTHREAD)loop body;
}
IBM T.J. Waston UPC: Unified Parallel C 4202/22/05
Performance of Equivalent upc_forall and for Loops
0
0.01
0.02
0.03
0.04
0.05
0.06
1 2 4 8 16 Processor(s)
upc_forall address upc_forall integer for address for integer for optimized
Tim
e (s
ec.)
IBM T.J. Waston UPC: Unified Parallel C 4302/22/05
Performance Limitations Imposed by Sequential C Compilers -- STREAM
NU
MA
(MB
/s)
BULK Element-by-Element
mem
cpy
mem
set
Struct cp
Copy (arr)
Copy (ptr)
Set
Sum
Scale
Add
Triad
F 291.21 163.90 N/A 291.59 N/A 159.68 135.37 246.3 235.1 303.82
C 231.20 214.62 158.86 120.57 152.77 147.70 298.38 133.4 13.86 20.71
Vector(M
B/s)
BULK Element-by-Element
mem
cpy
mem
set
Struct cp
Copy (arr)
Copy (ptr)
Set
Sum
Scale
Add
Triad
F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053
C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824
IBM T.J. Waston UPC: Unified Parallel C 4402/22/05
Loopmark – SET/ADD Operations
Vector
BULK Element-by-Element
mem
cpy
mem
set
Struct cp
Copy (arr)
Copy (ptr)
Set
Sum
Scale
Add
Triad
F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053
C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824
Let us compare loopmarks for each F / C operation
IBM T.J. Waston UPC: Unified Parallel C 4502/22/05
Loopmark – SET/ADD Operations
MEMSET (bulk set) 146. 1 t = mysecond(tflag) 147. 1 V M--<><> a(1:n) = 1.0d0 148. 1 t = mysecond(tflag) - t 149. 1 times(2,k) = t
SET 158. 1 arrsum = 2.0d0; 159. 1 t = mysecond(tflag) 160. 1 MV------< DO i = 1,n 161. 1 MV c(i) = arrsum 162. 1 MV arrsum = arrsum + 1 163. 1 MV------> END DO 164. 1 t = mysecond(tflag) - t 165. 1 times(4,k) = t
ADD 180. 1 t = mysecond(tflag) 181. 1 V M--<><> c(1:n) = a(1:n) + b(1:n) 182. 1 t = mysecond(tflag) - t 183. 1 times(7,k) = t
MEMSET (bulk set) 163. 1 times[1][k] = mysecond_(); 164. 1 memset(a, 1,
NDIM*sizeof(elem_t));; 165. 1 times[1][k] = mysecond_() -
times[1][k];
SET 217. 1 set = 2; 220. 1 times[5][k] = mysecond_(); 222. 1 MV--< for (i=0; i<NDIM; i++) 223. 1 MV { 224. 1 MV c[i] = (set++); 225. 1 MV--> } 227. 1 times[5][k] = mysecond_() -
times[5][k];
ADD 283. 1 times[10][k]= mysecond_(); 285. 1 Vp--< for (j=0; j<NDIM; j++) 286. 1 Vp { 287. 1 Vp c[j] = a[j] + b[j]; 288. 1 Vp--> } 290. 1 times[10][k] = mysecond_() -
times[10][k];
Fortran C
Legend: V: Vectorized – M: Multistreamed – p: conditional, partial and/or computed
IBM T.J. Waston UPC: Unified Parallel C 4602/22/05
UPC vs CAF using the NPB workloads In General, UPC slower than CAF, mainly due to
Point-to-point vs barrier synchronization Better scalability with proper collective operations Program writers can do a p-to-p syncronization using current
constructs Scalar performance of source-to-source translated code
Alias analysis (C pointers)» Can highlight the need for explicitly using restrict to help
several compiler backends Lack of support for multi-dimensional arrays in C
» Can prevent high level loop transformations and software pipelining, causing a 2 times slowdown in SP for UPC
Need for exhaustive C compiler analysis» A failure to perform proper loop fusion and alignment in the
critical section of MG can lead to 51% more loads for UPC than CAF
» A failure to unroll adequately the sparse matrix-vector multiplication in CG can lead to more cycles in UPC
IBM T.J. Waston UPC: Unified Parallel C 4702/22/05
Conclusions UPC is a locality-aware parallel programming
language With proper optimizations, UPC can outperform
MPI in random short accesses and can otherwise perform as good as MPI
UPC is very productive and UPC applications result in much smaller and more readable code than MPI
UPC compiler optimizations are still lagging, in spite of the fact that substantial progress has been made
For future architectures, UPC has the unique opportunity of having very efficient implementations as most of the pitfalls and obstacles are revealed along with adequate solutions
IBM T.J. Waston UPC: Unified Parallel C 4802/22/05
Conclusions
In general, four types of optimizations: Optimizations to Exploit the Locality
Consciousness and other Unique Features of UPC
Optimizations to Keep the Overhead of UPC low Optimizations to Exploit Architectural Features Standard Optimizations that are Applicable to all
Systems Compilers
IBM T.J. Waston UPC: Unified Parallel C 4902/22/05
Conclusions
Optimizations possible at three levels: Source to source program acting during the
compilation phase and incorporating most UPC specific optimizations
C backend compilers to compete with Fortran Strong run-time system that can work effectively
with the Operating System
IBM T.J. Waston UPC: Unified Parallel C 5002/22/05
Selected Publications
T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick, UPC: Distributed Shared Memory Programming. John Wiley &Sons Inc., New York, 2005. ISBN: 0-471-22048-5. (June 2005)
T. El-Ghazawi, F. Cantonnet, Y. Yao, S. Annareddy, A. Mohamed, Benchmarking Parallel Compilers for Distributed Shared Memory Languages: A UPC Case Study, Journal of Future Generation Computer Systems, North-Holland (Accepted)
IBM T.J. Waston UPC: Unified Parallel C 5102/22/05
Selected Publications
T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372
T. El-Ghazawi and F. Cantonnet. “UPC performance and potential: A NPB experimental study”. Supercomputing 2002 (SC2002), Baltimore, November 2002
F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005
CUG and PPOP