the performance and productivity benefits of global
TRANSCRIPT
• UPC: a GAS extension to ISO C99• shared types & pointers
• Libraries for parallel ops• Synchronization• Collective data movement• File I/O
• Many open source & vendor implementations• Berkeley UPC, MuPC• HP UPC, Cray UPC
• See UPC Booth #137
• NAS Benchmarks written from scratch• FT, CG, MG, LU - all NAS-compliant • Use some Berkeley UPC extensions• All competitive with best MPI versions,
even on cluster networks
• UPC: a GAS extension to ISO C99• shared types & pointers
• Libraries for parallel ops• Synchronization• Collective data movement• File I/O
• Many open source & vendor implementations• Berkeley UPC, MuPC• HP UPC, Cray UPC
• See UPC Booth #137
• NAS Benchmarks written from scratch• FT, CG, MG, LU - all NAS-compliant • Use some Berkeley UPC extensions• All competitive with best MPI versions,
even on cluster networks
Christian Bell, Dan Bonachea, Christian Bell, Dan Bonachea, KaushikKaushik DattaDatta, Rajesh , Rajesh NishtalaNishtala, Paul Hargrove, Parry Husbands, Kathy Yelick, Paul Hargrove, Parry Husbands, Kathy Yelick
http://upc.http://upc.lbllbl.gov.gov
The Performance and Productivity Benefits of Global Address SpacThe Performance and Productivity Benefits of Global Address Space Languagese Languages
http://titanium.http://titanium.cscs..berkeleyberkeley..eduedu http://gasnet.http://gasnet.cscs..berkeleyberkeley..eduedu
• Global Address Space languages (GAS) support:• Global pointers and distributed arrays• User controlled layout of data across nodes• Communicate using implicit reads & writes
of remote memory
• Languages: UPC, Titanium, Co-Array Fortran• Productivity benefits of shared-memory programming• Competitive performance on distributed memory
• Use Single Program Multiple Data (SPMD) control • Fixed number of compute threads• Global synchronization, barriers, collectives
• Exploit fast one-sided communication • Individual accesses and bulk copies• Berkeley implementations use GASNet
• Global Address Space languages (GAS) support:• Global pointers and distributed arrays• User controlled layout of data across nodes• Communicate using implicit reads & writes
of remote memory
• Languages: UPC, Titanium, Co-Array Fortran• Productivity benefits of shared-memory programming• Competitive performance on distributed memory
• Use Single Program Multiple Data (SPMD) control • Fixed number of compute threads• Global synchronization, barriers, collectives
• Exploit fast one-sided communication • Individual accesses and bulk copies• Berkeley implementations use GASNet
Global Address Space LanguagesGlobal Address Space Languages
LU DecompositionLU Decomposition
Fourier TransformFourier Transform
Conjugate GradientConjugate GradientPerformancePerformance
ProductivityProductivity
GASNet PortabilityGASNet Portability• Native network hardware support:
• Quadrics QsNet I/II (Elan3/Elan4) • Cray X1 - Cray shmem• SGI Altix - SGI shmem• Dolphin - SCI• InfiniBand - Mellanox VAPI• Myricom Myrinet - GM-1 and GM-2• IBM Colony and Federation - LAPI
• Portable network support:• Ethernet - UDP: works with any TCP/IP• MPI 1: portable impl. for other HPC systems
• Berkeley UPC, Titanium & GASNet highly portable:• Runtimes and generated code all ANSI C
• New platform ports in 2-3 days • New network hardware 2-3 weeks
• CPU's: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, T3E, X-1, SX-6, …
• OS's: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MS-Windows/Cygwin, Mac OSX, Unicos, SuperUX, Catamount, BlueGene, …
• Native network hardware support:• Quadrics QsNet I/II (Elan3/Elan4) • Cray X1 - Cray shmem• SGI Altix - SGI shmem• Dolphin - SCI• InfiniBand - Mellanox VAPI• Myricom Myrinet - GM-1 and GM-2• IBM Colony and Federation - LAPI
• Portable network support:• Ethernet - UDP: works with any TCP/IP• MPI 1: portable impl. for other HPC systems
• Berkeley UPC, Titanium & GASNet highly portable:• Runtimes and generated code all ANSI C
• New platform ports in 2-3 days • New network hardware 2-3 weeks
• CPU's: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, T3E, X-1, SX-6, …
• OS's: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MS-Windows/Cygwin, Mac OSX, Unicos, SuperUX, Catamount, BlueGene, …
NAS Parallel Benchmark Code Size
0
100200
300400
500600
700800900
F+MPI Ti F+MPI Ti F+MPI Ti
Prod
uctiv
e lin
es o
f cod
e
DeclarationsCommunicationComputation
CG FT MG
ApplicationsApplications
Mach-10 shock hitting solid surface at oblique angle
• Immersed Boundary Method Simulation• Human Heart• Cochlea
• Adaptive Mesh Refinement (AMR)• AMR Poisson Multigrid Solvers• 3D AMR Gas Dynamics Hyperbolic Solver• AMR with Line Relaxation (low aspect ratio)
• Bioinformatics: Microarray oligonucleotide selection• Finite Element Benchmarks• Tree-structured n-body kernels• Dense Linear Algebra: LU, MatMul
• Immersed Boundary Method Simulation• Human Heart• Cochlea
• Adaptive Mesh Refinement (AMR)• AMR Poisson Multigrid Solvers• 3D AMR Gas Dynamics Hyperbolic Solver• AMR with Line Relaxation (low aspect ratio)
• Bioinformatics: Microarray oligonucleotide selection• Finite Element Benchmarks• Tree-structured n-body kernels• Dense Linear Algebra: LU, MatMul
final Point<2> NORTH = [0,1], SOUTH = [0,-1],EAST = [1,0], WEST = [-1,0];
foreach (p in gridA.domain()) gridB[p] = S0 * gridA[p] +
S1 * ( gridA[p+NORTH] + gridA[p+SOUTH] +gridA[p+EAST] + gridA[p+WEST] );
Unified Parallel CUnified Parallel C• Titanium: GAS dialect of Java
• Compiled to native executable: • No JVM or JIT
• Highly portable & high performance• High productivity language
• All the productivity features of Java• Automatic memory management• Object oriented programming, etc
• Built-in support for scientific computing:• n-D points, rectangles, domains• n-D domain calculus operators• Flexible & efficient multi-D arrays
• High-performance templates • User-defined immutable (value) classes• Explicitly unordered n-D loop iteration• Operator overloading• Efficient cross-language support
• Allows for elegant and concise programs
• Titanium: GAS dialect of Java• Compiled to native executable:
• No JVM or JIT• Highly portable & high performance
• High productivity language• All the productivity features of Java
• Automatic memory management• Object oriented programming, etc
• Built-in support for scientific computing:• n-D points, rectangles, domains• n-D domain calculus operators• Flexible & efficient multi-D arrays
• High-performance templates • User-defined immutable (value) classes• Explicitly unordered n-D loop iteration• Operator overloading• Efficient cross-language support
• Allows for elegant and concise programs
High PerformanceHigh Performance
Shared
Glo
bal
addr
ess
spac
e X
Privatep p p
P4 Z
Thread0 Thread1 … Threadn
P5
yy y
A[0] A[1] A[n] Shared
Glo
bal
addr
ess
spac
e X
Privatep p p
P4 Z
Thread0 Thread1 … Threadn
P5
yy y
A[0] A[1] A[n]
• Global Address Space logically partitioned• Shared data access from any thread• Private data local to each thread
• Pointers are either local or global (possibly remote)
• Global Address Space logically partitioned• Shared data access from any thread• Private data local to each thread
• Pointers are either local or global (possibly remote)
2-d Multigrid stencil code in Titanium2-d Multigrid stencil code in Titanium
AMR Code Size
10 K
20 K
30 K
40 K
50 K
C++/Fortran/MPI Titanium
Line
s of
cod
e
Elliptic PDE solver*
AMR operations
AMR data structures
MG Class D Speedup - Opteron/InfiniBand
020406080
100120140
0 20 40 60 80 100 120 140Processors
Spee
dup
(Bes
t32P
roc:
171.
3sec
) Fortran w/MPITitanium
* C++ code has a more general grid generator
FT Class C Speedup - G5/InfiniBand
01020304050607080
0 10 20 30 40 50 60 70Processors
Spee
dup
(Bes
t 16
Proc
:37.
9sec
) Titanium (non-blocking)Titanium (blocking)Fortran w/MPI
FFTW is used for all serial 1-D FFT's
113.3 sec121.9 secParallel (large)57.5 sec47.6 secSerial (small)
Chombo (C++/F/MPI )Titanium AMRAMR Solver
• High ratio of computation to communication• scales very well to large machines: Top500
• Non-trivial dependence patterns• HPL code uses two-sided MPI messaging,
and is very difficult to tune• UPC implementation of LU uses:
• one-sided communication in GAS model• lightweight multithreading atop SPMD• memory-constrained lookahead to manage
amount of concurrency at each processor• highly adaptable to problem/machine sizes• natural latency tolerance
• Performance is comparable to HPL / MPI• UPC code less than half the size
• High ratio of computation to communication• scales very well to large machines: Top500
• Non-trivial dependence patterns• HPL code uses two-sided MPI messaging,
and is very difficult to tune• UPC implementation of LU uses:
• one-sided communication in GAS model• lightweight multithreading atop SPMD• memory-constrained lookahead to manage
amount of concurrency at each processor• highly adaptable to problem/machine sizes• natural latency tolerance
• Performance is comparable to HPL / MPI• UPC code less than half the size
Network-independent
GAS language GAS language program codeprogram code
Translator Generated C CodeTranslator Generated C CodeLanguage Runtime SystemLanguage Runtime System
GASNet Communication SystemGASNet Communication SystemNetwork HardwareNetwork Hardware
Platform-independent
Compiler-independent
Language-independent
SourceSource--toto--source source TranslatorTranslator
• Solves Ax = b for x, where A is sparse 2-D array• Computation dominated by Sparse Matrix-Vector
Multiplications (SPMV)• Key communication for 2-D (NAS) decomposition:
• team-sum-reduce-to-all (vector&scalar)• point-to-point exchange
• Need explicit synchronization in one-sided model • Uses tree-based team reduction library• Tunable based on network characteristics
• Overlap the vector reductions on the rows of the matrix with the computation of the SPMV
• Outperform MPI implementation by up to 10%
• Solves Ax = b for x, where A is sparse 2-D array• Computation dominated by Sparse Matrix-Vector
Multiplications (SPMV)• Key communication for 2-D (NAS) decomposition:
• team-sum-reduce-to-all (vector&scalar)• point-to-point exchange
• Need explicit synchronization in one-sided model • Uses tree-based team reduction library• Tunable based on network characteristics
• Overlap the vector reductions on the rows of the matrix with the computation of the SPMV
• Outperform MPI implementation by up to 10%
0
10
20
30
40
50
60
70
80
90
100
Opteron/InfiniBand D/256 Alpha/Elan3 D/512 x86/Myrinet D/64
system, problem class / proc count
MFl
ops
/ thr
ead
MPI / Fortran MPI / C MPI / C w/ Overlap UPC UPC w/ Overlap
Opteron/InfiniBand Flood Bandwidth Comparison
0
100
200
300
400
500
600
700
800
900
1 10 100 1,000 10,000 100,000 1,000,000 10,000,000Transfer Size (bytes)
Ban
dwid
th (M
B/s
ec)
GASNet put_nb
MPI (MVAPICH 0.9.5) ISend/IRecv
jacquard.nersc.gov- Dual CPU 2.2Ghz Opteron, 4GB mem- Mellanox InfiniHost PCI-X, IB-4X HCAs
Bandwidth Ratio
1.001.251.501.752.002.252.50
10 1,000 100,000 10,000,000Message Size (bytes)
(MB
= 2
^20
b) .
Flood Bandwidth for 4KB Messages
231 MB/s
815 MB/s 223 MB/s
679 MB/s 714 MB/s
177 MB/s 495 MB/s 152 MB/s
252 MB/s
426 MB/s
0 %
10 %
20 %
30 %
40 %
50 %
60 %
70 %
80 %
90 %
100 %
AlphaElan3
Itanium2Elan4
x86Myrinet
Power5InfiniBand
OpteronInfiniBand
Per
cent
of H
ardw
are
Pea
k Ba
ndw
idth
GASNet MPI
Flood Bandwidth for 512KB Messages
256 MB/s859 MB/s 229 MB/s
796 MB/s 799 MB/s
246 MB/s
851 MB/s225 MB/s
671 MB/s 659 MB/s
0 %
10 %
20 %
30 %
40 %
50 %
60 %
70 %
80 %
90 %
100 %
AlphaElan3
Itanium2Elan4
x86Myrinet
Power5InfiniBand
OpteronInfiniBand
Perc
ent o
f Har
dwar
e Pe
ak B
andw
idth
.
GASNet MPI
Itanium2/Elan4 Roundtrip Latency Comparison
0
1
2
3
4
5
6
7
8
9
10
11
12
1 10 100 1000Transfer Size (bytes)
Rou
nd-tr
ip la
tenc
y (u
sec)
MPI Isend/Irecv ping + 0-byte ackMPI send/recv ping + 0-byte ackGASNet put_nb + syncGASNet get_nb + sync
• Communication micro-benchmarks show GASNet consistently matches or outperforms MPI
• One-sided, lightweight communication semantics• eliminates matching & rendezvous overheads• better match to hardware capabilities• leverages widely-available support for remote
direct memory access• Active Messages support provides extensibility
• Communication micro-benchmarks show GASNet consistently matches or outperforms MPI
• One-sided, lightweight communication semantics• eliminates matching & rendezvous overheads• better match to hardware capabilities• leverages widely-available support for remote
direct memory access• Active Messages support provides extensibility
Datta, Bonachea and Yelick, LCPC 2005Datta, Bonachea and Yelick, LCPC 2005
Wen and Collella, IPDPS 2005Wen and Collella, IPDPS 2005
(up
is g
ood)
(up
is g
ood)
(up
is g
ood)
(up
is g
ood)
(up
is g
ood)
(up
is g
ood)
(up
is g
ood)
(up
is g
ood)
(dow
n is
goo
d)(d
own
is g
ood)
(up
is g
ood)
(up
is g
ood)
(dow
n is
goo
d)(d
own
is g
ood)
• Titanium reduces code size and development time• Communication more concise than MPI, less error-prone• Language features and libraries capture useful paradigms
• Titanium reduces code size and development time• Communication more concise than MPI, less error-prone• Language features and libraries capture useful paradigms
LU performance comparison
10% 20% 30% 40% 50% 60% 70% 80% 90%
Cray X1 (64)
Cray X1 (128/124)
SGI Altix (32)
Opteron 2.2 GHz (64)
Sys
tem
(p
roc
coun
t)
% peak performance
MPI/HPLUPC/LU
• 3-D FFT with 1-D partitioning across processors• Computation is 1-D FFTs using FFTW library• Communication is all-to-all transpose• Traditionally a bandwidth-limited problem
• Optimization approach in GAS languages• Aggressively overlap transpose with 2nd FFT• Send more, smaller msgs to maximize overlap
• Per-target slabs or individual 1-D pencils• Use low-overhead one-sided communication
• Consistently outperform MPI-based implementations• Improvement of up to 2x, even at large scale
• 3-D FFT with 1-D partitioning across processors• Computation is 1-D FFTs using FFTW library• Communication is all-to-all transpose• Traditionally a bandwidth-limited problem
• Optimization approach in GAS languages• Aggressively overlap transpose with 2nd FFT• Send more, smaller msgs to maximize overlap
• Per-target slabs or individual 1-D pencils• Use low-overhead one-sided communication
• Consistently outperform MPI-based implementations• Improvement of up to 2x, even at large scale
(up
is g
ood)
(up
is g
ood)
Fine-grained communication overlap less effective in MPI due to increased messaging overheads and higher per-message costs
Fine-grained communication overlap less effective in MPI due to increased messaging overheads and higher per-message costs
TitaniumTitanium