the performance and productivity benefits of global

1
UPC: a GAS extension to ISO C99 shared types & pointers Libraries for parallel ops Synchronization Collective data movement File I/O Many open source & vendor implementations Berkeley UPC, MuPC HP UPC, Cray UPC See UPC Booth #137 NAS Benchmarks written from scratch FT, CG, MG, LU - all NAS-compliant Use some Berkeley UPC extensions All competitive with best MPI versions, even on cluster networks UPC: a GAS extension to ISO C99 shared types & pointers Libraries for parallel ops Synchronization Collective data movement File I/O Many open source & vendor implementations Berkeley UPC, MuPC HP UPC, Cray UPC See UPC Booth #137 NAS Benchmarks written from scratch FT, CG, MG, LU - all NAS-compliant Use some Berkeley UPC extensions All competitive with best MPI versions, even on cluster networks Christian Bell, Dan Bonachea, Christian Bell, Dan Bonachea, Kaushik Kaushik Datta Datta , Rajesh , Rajesh Nishtala Nishtala , Paul Hargrove, Parry Husbands, Kathy Yelick , Paul Hargrove, Parry Husbands, Kathy Yelick http://upc. http://upc. lbl lbl .gov .gov The Performance and Productivity Benefits of Global Address Spac The Performance and Productivity Benefits of Global Address Spac e Languages e Languages http://titanium. http://titanium. cs cs . . berkeley berkeley . . edu edu http://gasnet. http://gasnet. cs cs . . berkeley berkeley . . edu edu Global Address Space languages (GAS) support: Global pointers and distributed arrays User controlled layout of data across nodes Communicate using implicit reads & writes of remote memory Languages: UPC, Titanium, Co-Array Fortran Productivity benefits of shared-memory programming Competitive performance on distributed memory Use Single Program Multiple Data (SPMD) control Fixed number of compute threads Global synchronization, barriers, collectives Exploit fast one-sided communication Individual accesses and bulk copies Berkeley implementations use GASNet Global Address Space languages (GAS) support: Global pointers and distributed arrays User controlled layout of data across nodes Communicate using implicit reads & writes of remote memory Languages: UPC, Titanium, Co-Array Fortran Productivity benefits of shared-memory programming Competitive performance on distributed memory Use Single Program Multiple Data (SPMD) control Fixed number of compute threads Global synchronization, barriers, collectives Exploit fast one-sided communication Individual accesses and bulk copies Berkeley implementations use GASNet Global Address Space Languages Global Address Space Languages LU Decomposition LU Decomposition Fourier Transform Fourier Transform Conjugate Gradient Conjugate Gradient Performance Performance Productivity Productivity GASNet Portability GASNet Portability Native network hardware support: Quadrics QsNet I/II (Elan3/Elan4) Cray X1 - Cray shmem SGI Altix - SGI shmem Dolphin - SCI InfiniBand - Mellanox VAPI Myricom Myrinet - GM-1 and GM-2 IBM Colony and Federation - LAPI Portable network support: Ethernet - UDP: works with any TCP/IP MPI 1: portable impl. for other HPC systems Berkeley UPC, Titanium & GASNet highly portable: Runtimes and generated code all ANSI C • New platform ports in 2-3 days • New network hardware 2-3 weeks CPU's: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, T3E, X-1, SX-6, … OS's: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MS-Windows/Cygwin, Mac OSX, Unicos, SuperUX, Catamount, BlueGene, … Native network hardware support: Quadrics QsNet I/II (Elan3/Elan4) Cray X1 - Cray shmem SGI Altix - SGI shmem Dolphin - SCI InfiniBand - Mellanox VAPI Myricom Myrinet - GM-1 and GM-2 IBM Colony and Federation - LAPI Portable network support: Ethernet - UDP: works with any TCP/IP MPI 1: portable impl. for other HPC systems Berkeley UPC, Titanium & GASNet highly portable: Runtimes and generated code all ANSI C • New platform ports in 2-3 days • New network hardware 2-3 weeks CPU's: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, T3E, X-1, SX-6, … OS's: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MS-Windows/Cygwin, Mac OSX, Unicos, SuperUX, Catamount, BlueGene, … NAS Parallel Benchmark Code Size 0 100 200 300 400 500 600 700 800 900 F+MPI Ti F+MPI Ti F+MPI Ti Productive lines of code Declarations Communication Computation CG FT MG Applications Applications Mach-10 shock hitting solid surface at oblique angle Immersed Boundary Method Simulation Human Heart Cochlea Adaptive Mesh Refinement (AMR) AMR Poisson Multigrid Solvers 3D AMR Gas Dynamics Hyperbolic Solver AMR with Line Relaxation (low aspect ratio) Bioinformatics: Microarray oligonucleotide selection Finite Element Benchmarks Tree-structured n-body kernels Dense Linear Algebra: LU, MatMul Immersed Boundary Method Simulation Human Heart Cochlea Adaptive Mesh Refinement (AMR) AMR Poisson Multigrid Solvers 3D AMR Gas Dynamics Hyperbolic Solver AMR with Line Relaxation (low aspect ratio) Bioinformatics: Microarray oligonucleotide selection Finite Element Benchmarks Tree-structured n-body kernels Dense Linear Algebra: LU, MatMul final Point<2> NORTH = [0,1], SOUTH = [0,-1], EAST = [1,0], WEST = [-1,0]; foreach (p in gridA.domain()) gridB[p] = S0 * gridA[p] + S1 * ( gridA[p+NORTH] + gridA[p+SOUTH] + gridA[p+EAST] + gridA[p+WEST] ); Unified Parallel C Unified Parallel C Titanium: GAS dialect of Java Compiled to native executable: • No JVM or JIT Highly portable & high performance High productivity language All the productivity features of Java • Automatic memory management • Object oriented programming, etc Built-in support for scientific computing: • n-D points, rectangles, domains • n-D domain calculus operators • Flexible & efficient multi-D arrays High-performance templates User-defined immutable (value) classes Explicitly unordered n-D loop iteration Operator overloading Efficient cross-language support Allows for elegant and concise programs Titanium: GAS dialect of Java Compiled to native executable: • No JVM or JIT Highly portable & high performance High productivity language All the productivity features of Java • Automatic memory management • Object oriented programming, etc Built-in support for scientific computing: • n-D points, rectangles, domains • n-D domain calculus operators • Flexible & efficient multi-D arrays High-performance templates User-defined immutable (value) classes Explicitly unordered n-D loop iteration Operator overloading Efficient cross-language support Allows for elegant and concise programs High Performance High Performance Shared Global address space X Private p p p P4 Z Thread 0 Thread 1 Thread n P5 y y y A[0] A[1] A[n] Shared Global address space X Private p p p P4 Z Thread 0 Thread 1 Thread n P5 y y y A[0] A[1] A[n] Global Address Space logically partitioned Shared data access from any thread Private data local to each thread Pointers are either local or global (possibly remote) Global Address Space logically partitioned Shared data access from any thread Private data local to each thread Pointers are either local or global (possibly remote) 2-d Multigrid stencil code in Titanium 2-d Multigrid stencil code in Titanium AMR Code Size 10 K 20 K 30 K 40 K 50 K C++/Fortran/MPI Titanium Lines of code Elliptic PDE solver* AMR operations AMR data structures MG Class D Speedup - Opteron/InfiniBand 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Processors Speedup (Best32Proc:171.3sec) Fortran w/MPI Titanium * C++ code has a more general grid generator FT Class C Speedup - G5/InfiniBand 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 Processors Speedup (Best 16 Proc:37.9sec) Titanium (non-blocking) Titanium (blocking) Fortran w/MPI FFTW is used for all serial 1-D FFT's 113.3 sec 121.9 sec Parallel (large) 57.5 sec 47.6 sec Serial (small) Chombo (C++/F/MPI ) Titanium AMR AMR Solver High ratio of computation to communication scales very well to large machines: Top500 Non-trivial dependence patterns HPL code uses two-sided MPI messaging, and is very difficult to tune UPC implementation of LU uses: one-sided communication in GAS model lightweight multithreading atop SPMD memory-constrained lookahead to manage amount of concurrency at each processor highly adaptable to problem/machine sizes natural latency tolerance Performance is comparable to HPL / MPI UPC code less than half the size High ratio of computation to communication scales very well to large machines: Top500 Non-trivial dependence patterns HPL code uses two-sided MPI messaging, and is very difficult to tune UPC implementation of LU uses: one-sided communication in GAS model lightweight multithreading atop SPMD memory-constrained lookahead to manage amount of concurrency at each processor highly adaptable to problem/machine sizes natural latency tolerance Performance is comparable to HPL / MPI UPC code less than half the size Network-independent GAS language GAS language program code program code Translator Generated C Code Translator Generated C Code Language Runtime System Language Runtime System GASNet Communication System GASNet Communication System Network Hardware Network Hardware Platform- independent Compiler-independent Language- independent Source Source - - to to - - source source Translator Translator Solves Ax = b for x, where A is sparse 2-D array Computation dominated by Sparse Matrix-Vector Multiplications (SPMV) Key communication for 2-D (NAS) decomposition: team-sum-reduce-to-all (vector&scalar) point-to-point exchange Need explicit synchronization in one-sided model Uses tree-based team reduction library Tunable based on network characteristics Overlap the vector reductions on the rows of the matrix with the computation of the SPMV Outperform MPI implementation by up to 10% Solves Ax = b for x, where A is sparse 2-D array Computation dominated by Sparse Matrix-Vector Multiplications (SPMV) Key communication for 2-D (NAS) decomposition: team-sum-reduce-to-all (vector&scalar) point-to-point exchange Need explicit synchronization in one-sided model Uses tree-based team reduction library Tunable based on network characteristics Overlap the vector reductions on the rows of the matrix with the computation of the SPMV Outperform MPI implementation by up to 10% 0 10 20 30 40 50 60 70 80 90 100 Opteron/InfiniBand D/256 Alpha/Elan3 D/512 x86/Myrinet D/64 system, problem class / proc count MFlops / thread MPI / Fortran MPI / C MPI / C w/ Overlap UPC UPC w/ Overlap Opteron/InfiniBand Flood Bandwidth Comparison 0 100 200 300 400 500 600 700 800 900 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 Transfer Size (bytes) Bandwidth (MB/sec) GASNet put_nb MPI (MVAPICH 0.9.5) ISend/IRecv jacquard.nersc.gov - Dual CPU 2.2Ghz Opteron, 4GB mem - Mellanox InfiniHost PCI-X, IB-4X HCAs Bandwidth Ratio 1.00 1.25 1.50 1.75 2.00 2.25 2.50 10 1,000 100,000 10,000,000 Message Size (bytes) (MB = 2^20 b) . Flood Bandwidth for 4KB Messages 231 MB/s 815 MB/s 223 MB/s 679 MB/s 714 MB/s 177 MB/s 495 MB/s 152 MB/s 252 MB/s 426 MB/s 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % Alpha Elan3 Itanium2 Elan4 x86 Myrinet Power5 InfiniBand Opteron InfiniBand Percent of Hardware Peak Bandwidth GASNet MPI Flood Bandwidth for 512KB Messages 256 MB/s 859 MB/s 229 MB/s 796 MB/s 799 MB/s 246 MB/s 851 MB/s 225 MB/s 671 MB/s 659 MB/s 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % Alpha Elan3 Itanium2 Elan4 x86 Myrinet Power5 InfiniBand Opteron InfiniBand Percent of Hardware Peak Bandwidth . GASNet MPI Itanium2/Elan4 RoundtripLatency Comparison 0 1 2 3 4 5 6 7 8 9 10 11 12 1 10 100 1000 Transfer Size (bytes) Round-triplatency (usec) MPI Isend/Irecv ping + 0-byte ack MPI send/recv ping + 0-byte ack GASNet put_nb + sync GASNet get_nb + sync Communication micro-benchmarks show GASNet consistently matches or outperforms MPI One-sided, lightweight communication semantics eliminates matching & rendezvous overheads better match to hardware capabilities leverages widely-available support for remote direct memory access Active Messages support provides extensibility Communication micro-benchmarks show GASNet consistently matches or outperforms MPI One-sided, lightweight communication semantics eliminates matching & rendezvous overheads better match to hardware capabilities leverages widely-available support for remote direct memory access Active Messages support provides extensibility Datta, Bonachea and Yelick, LCPC 2005 Datta, Bonachea and Yelick, LCPC 2005 Wen and Collella, IPDPS 2005 Wen and Collella, IPDPS 2005 (up is good) (up is good) (up is good) (up is good) (up is good) (up is good) (up is good) (up is good) (down is good) (down is good) (up is good) (up is good) (down is good) (down is good) Titanium reduces code size and development time Communication more concise than MPI, less error-prone Language features and libraries capture useful paradigms Titanium reduces code size and development time Communication more concise than MPI, less error-prone Language features and libraries capture useful paradigms LU performance comparison 10% 20% 30% 40% 50% 60% 70% 80% 90% Cray X1 (64) Cray X1 (128/124) SGI Altix (32) Opteron 2.2 GHz (64) System (proc count) % peak performance MPI/HPL UPC/LU 3-D FFT with 1-D partitioning across processors Computation is 1-D FFTs using FFTW library Communication is all-to-all transpose Traditionally a bandwidth-limited problem Optimization approach in GAS languages Aggressively overlap transpose with 2nd FFT Send more, smaller msgs to maximize overlap • Per-target slabs or individual 1-D pencils Use low-overhead one-sided communication Consistently outperform MPI-based implementations Improvement of up to 2x, even at large scale 3-D FFT with 1-D partitioning across processors Computation is 1-D FFTs using FFTW library Communication is all-to-all transpose Traditionally a bandwidth-limited problem Optimization approach in GAS languages Aggressively overlap transpose with 2nd FFT Send more, smaller msgs to maximize overlap • Per-target slabs or individual 1-D pencils Use low-overhead one-sided communication Consistently outperform MPI-based implementations Improvement of up to 2x, even at large scale (up is good) (up is good) Fine-grained communication overlap less effective in MPI due to increased messaging overheads and higher per- message costs Fine-grained communication overlap less effective in MPI due to increased messaging overheads and higher per- message costs Titanium Titanium

Upload: others

Post on 21-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Performance and Productivity Benefits of Global

• UPC: a GAS extension to ISO C99• shared types & pointers

• Libraries for parallel ops• Synchronization• Collective data movement• File I/O

• Many open source & vendor implementations• Berkeley UPC, MuPC• HP UPC, Cray UPC

• See UPC Booth #137

• NAS Benchmarks written from scratch• FT, CG, MG, LU - all NAS-compliant • Use some Berkeley UPC extensions• All competitive with best MPI versions,

even on cluster networks

• UPC: a GAS extension to ISO C99• shared types & pointers

• Libraries for parallel ops• Synchronization• Collective data movement• File I/O

• Many open source & vendor implementations• Berkeley UPC, MuPC• HP UPC, Cray UPC

• See UPC Booth #137

• NAS Benchmarks written from scratch• FT, CG, MG, LU - all NAS-compliant • Use some Berkeley UPC extensions• All competitive with best MPI versions,

even on cluster networks

Christian Bell, Dan Bonachea, Christian Bell, Dan Bonachea, KaushikKaushik DattaDatta, Rajesh , Rajesh NishtalaNishtala, Paul Hargrove, Parry Husbands, Kathy Yelick, Paul Hargrove, Parry Husbands, Kathy Yelick

http://upc.http://upc.lbllbl.gov.gov

The Performance and Productivity Benefits of Global Address SpacThe Performance and Productivity Benefits of Global Address Space Languagese Languages

http://titanium.http://titanium.cscs..berkeleyberkeley..eduedu http://gasnet.http://gasnet.cscs..berkeleyberkeley..eduedu

• Global Address Space languages (GAS) support:• Global pointers and distributed arrays• User controlled layout of data across nodes• Communicate using implicit reads & writes

of remote memory

• Languages: UPC, Titanium, Co-Array Fortran• Productivity benefits of shared-memory programming• Competitive performance on distributed memory

• Use Single Program Multiple Data (SPMD) control • Fixed number of compute threads• Global synchronization, barriers, collectives

• Exploit fast one-sided communication • Individual accesses and bulk copies• Berkeley implementations use GASNet

• Global Address Space languages (GAS) support:• Global pointers and distributed arrays• User controlled layout of data across nodes• Communicate using implicit reads & writes

of remote memory

• Languages: UPC, Titanium, Co-Array Fortran• Productivity benefits of shared-memory programming• Competitive performance on distributed memory

• Use Single Program Multiple Data (SPMD) control • Fixed number of compute threads• Global synchronization, barriers, collectives

• Exploit fast one-sided communication • Individual accesses and bulk copies• Berkeley implementations use GASNet

Global Address Space LanguagesGlobal Address Space Languages

LU DecompositionLU Decomposition

Fourier TransformFourier Transform

Conjugate GradientConjugate GradientPerformancePerformance

ProductivityProductivity

GASNet PortabilityGASNet Portability• Native network hardware support:

• Quadrics QsNet I/II (Elan3/Elan4) • Cray X1 - Cray shmem• SGI Altix - SGI shmem• Dolphin - SCI• InfiniBand - Mellanox VAPI• Myricom Myrinet - GM-1 and GM-2• IBM Colony and Federation - LAPI

• Portable network support:• Ethernet - UDP: works with any TCP/IP• MPI 1: portable impl. for other HPC systems

• Berkeley UPC, Titanium & GASNet highly portable:• Runtimes and generated code all ANSI C

• New platform ports in 2-3 days • New network hardware 2-3 weeks

• CPU's: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, T3E, X-1, SX-6, …

• OS's: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MS-Windows/Cygwin, Mac OSX, Unicos, SuperUX, Catamount, BlueGene, …

• Native network hardware support:• Quadrics QsNet I/II (Elan3/Elan4) • Cray X1 - Cray shmem• SGI Altix - SGI shmem• Dolphin - SCI• InfiniBand - Mellanox VAPI• Myricom Myrinet - GM-1 and GM-2• IBM Colony and Federation - LAPI

• Portable network support:• Ethernet - UDP: works with any TCP/IP• MPI 1: portable impl. for other HPC systems

• Berkeley UPC, Titanium & GASNet highly portable:• Runtimes and generated code all ANSI C

• New platform ports in 2-3 days • New network hardware 2-3 weeks

• CPU's: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, T3E, X-1, SX-6, …

• OS's: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MS-Windows/Cygwin, Mac OSX, Unicos, SuperUX, Catamount, BlueGene, …

NAS Parallel Benchmark Code Size

0

100200

300400

500600

700800900

F+MPI Ti F+MPI Ti F+MPI Ti

Prod

uctiv

e lin

es o

f cod

e

DeclarationsCommunicationComputation

CG FT MG

ApplicationsApplications

Mach-10 shock hitting solid surface at oblique angle

• Immersed Boundary Method Simulation• Human Heart• Cochlea

• Adaptive Mesh Refinement (AMR)• AMR Poisson Multigrid Solvers• 3D AMR Gas Dynamics Hyperbolic Solver• AMR with Line Relaxation (low aspect ratio)

• Bioinformatics: Microarray oligonucleotide selection• Finite Element Benchmarks• Tree-structured n-body kernels• Dense Linear Algebra: LU, MatMul

• Immersed Boundary Method Simulation• Human Heart• Cochlea

• Adaptive Mesh Refinement (AMR)• AMR Poisson Multigrid Solvers• 3D AMR Gas Dynamics Hyperbolic Solver• AMR with Line Relaxation (low aspect ratio)

• Bioinformatics: Microarray oligonucleotide selection• Finite Element Benchmarks• Tree-structured n-body kernels• Dense Linear Algebra: LU, MatMul

final Point<2> NORTH = [0,1], SOUTH = [0,-1],EAST = [1,0], WEST = [-1,0];

foreach (p in gridA.domain()) gridB[p] = S0 * gridA[p] +

S1 * ( gridA[p+NORTH] + gridA[p+SOUTH] +gridA[p+EAST] + gridA[p+WEST] );

Unified Parallel CUnified Parallel C• Titanium: GAS dialect of Java

• Compiled to native executable: • No JVM or JIT

• Highly portable & high performance• High productivity language

• All the productivity features of Java• Automatic memory management• Object oriented programming, etc

• Built-in support for scientific computing:• n-D points, rectangles, domains• n-D domain calculus operators• Flexible & efficient multi-D arrays

• High-performance templates • User-defined immutable (value) classes• Explicitly unordered n-D loop iteration• Operator overloading• Efficient cross-language support

• Allows for elegant and concise programs

• Titanium: GAS dialect of Java• Compiled to native executable:

• No JVM or JIT• Highly portable & high performance

• High productivity language• All the productivity features of Java

• Automatic memory management• Object oriented programming, etc

• Built-in support for scientific computing:• n-D points, rectangles, domains• n-D domain calculus operators• Flexible & efficient multi-D arrays

• High-performance templates • User-defined immutable (value) classes• Explicitly unordered n-D loop iteration• Operator overloading• Efficient cross-language support

• Allows for elegant and concise programs

High PerformanceHigh Performance

Shared

Glo

bal

addr

ess

spac

e X

Privatep p p

P4 Z

Thread0 Thread1 … Threadn

P5

yy y

A[0] A[1] A[n] Shared

Glo

bal

addr

ess

spac

e X

Privatep p p

P4 Z

Thread0 Thread1 … Threadn

P5

yy y

A[0] A[1] A[n]

• Global Address Space logically partitioned• Shared data access from any thread• Private data local to each thread

• Pointers are either local or global (possibly remote)

• Global Address Space logically partitioned• Shared data access from any thread• Private data local to each thread

• Pointers are either local or global (possibly remote)

2-d Multigrid stencil code in Titanium2-d Multigrid stencil code in Titanium

AMR Code Size

10 K

20 K

30 K

40 K

50 K

C++/Fortran/MPI Titanium

Line

s of

cod

e

Elliptic PDE solver*

AMR operations

AMR data structures

MG Class D Speedup - Opteron/InfiniBand

020406080

100120140

0 20 40 60 80 100 120 140Processors

Spee

dup

(Bes

t32P

roc:

171.

3sec

) Fortran w/MPITitanium

* C++ code has a more general grid generator

FT Class C Speedup - G5/InfiniBand

01020304050607080

0 10 20 30 40 50 60 70Processors

Spee

dup

(Bes

t 16

Proc

:37.

9sec

) Titanium (non-blocking)Titanium (blocking)Fortran w/MPI

FFTW is used for all serial 1-D FFT's

113.3 sec121.9 secParallel (large)57.5 sec47.6 secSerial (small)

Chombo (C++/F/MPI )Titanium AMRAMR Solver

• High ratio of computation to communication• scales very well to large machines: Top500

• Non-trivial dependence patterns• HPL code uses two-sided MPI messaging,

and is very difficult to tune• UPC implementation of LU uses:

• one-sided communication in GAS model• lightweight multithreading atop SPMD• memory-constrained lookahead to manage

amount of concurrency at each processor• highly adaptable to problem/machine sizes• natural latency tolerance

• Performance is comparable to HPL / MPI• UPC code less than half the size

• High ratio of computation to communication• scales very well to large machines: Top500

• Non-trivial dependence patterns• HPL code uses two-sided MPI messaging,

and is very difficult to tune• UPC implementation of LU uses:

• one-sided communication in GAS model• lightweight multithreading atop SPMD• memory-constrained lookahead to manage

amount of concurrency at each processor• highly adaptable to problem/machine sizes• natural latency tolerance

• Performance is comparable to HPL / MPI• UPC code less than half the size

Network-independent

GAS language GAS language program codeprogram code

Translator Generated C CodeTranslator Generated C CodeLanguage Runtime SystemLanguage Runtime System

GASNet Communication SystemGASNet Communication SystemNetwork HardwareNetwork Hardware

Platform-independent

Compiler-independent

Language-independent

SourceSource--toto--source source TranslatorTranslator

• Solves Ax = b for x, where A is sparse 2-D array• Computation dominated by Sparse Matrix-Vector

Multiplications (SPMV)• Key communication for 2-D (NAS) decomposition:

• team-sum-reduce-to-all (vector&scalar)• point-to-point exchange

• Need explicit synchronization in one-sided model • Uses tree-based team reduction library• Tunable based on network characteristics

• Overlap the vector reductions on the rows of the matrix with the computation of the SPMV

• Outperform MPI implementation by up to 10%

• Solves Ax = b for x, where A is sparse 2-D array• Computation dominated by Sparse Matrix-Vector

Multiplications (SPMV)• Key communication for 2-D (NAS) decomposition:

• team-sum-reduce-to-all (vector&scalar)• point-to-point exchange

• Need explicit synchronization in one-sided model • Uses tree-based team reduction library• Tunable based on network characteristics

• Overlap the vector reductions on the rows of the matrix with the computation of the SPMV

• Outperform MPI implementation by up to 10%

0

10

20

30

40

50

60

70

80

90

100

Opteron/InfiniBand D/256 Alpha/Elan3 D/512 x86/Myrinet D/64

system, problem class / proc count

MFl

ops

/ thr

ead

MPI / Fortran MPI / C MPI / C w/ Overlap UPC UPC w/ Overlap

Opteron/InfiniBand Flood Bandwidth Comparison

0

100

200

300

400

500

600

700

800

900

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000Transfer Size (bytes)

Ban

dwid

th (M

B/s

ec)

GASNet put_nb

MPI (MVAPICH 0.9.5) ISend/IRecv

jacquard.nersc.gov- Dual CPU 2.2Ghz Opteron, 4GB mem- Mellanox InfiniHost PCI-X, IB-4X HCAs

Bandwidth Ratio

1.001.251.501.752.002.252.50

10 1,000 100,000 10,000,000Message Size (bytes)

(MB

= 2

^20

b) .

Flood Bandwidth for 4KB Messages

231 MB/s

815 MB/s 223 MB/s

679 MB/s 714 MB/s

177 MB/s 495 MB/s 152 MB/s

252 MB/s

426 MB/s

0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100 %

AlphaElan3

Itanium2Elan4

x86Myrinet

Power5InfiniBand

OpteronInfiniBand

Per

cent

of H

ardw

are

Pea

k Ba

ndw

idth

GASNet MPI

Flood Bandwidth for 512KB Messages

256 MB/s859 MB/s 229 MB/s

796 MB/s 799 MB/s

246 MB/s

851 MB/s225 MB/s

671 MB/s 659 MB/s

0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100 %

AlphaElan3

Itanium2Elan4

x86Myrinet

Power5InfiniBand

OpteronInfiniBand

Perc

ent o

f Har

dwar

e Pe

ak B

andw

idth

.

GASNet MPI

Itanium2/Elan4 Roundtrip Latency Comparison

0

1

2

3

4

5

6

7

8

9

10

11

12

1 10 100 1000Transfer Size (bytes)

Rou

nd-tr

ip la

tenc

y (u

sec)

MPI Isend/Irecv ping + 0-byte ackMPI send/recv ping + 0-byte ackGASNet put_nb + syncGASNet get_nb + sync

• Communication micro-benchmarks show GASNet consistently matches or outperforms MPI

• One-sided, lightweight communication semantics• eliminates matching & rendezvous overheads• better match to hardware capabilities• leverages widely-available support for remote

direct memory access• Active Messages support provides extensibility

• Communication micro-benchmarks show GASNet consistently matches or outperforms MPI

• One-sided, lightweight communication semantics• eliminates matching & rendezvous overheads• better match to hardware capabilities• leverages widely-available support for remote

direct memory access• Active Messages support provides extensibility

Datta, Bonachea and Yelick, LCPC 2005Datta, Bonachea and Yelick, LCPC 2005

Wen and Collella, IPDPS 2005Wen and Collella, IPDPS 2005

(up

is g

ood)

(up

is g

ood)

(up

is g

ood)

(up

is g

ood)

(up

is g

ood)

(up

is g

ood)

(up

is g

ood)

(up

is g

ood)

(dow

n is

goo

d)(d

own

is g

ood)

(up

is g

ood)

(up

is g

ood)

(dow

n is

goo

d)(d

own

is g

ood)

• Titanium reduces code size and development time• Communication more concise than MPI, less error-prone• Language features and libraries capture useful paradigms

• Titanium reduces code size and development time• Communication more concise than MPI, less error-prone• Language features and libraries capture useful paradigms

LU performance comparison

10% 20% 30% 40% 50% 60% 70% 80% 90%

Cray X1 (64)

Cray X1 (128/124)

SGI Altix (32)

Opteron 2.2 GHz (64)

Sys

tem

(p

roc

coun

t)

% peak performance

MPI/HPLUPC/LU

• 3-D FFT with 1-D partitioning across processors• Computation is 1-D FFTs using FFTW library• Communication is all-to-all transpose• Traditionally a bandwidth-limited problem

• Optimization approach in GAS languages• Aggressively overlap transpose with 2nd FFT• Send more, smaller msgs to maximize overlap

• Per-target slabs or individual 1-D pencils• Use low-overhead one-sided communication

• Consistently outperform MPI-based implementations• Improvement of up to 2x, even at large scale

• 3-D FFT with 1-D partitioning across processors• Computation is 1-D FFTs using FFTW library• Communication is all-to-all transpose• Traditionally a bandwidth-limited problem

• Optimization approach in GAS languages• Aggressively overlap transpose with 2nd FFT• Send more, smaller msgs to maximize overlap

• Per-target slabs or individual 1-D pencils• Use low-overhead one-sided communication

• Consistently outperform MPI-based implementations• Improvement of up to 2x, even at large scale

(up

is g

ood)

(up

is g

ood)

Fine-grained communication overlap less effective in MPI due to increased messaging overheads and higher per-message costs

Fine-grained communication overlap less effective in MPI due to increased messaging overheads and higher per-message costs

TitaniumTitanium