non-uniformly communicating non-contiguous data: a case study with petsc and mpi
DESCRIPTION
Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI. P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics and Computer Science Argonne National Laboratory. Numerical Libraries in HEC. Developing parallel applications is a complex task - PowerPoint PPT PresentationTRANSCRIPT
Non-uniformly CommunicatingNon-contiguous Data:
A Case Study with PETSc and MPI
P. Balaji, D. Buntinas, S. Balay, B. Smith,R. Thakur and W. Gropp
Mathematics and Computer ScienceArgonne National Laboratory
Numerical Libraries in HEC• Developing parallel applications is a complex task
– Discretizing physical equations to numerical forms– Representing the domain of interest as data points
• Libraries allow developers to abstract low-level details– E.g., Numerical Analysis, Communication, I/O
• Numerical libraries (e.g., PETSc, ScaLAPACK, PESSL)– Parallel data layout and processing– Tools for distributed data layout (matrix, vector)– Tools for data processing (SLES, SNES)
Overview of PETSc• Portable, Extensible Toolkit for
Scientific Computing• Software tools for solving PDEs
– Suite of routines to create vectors, matrices and distributed arrays
– Sequential/parallel data layout
– Linear and nonlinear numerical solvers
• Widely used in Nanosimulations, Molecular dynamics, etc.
• Uses MPI for communication
BLAS LAPACK MPI
Matrices Vectors Index Sets
KSP(Krylov subspace Methods)
PC(Preconditioners) Draw
SNES(Nonlinear Equation Solvers) SLES
(Linear Equation Solvers)
TS(Time Stepping)
PDE Solvers
Application CodesLevel of Abstraction
Handling Parallel Data Layouts in PETSc• Grid layout exposed to the application
– Structured or Unstructured (1D, 2D, 3D)– Internally managed as a single vector of data
elements– Representation often suited to optimize its operations
• Impact on communication:– Data representation and communication pattern
might not be ideal for MPI communication operations– Non-uniformity and Non-contiguity in communication
are the primary culprits
Presentation Layout
• Introduction
• Impact of PETSc Data Layout and Processing on
MPI
• MPI Enhancements and Optimizations
• Experimental Evaluation
• Concluding Remarks and Future Work
Local Data Point
Data Layout and Processing in PETSc
• Grid layouts: data is divided among processes– Ghost data points shared
• Non-contiguous Data Communication– 2nd dimension of the grid
• Non-uniform communication– Structure of the grid– Stencil type used– Sides larger than corners
Process Boundary
Ghost Data Point
Proc 1Proc 0
Box-type stencil
Proc 1Proc 0
Star-type stencil
• MPI Derived Datatypes– Application describes noncontiguous data layout to
MPI– Data is either packed to contiguous buffers and
pipelined (sparse layouts) or sent individually (dense layouts)
• Good for simple algorithms, but very restrictive– Lookup upcoming content to predecide algorithm to
use– Multiple parses on the datatype loses context!
Non-contiguous Communication in MPI
Non-contiguous Data layout
Save Context Send DataSave Context
Packing Buffer
Issues with Lost Datatype Context• Rollback of context not possible
– Datatypes could be recursive• Duplication of context not possible
– Context information might be large– When datatype elements are small, context could be
larger than the datatype itself• Search of context possible, but very expensive
– Quadratically increasing search time with increasing datatype size
– Currently used mechanism!
Non-uniform Collective Communication
• Non-uniform communication algorithms are optimized for “uniform” communication
• Case Studies– Allgatherv uses a ring
algorithm• Causes idleness if data
volumes are very different– Alltoallw sends data to nodes
in round-robin manner• MPI processing is sequential
Large Message
Small Message
0
1
2
3
4
5
6
Presentation Layout
• Introduction
• Impact of PETSc Data Layout and Processing on MPI
• MPI Enhancements and Optimizations
• Experimental Evaluation
• Concluding Remarks and Future Work
Dual-context Approach forNon-contiguous Communication• Previous approaches are in-efficient in complex
designs– E.g., if a look-ahead is performed to understand the
structure of the upcoming data, the saved context is lost
• Dual-context approach retains the data context– Look-aheads are performed using a separate context– Completely eliminates the search timeNon-contiguous Data layout
Save ContextSend Data
Save ContextLook-ahead
Packing Buffer
Non-Uniform Communication: AllGatherv
• Single point of distribution is the primary bottleneck
• Identify if a small fraction of messages are very large– Floyd and Rivest Algorithm– Linear time detection of
outliers• Binomial Algorithms
– Recursive doubling or Dissemination
– Logarithmic time
Large Message
Small Message
Non-uniform Communication: Alltoallw• Distributing messages to be sent out as bins
(based on message size) allows differential treatment to nodes
• Send out small messages first– Nodes waiting for small messages have to wait lesser– Ratio of increase in time for nodes waiting for larger
messages is much smaller– No skew for zero-byte data with lesser
synchronization• Most helpful for non-contiguous messages
– MPI processing (e.g., packing) is sequential for non-contiguous messages
Presentation Layout
• Introduction
• Impact of PETSc Data Layout and Processing on MPI
• MPI Enhancements and Optimizations
• Experimental Evaluation
• Concluding Remarks and Future Work
Experimental Testbed• 64-node Cluster
– 32 nodes with dual Intel EM64T 3.6GHz processors• 2MB L2 Cache, 2GB DDR2 400MHz SDRAM• Intel E7520 (Lindenhurst) Chipset
– 32 nodes with dual Opteron 2.8GHz processors• 1MB L2 Cache, 4GB DDR 400MHz SDRAM• NVidia 2200/2050 Chipset
• RedHat AS4 with kernel.org kernel 2.6.16• InfiniBand DDR (16Gbps) Network:
– MT25208 adapters connected through a 144-port switch
• MVAPICH2-0.9.6 MPI implementation
Non-uniform Communication Evaluation
Non-contiguous Communication
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
64 128 256 512 1024Grid Size
Late
ncy (
us)
MVAPI CH2-0.9.6
MVAPI CH2-New
Timing Breakup (1024 Grid Size)
0%10%20%30%40%50%
60%70%80%90%
100%
MVAPI CH2-0.9.6 MVAPI CH2-NewPe
rcen
tage
Tim
e
Search Pack Communicate
Search time can dominate performance if the working context is lost!
AllGatherv EvaluationAllGatherv Latency vs. Message Size
0
200
400
600
800
1000
1200
1400
1600
18001 8 64 512 4K 32K
Message Size (bytes)
Late
ncy (
us)
MVAPI CH2-0.9.6
MVAPI CH2-New
AllGatherv Latency vs. System Size
0
200
400
600
800
1000
1200
1400
1600
1800
2 4 8 16 32 64Number of Processes
Late
ncy (
us)
MVAPI CH2-0.9.6
MVAPI CH2-New
Alltoallw Evaluation
0
200
400
600
800
1000
1200
1400
1600
1800
2 4 8 16 32 64 128
Number of Processes
Late
ncy (
us)
MVAPI CH2-0.9.6MVAPI CH2-New
Our algorithm reduces the skew introduced due to the Alltoallw operations by sending out smaller messages first and allowing the corresponding applications to progress
PETSc Vector ScatterPETSc Vecscatter Performance
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
2 4 8 16 32 64 128Number of Processes
Late
ncy (
us)
MVAPI CH2-0.9.5MVAPI CH2-NewHand-tuned
Relative I mprovement
-20
0
20
40
60
80
100
2 4 8 16 32 64 128
Number of Processes
Perc
enta
ge I
mpro
veme
nt
MVAPI CH2-0.9.5
Hand-tuned
3-D Laplacian Multigrid SolverApplication Perf ormance
0
10
20
30
40
50
60
70
80
90
4 8 16 32 64 128Number of Processors
Exec
utio
n Tim
e (s)
MVAPI CH2-0.9.6MVAPI CH2-NewHand-Tuned
Perf ormance I mprovement
-20
0
20
40
60
80
100
4 8 16 32 64 128
Number of Processors
Perc
enta
ge I
mpro
veme
nt (%
)
MVAPI CH2-0.9.6
Hand-Tuned
Presentation Layout
• Introduction
• Impact of PETSc Data Layout and Processing on MPI
• MPI Enhancements and Optimizations
• Experimental Evaluation
• Concluding Remarks and Future Work
Concluding Remarks and Future Work• Non-uniform and Non-contiguous communication is
inherent in several libraries and applications• Current algorithms deal with non-uniform
communication in a same way as uniform communication
• Demonstrated that more sophisticated algorithms can give close to 10x improvements in performance
• Designs are a part of MPICH2-1.0.5 and 1.0.6– To be picked up by MPICH2 derivatives in later
releases• Future Work:
– Skew tolerance in non-uniform communication– Other libraries and applications
Thank You
Group Web-page: http://www.mcs.anl.gov/radix
Home-page: http://www.mcs.anl.gov/~balajiEmail: [email protected]
Backup Slides
Noncontiguous Communication in PETSc
0 8 16 192 384
Copy Buffer
vector (count = 8, stride = 8)contiguous (count = 3)
double | double | double double | double | double double | double | double
contiguous (count = 3) contiguous (count = 3)
• Data might not always be contiguously laid out in memory– E.g., Second dimension of a
structured grid• Communication is performed
by packing data• Pipelining copy and
communication is important for performance
Hand-tuning vs. Automated optimization• Nonuniformity and noncontiguity in data
communication is inherent in several applications– Communicating unequal amounts of data to the
different peer processes– Communication data from noncontiguous memory
locations• Previous research has primarily focused on uniform
and contiguous data communication• Accordingly applications and libraries tried hand-
tuning attempts to convert communication formats– Manually packing noncontiguous data– Re-implementing collective operations in the
application
Non-contiguous Communication in MPI• MPI Derived Datatypes
– Common approach for non-contiguous communication
– Application describes noncontiguous data layout to MPI
– Data is either packed into contiguous memory (sparse layouts) or sent as independent segments (dense layouts)
• Pipelining of packing and communication improves performance, but requires context information!
Non-contiguous Data layout
Save ContextSend Data
Save Context
Packing Buffer
Issues with Non-contiguous Communication• Current approach is simple and works as long as
there is a single parse on the noncontiguous data• More intelligent algorithms might suffer:
– E.g., lookup upcoming datatype content to predecide algorithm to use
– Multiple parses on the datatype lose the context !– Searching for the lost context every time requires
quadratically increasing time with datatype size• PETSc non-contiguous communication suffers with
such high search times
MPI-level EvaluationNoncontiguous Communication Perf ormance
0200000
400000600000
8000001000000
64 128 256 512 1024Grid Size
Time
(us)
MVAPI CH2-0.9.6MVAPI CH2-New
Allgatherv Perf ormance
0
500
1000
1500
2000
1 4 16 64 256 1024
4096
16384
Message Size (bytes)
Time
(us)
MVAPI CH2-0.9.6MVAPI CH2-New
Allgatherv Perf ormance
0
500
1000
1500
2000
2 4 8 16 32 64Number of Processors
Time
(us)
MVAPI CH2-0.9.6MVAPI CH2-New
Alltoallw Perf ormance
0
500
1000
1500
2000
2 4 8 16 32 64 128Number of Processors
Time
(us)
MVAPI CH2-0.9.6MVAPI CH2-New
Experimental Results• MPI-level Micro-benchmarks
– Non-contiguous data communication time– Non-uniform collective communication
• Allgatherv Operation• Alltoallw Operation
• PETSc Vector Scatter Benchmark– Performs communication only
• 3-D Laplacian Multigrid Solver Application– Partial differential equation solver– Utilizes PETSc numerical solver operations