-
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551
LLNL-PRES-529376
SIAM Conference on Parallel Processing ◆ February 15, 2012
Topology aware resource allocation and mapping challenges at exascale
Abhinav Bhatele and Laxmikant V. Kale
-
IBM Blue Gene
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Qbox: FPMD Simulations
• Two-dimensional process grid
• Collective communication over rows and columns
• MPI_Bcast (14.5%)and MPI_Allreduce (3.7%) on 8K nodes
3
performance of MPI_Bcast, since this dominates communication costs in Qbox (see Table 1). Figure 2 shows the communication pattern of a single broadcast on a 4x4 plane of BG/L nodes using three eight-node communicators. Broadcasts over a compact rectangle of nodes (left panel), which use the torus network’s broadcast functionality, have the most balanced packet counts as well as the lowest maximum count. When we split the nodes
across multiple lines resulting in disjoint sets of nodes (middle panel), the communication requires significantly more communication packets with less balanced link utilization. The node mappings in Figure 1c) and 1d), on the other hand, lead to a more balanced link utilization (as illustrated in Figure 2, right panel) and hence to higher overall performance.
(a) (b)
(c) (d)
Figure 1. Illustration of different node mappings for a 64k-node partition. Each color represents the nodes belonging to one 512-node column of the process grid.
This analysis of Qbox communication led to several node mapping optimizations. In particular, our initial mappings did not optimize the placement of tasks within communicators. We have refined the bipartite mapping shown in Figure 1c to map tasks with a variant of Z ordering within a plane. This modification effectively isolates the subtree of a binomial software tree broadcast within subplanes of the torus. In addition, we observed that substantial time was spent in
MPI_Type_commit calls in the BLACS library. The types being created were in many cases just contiguous native types, which allowed us to hand-optimize BLACS to eliminate calls in these cases. We are investigating other possible optimizations, including multi-level collective operations [13] that could substantially improve the performance of the middle configuration in Figure 2.
Francois Gygi et al. Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing (SC '06). ACM, New York, NY, USA
39.5 TFlop/s 38.2 TFlop/s
64.0 TFlop/s 64.7 TFlop/s
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
OpenAtom: ab-initio MD
• Transposes, multicasts and reductions
• Significant contention on the network
4
OPTIMIZING COMMUNICATION FOR CHARM++ APPLICATIONS 7
!"#$%&'$()*$+%,+$-.)
/&$+"#$%&
"-$-&0
'+$1&0
"-$-&0
'+$1&0
'+$1&0
"-$-&0
"-$-&0
Figure 2. Mapping of different chare arrays to the 3D torus of the machine
wise communication). The number of planes in GSpace is different from that in RealSpace.GSpace also interacts with the PairCalculator arrays. Each plane of GSpace, G(∗, p) interactswith the corresponding plane, P (∗, ∗, p) of the PairCalculators (plane-wise communication)through multicasts and reductions. So, GSpace interacts state-wise with RealSpace and plane-
wise with PairCalculators. If all planes of GSpace are placed together, then the transpose
operation is favored, but if all states of GSpace are placed together, the multicasts/reductions
are favored. To strike a balance between the two extremes, a hybrid map is built, where a
subset of planes and states of these three arrays are placed on one processor.
Mapping GSpace and RealSpace Arrays: Initially, the GSpace array is placed on the torus
and other objects are mapped relative to GSpace’s mapping. The 3D torus is divided into
rectangular boxes (which will be referred to as “prisms”) such that the number of prisms is
equal to the number of the planes in GSpace. The longest dimension of the prism is chosen to
be same as one dimension of the torus. Each prism is used for all states of one plane of GSpace.
Within each prism for a specific plane, the states in G(*, p) are laid out in increasing order along
the long axis of the prism. Once GSpace is mapped, the RealSpace objects are placed. Prisms
perpendicular to the GSpace prisms are created which are formed by including processors
holding all planes for a particular state of GSpace, G(s, ∗). These prisms are perpendicular tothe GSpace prisms and the corresponding states of RealSpace, R(s, ∗) are mapped on to theseprisms. Figure 2 shows the GSpace objects (on the right) and the RealSpace objects (in the
foreground) being mapped along the long dimension of the torus (box in the center).
Mapping of Density Arrays: RhoR objects communicate with RealSpace plane-wise and hence
Rρ(p, ∗) have to be placed close to R(∗, p). To achieve this, we start with the centroid of theprism used by R(∗, p) and place RhoR objects in proximity to it. RhoG objects, Gρ(p) aremapped near RhoR objects, Rρ(p, ∗) but not on the same processors as RhoR to maximizeoverlap. The density computation is inherently smaller and hence occupies the center of the
torus.
Copyright c� 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 00:1–7Prepared using cpeauth.cls
0
2.75
5.5
8.25
11
1024 2048 4096 8192T
ime
per
step
(s)
Number of cores
Default MappingTopology Mapping
Abhinav Bhatele, Eric Bohm, and Laxmikant V. Kale, 2011. Optimizing communication for Charm++ applications by reducing network contention. Concurr. Comput. : Pract. Exper. 23, 2, pp. 211-222, February 2011.
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
ddcMD: PPPM algorithm
• Optimize communication within short-range and long-range and also between the two
• Reduction in comm. time:• short-range: 3 times• long-range (FFT): 2 times• between the two: 50%
• Overall reduction: 36% on 32K nodes (64 x 32 x 16)
5
Figure 2: Task layout on the BlueGene/P torus. Cyan arrowsindicate communication from MD tasks to collector tasks, bluearrows indicate communication from collector tasks to FFTtasks.
is the reduction of the contributions to ρ at each point from thesemultiple tasks to a single sum.
To perform the reduction we nominate a subset of the particletasks as “collector” tasks (see Fig. 2). Each mesh point is uniquelyassigned to a collector task that is responsible for gathering all con-tributions to ρ for that mesh point and performing the sum. Thenumber and arrangement of the collector tasks is a tunable param-eter and all communication is local.
In Stage 2 of the communication, each collector task sends meshinformation to the appropriate mesh task using MPI_Isend. Wethink of this stage as a gather operation since the mesh data is beinggathered to the mesh tasks. This communication is long-range, butcan be efficiently organized.
Once the long-range portion of the potential has been calculatedwe perform the communication stages again but in reverse order.The mesh tasks scatter mesh data back to the collector tasks whichin turn send values to the neighboring particle tasks. To maximizethe overlap of communication of computation the MPI_Irecvson the collector tasks are posted before we start the pair calculation.This allows data to begin moving from the mesh tasks as soon as itis available.
In benchmark calculations on Dawn using 144,384 processors(9,216 mesh tasks and 135,168 particle tasks, of which 22,400 arecollector tasks) we observe that when work is properly balancedbetween the particle and mesh tasks there is a pronounced asym-metry in the communication times between the particle and meshtasks. Although the mesh tasks spend roughly 15% of total run-time waiting for data to arrive, the particle tasks spend under 2% ofthe total runtime sending it. Asynchronous communication allowsthe particle tasks to continue with other work while the commu-nication proceeds. Hence, we have successfully accomplished thegoal of minimizing communication time on the particle tasks byoverlapping the communication of mesh data with the explicit paircomputation.
This success demonstrates the effectiveness of the direct mem-ory access (DMA) engine that was added to the BG/P design asan improvement over BG/L[23]. The DMA is directly coupled tothe L3 (shared) cache on each node and is responsible for sendingand receiving data to and from the torus network. The CPU is thusrelieved of these tasks and is free to continue on to other compu-tations. From comparisons with benchmark simulations performedon BG/L, it is clear that there is a significant benefit from the DMA.
The two-stage approach just described has at least two advan-tages over a single stage method in which each particle task simplysends all of the mesh points it populates to the appropriate meshtasks and the reduction of partial sums is performed on the meshtasks. The first advantage is a reduction of communication band-width from the particle to the mesh tasks. Although the number ofmesh points to which an particle task contributes varies with ng , forour typical problems of interest it is roughly 2–5 times the numberthat lie strictly within its computational domain. Hence a singlestage solution would require 2–5 times the network bandwidth tocomplete communication in the same time. In the two-stage ap-proach the mesh points are gathered and reduced locally so a largernumber of torus links can be active increasing the aggregate band-width available to communicate mesh points. A second advantageis that the number of collector tasks can be tuned to optimize totalcommunication cost. Changing the number of collector tasks al-lows trade offs between the number of messages sent to each meshtask in Stage 2 (with corresponding changes in message size) andthe bandwidth available for the reduction in Stage 1.
4.3 LayoutFor a 3D torus network, the assignment of MPI tasks onto com-
pute nodes at specific torus coordinates can significantly impactparallel efficiency at full machine scale. It is necessary to opti-mize communication both within the short-range and long-rangesubcommunicators, as well as between the two. For the latter, wefocused on splitting the torus into separate sections for the particleand mesh tasks such that communication between the “collector”tasks described in section 4.2 and the mesh tasks takes place along asingle torus dimension to reduce contention and avoid bottlenecks,as illustrated in Figure 2. The tasks are then ordered within eachsubcommunicator to provide nearest neighbor communication forspatially adjacent particle tasks and reduce transpose communica-tion times for the mesh tasks.
For a system of 1.2 billion particles on a 64 × 32 × 16 BlueGene/P partition (32,768 nodes, 131,072 tasks), we see a signif-icant decrease in communication times using a custom task mapconstructed as described above compared with the default (TXYZ)layout. The total run time decreased by 36% when the custom map-ping was used, with the greatest improvement being seen in theintra-particle task communication times, which decreased by a fac-tor of 3, likely due to the poor correspondence of the default map-ping to the simulation box shape. Communication within the 3DFFT was decreased by over a factor of 2, and communication be-tween the particle and mesh tasks was more than 50% faster. Theseresults highlight the need to carefully understand and optimize thecommunication patterns on large torus networks.
4.4 FFT ImplementationAs described in Section 2.1, the long-range interaction term in
Eqn. 3 involves the use of a 3D Fourier transform between thereal-space (real-valued) density and the k-space (complex-valued)density. To obtain optimal 3D FFT performance in the massively-parallel regime a custom real-to-complex 3D Fast Fourier Trans-form implementation (bigFFT) was developed using a 2D decom-
D. F. Richards et al. Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09). ACM, New York, NY, USA
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Other applications
• Blue Matter - Blake G. Fitch et al. Blue matter: approaching the limits of concurrency for classical molecular dynamics. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing (SC '06). ACM, New York, NY, USA
• NAMD - Abhinav Bhatele et al. Dynamic topology aware load balancing algorithms for molecular dynamics applications. In Proceedings of the 23rd international conference on Supercomputing (ICS '09). ACM, New York, NY, USA
• NAS BT, CG, MG - Brian E. Smith et al. Performance Effects of Node Mappings on the IBM Blue Gene/L Machine. In Euro-Par, pages 1005–1013, 2005
• SAGE, UMT2000 - G. Bhanot et al. Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development, 49(2/3):489–500, 2005
• GTC Particle-in-cell - Leonid Oliker et al. Scientific Application Performance on Candidate PetaScale Platforms. In Proceedings of IEEE Parallel and Distributed Processing Symposium (IPDPS), March 2007
6
-
Cray XT/XE
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
pF3D
• Slowest communicators are those that are most spread out on the network
• Runs on Cielo - Cray XE6 at LANL
8
(a) (b) (c) (d)
Fig. 4: Message passing during 2D FFTs for a single xy-slab. (a) Communication patterns for xyzt in x-direction, color
coded by communicator. Only links in a single direction are used. (b) Communication for block-mapped data, color coded by
communicator. The communicating nodes are distributed within the 4x4x4 block and uses links in multiple directions. (c) xyzt
in the y-direction uses vertical links. (d) Y-communication for the block mapped scheme uses links in multiple directions.
(a) (b)
Fig. 5: Message passing for y-communication on Cielo. (a) The locations on the Cielo torus are shown for the five fastest y-
communicators. (b) The locations on the Cielo torus are shown for the five fastest y-communicators. The slowest communicators
are all ”long and skinny” and are thus subject to contention for bandwidth along the links in the z-direction.
The cielo message passing performance doesn’t show a
strong trend as a function of the number of processes. The
variability among the three runs with 1024 processes appears
to be bigger than differences due to the number of processes.
All runs used a 16x16xN decomposition so that X messages
are all passed on node. All Y messages go off node and are
significantly slower than the X messages.
The message passing rate during the 32K process Cielo run
varied between 80 and 100 MB/s for different batch jobs.
The variability appears to be due to the placement of the
communicators on the interconnect. The run used a 16x16x128
decomposition. The message passing rate for Y messages is
probably about 1 GB/s per node. That is significantly less than
the roughly 5 GB/s rate that MPI benchmarks achieve.
We investigated the effect of varying the MPI eager limit.
It made a noticeable difference in smaller runs but does not
appear to help for large runs. The likely explanation is that a
large run is statistically very likely to have a communicator
that is slow and that masks any effects of varying the MPI
parameter. We plan to investigate custom mappings between
the physical location of domains and their placement on the
interconnect in future work.
VI. MTBI RESULTS
Large computers like Cielo have large numbers of compo-
nents. In spite of carefully designed RAS systems, applications
Steven H. Langer et al. Cielo Full-System Simulations of Multi-Beam Laser-Plasma Interaction in NIF Experiments, In Proceedings of Cray User Group, 2011.
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
OpenAtom: ab-initio MD
9
OPTIMIZING COMMUNICATION FOR CHARM++ APPLICATIONS 7
!"#$%&'$()*$+%,+$-.)
/&$+"#$%&
"-$-&0
'+$1&0
"-$-&0
'+$1&0
'+$1&0
"-$-&0
"-$-&0
Figure 2. Mapping of different chare arrays to the 3D torus of the machine
wise communication). The number of planes in GSpace is different from that in RealSpace.GSpace also interacts with the PairCalculator arrays. Each plane of GSpace, G(∗, p) interactswith the corresponding plane, P (∗, ∗, p) of the PairCalculators (plane-wise communication)through multicasts and reductions. So, GSpace interacts state-wise with RealSpace and plane-
wise with PairCalculators. If all planes of GSpace are placed together, then the transpose
operation is favored, but if all states of GSpace are placed together, the multicasts/reductions
are favored. To strike a balance between the two extremes, a hybrid map is built, where a
subset of planes and states of these three arrays are placed on one processor.
Mapping GSpace and RealSpace Arrays: Initially, the GSpace array is placed on the torus
and other objects are mapped relative to GSpace’s mapping. The 3D torus is divided into
rectangular boxes (which will be referred to as “prisms”) such that the number of prisms is
equal to the number of the planes in GSpace. The longest dimension of the prism is chosen to
be same as one dimension of the torus. Each prism is used for all states of one plane of GSpace.
Within each prism for a specific plane, the states in G(*, p) are laid out in increasing order along
the long axis of the prism. Once GSpace is mapped, the RealSpace objects are placed. Prisms
perpendicular to the GSpace prisms are created which are formed by including processors
holding all planes for a particular state of GSpace, G(s, ∗). These prisms are perpendicular tothe GSpace prisms and the corresponding states of RealSpace, R(s, ∗) are mapped on to theseprisms. Figure 2 shows the GSpace objects (on the right) and the RealSpace objects (in the
foreground) being mapped along the long dimension of the torus (box in the center).
Mapping of Density Arrays: RhoR objects communicate with RealSpace plane-wise and hence
Rρ(p, ∗) have to be placed close to R(∗, p). To achieve this, we start with the centroid of theprism used by R(∗, p) and place RhoR objects in proximity to it. RhoG objects, Gρ(p) aremapped near RhoR objects, Rρ(p, ∗) but not on the same processors as RhoR to maximizeoverlap. The density computation is inherently smaller and hence occupies the center of the
torus.
Copyright c� 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 00:1–7Prepared using cpeauth.cls
Abhinav Bhatele, Eric Bohm, and Laxmikant V. Kale, 2011. Optimizing communication for Charm++ applications by reducing network contention. Concurr. Comput. : Pract. Exper. 23, 2, pp. 211-222, February 2011.
0
2
4
6
8
512 1024 2048T
ime
per
step
(s)
Number of cores
Default MappingTopology Mapping
• Job schedulers on Cray are not topology aware typically
• Performance Benefit at 2048 cores: 40% (XT3), 45% (BG/P), 41% (BG/L)
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Other work
• M. Muller and Michael Resch. PE mapping and the congestion problem in the T3E. In Proceedings of the Fourth European Cray-SGI MPP Workshop, Garching, Germany, 1998
• Thierry Cornu and Michel Pahud. Contention in the Cray T3D Communi- cation Network. In Euro-Par ’96: Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II, pages 689–696, Lon- don, UK, 1996. Springer-Verlag
• Eduardo Huedo and Manuel Prieto and Ignacio Martin Llorente and Fran- cisco Tirado. Impact of PE Mapping on Cray T3E Message-Passing Per- formance. In Euro-Par ’00: Proceedings from the 6th International Euro- Par Conference on Parallel Processing, pages 199–207, London, UK, 2000. Springer-Verlag
• Deborah Weisser, Nick Nystrom, Chad Vizino, Shawn T. Brown, and John Urbanic. Optimizing Job Placement on the Cray XT3. 48th Cray User Group Proceedings, 2006
10
-
Infiniband clusters
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Optimizing collective communication
12
D. K. Panda, K. Schulz, B. Barth and A. Majumdar,Topology-Aware MPI Communication and Scheduling for Petascale Systems, Poster, NSF, STCI, https://confluence.pegasus.isi.edu/download/attachments/5242944/topology-aware-poster.pdf
Topology-Aware MPI Communication and Scheduling for Petascale SystemsPIs: D. K. Panda (The Ohio State University), K. Schulz and B. Barth (Texas Advanced Computing Center)
and A. Majumdar (San Diego Supercomputer Center)
Motivation Vision and Problem Statement Framework and Approach
Network Topology of TACC Ranger Current Job Allocation Strategies and their Impact: A Case Study with TACC Ranger
Application Level Performance Impact: A Case Study with MPCUGLES
Topology Aware MPI_Gather Design
Topology Aware MPI_Scatter Design Conclusions and Continuing Work
Job allocation for the entire system Jobs using 16-4800 cores
Jobs using 4800-16000 cores Jobs using 16000-64000 cores
Table 1: Data Collected from TACC Ranger System
Modern networks (like InfiniBand and 10 GigE) are capable of providing topology and routing information
Research Challenge:
Can the next-generation petascale systems provide topology-aware MPI communication, mapping and scheduling which can improve performance and scalability for a range of scientific applications?
• On the left, we compare performance of MPCUGLES on the Normal Batch Queue as compared to runs conducted on an exclusive queue
• We observe that performance may be impacted by up to 15%• On the right, we compare performance of MPCUGLES on the Normal Batch
Queue, but with special randomization of hostfiles
• We observe that there is greater variance in performance with up to 16% difference between best case and worst case runs
Topology AwareTask Mapping
Topology Aware Communication
Collectives Point To PointIntegratedEvaluation
Topology Information Interface
Topology Graph Network Status Graph
Dynamic State & Topology Management Framework
Unified Abstraction Layer
EthernetNetwork Management System Topology
DiscoveryTraffic
Monitoring
Enhanced Subnet Management Layer
High Performance Interconnect
Job Scheduler
Topology-Aware Scheduling
Performance Feedback
Dependency
Legend:
Profiling Information
TurbulencePrediction
EarthquakeModeling
MPI Applications
FlowModeling
KineticSimulation
Application Hints
Rack 1 Rack 2 Rack 82
InfiniBand Switch (Magnum) 1 InfiniBand Switch (Magnum) 2
• Ranger's compute nodes are a blade-based configuration• 12 blades in a chassis, 4 chassis in a rack, 82 racks in all for a total of 4096
compute nodes.
• Each chassis embeds a NEM (network express module) combining a 24-port leaf switch with 12 dual-rail SDR HCA's, one per node.
• Each NEM is connected to the core Magnum switch(es) with 4, 12X connectors, 2 to each Magnum.
80 1 2 3 4 5 6 7
1.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Run #
Perf
orm
ance
Nor
mal
ized
to E
xclu
sive
Que
ue
Batch Queue with Normal Ordering on 192 cores
60 1 2 3 4 5
1.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Run #
Perf
orm
ance
Nor
mal
ized
to E
xclu
sive
Que
ue
Batch Queue with Random Ordering on 192 cores
Research Questions:
(1) What are the topology aware communication and scheduling requirements of petascale applications?
(2) How to design a network topology and state management framework with static and dynamic network information?
(3) How to design topology-aware point-to-point and collective communication schemes?
(4) How to design topology-aware task mapping and scheduling schemes?
(5) How to design a flexible topology information interface?
• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions
• We observe that the algorithm is impacted by background traffic• On the right, we compare performance of proposed topology-aware algorithm
under quiet and busy conditions
• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions
• 23% performance improvement under quiet network conditions and 10% under busy conditions
• The graphs present the analysis of the jobs run on TACC ranger system in September '09 • There were a total of 19,441multi-node jobs, most of which used 16-4800 cores• We observe that for the majority of jobs, average inter-node distance is significantly more
than the best possible
• Modern high-end computing (HEC) systems enable scientists to tackle grand challenge problems
• Design and deployment of such ultra-scale HEC systems is being fueled by the increasing use of multi-core/many-core architectures and commodity networking technologies like InfiniBand.
• As a recent example, the TACC Ranger system was deployed with a total of 62,976 cores using a fat-tree InfiniBand interconnect to provide a peak performance of 579 TFlops.
• Most current petascale applications are written using the Message Passing Interface (MPI) programming model.
• By necessity, large-scale systems that support MPI are built using hierarchical topologies (multiple levels involving intra-socket, intra-node, intra-blade, intra-rack, and multi-stages within a high-speed switch).
• Current generation MPI libraries and schedulers do not take into account these various levels for optimizing communications
• Consequently, this leads to non-optimal performance and scalability for many applications.
Process Location Number of Hops MPI Latency (us)
Intra-RackIntra-ChassisInter-Chassis
Inter-Rack
0 Hops in Leaf Switch1 Hop in Leaf Switch
3 Hops Across Spine Switch5 Hops Across Spine Switch
1.572.042.452.85
Results of current work:
• We have observed a major impact on end applications if schedulers and communication libraries are not topology aware
• We have proposed topology aware collective communication algorithms for MPI_Gather and MPI_Scatter
• Our proposed algorithms outperform default implementation under both network quiet and busy conditions
Continuing work:
• Work towards a topology-aware scheduling scheme• Adapt more collective algorithms dynamically according to topology
interfacing with schedulers
• Gather more data from real-world application runs• Integrated solutions will be available in future versions of MVAPICH/
MVAPICH2 software
Publications:
• K. Kandalla, H. Subramoni and D. K. Panda, “Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters : Case Studies wih Scatter and Gather”, Communication Architecture for Clusters (CAC) Workshop, in conjunction with IPDPS 2010
Additional Personnel:
• Hari Subramoni, Krishna Kandalla, Sayantan Sur, Karen Tomko (OSU)• Mahidhar Tatineni, Yifeng Cui, Dmitry Pekurovsky (SDSC)
• We conduct experiments with MVAPICH2 stack, which is one of the most popular MPI implementations over InfiniBand, currently used by more than 1,050 organizations worldwide (http://www.mvapich.cse.ohio-state.edu)
• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions
• We observe that default algorithm is very sensitive to background traffic, with degradation up to 21% for large messages
• On the right, we compare performance of proposed topology-aware algorithm
• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions
• Over 50% performance improvement even when network was busy
Topology-Aware MPI Communication and Scheduling for Petascale SystemsPIs: D. K. Panda (The Ohio State University), K. Schulz and B. Barth (Texas Advanced Computing Center)
and A. Majumdar (San Diego Supercomputer Center)
Motivation Vision and Problem Statement Framework and Approach
Network Topology of TACC Ranger Current Job Allocation Strategies and their Impact: A Case Study with TACC Ranger
Application Level Performance Impact: A Case Study with MPCUGLES
Topology Aware MPI_Gather Design
Topology Aware MPI_Scatter Design Conclusions and Continuing Work
Job allocation for the entire system Jobs using 16-4800 cores
Jobs using 4800-16000 cores Jobs using 16000-64000 cores
Table 1: Data Collected from TACC Ranger System
Modern networks (like InfiniBand and 10 GigE) are capable of providing topology and routing information
Research Challenge:
Can the next-generation petascale systems provide topology-aware MPI communication, mapping and scheduling which can improve performance and scalability for a range of scientific applications?
• On the left, we compare performance of MPCUGLES on the Normal Batch Queue as compared to runs conducted on an exclusive queue
• We observe that performance may be impacted by up to 15%• On the right, we compare performance of MPCUGLES on the Normal Batch
Queue, but with special randomization of hostfiles
• We observe that there is greater variance in performance with up to 16% difference between best case and worst case runs
Topology AwareTask Mapping
Topology Aware Communication
Collectives Point To PointIntegratedEvaluation
Topology Information Interface
Topology Graph Network Status Graph
Dynamic State & Topology Management Framework
Unified Abstraction Layer
EthernetNetwork Management System Topology
DiscoveryTraffic
Monitoring
Enhanced Subnet Management Layer
High Performance Interconnect
Job Scheduler
Topology-Aware Scheduling
Performance Feedback
Dependency
Legend:
Profiling Information
TurbulencePrediction
EarthquakeModeling
MPI Applications
FlowModeling
KineticSimulation
Application Hints
Rack 1 Rack 2 Rack 82
InfiniBand Switch (Magnum) 1 InfiniBand Switch (Magnum) 2
• Ranger's compute nodes are a blade-based configuration• 12 blades in a chassis, 4 chassis in a rack, 82 racks in all for a total of 4096
compute nodes.
• Each chassis embeds a NEM (network express module) combining a 24-port leaf switch with 12 dual-rail SDR HCA's, one per node.
• Each NEM is connected to the core Magnum switch(es) with 4, 12X connectors, 2 to each Magnum.
80 1 2 3 4 5 6 7
1.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Run #
Perf
orm
ance
Nor
mal
ized
to E
xclu
sive
Que
ue
Batch Queue with Normal Ordering on 192 cores
60 1 2 3 4 5
1.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Run #
Perf
orm
ance
Nor
mal
ized
to E
xclu
sive
Que
ue
Batch Queue with Random Ordering on 192 cores
Research Questions:
(1) What are the topology aware communication and scheduling requirements of petascale applications?
(2) How to design a network topology and state management framework with static and dynamic network information?
(3) How to design topology-aware point-to-point and collective communication schemes?
(4) How to design topology-aware task mapping and scheduling schemes?
(5) How to design a flexible topology information interface?
• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions
• We observe that the algorithm is impacted by background traffic• On the right, we compare performance of proposed topology-aware algorithm
under quiet and busy conditions
• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions
• 23% performance improvement under quiet network conditions and 10% under busy conditions
• The graphs present the analysis of the jobs run on TACC ranger system in September '09 • There were a total of 19,441multi-node jobs, most of which used 16-4800 cores• We observe that for the majority of jobs, average inter-node distance is significantly more
than the best possible
• Modern high-end computing (HEC) systems enable scientists to tackle grand challenge problems
• Design and deployment of such ultra-scale HEC systems is being fueled by the increasing use of multi-core/many-core architectures and commodity networking technologies like InfiniBand.
• As a recent example, the TACC Ranger system was deployed with a total of 62,976 cores using a fat-tree InfiniBand interconnect to provide a peak performance of 579 TFlops.
• Most current petascale applications are written using the Message Passing Interface (MPI) programming model.
• By necessity, large-scale systems that support MPI are built using hierarchical topologies (multiple levels involving intra-socket, intra-node, intra-blade, intra-rack, and multi-stages within a high-speed switch).
• Current generation MPI libraries and schedulers do not take into account these various levels for optimizing communications
• Consequently, this leads to non-optimal performance and scalability for many applications.
Process Location Number of Hops MPI Latency (us)
Intra-RackIntra-ChassisInter-Chassis
Inter-Rack
0 Hops in Leaf Switch1 Hop in Leaf Switch
3 Hops Across Spine Switch5 Hops Across Spine Switch
1.572.042.452.85
Results of current work:
• We have observed a major impact on end applications if schedulers and communication libraries are not topology aware
• We have proposed topology aware collective communication algorithms for MPI_Gather and MPI_Scatter
• Our proposed algorithms outperform default implementation under both network quiet and busy conditions
Continuing work:
• Work towards a topology-aware scheduling scheme• Adapt more collective algorithms dynamically according to topology
interfacing with schedulers
• Gather more data from real-world application runs• Integrated solutions will be available in future versions of MVAPICH/
MVAPICH2 software
Publications:
• K. Kandalla, H. Subramoni and D. K. Panda, “Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters : Case Studies wih Scatter and Gather”, Communication Architecture for Clusters (CAC) Workshop, in conjunction with IPDPS 2010
Additional Personnel:
• Hari Subramoni, Krishna Kandalla, Sayantan Sur, Karen Tomko (OSU)• Mahidhar Tatineni, Yifeng Cui, Dmitry Pekurovsky (SDSC)
• We conduct experiments with MVAPICH2 stack, which is one of the most popular MPI implementations over InfiniBand, currently used by more than 1,050 organizations worldwide (http://www.mvapich.cse.ohio-state.edu)
• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions
• We observe that default algorithm is very sensitive to background traffic, with degradation up to 21% for large messages
• On the right, we compare performance of proposed topology-aware algorithm
• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions
• Over 50% performance improvement even when network was busy
Topology-Aware MPI Communication and Scheduling for Petascale SystemsPIs: D. K. Panda (The Ohio State University), K. Schulz and B. Barth (Texas Advanced Computing Center)
and A. Majumdar (San Diego Supercomputer Center)
Motivation Vision and Problem Statement Framework and Approach
Network Topology of TACC Ranger Current Job Allocation Strategies and their Impact: A Case Study with TACC Ranger
Application Level Performance Impact: A Case Study with MPCUGLES
Topology Aware MPI_Gather Design
Topology Aware MPI_Scatter Design Conclusions and Continuing Work
Job allocation for the entire system Jobs using 16-4800 cores
Jobs using 4800-16000 cores Jobs using 16000-64000 cores
Table 1: Data Collected from TACC Ranger System
Modern networks (like InfiniBand and 10 GigE) are capable of providing topology and routing information
Research Challenge:
Can the next-generation petascale systems provide topology-aware MPI communication, mapping and scheduling which can improve performance and scalability for a range of scientific applications?
• On the left, we compare performance of MPCUGLES on the Normal Batch Queue as compared to runs conducted on an exclusive queue
• We observe that performance may be impacted by up to 15%• On the right, we compare performance of MPCUGLES on the Normal Batch
Queue, but with special randomization of hostfiles
• We observe that there is greater variance in performance with up to 16% difference between best case and worst case runs
Topology AwareTask Mapping
Topology Aware Communication
Collectives Point To PointIntegratedEvaluation
Topology Information Interface
Topology Graph Network Status Graph
Dynamic State & Topology Management Framework
Unified Abstraction Layer
EthernetNetwork Management System Topology
DiscoveryTraffic
Monitoring
Enhanced Subnet Management Layer
High Performance Interconnect
Job Scheduler
Topology-Aware Scheduling
Performance Feedback
Dependency
Legend:
Profiling Information
TurbulencePrediction
EarthquakeModeling
MPI Applications
FlowModeling
KineticSimulation
Application Hints
Rack 1 Rack 2 Rack 82
InfiniBand Switch (Magnum) 1 InfiniBand Switch (Magnum) 2
• Ranger's compute nodes are a blade-based configuration• 12 blades in a chassis, 4 chassis in a rack, 82 racks in all for a total of 4096
compute nodes.
• Each chassis embeds a NEM (network express module) combining a 24-port leaf switch with 12 dual-rail SDR HCA's, one per node.
• Each NEM is connected to the core Magnum switch(es) with 4, 12X connectors, 2 to each Magnum.
80 1 2 3 4 5 6 7
1.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Run #
Perf
orm
ance
Nor
mal
ized
to E
xclu
sive
Que
ue
Batch Queue with Normal Ordering on 192 cores
60 1 2 3 4 5
1.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Run #
Perf
orm
ance
Nor
mal
ized
to E
xclu
sive
Que
ue
Batch Queue with Random Ordering on 192 cores
Research Questions:
(1) What are the topology aware communication and scheduling requirements of petascale applications?
(2) How to design a network topology and state management framework with static and dynamic network information?
(3) How to design topology-aware point-to-point and collective communication schemes?
(4) How to design topology-aware task mapping and scheduling schemes?
(5) How to design a flexible topology information interface?
• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions
• We observe that the algorithm is impacted by background traffic• On the right, we compare performance of proposed topology-aware algorithm
under quiet and busy conditions
• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions
• 23% performance improvement under quiet network conditions and 10% under busy conditions
• The graphs present the analysis of the jobs run on TACC ranger system in September '09 • There were a total of 19,441multi-node jobs, most of which used 16-4800 cores• We observe that for the majority of jobs, average inter-node distance is significantly more
than the best possible
• Modern high-end computing (HEC) systems enable scientists to tackle grand challenge problems
• Design and deployment of such ultra-scale HEC systems is being fueled by the increasing use of multi-core/many-core architectures and commodity networking technologies like InfiniBand.
• As a recent example, the TACC Ranger system was deployed with a total of 62,976 cores using a fat-tree InfiniBand interconnect to provide a peak performance of 579 TFlops.
• Most current petascale applications are written using the Message Passing Interface (MPI) programming model.
• By necessity, large-scale systems that support MPI are built using hierarchical topologies (multiple levels involving intra-socket, intra-node, intra-blade, intra-rack, and multi-stages within a high-speed switch).
• Current generation MPI libraries and schedulers do not take into account these various levels for optimizing communications
• Consequently, this leads to non-optimal performance and scalability for many applications.
Process Location Number of Hops MPI Latency (us)
Intra-RackIntra-ChassisInter-Chassis
Inter-Rack
0 Hops in Leaf Switch1 Hop in Leaf Switch
3 Hops Across Spine Switch5 Hops Across Spine Switch
1.572.042.452.85
Results of current work:
• We have observed a major impact on end applications if schedulers and communication libraries are not topology aware
• We have proposed topology aware collective communication algorithms for MPI_Gather and MPI_Scatter
• Our proposed algorithms outperform default implementation under both network quiet and busy conditions
Continuing work:
• Work towards a topology-aware scheduling scheme• Adapt more collective algorithms dynamically according to topology
interfacing with schedulers
• Gather more data from real-world application runs• Integrated solutions will be available in future versions of MVAPICH/
MVAPICH2 software
Publications:
• K. Kandalla, H. Subramoni and D. K. Panda, “Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters : Case Studies wih Scatter and Gather”, Communication Architecture for Clusters (CAC) Workshop, in conjunction with IPDPS 2010
Additional Personnel:
• Hari Subramoni, Krishna Kandalla, Sayantan Sur, Karen Tomko (OSU)• Mahidhar Tatineni, Yifeng Cui, Dmitry Pekurovsky (SDSC)
• We conduct experiments with MVAPICH2 stack, which is one of the most popular MPI implementations over InfiniBand, currently used by more than 1,050 organizations worldwide (http://www.mvapich.cse.ohio-state.edu)
• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions
• We observe that default algorithm is very sensitive to background traffic, with degradation up to 21% for large messages
• On the right, we compare performance of proposed topology-aware algorithm
• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions
• Over 50% performance improvement even when network was busy
https://confluence.pegasus.isi.edu/download/attachments/5242944/topology-aware-poster.pdfhttps://confluence.pegasus.isi.edu/download/attachments/5242944/topology-aware-poster.pdf
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Mapping is important
13
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Mapping is important
• As evidenced by several applications on Blue Gene
13
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Mapping is important
• As evidenced by several applications on Blue Gene• Also on Cray machines, even though:• Link bandwidth - 3.8 GB/s (XT3), 0.425 (BG/P), 0.175 (BG/L)• Bytes per flop - 8.77 (XT3), 0.375 (BG/P and BG/L)
13
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Mapping is important
• As evidenced by several applications on Blue Gene• Also on Cray machines, even though:• Link bandwidth - 3.8 GB/s (XT3), 0.425 (BG/P), 0.175 (BG/L)• Bytes per flop - 8.77 (XT3), 0.375 (BG/P and BG/L)
• Even more important now that:• Bytes per flop that the network can handle is reducing - 8.77
(XT3), 1.36 (XT4), 0.23 (XT5), 0.23 - 0.46 (XE6)★
13
★ Based on BigBen (XT3, PSC), Jaguar (XT4/XT5, ORNL) and Hopper (XE6, NERSC)
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Wormhole Routing
• Ni et al. 1993; Oh et al. 1997 - Equation for modeling message latencies:
• Relatively small sized supercomputers• It was safe to assume message latencies were independent
of distance
sharing network resources. For common networks with asymptotically inadequate
link bandwidth, chances of contention increase as messages travel farther and far-
ther. Network congestion on a link slows down all messages passing through that
link. Delays in message delivery can affect overall application performance. Thus, it
becomes necessary to consider the topology of the machine while mapping parallel
applications to job partitions.
This dissertation will demonstrate that it is not wise to assume that message la-
tencies are independent of the distance a message travels. This assumption has been
supported all these years by the advantages of virtual cut-through and wormhole
routing suggesting that the message latency is independent of the distance in ab-
sence of blocking [[5–12]]. When virtual cut-through or wormhole routing is deployed,
message latency is modeled by the equation,
LfB
∗ D + LB
(1.1)
where Lf is the length of each flit, B is the link bandwidth, D is the number of
links (hops) traversed and L is the length of the message. In absence of blocking, for
sufficiently large messages (where Lf
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Topology friendly supercomputers?
15
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Topology friendly supercomputers?
• Topology aware resource allocation
15
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Topology friendly supercomputers?
• Topology aware resource allocation• Ability to query the software stack for network topology
information:
• At job allocation time/pre-launch• During runtime
15
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Topology friendly supercomputers?
• Topology aware resource allocation• Ability to query the software stack for network topology
information:
• At job allocation time/pre-launch• During runtime
• Ability to change the mapping• Fixed shape partitions - mapping can be specified apriori (at job submission
time)
• At job allocation time - specify new mapping before job is launched
15
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Topology friendly supercomputers?
• Topology aware resource allocation• Ability to query the software stack for network topology
information:
• At job allocation time/pre-launch• During runtime
• Ability to change the mapping• Fixed shape partitions - mapping can be specified apriori (at job submission
time)
• At job allocation time - specify new mapping before job is launched
• Topology aware MPI implementations• Optimized collectives
15
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Survey of machines (top500.org)
16
Machine Network / topology Resource ManagerTopology friendly?
K computer Tofu / 6D torus
Tianhe-1A Proprietary fat-tree SLURM
Jaguar (XT5) Seastar2+ / 3D torus Moab/ TORQUE/ ALPS
Nebulae Infiniband
Tsubame 5.0 Infiniband N1 Grid Engine/ PBS
Cielo (XE6) Gemini/ 3D torus Moab/ TORQUE/ ALPS
Pleiades (SGI Altix) Infiniband/ 11D hypercube PBS
Tera 100 Infiniband SLURM
RoadRunner Infiniband
Jugene (BG/P) 3D Torus LoadLeveler
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Survey of machines (top500.org)
16
Machine Network / topology Resource ManagerTopology friendly?
K computer Tofu / 6D torus
Tianhe-1A Proprietary fat-tree SLURM
Jaguar (XT5) Seastar2+ / 3D torus Moab/ TORQUE/ ALPS
Nebulae Infiniband
Tsubame 5.0 Infiniband N1 Grid Engine/ PBS
Cielo (XE6) Gemini/ 3D torus Moab/ TORQUE/ ALPS
Pleiades (SGI Altix) Infiniband/ 11D hypercube PBS
Tera 100 Infiniband SLURM
RoadRunner Infiniband
Jugene (BG/P) 3D Torus LoadLeveler
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Survey of machines (top500.org)
16
Machine Network / topology Resource ManagerTopology friendly?
K computer Tofu / 6D torus
Tianhe-1A Proprietary fat-tree SLURM
Jaguar (XT5) Seastar2+ / 3D torus Moab/ TORQUE/ ALPS
Nebulae Infiniband
Tsubame 5.0 Infiniband N1 Grid Engine/ PBS
Cielo (XE6) Gemini/ 3D torus Moab/ TORQUE/ ALPS
Pleiades (SGI Altix) Infiniband/ 11D hypercube PBS
Tera 100 Infiniband SLURM
RoadRunner Infiniband
Jugene (BG/P) 3D Torus LoadLeveler
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Overall system utilization?
17
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Resource Management in HPC
• Manage incoming jobs in the queue (assign priorities) - job/batch scheduler
• Launch/monitor jobs on the compute nodes - job launcher
• Manage and allocate resources - workload/resource manager
18
Typically referred to as a resource manager or job scheduler
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Resource manager requirements
• Resource allocator needs to be topology aware• contiguous partitions on 3d mesh/torus• nodes on the same switch for fat-trees
• Job launcher should be able to launch MPI processes on specific nodes
• On BG/P systems, mpirun can take a mapfile as input• On Cray systems, aprun can take a list of node ids
• Within-node mapping• Most job launchers support this through cpu/task affinity options
19
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Mapping algorithms
• Steps involved in mapping:• Collect processor topology information• Collect application topology information• Run the mapping algorithm• Communicate the new mapping to processes
• Approaches• Centralized• Completely distributed• Distributed with global view
20
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Mapping Challenges
• At exascale, we cannot store O(p) information on one processor
• For asymmetric topologies, how to query/store the topology information in a scalable fashion (MPI topology routines)
• how to collect/store the communication graph in a scalable fashion (MPI graph routines)
• Centralized algorithms: O(n log n) or O(n) might be too slow
• Distributed algorithms: convergence is slow
21
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Scalable mapping algorithms
• Scalable algorithms: distributed algorithms with global view
• as good as centralized algorithms?
22
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Completely Distributed Mapping
• Do the mapping in parallel• With some sense of global load distribution• Using parallel prefix (discussed in literature)
• Map n objects communicating in a 1D ring pattern to a linear array of p processors
• Map n objects communicating in a 2D stencil pattern to a 2D mesh of p processors
23
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
• Object i communicates with i-1 and i+1• We want to make cuts in the 1D ring based on the loads
of the objects
• Each processor will then have only two external communication arcs
24
1D ring to a linear array
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
• Object i communicates with i-1 and i+1• We want to make cuts in the 1D ring based on the loads
of the objects
• Each processor will then have only two external communication arcs
24
1D ring to a linear array
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
1D ring to a linear array
• Perform a parallel prefix sum between objects and send total load to all objects
25
5 9 1 2 3 1 6 4
5 14 10 3 5 4 7 10
5 14 15 17 15 7 12 14
5 14 15 17 20 21 27 31
v1 v2 v3 v4 v5 v6 v7 v8
Figure 12.1: Prefix sum in parallel to obtain partial sums of loads of all objects up
to a certain object
prefix operation and then migrate to the respective processors.
12.1.1 Complexity Analysis
Let us compare the running time and memory requirements for the centralized versus
completely distributed load balancing algorithms. Let us assume that there are n
objects (or VPs) to be placed on p physical processors.
In the centralized scheme, one processor stores all the information and hence
the memory requirements are proportional to the number of objects, v. In the
distributed case, each processor stores information about its objects which is v/p.
In the centralized case, each processor sends a message to one processor with its
load information, which leads to p messages of size v/p each. On the other hand, in
the parallel prefix there are logv phases and v messages of constant size exchanged
in each phase. These comparisons are summarized in Table 12.1 below.
In the centralized case, if we assume that the fastest algorithm for load balancing
will have a linear running time, the time complexity for the decision making can be
119
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
1D ring to a linear array
• Perform a parallel prefix sum between objects and send total load to all objects
25
5 9 1 2 3 1 6 4
5 14 10 3 5 4 7 10
5 14 15 17 15 7 12 14
5 14 15 17 20 21 27 31
v1 v2 v3 v4 v5 v6 v7 v8
Figure 12.1: Prefix sum in parallel to obtain partial sums of loads of all objects up
to a certain object
prefix operation and then migrate to the respective processors.
12.1.1 Complexity Analysis
Let us compare the running time and memory requirements for the centralized versus
completely distributed load balancing algorithms. Let us assume that there are n
objects (or VPs) to be placed on p physical processors.
In the centralized scheme, one processor stores all the information and hence
the memory requirements are proportional to the number of objects, v. In the
distributed case, each processor stores information about its objects which is v/p.
In the centralized case, each processor sends a message to one processor with its
load information, which leads to p messages of size v/p each. On the other hand, in
the parallel prefix there are logv phases and v messages of constant size exchanged
in each phase. These comparisons are summarized in Table 12.1 below.
In the centralized case, if we assume that the fastest algorithm for load balancing
will have a linear running time, the time complexity for the decision making can be
119
the basic technique in a simpler context.
2. The second scenario is where we have a two-dimensional array of objects where
each object communicates with two immediate neighbors in its row and col-
umn. We wish to map this group of objects on to a 2D mesh of processors.
12.1 Mapping of a 1D Ring
Problem: Load balancing a 1D array of v objects which communicate in a ring
pattern to a 1D linear array of p processors.
Solution: We want to map these objects on to processors while considering the
load of each object and the communication patterns among the objects. In order to
optimize communication, we want to place objects next to each other on the same
processor as much as possible and cross processor boundaries only for ensuring load
balance. We assume that the IDs of objects denote the nearness in terms of who
communicates with whom. Hence the problem reduces to finding contiguous groups
of objects in the 1D array such that the load on all processors is nearly the same.
We arrange the objects virtually by their IDs and perform a prefix sum in parallel
between them based on the object loads. At the conclusion of a prefix sum, every
object knows the sum of loads of all objects that appear before it (Figure 12.1).
Then the last object broadcasts the sum of loads of all objects so that every object
knows the global load of the system. Each object i, can calculate its destination
processor (di), based on the total load of all objects (Lv), prefix sum of loads up to
it (Li), its load (li) and the total number of processors (p), by this equation,
di = �p ∗Li − li/2
Lv� (12.1)
So, every object can decide its destination processor in parallel through a parallel
118
Li = Prefix sumLv = Sum of all loadsp = no. of pesli = load of objectdi = destination pe
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
1D ring to a linear array
• Perform a parallel prefix sum between objects and send total load to all objects
25
5 9 1 2 3 1 6 4
5 14 10 3 5 4 7 10
5 14 15 17 15 7 12 14
5 14 15 17 20 21 27 31
v1 v2 v3 v4 v5 v6 v7 v8
Figure 12.1: Prefix sum in parallel to obtain partial sums of loads of all objects up
to a certain object
prefix operation and then migrate to the respective processors.
12.1.1 Complexity Analysis
Let us compare the running time and memory requirements for the centralized versus
completely distributed load balancing algorithms. Let us assume that there are n
objects (or VPs) to be placed on p physical processors.
In the centralized scheme, one processor stores all the information and hence
the memory requirements are proportional to the number of objects, v. In the
distributed case, each processor stores information about its objects which is v/p.
In the centralized case, each processor sends a message to one processor with its
load information, which leads to p messages of size v/p each. On the other hand, in
the parallel prefix there are logv phases and v messages of constant size exchanged
in each phase. These comparisons are summarized in Table 12.1 below.
In the centralized case, if we assume that the fastest algorithm for load balancing
will have a linear running time, the time complexity for the decision making can be
119
the basic technique in a simpler context.
2. The second scenario is where we have a two-dimensional array of objects where
each object communicates with two immediate neighbors in its row and col-
umn. We wish to map this group of objects on to a 2D mesh of processors.
12.1 Mapping of a 1D Ring
Problem: Load balancing a 1D array of v objects which communicate in a ring
pattern to a 1D linear array of p processors.
Solution: We want to map these objects on to processors while considering the
load of each object and the communication patterns among the objects. In order to
optimize communication, we want to place objects next to each other on the same
processor as much as possible and cross processor boundaries only for ensuring load
balance. We assume that the IDs of objects denote the nearness in terms of who
communicates with whom. Hence the problem reduces to finding contiguous groups
of objects in the 1D array such that the load on all processors is nearly the same.
We arrange the objects virtually by their IDs and perform a prefix sum in parallel
between them based on the object loads. At the conclusion of a prefix sum, every
object knows the sum of loads of all objects that appear before it (Figure 12.1).
Then the last object broadcasts the sum of loads of all objects so that every object
knows the global load of the system. Each object i, can calculate its destination
processor (di), based on the total load of all objects (Lv), prefix sum of loads up to
it (Li), its load (li) and the total number of processors (p), by this equation,
di = �p ∗Li − li/2
Lv� (12.1)
So, every object can decide its destination processor in parallel through a parallel
118
Li = Prefix sumLv = Sum of all loadsp = no. of pesli = load of objectdi = destination pe
• Each object now decides which processor it should be on
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
2D stencil to 2D mesh
26
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
2D stencil to 2D mesh
26
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
2D stencil to 2D mesh
• Linearize using a space filling curve• Perform 1D parallel prefix to obtain linear index of processor
26
Solution 2: Another solution is to use a parallel prefix in 1D as we did in the
previous section. To do this, we have to linearize the objects in some fashion. Space
filling curves can be used to map the 2D object grid to a 1D line [[87, 88]]. Space
filling curves preserve the neighborhood properties of the objects in 2D. Figure 12.2
shows the linearization of an object grid of dimensions 32× 32 and a processor grid
of dimensions 8× 8.
Figure 12.2: Hilbert order linearization of an object grid of dimensions 32× 32 anda processor mesh of dimensions 8× 8
Having linearized the object grid, we can perform a parallel prefix on 1D array of
objects and obtain a destination processor for each object. This processor number
is the linearized index of each processor if we create a space filling curve for the 2D
processor mesh. Hence, based on this linearized index, we can obtain the x and y
coordinates of the processor by decoding using the same logic used for generating
space filling curves. The hope is that nearby objects in the 2D object graph end up
close to one another on the 2D processor mesh.
121
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
2D stencil to 2D mesh
• Linearize using a space filling curve• Perform 1D parallel prefix to obtain linear index of processor
26
Solution 2: Another solution is to use a parallel prefix in 1D as we did in the
previous section. To do this, we have to linearize the objects in some fashion. Space
filling curves can be used to map the 2D object grid to a 1D line [[87, 88]]. Space
filling curves preserve the neighborhood properties of the objects in 2D. Figure 12.2
shows the linearization of an object grid of dimensions 32× 32 and a processor grid
of dimensions 8× 8.
Figure 12.2: Hilbert order linearization of an object grid of dimensions 32× 32 anda processor mesh of dimensions 8× 8
Having linearized the object grid, we can perform a parallel prefix on 1D array of
objects and obtain a destination processor for each object. This processor number
is the linearized index of each processor if we create a space filling curve for the 2D
processor mesh. Hence, based on this linearized index, we can obtain the x and y
coordinates of the processor by decoding using the same logic used for generating
space filling curves. The hope is that nearby objects in the 2D object graph end up
close to one another on the 2D processor mesh.
121
Solution 2: Another solution is to use a parallel prefix in 1D as we did in the
previous section. To do this, we have to linearize the objects in some fashion. Space
filling curves can be used to map the 2D object grid to a 1D line [[87, 88]]. Space
filling curves preserve the neighborhood properties of the objects in 2D. Figure 12.2
shows the linearization of an object grid of dimensions 32× 32 and a processor grid
of dimensions 8× 8.
Figure 12.2: Hilbert order linearization of an object grid of dimensions 32× 32 anda processor mesh of dimensions 8× 8
Having linearized the object grid, we can perform a parallel prefix on 1D array of
objects and obtain a destination processor for each object. This processor number
is the linearized index of each processor if we create a space filling curve for the 2D
processor mesh. Hence, based on this linearized index, we can obtain the x and y
coordinates of the processor by decoding using the same logic used for generating
space filling curves. The hope is that nearby objects in the 2D object graph end up
close to one another on the 2D processor mesh.
121
• Decode (x, y) of processor using a space filling curve
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Running Time
• Benefits from distributed mapping• Memory usage per processor - O(v/p) compared to O(v)• No communication bottleneck at a single processor• Faster decision time: O(log p)
• Time (in seconds) for load balancing on Cray XT5
27
# cores 4096 16384
1 million1.81 6.77
1 million3.29 3.49
4 million12.96 9.59
4 million12.65 5.68
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Results: 1 million objects on 4k cores
28
0.1
1
10
100
Trial 1 Trial 2 Trial 3 Trial 4 Trial 5
Hop
s pe
r by
te
0
0.35
0.7
1.05
1.4
Trial 1 Trial 2 Trial 3 Trial 4 Trial 5
Max
to
Avg
Rat
io
RandomDefaultTopology
Reduction in Hop-bytes Ratio of maximum to average load
di = distancebi = bytesn = no. of messages
5 Hop-bytes as an Evaluation Metric
The volume of inter-processor communication can be characterized by the hop-bytes
metric which is the weighted sum of message sizes where the weights are the number
of hops (links) traveled by the respective messages. Hop-bytes can be calculated by
the equation,
HB =
n�
i=1
di × bi (5.1)
where di is the number of links traversed by message i and bi is the message size in
bytes for message i and the summation is over all messages sent.
Hop-bytes is an indication of the average communication load on each link on the
network. This assumes that the application generates nearly uniform traffic over all
links in the partition. The metric does not give an indication of hot-spots generated
on specific links on the network but is an easily derivable metric and correlates well
with actual application performance.
In VLSI circuit design and early parallel computing work, emphasis was placed
on another metric called maximum dilation which is defined as,
d(e) = max{di|ei ∈ E} (5.2)
where di is the dilation of the edge ei. This metric aims at minimizing the longest
length of the wire in a circuit. We claim that reducing the largest number of links
traveled by any message is not as critical as reducing the average hops across all
messages.
32
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Results: 1 million objects on 4k cores
28
0.1
1
10
100
Trial 1 Trial 2 Trial 3 Trial 4 Trial 5
Hop
s pe
r by
te
0
0.35
0.7
1.05
1.4
Trial 1 Trial 2 Trial 3 Trial 4 Trial 5
Max
to
Avg
Rat
io
RandomDefaultTopology
Reduction in Hop-bytes Ratio of maximum to average load
-
LLNL-PRES-529376 Abhinav Bhatele @ SIAM Conference on Parallel Processing
Summary
• Topology awareness required at various levels:• Resource Managers (allocator, job launcher): topology aware
scheduling
• MPI implementations (optimized collectives)• Software support for querying network topology• Distributed graph representations for communication structure
• Scalable distributed algorithms• With global view of load and partial view of communication
structure
29
-
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551
LLNL-PRES-529376
SIAM Conference on Parallel Processing ◆ February 15, 2012
Questions?Abhinav Bhatele, Automating Topology Aware Mapping for Supercomputers, PhD Thesis, Department of Computer Science, University of Illinois.
http://charm.cs.uiuc.edu/research/topology
http://charm.cs.uiuc.edu/research/topologyhttp://charm.cs.uiuc.edu/research/topology