scalable and topology-aware load balancers in charm++ amit sharma parallel programming lab, uiuc
DESCRIPTION
Dynamic Load-Balancing Framework in Charm++ Load balancing task in Charm++ Given a collection of migratable objects and a set of computers connected in a certain topology Find a mapping of objects to processors Almost same amount of computation on each processor Communication between processors is minimum Dynamic mapping of chares to processorsTRANSCRIPT
![Page 1: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/1.jpg)
Scalable and Topology-Aware Load Balancers in Charm++
Amit SharmaParallel Programming Lab, UIUC
![Page 2: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/2.jpg)
Outline Dynamic Load Balancing framework in
Charm++ Load Balancing on Large Machines
Scalable Load Balancer Topology-aware Load Balancers
![Page 3: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/3.jpg)
Dynamic Load-Balancing Framework in Charm++ Load balancing task in Charm++
Given a collection of migratable objects and a set of computers connected in a certain topology
Find a mapping of objects to processors Almost same amount of computation on each processor Communication between processors is minimum
Dynamic mapping of chares to processors
![Page 4: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/4.jpg)
Load-Balancing Approaches Two major approaches
No predictability of load patterns Fully dynamic Early work on State Space Search, Branch&Bound, .. Seed load balancers
With certain predictability CSE, molecular dynamics simulation Measurement-based load balancing strategy
![Page 5: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/5.jpg)
Principle of Persistence Once an application is expressed in terms of
interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior
Abrupt and large,but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration)
Parallel analog of principle of locality Heuristics, that hold for most CSE applications
![Page 6: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/6.jpg)
Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation (LB Database)
communication volume and computation time Measurement based load balancers
Use the database periodically to make new decisions Many alternative strategies can use the database
Centralized vs distributed Greedy improvements vs complete reassignments Taking communication into account Taking dependencies into account (More complex) Topology-aware
![Page 7: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/7.jpg)
Load Balancer Strategies Centralized
Object load data are sent to processor 0
Integrate to a complete object graph
Migration decision is broadcasted from processor 0
Global barrier
Distributed Load balancing
among neighboring processors
Build partial object graph
Migration decision is sent to its neighbors
No global barrier
![Page 8: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/8.jpg)
Load Balancing on Very Large Machines – New Challenges Existing load balancing strategies don’t scale on
extremely large machines Consider an application with 1M objects on 64K
processors Limiting factors and issues
Decision-making algorithm Difficult to achieve well-informed load balancing
decisions Resource limitations
![Page 9: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/9.jpg)
Limitations of Centralized Strategies
Effective on small number of processors, easy to achieve good load balance
Limitations (inherently not scalable) Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow
We demonstrate these limitations using the simulator we developed
![Page 10: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/10.jpg)
Memory Overhead (simulation results with lb_test)
050
100150200250300350400450500
LB Memory usage on the central node
(MB)
128K 256K 512K 1MNumber of objects
32K processors64K processors
Lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D-mesh.
Run on Lemieux 64 processors
![Page 11: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/11.jpg)
Load Balancing Execution Time
050
100150200250300350400
Execution Time (in seconds)
128K 256K 512K 1MNumber of Objects
GreedyLBGreedyCommLBRefineLB
Execution time of load balancing algorithms on 64K processor simulation
![Page 12: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/12.jpg)
Why Hierarchical LB? Centralized load balancer
Bottleneck for communication on processor 0 Memory constraint
Fully distributed load balancer Neighborhood balancing Without global load information
Hierarchical distributed load balancer Divide into processor groups Apply different strategies at each level Scalable to a large number of processors
![Page 13: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/13.jpg)
A Hybrid Load Balancing Strategy Dividing processors into independent sets of
groups, and groups are organized in hierarchies (decentralized)
Each group has a leader (the central node) which performs centralized load balancing
A particular hybrid strategy that works well
Gengbin Zheng, PhD Thesis, 2005
![Page 14: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/14.jpg)
Hierarchical Tree (an example)
0 … 1023 6553564512 …1024 … 2047 6451163488 ……...
0 1024 63488 64512
1
64K processor hierarchical tree
•Apply different strategies at each level
Level 0
Level 1
Level 2
1024
64
![Page 15: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/15.jpg)
Our HybridLB Scheme
0 … 1023 6553564512 …1024 … 2047 6451163488 ……...
0 1024 63488 64512
1
Load Data (OCG)
Refinement-based Load balancing
Greedy-based Load balancing
Load Data
token
object
![Page 16: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/16.jpg)
Simulation Study - Memory Usage
050
100150200250300350400450500
Memory usage (MB)
256K 512K 1M
Number of Objects
lb_test simulation of 64K processors
CentralLBHybridLB
Simulation of lb_test benchmark with the performance simulator
![Page 17: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/17.jpg)
Total Load Balancing Time
050
100150200250300350400450
Time(s)
256K 512K 1M
Number of Objects
lb_test simulation of 64K processors
GreedyCommLBHybridLB(GreedyCommLB)
![Page 18: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/18.jpg)
Load Balancing Quality
00.020.040.060.08
0.10.12
Maximum predicted load (seconds)
256K 512K 1MNumber of Objects
lb_test simulation of 64K processors
GreedyCommLBHybridLB
![Page 19: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/19.jpg)
Topology-aware mapping of tasks Problem
Map tasks to processors connected in a topology, such that:
Compute load on processors is balanced Communicating chares (objects) are placed on
nearby processors.
![Page 20: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/20.jpg)
Mapping Model Task Graph :
Gt = (Vt , Et) Weighted graph, undirected edges Nodes chares, w(va) computation Edges communication, cab bytes between va and vb
Topology-graph : Gp = (Vp , Ep) Nodes processors Edges Direct Network Links Ex: 3D-Torus, 2D-Mesh, Hypercube
![Page 21: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/21.jpg)
Model (Cont.) Task Mapping
Assigns tasks to processors P : Vt Vp
Hop-Bytes Hop-Bytes Communication cost The cost imposed on the network is more if
more links are used Weigh inter-processor communication by
distance on the network))(),(( ba
Eeab vPvPdcHB
tab
![Page 22: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/22.jpg)
Metric Metric
Minimize Hop-Bytes, equivalently Hops-per-Byte Hops-per-Byte
Average hops traveled by a byte under a task-mapping.
![Page 23: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/23.jpg)
TopoLB: Topology-aware LBOverview
First coalesce task graph to n nodes. (n = number of processors) Use MetisLB because it reduces inter-group
communication Can use GreedyLB, GreedyCommLB, etc.
Repeat n times: Pick a task t and processor p Place t on p (P(t) p)
Tarun Agarwal, MS Thesis, 2005
![Page 24: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/24.jpg)
Picking t,p t is the task for which placement in this iteration is critical p is the processor where t costs least Note that:
Cost of placing t on p is approximated as:
cost(t;p) = Pm2Assigned Tasks
ctmd(P(m);p)
+ Pm2Unassigned Tasks
ctmd(pö;p)
where HB(t) = Pm2Vt
ctmd(P(m);P(t))
Hop Bytes = Pt2Vt
HB(t)
![Page 25: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/25.jpg)
Picking t,p (Cont.) Criticality of placing t in this iteration:
By how much will the cost of placing t increase in the future?
Future Cost : t will be placed on some random processor in future iteration
Criticality of t :
future_ cost(t) = jFree Procsj
Ppi 2 Free Procs cost(t;pi)
criticality(t) = future_ cost(t) à cost(t;pbest)
![Page 26: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/26.jpg)
Putting it together
![Page 27: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/27.jpg)
TopoCentLB: A faster Topology-aware LB Coalesce task graph to n nodes. (n=number of
processors) Picking task t, processor p
t is the task which has maximum total communication with already assigned tasks
p is the processor where t costs least
cost(t;p) = Pm2Assigned Tasks
ctmd(P(m);p)
![Page 28: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/28.jpg)
TopoCentLB (Cont.) Difference from TopoLB
No notion of criticality as in TopoLB Considers only past mapping Doesn’t look into the future
Running Complexity TopoLB: Depends on the criticality function
O(p|Et|) O(p3)
TopoCentLB O(p|Et|) (with a smaller constant than TopoLB)
![Page 29: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/29.jpg)
Results Compare TopoLB, TopoCentLB, Random
Placement Charm++ LB Simulation mode
2D-Jacobi like benchmark LeanMD Reduction in Hop-Bytes
BlueGene/L 2D-Jacobi like benchmark Reduction in running time
![Page 30: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/30.jpg)
Simulation Results2D-Mesh pattern on a 3D-Torus topology (same
size)
![Page 31: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/31.jpg)
Simulation ResultsLeanMD on 3D-Torus
![Page 32: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/32.jpg)
Experimental Results: Bluegene2D-Mesh pattern on a 3D-Torus (Msg
Size:100KB)
![Page 33: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/33.jpg)
Experimental Results: Bluegene2D-Mesh pattern on a 3D-Mesh (Msg Size:100KB)
![Page 34: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC](https://reader033.vdocuments.site/reader033/viewer/2022051301/5a4d1ae07f8b9ab059976cdf/html5/thumbnails/34.jpg)
Conclusions Need for Scalable LBs in future due to large
machines like BG/L Hybrid Load Balancers Distributed approach for keeping communication
localized in a neighborhood Efficient topology-aware task mapping strategies
which reduce hop-bytes also lead to Lower network latencies Better tolerance to contention and bandwidth
constraints