hierarchical load balancing for large scale supercomputers
DESCRIPTION
Hierarchical Load Balancing for Large Scale Supercomputers. Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC. Outline. Dynamic Load Balancing framework in Charm++ Motivations Hierarchical load balancing strategy. Charm++ Dynamic Load-Balancing Framework. - PowerPoint PPT PresentationTRANSCRIPT
Hierarchical Load Balancing for Large Scale Supercomputers
Gengbin ZhengCharm++ Workshop 2010
Parallel Programming Lab, UIUC
1Charm++ Workshop 2010
Outline Dynamic Load Balancing framework in
Charm++ Motivations Hierarchical load balancing strategy
2Charm++ Workshop 2010
Charm++ Dynamic Load-Balancing Framework One of the most popular reasons
to use Charm++/AMPI Fully automatic Adaptive Application independent Modular, and extendable
3Charm++ Workshop 2010
Principle of Persistence Principle of Persistence
Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time
In spite of dynamic behavior Abrupt and large,but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration)
Parallel analog of principle of locality Heuristics, that holds for most CSE applications
4Charm++ Workshop 2010
Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation (LB Database)
communication volume and computation time Measurement based load balancers
Use the database periodically to make new decisions Many alternative strategies can use the database
Centralized vs distributed Greedy vs refinement Taking communication into account Taking dependencies into account (More complex) Topology-aware
5Charm++ Workshop 2010
Load Balancer Strategies Centralized
Object load data are sent to processor 0
Integrate to a complete object graph
Migration decision is broadcasted from processor 0
Global barrier
Distributed Load balancing
among neighboring processors
Build partial object graph
Migration decision is sent to its neighbors
No global barrier6Charm++ Workshop 2010
Limitations of Centralized Strategies
Now consider an application with 1M objects on 64K processors
Limitations (inherently not scalable) Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow
We demonstrate these limitations using the simulator we developed
7Charm++ Workshop 2010
Memory Overhead (simulation results with lb_test)
050
100150200250300350400450500
LB M
emor
y us
age
on t
he c
entr
al
node
(MB
)
128K 256K 512K 1MNumber of obj ects
32K cores64K cores
Lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D-mesh.
Run on Lemieux 64 processors
8Charm++ Workshop 2010
Load Balancing Execution Time
0
50
100
150
200
250
300
350
400
Exec
utio
n Ti
me (
in s
econ
ds)
128K 256K 512K 1MNumber of Obj ects
GreedyLBGreedyCommLBRefi neLB
Execution time of load balancing algorithms on 64K processor simulation
9Charm++ Workshop 2010
10
Limitations of Distributed Strategies
Each processor periodically exchange load information and migrate objects among neighboring processors
Performance improved slowly
Lack of global information
Difficult to converge quickly to as good a solution as a centralized strategy
Result with NAMD on 256 processors
A Hybrid Load Balancing Strategy Dividing processors into independent sets of
groups, and groups are organized in hierarchies (decentralized) Aggressive load balancing in sub-groups, combined
with Refinement-based cross-group load balancing
Each group has a leader (the central node) which performs centralized load balancing Reuse existing centralized load balancing
11Charm++ Workshop 2010
Hierarchical Tree (an example)
0 … 1023 6553564512 …1024 … 2047 6451163488 ……...
0 1024 63488 64512
1
64K processor hierarchical tree
• Apply different strategies at each level
Level 0
Level 1
Level 2
1024
64
12Charm++ Workshop 2010
Issues Load data reduction
Semi-centralized load balancing scheme
Reducing data movement Token-based local balancing
Topology-aware tree construction
Charm++ Workshop 2010 13
Token-based HybridLB Scheme
0 … 1023 6553564512 …1024 … 2047 6451163488 ……...
0 1024 63488 64512
1
Load Data (OCG)
Refinement-based Load balancing
Greedy-based Load balancing
Load Data
token
object
14Charm++ Workshop 2010
Performance Study with Synthetic Benchmark
lb_test benchmark on Ranger Cluster (1M objects)15Charm++ Workshop 2010
4096 8192 163840
20406080
100120140160180
CentralizedLBHierarchical
Peak
Mem
Usa
ge (M
B)
Load Balancing Time (lb_test)
lb_test benchmark on Ranger Cluster16Charm++ Workshop 2010
4096 8192 1638405
101520253035
CentralizedLBHierarchical
LB T
ime
(s)
Performance (lb_test)
lb_test benchmark on Ranger Cluster17Charm++ Workshop 2010
4096 8192 163840
2
4
6
8
10
12
no LBCentralizedLBHierarchical
Lbte
st s
tep
time
NAMD Hierarchical LB NAMD implements its own
specialized load balancing strategies Based on Charm++ load balancing
framework Extended NAMD comprehensive
and refinement-based solution Work on subset of processors
Charm++ Workshop 2010 18
NAMD LB Time
Charm++ Workshop 2010 19
256 512 1024 204805
1015202530354045
ComprehensiveRefinement
NAM
D Ti
me/
Step
NAMD LB Time (Comprehensive)
Charm++ Workshop 2010 20
512 1024 2048 4096 81920.1
1
10
100
1000
ComprehensiveHierarchical
Load
Bal
ancin
g Ti
me
(s)
NAMD LB Time (Refinement)
Charm++ Workshop 2010 21
512 1024 2048 4096 81920.1
1
10
100
1000
RefinementHierarchical
Load
Bal
ancin
g Ti
me
(s)
NAMD Performance
Charm++ Workshop 2010 22
512 1024 2048 4096 819202468
101214161820
CentralizedHierarchical
NAM
D Ti
me/
Step
Conclusions Scalable LBs are needed due to large machines
like BG/P Avoid memory and communication bottleneck Achieve similar result to the more expensive
centralized load balancer Take processor topology into account
23Charm++ Workshop 2010