work stealing and persistence-based load balancers for iterative overdecomposed applications...
TRANSCRIPT
Work Stealing and Persistence-based Load Balancers for Iterative
Overdecomposed Applications
Jonathan Lifflander, UIUCSriram Krishnamoorthy, PNNL*
Laxmikant Kale, UIUC
HPDC 2012
HPDC’12, Delft
Iterative Applications
Applications repeatedly executing the same computation
Static or slowly evolving execution characteristics
Execution characteristics preclude static balancingApplication characteristics (comm. pattern, sparsity,…)Execution environment (topology, asymmetry, …)
Challenge: Load-balancing such applications
HPDC’12, Delft
Overdecomposition
Expose greater levels concurrency than supported by hardware
Middleware (runtime) dynamically maps the concurrent tasks to hardware resources
Abstraction supports continuous optimization and adaptation
Improvements to load balancingNew metrics (power, energy, graceful degradation, …)New features: fault tolerance, power/energy-awareness
HPDC’12, Delft
Problem Statement
Scalable load balancers for iterative overdecomposed applications
We consider two alternatives:Persistence-based load balancingWork stealing
How do these algorithms behave at scale?
How do they compare?
HPDC’12, Delft
Related Work
Overdecomposition is a widely used approach
Inspector-executor approaches employ start-time load balancers
Hierarchical load balancers in the past typically do not consider localization
Scalability of work stealing not well understood – largest prior demonstration was on 8192 cores
No comparative evaluation of the two schemes
HPDC’12, Delft
TASCEL: Task Scheduling Library
Runtime library for task-parallel programs
Manages task collections for execution on distributed memory machines
Compatible with native MPI programs
Phase-based switch between SPMD and non-SPMD modes of execution
HPDC’12, Delft
TASCEL Execution
Task: basic unit of migrateable executionTypical workflow:
Create a task collectionSeed it with one or more tasksProcess tasks in the collection till termination detection
Processing of task collectionsManages concurrency, faults, …
Trade-offs exposed through implementation specializations
Dynamic load balancing schemesFault tolerance protocols…
HPDC’12, Delft
Load Balancers
Greedy localized hierarchical persistence-based load balancing
Retentive work stealing
HPDC’12, Delft
0 1 2 3 4 5
3 4 5
1 2
0
Greedy Localized Hierarchical Persistence-based LB
Intuition: Satisfy local imbalance first
HPDC’12, Delft
0 1 2 3 4 5
3 4 5
1 2
0
Greedy Localized Hierarchical Persistence-based LB
Intuition: Satisfy local imbalance first
HPDC’12, Delft
head split
addTask(): add task to local region
getTask(): remove task from local region
stail
Buffer of locally executed tasks
Retentive Work Stealing
HPDC’12, Delft
head split
releaseToShared(): move to shared portion
acquireFromShared(): move to local portion
stail
Retentive Work Stealing
HPDC’12, Delft
head split
1. Mark tasks stolen at stail and begin transfer
itail ctail
stail: beginning of tasks available to be stolen
itail: number of tasks that have finished transferctail: past this marker it is safe to use buffer
stail
2. Atomically increment itail on completion of transfer 3. Worker updates ctail when stail == itail
==itail ==ctail
Retentive Work Stealing
HPDC’12, Delft
Proc 1
Proc 2
Proc 3
…
Proc n
Seeded Local Queues
Proc 1Proc 2Proc 3Proc n
Actual Executed Tasks
Intuition: Stealing indicates poor initial balance
Retentive Work Stealing
HPDC’12, Delft
Retentive Work Stealing
Active message based work stealing optimized for distributed memory
Exploit persistence across work stealing iterations
Each work stealing phaseTrack tasks executed by this worker in this iterationSeed with tasks executed by this worker for the next iteration
HPDC’12, Delft
Experimental Setup
Multi-threaded MPI; one core per node for active messages“Flat” execution – each core is an independent worker
No. nodes
Cores per
node
Memory per
node
Max cores in queue
Hopper (Cray XE6) 6384 24 32GB 146400
Intrepid (BG/P) 40960 4 4GB 163840
Titan (Cray XK6) 18688 16 32GB 298592
HPDC’12, Delft
Hartree-Fock Benchmark
Basis for several electronic structure theoriesTwo-electron contributionSchwarz-screening: data dependent sparsity screening at runtimeTasks vary in size from milliseconds to seconds
HF-Be512 (20) HF-Be512 (40)
Total tasks 2.2x1010 1.4x109
Non-null tasks 9.1x106 8.6x105
HPDC’12, Delft
Hopper: Performance
Persistence-based load balancing “converges” fasterRetentive stealing also improves efficiencyStealing effective even with limited parallelism
Persistence-based load balancing Retentive Stealing
Effi
cien
cy
Core count Core count
Avg. tasks per core
HPDC’12, Delft
Intrepid: Performance
Much worse performance for the first iterationConverges to a better efficiency than on Hopper
Persistence-based load balancing Retentive Stealing
Effi
cien
cy
Core count Core count
Avg. tasks per core
HPDC’12, Delft
Titan: Performance
Similar behavior as on Intrepid
Persistence-based load balancing Retentive Stealing
Effi
cien
cy
Core count Core count
Avg. tasks per core
HPDC’12, Delft
Intrepid: Num. Steals
Retentive stealing stabilizes stealing costsSimilar trends on all systems
Core count Core count
Num
. st
eals
Attempted steals Successful steals
HPDC’12, Delft
Utilization
HF-Be256 on 9600 cores on HopperInitial stealing has high costs during ramp-downRetentive stealing does a better job reducing this cost
Steal (13.6secs) StealRet-final (12.6secs) PLB (12.2secs)
Util
izat
ion
(%)
Time Time Time
HPDC’12, Delft
Summary of Insights
Retentive work stealing can scale – demonstrated on up to 163,840 cores of Intrepid, 146,400 cores of Hopper, and 128,000 cores of Titan
Retentive stealing and persistence-based load balancing perform comparably
Retentive stealing incrementally improves balance
Number of steals does not grow substantially with scale
Greedy hierarchical persistence-based load balancer achieves good load balance quality as compared to a centralized scheme (details in paper)