work stealing and persistence-based load balancers for iterative overdecomposed applications...

Work Stealing and Persistence-based Load Balancers for Iterative

Overdecomposed Applications

Jonathan Lifflander, UIUCSriram Krishnamoorthy, PNNL*

Laxmikant Kale, UIUC

HPDC 2012

Dynamic load balancing on 100,000 processor cores

and beyond

HPDC’12, Delft

Iterative Applications

Applications repeatedly executing the same computation

Static or slowly evolving execution characteristics

Execution characteristics preclude static balancingApplication characteristics (comm. pattern, sparsity,…)Execution environment (topology, asymmetry, …)

Challenge: Load-balancing such applications

HPDC’12, Delft

Overdecomposition

Expose greater levels concurrency than supported by hardware

Middleware (runtime) dynamically maps the concurrent tasks to hardware resources

Abstraction supports continuous optimization and adaptation

Improvements to load balancingNew metrics (power, energy, graceful degradation, …)New features: fault tolerance, power/energy-awareness

HPDC’12, Delft

Problem Statement

Scalable load balancers for iterative overdecomposed applications

We consider two alternatives:Persistence-based load balancingWork stealing

How do these algorithms behave at scale?

How do they compare?

HPDC’12, Delft

Related Work

Overdecomposition is a widely used approach

Inspector-executor approaches employ start-time load balancers

Hierarchical load balancers in the past typically do not consider localization

Scalability of work stealing not well understood – largest prior demonstration was on 8192 cores

No comparative evaluation of the two schemes

HPDC’12, Delft

TASCEL: Task Scheduling Library

Runtime library for task-parallel programs

Manages task collections for execution on distributed memory machines

Compatible with native MPI programs

Phase-based switch between SPMD and non-SPMD modes of execution

HPDC’12, Delft

TASCEL Execution

Task: basic unit of migrateable executionTypical workflow:

Create a task collectionSeed it with one or more tasksProcess tasks in the collection till termination detection

Processing of task collectionsManages concurrency, faults, …

Trade-offs exposed through implementation specializations

Dynamic load balancing schemesFault tolerance protocols…

HPDC’12, Delft

Load Balancers

Greedy localized hierarchical persistence-based load balancing

Retentive work stealing

HPDC’12, Delft

0 1 2 3 4 5

3 4 5

1 2

0

Greedy Localized Hierarchical Persistence-based LB

Intuition: Satisfy local imbalance first

HPDC’12, Delft

Proc 1

Proc 2

Proc 3

…

Proc n

Local Queues Work Pool

Retentive Work Stealing

HPDC’12, Delft

head split stail

Local Remote


HPDC’12, Delft

head split

addTask(): add task to local region

getTask(): remove task from local region

stail

Buffer of locally executed tasks


HPDC’12, Delft

head split

releaseToShared(): move to shared portion

acquireFromShared(): move to local portion

stail


HPDC’12, Delft

head split

1. Mark tasks stolen at stail and begin transfer

itail ctail

stail: beginning of tasks available to be stolen

itail: number of tasks that have finished transferctail: past this marker it is safe to use buffer

stail

2. Atomically increment itail on completion of transfer 3. Worker updates ctail when stail == itail

==itail ==ctail


HPDC’12, Delft

Proc 1

Proc 2

Proc 3

…

Proc n

Seeded Local Queues

Proc 1Proc 2Proc 3Proc n

Actual Executed Tasks

Intuition: Stealing indicates poor initial balance


HPDC’12, Delft


Active message based work stealing optimized for distributed memory

Exploit persistence across work stealing iterations

Each work stealing phaseTrack tasks executed by this worker in this iterationSeed with tasks executed by this worker for the next iteration

HPDC’12, Delft

Experimental Setup

Multi-threaded MPI; one core per node for active messages“Flat” execution – each core is an independent worker

No. nodes

Cores per

node

Memory per

node

Max cores in queue

Hopper (Cray XE6) 6384 24 32GB 146400

Intrepid (BG/P) 40960 4 4GB 163840

Titan (Cray XK6) 18688 16 32GB 298592

HPDC’12, Delft

Hartree-Fock Benchmark

Basis for several electronic structure theoriesTwo-electron contributionSchwarz-screening: data dependent sparsity screening at runtimeTasks vary in size from milliseconds to seconds

HF-Be512 (20) HF-Be512 (40)

Total tasks 2.2x1010 1.4x109

Non-null tasks 9.1x106 8.6x105

HPDC’12, Delft

Hopper: Performance

Persistence-based load balancing “converges” fasterRetentive stealing also improves efficiencyStealing effective even with limited parallelism

Persistence-based load balancing Retentive Stealing

Effi

cien

cy

Core count Core count

Avg. tasks per core

HPDC’12, Delft

Intrepid: Performance

Much worse performance for the first iterationConverges to a better efficiency than on Hopper


Effi

cien

cy


Avg. tasks per core

HPDC’12, Delft

Titan: Performance

Similar behavior as on Intrepid


Effi

cien

cy


Avg. tasks per core

HPDC’12, Delft

Intrepid: Num. Steals

Retentive stealing stabilizes stealing costsSimilar trends on all systems


Num

. st

eals

Attempted steals Successful steals

HPDC’12, Delft

Utilization

HF-Be256 on 9600 cores on HopperInitial stealing has high costs during ramp-downRetentive stealing does a better job reducing this cost

Steal (13.6secs) StealRet-final (12.6secs) PLB (12.2secs)

Util

izat

ion

(%)

Time Time Time

HPDC’12, Delft

Summary of Insights

Retentive work stealing can scale – demonstrated on up to 163,840 cores of Intrepid, 146,400 cores of Hopper, and 128,000 cores of Titan

Retentive stealing and persistence-based load balancing perform comparably

Retentive stealing incrementally improves balance

Number of steals does not grow substantially with scale

Greedy hierarchical persistence-based load balancer achieves good load balance quality as compared to a centralized scheme (details in paper)

work stealing and persistence-based load balancers for iterative overdecomposed applications...

Documents

dynamic load balancing

task collectionseed

local imbalance

taskparallel programs

local regionstailbuffer

local regiongettask

concurrent tasks

number of tasks