optimized distributed work-stealingvivkumar.github.io/papers/slides_ia3_2016.pdf · vivek kumar1,...
Post on 25-Aug-2020
1 Views
Preview:
TRANSCRIPT
Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2
1 Rice University 2 Lawrence Berkeley National Laboratory
Optimized Distributed Work-Stealing
Cores/Socket System Share in Top500
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
1 (45%)
2 (55%)
Multicore Nodes in Supercomputers
Graph plotted using the data obtained from https://www.top500.org/statistics/list/
November 2006 June 2016
• Productivity – Several existing APIs for scientific computing – Hard to parallelize complex irregular computations
using existing APIs • Ideal candidate for runtime based global load-balancing
• Performance on multicore nodes – Using a process per core (e.g., MPI everywhere)
on a node not scalable – Hybrid programming using thread pool per node
• How to design a high performance implementation of global load-balancing
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Problem Statement
3
Productivity and Performance Challenge
4
✔ Library-based API in a PGAS library to express irregular computations
C++11 lambda function based API that provides serial elision
✔ Novel implementation of distributed work-stealing That introduces a new victim selection policy that avoid all
inter-node failed steals
✔ Detailed performance study That demonstrates the benefit using scaling irregular applications up to
12k cores of Edison supercomputer
✔ Results That shows that our approach delivers performance benefits up to 7%
Contributions
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Load Balancing using Work-Stealing
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Motivating Analysis
5
w1
Work-stealing in a thread pool
pop tasks
push tasks
w2 w3 w4
steal tasks deque
tail
head
Core Core Core Core
• Thread pool (intra-node) based implementations perform stealing using low overhead CAS operations
Distributed Work-Stealing
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Motivating Analysis
6
Intra-node steals
Inter-node steals
• Inter-node steals are much costlier than intra-node steals
Failed Steal Attempts
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Motivating Analysis
7
• Thief fails to steal a task from victim
Intra-node failed steals
Inter-node failed steals
Inter-node failed steals are more
costly than intra-node steals
Chances to fail with
same victim multiple times
Our Approach
• Use HabaneroUPC++ PGAS library for multicore cluster [Kumar et. al., PGAS 2014] – Several asynchronous tasking APIs
• Provide a programming model to express irregular computation
• Implement a high performance distributed work-stealing runtime that completely removes all inter-node failed steal attempts
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016 8
Nod
e_A
Nod
e_X
One process with a thread pool at each node
HabaneroUPC++ Programming Model
• C++11 lambda-function based API • Provides serial elision and improves productivity
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Implementation
9
asyncAny ( [=] {
irregular_computation();
}); //distributed work-stealing
Distributed Work-Stealing Runtime
• Two different implementations in HabaneroUPC++ • BaselineWS
– Uses prior work + some optimizations
• SuccessOnlyWS – Extends BaselineWS by using a novel victim selection
policy that complete removes all inter-node failed steals
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Implementation
10
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Step-1: Failed steal at
intra-node level
Step-2: Local worker
request the leader for inter-node steal
Step-6: Remote victim can send tasks, else it’s a inter-node failed steal
BaselineWS in HabaneroUPC++
Step-4: Lock and wait for tasks from victim
Step-3: Leader finds a victim that has sufficient number of tasks (RDMA)
11
Implementation
Step-5: Remote leader attempts to steal tasks from its local thread-pool
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Step-1: Failed steal at
intra-node level
Step-2: Local worker
request the leader for inter-node steal
SuccessOnlyWS in HabaneroUPC++
Step-3: Leader finds a victim that has sufficient number of tasks (RDMA)
12
Implementation
Step-5: Remote leader attempts to steal tasks from its local thread-pool
Step-6: Repeat Step-3 and Step-4
Step-4: Asynchronous task request
Step-7: One or more remote victim will send tasks. Break out of Step-5 (also if application terminates)
Methodology
• Benchmarks – Two UTS trees T1WL and T3WL – NQueens
• Computing infrastructure – Edison supercomputer at NERSC
• 2x12 cores per node
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Experimental Evaluation
13
Results
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Experimental Evaluation
14
1
10
100
1000
10000
100000
1e+06
1536 3072 6144 12288
To
tal f
aile
d s
tea
ls (
log
sca
le)
Total cores
T1WL T3WL NQ
1
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08
1.09
1.1
1536 3072 6144 12288
Sp
ee
du
p o
ver
Ba
selin
eW
S
Total cores
T1WL T3WL NQ
Note: More results are available in the paper
(a) Total inter-node failed steals in BaselineWS
(b) SuccessOnlyWS speedup over BaselineWS
Total cores (24 cores/node) Total cores (24 cores/node)
Higher inter-node failed steals in BaselineWS => Better performance in SuccessOnlyWS
T1WL T3WL NQueens
Summary and Conclusion • Inter-node steals are costlier than intra-node steals • Failed inter-node steals could hamper performance • C++11 lambda function based API to in
HabaneroUPC++ to express complex irregular computation that can participate in distributed work-stealing
• A novel implementation of distributed work-stealing runtime in HabaneroUPC++ PGAS library that completely removes all inter-node failed steals
• Our novel runtime delivers performance benefits up to 7%
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016 15
Summary
Backup Slides
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Existing Techniques for Inter-node Stealing
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016 17
• Thread pool based hybrid runtimes [Lifflander et. al., HPDC’12, Paudel et. al., ICPP’13]
• Communication worker maintain ready queue of tasks even before a remote request arrives [Paudel et. al., ICPP’13]
• Load-aware steal attempts to reduce chances of failure [Dinan et. al., ICPP’08]
• First try random victims and on failing contact set of victims (lifelines) that promises to send tasks whenever they have it ready [Saraswat et. al., PPoPP’11]
Related Work
Inter-node Steal Request from Thief
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Implementation
18
procedure Steal_AsyncAny while (global termination is not detected) V = get a random remote rank if (V has declared task availability in PGAS space) // RDMA if (I did not try to steal from V) queue my rank at V if TryLock (V) is success
save my rank at V wait until V send tasks or decline
Unlock (V) break from while loop if I just received asyncAny tasks if I receive asyncAny tasks from any victim forget that I contacted this victim reset my task receiving status
BaselineWS Runtime
SuccessOnlyWS Runtime 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Inter-node Task Transfer from Victim
Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016
Implementation
19
procedure Send_AsyncAny while (there are pending inter-node steal requests) T = get rank of one of the queued remote thief steal tasks from my local workers and send to T forget that T contacted me break out of the while loop if local steal failed
T = get rank of the only waiting remote thief steal tasks from local workers and send to T declare that now I don’t have any waiting remote thief publish in PGAS space asyncAny count at my place
BaselineWS Runtime
SuccessOnlyWS Runtime
1 2 3 4 5 6 7 8 9 10 11
top related