weighted dynamic scheduling with many parallelism grains ...runtime scheduler loop 11/16/15, austin,...

15
Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators Azzam Haidar Yulu Jia Piotr Luszczek Stanimire Tomov Asim YarKhan Jack Dongarra

Upload: others

Post on 22-Sep-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Weighted Dynamic Scheduling with Many Parallelism Grains for

Offloading of Numerical Workloads to Multiple Varied

Accelerators

Azzam Haidar Yulu Jia

Piotr Luszczek Stanimire Tomov

Asim YarKhan Jack Dongarra

Page 2: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Problem Statement: Factorization

11/16/15, Austin, TX, USA ScalA 2015 2

Ax = b PA = LU Ly = P-1b Ux = y

GETF2 TRSM GEMM

Page 3: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Problem Statement: Algorithm

11/16/15, Austin, TX, USA ScalA 2015 3

Ax = b PA = LU Ly = P-1b Ux = y

GETF2 TRSM GEMM

Page 4: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Problem Statement: Mapping to Hardware

11/16/15, Austin, TX, USA ScalA 2015 4

Ax = b PA = LU Ly = P-1b Ux = y

GETF2 TRSM GEMM

Page 5: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

From Code to DAG

11/16/15, Austin, TX, USA ScalA 2015 5

GETRF(A[1:n,1:n]){fori=1,nb,2*nb,…{ GETF2(A[i:n,i:i+nb]) TRSM(A[i:i+nb,i:i+nb],A[i:i+nb,i:n]) GEMM(A[i:n,i:i+nb],A[i:i+nb,i:n],A[i+nb:n,i+nb:n])}

}

GETF2(A[1,1],A[2,1],…)TRSM(A[1,1],A[1,2])TRSM(A[1,1],A[1,3])GEMM(A[2,1],A[1,2],A[2,2])

Startwithcanonicalcode:

Unwindthefunc:oncalls: Trackdependences:

Thefunc:oncallsandtheirdependencesformaDAG.

Page 6: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Matrix-View of Dependences

11/16/15, Austin, TX, USA ScalA 2015 6

GPU

XeonPhi

Page 7: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

From Code to DAG

11/16/15, Austin, TX, USA ScalA 2015 7

Page 8: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Asynchronous Algorithm

11/16/15, Austin, TX, USA ScalA 2015 8

Panelfactoriza:on

Pivo:ng(rowswaps)

Triangularsolve

Matrixmul:ply(Schurcomplement)

Scheduletasksaccordingto:-datasizes-hardwareweights-affinity-performlookahead

Page 9: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Main Features of the Scheduler

•  Resource capability weight for the task – w(kernel, device)

•  Adaptive scheduling with weights •  Dynamic lookahead •  Enhanced task priorities – Regulates lookahead (DAG exploration order)

•  Transparent data movement – Uses (and tracks) asynchronous data transfers

•  Dynamic data redistribution – Data is moved as needed and only if needed

11/16/15, Austin, TX, USA ScalA 2015 9

Page 10: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Runtime Scheduler Loop

11/16/15, Austin, TX, USA ScalA 2015 10

defmain_thread_loop(user_code,queues,threshold):#ifthereareenoughtasksforcoresifqueues.total_length()>threshold: #resumeuser’scodeforsubmittingtasks task=user_code.get_next_task() q=queues.find_closes_queue(task.devices()) q.insert(task,task.priority())else: task=queues.steal_task(main_cpu) #iftasksavailableforstealing ifnottask.is_empty(): task.execute()#executesingletask else: queues.wait_for_tasks()

Page 11: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Effect of Dynamic Scheduling

11/16/15, Austin, TX, USA ScalA 2015 11

Page 12: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Performance: Kepler K20

11/16/15, Austin, TX, USA ScalA 2015 12

Page 13: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Performance: Xeon Phi

11/16/15, Austin, TX, USA ScalA 2015 13

Page 14: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Performance: Xeon Phi + Kepler K40

11/16/15, Austin, TX, USA ScalA 2015 14

Page 15: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if

Summary of Contributions

•  Fine- and coarse-grained tasks for scheduling •  Capability weights for hardware description •  Unified scheduling across CPUs, GPUs, and

coprocessors •  Synchronous memory-transfer model with

transparent asynchronous progress

11/16/15, Austin, TX, USA ScalA 2015 15