legion: programming distributed heterogeneous ......legion: programming distributed heterogeneous...

1
Tasks are the unit of computation in Legion - Tasks name logical regions and fields they will access at dispatch - Each logical region and field annotated with privilege: read-only, read-write, reduce - Tasks can recursively launch arbitrary sub-tasks for nested parallelism Tasks issued in program order, parallelism inferred from non-interference - Sub-region non-interference: accessing disjoint sub-regions - Field non-interference: accessing disjoint sets of fields - Privilege non-interference: performing non-interfering access (e.g. both read-only) Legion runtime operates similarly to a hardware out-of-order processor Legion: Programming Distributed Heterogeneous Architectures with Logical Regions Michael Bauer, Sean Treichler, Elliott Slaughter, Alex Aiken Stanford University Overview Partitioning Implicit Task Parallelism S3D Evaluation Three Important Trends 1. Cost of Data Movement Dominates 2. Dynamism in Both Hardware and Software 3. Heterogeneity of Processors and Memory Legion Goals 1. Abstractions for logically describing program data to minimize data movement 2. Emphasize runtime decision making to handle dynamism 3. Decouple specification from mapping to handle heterogeneity Sub-Region Non-Interference Mapping Entry Field Logical Regions: A Relational Abstraction for Data Logical regions describe data in a relation style: entries (rows) and fields (columns). Support some relational operators for providing different views on data. - Partitioning is selection (σ) - Field-slicing is projection (π) Relations can encode arbitrary data structures. Field Field Field Entry Entry Entry Entry Entry Entry Entry Entry Logical regions can be partitioned into sub-regions - Partitions are computed dynamically, either disjoint or aliased - Logical regions can be partitioned recursively - Logical regions support multiple partitions Logical region trees capture important properties of program data - Independence: disjoint logical regions in same partition - Locality: elements in same logical regions - Sub-region: parent and child logical regions - Aliasing: aliased partitions and different partitions of same region N S P p 1 p n s 1 s n g 1 g n disjoint possibly overlap Field Field Field Field Field Field Dep. Analysis Distribute Map Execute Resolve Spec. Complete Commit Field Non-Interference Privilege Non-Interference Legion applications are mapped onto different architectures - Tasks assigned to processors, region instances assigned to memories - All decisions are programmatically available via mapper interface - Mapper objects can make dynamic decisions based on runtime information Mapping decisions are independent of correctness - Makes tuning Legion applications simple - Porting can be performed easily and only requires writing a new mapper Mappers are customizable and composable - Provide a default mapper with heuristics - Can write Legion applications and gradually refine mapping by overriding virtual functions - Different mappers in same application t 1 t 2 t 3 t 4 t 5 r c r w r w1 r w2 r n r n1 r n2 $ $ $ N U M A N U M A FB D R A M x86 CUDA x86 x86 x86 Production combustion simulation from the Department of Energy - Full scale application consisting of more than 200K lines of Fortran MPI - Significant task and data level parallelism - Data movement is the limiting factor Ported main simulation loop (100K lines) of S3D into Legion C++ - Interoperate with MPI version - Legion automatically discovers significantly task level parallelism - Can explore different mapping running on Titan, worlld’s #2 supercomputer Many applications ported to Legion, showing both strong and weak scaling - Circuit simulation, AMR, fluid flow, unstructured mesh, S3D - Running on Titan and Keeneland supercomputers at Oak Ridge National Lab - Can handle CPU+GPU, Infiniband, Gemini, Aires, NUMA, … S3D performance results demonstrate that Legion can weak scale - Compare against MPI+OpenACC version tuned by NVIDIA and Cray teams - Take best Legion mapping after trying many different tuning techniques - Between 2.0 and 2.27X faster than MPI+OpenACC 8 16 32 64 128 256 512 Nodes 0 50000 100000 150000 200000 250000 Throughput (Points/s) Legion 64 3 MPI+OpenACC 64 3 8 16 32 64 128 256 512 1024 Nodes 0 20000 40000 60000 80000 100000 120000 140000 Throughput (Points/s) Legion 48 3 MPI+OpenACC 48 3 DME Mechanism Heptane Mechanism

Upload: others

Post on 24-Sep-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Legion: Programming Distributed Heterogeneous ......Legion: Programming Distributed Heterogeneous Architectures with Logical Regions Michael Bauer, Sean Treichler, Elliott Slaughter,

Tasks are the unit of computation in Legion -  Tasks name logical regions and fields they will access at dispatch -  Each logical region and field annotated with privilege: read-only, read-write, reduce -  Tasks can recursively launch arbitrary sub-tasks for nested parallelism

Tasks issued in program order, parallelism inferred from non-interference -  Sub-region non-interference: accessing disjoint sub-regions -  Field non-interference: accessing disjoint sets of fields -  Privilege non-interference: performing non-interfering access (e.g. both read-only) Legion runtime operates similarly to a hardware out-of-order processor

Legion: Programming Distributed Heterogeneous Architectures with Logical Regions

Michael Bauer, Sean Treichler, Elliott Slaughter, Alex Aiken Stanford University

Overview

Partitioning

Implicit Task Parallelism S3D

Evaluation

Three Important Trends 1.  Cost of Data Movement Dominates 2.  Dynamism in Both Hardware and Software 3.  Heterogeneity of Processors and Memory Legion Goals 1.  Abstractions for logically describing program data to minimize data movement 2.  Emphasize runtime decision making to handle dynamism 3.  Decouple specification from mapping to handle heterogeneity

Sub-Region Non-Interference

Mapping

Entry

Field Logical Regions: A Relational Abstraction for Data Logical regions describe data in a relation style: entries (rows) and fields (columns). Support some relational operators for providing different views on data. - Partitioning is selection (σ) - Field-slicing is projection (π) Relations can encode arbitrary data structures.

Field Field Field

Entry

Entry

Entry

Entry

Entry

Entry

Entry

Entry

Logical regions can be partitioned into sub-regions -  Partitions are computed dynamically, either disjoint or aliased -  Logical regions can be partitioned recursively -  Logical regions support multiple partitions

Logical region trees capture important properties of program data -  Independence: disjoint logical regions in same partition -  Locality: elements in same logical regions -  Sub-region: parent and child logical regions -  Aliasing: aliased partitions and different partitions of same region

N

SP

p1 pn … s1 sn … g1 gn …

disjoint

possibly overlap

Field Field Field Field Field Field

Dep. Analysis Distribute Map Execute Resolve

Spec. Complete Commit

Field Non-Interference Privilege Non-Interference

Legion applications are mapped onto different architectures -  Tasks assigned to processors, region instances assigned to memories -  All decisions are programmatically available via mapper interface -  Mapper objects can make dynamic decisions based on runtime information

Mapping decisions are independent of correctness -  Makes tuning Legion applications simple -  Porting can be performed easily and only requires writing a new mapper Mappers are customizable and composable -  Provide a default mapper with heuristics -  Can write Legion applications and gradually refine mapping by overriding virtual functions -  Different mappers in same application

1  

t1!

t2!

t3!

t4!t5!

rc

rw

rw1 rw2

rn

rn1 rn2

$

$

$

$

N U M A

N U M A

FB

D R A M

x86

CUDA

x86

x86

x86

Production combustion simulation from the Department of Energy -  Full scale application consisting of more than 200K lines of Fortran MPI -  Significant task and data level parallelism -  Data movement is the limiting factor

Ported main simulation loop (100K lines) of S3D into Legion C++ -  Interoperate with MPI version -  Legion automatically discovers significantly task level parallelism -  Can explore different mapping running on Titan, worlld’s #2 supercomputer

Many applications ported to Legion, showing both strong and weak scaling -  Circuit simulation, AMR, fluid flow, unstructured mesh, S3D -  Running on Titan and Keeneland supercomputers at Oak Ridge National Lab -  Can handle CPU+GPU, Infiniband, Gemini, Aires, NUMA, …

S3D performance results demonstrate that Legion can weak scale -  Compare against MPI+OpenACC version tuned by NVIDIA and Cray teams -  Take best Legion mapping after trying many different tuning techniques -  Between 2.0 and 2.27X faster than MPI+OpenACC

8 16 32 64 128 256 512Nodes

0

50000

100000

150000

200000

250000

Thro

ughp

ut(P

oint

s/s)

Legion 643

MPI+OpenACC 643

8 16 32 64 128 256 512 1024Nodes

0

20000

40000

60000

80000

100000

120000

140000

Thro

ughp

ut(P

oint

s/s)

Legion 483

MPI+OpenACC 483

DME Mechanism Heptane Mechanism