the scala programming language - delite: a framework for...

27
Delite: A Framework for Heterogeneous Parallel DSLs Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL

Upload: others

Post on 27-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Delite: A Framework for

Heterogeneous Parallel DSLs Hassan Chafi, Arvind Sujeeth, Kevin Brown,

HyoukJoong Lee, Kunle Olukotun Stanford University

Tiark Rompf, Martin Odersky

EPFL

Page 2: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Heterogeneous Parallel Programming

Cray

Jaguar

Sun

T2

Nvidia

Fermi

Altera

FPGA

MPI

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Page 3: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Programmability Chasm

Too many different programming models

Cray

Jaguar

Sun

T2

Nvidia

Fermi

Altera

FPGA

MPI

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Virtual

Worlds

Personal

Robotics

Data informatics

Scientific

Engineering

Applications

Page 4: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,
Page 5: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,
Page 6: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

PPL Vision

Domain Embedding Language (Scala)

Virtual

Worlds

Personal

Robotics Data

informatics

Scientific

Engineering

Physics

(Liszt)

Data Analytics

(OptiQL)

Graph Alg.

(Green Marl)

Machine Learning (OptiML)

Statistics

(R)

Parallel Runtime (Delite, GRAMPS)

Dynamic Domain Spec. Opt. Locality Aware Scheduling

Staging Polymorphic Embedding

Applications

Domain

Specific

Languages

Heterogeneous

Hardware

DSL

Infrastructure

Task & Data Parallelism

Static Domain Specific Opt.

Page 7: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

We need to develop all these DSLs

Current DSL methods are unsatisfactory

DSLs Present New Problem

Page 8: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Current DSL Development Approaches

Stand-alone DSLs Can include extensive optimizations Enormous effort to develop to a sufficient degree of maturity

Actual Compiler/Optimizations Tooling (IDE, Debuggers,…)

Interoperation between multiple DSLs is very difficult

Purely embedded DSLs ⇒ “just a library” Easy to develop (can reuse full host language) Easier to learn DSL Can Combine multiple DSLs in one program Can Share DSL infrastructure among several DSLs Hard to optimize using domain knowledge Target same architecture as host language

Need to do better

Page 9: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Need to Do Better

Goal: Develop embedded DSLs that perform as well as stand-alone ones

Intuition: General-purpose languages should be designed with DSL embedding in mind

Page 10: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Embedded DSL gets it all for free, but can’t change any of it

Modular Staging Approach

Lexer Parser Type

checker Analysis Optimization

Code gen

DSLs adopt front-end from highly expressive

embedding language

but can customize IR and participate in backend phases

Stand-alone DSL implements everything

Typical Compiler

Modular Staging provides a hybrid approach

GPCE’10: Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs

Page 11: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Linear Algebra Example

object TestMatrix {

def example(a: Matrix, b: Matrix, c: Matrix, d: Matrix) = { val x = a*b + a*c val y = a*c + a*d println(x+y) } }

Page 12: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Abstract Matrix Usage

object TestMatrix extends DSLApp with MatrixArith {

def example(a: Rep[Matrix], b: Rep[Matrix],

c: Rep[Matrix], d: Rep[Matrix]) = {

val x = a*b + a*c val y = a*c + a*d println(x+y) } }

Rep[Matrix]: abstract type constructor ⇒ range of possible implementations of Matrix

Operations on Rep[Matrix] defined in MatrixArith trait

Page 13: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Lifting Matrix to Abstract Representation DSL interface building blocks structured as traits

Expressions of type Rep[T] represent expressions of type T Can plug in different representation

Need to be able to convert (lift) Matrix to abstract representation

Need to define an interface for our DSL type

Now can plugin different implementations and representations for the DSL

trait MatrixArith { type Rep[T] implicit def liftMatrixToRep(x: Matrix): Rep[Matrix] def infix_+(x:Rep[Matrix], y: Rep[Matrix]): Rep[Matrix] def infix_*(x:Rep[Matrix], y: Rep[Matrix]): Rep[Matrix] }

Page 14: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Modest Effort

Lifting each new DSL that uses slightly different IR violates Effort criterion

Need a DSL embedding infrastructure

Provide building blocks of common DSL compiler functionality

Page 15: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Intermediate Representation (IR)

Delite DSL Compiler

Provide a common IR that can be extended while still benefitting from generic analysis and opt.

Scala Embedding

Framework

Delite Parallelism

Framework

Base IR

Generic

Analysis & Opt.

Liszt

program

OptiML

program

Page 16: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Now Can Build an IR

Start with common IR structure to be shared among DSLs

Generic optimizations provided to all DSLs Common subexpression elimination Dead code elimination Code motion

trait Expressions { // constants/symbols (atomic) abstract class Exp[T] case class Const[T](x: T) extends Exp[T] case class Sym[T](n: Int) extends Exp[T] // operations (composite, defined in subtraits) abstract class Op[T] // additional members for managing encountered definitions def findOrCreateDefinition[T](op: Op[T]): Sym[T] implicit def toExp[T](d: Op[T]): Exp[T] = findOrCreateDefinition(d) }

Page 17: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Intermediate Representation (IR)

Delite DSL Compiler

Provide a common IR that can be extended while still benefitting from generic analysis and opt.

Extend common IR and provide IR nodes that encode parallel execution patterns

Now can do parallel optimizations and mapping

Scala Embedding

Framework

Delite Parallelism

Framework

Base IR

Generic

Analysis & Opt.

Liszt

program

OptiML

program

Delite IR

Parallelism Analysis,

Opt. & Mapping

Page 18: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Delite Ops

Encode known parallel execution patterns

Data-parallel: map, zip, reduce, …

Delite provides implementations of these patterns for multiple hardware targets

e.g., multi-core, GPU

DSL developer maps each domain operation to the appropriate pattern

Page 19: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Delite Op Fusing

Operates on all loop-based ops

Reduces op overhead and improves locality

Elimination of temporary data structures

Communication through registers

Fuse both dependent and side-by-side operations

Fused ops can have multiple outputs

Algorithm: fuse two loops if

size(loop1) == size(loop2)

No dependencies exist that would require an impossible schedule if fused

e.g., C depends on B, depends on A -> cannot fuse C and A

Page 20: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Intermediate Representation (IR)

Delite DSL Compiler

Provide a common IR that can be extended while still benefitting from generic analysis and opt.

Extend common IR and provide IR nodes that encode parallel execution patterns

Now can do parallel optimizations and mapping

DSL extends appropriate parallel nodes for their operations

Now can do domain-specific analysis and opt.

Scala Embedding

Framework

Delite Parallelism

Framework

Base IR

Generic

Analysis & Opt.

Liszt

program

OptiML

program

DS IR

Domain

Analysis & Opt.

Delite IR

Parallelism Analysis,

Opt. & Mapping

⇒ ⇒

Page 21: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Customize IR with Domain Info

Choose Exp as representation for the DSL types Define Lifting function to create expressions Extend Delite IR with domain-specific node types DSL methods build IR as program runs

trait MatrixArithRepExp extends MatrixArith with Expressions { type Rep[T] = Exp[T] implicit def liftMatrixToRep(x: Matrix) = Const(x) case class Plus(x: Exp[Matrix],y: Exp[Matrix]) extends DeliteOpZip[Matrix] def infix_+(x: Exp[Matrix],y: Exp[Matrix]) = Plus(x, y) }

Page 22: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

The Delite IR

s = sum(M) V1 = exp(V2) M1 = M2 + M3

Domain User

Interface DSL

User

Application

DS IR

Delite Op IR

Base IR

C2 = sort(C1)

Matrix

Plus

Vector

Exp

Matrix

Sum

Domain

Analysis & Opt.

DSL

Author Collection

Quicksort

Reduce Map ZipWith

Parallelism

Analysis & Opt.

Code Generation

Delite

Divide &

Conquer

Op Generic Analysis

& Opt.

Delite

Page 23: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Intermediate Representation (IR)

Delite DSL Compiler

Provide a common IR that can be extended while still benefitting from generic analysis and opt.

Extend common IR and provide IR nodes that encode data parallel execution patterns

Now can do parallel optimizations and mapping

DSL extends appropriate data parallel nodes for their operations

Now can do domain-specific analysis and opt.

Generate an execution graph, kernels and data structures

Scala Embedding

Framework

Delite

Execution

Graph

Delite Parallelism

Framework

Base IR

Generic

Analysis & Opt.

Code Generation

Kernels

(Scala, C,

Cuda, MPI

Verilog, …)

Liszt

program

OptiML

program

DS IR

Domain

Analysis & Opt.

Delite IR

Parallelism Analysis,

Opt. & Mapping

⇒ ⇒

Data Structures

(arrays, trees,

graphs, …)

Page 24: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Delite Runtime

Maps the machine-agnostic DSL compiler output onto the given machine specification for customized execution

Walk-time scheduling produces partial schedules

Generates variants to be selected at run-time once actual execution flow is resolved

Code generation produces fused, specialized kernels to be launched on each resource

Optimizes for the execution plan using the partial schedules

Run-time executor controls and optimizes execution

Launches kernels on resources

Monitors execution and system state to dynamically adjust kernels and schedule

Delite

Execution

Graph

Kernels

(Scala, C,

Cuda, …)

DSL Data

Structures

Local System

GPU

Partial schedules, Fused, specialized kernels

SMP

Machine Inputs Application Inputs

Scheduler

Code Generator

JIT Kernel Fusion, Specialization, Synchronization

Walk-Time

Schedule Dispatch, Memory Management, Lazy Data Transfers

Run-Time

Page 25: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Schedule and Kernel Compilation

Compile execution graph to executables for each resource after scheduling All synchronization is added at this point where required

Kernels specialized based on number of processors allocated for it

e.g., specialize height of tree reduction

Page 26: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

GPU Management

Cuda host thread launches kernels and automatically performs data transfers as required by schedule

Compiler provides helper functions to

Copy data structures between address spaces

pre-allocate outputs and temporaries

select the number of threads & thread blocks

Provides device memory management for kernels

Perform liveness analysis to determine when op inputs and outputs are dead on the GPU

Runtime frees dead data when it experiences memory pressure

Page 27: The Scala Programming Language - Delite: A Framework for …days2011.scala-lang.org/sites/days2011/files/06. Delite.pdf · 2011-09-09 · Heterogeneous Parallel DSLs Hassan Chafi,

Conclusions

Need to simplify the process of developing DSLs for parallelism

Need programming languages to be designed for flexible embedding

Lightweight modular staging in Scala allows for more powerful embedded DSLs

Delite provides a framework for adding parallelism