cuda - copperhead

8/14/2019 CUDA - Copperhead

1/28

Copperhead: A Python-like DataParallel Language & Compiler

Bryan Catanzaro, UC BerkeleyMichael Garland, NVIDIA Research

Kurt Keutzer, UC Berkeley

Universal Parallel Computing Research CenterUniversity of California, Berkeley


2/28

2/28

Intro to CUDA

Overview Multicore/Manycore SIMD Programming with millions of threads


3/28

3/28

The CUDA Programming Model

CUDA is a recent programming model, designed for

Manycore architectures

Wide SIMD parallelism

Scalability CUDA provides:

A thread abstraction to deal with SIMD

Synchronization & data sharing between small groups of

threads CUDA programs are written in C + extensions OpenCL is inspired by CUDA, but HW & SW vendor neutral

Programming model essentially identical


4/28

4/28

Multicore and Manycore

Multicore: yoke of oxen Each core optimized for executing a single thread

Manycore: flock of chickens

Cores optimized for aggregate throughput, deemphasizing

individual performance

Multicore Manycore


5/28

5/28

Multicore & Manycore, cont.

Specifications Core i7 960 GTX285

Processing Elements4 cores, 4 way SIMD

@3.2 GHz

30 cores, 8 way SIMD

@1.5 GHz

Resident Threads(max)

4 cores, 2 threads, 4

width SIMD:

32 strands

30 cores, 32 SIMD

vectors, 32 widthSIMD:

30720 strands

SP GFLOP/s 102 1080

Memory Bandwidth 25.6 GB/s 159 GB/s

Register File - 1.875 MB

Local Store - 480 kB

Core i7

GTX285


6/28

6/28

SIMD: Neglected Parallelism

It is difficult for a compiler to exploit SIMD How do you deal with sparse data & branches?

Many languages (like C) are difficult to vectorize

Fortran is somewhat better

Most common solution:

Either forget about SIMD

Pray the autovectorizer likes you

Or instantiate intrinsics (assembly language)

Requires a new code version for every SIMD extension


7/287/28

What to do with SIMD?

Neglecting SIMD in the future will be more expensive

AVX: 8 way SIMD, Larrabee: 16 way SIMD

This problem composes with thread level parallelism

4 way SIMD 16 way SIMD


8/288/28

CUDA

CUDA addresses this problem by abstracting both SIMDand task parallelism into threads

The programmer writes a serial, scalar thread with the

intention of launching thousands of threads Being able to launch 1 Million threads changes the

parallelism problem

Its often easier to find 1 Million threads than 32: just look

at your data & launch a thread per element CUDA is designed for Data Parallelism

Not coincidentally, data parallelism is the only way for

most applications to scale to 1000(+) way parallelism


9/289/28

Hello World


10/2810/28

CUDA Summary

CUDA is a programming model for manycoreprocessors

It abstracts SIMD, making it easy to use wide SIMD

vectors It provides good performance on todays GPUs In the near future, CUDA-like approaches will map well

to many processors & GPUs

CUDA encourages SIMD friendly, highly scalablealgorithm design and implementation


11/2811/28

A Parallel Scripting Language

What is a scripting language?

Lots of opinions on this

Im using an informal definition:

A language where performance is happily traded for productivity Weak performance requirement of scalability

My code should run faster tomorrow

What is the analog of todays scripting languages for manycore?


12/2812/28

Data Parallelism

Assertion: Scaling to 1000 cores requires dataparallelism

Accordingly, manycore scripting languages will be data

parallel They should allow the programmer to express data

parallelism naturally They should compose and transform the parallelism to

fit target platforms


13/2813/28

Warning: Evolving Project

Copperhead is still in embryo We can compile a few small programs Lots more work to be done in both language definition

and code generation Feedback is encouraged


14/2814/28

Copperhead = Cu + python

Copperhead is a subset of Python, designedfor data parallelism

Why Python?

Extant, well accepted high level scripting language

Free simulator(!!)

Already understands things like map and reduce

Comes with a parser & lexer

The current Copperhead compiler takes a subset ofPython and produces CUDA code

Copperhead is not CUDA specific, but current compiler is


15/2815/28

Copperhead is not Pure Python

Copperhead is not for arbitrary Python code

Most features of Python are unsupported

Copperhead is compiled, not interpreted Connecting Python code & Copperhead code will

require binding the programs together, similar toPython-C interaction

Copperhead is statically typed

Python

Copperhead


16/2816/28

Saxpy: Hello world

Some things to notice:

Types are implicit The Copperhead compiler uses a Hindley-Milner type system

with typeclasses similar to Haskell

Typeclasses are fully resolved in CUDA via C++ templates

Functional programming: map, lambda (or equivalent in list comprehensions)

you can pass functions around to other functions

Closure: the variable a is free in the lambda function, but

bound to the a in its enclosing scope

def saxpy(a, x, y):

return map(lambda xi, yi: a*xi + yi, x, y)


17/28

17/28

Type Inference, cont.

Copperhead includes function templates for intrinsics like

add, subtract, map, scan, gather Expressions are mapped against templates

Every variable starts out with a unique generic type, then

types are resolved by union find on the abstract syntax tree Tuple and function types are also inferred

c = a + b+ : (Num0, Num0) > Num0

A145A207

A52

c = a + b

Num52Num52

Num52


18/28

18/28

Data parallelism

Copperhead computations are organized around dataparallel arrays

map performs a forall for each element in an array

Accesses must be local Accessing non-local elements is done explicitly

shift, rotate, or gather

No side effects allowed


19/28

19/28

Copperhead primitives

map

reduce

Scans:

scan, rscan, segscan, rsegscan

exscan, exrscan, exsegscan, exrsegscan

Shuffles:

shift, rotate, gather, scatter


20/28

20/28

Implementing Copperhead

The Copperheadcompiler is written inPython

Python provides its

own Abstract SyntaxTree Type inference, code

generation, etc. is doneby walking the AST

Module(None,

Stmt(

Function(

None,

'saxpy',

['a', 'x', 'y'],

0,

None,

Stmt(

Return(

CallFunc(Name('map'),

Lambda(

['xi', 'yi'],

0,

Add(

Mul(

Name('a'),

Name('xi')

),

Name('yi')

)

),

Name('x'),

Name('y'),

None,

None

)

)

)

)

))

def saxpy(a, x, y):



21/28

21/28

Compiling Copperhead to CUDA

Every Copperhead function creates at least one CUDAdevice function

Top level Copperhead functions create a CUDA global

function, which orchestrates the device function calls The global function takes care of allocating shared

memory and returning data (storing it to DRAM) Global synchronizations are implemented through

multiple phases All intermediate arrays & plumbing handled by Copperhead

compiler


22/28

22/28

Saxpy Revisited

template __device__ Num lambda0(Num xi, Num yi, Num a) {return ((a * xi) + yi);

}

template__device__ void saxpy0Dev(Array x, Array y, Num a, uint

_globalIndex, Num& _returnValueReg) {

Num _xReg, yReg;

if (_globalIndex < x.length) _xReg = x[_globalIndex];

if (_globalIndex < y.length) _yReg = y[_globalIndex];

if (_globalIndex < x.length) _returnValueReg = lambda0(_xReg, _yReg, a);

}template__global__ void saxpy0(Array x, Array y, Num a, Array

_returnValue) {

uint _blockMin = IMUL(blockDim.x, blockIdx.x);

uint _blockMax = _blockMin + blockDim.x;

uint _globalIndex = _blockMin + threadIdx.x;

Num _returnValueReg;

saxpy0Dev(x, y, a, _globalIndex, _returnValueReg);

if (_globalIndex < _returnValue.length) _returnValue[_globalIndex] = _returnValueReg;

}

def saxpy(a, x, y):



23/28

23/28

Phases

Reduction

phase 0

phase 1

Scanphase 0

phase 1

phase 2


24/28

24/28

Copperhead to CUDA, cont.

Compiler schedules computations into phases

Right now, this composition is done greedily

Compiler tracks global and local availability of all variablesand creates a phase boundary when necessary

Fusing work into phases is important for performance

B = reduce(map(A))

D = reduce(map(C))

phase 0

phase 1


25/28

25/28

Copperhead to CUDA, cont.

Shared memory used only for communicating betweenthreads

Caching unpredictable accesses (gather)

Accessing elements with a uniform stride (shift & rotate)

Each device function returns its intermediate resultsthrough registers


26/28

26/28

Split

This code is decomposed into 3 phases Copperhead compiler takes care of intermediate

variables Copperhead compiler uses shared memory for

temporaries used in scans here

Everything else is in registers

def split(input, value):

flags = map(lambda a: 1 if a


27/28

27/28

Interpreting to Copperhead

If the interpreter harvested dynamic type information, itcould use the Copperhead compiler as a backend

Fun project see what kinds of information could be

gleaned from the Python interpreter at runtime tofigure out what should be compiled via Copperhead to amanycore chip


28/28

Future Work

Finish support for the basics Compiler transformations

Nested data parallelism flattening

segmented scans

Retargetability

Thread Building Blocks/OpenMP/OpenCL

Bridge Python and Copperhead Implement real algorithms with Copperhead

Vision/Machine Learning, etc.

cuda - copperhead

Documents