cuda - copperhead
TRANSCRIPT
-
8/14/2019 CUDA - Copperhead
1/28
Copperhead: A Python-like DataParallel Language & Compiler
Bryan Catanzaro, UC BerkeleyMichael Garland, NVIDIA Research
Kurt Keutzer, UC Berkeley
Universal Parallel Computing Research CenterUniversity of California, Berkeley
-
8/14/2019 CUDA - Copperhead
2/28
2/28
Intro to CUDA
Overview Multicore/Manycore SIMD Programming with millions of threads
-
8/14/2019 CUDA - Copperhead
3/28
3/28
The CUDA Programming Model
CUDA is a recent programming model, designed for
Manycore architectures
Wide SIMD parallelism
Scalability CUDA provides:
A thread abstraction to deal with SIMD
Synchronization & data sharing between small groups of
threads CUDA programs are written in C + extensions OpenCL is inspired by CUDA, but HW & SW vendor neutral
Programming model essentially identical
-
8/14/2019 CUDA - Copperhead
4/28
4/28
Multicore and Manycore
Multicore: yoke of oxen Each core optimized for executing a single thread
Manycore: flock of chickens
Cores optimized for aggregate throughput, deemphasizing
individual performance
Multicore Manycore
-
8/14/2019 CUDA - Copperhead
5/28
5/28
Multicore & Manycore, cont.
Specifications Core i7 960 GTX285
Processing Elements4 cores, 4 way SIMD
@3.2 GHz
30 cores, 8 way SIMD
@1.5 GHz
Resident Threads(max)
4 cores, 2 threads, 4
width SIMD:
32 strands
30 cores, 32 SIMD
vectors, 32 widthSIMD:
30720 strands
SP GFLOP/s 102 1080
Memory Bandwidth 25.6 GB/s 159 GB/s
Register File - 1.875 MB
Local Store - 480 kB
Core i7
GTX285
-
8/14/2019 CUDA - Copperhead
6/28
6/28
SIMD: Neglected Parallelism
It is difficult for a compiler to exploit SIMD How do you deal with sparse data & branches?
Many languages (like C) are difficult to vectorize
Fortran is somewhat better
Most common solution:
Either forget about SIMD
Pray the autovectorizer likes you
Or instantiate intrinsics (assembly language)
Requires a new code version for every SIMD extension
-
8/14/2019 CUDA - Copperhead
7/287/28
What to do with SIMD?
Neglecting SIMD in the future will be more expensive
AVX: 8 way SIMD, Larrabee: 16 way SIMD
This problem composes with thread level parallelism
4 way SIMD 16 way SIMD
-
8/14/2019 CUDA - Copperhead
8/288/28
CUDA
CUDA addresses this problem by abstracting both SIMDand task parallelism into threads
The programmer writes a serial, scalar thread with the
intention of launching thousands of threads Being able to launch 1 Million threads changes the
parallelism problem
Its often easier to find 1 Million threads than 32: just look
at your data & launch a thread per element CUDA is designed for Data Parallelism
Not coincidentally, data parallelism is the only way for
most applications to scale to 1000(+) way parallelism
-
8/14/2019 CUDA - Copperhead
9/289/28
Hello World
-
8/14/2019 CUDA - Copperhead
10/2810/28
CUDA Summary
CUDA is a programming model for manycoreprocessors
It abstracts SIMD, making it easy to use wide SIMD
vectors It provides good performance on todays GPUs In the near future, CUDA-like approaches will map well
to many processors & GPUs
CUDA encourages SIMD friendly, highly scalablealgorithm design and implementation
-
8/14/2019 CUDA - Copperhead
11/2811/28
A Parallel Scripting Language
What is a scripting language?
Lots of opinions on this
Im using an informal definition:
A language where performance is happily traded for productivity Weak performance requirement of scalability
My code should run faster tomorrow
What is the analog of todays scripting languages for manycore?
-
8/14/2019 CUDA - Copperhead
12/2812/28
Data Parallelism
Assertion: Scaling to 1000 cores requires dataparallelism
Accordingly, manycore scripting languages will be data
parallel They should allow the programmer to express data
parallelism naturally They should compose and transform the parallelism to
fit target platforms
-
8/14/2019 CUDA - Copperhead
13/2813/28
Warning: Evolving Project
Copperhead is still in embryo We can compile a few small programs Lots more work to be done in both language definition
and code generation Feedback is encouraged
-
8/14/2019 CUDA - Copperhead
14/2814/28
Copperhead = Cu + python
Copperhead is a subset of Python, designedfor data parallelism
Why Python?
Extant, well accepted high level scripting language
Free simulator(!!)
Already understands things like map and reduce
Comes with a parser & lexer
The current Copperhead compiler takes a subset ofPython and produces CUDA code
Copperhead is not CUDA specific, but current compiler is
-
8/14/2019 CUDA - Copperhead
15/2815/28
Copperhead is not Pure Python
Copperhead is not for arbitrary Python code
Most features of Python are unsupported
Copperhead is compiled, not interpreted Connecting Python code & Copperhead code will
require binding the programs together, similar toPython-C interaction
Copperhead is statically typed
Python
Copperhead
-
8/14/2019 CUDA - Copperhead
16/2816/28
Saxpy: Hello world
Some things to notice:
Types are implicit The Copperhead compiler uses a Hindley-Milner type system
with typeclasses similar to Haskell
Typeclasses are fully resolved in CUDA via C++ templates
Functional programming: map, lambda (or equivalent in list comprehensions)
you can pass functions around to other functions
Closure: the variable a is free in the lambda function, but
bound to the a in its enclosing scope
def saxpy(a, x, y):
return map(lambda xi, yi: a*xi + yi, x, y)
-
8/14/2019 CUDA - Copperhead
17/28
17/28
Type Inference, cont.
Copperhead includes function templates for intrinsics like
add, subtract, map, scan, gather Expressions are mapped against templates
Every variable starts out with a unique generic type, then
types are resolved by union find on the abstract syntax tree Tuple and function types are also inferred
c = a + b+ : (Num0, Num0) > Num0
A145A207
A52
c = a + b
Num52Num52
Num52
-
8/14/2019 CUDA - Copperhead
18/28
18/28
Data parallelism
Copperhead computations are organized around dataparallel arrays
map performs a forall for each element in an array
Accesses must be local Accessing non-local elements is done explicitly
shift, rotate, or gather
No side effects allowed
-
8/14/2019 CUDA - Copperhead
19/28
19/28
Copperhead primitives
map
reduce
Scans:
scan, rscan, segscan, rsegscan
exscan, exrscan, exsegscan, exrsegscan
Shuffles:
shift, rotate, gather, scatter
-
8/14/2019 CUDA - Copperhead
20/28
20/28
Implementing Copperhead
The Copperheadcompiler is written inPython
Python provides its
own Abstract SyntaxTree Type inference, code
generation, etc. is doneby walking the AST
Module(None,
Stmt(
Function(
None,
'saxpy',
['a', 'x', 'y'],
0,
None,
Stmt(
Return(
CallFunc(Name('map'),
Lambda(
['xi', 'yi'],
0,
Add(
Mul(
Name('a'),
Name('xi')
),
Name('yi')
)
),
Name('x'),
Name('y'),
None,
None
)
)
)
)
))
def saxpy(a, x, y):
return map(lambda xi, yi: a*xi + yi, x, y)
-
8/14/2019 CUDA - Copperhead
21/28
21/28
Compiling Copperhead to CUDA
Every Copperhead function creates at least one CUDAdevice function
Top level Copperhead functions create a CUDA global
function, which orchestrates the device function calls The global function takes care of allocating shared
memory and returning data (storing it to DRAM) Global synchronizations are implemented through
multiple phases All intermediate arrays & plumbing handled by Copperhead
compiler
-
8/14/2019 CUDA - Copperhead
22/28
22/28
Saxpy Revisited
template __device__ Num lambda0(Num xi, Num yi, Num a) {return ((a * xi) + yi);
}
template__device__ void saxpy0Dev(Array x, Array y, Num a, uint
_globalIndex, Num& _returnValueReg) {
Num _xReg, yReg;
if (_globalIndex < x.length) _xReg = x[_globalIndex];
if (_globalIndex < y.length) _yReg = y[_globalIndex];
if (_globalIndex < x.length) _returnValueReg = lambda0(_xReg, _yReg, a);
}template__global__ void saxpy0(Array x, Array y, Num a, Array
_returnValue) {
uint _blockMin = IMUL(blockDim.x, blockIdx.x);
uint _blockMax = _blockMin + blockDim.x;
uint _globalIndex = _blockMin + threadIdx.x;
Num _returnValueReg;
saxpy0Dev(x, y, a, _globalIndex, _returnValueReg);
if (_globalIndex < _returnValue.length) _returnValue[_globalIndex] = _returnValueReg;
}
def saxpy(a, x, y):
return map(lambda xi, yi: a*xi + yi, x, y)
-
8/14/2019 CUDA - Copperhead
23/28
23/28
Phases
Reduction
phase 0
phase 1
Scanphase 0
phase 1
phase 2
-
8/14/2019 CUDA - Copperhead
24/28
24/28
Copperhead to CUDA, cont.
Compiler schedules computations into phases
Right now, this composition is done greedily
Compiler tracks global and local availability of all variablesand creates a phase boundary when necessary
Fusing work into phases is important for performance
B = reduce(map(A))
D = reduce(map(C))
phase 0
phase 1
-
8/14/2019 CUDA - Copperhead
25/28
25/28
Copperhead to CUDA, cont.
Shared memory used only for communicating betweenthreads
Caching unpredictable accesses (gather)
Accessing elements with a uniform stride (shift & rotate)
Each device function returns its intermediate resultsthrough registers
-
8/14/2019 CUDA - Copperhead
26/28
26/28
Split
This code is decomposed into 3 phases Copperhead compiler takes care of intermediate
variables Copperhead compiler uses shared memory for
temporaries used in scans here
Everything else is in registers
def split(input, value):
flags = map(lambda a: 1 if a
-
8/14/2019 CUDA - Copperhead
27/28
27/28
Interpreting to Copperhead
If the interpreter harvested dynamic type information, itcould use the Copperhead compiler as a backend
Fun project see what kinds of information could be
gleaned from the Python interpreter at runtime tofigure out what should be compiled via Copperhead to amanycore chip
-
8/14/2019 CUDA - Copperhead
28/28
Future Work
Finish support for the basics Compiler transformations
Nested data parallelism flattening
segmented scans
Retargetability
Thread Building Blocks/OpenMP/OpenCL
Bridge Python and Copperhead Implement real algorithms with Copperhead
Vision/Machine Learning, etc.