[harvard cs264] 10a - easy, effective, efficient: gpu programming in python with pyopencl and pycuda...

Intro PyOpenCL RTCG Perspectives

Easy, Effective, Efficient:GPU Programming in Pythonwith PyOpenCL and PyCUDA

Andreas Klockner

Courant Institute of Mathematical SciencesNew York University

March 31, 2011

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA


Thanks

Jan Hesthaven (Brown)

Tim Warburton (Rice)

Leslie Greengard (NYU)

PyOpenCL, PyCUDA contributors

Nvidia Corp., AMD Corp.



Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 Perspectives


Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

Outline

1 IntroductionA Common ThemeIntro to OpenCL



4 Perspectives



How are High-Performance Codes constructed?

“Traditional” Construction ofHigh-Performance Codes:

C/C++/FortranLibraries

“Alternative” Construction ofHigh-Performance Codes:

Scripting for ‘brains’GPUs for ‘inner loops’

Play to the strengths of eachprogramming environment.



Outline

1 IntroductionA Common ThemeIntro to OpenCL



4 Perspectives



What is OpenCL?

OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]

Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)

Vendor-neutral

Comes with RTCG

Defines:

Host-side programming interface (library)

Device-side programming language (!)

Big deal?

Big deal!



Who?

© Copyright Khronos Group, 2010 - Page 4

OpenCL Working Group

• Diverse industry participation

- Processor vendors, system OEMs, middleware vendors, application developers

• Many industry-leading experts involved in OpenCL’s design

- A healthy diversity of industry perspectives

• Apple made initial proposal and is very active in the working group

- Serving as specification editor

Credit: Khronos Group



When?


OpenCL Timeline

• Six months from proposal to released OpenCL 1.0 specification

- Due to a strong initial proposal and a shared commercial incentive

• Multiple conformant implementations shipping

- Apple’s Mac OS X Snow Leopard now ships with OpenCL

• 18 month cadence between OpenCL 1.0 and OpenCL 1.1

- Backwards compatibility protect software investment

Apple proposes OpenCL working group and contributes draft specification to Khronos

Khronos publicly releases OpenCL 1.0 as royalty-free specification

Khronos releases OpenCL 1.0 conformance tests to ensure high-quality implementations

Jun08

Dec08

May09

2H09

Multiple conformant implementations ship across diverse OS and platforms

Jun10

OpenCL 1.1 Specification released and first implementations ship




Why?


Processor Parallelism

CPUsMultiple cores driving performance increases

GPUsIncreasingly general purpose data-parallel

computing

Graphics APIs and Shading

Languages

Multi-processor

programming – e.g. OpenMP

EmergingIntersection

HeterogeneousComputing

OpenCL is a programming framework for heterogeneous compute resources




CL vs CUDA side-by-side

CUDA source code:global void transpose(

float ∗A t, float ∗A,int a width, int a height )

int base idx a =

blockIdx .x ∗ BLK SIZE +blockIdx .y ∗ A BLOCK STRIDE;

int base idx a t =blockIdx .y ∗ BLK SIZE +blockIdx .x ∗ A T BLOCK STRIDE;

int glob idx a =base idx a + threadIdx.x+ a width ∗ threadIdx.y;

int glob idx a t =base idx a t + threadIdx.x+ a height ∗ threadIdx .y;

shared float A shared[BLK SIZE][BLK SIZE+1];

A shared[threadIdx .y ][ threadIdx .x] =A[ glob idx a ];

syncthreads ();

A t[ glob idx a t ] =A shared[threadIdx .x ][ threadIdx .y ];

OpenCL source code:void transpose(

global float ∗a t, global float ∗a,unsigned a width, unsigned a height)

int base idx a =get group id (0) ∗ BLK SIZE +get group id (1) ∗ A BLOCK STRIDE;

int base idx a t =get group id (1) ∗ BLK SIZE +get group id (0) ∗ A T BLOCK STRIDE;

int glob idx a =base idx a + get local id (0)+ a width ∗ get local id (1);

int glob idx a t =base idx a t + get local id (0)+ a height ∗ get local id (1);

local float a local [BLK SIZE][BLK SIZE+1];

a local [ get local id (1)∗BLK SIZE+get local id(0)] =a[ glob idx a ];

barrier (CLK LOCAL MEM FENCE);

a t [ glob idx a t ] =a local [ get local id (0)∗BLK SIZE+get local id(1)];



OpenCL ↔ CUDA: A dictionary

OpenCL CUDAGrid Grid

Work Group BlockWork Item Thread

kernel global

global device

local shared

private local

imagend t texture<type, n, ...>barrier(LMF) syncthreads()

get local id(012) threadIdx.xyz

get group id(012) blockIdx.xyz

get global id(012) – (reimplement)



OpenCL: Execution Model

nD Grid

Group(0, 0)

Group(0, 1)

Group(1, 0)

Group(1, 1)

Group(2, 0)

Group(2, 1)

Work Group (1, 0)

Item(0, 0)

Item(0, 1)

Item(0, 2)

Item(0, 3)

Item(1, 0)

Item(1, 1)

Item(1, 2)

Item(1, 3)

Item(2, 0)

Item(2, 1)

Item(2, 2)

Item(2, 3)

Item(3, 0)

Item(3, 1)

Item(3, 2)

Item(3, 3)

Two-tiered Parallelism

Grid = Nx × Ny × Nz work groupsWork group = Sx × Sy × Sz work itemsTotal:

∏i∈x,y ,z SiNi work items

Comm/Sync only within work group

Work group maps to compute unit

Grid/Group ≈ outer loops in an algorithm

Device Language:get global,group,local id,size(axis)



OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory


· · ·· · ·· · ·

Memory


· · ·· · ·· · ·

Memory


· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99




Host(CPU)

Memory


· · ·· · ·· · ·

MemoryCompute Device 1 (Platform 0)

· · ·· · ·· · ·


· · ·· · ·· · ·


· · ·· · ·· · ·

Memory






Python





Host(CPU)

Memory


· · ·· · ·· · ·

Memory


· · ·· · ·· · ·

Memory


· · ·· · ·· · ·

Memory


· · ·· · ·· · ·

Memory






Python




OpenCL Object Diagram

Last Revision Date: 9/30/10 Page 20

Figure 2.1 - OpenCL UML Class Diagram




Why do Scripting for GPUs?

GPUs are everything that scriptinglanguages are not.

Highly parallelVery architecture-sensitiveBuilt for maximum FP/memorythroughput

→ complement each other

CPU: largely restricted to controltasks (∼1000/sec)

Scripting fast enough

Python + CUDA = PyCUDA

Python + OpenCL = PyOpenCL


Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Outline

1 Introduction

2 Programming with PyOpenCLFirst ContactAbout PyOpenCL


4 Perspectives



Dive into PyOpenCL

1 import pyopencl as cl , numpy23 a = numpy.random.rand(256∗∗3).astype(numpy.float32)45 ctx = cl. create some context()6 queue = cl.CommandQueue(ctx)78 a dev = cl. Buffer(ctx , cl .mem flags.READ WRITE, size=a.nbytes)9 cl . enqueue write buffer (queue, a dev, a)

1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 a[ get global id (0)] ∗= 2; 14 ”””). build ()1516 prg. twice(queue, a.shape, (1,), a dev)

Compute kernel



Dive into PyOpenCL

8 a dev = cl. Buffer(ctx , cl .mem flags.READ WRITE, size=a.nbytes)9 cl . enqueue write buffer (queue, a dev, a)

1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 a[ get local id (0)+ get local size (0)∗get group id (0)] ∗= 2; 14 ”””). build ()1516 prg. twice(queue, a.shape, (256,), a dev)1718 result = numpy.empty like(a)19 cl . enqueue read buffer (queue, a dev, result ). wait()20 import numpy.linalg as la21 assert la .norm(result − 2∗a) == 0



Outline

1 Introduction

2 Programming with PyOpenCLFirst ContactAbout PyOpenCL


4 Perspectives



PyOpenCL: Completeness

PyOpenCL exposes all of OpenCL.

For example:

Every GetInfo() query

Images and Samplers

Memory Maps

Profiling and Synchronization

GL Interop



PyOpenCL: Completeness

PyOpenCL supports (nearly)every OS that has an OpenCLimplementation.

Linux

OS X

Windows



Automatic Cleanup

Reachable objects (memory,streams, . . . ) are never destroyed.

Once unreachable, released at anunspecified future time.

Scarce resources (memory) can beexplicitly freed. (obj.release())

Correctly deals with multiplecontexts and dependencies. (basedon OpenCL’s reference counting)



PyOpenCL: Documentation



PyOpenCL Philosophy

Provide complete access

Automatically manage resources

Provide abstractions

Allow interactive use

Check for and report errorsautomatically

Integrate tightly with numpy



PyOpenCL, PyCUDA: Vital Information

http://mathema.tician.de/

software/pyopencl (or /pycuda)

Complete documentation

X Consortium License(no warranty, free for all use)

Convenient abstractionsArrays, Elementwise op., Reduction, Scan

Require: numpy, Python 2.4+(Win/OS X/Linux)

Community: mailing list, wiki, add-onpackages (FFT, scikits.cuda, . . . )


http://mathema.tician.de/software/pyopencl

http://mathema.tician.de/software/pyopencl


Capturing Dependencies

B = f(A)C = g(B)E = f(C)F = h(C)G = g(E,F)P = p(B)Q = q(B)R = r(G,P,Q)

A

C

B

E

G

F Q

P

R

h

r

g

rg

r

g

q

f

p

f

Switch queue to out-of-ordermode!

Specify as list of events usingwait for= optional keyword toenqueue XXX.

Can also enqueue barrier.

Common use case:Transmit/receive from other MPIranks.

Possible on Nv Fermi: Submitparallel work to increase machineuse.


Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Outline

1 Introduction


3 Run-Time Code GenerationThe IdeaRTCG in Action

4 Perspectives



Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL



Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human In GPU scripting,GPU code doesnot need to bea compile-time

constant.



PyCUDAPyOpenCL



Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human


constant.



PyCUDAPyOpenCL



Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human


constant.



PyCUDA

PyOpenCL



Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human


constant.



PyCUDAPyOpenCL



Machine-generated Code

Why machine-generate code?

Automated Tuning(cf. ATLAS, FFTW)

Data types

Specialize code for given problem

Constants faster than variables(→ register pressure)

Loop Unrolling



PyOpenCL: Support for Metaprogramming

Three (main) ways of generating code:

Simple %-operator substitution

Combine with C preprocessor: simple, often sufficient

Use a templating engine (Mako works very well)

codepy:

Build C syntax trees from PythonGenerates readable, indented C

Many ways of evaluating code–most important one:

Exact device timing via events



Outline

1 Introduction


3 Run-Time Code GenerationThe IdeaRTCG in Action

4 Perspectives



PyOpenCL Arrays: General Usage

Remember your first PyOpenCL program?

Abstraction is good:

1 import numpy2 import pyopencl as cl3 import pyopencl.array as cl array45 ctx = cl. create some context()6 queue = cl.CommandQueue(ctx)78 a gpu = cl array . to device (9 ctx , queue, numpy.random.randn(4,4).astype(numpy.float32))

10 a doubled = (2∗a gpu).get()11 print a doubled12 print a gpu



pyopencl.array: Simple Linear Algebra

pyopencl.array.Array:

Meant to look and feel just like numpy.

p.a.to device(ctx, queue, numpy array)

numpy array = ary.get()

+, -, ∗, /, fill, sin, arange, exp, rand, . . .

Mixed types (int32 + float32 = float64)

print cl array for debugging.

Allows access to raw bits

Use as kernel arguments, memory maps



pyopencl.elementwise: Elementwise expressions

Avoiding extra store-fetch cycles for elementwise math:

n = 10000a gpu = cl array . to device (

ctx , queue, numpy.random.randn(n).astype(numpy.float32))b gpu = cl array . to device (

ctx , queue, numpy.random.randn(n).astype(numpy.float32))

from pyopencl.elementwise import ElementwiseKernellin comb = ElementwiseKernel(ctx,

” float a, float ∗x, float b, float ∗y, float ∗z”,”z[ i ] = a∗x[i ] + b∗y[i ]”)

c gpu = cl array . empty like (a gpu)lin comb(5, a gpu, 6, b gpu, c gpu)

import numpy.linalg as laassert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5



RTCG via Substitution

source = (”””kernel void %(name)s(%(arguments)s)

unsigned lid = get local id (0);unsigned gsize = get global size (0);unsigned work item start = get local size (0)∗get group id (0);

for (unsigned i = work item start + lid ; i < n; i += gsize)

%(operation)s;””” %

”arguments”: ”, ”. join (arg . declarator () for arg in arguments),”operation”: operation ,”name”: name,”loop prep”: loop prep ,)

prg = cl.Program(ctx, source ). build ()



RTCG via Templates

from mako.template import Templatetpl = Template(”””

kernel void add(global $ type name ∗tgt,global const $ type name ∗op1,global const $ type name ∗op2)

int idx = get local id (0)

+ $ local size ∗ $ thread strides ∗ get group id (0);

% for i in range( thread strides ):<% offset = i∗ local size %>tgt [ idx + $ offset ] =

op1[idx + $ offset ]+ op2[idx + $ offset ];

% endfor”””)

rendered tpl = tpl . render(type name=”float”,local size = local size , thread strides = thread strides )

knl = cl.Program(ctx, str ( rendered tpl )). build (). add



pyopencl.reduction: Reduction made easy

Example: A dot product calculation

from pyopencl.reduction import ReductionKerneldot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”,

reduce expr=”a+b”, map expr=”x[i]∗y[i]”,arguments=” global const float ∗x, global const float ∗y”)

import pyopencl.clrandom as cl randx = cl rand.rand(ctx , queue, (1000∗1000), dtype=numpy.float32)y = cl rand.rand(ctx , queue, (1000∗1000), dtype=numpy.float32)

x dot y = dot(x, y). get()x dot y cpu = numpy.dot(x.get(), y.get())



pyopencl.scan: Scan made easy

Example: A cumulative sum computation

from pyopencl.scan import InclusiveScanKernelknl = InclusiveScanKernel(ctx , np.int32 , ”a+b”)

n = 2∗∗20−2∗∗18+5host data = np.random.randint(0, 10, n).astype(np.int32)dev data = cl array . to device (queue, host data)

knl(dev data)assert (dev data.get() == np.cumsum(host data, axis=0)).all()


Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Outline

1 Introduction



4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions



Whetting your appetite

1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a)

[This is examples/demo.py in the PyCUDA distribution.]



Whetting your appetite

1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 7 ”””)89 func = mod.get function(”twice”)

10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a

Compute kernel



PyOpenCL ↔ PyCUDA: A (rough) dictionary

PyOpenCL PyCUDAContext Context

CommandQueue Stream

Buffer mem alloc / DeviceAllocation

Program SourceModule

Kernel Function

Event (eg. enqueue marker) Event



Outline

1 Introduction






Discontinuous Galerkin Method

Let Ω :=⋃

i Dk ⊂ Rd .

Goal

Solve a conservation law on Ω: ut +∇ · F (u) = 0

Example

Maxwell’s Equations: EM field: E (x , t), H(x , t) on Ω governed by

∂tE −1

ε∇× H = − j

ε, ∂tH +

1

µ∇× E = 0,

∇ · E =ρ

ε, ∇ · H = 0.



Metaprogramming DG: Flux Terms

0 =

ˆDk

utϕ+ [∇ · F (u)]ϕ dx −ˆ∂Dk

[n · F − (n · F )∗]ϕ dSx︸︷︷︸Flux term

Flux terms:

vary by problem

expression specified by user

evaluated pointwise



Metaprogramming DG: Flux Terms Example

Example: Fluxes for Maxwell’s Equations

n · (F − F ∗)E :=1

2[n × (JHK− αn × JEK)]





n · (F − F ∗)E :=1


User writes: Vectorial statement in math. notation

flux = 1/2∗cross(normal, h. int−h.ext−alpha∗cross(normal, e. int−e.ext))





n · (F − F ∗)E :=1


We generate: Scalar evaluator in C (6×)

a flux += (((( val a field5 − val b field5 )∗ fpair−>normal[2]− ( val a field4 − val b field4 )∗ fpair−>normal[0])

+ val a field0 − val b field0 )∗ fpair−>normal[0]− ((( val a field4 − val b field4 ) ∗ fpair−>normal[1]− ( val a field1 − val b field1 )∗ fpair−>normal[2])

+ val a field3 − val b field3 ) ∗ fpair−>normal[1])∗value type (0.5);



Loop Slicing for element-local parts of GPU DG

Per Block: KL element-local mat.mult. + matrix loadPreparation

Question: How should one assign work to threads?

ws : in sequenceThread

t

wi : “inline-parallel”Thread

t

wp: in parallel

Thread

t

(amortize preparation) (exploit register space)



Loop Slicing for Differentiation

15 20 25 30wp

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Execu

tion t

ime [ms]

Local differentiation, matrix-in-shared,order 4, with microblockingpoint size denotes wi ∈

1, ,4

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

ws



Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400

0 2 4 6 8 10Polynomial Order N

0

50

100

150

200

250

300

GFl

ops/

s

GPU

CPU



Memory Bandwidth on a GTX 280

1 2 3 4 5 6 7 8 9Polynomial Order N

20

40

60

80

100

120

140

160

180

200

Glo

bal M

em

ory

Bandw

idth

[G

B/s

]

GatherLiftDiffAssy.Peak



GPU DG Showcase

Eletromagnetism

Poisson

CFD



Outline

1 Introduction






Automating GPU Programming

GPU programming can be time-consuming, unintuitive anderror-prone.

Obvious idea: Let the computer do it.

One way: Smart compilers

GPU programming requires complex tradeoffsTradeoffs require heuristicsHeuristics are fragile

Another way: Dumb enumeration

Enumerate loop slicingsEnumerate prefetch optionsChoose by running resulting code on actual hardware



Loo.py Example

Empirical GPU loop optimization:

a, b, c, i , j , k = [var(s) for s in ”abcijk”]n = 500k = make loop kernel([

LoopDimension(”i”, n),LoopDimension(”j”, n),LoopDimension(”k”, n),], [(c[ i+n∗j], a[ i+n∗k]∗b[k+n∗j])])

gen kwargs = ”min threads”: 128,”min blocks”: 32,

→ Ideal case: Finds 160 GF/s kernelwithout human intervention.



Loo.py Status

Limited scope:

Require input/output separationKernels must be expressible using“loopy” model(i.e. indices decompose into “output”and “reduction”)Enough for DG, LA, FD, . . .

Kernel compilation limits trial rate

Non-Goal: Peak performance

Good results currently for dense linearalgebra and (some) DG subkernels



Outline

1 Introduction






Where to from here?

PyCUDA, PyOpenCL, hedge

→ http://www.cims.nyu.edu/~kloeckner/

GPU RTCG

AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation forHigh-Performance Computing, submitted.

GPU-DG Article

AK, T. Warburton, J. Bridge, J.S. Hesthaven, “NodalDiscontinuous Galerkin Methods on Graphics Processors”,J. Comp. Phys., 228 (21), 7863–7882.

Also: Intro in GPU Computing Gems Vol 2


http://www.cims.nyu.edu/~kloeckner/


Conclusions

GPUs to me: architecture choice now widely available

Fun time to be in computational science

GPUs and scripting work surprisingly well together

Exploit a natural task decomposition in computational codesRTCG: Crucial tool

GPU Scripting great for prototyping

. . . and just as suitable for production code



Questions?

?

Thank you for your attention!


image credits




Image Credits

Dictionary: sxc.hu/topferC870 GPU: Nvidia Corp.OpenCL Logo: Apple Corp./Ars TechnicaOS Platforms: flickr.com/aOliN.Tk

Old Books: flickr.com/ppdigital

Floppy disk: flickr.com/ethanhein

Machine: flickr.com/13521837@N00

Adding Machine: flickr.com/thomashawk


Implementations

Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs

0 2 4 6 8 10Polynomial Order N

0

1000

2000

3000

4000

GFl

ops/

s

Flop Rates: 16 GPUs vs 64 CPU cores

GPUCPU


Implementations

Outline

5 OpenCL implementations


Implementations

The Nvidia CL implementation

Targets only GPUs

Notes:

Nearly identical to CUDA

No native C-level JIT in CUDA (→PyCUDA)

Page-locked memory:Use CL MEM ALLOC HOST PTR.

Careful: double meaningNeed page-locked memory for genuinelyoverlapped transfers.

No linear memory texturing

CUDA device emulation mode deprecated→ Use AMD CPU CL (faster, too!)


Implementations

The Apple CL implementation

Targets CPUs and GPUs

General notes:

Different header nameOpenCL/cl.h instead of CL/cl.hUse -framework OpenCL for Caccess.

Beware of imperfect compiler cacheimplementation(ignores include files)

CPU notes:

One work item per processor

GPU similar to hardware vendorimplementation.(New: Intel w/ Sandy Bridge)


Implementations

The AMD CL implementation

Targets CPUs and GPUs (from both AMD and Nvidia)

GPU notes:

Wide SIMD groups (64)

Native 4/5-wide vectors

But: very flop-heavy machine, may ignore vectorsfor memory-bound workloads

→ Both implicit and explicit SIMD

CPU notes:

Many work items per processor (emulated)

General:

cl amd printf


[harvard cs264] 10a - easy, effective, efficient: gpu programming in python with pyopencl and pycuda...

Education

opencl timeline

gpu programming

programming language

opencl source code

rtcg denes

opencl open computing

common theme openclwhen

programming framework