[harvard cs264] 10a - easy, effective, efficient: gpu programming in python with pyopencl and pycuda...

107
Intro PyOpenCL RTCG Perspectives Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA Andreas Kl¨ ockner Courant Institute of Mathematical Sciences New York University March 31, 2011 AndreasKl¨ockner GPU-Python with PyOpenCL and PyCUDA

Upload: npinto

Post on 10-May-2015

2.131 views

Category:

Education


2 download

DESCRIPTION

http://cs264.org Abstract: High-level scripting languages are in many ways polar opposites to GPUs. GPUs are highly parallel, subject to hardware subtleties, and designed for maximum throughput, and they offer a tremendous advance in the performance achievable for a significant number of computational problems. On the other hand, scripting languages such as Python favor ease of use over computational speed and do not generally emphasize parallelism. PyOpenCL and PyCUDA are two packages that attempt to join the two together. By showing concrete examples, both at the toy and the whole-application level, this talk aims to demonstrate that by combining these opposites, a programming environment is created that is greater than just the sum of its two parts. Speaker biography: Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at the Department of Applied Mathematics at Brown University. He worked on a variety of topics all aiming to broaden the utility of discontinuous Galerkin (DG) methods. This included their use in the simulation of plasma physics and the demonstration of their particular suitability for computation on throughput-oriented graphics processors (GPUs). He also worked on multi-rate time stepping methods and shock capturing schemes for DG. In the fall of 2010, he joined the Courant Institute of Mathematical Sciences at New York University as a Courant Instructor. There, he is working on problems in computational electromagnetics with Leslie Greengard. His research interests include: - Discontinuous Galerkin and integral equation methods for wave propagation - Programming tools for parallel architectures - High-order unstructured particle-in-cell methods for plasma simulation

TRANSCRIPT

Page 1: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives

Easy, Effective, Efficient:GPU Programming in Pythonwith PyOpenCL and PyCUDA

Andreas Klockner

Courant Institute of Mathematical SciencesNew York University

March 31, 2011

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 2: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives

Thanks

Jan Hesthaven (Brown)

Tim Warburton (Rice)

Leslie Greengard (NYU)

PyOpenCL, PyCUDA contributors

Nvidia Corp., AMD Corp.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 3: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives

Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 Perspectives

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 4: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

Outline

1 IntroductionA Common ThemeIntro to OpenCL

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 Perspectives

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 5: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

Outline

1 IntroductionA Common ThemeIntro to OpenCL

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 Perspectives

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 6: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

How are High-Performance Codes constructed?

“Traditional” Construction ofHigh-Performance Codes:

C/C++/FortranLibraries

“Alternative” Construction ofHigh-Performance Codes:

Scripting for ‘brains’GPUs for ‘inner loops’

Play to the strengths of eachprogramming environment.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 7: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

Outline

1 IntroductionA Common ThemeIntro to OpenCL

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 Perspectives

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 8: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

What is OpenCL?

OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]

Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)

Vendor-neutral

Comes with RTCG

Defines:

Host-side programming interface (library)

Device-side programming language (!)

Big deal?

Big deal!

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 9: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

What is OpenCL?

OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]

Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)

Vendor-neutral

Comes with RTCG

Defines:

Host-side programming interface (library)

Device-side programming language (!)

Big deal?

Big deal!

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 10: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

What is OpenCL?

OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]

Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)

Vendor-neutral

Comes with RTCG

Defines:

Host-side programming interface (library)

Device-side programming language (!)

Big deal?

Big deal!

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 11: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

Who?

© Copyright Khronos Group, 2010 - Page 4

OpenCL Working Group

• Diverse industry participation

- Processor vendors, system OEMs, middleware vendors, application developers

• Many industry-leading experts involved in OpenCL’s design

- A healthy diversity of industry perspectives

• Apple made initial proposal and is very active in the working group

- Serving as specification editor

Credit: Khronos Group

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 12: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

When?

© Copyright Khronos Group, 2010 - Page 5

OpenCL Timeline

• Six months from proposal to released OpenCL 1.0 specification

- Due to a strong initial proposal and a shared commercial incentive

• Multiple conformant implementations shipping

- Apple’s Mac OS X Snow Leopard now ships with OpenCL

• 18 month cadence between OpenCL 1.0 and OpenCL 1.1

- Backwards compatibility protect software investment

Apple proposes OpenCL working group and contributes draft specification to Khronos

Khronos publicly releases OpenCL 1.0 as royalty-free specification

Khronos releases OpenCL 1.0 conformance tests to ensure high-quality implementations

Jun08

Dec08

May09

2H09

Multiple conformant implementations ship across diverse OS and platforms

Jun10

OpenCL 1.1 Specification released and first implementations ship

Credit: Khronos Group

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 13: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

Why?

© Copyright Khronos Group, 2010 - Page 3

Processor Parallelism

CPUsMultiple cores driving performance increases

GPUsIncreasingly general purpose data-parallel

computing

Graphics APIs and Shading

Languages

Multi-processor

programming – e.g. OpenMP

EmergingIntersection

HeterogeneousComputing

OpenCL is a programming framework for heterogeneous compute resources

Credit: Khronos Group

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 14: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

CL vs CUDA side-by-side

CUDA source code:global void transpose(

float ∗A t, float ∗A,int a width, int a height )

int base idx a =

blockIdx .x ∗ BLK SIZE +blockIdx .y ∗ A BLOCK STRIDE;

int base idx a t =blockIdx .y ∗ BLK SIZE +blockIdx .x ∗ A T BLOCK STRIDE;

int glob idx a =base idx a + threadIdx.x+ a width ∗ threadIdx.y;

int glob idx a t =base idx a t + threadIdx.x+ a height ∗ threadIdx .y;

shared float A shared[BLK SIZE][BLK SIZE+1];

A shared[threadIdx .y ][ threadIdx .x] =A[ glob idx a ];

syncthreads ();

A t[ glob idx a t ] =A shared[threadIdx .x ][ threadIdx .y ];

OpenCL source code:void transpose(

global float ∗a t, global float ∗a,unsigned a width, unsigned a height)

int base idx a =get group id (0) ∗ BLK SIZE +get group id (1) ∗ A BLOCK STRIDE;

int base idx a t =get group id (1) ∗ BLK SIZE +get group id (0) ∗ A T BLOCK STRIDE;

int glob idx a =base idx a + get local id (0)+ a width ∗ get local id (1);

int glob idx a t =base idx a t + get local id (0)+ a height ∗ get local id (1);

local float a local [BLK SIZE][BLK SIZE+1];

a local [ get local id (1)∗BLK SIZE+get local id(0)] =a[ glob idx a ];

barrier (CLK LOCAL MEM FENCE);

a t [ glob idx a t ] =a local [ get local id (0)∗BLK SIZE+get local id(1)];

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 15: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL ↔ CUDA: A dictionary

OpenCL CUDAGrid Grid

Work Group BlockWork Item Thread

kernel global

global device

local shared

private local

imagend t texture<type, n, ...>barrier(LMF) syncthreads()

get local id(012) threadIdx.xyz

get group id(012) blockIdx.xyz

get global id(012) – (reimplement)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 16: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Execution Model

nD Grid

Group(0, 0)

Group(0, 1)

Group(1, 0)

Group(1, 1)

Group(2, 0)

Group(2, 1)

Work Group (1, 0)

Item(0, 0)

Item(0, 1)

Item(0, 2)

Item(0, 3)

Item(1, 0)

Item(1, 1)

Item(1, 2)

Item(1, 3)

Item(2, 0)

Item(2, 1)

Item(2, 2)

Item(2, 3)

Item(3, 0)

Item(3, 1)

Item(3, 2)

Item(3, 3)

Two-tiered Parallelism

Grid = Nx × Ny × Nz work groupsWork group = Sx × Sy × Sz work itemsTotal:

∏i∈x,y ,z SiNi work items

Comm/Sync only within work group

Work group maps to compute unit

Grid/Group ≈ outer loops in an algorithm

Device Language:get global,group,local id,size(axis)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 17: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 18: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 19: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 20: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

MemoryCompute Device 1 (Platform 0)

· · ·· · ·· · ·

MemoryCompute Device 0 (Platform 1)

· · ·· · ·· · ·

MemoryCompute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 21: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 22: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 23: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 24: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 25: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 26: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 27: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 28: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 29: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 30: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 31: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

OpenCL Object Diagram

Last Revision Date: 9/30/10 Page 20

Figure 2.1 - OpenCL UML Class Diagram

Credit: Khronos Group

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 32: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

Why do Scripting for GPUs?

GPUs are everything that scriptinglanguages are not.

Highly parallelVery architecture-sensitiveBuilt for maximum FP/memorythroughput

→ complement each other

CPU: largely restricted to controltasks (∼1000/sec)

Scripting fast enough

Python + CUDA = PyCUDA

Python + OpenCL = PyOpenCL

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 33: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Outline

1 Introduction

2 Programming with PyOpenCLFirst ContactAbout PyOpenCL

3 Run-Time Code Generation

4 Perspectives

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 34: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Outline

1 Introduction

2 Programming with PyOpenCLFirst ContactAbout PyOpenCL

3 Run-Time Code Generation

4 Perspectives

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 35: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Dive into PyOpenCL

1 import pyopencl as cl , numpy23 a = numpy.random.rand(256∗∗3).astype(numpy.float32)45 ctx = cl. create some context()6 queue = cl.CommandQueue(ctx)78 a dev = cl. Buffer(ctx , cl .mem flags.READ WRITE, size=a.nbytes)9 cl . enqueue write buffer (queue, a dev, a)

1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 a[ get global id (0)] ∗= 2; 14 ”””). build ()1516 prg. twice(queue, a.shape, (1,), a dev)

Compute kernel

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 36: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Dive into PyOpenCL

1 import pyopencl as cl , numpy23 a = numpy.random.rand(256∗∗3).astype(numpy.float32)45 ctx = cl. create some context()6 queue = cl.CommandQueue(ctx)78 a dev = cl. Buffer(ctx , cl .mem flags.READ WRITE, size=a.nbytes)9 cl . enqueue write buffer (queue, a dev, a)

1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 a[ get global id (0)] ∗= 2; 14 ”””). build ()1516 prg. twice(queue, a.shape, (1,), a dev)

Compute kernel

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 37: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Dive into PyOpenCL

8 a dev = cl. Buffer(ctx , cl .mem flags.READ WRITE, size=a.nbytes)9 cl . enqueue write buffer (queue, a dev, a)

1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 a[ get local id (0)+ get local size (0)∗get group id (0)] ∗= 2; 14 ”””). build ()1516 prg. twice(queue, a.shape, (256,), a dev)1718 result = numpy.empty like(a)19 cl . enqueue read buffer (queue, a dev, result ). wait()20 import numpy.linalg as la21 assert la .norm(result − 2∗a) == 0

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 38: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Outline

1 Introduction

2 Programming with PyOpenCLFirst ContactAbout PyOpenCL

3 Run-Time Code Generation

4 Perspectives

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 39: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

PyOpenCL: Completeness

PyOpenCL exposes all of OpenCL.

For example:

Every GetInfo() query

Images and Samplers

Memory Maps

Profiling and Synchronization

GL Interop

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 40: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

PyOpenCL: Completeness

PyOpenCL supports (nearly)every OS that has an OpenCLimplementation.

Linux

OS X

Windows

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 41: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Automatic Cleanup

Reachable objects (memory,streams, . . . ) are never destroyed.

Once unreachable, released at anunspecified future time.

Scarce resources (memory) can beexplicitly freed. (obj.release())

Correctly deals with multiplecontexts and dependencies. (basedon OpenCL’s reference counting)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 42: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

PyOpenCL: Documentation

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 43: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

PyOpenCL Philosophy

Provide complete access

Automatically manage resources

Provide abstractions

Allow interactive use

Check for and report errorsautomatically

Integrate tightly with numpy

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 44: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

PyOpenCL, PyCUDA: Vital Information

http://mathema.tician.de/

software/pyopencl (or /pycuda)

Complete documentation

X Consortium License(no warranty, free for all use)

Convenient abstractionsArrays, Elementwise op., Reduction, Scan

Require: numpy, Python 2.4+(Win/OS X/Linux)

Community: mailing list, wiki, add-onpackages (FFT, scikits.cuda, . . . )

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 45: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Capturing Dependencies

B = f(A)C = g(B)E = f(C)F = h(C)G = g(E,F)P = p(B)Q = q(B)R = r(G,P,Q)

A

C

B

E

G

F Q

P

R

h

r

g

rg

r

g

q

f

p

f

Switch queue to out-of-ordermode!

Specify as list of events usingwait for= optional keyword toenqueue XXX.

Can also enqueue barrier.

Common use case:Transmit/receive from other MPIranks.

Possible on Nv Fermi: Submitparallel work to increase machineuse.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 46: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Capturing Dependencies

B = f(A)C = g(B)E = f(C)F = h(C)G = g(E,F)P = p(B)Q = q(B)R = r(G,P,Q)

A

C

B

E

G

F Q

P

R

h

r

g

rg

r

g

q

f

p

f

Switch queue to out-of-ordermode!

Specify as list of events usingwait for= optional keyword toenqueue XXX.

Can also enqueue barrier.

Common use case:Transmit/receive from other MPIranks.

Possible on Nv Fermi: Submitparallel work to increase machineuse.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 47: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code GenerationThe IdeaRTCG in Action

4 Perspectives

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 48: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code GenerationThe IdeaRTCG in Action

4 Perspectives

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 49: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 50: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 51: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 52: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 53: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 54: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 55: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 56: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDA

PyOpenCL

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 57: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 58: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Machine-generated Code

Why machine-generate code?

Automated Tuning(cf. ATLAS, FFTW)

Data types

Specialize code for given problem

Constants faster than variables(→ register pressure)

Loop Unrolling

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 59: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

PyOpenCL: Support for Metaprogramming

Three (main) ways of generating code:

Simple %-operator substitution

Combine with C preprocessor: simple, often sufficient

Use a templating engine (Mako works very well)

codepy:

Build C syntax trees from PythonGenerates readable, indented C

Many ways of evaluating code–most important one:

Exact device timing via events

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 60: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code GenerationThe IdeaRTCG in Action

4 Perspectives

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 61: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

PyOpenCL Arrays: General Usage

Remember your first PyOpenCL program?

Abstraction is good:

1 import numpy2 import pyopencl as cl3 import pyopencl.array as cl array45 ctx = cl. create some context()6 queue = cl.CommandQueue(ctx)78 a gpu = cl array . to device (9 ctx , queue, numpy.random.randn(4,4).astype(numpy.float32))

10 a doubled = (2∗a gpu).get()11 print a doubled12 print a gpu

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 62: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

pyopencl.array: Simple Linear Algebra

pyopencl.array.Array:

Meant to look and feel just like numpy.

p.a.to device(ctx, queue, numpy array)

numpy array = ary.get()

+, -, ∗, /, fill, sin, arange, exp, rand, . . .

Mixed types (int32 + float32 = float64)

print cl array for debugging.

Allows access to raw bits

Use as kernel arguments, memory maps

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 63: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

pyopencl.elementwise: Elementwise expressions

Avoiding extra store-fetch cycles for elementwise math:

n = 10000a gpu = cl array . to device (

ctx , queue, numpy.random.randn(n).astype(numpy.float32))b gpu = cl array . to device (

ctx , queue, numpy.random.randn(n).astype(numpy.float32))

from pyopencl.elementwise import ElementwiseKernellin comb = ElementwiseKernel(ctx,

” float a, float ∗x, float b, float ∗y, float ∗z”,”z[ i ] = a∗x[i ] + b∗y[i ]”)

c gpu = cl array . empty like (a gpu)lin comb(5, a gpu, 6, b gpu, c gpu)

import numpy.linalg as laassert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 64: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

RTCG via Substitution

source = (”””kernel void %(name)s(%(arguments)s)

unsigned lid = get local id (0);unsigned gsize = get global size (0);unsigned work item start = get local size (0)∗get group id (0);

for (unsigned i = work item start + lid ; i < n; i += gsize)

%(operation)s;””” %

”arguments”: ”, ”. join (arg . declarator () for arg in arguments),”operation”: operation ,”name”: name,”loop prep”: loop prep ,)

prg = cl.Program(ctx, source ). build ()

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 65: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

RTCG via Templates

from mako.template import Templatetpl = Template(”””

kernel void add(global $ type name ∗tgt,global const $ type name ∗op1,global const $ type name ∗op2)

int idx = get local id (0)

+ $ local size ∗ $ thread strides ∗ get group id (0);

% for i in range( thread strides ):<% offset = i∗ local size %>tgt [ idx + $ offset ] =

op1[idx + $ offset ]+ op2[idx + $ offset ];

% endfor”””)

rendered tpl = tpl . render(type name=”float”,local size = local size , thread strides = thread strides )

knl = cl.Program(ctx, str ( rendered tpl )). build (). add

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 66: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

pyopencl.reduction: Reduction made easy

Example: A dot product calculation

from pyopencl.reduction import ReductionKerneldot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”,

reduce expr=”a+b”, map expr=”x[i]∗y[i]”,arguments=” global const float ∗x, global const float ∗y”)

import pyopencl.clrandom as cl randx = cl rand.rand(ctx , queue, (1000∗1000), dtype=numpy.float32)y = cl rand.rand(ctx , queue, (1000∗1000), dtype=numpy.float32)

x dot y = dot(x, y). get()x dot y cpu = numpy.dot(x.get(), y.get())

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 67: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

pyopencl.scan: Scan made easy

Example: A cumulative sum computation

from pyopencl.scan import InclusiveScanKernelknl = InclusiveScanKernel(ctx , np.int32 , ”a+b”)

n = 2∗∗20−2∗∗18+5host data = np.random.randint(0, 10, n).astype(np.int32)dev data = cl array . to device (queue, host data)

knl(dev data)assert (dev data.get() == np.cumsum(host data, axis=0)).all()

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 68: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 69: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 70: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Whetting your appetite

1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a)

[This is examples/demo.py in the PyCUDA distribution.]

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 71: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Whetting your appetite

1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 7 ”””)89 func = mod.get function(”twice”)

10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a

Compute kernel

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 72: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Whetting your appetite

1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 7 ”””)89 func = mod.get function(”twice”)

10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a

Compute kernel

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 73: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

PyOpenCL ↔ PyCUDA: A (rough) dictionary

PyOpenCL PyCUDAContext Context

CommandQueue Stream

Buffer mem alloc / DeviceAllocation

Program SourceModule

Kernel Function

Event (eg. enqueue marker) Event

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 74: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 75: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Discontinuous Galerkin Method

Let Ω :=⋃

i Dk ⊂ Rd .

Goal

Solve a conservation law on Ω: ut +∇ · F (u) = 0

Example

Maxwell’s Equations: EM field: E (x , t), H(x , t) on Ω governed by

∂tE −1

ε∇× H = − j

ε, ∂tH +

1

µ∇× E = 0,

∇ · E =ρ

ε, ∇ · H = 0.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 76: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Discontinuous Galerkin Method

Let Ω :=⋃

i Dk ⊂ Rd .

Goal

Solve a conservation law on Ω: ut +∇ · F (u) = 0

Example

Maxwell’s Equations: EM field: E (x , t), H(x , t) on Ω governed by

∂tE −1

ε∇× H = − j

ε, ∂tH +

1

µ∇× E = 0,

∇ · E =ρ

ε, ∇ · H = 0.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 77: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Discontinuous Galerkin Method

Let Ω :=⋃

i Dk ⊂ Rd .

Goal

Solve a conservation law on Ω: ut +∇ · F (u) = 0

Example

Maxwell’s Equations: EM field: E (x , t), H(x , t) on Ω governed by

∂tE −1

ε∇× H = − j

ε, ∂tH +

1

µ∇× E = 0,

∇ · E =ρ

ε, ∇ · H = 0.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 78: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Metaprogramming DG: Flux Terms

0 =

ˆDk

utϕ+ [∇ · F (u)]ϕ dx −ˆ∂Dk

[n · F − (n · F )∗]ϕ dSx︸ ︷︷ ︸Flux term

Flux terms:

vary by problem

expression specified by user

evaluated pointwise

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 79: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Metaprogramming DG: Flux Terms

0 =

ˆDk

utϕ+ [∇ · F (u)]ϕ dx −ˆ∂Dk

[n · F − (n · F )∗]ϕ dSx︸ ︷︷ ︸Flux term

Flux terms:

vary by problem

expression specified by user

evaluated pointwise

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 80: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Metaprogramming DG: Flux Terms Example

Example: Fluxes for Maxwell’s Equations

n · (F − F ∗)E :=1

2[n × (JHK− αn × JEK)]

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 81: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Metaprogramming DG: Flux Terms Example

Example: Fluxes for Maxwell’s Equations

n · (F − F ∗)E :=1

2[n × (JHK− αn × JEK)]

User writes: Vectorial statement in math. notation

flux = 1/2∗cross(normal, h. int−h.ext−alpha∗cross(normal, e. int−e.ext))

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 82: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Metaprogramming DG: Flux Terms Example

Example: Fluxes for Maxwell’s Equations

n · (F − F ∗)E :=1

2[n × (JHK− αn × JEK)]

We generate: Scalar evaluator in C (6×)

a flux += (((( val a field5 − val b field5 )∗ fpair−>normal[2]− ( val a field4 − val b field4 )∗ fpair−>normal[0])

+ val a field0 − val b field0 )∗ fpair−>normal[0]− ((( val a field4 − val b field4 ) ∗ fpair−>normal[1]− ( val a field1 − val b field1 )∗ fpair−>normal[2])

+ val a field3 − val b field3 ) ∗ fpair−>normal[1])∗value type (0.5);

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 83: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Loop Slicing for element-local parts of GPU DG

Per Block: KL element-local mat.mult. + matrix loadPreparation

Question: How should one assign work to threads?

ws : in sequenceThread

t

wi : “inline-parallel”Thread

t

wp: in parallel

Thread

t

(amortize preparation) (exploit register space)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 84: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Loop Slicing for element-local parts of GPU DG

Per Block: KL element-local mat.mult. + matrix loadPreparation

Question: How should one assign work to threads?

ws : in sequenceThread

t

wi : “inline-parallel”Thread

t

wp: in parallel

Thread

t

(amortize preparation) (exploit register space)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 85: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Loop Slicing for Differentiation

15 20 25 30wp

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Execu

tion t

ime [ms]

Local differentiation, matrix-in-shared,order 4, with microblockingpoint size denotes wi ∈

1, ,4

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

ws

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 86: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400

0 2 4 6 8 10Polynomial Order N

0

50

100

150

200

250

300

GFl

ops/

s

GPU

CPU

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 87: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Memory Bandwidth on a GTX 280

1 2 3 4 5 6 7 8 9Polynomial Order N

20

40

60

80

100

120

140

160

180

200

Glo

bal M

em

ory

Bandw

idth

[G

B/s

]

GatherLiftDiffAssy.Peak

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 88: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

GPU DG Showcase

Eletromagnetism

Poisson

CFD

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 89: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

GPU DG Showcase

Eletromagnetism

Poisson

CFD

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 90: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

GPU DG Showcase

Eletromagnetism

Poisson

CFD

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 91: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 92: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Automating GPU Programming

GPU programming can be time-consuming, unintuitive anderror-prone.

Obvious idea: Let the computer do it.

One way: Smart compilers

GPU programming requires complex tradeoffsTradeoffs require heuristicsHeuristics are fragile

Another way: Dumb enumeration

Enumerate loop slicingsEnumerate prefetch optionsChoose by running resulting code on actual hardware

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 93: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Automating GPU Programming

GPU programming can be time-consuming, unintuitive anderror-prone.

Obvious idea: Let the computer do it.

One way: Smart compilers

GPU programming requires complex tradeoffsTradeoffs require heuristicsHeuristics are fragile

Another way: Dumb enumeration

Enumerate loop slicingsEnumerate prefetch optionsChoose by running resulting code on actual hardware

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 94: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Automating GPU Programming

GPU programming can be time-consuming, unintuitive anderror-prone.

Obvious idea: Let the computer do it.

One way: Smart compilers

GPU programming requires complex tradeoffsTradeoffs require heuristicsHeuristics are fragile

Another way: Dumb enumeration

Enumerate loop slicingsEnumerate prefetch optionsChoose by running resulting code on actual hardware

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 95: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Loo.py Example

Empirical GPU loop optimization:

a, b, c, i , j , k = [var(s) for s in ”abcijk”]n = 500k = make loop kernel([

LoopDimension(”i”, n),LoopDimension(”j”, n),LoopDimension(”k”, n),], [(c[ i+n∗j], a[ i+n∗k]∗b[k+n∗j])])

gen kwargs = ”min threads”: 128,”min blocks”: 32,

→ Ideal case: Finds 160 GF/s kernelwithout human intervention.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 96: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Loo.py Status

Limited scope:

Require input/output separationKernels must be expressible using“loopy” model(i.e. indices decompose into “output”and “reduction”)Enough for DG, LA, FD, . . .

Kernel compilation limits trial rate

Non-Goal: Peak performance

Good results currently for dense linearalgebra and (some) DG subkernels

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 97: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Loo.py Status

Limited scope:

Require input/output separationKernels must be expressible using“loopy” model(i.e. indices decompose into “output”and “reduction”)Enough for DG, LA, FD, . . .

Kernel compilation limits trial rate

Non-Goal: Peak performance

Good results currently for dense linearalgebra and (some) DG subkernels

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 98: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 99: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Where to from here?

PyCUDA, PyOpenCL, hedge

→ http://www.cims.nyu.edu/~kloeckner/

GPU RTCG

AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation forHigh-Performance Computing, submitted.

GPU-DG Article

AK, T. Warburton, J. Bridge, J.S. Hesthaven, “NodalDiscontinuous Galerkin Methods on Graphics Processors”,J. Comp. Phys., 228 (21), 7863–7882.

Also: Intro in GPU Computing Gems Vol 2

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 100: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Conclusions

GPUs to me: architecture choice now widely available

Fun time to be in computational science

GPUs and scripting work surprisingly well together

Exploit a natural task decomposition in computational codesRTCG: Crucial tool

GPU Scripting great for prototyping

. . . and just as suitable for production code

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 101: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Questions?

?

Thank you for your attention!

http://www.cims.nyu.edu/~kloeckner/

image credits

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 102: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Image Credits

Dictionary: sxc.hu/topferC870 GPU: Nvidia Corp.OpenCL Logo: Apple Corp./Ars TechnicaOS Platforms: flickr.com/aOliN.Tk

Old Books: flickr.com/ppdigital

Floppy disk: flickr.com/ethanhein

Machine: flickr.com/13521837@N00

Adding Machine: flickr.com/thomashawk

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 103: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Implementations

Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs

0 2 4 6 8 10Polynomial Order N

0

1000

2000

3000

4000

GFl

ops/

s

Flop Rates: 16 GPUs vs 64 CPU cores

GPUCPU

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 104: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Implementations

Outline

5 OpenCL implementations

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 105: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Implementations

The Nvidia CL implementation

Targets only GPUs

Notes:

Nearly identical to CUDA

No native C-level JIT in CUDA (→PyCUDA)

Page-locked memory:Use CL MEM ALLOC HOST PTR.

Careful: double meaningNeed page-locked memory for genuinelyoverlapped transfers.

No linear memory texturing

CUDA device emulation mode deprecated→ Use AMD CPU CL (faster, too!)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 106: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Implementations

The Apple CL implementation

Targets CPUs and GPUs

General notes:

Different header nameOpenCL/cl.h instead of CL/cl.hUse -framework OpenCL for Caccess.

Beware of imperfect compiler cacheimplementation(ignores include files)

CPU notes:

One work item per processor

GPU similar to hardware vendorimplementation.(New: Intel w/ Sandy Bridge)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 107: [Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

Implementations

The AMD CL implementation

Targets CPUs and GPUs (from both AMD and Nvidia)

GPU notes:

Wide SIMD groups (64)

Native 4/5-wide vectors

But: very flop-heavy machine, may ignore vectorsfor memory-bound workloads

→ Both implicit and explicit SIMD

CPU notes:

Many work items per processor (emulated)

General:

cl amd printf

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA