portability with opencl 1. high performance landscape high performance computing is trending towards...
TRANSCRIPT
![Page 1: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/1.jpg)
Portability with OpenCL
1
![Page 2: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/2.jpg)
High Performance Landscape
• High Performance computing is trending towards large number of cores and accelerators
• It would be nice to write your code once and have it be portable across these platforms
• The selection of a programming model can impact your portability
2
![Page 3: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/3.jpg)
Platforms
• Multi-core CPU– An 8-core Sandy Bridge with AVX instructions has 64 GPU
core-equivalents (each core can issue 5 instructions per clock cycle and can store the state of two threads. The standard CPUs will execute at a higher clock rate than a standard GPU
– The Intel Xeon Phi Coprocessor (IXPC) has up to 61 cores which perform 16 single precision operations in a single instruction or 976 GPU cores.
– NVIDIA Kepler SMX has 2880 GPU cores– An AMD Radeon GPU has up to 32 compute units that can
issue 4 instructions per cycle which could be treated as 128 cores
3
![Page 4: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/4.jpg)
CPU+Accelerator architecture
4
![Page 5: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/5.jpg)
Memory Management
• CUDA C requires the programmer to allocate device memory and explicitly copy data between host and device.
• OpenCL requires the programmer to allocate buffers and copy data to the buffers. It hides some important details, however, in that it doesn't expose exactly where the buffer lives at various points during the program execution.
• OpenACC allow the programmer to rely entirely on the compiler for memory management to get started, but offer optional data constructs and clauses to control and optimize when data is allocated on the device and moved between host and device
5
![Page 6: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/6.jpg)
Parallelism Scheduling
• CUDA and OpenCL have thread, block, grid abstractions
• OpenMP provides core control, but most of the process is automated.
• OpenACC exposes the three levels of parallelism as gang, worker and vector parallelism
6
![Page 7: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/7.jpg)
Multithreading
• All three devices require oversubscription to keep the compute units busy; that is, the program must expose extra (slack) parallelism so a compute unit can swap in another active thread when a thread stalls on memory or other long latency operations
• In CUDA and OpenCL, slack parallelism comes from creating blocks or workgroups larger than the number of cores.
• OpenMP allocates multiple iterations per core.
• OpenACC worker-level parallelism is intended to address this issue directly. On the GPUs, iterations of a worker-parallel loop will run on the same core
7
![Page 8: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/8.jpg)
Caching and Scratchpad Memories
• In CUDA and OpenCL, the programmer must manage the scratchpad memory explicitly, using CUDA __shared__ or OpenCL __local memory
• OpenACC has a cache directive to allow the programmer to tell the implementation what data has enough reuse to cache locally
8
![Page 9: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/9.jpg)
Portability
• There are three levels of portability.
• First is language portability, meaning a programmer can use the same language to write a program for different targets, even if the programs must be different.
• Second is functional portability, meaning a programmer can write one program that will run on different targets, though not all targets will get the best performance.
• Third is performance portability, meaning a programmer can write one program that gives good performance across many targets.
9
![Page 10: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/10.jpg)
Portability• CUDA provides reasonable portability across NVIDIA
devices but there is no pretense that these provide cross-vendor portability, or even performance portability of CUDA source code.
• OpenCL is designed to provide language and functionality portability. Research has demonstrated that even across similar devices, like NVIDIA and AMD GPUs, retuning or rewriting a program can have a significant impact on performance.
• OpenACC is also intended to provide performance portability across devices, and there is some initial evidence to support this claim.
10
![Page 11: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/11.jpg)
11
OpenCL Parallelism Concept
CUDA Equivalent
kernel kernel
host program host program
NDRange (index space) grid
work item thread
work group block
OpenCL to CUDA Data Parallelism Model Mapping
![Page 12: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/12.jpg)
12
Overview of OpenCL Execution Model
![Page 13: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/13.jpg)
13
OpenCL API Call Explanation CUDA Equivalent
get_global_id(0); global index of the work item in the x dimension
blockIdx.x×blockDim.x+threadIdx.x
get_local_id(0) local index of the work item within the work group in the x dimension
blockIdx.x
get_global_size(0); size of NDRange in the x dimension
gridDim.x ×blockDim.x
get_local_size(0); Size of each work group in the x dimension
blockDim.x
Mapping of OpenCL Dimensions and Indices to CUDA
![Page 14: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/14.jpg)
1414
Conceptual OpenCL Device Architecture
![Page 15: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/15.jpg)
15
OpenCL Memory Types CUDA Equivalent
global memory global memory
constant memory constant memory
local memory shared memory
private memory Local memory
Mapping OpenCL Memory Types to CUDA
![Page 16: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/16.jpg)
16
OpenCL Context for Device Management
![Page 17: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/17.jpg)
17
Structure of OpenCL main programGet information about platform and devices available on system
Select devices to use
Create an OpenCL command queue
Create memory buffers on device
Create kernel program object
Build (compile) kernel in-line (or load precompiled binary)
Create OpenCL kernel object
Set kernel arguments
Execute kernel
Read kernel memory and copy to host memory.
Transfer data from host to device memory buffers
![Page 18: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/18.jpg)
18
Platform
"The host plus a collection of devices managed by the OpenCL framework that allow an application to share resources and execute kernels on devices in the platform."
Platforms represented by a cl_platform object, initialized with clGetPlatformID()
http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201
![Page 19: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/19.jpg)
19
Simple code for identifying platform
//Platform
cl_platform_id platform;
clGetPlatformIDs (1, &platform, NULL);
List of OpenCL platforms found.
(Platform IDs)In our case just one platform, identified by &platform
Number of platform entries
Returns number of OpenCL platforms
available. If NULL, ignored.
![Page 20: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/20.jpg)
20
Context
“The environment within which the kernels execute and the domain in which synchronization and memory management is defined.
The context includes a set of devices, the memory accessible to those devices, the corresponding memory properties and one or more command-queues used to schedule execution of a kernel(s) or operations on memory objects.”
The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
![Page 21: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/21.jpg)
21
//Context
cl_context_properties props[3];
props[0] = (cl_context_properties) CL_CONTEXT_PLATFORM;
props[1] = (cl_context_properties) platform;
props[2] = (cl_context_properties) 0;
cl_context GPUContext =
clCreateContextFromType(props,CL_DEVICE_TYPE_GPU,NULL,NULL,NULL);
//Context info
size_t ParmDataBytes;
clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,0,NULL,&ParmDataBytes);
cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes);
clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,ParmDataBytes,GPUDevices,N
ULL);
Code for context
![Page 22: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/22.jpg)
22
Command Queue
“An object that holds commands that will be executed on a specific device.
The command-queue is created on a specific device in a context.
Commands to a command-queue are queued in-order but may be executed in-order or out-of-order. ...”
The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
![Page 23: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/23.jpg)
23
// Create command-queue
cl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext,GPUDevices[0],0,NULL);
Simple code for creating a command queue
![Page 24: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/24.jpg)
24
Allocating memory on device
Use clCreatBuffer:
cl_mem clCreateBuffer(cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret)
OpenCL context, from clCreateContextFromType()
Bit field to specify type of allocation/usage (CL_MEM_READ_WRITE ,…)
No of bytes in buffer memory object
Returns error code if an error
Ptr to buffer data (May be previously allocated.)
Returns memory object
![Page 25: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/25.jpg)
25
Sample code for allocating memory on device for source data
// source data on host, two vectors
int *A, *B;A = new int[N];B = new int[N];for(int i = 0; i < N; i++) {
A[i] = rand()%1000;B[i] = rand()%1000;
}…
// Allocate GPU memory for source vectors
cl_mem GPUVector1 = clCreateBuffer(GPUContext,CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR,sizeof(int)*N, A, NULL);
cl_mem GPUVector2 = clCreateBuffer(GPUContext,CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR,sizeof(int)*N, B, NULL);
![Page 26: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/26.jpg)
26
Sample code for allocating memory on device for results on GPU
// Allocate GPU memory for output vector
cl_mem GPUOutputVector =
clCreateBuffer(GPUContext,CL_MEM_WRITE_ONLY,sizeof(int)*N, NULL,NULL);
![Page 27: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/27.jpg)
27
Kernel Program
Simple programs might be in the same file as the host code as our CUDA examples.
In that case need to formed into strings in a character array.
If in a separate file, can read that file into host program as a character string
![Page 28: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/28.jpg)
28
Kernel program
const char* OpenCLSource[] = {"__kernel void vectorAdd (const __global int* a,"," const __global int* b,"," __global int* c)","{"," unsigned int gid = get_global_id(0);"," c[gid] = a[gid] + b[gid];","}"};
…int main(int argc, char **argv){…}
If in same program as host, kernel needs to be strings (I think it can be a single string)
OpenCL qualifier to indicate kernel code
OpenCL qualifier to indicate kernel memory(Memory objects allocated from global memory pool)
Double underscores optional in OpenCL qualifiers
Returns global work-item ID in given dimension (0 here)
![Page 29: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/29.jpg)
29
Kernel in a separate file
// Load the kernel source code into the array source_str
FILE *fp; char *source_str; size_t source_size;
fp = fopen("vector_add_kernel.cl", "r"); if (!fp) { fprintf(stderr, "Failed to load kernel.\n"); exit(1); }
source_str = (char*)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); fclose( fp );
http://mywiki-science.wikispaces.com/OpenCL
![Page 30: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/30.jpg)
30
Create kernel program object
const char* OpenCLSource[] = {…};
int main(int argc, char **argv)…
// Create OpenCL program object
cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext,7,OpenCLSource,NULL,NULL);
Number of strings in kernel program array
Used if strings not null-terminated to given length of strings
Used to return error code if error
This example uses a single file for both host and kernel code. Can use clCreateprogramWithSource() with a separate kernel file read into host program
![Page 31: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/31.jpg)
31
Build kernel program
// Build the program (OpenCL JIT compilation)
clBuildProgram(OpenCLProgram,0,NULL,NULL,NULL,NULL);
Program object from clCreateProgramwithSource
Number of devices
List of devices, if more than one
Build options
Arguments for notification
routine
Function ptr to notification routine called with build complete. Then
clBuildProgram will return immediately,
otherwise only when build complete
![Page 32: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/32.jpg)
32
Creating Kernel Objects
// Create a handle to the compiled OpenCL function
cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "vectorAdd", NULL);
Built prgram from clBuildProgram
Function name with __kernel qualifier
Return error code
![Page 33: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/33.jpg)
33
Set Kernel Arguments
// Set kernel arguments
clSetKernelArg(OpenCLVectorAdd,0,sizeof(cl_mem), (void*)&GPUVector1);
clSetKernelArg(OpenCLVectorAdd,1,sizeof(cl_mem), (void*)&GPUVector2);
clSetKernelArg(OpenCLVectorAdd,2,sizeof(cl_mem), (void*)&GPUOutputVector);
Kernel objectfrom clCreateKernel()
Which argument
Size of argument
Pointer to data for argument, from clCreateBuffer()
![Page 34: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/34.jpg)
Number of events to complete
before this commands
Array describing no of global work
items
Array describing no of work items that make up a work
group
34
Enqueue a command to execute kernel on device
// Launch the kernel
size_t WorkSize[1] = {N}; // Total number of work items size_t localWorkSize[1]={256}; //No of work items in work group
// Launch the kernel
clEnqueueNDRangeKernel(GPUCommandQueue,OpenCLVectorAdd,1,NULL,
WorkSize, localWorkSize, 0, NULL, NULL);
Kernel object from clCreatKernel()
Dimensions of work items
Offset used with work item
Event wait list
Event
![Page 35: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/35.jpg)
35
Function to copy from buffer object to host memory
The following function enqueue commands to read from a buffer object to host memory:
cl_int clEnqueueReadBuffer (cl_command_queue command_queue,cl_mem buffer,cl_bool blocking_read,size_t offset,size_t cb,void *ptr,cl_uint num_events_in_wait_list,const cl_event *event_wait_list,cl_event *event)
The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
![Page 36: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/36.jpg)
36
Function to copy from host memory to buffer object
The following function enqueue commands to write to a buffer object from host memory:
cl_int clEnqueueWriteBuffer (cl_command_queue command_queue,cl_mem buffer,cl_bool blocking_write,size_t offset,size_t cb,const void *ptr,cl_uint num_events_in_wait_list,const cl_event *event_wait_list,cl_event *event)
The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
![Page 37: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/37.jpg)
37
Copy data back from kernel
// Copy the output back to CPU memory
int *C;
C = new int[N];
clEnqueueReadBuffer(GPUCommandQueue,GPUOutputVector,CL_TRUE, 0, N*sizeof(int), C, 0, NULL, NULL);
Read is blocking
Byte offset in buffer
Size of data to read in bytes
Pointer to buffer in host to write
data
Number of events to
complete before this commands
Event wait list
Event
Command queue from clCreateCommandQueue
Device buffer from clCreateBuffer
![Page 38: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/38.jpg)
38
Results from GPU
cout << "C[“ << 0 << "]: " << A[0] <<"+"<< B[0] <<"=" << C[0] << "\n";
cout << "C[“ << N-1 << "]: “ << A[N-1] << "+“ << B[N-1] << "=" << C[N-
1] << "\n";
C++ here
![Page 39: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/39.jpg)
39
Clean-up
// Cleanup
free(GPUDevices);
clReleaseKernel(OpenCLVectorAdd);
clReleaseProgram(OpenCLProgram);
clReleaseCommandQueue(GPUCommandQueue);
clReleaseContext(GPUContext);
clReleaseMemObject(GPUVector1);
clReleaseMemObject(GPUVector2);
clReleaseMemObject(GPUOutputVector);
![Page 40: Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be](https://reader036.vdocuments.site/reader036/viewer/2022070418/5697bfed1a28abf838cb9047/html5/thumbnails/40.jpg)
40
Compiling
Need OpenCL header:
#include <CL/cl.h>
(For mac: #include <OpenCL/opencl.h> )
and link to the OpenCL library.
Compile OpenCL host program main.c using gcc, two phases:
gcc -c -I /path-to-include-dir-with-cl.h/ main.c -o main.ogcc -L /path-to-lib-folder-with-OpenCL-libfile/ -l OpenCL main.o -o host
Ref: http://www.thebigblob.com/getting-started-with-opencl-and-gpu-computing/