1 itcs 6/8010 cuda programming, unc-charlotte, b. wilkinson, april 7, 2011, opencl.ppt opencl these...
TRANSCRIPT
1ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,April 7, 2011, OpenCL.ppt
OpenCL
These notes will introduce OpenCL
2
OpenCL(Open Computing Language)
A standard based upon C for portable parallel applications
Task parallel and data parallel applications
Focuses on multi platform support (multiple CPUs, GPUs, …)
Development initiated by Apple.
Developed by Khromos group who also managed OpenGLOpenCL 1.0 2008. Released with Max OS 10.6 (Snow Leopard) OpenCL 1.1 June 2010
Similarities with CUDA
Implementation available for NVIDIA GPUs
Wikipedia “OpenCL http://en.wikipedia.org/wiki/OpenCL
3
OpenCL Programming Model
Uses data parallel programming model, similar to CUDA
Host program launches kernel routines as in CUDA, but allows for just-in-time compilation during host execution.
OpenCL “work items” corresponds to CUDA threads
OpenCL “work groups” corresponds to CUDA thread blocks
Work items in same work group can be synchronized with a barrier as in CUDA.
4
Sample OpenCL code to add two vectors
To illustrate OpenCL commands, will used OpenCl code to add two vectors, A and B which are transferred to the device (GPU) and the result, C, returned to host (CPU), similar to CUDA vector addition
5
Structure of OpenCL main programGet information about platform and devices available on system
Select devices to use
Create an OpenCL command queue
Create memory buffers on device
Create kernel program object
Build (compile) kernel in-line (or load precompiled binary)
Create OpenCL kernel object
Set kernel arguments
Execute kernel
Read kernel memory and copy to host memory.
Transfer data from host to device memory buffers
6
Platform
"The host plus a collection of devices managed by the OpenCL framework that allow an application to share resources and execute kernels on devices in the platform."
Platforms represented by a cl_platform object, initialized with clGetPlatformID()
http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201
7
Simple code for identifying platform
//Platform
cl_platform_id platform;
clGetPlatformIDs (1, &platform, NULL);
List of OpenCL platforms found.
(Platform IDs)In our case just one platform, identified
by &platform
Number of platform entries
Returns number of OpenCL platforms
available. If NULL, ignored.
8
Context
“The environment within which the kernels execute and the domain in which synchronization and memory management is defined.
The context includes a set of devices, the memory accessible to those devices, the corresponding memory properties and one or more command-queues used to schedule execution of a kernel(s) or operations on memory objects.”
The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
9
//Context
cl_context_properties props[3];
props[0] = (cl_context_properties) CL_CONTEXT_PLATFORM;
props[1] = (cl_context_properties) platform;
props[2] = (cl_context_properties) 0;
cl_context GPUContext =
clCreateContextFromType(props,CL_DEVICE_TYPE_GPU,NULL,NULL,NULL);
//Context info
size_t ParmDataBytes;
clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,0,NULL,&ParmDataBytes);
cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes);
clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,ParmDataBytes,GPUDevices,
NULL);
Code for context
10
Command Queue
“An object that holds commands that will be executed on a specific device.
The command-queue is created on a specific device in a context.
Commands to a command-queue are queued in-order but may be executed in-order or out-of-order. ...”
The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
11
// Create command-queue
cl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext,GPUDevices[0],0,NULL);
Simple code for creating a command queue
12
Allocating memory on device
Use clCreatBuffer:
cl_mem clCreateBuffer(cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret)
OpenCL context, from clCreateContextFromType()
Bit field to specify type of allocation/usage (CL_MEM_READ_WRITE ,…)
No of bytes in buffer memory object
Returns error code if an error
Ptr to buffer data (May be previously allocated.)
Returns memory object
13
Sample code for allocating memory on device for source data
// source data on host, two vectors
int *A, *B;A = new int[N];B = new int[N];for(int i = 0; i < N; i++) {
A[i] = rand()%1000;B[i] = rand()%1000;
}…
// Allocate GPU memory for source vectors
cl_mem GPUVector1 = clCreateBuffer(GPUContext,CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR,sizeof(int)*N, A, NULL);
cl_mem GPUVector2 = clCreateBuffer(GPUContext,CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR,sizeof(int)*N, B, NULL);
14
Sample code for allocating memory on device for results on GPU
// Allocate GPU memory for output vector
cl_mem GPUOutputVector =
clCreateBuffer(GPUContext,CL_MEM_WRITE_ONLY,sizeof(int)*N,
NULL,NULL);
15
Kernel Program
Simple programs might be in the same file as the host code as our CUDA examples.
In that case need to formed into strings in a character array.
If in a separate file, can read that file into host program as a character string
16
Kernel program
const char* OpenCLSource[] = {"__kernel void vectorAdd (const __global int* a,"," const __global int* b,"," __global int* c)","{"," unsigned int gid = get_global_id(0);"," c[gid] = a[gid] + b[gid];","}"};
…int main(int argc, char **argv){…}
If in same program as host, kernel needs to be strings (I think it can be a single string)
OpenCL qualifier to indicate kernel code
OpenCL qualifier to indicate kernel memory(Memory objects allocated from global memory pool)
Double underscores optional in OpenCL qualifiers
Returns global work-item ID in given dimension (0 here)
17
Kernel in a separate file
// Load the kernel source code into the array source_str
FILE *fp; char *source_str; size_t source_size;
fp = fopen("vector_add_kernel.cl", "r"); if (!fp) { fprintf(stderr, "Failed to load kernel.\n"); exit(1); }
source_str = (char*)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); fclose( fp );
http://mywiki-science.wikispaces.com/OpenCL
18
Create kernel program object
const char* OpenCLSource[] = {…};
int main(int argc, char **argv)…
// Create OpenCL program object
cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext,7,OpenCLSource,NULL,NULL);
Number of strings in kernel program array
Used if strings not null-terminated to given length of strings
Used to return error code if error
This example uses a single file for both host and kernel code. Can use clCreateprogramWithSource() with a separate kernel file read into host program
19
Build kernel program
// Build the program (OpenCL JIT compilation)
clBuildProgram(OpenCLProgram,0,NULL,NULL,NULL,NULL);
Program object from clCreateProgramwithSource
Number of devices
List of devices, if more than one
Build options
Arguments for notification
routine
Function ptr to notification routine called with build complete. Then
clBuildProgram will return immediately,
otherwise only when build complete
20
Creating Kernel Objects
// Create a handle to the compiled OpenCL function
cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "vectorAdd", NULL);
Built prgram from clBuildProgram
Function name with __kernel qualifier
Return error code
21
Set Kernel Arguments
// Set kernel arguments
clSetKernelArg(OpenCLVectorAdd,0,sizeof(cl_mem), (void*)&GPUVector1);
clSetKernelArg(OpenCLVectorAdd,1,sizeof(cl_mem), (void*)&GPUVector2);
clSetKernelArg(OpenCLVectorAdd,2,sizeof(cl_mem), (void*)&GPUOutputVector);
Kernel objectfrom clCreateKernel()
Which argument
Size of argument
Pointer to data for argument, from clCreateBuffer()
Number of events to complete
before this commands
Array describing no of global work
items
Array describing no of work items that make up a
work group
22
Enqueue a command to execute kernel on device
// Launch the kernel
size_t WorkSize[1] = {N}; // Total number of work items size_t localWorkSize[1]={256}; //No of work items in work group
// Launch the kernel
clEnqueueNDRangeKernel(GPUCommandQueue,OpenCLVectorAdd,1,NULL,
WorkSize, localWorkSize, 0, NULL, NULL);
Kernel object from clCreatKernel()
Dimensions of work items
Offset used with work item
Event wait list
Event
23
Function to copy from buffer object to host memory
The following function enqueue commands to read from a buffer object to host memory:
cl_int clEnqueueReadBuffer (cl_command_queue command_queue,cl_mem buffer,cl_bool blocking_read,size_t offset,size_t cb,void *ptr,cl_uint num_events_in_wait_list,const cl_event *event_wait_list,cl_event *event)
The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
24
Function to copy from host memory to buffer object
The following function enqueue commands to write to a buffer object from host memory:
cl_int clEnqueueWriteBuffer (cl_command_queue command_queue,cl_mem buffer,cl_bool blocking_write,size_t offset,size_t cb,const void *ptr,cl_uint num_events_in_wait_list,const cl_event *event_wait_list,cl_event *event)
The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
25
Copy data back from kernel
// Copy the output back to CPU memory
int *C;
C = new int[N];
clEnqueueReadBuffer(GPUCommandQueue,GPUOutputVector,
CL_TRUE, 0, N*sizeof(int), C, 0, NULL, NULL);
Read is blocking
Byte offset in buffer
Size of data to read in bytes
Pointer to buffer in host to write
data
Number of events to
complete before this commands
Event wait list
Event
Command queue from clCreateCommandQueue
Device buffer from clCreateBuffer
26
Results from GPU
cout << "C[“ << 0 << "]: " << A[0] <<"+"<< B[0] <<"=" << C[0]
<< "\n";
cout << "C[“ << N-1 << "]: “ << A[N-1] << "+“ << B[N-1] << "="
<< C[N-1] << "\n";
C++ here
27
Clean-up
// Cleanup
free(GPUDevices);
clReleaseKernel(OpenCLVectorAdd);
clReleaseProgram(OpenCLProgram);
clReleaseCommandQueue(GPUCommandQueue);
clReleaseContext(GPUContext);
clReleaseMemObject(GPUVector1);
clReleaseMemObject(GPUVector2);
clReleaseMemObject(GPUOutputVector);
28
Compiling
Need OpenCL header:
#include <CL/cl.h>
(For mac: #include <OpenCL/opencl.h> )
and link to the OpenCL library.
Compile OpenCL host program main.c using gcc, two phases:
gcc -c -I /path-to-include-dir-with-cl.h/ main.c -o main.ogcc -L /path-to-lib-folder-with-OpenCL-libfile/ -l OpenCL main.o -o host
Ref: http://www.thebigblob.com/getting-started-with-opencl-and-gpu-computing/
29
Make File(Program called scalarmulocl)
CC = g++LD = g++ -lmCFLAGS = -Wall -sharedCDEBUG =LIBOCL = -L/nfs-home/mmishra2/NVIDIA_GPU_Computing_SDK/OpenCL/common/libINCOCL = -I/nfs-home/mmishra2/NVIDIA_GPU_Computing_SDK/OpenCL/common/incSRCS = scalarmulocl.cppOBJS = scalarmulocl.oEXE = scalarmulocl.aall: $(EXE)$(OBJS): $(SRCS)
$(CC) $(CFLAGS) $(INCOCL) -I/usr/include -c $(SRCS)$(EXE): $(OBJS)
$(LD) -L/usr/local/lib $(OBJS) $(LIBOCL) -o $(EXE) -l OpenCLclea:
rm -f $(OBJS) *~clear
References: http://mywiki-science.wikispaces.com/OpenCLSubmitted by: Manisha Mishra
30
To compile: make
To Run: ./scalarmulocl.a
Snapshot:
Submitted by: Manisha Mishra
Compiling and Executing the program
Questions
32
Chapter 11 of Programming Massively Parallel Processors by D. B. Kirk and W-M W. Hwu, Morgan Kaufmann, 2010
More Information
33
clGetPlatformIDsObtain the list of platforms available.
cl_int clGetPlatformIDs(cl_uint num_entries,cl_platform_id *platforms,cl_uint *num_platforms)
Parametersnum_entries The number of cl_platform_id entries that can be added to platforms. If platforms is not NULL, the num_entries must be greater than zero.
platforms Returns a list of OpenCL platforms found. The cl_platform_id values returned in platforms can be used to identify a specific OpenCL platform. If platforms argument is NULL, this argument is ignored. The number of OpenCL platforms returned is the mininum of the value specified by num_entries or the number of OpenCL platforms available.
num_platforms Returns the number of OpenCL platforms available. If num_platforms is NULL, this argument is ignored.
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/
34
#include <stdio.h>#include <stdlib.h>
#include <CL/cl.h> //OpenCL header for C
#include <iostream> //C++ input/outputusing namespace std;
Includes