gpu programming

GPU PROGRAMMING David GilbertCalifornia State University, Los Angeles

Outline• CUDA• CPU vs GPU Architecture• Scalability• Blocks• Performance• Speed Up• Graphics Cards• How It Works• Program Flow• When to Use the GPU• Example: Matrix Row Sum• References

CUDA• Compute Unified Device Architecture (CUDA)• High performance computing on your GPU• CUDA is a proprietary architecture for GPU Computing,

there is also OpenCL which runs on AMD/ATI

CPU vs GPU Architecture• ALU does the computations

Scalability• Code automatically

scales upward• GPUs with more

cores will execute the same code in less time

• Can add additional graphics cards to your computer and gain exponential performance increases!

Blocks• Essentially Groups• Block Size and

ThreadsPerBlock are defined before the memory is copied to the graphics card.

• To access a thread in ablock

i = blockIdx.x + threadIdx.x;j = blockIdx.y + threadIdx.y;

Performance• Super computer performance is measured in Floating

Point Operations Per Second (FLOPS)• Megaflops = 10^6• Gigaflops = 10^9• Teraflops = 10^12• Petaflops = 10^15

• Japan’s K Computer• 10.51 Petaflops

• Nvidia GTX 480• ~1300 gigaflops

• Core i7 920 @3.4Ghz• 69 gigaflops

Graphics Cards• Consumer

• AMD 6950, $250• 2.25 TFLOPs Single Precision compute power• 562.5 GFLOPs Double Precision compute power• 1408 Stream Processors

• Nvidia GTX 470, $150• 1.09 TFLOPs Single Precision compute power• 544.32 GFLOPs Double Precision compute power• 448 Cuda Cores• About $1 per TFLOP

Speed Up?

How it works• Computer dumps the load onto the GPU• GPU does the computing• GPU returns the results to System Memory• This transfer is the biggest bottleneck in the system

CPU GPU

Results

Code

Program Flow1. Allocate System Memory2. Allocate Device Memory3. Copy Memory from System to Device4. Execute the Code5. Copy Results back to the System from the Device6. Free Device Memory7. Process Results8. Free System Memory

• Lines 3 and 5 create the bottleneck

When to Use the GPU• Let dT = transfer time between device and system• Let st = serial execution time• Let pt = parallel execution time

2(dT) + pt < st

Example: Matrix Row Sum0.5 0.25 0.25 0

0.25 0.25 0.25 0.25

0 0.5 0.5 0

0 0 0.75 0.25

0.5

0.25

0

0

0

0.25

0

0.25

Block size, 4X1

0.25

0.25

0.5

0

0.25

0.25

0.5

0.75

Example: Matrix Row Sum// Device code__global__ void RowSum(float* B, float* Sum, int N, int M){ int i = blockDim.x * blockIdx.x + threadIdx.x; int j = blockDim.y * blockIdx.y + threadIdx.y; if (i < N && j < M) C[j] += B[i][j];}• B is the matrix being summed• Sum is the array storing the row sum• N is # of rows• M is # of cols

Example: Matrix Row Sumint main(){

int M = 4, N = 4;

// Allocate System Memorysize_t size = N*M*sizeof(float);float * h_B = (float *)malloc(size);float * h_sum = (float *)malloc(size);

// Allocate Device Memoryfloat * d_B, * d_sum;cudaMalloc(&d_B, size);cudaMalloc(&d_sum, size);

// Copy System Memory to DevicecudaMemcpy(d_B, h_B, size, cudaMemcpyDeviceToHost);

// Execute the codeint threadsPerBlock = 4;int blocksPerGrid = 4;RowSum<<<blocksPerGrid, threadsPerBlock>>>(d_B, d_sum, N, M);

// Copy Results from Device Back to System MemorycudaMemcpy(h_sum, d_sum, size, cudaMemcpyDeviceToHost);

// Free device MemorycudaFree(d_B);cudaFree(d_sum);

// Process Resultsprint results… // some method to display results

// Free System Memoryfree(h_B);free(h_sum);

return 0;}

Example: Matrix Row Sum• Now, imagine a matrix of 1000 x 1000• I don’t guarantee that this code will run

References• Newegg.com• CUDA C Programming Guide

http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf

• AMD.comhttp://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/hd-6950/Pages/amd-radeon-hd-6950-overview.aspx

• PCGameshardware.comhttp://www.pcgameshardware.com/aid,743498/Geforce-GTX-480-and-GTX-470-reviewed-Fermi-performance-benchmarks/Reviews/

• Nvidia.comhttp://www.nvidia.com/object/product_geforce_gtx_470_us.html



http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/hd-6950/Pages/amd-radeon-hd-6950-overview.aspx

http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/hd-6950/Pages/amd-radeon-hd-6950-overview.aspx

http://www.pcgameshardware.com/aid,743498/Geforce-GTX-480-and-GTX-470-reviewed-Fermi-performance-benchmarks/Reviews/

http://www.pcgameshardware.com/aid,743498/Geforce-GTX-480-and-GTX-470-reviewed-Fermi-performance-benchmarks/Reviews/

http://www.nvidia.com/object/product_geforce_gtx_470_us.html

gpu programming

Documents

matrix row sum device

size copy system memory

system memorycudamemcpyh

system memorysize

matrix row sum0

matrix row sumnow

device memoryfloat

y threadidx