gpu computing gems chapter 34 review

Book Review

Chapter 34 Chapter 34 Experiences on Image and Video Processing with CUDA and OpenCLTim Child October 2011

Outline Authors Chapter Contents Problem Statement Technology or Algorithm Background Subtraction Pearson's Correlation Coefficient Key Insights Case Study 1: R-T Video Background Subtraction Case Study 2: Cross Correlation Final Evaluation

Reviewers Conclusions Q&A

Authors Alptekin Temizel Tugba Halici Berker Logoglu Tugba Taskaya Temizel Faith Ormruuzun Ersin Karaman

Problem Statement Guide Users in Implementing GPU Algorithms Using CUDA and OpenCL Video and Image Processing Comparison of CUDA and OpenCL

Multiple Architectures Studied Advantages of each Video and Image Processing

TechnologyHardware Set-up Descriptions 4 GPU Configurations (3 CUDA, 1 OpenCL)

1 CPU Configuration (OpenMP on 4 Cores)

Algorithms1. Background Subtraction Algorithm Find moving Pixels Updating the Background Updating the Motion Threshold

2. Pearsons Correlation CoefficientPMCC = cov( x, y ) x y

Background Subtraction Algorithm Find Moving Pixels Compare Pixels from prior 2 imagesIn(x,y) - In-1(x,y) > Tn(x,y) && In(x,y) - In-2(x,y) > Tn(x,y)

Update the Background

Bn+1 =

Bn(x,y) + (1-)In(x,y) Bn(x,y)

Updating Threshold

Background Subtraction

Pearsons Correlation Coefficient PMCC values lie between [1,-1]Value 1 0 Description Perfect match No correlation

-1

Perfect negative correlation

Case Study 1 R-T Video SubtractionMeasure speed-up for various image sizes, architectures Experiment 1 Single Kernel Best speed up 5.6x

Experiment 2 Single vs. 3 Kernels Using multiple kernel decrease perf by 43% (due to independent memory access )

Experiment 3 Single Kernel Serial/Async I/O Async I/O Increased Performance 61% - 91%

Experiment 4 8 bit vs 32 Bit Global Memory Access 32bit access Increased Performance 8%- 29%

CS 1 Exp 1Speed-Up W/O I/O

CS 1 Exp 1 Speed-Up With I/O

CS 1 Exp 2 Multiple Kernels

CS 1 Exp3,4 Async I/O 8 bit vs 32 bit

Case Study 2 Cross CorrelationCompute PMCC between Images Experiment 1 Global Memory Only Best speed-up 13.25x

Experiment 2 Global Memory & Shared Memory with Coalesced Access

Speed-up by between 4.34x Coalesced, 5.6x shared

Experiment 3 Increasing Number of Images See effect of increasing number of images Increasing number of images performance gains increases (8K images 89x speed-up)

CS2 Exp1 Using Global Memory

CS 2 Exp 2 Speed-Up Due Shared or Coalesced Memory

CS2 Exp3 Increased Number of Images

Key Insights CUDA faster than OpenCL* Speed-up between 11.6x to 89x possible Larger image sizes have more speed-up I/O is significant and needs to be optimized Efficient GPU memory access is vital Using more kernels increase memory access and reduces performance

*Researchers didnt effectively investigate why

Reviewers Conclusions Useful experiments Valid conclusions make sense Sometimes not clear what the baseline is Hardware is now dated

Didnt explore why CUDA differs from OpenCL

Q&A

gpu computing gems chapter 34 review

Documents

speedup io

images experiment

global memory access

global memory shared

global memory cs

coalesced access speedup

best speedup

speedup wo io cs