gpu computing gems chapter 34 review
DESCRIPTION
A Book review for GPU Computing Chapt 34. forthe Silicon Valley HPC Meet-Up GroupTRANSCRIPT
Book Review
Chapter 34 Chapter 34 Experiences on Image and Video Processing with CUDA and OpenCLTim Child October 2011
Outline Authors Chapter Contents Problem Statement Technology or Algorithm Background Subtraction Pearson's Correlation Coefficient Key Insights Case Study 1: R-T Video Background Subtraction Case Study 2: Cross Correlation Final Evaluation
Reviewers Conclusions Q&A
Authors Alptekin Temizel Tugba Halici Berker Logoglu Tugba Taskaya Temizel Faith Ormruuzun Ersin Karaman
Problem Statement Guide Users in Implementing GPU Algorithms Using CUDA and OpenCL Video and Image Processing Comparison of CUDA and OpenCL
Multiple Architectures Studied Advantages of each Video and Image Processing
TechnologyHardware Set-up Descriptions 4 GPU Configurations (3 CUDA, 1 OpenCL)
1 CPU Configuration (OpenMP on 4 Cores)
Algorithms1. Background Subtraction Algorithm Find moving Pixels Updating the Background Updating the Motion Threshold
2. Pearsons Correlation CoefficientPMCC = cov( x, y ) x y
Background Subtraction Algorithm Find Moving Pixels Compare Pixels from prior 2 imagesIn(x,y) - In-1(x,y) > Tn(x,y) && In(x,y) - In-2(x,y) > Tn(x,y)
Update the Background
Bn+1 =
Bn(x,y) + (1-)In(x,y) Bn(x,y)
Updating Threshold
Background Subtraction
Pearsons Correlation Coefficient PMCC values lie between [1,-1]Value 1 0 Description Perfect match No correlation
-1
Perfect negative correlation
Case Study 1 R-T Video SubtractionMeasure speed-up for various image sizes, architectures Experiment 1 Single Kernel Best speed up 5.6x
Experiment 2 Single vs. 3 Kernels Using multiple kernel decrease perf by 43% (due to independent memory access )
Experiment 3 Single Kernel Serial/Async I/O Async I/O Increased Performance 61% - 91%
Experiment 4 8 bit vs 32 Bit Global Memory Access 32bit access Increased Performance 8%- 29%
CS 1 Exp 1Speed-Up W/O I/O
CS 1 Exp 1 Speed-Up With I/O
CS 1 Exp 2 Multiple Kernels
CS 1 Exp3,4 Async I/O 8 bit vs 32 bit
Case Study 2 Cross CorrelationCompute PMCC between Images Experiment 1 Global Memory Only Best speed-up 13.25x
Experiment 2 Global Memory & Shared Memory with Coalesced Access
Speed-up by between 4.34x Coalesced, 5.6x shared
Experiment 3 Increasing Number of Images See effect of increasing number of images Increasing number of images performance gains increases (8K images 89x speed-up)
CS2 Exp1 Using Global Memory
CS 2 Exp 2 Speed-Up Due Shared or Coalesced Memory
CS2 Exp3 Increased Number of Images
Key Insights CUDA faster than OpenCL* Speed-up between 11.6x to 89x possible Larger image sizes have more speed-up I/O is significant and needs to be optimized Efficient GPU memory access is vital Using more kernels increase memory access and reduces performance
*Researchers didnt effectively investigate why
Reviewers Conclusions Useful experiments Valid conclusions make sense Sometimes not clear what the baseline is Hardware is now dated
Didnt explore why CUDA differs from OpenCL
Q&A