accelerating sequential computer vision algorithms using commodity parallel hardware...
TRANSCRIPT
-
Accelerating sequential computer vision algorithms using commodity parallel hardware
Platform Parallel NetherlandsGPGPU-day, 28 June 2012
Jaap van de Loosdrecht
NHL Centre of Expertise in Computer Vision Van de Loosdrecht Machine Vision BV
Limerick Institute of Technology
-
Overview
Introduction
Computer vision algorithms and parallelization
Benchmarking
Run time prediction if parallelization is beneficial
Progress
Future work
Summary and preliminary conclusions
Questions
-
Introduction
Manager NHL Centre of Expertise in Computer Vision
University of professional education, Leeuwarden
4,5 FTE
Since 1996: 160 industrial projects
Managing director Van de Loosdrecht Machine Vision BV
VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms
Portable library, > 100.000 lines of ANSI C++Windows, Linux and Androidx86, x64, ARM and PowerPC
Student Limerick Institute of Technology (Ireland)
Research master project,1 September 2011 1 July 2013
-
VisionLab: development environment for Computer Vision
-
Introduction
Manager NHL Centre of Expertise in Computer Vision
University of professional education, Leeuwarden
4,5 FTE
Since 1996: 160 industrial projects
Managing director Van de Loosdrecht Machine Vision BV
VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms
Portable library, > 100.000 lines of ANSI C++Windows, Linux and Androidx86, x64, ARM and PowerPC
Student Limerick Institute of Technology (Ireland)
Research master project,1 September 2011 1 July 2013
-
Motivation
Apply parallel programming techniques to meet the challenges posed in computer vision by the limits of sequential architectures
-
Aims and objectives
Compare existing programming languages and environments for parallel computing
Choose one standard for
Multi-core CPU programming
GPU programming
Re-implement a number of standard and well-known algorithms
Compare performance to existing sequential implementation of VisionLab
Evaluate test results, benefits and costs of parallel approaches to implementation of computer vision algorithms
-
Requirements
Primary target system
Conventional PC or intelligent camera
Windows or Linux, on a x86 or x64
Important option: easy porting (Android, ARM, PowerPC)
Existing scripts and applications should not have to be modified in order to benefit from parallelization
Run time prediction if parallelization is beneficial
Chosen standards must be
An industry standard
Vendor independent
For CPU
ANSI C++ based
Efficient parallelization for majority of code
-
Related research
Other research projects
Compare best sequential with best parallel algorithm
Often specific domain and hardware
Framework for auto parallelisation
In research, not yet generic applicable
Special points of interest in my project
Generic library
Portability and vendor independency
Run time prediction if parallelization is beneficial
Variance in execution times
100.000 lines of ANSI C++
-
Choice of standard for multi-core CPU programming (1 oct 2011)
Requirement
----------------
Standard
Industry
standard
Maturity Acceptance by
market
Future
developments
Vendor
independence
Portability Scalable to
ccNUMA
(optional)
Vector
capabilities
(optional)
Effort for
conversion
Array Building
Blocks
No Beta New,
not ranked
Good Poor Poor No Yes Huge
C++11
Threads
Yes Partly new New,
not ranked
Good Good Good No No Huge
Cilk Plus No Good Rank 6 Good Reasonable
No MSVC
Reasonable No Yes Low
MCAPI No Poor Not ranked Poor Poor Poor Yes No Huge
MPI Yes Excellent Rank 7 Good Good Good Yes No Huge
OpenMP Yes Excellent Rank 1 Good Good Good Yes,
only GNU
No Low
Parallel
Patterns
Library
No Reasonable New,
not ranked
Good Poor
Only MSVC
Poor No No Huge
Posix Threads Yes Excellent Not ranked Poor Good Good No No Huge
Thread
Building
Blocks
No Good Rank 3 Good Reasonable Reasonable No No Huge
-
Choice of standard for GPU programming (1 oct 2011)
Requirement
---------------
Standard
Industry
standard
Maturity Acceptance by
market
Future
developments
Expected
familiarization
time
Hardware
vendor
independence
Software
vendor
independence
Portability Heterogeneous
Accelerator No Good Not ranked Bad Medium Bad Bad Poor No
CUDA No Excellent Rank 5 Good High Bad Bad Bad No
Direct
Compute
No Poor Not ranked Unknown High Bad Bad Bad No
HMPP No Poor Not ranked Plan for open
standard
Medium Reasonable Bad Good Yes
OpenCL Yes Good Rank 2 Good High Excellent Good Good Yes
PGI
Accelerator No Reasonable Not ranked Unknown Medium Bad Bad Bad No
-
Computer vision algorithms and parallelization
Classification image operators
Low level image operators
Point operators
Local neighbour operators
Global operators
Connectivity based operators
High level image operators
Often built on the low level operators
Specials
Pattern matcher, neural network, genetic algorithm, etc
Idea: design and implement skeletons for parallelizing representatives in classes
-
Benchmarking
Benchmark protocol
Data analyse with R
Speedup graphs
Speedup tables
Median of execution time tables
Best work-group size tables
Violin plots
-
OpenMP
Progress
Script commands added
Frame work for benchmarking
> 160 operators are parallelized
Run time prediction if parallelization is beneficial
Calibration procedure
-
Example speedup graph (i7 2600)
-
Run time prediction if parallelization is beneficial
The speed-up depends on
Size of image
Pixel type
Content of the image
Parameters like size of neighbourhood
Etc.
Calibration procedure for OpenMP
Simple and fast procedure for global optimization
Complex and slow procedure for more optimal optimization for each (sub) operator
-
Variations in executing times, violin plot
-
OpenCL
Progress
Toolbox for using OpenCL kernels from scripts and C++
Script commands added for host API
Frame work for benchmarking
Implementation first kernels
-
OpenCL development in VisionLab
-
OpenCL development in VisionLab
-
Optimal number of local histograms per work-group (GTX 560 Ti)
-
Speedup graph of Histogram,16 local histogram per work-group (GTX 560 Ti)
-
Violin plot for Histogram, 16 local histograms/wg (GTX 560 Ti)
-
X-files: OpenCL versus OpenMP
-
Future work OpenCL
Near future
Memory transfers
Pinned
Zero copy APU
Implementing more vision operators
More distant future
Intelligent buffer management ?
Automatic tuning of parameters ?
Run time prediction if parallelization is beneficial ?
Heterogeneous computing ?
OpenMP 4.0, OpenACC, C++ AMP ?
-
The future: XIMEA Currera G (APU based)
-
Summary and preliminary conclusions
Choice made for standards OpenMP and OpenCL
Integration OpenMP and OpenCL in VisionLab
Benchmark environment
OpenMP
Embarrassingly parallel algorithms are easy to convert
More than 160 operators parallelized
Run time prediction implemented
OpenCL
Scripting host side code accelerates development time
Portable functionality
Portable performance is not easy
Still have to learn a lot about GPUs
Searching for sparing partners
-
Questions ?
Jaap van de Loosdrecht
NHL Centre of Expertise in Computer Vision
www.nhl.nl/computervision
Van de Loosdrecht Machine Vision BV
www.vdlmv.nl