accelerating sequential computer vision algorithms using commodity parallel hardware...

Accelerating sequential computer vision algorithms using commodity parallel hardware

Platform Parallel NetherlandsGPGPU-day, 28 June 2012

Jaap van de Loosdrecht

NHL Centre of Expertise in Computer Vision Van de Loosdrecht Machine Vision BV

Limerick Institute of Technology

Overview

Introduction

Computer vision algorithms and parallelization

Benchmarking

Run time prediction if parallelization is beneficial

Progress

Future work

Summary and preliminary conclusions

Questions

Introduction

Manager NHL Centre of Expertise in Computer Vision

University of professional education, Leeuwarden

4,5 FTE

Since 1996: 160 industrial projects

Managing director Van de Loosdrecht Machine Vision BV

VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms

Portable library, > 100.000 lines of ANSI C++Windows, Linux and Androidx86, x64, ARM and PowerPC

Student Limerick Institute of Technology (Ireland)

Research master project,1 September 2011 1 July 2013

VisionLab: development environment for Computer Vision

Introduction

Manager NHL Centre of Expertise in Computer Vision

University of professional education, Leeuwarden

4,5 FTE

Since 1996: 160 industrial projects

Managing director Van de Loosdrecht Machine Vision BV

VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms

Portable library, > 100.000 lines of ANSI C++Windows, Linux and Androidx86, x64, ARM and PowerPC

Student Limerick Institute of Technology (Ireland)

Research master project,1 September 2011 1 July 2013

Motivation

Apply parallel programming techniques to meet the challenges posed in computer vision by the limits of sequential architectures

Aims and objectives

Compare existing programming languages and environments for parallel computing

Choose one standard for

Multi-core CPU programming

GPU programming

Re-implement a number of standard and well-known algorithms

Compare performance to existing sequential implementation of VisionLab

Evaluate test results, benefits and costs of parallel approaches to implementation of computer vision algorithms

Requirements

Primary target system

Conventional PC or intelligent camera

Windows or Linux, on a x86 or x64

Important option: easy porting (Android, ARM, PowerPC)

Existing scripts and applications should not have to be modified in order to benefit from parallelization


Chosen standards must be

An industry standard

Vendor independent

For CPU

ANSI C++ based

Efficient parallelization for majority of code

Related research

Other research projects

Compare best sequential with best parallel algorithm

Often specific domain and hardware

Framework for auto parallelisation

In research, not yet generic applicable

Special points of interest in my project

Generic library

Portability and vendor independency


Variance in execution times

100.000 lines of ANSI C++

Choice of standard for multi-core CPU programming (1 oct 2011)

Requirement

----------------

Standard

Industry

standard

Maturity Acceptance by

market

Future

developments

Vendor

independence

Portability Scalable to

ccNUMA

(optional)

Vector

capabilities

(optional)

Effort for

conversion

Array Building

Blocks

No Beta New,

not ranked

Good Poor Poor No Yes Huge

C++11

Threads

Yes Partly new New,

not ranked

Good Good Good No No Huge

Cilk Plus No Good Rank 6 Good Reasonable

No MSVC

Reasonable No Yes Low

MCAPI No Poor Not ranked Poor Poor Poor Yes No Huge

MPI Yes Excellent Rank 7 Good Good Good Yes No Huge

OpenMP Yes Excellent Rank 1 Good Good Good Yes,

only GNU

No Low

Parallel

Patterns

Library

No Reasonable New,

not ranked

Good Poor

Only MSVC

Poor No No Huge

Posix Threads Yes Excellent Not ranked Poor Good Good No No Huge

Thread

Building

Blocks

No Good Rank 3 Good Reasonable Reasonable No No Huge

Choice of standard for GPU programming (1 oct 2011)

Requirement

---------------

Standard

Industry

standard

Maturity Acceptance by

market

Future

developments

Expected

familiarization

time

Hardware

vendor

independence

Software

vendor

independence

Portability Heterogeneous

Accelerator No Good Not ranked Bad Medium Bad Bad Poor No

CUDA No Excellent Rank 5 Good High Bad Bad Bad No

Direct

Compute

No Poor Not ranked Unknown High Bad Bad Bad No

HMPP No Poor Not ranked Plan for open

standard

Medium Reasonable Bad Good Yes

OpenCL Yes Good Rank 2 Good High Excellent Good Good Yes

PGI

Accelerator No Reasonable Not ranked Unknown Medium Bad Bad Bad No

Computer vision algorithms and parallelization

Classification image operators

Low level image operators

Point operators

Local neighbour operators

Global operators

Connectivity based operators

High level image operators

Often built on the low level operators

Specials

Pattern matcher, neural network, genetic algorithm, etc

Idea: design and implement skeletons for parallelizing representatives in classes

Benchmarking

Benchmark protocol

Data analyse with R

Speedup graphs

Speedup tables

Median of execution time tables

Best work-group size tables

Violin plots

OpenMP

Progress

Script commands added

Frame work for benchmarking

> 160 operators are parallelized


Calibration procedure

Example speedup graph (i7 2600)


The speed-up depends on

Size of image

Pixel type

Content of the image

Parameters like size of neighbourhood

Etc.

Calibration procedure for OpenMP

Simple and fast procedure for global optimization

Complex and slow procedure for more optimal optimization for each (sub) operator

Variations in executing times, violin plot

OpenCL

Progress

Toolbox for using OpenCL kernels from scripts and C++

Script commands added for host API

Frame work for benchmarking

Implementation first kernels

OpenCL development in VisionLab

Optimal number of local histograms per work-group (GTX 560 Ti)

Speedup graph of Histogram,16 local histogram per work-group (GTX 560 Ti)

Violin plot for Histogram, 16 local histograms/wg (GTX 560 Ti)

X-files: OpenCL versus OpenMP

Future work OpenCL

Near future

Memory transfers

Pinned

Zero copy APU

Implementing more vision operators

More distant future

Intelligent buffer management ?

Automatic tuning of parameters ?

Run time prediction if parallelization is beneficial ?

Heterogeneous computing ?

OpenMP 4.0, OpenACC, C++ AMP ?

The future: XIMEA Currera G (APU based)

Summary and preliminary conclusions

Choice made for standards OpenMP and OpenCL

Integration OpenMP and OpenCL in VisionLab

Benchmark environment

OpenMP

Embarrassingly parallel algorithms are easy to convert

More than 160 operators parallelized

Run time prediction implemented

OpenCL

Scripting host side code accelerates development time

Portable functionality

Portable performance is not easy

Still have to learn a lot about GPUs

Searching for sparing partners

Questions ?

Jaap van de Loosdrecht

NHL Centre of Expertise in Computer Vision

[email protected]

www.nhl.nl/computervision

Van de Loosdrecht Machine Vision BV

[email protected]

www.vdlmv.nl

accelerating sequential computer vision algorithms using commodity parallel hardware...

Documents