accelerating sequential computer vision algorithms using commodity parallel hardware...

28
Accelerating sequential computer vision algorithms using commodity parallel hardware Platform Parallel Netherlands GPGPU-day, 28 June 2012 Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision Van de Loosdrecht Machine Vision BV Limerick Institute of Technology

Upload: vanlien

Post on 24-Mar-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

  • Accelerating sequential computer vision algorithms using commodity parallel hardware

    Platform Parallel NetherlandsGPGPU-day, 28 June 2012

    Jaap van de Loosdrecht

    NHL Centre of Expertise in Computer Vision Van de Loosdrecht Machine Vision BV

    Limerick Institute of Technology

  • Overview

    Introduction

    Computer vision algorithms and parallelization

    Benchmarking

    Run time prediction if parallelization is beneficial

    Progress

    Future work

    Summary and preliminary conclusions

    Questions

  • Introduction

    Manager NHL Centre of Expertise in Computer Vision

    University of professional education, Leeuwarden

    4,5 FTE

    Since 1996: 160 industrial projects

    Managing director Van de Loosdrecht Machine Vision BV

    VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms

    Portable library, > 100.000 lines of ANSI C++Windows, Linux and Androidx86, x64, ARM and PowerPC

    Student Limerick Institute of Technology (Ireland)

    Research master project,1 September 2011 1 July 2013

  • VisionLab: development environment for Computer Vision

  • Introduction

    Manager NHL Centre of Expertise in Computer Vision

    University of professional education, Leeuwarden

    4,5 FTE

    Since 1996: 160 industrial projects

    Managing director Van de Loosdrecht Machine Vision BV

    VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms

    Portable library, > 100.000 lines of ANSI C++Windows, Linux and Androidx86, x64, ARM and PowerPC

    Student Limerick Institute of Technology (Ireland)

    Research master project,1 September 2011 1 July 2013

  • Motivation

    Apply parallel programming techniques to meet the challenges posed in computer vision by the limits of sequential architectures

  • Aims and objectives

    Compare existing programming languages and environments for parallel computing

    Choose one standard for

    Multi-core CPU programming

    GPU programming

    Re-implement a number of standard and well-known algorithms

    Compare performance to existing sequential implementation of VisionLab

    Evaluate test results, benefits and costs of parallel approaches to implementation of computer vision algorithms

  • Requirements

    Primary target system

    Conventional PC or intelligent camera

    Windows or Linux, on a x86 or x64

    Important option: easy porting (Android, ARM, PowerPC)

    Existing scripts and applications should not have to be modified in order to benefit from parallelization

    Run time prediction if parallelization is beneficial

    Chosen standards must be

    An industry standard

    Vendor independent

    For CPU

    ANSI C++ based

    Efficient parallelization for majority of code

  • Related research

    Other research projects

    Compare best sequential with best parallel algorithm

    Often specific domain and hardware

    Framework for auto parallelisation

    In research, not yet generic applicable

    Special points of interest in my project

    Generic library

    Portability and vendor independency

    Run time prediction if parallelization is beneficial

    Variance in execution times

    100.000 lines of ANSI C++

  • Choice of standard for multi-core CPU programming (1 oct 2011)

    Requirement

    ----------------

    Standard

    Industry

    standard

    Maturity Acceptance by

    market

    Future

    developments

    Vendor

    independence

    Portability Scalable to

    ccNUMA

    (optional)

    Vector

    capabilities

    (optional)

    Effort for

    conversion

    Array Building

    Blocks

    No Beta New,

    not ranked

    Good Poor Poor No Yes Huge

    C++11

    Threads

    Yes Partly new New,

    not ranked

    Good Good Good No No Huge

    Cilk Plus No Good Rank 6 Good Reasonable

    No MSVC

    Reasonable No Yes Low

    MCAPI No Poor Not ranked Poor Poor Poor Yes No Huge

    MPI Yes Excellent Rank 7 Good Good Good Yes No Huge

    OpenMP Yes Excellent Rank 1 Good Good Good Yes,

    only GNU

    No Low

    Parallel

    Patterns

    Library

    No Reasonable New,

    not ranked

    Good Poor

    Only MSVC

    Poor No No Huge

    Posix Threads Yes Excellent Not ranked Poor Good Good No No Huge

    Thread

    Building

    Blocks

    No Good Rank 3 Good Reasonable Reasonable No No Huge

  • Choice of standard for GPU programming (1 oct 2011)

    Requirement

    ---------------

    Standard

    Industry

    standard

    Maturity Acceptance by

    market

    Future

    developments

    Expected

    familiarization

    time

    Hardware

    vendor

    independence

    Software

    vendor

    independence

    Portability Heterogeneous

    Accelerator No Good Not ranked Bad Medium Bad Bad Poor No

    CUDA No Excellent Rank 5 Good High Bad Bad Bad No

    Direct

    Compute

    No Poor Not ranked Unknown High Bad Bad Bad No

    HMPP No Poor Not ranked Plan for open

    standard

    Medium Reasonable Bad Good Yes

    OpenCL Yes Good Rank 2 Good High Excellent Good Good Yes

    PGI

    Accelerator No Reasonable Not ranked Unknown Medium Bad Bad Bad No

  • Computer vision algorithms and parallelization

    Classification image operators

    Low level image operators

    Point operators

    Local neighbour operators

    Global operators

    Connectivity based operators

    High level image operators

    Often built on the low level operators

    Specials

    Pattern matcher, neural network, genetic algorithm, etc

    Idea: design and implement skeletons for parallelizing representatives in classes

  • Benchmarking

    Benchmark protocol

    Data analyse with R

    Speedup graphs

    Speedup tables

    Median of execution time tables

    Best work-group size tables

    Violin plots

  • OpenMP

    Progress

    Script commands added

    Frame work for benchmarking

    > 160 operators are parallelized

    Run time prediction if parallelization is beneficial

    Calibration procedure

  • Example speedup graph (i7 2600)

  • Run time prediction if parallelization is beneficial

    The speed-up depends on

    Size of image

    Pixel type

    Content of the image

    Parameters like size of neighbourhood

    Etc.

    Calibration procedure for OpenMP

    Simple and fast procedure for global optimization

    Complex and slow procedure for more optimal optimization for each (sub) operator

  • Variations in executing times, violin plot

  • OpenCL

    Progress

    Toolbox for using OpenCL kernels from scripts and C++

    Script commands added for host API

    Frame work for benchmarking

    Implementation first kernels

  • OpenCL development in VisionLab

  • OpenCL development in VisionLab

  • Optimal number of local histograms per work-group (GTX 560 Ti)

  • Speedup graph of Histogram,16 local histogram per work-group (GTX 560 Ti)

  • Violin plot for Histogram, 16 local histograms/wg (GTX 560 Ti)

  • X-files: OpenCL versus OpenMP

  • Future work OpenCL

    Near future

    Memory transfers

    Pinned

    Zero copy APU

    Implementing more vision operators

    More distant future

    Intelligent buffer management ?

    Automatic tuning of parameters ?

    Run time prediction if parallelization is beneficial ?

    Heterogeneous computing ?

    OpenMP 4.0, OpenACC, C++ AMP ?

  • The future: XIMEA Currera G (APU based)

  • Summary and preliminary conclusions

    Choice made for standards OpenMP and OpenCL

    Integration OpenMP and OpenCL in VisionLab

    Benchmark environment

    OpenMP

    Embarrassingly parallel algorithms are easy to convert

    More than 160 operators parallelized

    Run time prediction implemented

    OpenCL

    Scripting host side code accelerates development time

    Portable functionality

    Portable performance is not easy

    Still have to learn a lot about GPUs

    Searching for sparing partners

  • Questions ?

    Jaap van de Loosdrecht

    NHL Centre of Expertise in Computer Vision

    [email protected]

    www.nhl.nl/computervision

    Van de Loosdrecht Machine Vision BV

    [email protected]

    www.vdlmv.nl