accelerated multiple region evaluation for human motion ...€¦ · accelerated multiple region...

1
Accelerated Multiple Region Evaluation for Human Motion Tracking David Concha, Raúl Cabido, Antonio S. Montemayor, Juan José Pantrigo {david.concha, raul.cabido, antonio.sanz, juanjose.pantrigo}@urjc.es http://www.gavab.es/capo In this work we present a study about different NVIDIA CUDA approaches to the problem of the evaluation of a region of interesting (ROI) pixels in an image. This problem is usually integrated as part of other higher level methods, such as image retargeting, completion, video summarization, object detection, visual tracking, etc. Because of these problems evaluate thousands of ROIs, in many cases performance is usually far from being interactive. In a visual tracking context, interesting pixels of an image (target candidates) are usually recovered after a segmentation process. In order to track these targets a temporal estimation filter is used evaluating the binary image through these ROIs. This evaluation is usually responsible of a high percentage of the computational cost of the overall tracking method. In a general case, ROIs can be represented as translated, rotated and scaled bounding boxes of different sizes, presenting complex memory access patterns (many pixels, bad memory alignment, etc.). Therefore, some approaches to reduce computation have gained popularity (like the Integral Image for axis-aligned ROIs [Viola 2004]) although none for general ROIs and taking into account different technologies of the GPU architecture. 1. Introduction 2. Multiple ROI evaluation on GPU 3. Study case 2.1. OpenGL+Cg ROIs weights are computed by creating a grid of quad primitives (one for each ROI). The rasterizer returns the interpolated coordinates and fetches the ROIs automatically. Then, a multipass 2D reduction is applied retrieving the weight of each ROI. Pros: exploit hardware interpolation capabilities Cons: features added in CUDA compute devices are not exploited 2.3 CUDA+CUDPP/Thrust Given a ROI configuration each CUDA thread computes a ROI texture coordinate, fetches a ROI texel and writes the value into global memory. After the kernel execution, ROIs are coded in aligned memory, and ROIs evaluation are done using CUDPP or Thrust primitives without any previous data rearrangement. Pros: exploit CUDA features and hardware interpolation capabilities. Data rearrangement is not required. Cons: many memory accesses required 2.2. OpenGL+CUDPP/Thrust ROIs are rendered using OpenGL API, stored in texture memory and mapped into the address space of CUDA. Weighting is done using CUDPP/Thrust reduction algorithm primitives. Pros: exploit CUDA features and hardware interpolation capabilities Cons: ROIs are not stored in aligned memory, a previous sort operation has to be applied and thus including a penalty factor. 2.4 CUDA This approach reduce the number of memory accesses by fetching ROIs and computing their weights using only one kernel. Thus, each thread is responsible of computing a single weight. Reading a ROI configuration from global memory, all texture coordinates are calculated. A global operation like sum-reduction can be computed by accumulating values stored in these positions in a register. Pros: exploit hardware interpolation and CUDA features, memory aligned, reduce number of memory accesses. Cons: fewer number of ROIs to compute low ocuppancy. As an application of multiple ROI evaluation on GPU, we tackle an eight DOFs visual tracking problem. In a particle framework the number of required particles grows with the size of the state-space. Then, high-dimensional problems like human motion tracking require the evaluation of a large number of rotated and scaled segments to keep the target tracked. A commodity 2008 GPU (Geforce GTX260) can process up to 70 frames per second, evaluating more than 24k different sized and rotated segments for 640x480 video resolutions. Interoperability Sort ROIs mapped into the address space of CUDA ROIs stored in contiguous memory blocks OpenGL rendered ROIs Global memory CUDA threads Memory writes ROI 1 ROI 2 Input frame Texture fetches [Viola 2004] P. Viola and M. J. Jones. Robust Real-Time Face Detection. International Journal of Computer Vision, 57(2):137-154, 2004 This research was partially supported by the Spanish Ministry of Education and Science CICYT TIN2011-28151and the NVIDIA Professor Partnership Program.

Upload: others

Post on 19-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerated Multiple Region Evaluation for Human Motion ...€¦ · Accelerated Multiple Region Evaluation for Human Motion Tracking David Concha, Raúl Cabido, Antonio S. Montemayor,

Accelerated Multiple Region Evaluation for

Human Motion Tracking

David Concha, Raúl Cabido, Antonio S. Montemayor, Juan José Pantrigo {david.concha, raul.cabido, antonio.sanz, juanjose.pantrigo}@urjc.es

http://www.gavab.es/capo

In this work we present a study about different NVIDIA CUDA

approaches to the problem of the evaluation of a region of interesting

(ROI) pixels in an image. This problem is usually integrated as part of

other higher level methods, such as image retargeting, completion,

video summarization, object detection, visual tracking, etc. Because

of these problems evaluate thousands of ROIs, in many cases

performance is usually far from being interactive.

In a visual tracking context, interesting pixels of an image (target

candidates) are usually recovered after a segmentation process. In

order to track these targets a temporal estimation filter is used

evaluating the binary image through these ROIs. This evaluation is

usually responsible of a high percentage of the computational cost of

the overall tracking method. In a general case, ROIs can be

represented as translated, rotated and scaled bounding boxes of

different sizes, presenting complex memory access patterns (many

pixels, bad memory alignment, etc.). Therefore, some approaches to

reduce computation have gained popularity (like the Integral Image

for axis-aligned ROIs [Viola 2004]) although none for general ROIs

and taking into account different technologies of the GPU

architecture.

1. Introduction 2. Multiple ROI evaluation on GPU

3. Study case

2.1. OpenGL+Cg

ROIs weights are computed by creating a grid of quad

primitives (one for each ROI). The rasterizer returns the

interpolated coordinates and fetches the ROIs

automatically. Then, a multipass 2D reduction is applied

retrieving the weight of each ROI.

Pros: exploit hardware interpolation capabilities

Cons: features added in CUDA compute devices are not exploited

2.3 CUDA+CUDPP/Thrust

Given a ROI configuration each CUDA thread computes a

ROI texture coordinate, fetches a ROI texel and writes the

value into global memory. After the kernel execution, ROIs

are coded in aligned memory, and ROIs evaluation are

done using CUDPP or Thrust primitives without any

previous data rearrangement.

Pros: exploit CUDA features and hardware interpolation capabilities. Data

rearrangement is not required.

Cons: many memory accesses required

2.2. OpenGL+CUDPP/Thrust

ROIs are rendered using OpenGL API, stored in texture

memory and mapped into the address space of CUDA.

Weighting is done using CUDPP/Thrust reduction

algorithm primitives.

Pros: exploit CUDA features and hardware interpolation capabilities

Cons: ROIs are not stored in aligned memory, a previous sort operation

has to be applied and thus including a penalty factor.

2.4 CUDA

This approach reduce the number of memory accesses by

fetching ROIs and computing their weights using only one

kernel. Thus, each thread is responsible of computing a

single weight. Reading a ROI configuration from global

memory, all texture coordinates are calculated. A global

operation like sum-reduction can be computed by

accumulating values stored in these positions in a register.

Pros: exploit hardware interpolation and CUDA features, memory aligned,

reduce number of memory accesses.

Cons: fewer number of ROIs to compute low ocuppancy.

As an application of multiple ROI evaluation on GPU, we tackle an

eight DOFs visual tracking problem. In a particle framework the

number of required particles grows with the size of the state-space.

Then, high-dimensional problems like human motion tracking require

the evaluation of a large number of rotated and scaled segments to

keep the target tracked.

A commodity 2008 GPU (Geforce GTX260) can process up to 70

frames per second, evaluating more than 24k different sized and

rotated segments for 640x480 video resolutions.

Interoperability

Sort

ROIs mapped into the address space of CUDA

ROIs stored in contiguous memory blocks

OpenGL rendered ROIs

… …

Global memory

CUDA threads

Memory writes

ROI 1 ROI 2

Input frame

Texture fetches

[Viola 2004] P. Viola and M. J. Jones. Robust Real-Time Face Detection.

International Journal of Computer Vision, 57(2):137-154, 2004

This research was partially supported by the Spanish Ministry of Education and Science CICYT TIN2011-28151and the NVIDIA Professor Partnership Program.