lecture 1 object detection - class

Lecture 1

Object Detection

Bill TriggsLaboratoire Jean Kuntzmann, Grenoble, France

[email protected]

International Computer Vision Summer SchoolSicily

July 2008

mailto:[email protected]

What do we need to build a good object detector?

● There may be lighting variations, changes in appearance, complex backgrounds– We need robust visual features

● Instances may have variable geometry or internal degrees of freedom– Orientation, 3D pose, body pose, facial expression

– We need a flexible recognition method

● Instances may occur anywhere in the image and at any scale– We need a good search control strategy...

What do we need to build a good object detector?

● There may be overlapping instances or detections– We need a detection postprocessing strategy

● The method is likely to be based on learning and will need to be validated– We need labelled training and validation sets

● Computational cost or embeddability may be an issue– We need to review the whole system for efficiency

A Naive Image Scanning Detector – Template Matching

Match window against a rigid template, e.g. by correlation

Scan image at all scales and locations

Object detections

Detection Phase

`Scale-space pyramid

Detection windowReturn above-threshold matches as detections

Problems with this approach

• It is photometrically too rigid to resist changes in lighting and appearance variations

• It is geometrically too rigid to resist shape variations• It does not have a strategy for overlapping detections

Anatomy of a Modern Object Detector

• Strong image preprocessing and feature normalization for resistance to illumination changes

• Local rectification and pooling for resistance to small shape variations

• Overcomplete feature set for rich description• Machine learning based decision rule to capture

statistics and variability of real application• Postprocessing to fuse multiple detections

Image Scanning Detectors

Fuse multiple detections in 3-D position & scale space

Extract features over windows

Scan image(s) at all scales and locations

Object detections with bounding boxes

Detection Phase

`Scale-space pyramid

Detection window

Run window classifier at all locations

Image Preprocessing● Preprocessing is often neglected but it can

make a huge difference in performance● One example of a preprocessing chain

input image

strong gamma

compression

centre-surround

filter

robust local contrast

normalization

highlight suppression

Performance Improvements from Preprocessing

Face Recognition Grand Challenge 1.0.4 Dataset,various features,baseline LDA classifier

Local Binary Pattern Features

● Descriptors based on local thresholding or ranking of pixel or edge intensities are very resistant to illumination changes

● Local Binary Patterns – threshold ring of pixels at value of central pixel

– locally histogram resulting binary codes

– currently one of the best descriptors for face recognition

Detectors using Local Filters● Convolution filters inspired by V1 simple cell

responses, multiscale image representations– Gaussian derivatives, Gabor filters, log-polar Gabor

filters, steerable filters, Haar wavelets

– use a number of orientations (4-12)

– output is typically squared or rectified before use

2nd & 3rd order Gaussian derivative, scaled Gaussian derivative and log-polar Gabor filters

2nd order steerable filter and its frequency response

Haar wavelets

Training set (2k positive / 10k negative)

Haar wavelet descriptors

Support vector

machine

Multi-scale search

training

Test image

results

testdescriptors

Haar Wavelet / SVM Human Detector

[Papageorgiou & Poggio, 1998]

1326-D descriptor

Which Descriptors are Important?

32x32 descriptors (HVX) 16x16 descriptors (HVX)

Mean response difference between positive & negative training examples

Essentially just a coarse-scale human silhouette template!

Some Detection Results

Detectors using Edge / Gradient Orientation Histograms

● Divide local region into spatial cells● Calculate orientation of image gradient at each pixel● Pool quantized orientations over each cell

– descriptor contains an orientation histogram for each cell– weight votes by gradient magnitude

● Can also use edge orientations from a discrete edge detector

● Basis of the popular SIFT, HOG, Generalized Shape Context methods

orientation voting and pooling into spatial cells

C.f. Shape context– pool counts of edge pixels into log-polar spatial bins

– centre descriptor on regularly spaced / all edge pixels

Histogram of Oriented Gradient (HOG) Person Detector

● This simple detector is still one of the best generic human detectors

● It is a good illustration of – the power that modern features and training methods

have given to basic template matching

– the need for good engineering and attention to detail

N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005

Feature Extraction

Compute gradients

Feature vector f = [ ..., ..., ...]

Block

Normalise gamma

Weighted vote in spatial & orientation cells

Contrast normalise over overlapping spatial cells

Collect HOGs over detection window

Input image

Detection window

Linear SVM

Overlap of Blocks

Cell

Overview of Learning Phase

Learn binary classifier

Encode images into feature spaces

Create fixed-resolution normalised training image data set

Learning phase

Object/Non-object decision

Learn binary classifier

Encode images into feature spaces

Resample negative training images to create hard examples

Input: Annotations on training images

Re-training reduces false positives by an order of magnitude!

HOG DescriptorsParameters Gradient scale Orientation bins Percentage of block

overlapε+← 2

2/ vvv

Schemes RGB or Lab, colour/gray-space Block normalisation

L2-norm,

orL1-norm,

CellBlock

R-H

OG

/SIF

T

Cente

r bin

C-H

OG

)/(1

ε+← vvv

Evaluation Data SetsINRIA person databaseMIT pedestrian database

Overall 709 annotations+ reflections

200 positive windows

Negative data unavailable


Negative data unavailable


453 negative images


1218 negative images

Overall 1774 annotations+ reflections

Tra

inTest

Tra

inTest

Performance on MIT Dataset

● R-HOG and C-HOG give near perfect separation on MIT database● Both have 1-2 order lower false positives than wavelets and similar

descriptors

Performance on INRIA Database

Influence of ParametersGradient smoothing, σ Orientation bins, β

Reducing gradient scale from 3 to 0 decreases false positives by 10 times

Increasing orientation bins from 4 to 9 decreases false positives by 10 times

Influence of ParametersNormalisation method Block overlap

● Strong local normalisation is essential

Overlapping blocks improve performance, but descriptor size increases

Influence of Block and Cell Size

● Trade off between need for local spatial invariance and need for finer spatial resolution

12

8

64

Which Cues are Important?

Input example

Weighted pos wts

Weighted neg wts

Outside-in weights

Most important cues are head, shoulder, leg silhouettes Vertical gradients inside a person are counted as

negative Overlapping blocks just outside the contour are most

important

Average gradients

Merging Overlapping Detections

Robust mode detection (mean shift)

∑

Η−−=

=Η

−n

i iii

syixii

wf

ss

2//)(exp)(

],)exp(,)[exp(

21xxx

σσσ

x

y s (i

n log

)

Clip Detection Score

Multi-scale dense scan of detection window

Final detections

Threshold

Bias

Influence of Mean Shift Kernel

Spatial smoothing aspect ratio as per window shape, smallest sigma approx. equal to stride/cell size

Relatively independent of scale smoothing, sigma equal to 0.4 to 0.7 octaves gives good results

Influence of Other Parameters

Different mappings Effect of scale-ratio

Hard clipping of SVM scores gives better results than simple probabilistic mapping of the scores

Fine scale sampling improves recall

Results Using Static HOGNo temporal smoothing of detections

Conclusions for Static HOG Human Detector

● Fine grained features improve performance– Rectify fine gradients then pool spatially

● No gradient smoothing, [1 0 -1] derivative mask● Orientation voting into fine bins● Spatial voting into coarser bins

– Use gradient magnitude (no thresholding)– Strong local normalization– Use overlapping blocks– Robust non-maximum suppression

● Fine scale sampling, hard clipping & anisotropic kernel

Human detection rate of 90% at 10-4 false positives per window

Slower than integral images of Viola & Jones, 2001

Applications to Other Classes

M. Everingham et al. The 2005 PASCAL Visual Object Classes Challenge. Proceedings of the PASCAL Challenge Workshop, 2006.

Parameter Settings

● Most HOG parameters are stable across different classes

● Parameters that change– Gamma compression– Normalisation methods – Signed/un-signed gradients

Results from Pascal VOC 2006

0.160

-

-

-

-

0.151

Cat

0.137

-

0.140

-

-

0.091H

ors

e

0.265

0.153

0.318

0.390

-

0.178

Moto

rbik

e

0.303

-

0.440

0.414

-

0.249

Bic

ycle

0.169

-

-

0.117

-

0.138

Bu

s

0.039

0.074

0.114

0.164

-

0.030

Pers

on

0.227

-

-

0.251

-

0.131

Sh

eep

0.252

-

0.224

0.212

0.159

0.149

Cow

0.113

-

-

-

-

0.118

Dog

0.222TKK

-TUD

-

Laptev=HOG+

Ada-boost

0.444HOG

0.398ENSMP

0.254Cam

bridge

Car

HOG outperformed other methods for 4 out of 10 classes Its adaBoost variant outperformed other methods for 2 out of 10 classes

Finding People in Videos

● Motivation– Human motion is very characteristic

● Requirements– Must work for moving camera and background– Robust coding of relative motion of human parts

● Method– Use differential flow for resistance to camera motion

– HOG like spatial histogramming for robust coding of relative motion

Motion HOG Processing Chain

Collect HOGs for all blocks over detection window

Normalise contrast within overlapping blocks of cells

Accumulate votes for differential flow orientation over spatial cells

Compute optical flow

Normalise gamma & colour

Compute differential flow

Input image Consecutive image

Flow field Magnitude of flow

Differential flow X Differential flow Y

Block

Overlap of Blocks

Cell

Detection windows

Overview of Feature Extraction

Collect HOGs over detection window

Object/Non-object decision

Linear SVM

Static HOG Encoding

Motion HOG Encoding

Input image Consecutive image(s)

App

eara

nc

e C

hannel M

otio

n

Channe

l

Test 2

Test 1

Train

Same 5 DVDs, 50 shots


5 DVDs, 182 shots


6 new DVDs, 128 shots


Data Set

Motion Boundary Histograms

First frame

Second frame

Estd. flow

Flow mag.

y-flow diff

x-flow diff

Avg. x-flow diff

Avg. y-flow diff

Treat x, y-flow components as independent images

Take their local gradients separately, and compute HOGs as in static images

Flow discontinuities follow occlusion boundaries, so this encodes depth and motion boundaries

Internal Motion Histograms

● Alternatively, we can use orientations of flow differences not boundaries

● This captures relative motions of body parts

● We tested several different coding schemes based on finite spatial (inter-part) displacements

IMH Encoding Schemes● Simple difference

– Take x, y differentials of flow vector images [Ix, Iy ]

– Variants may use larger spatial displacements while differencing, e.g. [1 0 0 0 -1]

● Center cell difference

+1

+1

+1+1

+1

+1+1

-1

+1

Wavelet-style cell differences

+1

-1

+1

-1

+1 -1

+1

-1

+1

-2

+1

-1

+1 -1

+1

+1 -1

+1

-1

+1-1

-1

+1

+1-2 +1

Flow Methods● Proesman’s flow [ Proesmans et al. ECCV 1994]

– 15 seconds per frame● Our flow method

– Multi-scale pyramid based method, no regularization– Brightness constancy based damped least squares solution

on 5X5 window

– 1 second per frame● MPEG-4 based block matching

– Runs in real-time

Input image Proesman’s flow Our multi-scale flow

( ) bAIAA TTT 1],[

−+= βyx

Performance Comparison

Only motion information Appearance + motion

With motion only, MBH scheme on Proesmans’ flow works best

Combined with appearance, centre difference IMH performs best

Trained on Static & Flow

Tested on flow only Tested on appearance + flow

Adding static images during test reduces performance margin

No deterioration in performance on static images

Motion HOG VideoNo temporal smoothing, each pair of frames treated independently

AdaBoost Cascade Face Detector● A computationally efficient architecture that rapidly rejects

unpromising windows– A chain of classifiers that each reject some fraction of the negative

training samples while keeping almost all positive ones

● Each classifier is an AdaBoost ensemble of rectangular Harr-like features sampled from a large pool

[Viola & Jones, 2001]

Rectangular Haar features and the first two features chosen by AdaBoost

Dynamic Pedestrian DetectionViola, Jones and Snow, ICCV 2003

Similar to the above face detector but also includes motion derivative filters

Convolutional Neural Nets● A series of banks of convolution filters that alternately analyse

the output images of the previous bank (“simple cells”) and spatially pool the resulting rectified responses (“complex cells”)

● Trained by gradient descent on large training sets

AT&T system – reads ~10% of U.S. cheques

[Lecun 1992-8]

Rotation Invariant Neural Net Face Detector

Learn rectifier network for rotations, then upright face detector

[Rowley et al., 1998]

Convolutional Net Multipose Face Detector● Net is trained to produce zero for a non-face, a unit-vector

encoding the facial pose for a face● At run time must run a descent search to find best putative pose

for observed image, then check whether “face” is likely given this

[Osadchy, 2007]

Exemplar based Pedestrian Detector● Build model by clustering training examples hierarchically ● At run-time, use similarity tree to find similar examples quickly

[D.Gavrila, ICPR'98]

Distance Transform based Edge Template Matching

[Gavrila, Philomin, ICCV'99]

For best results, use DT over orientated edges

Learning to Detect Object Contoursby Cue Combination

Brightness, colour & texture

gradient, combined with

boosted logistic regression

[Martin et al., PAMI'04]

Capturing Local Statistics● Many approaches capture local image content

using statistics or distributions of primitive descriptors over local image regions– e.g. tile local region with small cells, find statistics in

each cell

– captures local context, increases robustness to spatial displacements

● Capture distributions using mixture models, histograms of quantized descriptor values

● Capture statistics using moments, pairwise correlations...

Wavelet Histogram Face Detector● Crudely quantize wavelet responses (3-5 levels)● Partition wavelets into groups of 5-8 with strong mutual information● For each group build histogram of log P(object)/P(non-object)● Final classifier is naïve Bayes combination of histogram lookups● Learn frontal and profile face detectors and combine outputs

Wavelets with strong MI with indicated one, and a chosen

coefficient pair

Some detections

Green regions strongly support face, red regions

support non-face[Schneiderman, IJCV'02]

Learning Based Feature Detectors● Many kinds of local cues are informative, but

responses are typically strongly correlated

● Naïve Bayes feature combination doesn't work well, but we can learn to combine cues to produce a stronger detector

● e.g. Maximum Entropy learning of distribution, ML of presence decisions

Maximum Entropy Learning● Models joint distribution by matching predicted

and empirical 1D projections (e.g. histograms of linear filter responses)

[Siddenbladh & Black]

Maximum Entropy Learning

Local Descriptor Methods ● Represent image as a set of descriptors over local image

regions (patches)

● Patches contain a lot of information about image content

● Locality reduces interference from

– occlusion & clutter

– lighting variations (local normalization)

– global effects of changes in form or viewpoint

● But it fragments the scene – global form is harder to see

“Texton” / “Bag of Features” Image Classification

● Classify images by their distributions of local patch appearances– Sample patches densely, randomly, at salient interest points...– Characterize appearance using any local descriptor (e.g. SIFT)– Characterize descriptor set or distribution by vector quantizing

descriptors against a large dictionary of patches and histogramming results

– Learn classification rules for classes of images using ML over the BoF histograms

● Inter-patch relationships and global image structure are ignored

Extremely Randomized Clustering Forests

● Instead of vector quantization, quantize against an ensemble of discriminatively trained random decision trees

● Each leaf of each tree has a separate bin● Then learn linear SVM classifier over these bins● Fast and works very well

Object Localization in Bag of Features Models

● BoF models work surprisingly well for content based image classification because certain patches are very characteristic of certain object classes– e.g. this can be seen in the linear SVM weights

● We can use these to approximately localize the object– iterate updating the location mask and using it to remake histogram

Bicycle localization with

randomized forest features

Local Feature Based Pedestrian Detector

Combines ● bottom-up local cues

from bag of interest point recognition

● probabilistic top-down segmentation

for good handling of occlusions

(Leibe & Schiele, CVPR'05)

Implicit Shape Model - Liebe and Schiele, 2003

BackprojectedHypotheses

Interest Points Matched Codebook Entries

Probabilistic Voting

Voting Space(continuous)

Backprojectionof Maxima

Segmentation

Refined Hypotheses(uniform sampling)

Liebe and Schiele, 2003, 2005

Learning Surface Orientation & Type

● Learning based features (detectors) for vertical, horizontal left/right/centre facing and “porous” vs. solid surfaces– logistic AdaBoosted decision trees over a large set of

local cues

Using Geometric Context to Aid Detection

● Making sense of city scenes by combining surface orientation cues, object detector responses, horizon estimates

Image

P(object | surfaces, viewpoint)P(object)

P(surfaces) P(viewpoint)

[Hoiem, CVPR'06]

Image Parsing● Attempts to synthesize entire scenes

from component models using multilevel MCMC sampling

– faces, letters, background...

[Zhu et al, 2003...]

The End

lecture 1 object detection - class

Documents