overview - pennsylvania state university

Overview •  Recall last class: Boos1ng is a way of genera1ng a strong classifier as a weighted ensemble of weak ones

•  Today: Support Vector Machine (SVM) training generates a strong classifier directly

•  Case Study: Dalal and Triggs pedestrian detector

Support Vector Machines SVM slides from Kristen Grauman, UT-‐Aus1n

Other good resources: Presentation slides from Christoph Lampert: https://sites.google.com/site/christophlampert/teaching/kernel-methods-for-object-recognition Simple tutorial document by Chris Williams http://www.inf.ed.ac.uk/teaching/courses/iaml/docs/svm.pdf Video lecture by Pat Winston: https://www.youtube.com/watch?v=_PwhiWxHK8o

Linear classifiers

Find linear function to separate positive and negative examples

Lines in R2

0=++ bcyax

⎥⎦

⎤⎢⎣

⎡=ca

w ⎥⎦

⎤⎢⎣

⎡=yx

xLet

Lines in R2

0=+⋅ bxw

⎥⎦

⎤⎢⎣

⎡=ca

w ⎥⎦

⎤⎢⎣

⎡=yx

x

0=++ bcyax

Let

w

Linear classifiers •  Find linear function to separate positive and

negative examples

0:negative0:positive

<+⋅

≥+⋅

bb

ii

ii

wxxwxx

Which line is best?

Support Vector Machines (SVMs)

•  Discriminative classifier based on optimal separating line (for 2d case)

•  Maximize the margin between the positive and negative training examples

Support vector machines •  Want line that maximizes the margin.

1:1)(negative1:1)( positive−≤+⋅−=

≥+⋅=

byby

iii

iii

wxxwxx

Margin Support vectors

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

For support, vectors, 1±=+⋅ bi wx

Lines in R2

0=+⋅ bxw

⎥⎦

⎤⎢⎣

⎡=ca

w ⎥⎦

⎤⎢⎣

⎡=yx

x

0=++ bcyax

Let

w

( )00 , yx

D

Lines in R2

0=+⋅ bxw

⎥⎦

⎤⎢⎣

⎡=ca

w ⎥⎦

⎤⎢⎣

⎡=yx

x

0=++ bcyax

Let

w

( )00 , yx

D

wxw b

ca

bcyaxD +

=+

++=

Τ

22

00 distance from point to line



≥+⋅=

byby

iii

iii

wxxwxx

Margin M Support vectors


Distance between point and line: ||||

||wwx bi +⋅

www211

=−

−=Mwwxw 1±

=+ bΤ

For support vectors:



≥+⋅=

byby

iii

iii

wxxwxx

Margin Support vectors


Distance between point and line: ||||

||wwx bi +⋅

Therefore, the margin is 2 / ||w||

Finding the maximum margin line 1.  Maximize margin 2/||w|| 2.  Correctly classify all training data points:

Quadratic optimization problem: Minimize

Subject to yi(w·xi+b) ≥ 1


wwT21


≥+⋅=

byby

iii

iii

wxxwxx

One constraint for each training point. Note sign trick.

Finding the maximum margin line •  Solution:

∑= i iii y xw α

Support vector

learned weight


Finding the maximum margin line •  Solution:

b = yi – w·xi (for any support vector)

•  Classification function:

•  Notice that it relies on an inner product between the test point x and the support vectors xi

•  (Solving the optimization problem also involves computing the inner products xi · xj between all pairs of training points)

∑= i iii y xw α

bybi iii +⋅=+⋅ ∑ xxxw α


f (x) = sign (w ⋅x+ b)

= sign αiyii∑ xi ⋅x+ b( )If f(x) < 0, classify as negative, if f(x) > 0, classify as positive

Ques1ons •  What if the features are not 2d? •  What if the data is not linearly separable? •  What to do for more than two classes?

Ques1ons •  What if the features are not 2d?

– Generalizes to d-‐dimensions – replace line with “hyperplane”

•  What if the data is not linearly separable? •  What to do for more than two classes?

Planes in R3

0=+++ dczbyax

0=+⋅ dxw

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

=

cba

w⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

=

zyx

xLet w

wxw d

cba

dczbyaxD +

=++

+++=

Τ

222

000 distance from point to plane

( )000 ,, zyx

D

Hyperplanes in Rn

02211 =++++ bxwxwxw nn…

Hyperplane H is set of all vectors which satisfy:

nR∈x

0=+Τ bxw

wxwx bHD +

=Τ

),(distance from point to hyperplane

Ques1ons •  What if the features are not 2d? •  What if the data is not linearly separable?

Nonlinear SVMs

Slide from Andrew Zisserman

The Kernel Trick

•  Recall we transformed linear regression into nonlinear regression using a feature vector Φ(x) and, ul1mately, the “kernel trick.”

•  We also use the kernel trick here to transform a linear classifier into nonlinear one.

Example Kernel


Example Kernels


Nonlinear SVMs •  The kernel trick: instead of explicitly computing

the lifting transformation φ(x), define a kernel function K such that

K(xi , xjj) = φ(xi ) · φ(xj)

•  This gives a nonlinear decision boundary in the original feature space:

bKyi

iii +∑ ),( xxα


Ques1ons •  What if the features are not 2d? •  What if the data is not linearly separable? •  What to do for more than two classes?

Mul1-‐class SVMs •  Achieve mul1-‐class classifier by combining a number of binary

classifiers

•  One vs. all –  Training: learn an SVM for each class vs. the rest –  Tes1ng: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value

•  One vs. one –  Training: learn an SVM for each pair of classes –  Tes1ng: each learned SVM “votes” for a class to assign to the test example

Software for SVMs

SVMs for recognition 1. Define a vector representation for

each example.

2. Select a kernel function.

3. Compute pairwise kernel values between labeled examples

4. Given this “kernel matrix” to SVM optimization software to identify support vectors & weights.

5. To classify a new example: compute kernel values between new input and support vectors, apply weights, check sign of output.

Case Study: Pedestrian Detector Navneet Dalal and Bill Triggs,

“Histograms of Oriented Gradients for Human DetecGon,” CVPR 2005

Dalal and Triggs CVPR’05 •  Detect upright pedestrians •  Histogram of oriented gradient feature vector •  Linear SVM classifier; sliding window detector

64X128 HoG descriptor

HoG Feature Extrac1on

HoG Feature Extrac1on: Cells 64x128

Compute gradients

Each cell contains a histogram of gradient orientations, weighted by gradient magnitude

HoG Feature Extrac1on: Blocks

“Each scalar cell response contributes several components to the final descriptor vector, each normalized with respect to a different block. This may seem redundant but good normalization is critical and including overlap significantly improves the performance.” Dalal&Triggs CVPR’05

2x2 block of cells normalize [ , , , , ... , ]

38

HoG Design Choices Parameters •  Gradient scale •  Orienta1on bins •  Block overlap area

ε+←2

2/ vvv

Other choices n  RGB or Lab, Color/gray n  Block normalization

L2-hys,

or L1-sqrt,

Cel

l Cen

ter

bin

Block

R-H

OG

/SIF

T C-

HO

G

)/(1ε+← vvv

Parameter / design choices were guided by extensive experimentation to determine empirical effects on detector performance (e.g. miss rate)

Dalal&Triggs Detector •  Default detector configura1on:

–  RGB colour space with no gamma correc1on ; –  [−1, 0, 1] gradient filter with no smoothing ; –  linear gradient vo1ng into 9 orienta1on bins in 0◦–180◦; –  16×16 pixel blocks of four 8×8 pixel cells; –  Gaussian spa1al window with σ = 8 pixel; –  L2-‐Hys (Lowe-‐style clipped L2 norm) block normaliza1on; –  block spacing stride of 8 pixels (hence 4-‐fold coverage of each cell) ;

–  64×128 detec1on window ; –  linear SVM classifier.

41

Detector Architecture

Learn binary classifier

Encode images into feature vectors

Create normalised training data set

Object/Non-object decision

Fuse multiple detections in 3-D position & scale space

Run classifier to obtain object/non-object decisions

Scan image at all scales and locations

Object detections with bounding boxes

Learning Phase Detection Phase

Posi1ve and nega1ve examples

+ thousands more…

+ millions more…

Person detec1on with HoG & linear SVM

[Dalal and Triggs, CVPR 2005]

Soft (C=0.01) linear SVM trained with SVMLight.

To detect people at all locaGons and scales:

• Sliding window using learnt HOG template

• Post-‐processing using non-‐maxima suppression

• Sliding window using learnt HOG template

• Post-‐processing using non-‐maxima suppression

To detect people at all locaGons and scales:

Non-maximum Suppression across Scales

Dalal and Triggs Summary •  HoG feature representa1on •  Linear SVM classifier; sliding window detector •  Non-‐maximum suppression across scale •  Use of detector performance metrics to guide turning of system parameters

•  Detec1on rate 90% at 10-‐4 FP per window •  Slower than Viola-‐Jones detector

overview - pennsylvania state university

Documents