lecture 14: introduction to object recognition bag of words...

Lecture 14Fei-Fei Li

Lecture 14: Introduction to Object Recognition & Bag‐of‐Words (BoW) Models

Professor Fei‐Fei LiStanford Vision Lab

8‐Nov‐111


What we will learn today?

• Introduction to object recognition– Representation– Learning– Recognition

• Bag of Words models (Problem Set 4 (Q2))– Basic representation– Different learning and recognition algorithms

8‐Nov‐112


What are the different visual recognition tasks?

8‐Nov‐113


Classification: Does this image contain a building? [yes/no]

Yes!

8‐Nov‐114


Classification:Is this an beach?

8‐Nov‐115


Image Search

Organizing photo collections

8‐Nov‐116


Detection:Does this image contain a car? [where?]

car

8‐Nov‐117


Building

clock

personcar

Detection:Which object does this image contain? [where?]

8‐Nov‐118


clock

Detection:Accurate localization (segmentation)

8‐Nov‐119


Object: Person, back;1‐2 meters away

Object: Police car, side view, 4‐5 m away

Object: Building, 45º pose, 8‐10 meters awayIt has bricks

Detection: Estimating object semantic & geometric attributes

8‐Nov‐1110


Applications of computer vision

SurveillanceAssistive technologies

Security Assistive driving

Computational photography

8‐Nov‐1111


Categorization vs Single instance recognitionDoes this image contain the Chicago Macy building’s?

8‐Nov‐1112


Where is the crunchy nut?

Categorization vs Single instance recognition

8‐Nov‐1113


+ GPS

•Recognizing landmarks in mobile platforms

Applications of computer vision

8‐Nov‐1114


Activity or Event recognitionWhat are these people doing?

8‐Nov‐1115


Visual Recognition

• Design algorithms that are capable to–Classify images or videos–Detect and localize objects– Estimate semantic and geometrical attributes

– Classify human activities and events

Why is this challenging?8‐Nov‐1116


How many object categories are there?

8‐Nov‐1117


Challenges: viewpoint variation

Michelangelo 1475-1564

8‐Nov‐1118


Challenges: illumination

image credit: J. Koenderink

8‐Nov‐1119


Challenges: scale

8‐Nov‐1120


Challenges: deformation

8‐Nov‐1121


Challenges: occlusion

Magritte, 1957

8‐Nov‐1122


Challenges: background clutter

Kilmeny Niland. 1995

8‐Nov‐1123


Challenges: intra‐class variation

8‐Nov‐1124

Lecture 14

• Turk and Pentland, 1991• Belhumeur, Hespanha, & Kriegman, 1997• Schneiderman & Kanade 2004• Viola and Jones, 2000

• Amit and Geman, 1999• LeCun et al. 1998• Belongie and Malik, 2002

• Schneiderman & Kanade, 2004• Argawal and Roth, 2002• Poggio et al. 1993

Some early works on object categorization

8‐Nov‐11


Basic issues

• Representation– How to represent an object category; which classification scheme?

• Learning– How to learn the classifier, given training data

• Recognition– How the classifier is to be used on novel data

8‐Nov‐1126


Representation‐ Building blocks: Sampling strategies

RandomlyMultiple interest operators

Interest operators Dense, uniformly

Image cred

its: L. Fei‐Fei, E. N

owak, J. Sivic

8‐Nov‐1127


Representation– Appearance only or location and appearance

8‐Nov‐1128


Representation

–Invariances• View point• Illumination• Occlusion• Scale• Deformation• Clutter• etc.

8‐Nov‐1129


Representation

– To handle intra‐class variability, it is convenient to describe an object categories using probabilistic models

– Object models: Generative vs Discriminative vs hybrid

8‐Nov‐1130

Lecture 14

Object categorization: the statistical viewpoint

)|( imagezebrap

)( ezebra|imagnopvs.

)|()|(

imagezebranopimagezebrap

• Bayes rule:

8‐Nov‐1131

Lecture 14


)|( imagezebrap

)( ezebra|imagnopvs.

• Bayes rule:

)()(

)|()|(

)|()|(

zebranopzebrap

zebranoimagepzebraimagep


posterior ratio likelihood ratio prior ratio

8‐Nov‐11

Lecture 14


• Bayes rule:

)()(

)|()|(

)|()|(

zebranopzebrap



posterior ratio likelihood ratio prior ratio

• Discriminative methods model posterior

• Generative methods model likelihood and prior

8‐Nov‐11


Discriminative models

Zebra

Non‐zebra

Decisionboundary

)|()|(


• Modeling the posterior ratio:

8‐Nov‐1134



Support Vector Machines

Guyon, Vapnik, Heisele, Serre, Poggio…

Boosting

Viola, Jones 2001, Torralba et al. 2004, Opelt et al. 2006,…

106 examples

Nearest neighbor

Shakhnarovich, Viola, Darrell 2003Berg, Berg, Malik 2005...

Neural networks

Source: Vittorio Ferrari, Kristen Grauman, Antonio Torralba

Latent SVMStructural SVM

Felzenszwalb 00Ramanan 03…

LeCun, Bottou, Bengio, Haffner 1998Rowley, Baluja, Kanade 1998…

8‐Nov‐1135


Generative models• Modeling the likelihood ratio:

)|()|(


0 0.2 0.4 0.6 0.8 10

1

2

3

4

5

clas

s de

nsiti

es

p(x|C1)

p(x|C2)

x

8‐Nov‐1136

Lecture 14

)|( zebranoimagep)|( zebraimagep

Generative models

0

1

2

3

4

5

clas

s de

nsiti

es

p(x|C1)

p(x|C2)

High Low

Low High

8‐Nov‐1137


Generative models• Naïve Bayes classifier

– Csurka Bray, Dance & Fan, 2004

• Hierarchical Bayesian topic models (e.g. pLSAand LDA)

– Object categorization: Sivic et al. 2005, Sudderth et al. 2005– Natural scene categorization: Fei‐Fei et al. 2005

• 2D Part based models‐ Constellation models: Weber et al 2000; Fergus et al 200‐ Star models: ISM (Leibe et al 05)

• 3D part based models: ‐multi‐aspects: Sun, et al, 2009

8‐Nov‐1138


Basic issues




8‐Nov‐1139


• Learning parameters: What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.)

Learning

8‐Nov‐1140



• Level of supervision• Manual segmentation; bounding box; image labels; noisy labels

Learning

• Batch/incremental

• Priors

8‐Nov‐1141



• Level of supervision• Manual segmentation; bounding box; image labels; noisy labels

Learning

• Batch/incremental

• Training images:•Issue of overfitting•Negative images for discriminative methods

• Priors

8‐Nov‐1142


Basic issues




8‐Nov‐1143


– Recognition task: classification, detection, etc..

Recognition

8‐Nov‐1144


Recognition– Recognition task– Search strategy: Sliding Windows

• Simple• Computational complexity (x,y, S, , N of classes)

‐ BSW by Lampert et al 08

‐ Also, Alexe, et al 10

Viola, Jones 2001,

8‐Nov‐1145




• Localization• Objects are not boxes



Viola, Jones 2001,

8‐Nov‐1146




• Localization• Objects are not boxes• Prone to false positive



Non max suppression: Canny ’86….Desai et al , 2009

Viola, Jones 2001,

8‐Nov‐1147


Recognition

Category: carAzimuth = 225ºZenith = 30º

•Savarese, 2007 •Sun et al 2009• Liebelt et al., ’08, 10•Farhadi et al 09

‐ It has metal‐ it is glossy‐ has wheels

•Farhadi et al 09 • Lampert et al 09• Wang & Forsyth 09

– Recognition task– Search strategy– Attributes

8‐Nov‐1148


Semantic:•Torralba et al 03• Rabinovich et al 07• Gupta & Davis 08• Heitz & Koller 08• L‐J Li et al 08• Yao & Fei‐Fei 10

Recognition– Recognition task– Search strategy– Attributes– Context

Geometric• Hoiem, et al 06• Gould et al 09• Bao, Sun, Savarese 10

8‐Nov‐1149


Basic issues




8‐Nov‐1150


Part 1: Bag‐of‐words models

This segment is based on the tutorial “Recognizing and Learning Object Categories: Year 2007”, by Prof L. Fei‐Fei, A. Torralba, and R. Fergus

8‐Nov‐1151


Related works

• Early “bag of words” models: mostly texture recognition– Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001;

Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003;

• Hierarchical Bayesian models for documents (pLSA, LDA, etc.)– Hoffman 1999; Blei, Ng & Jordan, 2004; Teh, Jordan, Beal & Blei, 2004

• Object categorization– Csurka, Bray, Dance & Fan, 2004; Sivic, Russell, Efros, Freeman &

Zisserman, 2005; Sudderth, Torralba, Freeman & Willsky, 2005;

• Natural scene categorization– Vogel & Schiele, 2004; Fei‐Fei & Perona, 2005; Bosch, Zisserman &

Munoz, 2006

8‐Nov‐1152


Object Bag of ‘words’

8‐Nov‐1153


Analogy to documentsOf all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.

sensory, brain, visual, perception,

retinal, cerebral cortex,eye, cell, optical

nerve, imageHubel, Wiesel

China is forecasting a trade surplus of $90bn (£51bn) to $100bn this year, a threefold increase on 2004's $32bn. The Commerce Ministry said the surplus would be created by a predicted 30% jump in exports to $750bn, compared with a 18% rise in imports to $660bn. The figures are likely to further annoy the US, which has long argued that China's exports are unfairly helped by a deliberately undervalued yuan. Beijing agrees the surplus is too high, but says the yuan is only one factor. Bank of China governor Zhou Xiaochuan said the country also needed to do more to boost domestic demand so more goods stayed within the country. China increased the value of the yuan against the dollar by 2.1% in July and permitted it to trade within a narrow band, but the US wants the yuan to be allowed to trade freely. However, Beijing has made it clear that it will take its time and tread carefully before allowing the yuan to rise further in value.

China, trade, surplus, commerce,

exports, imports, US, yuan, bank, domestic,

foreign, increase, trade, value

8‐Nov‐1154


– Independent features

definition of “BoW”

face bike violin

8‐Nov‐1155


definition of “BoW”– Independent features – histogram representation

codewords dictionary

8‐Nov‐1156


categorydecision

Representation

feature detection& representation


image representation

category models(and/or) classifiers

recognitionle

arni

ng

8‐Nov‐1157


1.Feature detection and representation

8‐Nov‐1158



• Regular grid– Vogel & Schiele, 2003– Fei‐Fei & Perona, 2005

8‐Nov‐1159




• Interest point detector– Csurka, et al. 2004– Fei‐Fei & Perona, 2005– Sivic, et al. 2005

8‐Nov‐1160




• Interest point detector– Csurka, Bray, Dance & Fan, 2004– Fei‐Fei & Perona, 2005– Sivic, Russell, Efros, Freeman & Zisserman, 2005

• Other methods– Random sampling (Vidal‐Naquet & Ullman, 2002)– Segmentation based patches (Barnard, Duygulu, Forsyth, de Freitas, Blei, Jordan, 2003)

8‐Nov‐1161



Normalize patch

Detect patches[Mikojaczyk and Schmid ’02]

[Mata, Chum, Urban & Pajdla, ’02]

[Sivic & Zisserman, ’03]

Compute SIFT

descriptor[Lowe’99]

Slide credit: Josef Sivic

8‐Nov‐1162


…


8‐Nov‐1163


2. Codewords dictionary formation

…

8‐Nov‐1164



Clustering/vector quantization

…

Cluster center= code word

8‐Nov‐1165



Fei-Fei et al. 2005

8‐Nov‐1166


Image patch examples of codewords

Sivic et al. 2005

8‐Nov‐1167


Visual vocabularies: Issues

• How to choose vocabulary size?– Too small: visual words not representative of all patches– Too large: quantization artifacts, overfitting

• Computational efficiency– Vocabulary trees

(Nister & Stewenius, 2006)

8‐Nov‐1168


3. Bag of word representation

Codewords dictionary • Nearest neighbors assignment• K‐D tree search strategy

8‐Nov‐1169


3. Bag of word representation

Codewords dictionary codewords

frequ

ency

….

8‐Nov‐1170


feature detection& representation


image representation

Representation

1.2.

3.

8‐Nov‐1171


categorydecision



Learning and Recognition

8‐Nov‐1172




1. Discriminative method: - NN- SVM

2.Generative method: - graphical models

8‐Nov‐1173


category models

Class 1 Class N

… ……

Discriminative classifiers

Model space

8‐Nov‐1174


Discriminative classifiers

Query image

Winning class: pink

Model space

8‐Nov‐1175


Nearest Neighborsclassifier

Query image

Winning class: pink

• Assign label of nearest training data point to each test data point

Model space

8‐Nov‐1176


Query image

• For a new point, find the k closest points from training data• Labels of the k points “vote” to classify• Works well provided there is lots of data and the distance function is good

K- Nearest Neighborsclassifier

Model space

Winning class: pink

8‐Nov‐1177


• For k dimensions: k‐D tree = space‐partitioning data structure for organizing points in a k‐dimensional space• Enable efficient search

from Duda et al.

K- Nearest Neighborsclassifier

• Voronoi partitioning of feature space for 2‐category 2‐D and 3‐D data

• Nice tutorial: http://www.cs.umd.edu/class/spring2002/cmsc420‐0401/pbasic.pdf

8‐Nov‐1178


Functions for comparing histograms• L1 distance

• χ2 distance

• Quadratic distance (cross‐bin)

N

iihihhhD

12121 |)()(|),(

Jan Puzicha, Yossi Rubner, Carlo Tomasi, Joachim M. Buhmann: Empirical Evaluation of Dissimilarity Measures for Color and Texture. ICCV 1999

N

i ihihihihhhD

1 21

221

21 )()()()(),(

ji

ij jhihAhhD,

22121 ))()((),(

8‐Nov‐1179



1. Discriminative method: - NN- SVM

2.Generative method: - graphical models

8‐Nov‐1180


Discriminative classifiers(linear classifier)

Model spacecategory models

Class 1 Class N

… ……

8‐Nov‐1181


Support vector machines• Find hyperplane that maximizes the margin between the positive and

negative examples

MarginSupport vectors

Distance between point and hyperplane: ||||

||wwx bi

Support vectors: 1 bi wx

Margin = 2 / ||w||

Credit slide: S. Lazebnik

i iii y xw

bybi iii xxxw

Classification function (decision boundary):

Solution:

8‐Nov‐1182


Support vector machines• Classification

Margin

bybi iii xxxw

2010

classbifclassbif

wxwx

Test point

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

8‐Nov‐1183


• Datasets that are linearly separable work out great:•

•

• But what if the dataset is just too hard?

• We can map it to a higher‐dimensional space:

0 x

0 x

0 x

x2

Nonlinear SVMs

Slide credit: Andrew Moore

8‐Nov‐1184


Φ: x→ φ(x)

Nonlinear SVMs• General idea: the original input space can always be mapped

to some higher‐dimensional feature space where the training set is separable:

Slide credit: Andrew Moorelifting transformation

8‐Nov‐1185


Nonlinear SVMs• Nonlinear decision boundary in the original feature space:

bKyi

iii ),( xx

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

•The kernel K = product of the lifting transformation φ(x):

K(xi,xjj) = φ(xi ) · φ(xj)NOTE:• It is not required to compute φ(x) explicitly:• The kernel must satisfy the “Mercer inequality”

8‐Nov‐1186


Kernels for bags of features

• Histogram intersection kernel:

• Generalized Gaussian kernel:

• D can be Euclidean distance, χ2 distance etc…

N

iihihhhI

12121 ))(),(min(),(

2

2121 ),(1exp),( hhDA

hhK

J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007

8‐Nov‐1187


Pyramid match kernel• Fast approximation of Earth Mover’s Distance• Weighted sum of histogram intersections at mutliple resolutions (linear in

the number of features instead of cubic)

K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features, ICCV 2005.

8‐Nov‐1188


Spatial Pyramid Matching

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. S. Lazebnik, C. Schmid, and J. Ponce. CVPR 2006

8‐Nov‐1189


What about multi‐class SVMs?

• No “definitive” multi‐class SVM formulation• In practice, we have to obtain a multi‐class SVM by combining

multiple two‐class SVMs • One vs. others

– Traning: learn an SVM for each class vs. the others– Testing: apply each SVM to test example and assign to it the class of

the SVM that returns the highest decision value

• One vs. one– Training: learn an SVM for each pair of classes– Testing: each learned SVM “votes” for a class to assign to the test

example

Credit slide: S. Lazebnik

8‐Nov‐1190


SVMs: Pros and cons• Pros

– Many publicly available SVM packages:http://www.kernel‐machines.org/software

– Kernel‐based framework is very powerful, flexible– SVMs work very well in practice, even with very small training sample sizes

• Cons– No “direct” multi‐class SVM, must combine two‐class SVMs– Computation, memory

• During training time, must compute matrix of kernel values for every pair of examples

• Learning can take a very long time for large‐scale problems

8‐Nov‐1191

Lecture 14

Object recognition results

• ETH‐80 database of 8 object classes (Eichhorn and Chapelle 2004)

• Features: – Harris detector– PCA‐SIFT descriptor, d=10

Kernel Complexity Recognition rateMatch [Wallraven et al.] 84%

Bhattacharyya affinity [Kondor & Jebara]

85%

Pyramid match 84%Slide credit: Kristen Grauman

8‐Nov‐11



Support Vector Machines

Guyon, Vapnik, Heisele, Serre, Poggio…

Boosting

Viola, Jones 2001, Torralba et al. 2004, Opelt et al. 2006,…

106 examples

Nearest neighbor

Shakhnarovich, Viola, Darrell 2003Berg, Berg, Malik 2005...

Neural networks

Source: Vittorio Ferrari, Kristen Grauman, Antonio Torralba

Latent SVMStructural SVM

Felzenszwalb 00Ramanan 03…

LeCun, Bottou, Bengio, Haffner 1998Rowley, Baluja, Kanade 1998…

8‐Nov‐1193



1. Discriminative method: ‐ NN‐ SVM

2.Generative method: ‐ graphical models

Model the probability distribution that produces a given bag of features

8‐Nov‐1194


Generative models

1. Naïve Bayes classifier– Csurka Bray, Dance & Fan, 2004

2. Hierarchical Bayesian text models (pLSA and LDA)

– Background: Hoffman 2001, Blei, Ng & Jordan, 2004– Object categorization: Sivic et al. 2005, Sudderth et al.

2005– Natural scene categorization: Fei‐Fei et al. 2005

8‐Nov‐1195


• w: a collection of all N codewords in the imagew = [w1,w2,…,wN]

• c: category of the image

Some notations

8‐Nov‐1196

Lecture 14

wN

c

the Naïve Bayes model

)|()( cwpcp)|( wcp

8‐Nov‐11

Prior prob. of the object classes

Image likelihoodgiven the class

Graphical model

Posterior =probability that image I is of category c

Lecture 14

wN

c

the Naïve Bayes model

)|()( cwpcp

N

nn cwpcp

1

)|()(

Object classdecision

)|( wcpc

c maxarg

Likelihood of ith visual wordgiven the class

Estimated by empirical frequencies of code words in images from a given class

8‐Nov‐11

Graphical model


Csurka et al. 2004

8‐Nov‐1199


Csurka et al. 2004

8‐Nov‐11100


Other generative BoWmodels

• Hierarchical Bayesian topic models (e.g. pLSAand LDA)

– Object categorization: Sivic et al. 2005, Sudderth et al. 2005– Natural scene categorization: Fei‐Fei et al. 2005

8‐Nov‐11101


Generative vs discriminative

• Discriminative methods– Computationally efficient & fast

• Generative models– Convenient for weakly‐ or un‐supervised, incremental training

– Prior information– Flexibility in modeling parameters

8‐Nov‐11102


• No rigorous geometric information of the object components

• It’s intuitive to most of us that objects are made of parts – no such information

• Not extensively tested yet for– View point invariance– Scale invariance

• Segmentation and localization unclear

Weakness of BoW the models

8‐Nov‐11103


What have learned today?

• Introduction to object recognition– Representation– Learning– Recognition

• Bag of Words models (Problem Set 4 (Q2))– Basic representation– Different learning and recognition algorithms

8‐Nov‐11104

lecture 14: introduction to object recognition bag of words...

Documents