visual object category recognition

Visual Object Category Recognition

Ashish Gupta

Centre for Vision, Speech, and Signal Processing

Contents

• Introduction

• Related work

• Overview: Object recognition system

• Object classification & detection

• Conclusions

• Future work

Introduction

Research Topic: Visual object category recognition using weakly supervised learning.

DIPLECS: Artificial cognitive system for autonomous systems.

• Interested in object interactions determined by their functional properties.

• All objects in same category have the same functional properties.

• Recognition is based on object’s visual properties.

Introduction

Research Topic: Visual object category recognition using weakly supervised learning.

• A very large training set is required to learn the large appearance variation in a category.

• So we utilize huge image datasets like Flickr®

and GoogleTM Image.

• The images are corrupt and incompletely labelled.

• Therefore, weakly supervised learning is utilized which can handle corrupt and noisy training data.

Challenges

Intra-category appearance Pose Clutter Scale

Occlusion Illumination Articulation Camouflage

Background

Work done

Visual Recognition System

SIFT feature descriptor

Occurrence frequency of visual words is characteristic of the object

Object model : bag-of-visual words

Creating a visual codebook


A test image can be classified based on the distance of its normalized codebook from the codebooks of positive and negative training samples.

Codebook positive samples Codebook negative samples Codebook test image


Visual codebooks for positive and negative samples of ‘car’ category in PASCAL VOC 2006


Visual codebooks for ‘car’ and ‘cow’ categories in PASCAL VOC 2009 dataset

Classification

ROC (Receiver Operating Characteristics): evaluating classification performance.

ROC for ‘car’ category in PASCAL VOC 2006

The linear kernel: K(x,y) = xTy, was used since it is fast.

Improve Classification

Larger Visual Codebook:

• More representative of category

• Higher computational cost

ROC of ‘car’ category in PACAL VOC

2006 for codebook sizes from 20 to

20000 visual words.


Training and test images in the dataset scaled down by same factor.

Training and test images scaled down by different factors.


Training Samples Dataset 1 Training Samples Dataset 2Scale down factor

/1

/2

Y NY Y

Test Image Image classified correctly


ROC for 20 visual categories in PASCAL VOC 2009

The PACAL VOC 2009 dataset is

larger and more challenging than the

2006 dataset.


ROC for PASCAL VOC 2009 training and test images images scaled down by factor of 2

ROC for PASCAL VOC 2009 using a universal visual vocabulary

Object localization using sliding window

The poor localization results are due to:

• Lack of structural information in the bag-of-words object model

• Classifier learning object background

Visual codebook

Training images with bounding - boxes

Training images without bounding - boxes

Good Codebook with equal population of positive and negative visual words

Positive background different from negative images

Positive background similar to negative images

With no bounding-box

utilized, the codebook

consists of a majority of

negative visual words.

Visual codebook

Training images with bounding - boxes

Training images without bounding - boxes

Good Codebook with equal population of positive and negative visual words

Positive background different from negative images

Positive background similar to negative images

Classification based on

object context

(background) rather than

object features.


The detection at each iteration estimates a bounding box which provides a better

visual codebook which in turn leads to better detection.

• Key-point configurations as features are a discriminativeobject feature set.

• A configuration of visual words appends structural informationto the bag-of-words model.

Object detection

• Harvest frequent and discriminative configurations.

• Encode configurations called transaction vectors.

• Association between a transaction vector and the

training type is an association rule.

• Apriori algorithm finds association rules with high

confidence in a support-confidence framework. Transaction vector encoding key-point configuration

Apriori algorithm

• Uses breadth-first search and tree structure.

• Longer configurations will have lower support as

they are infrequent but higher confidence as they

are more discriminative.

• Downward closure lemma: prune configurations

with infrequent sub-sets.

Object localization

Training Data Set

Test Data Set

Test Image

Generate Transactions Transactions Apriori data

miningAssociation

Rules

Generate Confidence for each Transaction

Threshold Confidence

Transactions

• A confidence is assigned to every

key-point in the image.

• Key-points with sufficiently high

confidence are retained.

• Key-points which occur on

common background objects like

doors and windows can have high

confidence.

Object classification using Apriori

Training Data Set

Test Data Set

Generate Transactions Transactions Apriori data

miningAssociation

Rules

Generate Confidence for each Transaction

Sum Confidence

TransactionsTest Images

ROC ‘car’ in PASCAL VOC 2006

The summed confidence score depends

upon object scale in the image, which

explains the comparatively poor

performance of this approach.

Conclusions

• The ‘bag-of-words’ model is good for classification, but poor for localization.

• Separate foreground-background for better visual codebooks.

• The good classification using PASCAL VOC 2006 dataset is attributed to

recognition of object context rather than object features.

• The dataset utilized should have sufficient variation in appearance of the

object and its background.

• Larger visual vocabulary gives slightly better classification, but is

computationally more expensive.

• The visual vocabulary built has majority of background visual words since

bounding-boxes are not utilized during training.

Conclusions

• Improving the proportion of visual words representing the object in the

vocabulary is vital for good classification.

• Incorporate object boundary contour to the descriptor.

• Use of frequent and discriminative key-point configurations is a promising

approach for object localization.

• A low quality dataset results in a weak visual codebook and classifiers biased

to the training data.

• Classification using key-point configurations was poor compared to ‘bag-of-

words’ for PASCAL VOC 2006.

Future Work

• Improve a visual codebook by increasing the proportion of visual words

pertaining to object features. Combine Apriori based localization and

clustering for visual word selection in an iterative approach.

•Model visual scene information (Use the GIST descriptor by Torralba). Learn

co-occurrence statistics of a scene and a visual category. Recognition of the

scene serves as prior for object presence and improves object recognition

performance.

• Improve object localization by using context priming.

• Model object contextual information to aid foreground-background

disambiguation for better object localization.

Future Work

• Share information of features between visual categories. The size of a

universal visual vocabulary should increase sub-linearly with increase in

number of visual categories.

• Combine image segmentation and classification to improve the object

model to provide better classification performance.

• Build a hierarchical framework for visual categorization:

• Representation: combine local and global features.

• Model: combine semantic and structural object models.

• Classification: combine generative and discriminative approaches.

Future Work

Questions?

visual object category recognition

Technology

configuration of visual

visual wordscreating

visual recognition system

to20000 visual words

visual wordsa test image

object interactions

visual categories inpascal

visual wordsvisual codebooks