submitted to ieee transactions on pattern analysis … · for recognition in specialized domains...

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Classemes and Other Classifier-basedFeatures for Efficient Object Categorization

Alessandro Bergamo, and Lorenzo Torresani, Member, IEEE

Abstract—This paper describes compact image descriptors enabling accurate object categorization with linear classificationmodels, which offer the advantage of being efficient to both train and test. The shared property of our descriptors is the use ofclassifiers to produce the features of each image. Intuitively, these classifiers evaluate the presence of a set of basis classesinside the image. We first propose to train the basis classifiers as recognizers of a hand-selected set of object classes. We thendemonstrate that better accuracy can be achieved by learning the basis classes as “abstract categories” collectively optimizedas features for linear classification. Finally, we describe several strategies to aggregate the outputs of basis classifiers evaluatedon multiple subwindows of the image in order to handle cases when the photo contains multiple objects and large amounts ofclutter. We test our descriptors on challenging benchmarks of object categorization and detection, using a simple linear SVM asclassifier. Our results are on par with those achieved by the best systems in these fields but are produced at orders of magnitudelower computational costs and using an image representation that is general and not specifically tuned for a predefined set oftest classes.

Index Terms—Object categorization, image features, attributes.

F

1 INTRODUCTION

IN this work we consider the problem of objectclass recognition in large image databases. Over

the last few years this topic has received a growingamount of attention in the vision community [1], [2].We argue, however, that nearly all proposed systemshave focused on a scenario involving two restrictiveassumptions: the first, is that the recognition probleminvolves a fixed set of categories, known before thecreation of the database; the second, is that there areno constraints on the learning and testing time of theobject classifiers, besides the requirement that trainingand testing must be feasible. Such assumptions are re-flected in the characteristics of the most popular objectcategorization benchmarks [3], [4], which measure theperformance of recognition systems solely in terms ofclassification accuracy over a predefined set of classes,and without consideration of the computational costs.

We believe that these two assumptions do not meetthe requirements of modern applications of large-scaleobject categorization. For example, test-recognitionefficiency is a fundamental requirement to be ableto scale object classification to Web photo reposito-ries, such as Flickr, which are growing at rates ofseveral millions new photos per day. Furthermore,while a fixed set of object classifiers can be used toannotate pictures with a set of predefined tags, theinteractive nature of searching and browsing largeimage collections calls for the ability to allow usersto define their own personal query categories to be

A. Bergamo and L. Torresani are with the Department of ComputerScience, Dartmouth College, Hanover, NH 03755, U.S.A.E-mail: {aleb, lorenzo}@cs.dartmouth.edu

recognized and retrieved from the database, ideally inreal-time. Depending on the application, the user candefine the query category either by supplying a set ofimage examples of the desired class, by performingrelevance feedback on images retrieved for predefinedtags, or perhaps by bootstrapping the recognition viatext-to-image search. In all these cases, the classifierscannot be precomputed during an offline stage andthus both training and testing must occur efficientlyat query-time in order to be able to provide results inreasonable time to the user.

In this paper we consider the problem of designinga system that can address these requirements: ourgoal is to develop an approach that enables accuratereal-time recognition of arbitrary categories in giganticimage collections, where the classes are not definedin advance, i.e, they are not known at the time ofthe creation of the database. We propose to achievethis goal by means of image descriptors designed toenable good recognition accuracy with simple linearclassifiers, which can be trained efficiently and –perhaps even more crucially – can be tested in just afew seconds even on databases containing millions ofimages. Rather than optimizing classification accuracyfor a fixed set of classes, our aim is to learn a generalimage representation which can be used to describeand recognize arbitrary categories, even novel classesnot present in the training set used to learn thedescriptor.

We propose to use as entries of our image de-scriptor the outputs of a predefined set of nonlinearclassifiers evaluated on low-level features computedfrom the photo. This implies that a simple linearclassification model applied to this descriptor effec-


tively implements a nonlinear function of the originallow-level features. As demonstrated in recent litera-ture on object categorization [5], these nonlinearitiesare critical to achieve good categorization accuracywith low-level features. However, the advantage ofour approach is that our classification model, albeitnonlinear in the low-level features, remains linear inour descriptor and thus it enables efficient trainingand testing. In other words, the nonlinear classifiersimplementing our features are used as a classificationbasis to recognize new categories via linear models.Based on this intuition, we refer to our features as basisclasses. The final classifier for a novel class is obtainedby linearly combining the outputs of the nonlinearclassifiers, which we can pre-compute and store forevery image in the database, thus enabling efficientnovel-class recognition even in large datasets.

In this work we investigate and propose severalstrategies to define and train the basis classifiers. Insection §3.2 we first propose to use as basis classes ahand-selected set of object classes from the real-world:in this case each basis classifier is simply trained asa traditional object classification model optimized torecognize a particular object category. This idea isevocative of the use of attributes [6], [7], [8] which arefully-supervised classifiers trained to recognize cer-tain properties in the image such as “has beak”, “nearwater”. While attributes have been used as featuresfor recognition in specialized domains (e.g., animalrecognition [8] or face identification [6]), we demon-strate that by choosing a large and varied set of objectcategories as basis classes, it is possible to employthe resulting descriptor as an effective universal fea-ture representation for general object categorization.Furthermore, we show that this feature vector canbe binarized with little loss of recognition accuracyto produce a compact binary code that allows evengigantic image collections to be kept in memory formore efficient testing.

However, in this work we demonstrate that a bet-ter classifier-based representation can be obtained bylearning the basis classes as general “abstract cate-gories” optimized to yield useful features for linearmodels rather than hand-defining them as real objectclasses. We propose two distinct strategies to learnautomatically the basis classes: the first (§3.3) opti-mizes the basis classifiers for linear classification, i.e.,it trains them to produce good recognition accuracywhen used as features with linear models; the secondstrategy (§3.4) constrains each basis class to be asuper-category obtained by grouping a set of objectclasses such that, collectively, they are easy to distin-guish from other sets of categories. When comparedto existing work on learning compact image codes, weshow that our feature vectors provide better accuracyfor the same descriptor size.

While multiclass recognition of a fixed set of cat-egories is not our main motivating application, we

show that our approaches achieve excellent perfor-mance even on this task. On the Caltech256 bench-mark, a simple linear SVM trained on our repre-sentation outperforms the state-of-the-art LP-β classi-fier [5] trained on the same low-level features used tolearn our descriptor. On the 2010 ImageNet Challenge(ILSVRC2010) database, linear classification with ourfeatures achieves recognition accuracy only 10.3%lower than the winner of the competition [9], whosecomputational cost for training and testing is sev-eral orders of magnitude higher compared to ourapproach.

Finally, we present the first comprehensive evalua-tion of classifier-based descriptors on object detectionand scene recognition datasets. In order to renderthe descriptor capable of handling the presence ofmultiple objects and clutter typical of these datasets,we propose to apply our basis classifiers on manysubwindows of the photo. We describe and test sev-eral strategies to aggregate the features produced frommultiple subwindows into a single compact descriptorthat can be used by linear classification models. Weshow that the resulting descriptors yield a significantboost in accuracy compared to feature vectors builtfrom basis classifiers evaluated on the whole imageand are on par with specialized object detectors tunedon the test classes. However, the training and testingof linear classifiers using our descriptors have consid-erably lower computational cost.

2 RELATED WORK

The problem of object class recognition in largedatasets has been the subject of much recent work.While nonlinear classifiers are recognized as the state-of-the-art in terms of categorization accuracy [5],[10], they are difficult to scale to large training sets.Thus, much more efficient linear models are typicallyadopted in recognition settings involving a large num-ber of object classes, with many image examples perclass [1]. As a result, much work in the last few yearshas focused on methods to retain high recognitionaccuracy even with linear classifiers. We can looselydivide these methods in three categories.

The first category comprises techniques to approx-imate nonlinear kernel distances via explicit featuremaps [11], [12]. For many popular kernels in computervision, these methods provide analytical mappings tohigher-dimensional feature spaces where inner prod-ucts approximate the kernel distance. This permitsto achieve results comparable to those of the bestnonlinear classifiers with simple linear models. How-ever, these methods are typically applied to hand-crafted features that are already high-dimensional andthey map them to spaces of further increased dimen-sionality (2 or 3 times as large, depending on theimplementation). As a result, they pose high storagecosts in large-scale settings.


A second line of work [2] involves the use ofvectors containing a huge number of features (up toseveral millions) so as to obtain a high degree oflinear separability. The idea similar is similar to thatof explicit feature maps, with the difference that thesehigh-dimensional signatures are not produced withthe goal of approximating kernel distances betweenlower-dimensional features but rather to yield higheraccuracy with linear models. In order to be able tokeep large datasets in memory with such representa-tions, the vectors must be stored in compressed formand then decompressed on the fly “one at a time”during training and testing [2], [13].

Finally, the third strand of related work involvesthe use of image descriptors encoding categoricalinformation as features: the image is represented interms of its relation to a set of basis object classes [14],[15], [16] or as the response map to a set of detec-tors [17]. Even linear models applied to these high-level representations have been shown to producegood categorization accuracy. These descriptors canbe viewed as generalizing attributes [6], [7], [8], whichare semantic characteristics selected by humans asassociated to the classes to recognize.

The three approaches presented in this paper areclosely related to this third line of work, as they allrepresent images in terms of the outputs of classifierslearned for a set of basis classes. Our first approach– classemes – (§3.2) was originally presented in [15]and uses a large set of real object classes as features. Insection §3.3 instead we describe PiCoDes (first intro-duced in [18]), which are binary descriptors explicitlyoptimized for linear classification accuracy on a largeset of training object classes. The third method – meta-classes – (§3.4) was originally presented in [19] andlearns the basis classes as a set of abstract categoriescapturing useful properties for object class recogni-tion. The learning of these abstract categories is muchmore efficient than in the case of PiCoDes and itenables training of descriptors of much larger size.

In this article we also consider how to extendclassifier-based descriptors to yield good recognitionaccuracy even for photos containing multiple objectsand clutter. Our approach is inspired by Li et al. [17]who have proposed to use the localized outputs ofobject detectors as image features, to encode spatialinformation into a global image descriptor. However,this representation is very high-dimensional (the “Ob-jectBank” descriptor includes over 40,000 real-valuedentries) and as such is not suited for recognition inlarge databases. Here we present several strategies toaggregate the outputs of basis classifiers evaluated onmultiple subwindows in a single low-dimensional de-scriptor. We show that simple linear classifiers trainedon our aggregate vector produce results approach-ing the state-of-the-art on challenging object-detectionand scene recognition benchmarks.

3 TECHNICAL APPROACH

3.1 General frameworkIn this section we introduce our general frameworkof classifier-based image descriptors. At a high-level,our approach involves representing an image x as aC-dimensional vector h(x), where the c-th entry is theoutput of a classifier hc evaluated on x:

h(x) =

h1(x)...

hC(x)

(1)

The classifiers h1...C (the basis classifiers) are learnedduring an offline stage from a large labeled datasetof images DS = {(x1, y1), . . . , (xN , yN )}, where xi

denotes the i-th image in the database and yi ∈{1, . . . ,K} indicates the class label of the objectpresent in the photo. The database DS representsour general visual knowledge-base; it should be verylarge and ideally include all possible visual conceptsof the world. Our approaches analyze DS to auto-matically extract the C visual classifiers h1...C to besubsequently used as universal image features forclass recognition. We call these C visual features basisclasses. Note that the basis classes may be abstract cat-egories, i.e., visual categories that do not necessarilyexist in the real-world but that are useful to representand discriminate the K classes in DS .

Before describing how we define and learn thebasis classifiers, we want to discuss the motivationbehind this representation. Intuitively, our descriptorrepresents an image in terms of its visual closeness tothe set of C basis classes. We propose to extract thisdescriptor for every image x in the database wherewe want to perform the final recognition. As shownin our experiments, these classifier-based representa-tion provides us with two fundamental advantages:1) Compactness. Only C entries are are needed todescribe each image. By limiting C to relatively smallvalues (say, a few thousands), we can store even largeimage collections in memory (rather than on the hard-drive) for more efficient recognition. 2) High accuracywith linear models. This representation enables goodclassification accuracy even with simple linear modelsof the form θ>h(x). Intuitively this happens becausea linear combination of these C features implements apowerful weighted sum of C classifiers, which reusesthe basis classes as building blocks to recognize novelcategories.

Let us now consider the problem of how to definethe basis classifiers. We propose to train our basisclassifiers on a low-level representation of the image:given an image x we first extract a set of M low-levelfeatures {fm(x)}Mm=1, capturing different visual cues,such as the color distribution, or the spatial layoutin the image (our low-level features include commondescriptors such as SIFT [20] and GIST [21]; see sec.4.2for further details). In order to obtain a powerful


image representation, we want our basis classifiersh1...C to be nonlinear functions of features {fm(x)}.This is motivated by the observation that severalrecent papers have shown that such nonlinearitiesare crucially necessary to achieve good classificationaccuracy (see, e.g., [5], [15]). However, at the sametime we want to define a representation that is effi-cient to compute. This implies that the basis classifiersh1...C must be fast to evaluate. In order to achievethis twofold purpose, we lift each feature vectorfm(x) ∈ Rdm to a higher-dimensional feature spacevia the explicit map proposed in [12] so that innerproducts in this space approximate well a predefinednonlinear kernel distance Km() over the original fea-tures fm(x). Formally, the explicit map Ψm for featurem is defined as a function Ψm : Rdm −→ Rdm(2r+1)

where r ∈ Z+ such that 〈Ψm(fm(x)),Ψm(fm(x′))〉 ≈Km(fm(x),fm(x′)). For the family of additive kernelsthe map can be analytically computed. The parameterr is a positive integer parameter controlling the targetdimensionality. It has been shown in [12] that goodapproximations can be obtained even when settingr to very small values (in this work we use r = 1;for more details see Sec. 4.2). This trick allows usto approximate a computationally-expensive kernel-based classifier with a linear model that can be trainedand evaluated very efficiently. The concatenation ofthe M lifted-up features produces a vector Ψ(x) ofdimensionality D =

∑Mm=1 dm(2r + 1):

Ψ(x) = [Ψ1(f1(x))>, . . . ,ΨM (fM (x))>]> (2)

Using this formulation based on explicit featuremaps, we implement each basis classifier hc as a linearcombination of approximate kernels parameterized bya single projection vector ac:

hc(x) = τc(a>c [Ψ(f(x)); 1]) (3)

where τc is a function to either quantize or scale theclassifier output. The constant value 1 is appendedto the vector Ψ(x) to implement a bias coefficientwithout explicitly keeping track of it. We can thenstack together the parameters of all basis classifiers ina single matrix A = [a1| . . . |aC ]. For notational con-venience, we set the following equivalences: hc(x) ≡h(x;ac) = τc(a

>c [Ψ(f(x)); 1]) and h(x) ≡ h(x;A).

We can now even more clearly recognize the twoabove-mentioned advantages enabled by our descrip-tor, i.e, compactness and high accuracy with linearmodels. Specifically, this formulation implies that asimple linear classification model θ>h(x) trained fora new class effectively implements an approximatelinear combination of multiple low-level kernel dis-tances, using our basis classifiers as individual com-ponents:

θ>h(x) =

C∑c=1

θcτc(a>c [Ψ(f(x)); 1]) (4)

The great advantage is that the model θ, albeit nonlin-ear in the low-level features, remains linear in our de-scriptor and thus can be efficiently trained and tested.Moreover, note that while the extraction of the low-level features {fm(x)}Mm=1 is needed to calculate thedescriptor h(x), the storage of these low-level featuresis not necessary for the subsequent recognition. Thisimplies that only the compact feature vectors h(x)need to be stored in the database.

Our descriptors can be interpreted as implementinga form of deep network [22], where low-level featuresare pumped through a set of learned nonlinear func-tions organized in layers. However, our approach dif-fers from traditional deep learning methods in termsof the training objective, the optimization algorithm,the type of nonlinearities, and also in our use of theoutput layer as a general representation for broad classrecognition. Our approach can also be viewed as adimensional reduction method: the function h indeedmaps the high-dimensional descriptor represented byEq. 2, to a space of much lower dimensionality, i.e.C � D. In the experimental section we will show thatsuch representation is able to retain or even enrich theoriginal representation, allowing efficient linear classi-fiers to achieve state-of-the-art classification accuracyon several challenging benchmarks.

In the next sections we propose different methods todiscover the basis classes and learn the basis classifiersfrom a given database DS .

3.2 ClassemesCLASSEMES are the classifier-based features that wefirst introduced in [15]. In this approach, the basisclasses are defined directly as the real-world objectcategories of the labeled training set DS (thus C = K).Each basis classifier hc is a 1-vs-the-rest binary classi-fier trained to recognize the c-th category. Specifically,we train hc using a training set DS

c that contains theimages of the c-th category as positive examples, anda random subset of images sampled from the otherK − 1 classes as negative examples. In this workwe implement the CLASSEMES hc(x) using the LP-β classification model [5], which has been shown toyield state-of-the-art results on several categorizationbenchmarks. The LP-β classifier is defined as a linearcombination of M nonlinear SVMs, each trained ona different low-level feature vector fm(x). While [5],[15] employ exact kernels to render the classifiernonlinear, here we use approximate kernel distancesby adopting the explicit feature map described in theSec. 3.1. As already noted, this modification speeds upthe training and the evaluation of the basis classifiersby orders of magnitudes and thus enables a veryefficient extraction of CLASSEMES. Formally, each basisclassifier hc(x) is defined as:

hc(x) = τc

( M∑m=1

βm,c

[wT

m,cΨm(fm(x)) + bm,c

])(5)


Note that by re-arranging the terms in Eq. 5, wecan expresses hc in the same form as Eq. 3. Follow-ing the customary training scheme of LP-β, we firstlearn the parameters {wm,c, bm,c} for each feature mindependently by training the hypothesis hm,c(x) =[wT

m,cΨm(fm(x)) + bm,c

]using the traditional SVM

large-margin objective on training set DSc . In a sec-

ond step, we optimize over parameter vector βc =

[β1,c, . . . , βm,c]T subject to the constraint

∑m βm,c = 1.

This last optimization can be shown to reduce to asimple linear program (see [5] for further details).An advantage of this training procedure is that itis embarassingly parallel, as each hypothesis hc canbe learned independently from the others. This factmakes this method very scalable to the size of thetraining database DS .

The functions τc in Eq. 5 are quantizers appliedafter the learning to reduce the dimensionality of thedescriptor. In our original work [15] we evaluateddifferent levels of quantization, realizing differenttrade-offs between the descriptor compactness andthe final accuracy of the visual recognition system.In this work, we consider the following two options:• CLASSEMES We use the raw outputs of the LP-β classifiers as features, i.e., τc(z) = z,∀c ∈{1, . . . , C};

• CLASSEMES-BIT We create a binary vector bythresholding the values at zero, i.e., we chooseτc(z) = 1[z > 0],∀c ∈ {1, . . . , C} where 1[.] is the0-1 indicator function of its boolean argument.

3.3 PiCoDesAs we will show in the experiments, the trainingprocedure presented in Sec. 3.2 can efficiently learnclassifier-based descriptors yielding state-of-the-artaccuracy. However, it also comes with a few short-comings:• The basis classifiers are learned disjointly by op-

timizing an objective that is unrelated to theiractual usage as features at application time (seeEq. 4).

• The basis classes are hand-selected real objectclasses. The choice of classes to use is left tothe designer of the descriptor but there is nowell-defined criterion to determine the best setof categories.

• The quantization of the descriptor is appliedas a post-processing and not considered in thelearning objective of the basis classifiers.

We now describe a method that addresses these lim-itations. The idea of the approach is quite simple:given that we want to use basis classifiers as featureswith linear models, we design our learning objectiveto enforce that linear combinations of these basisclassifiers must yield good accuracy with respect toour training set. In other words, rather than hand-designing the basis classes, we learn abstract cate-

gories aimed at optimizing linear classification whenthey are used as features. This learning objective de-couples the number of training classes from the targetdimensionality of the binary descriptor and thus itallows us to optimize our descriptor for any arbitrarylength. Furthermore, it enables direct optimization ofthe learning parameters with respect to the desiredquantization function τc, avoiding the suboptimal useof quantizers applied as a post-processing.

Figure 1 provides a visualization of a few basis clas-sifiers learned by our training procedure. This figuresuggests that our learned features describe the imagein terms of binary visually-interpretable propertiescorresponding, e.g., to particular shape, texture orcolor patterns.

We first introduced this approach in [18]. We calledthese features PICODES, which stands for “PictureCodes” but also “Pico-Descriptors”. In order to obtaina highly compact descriptor, we optimize the PI-CODES features to be binary, i.e., h(x;ac) = 1[aT

c x >0] (note that this is the same form as in Eq. 1 withonly difference that here we choose τc(z) = 1[z > 0]).

3.3.1 Learning the basis classifiers

In order to learn the PICODES basis classifiers, weuse a binary encoding of the labels for the train-ing examples in DS : for each example i, we defineyi ∈ {−1,+1}K to be a vector describing its categorylabel, with yik = +1 iff the i-th example belongs toclass k. As seen shortly, this labeling will allow usto implement a form of “one-vs-the-rest” strategy fortraining the basis classifiers.

We want to optimize the parameter matrixA so thatlinear combinations of the resulting basis classifiersyield good categorization accuracy on DS . In orderto formalize this objective, for each training class kwe introduce auxiliary parameters (wk, bk), whichdefine a linear classifier operating on the PICODESfeatures and distinguishing the k-th class from all theothers. Thus, we state our overall objective as jointoptimization over our representation (parameterizedby A) and the auxiliary parameters defining the Klinear classifiers so as to obtain the best possible linearclassification accuracy on DS . Specifically, we use alarge-margin formulation which trades off betweena small classification error and a large margin whenusing the output bits of the basis classifiers as featuresin a one-versus-all linear SVM:

E(A,w1..K , b1..K) =K∑

k=1

{1

2‖wk‖2 +

λ

N

N∑i=1

`

[yi,k(bk +w>k h(xi;A)

]}(6)

where `[·] is the traditional hinge loss function. Ex-


Fig. 1. Visualization of PICODES. Each row of images illustrates a particular bit in our 128-bit descriptor. These 6 randomly chosen bitsare shown as follows: for bit c, all images are sorted by non-binarized classifier outputs a>

c x and the 10 smallest and largest are presentedon each row. Note that ac is defined only up to sign, so the patterns to which the bits are specialized may appear in either the “positive” or“negative” columns.

panding, we get

E(A,w1..K , b1..K) =K∑

k=1

{1

2‖wk‖2+

λ

N

N∑i=1

`

[yi,k(bk+

C∑c=1

wkc1[aTc xi > 0])

]}We propose to jointly learn the linear SVMs (wk, bk)and the parameters ac of the basis classifiers usingthe method described below.

3.3.2 OptimizationMinimizing this objective requires optimization di-rectly for binary features, which is difficult and non-convex. Thus, we use an alternation scheme imple-menting block-coordinate descent. We alternate be-tween the two following steps:1. Learn classifiers.We fix A and optimize the objective with respectto w and b jointly. This optimization is convex andequivalent to traditional linear SVM learning.2. Learn PICODESGiven the current values of w and b, we min-imize the objective with respect to A by updat-ing one basis-classifier at a time. Let us con-sider the update of ac with fixed parametersw1..K , b,a1, . . . ,ac−1,ac+1, . . . ,aC . It can be seen (seethe supplementary material) that in this case theobjective becomes:

E(ac) =

N∑i=1

vi1[ziaTc xi > 0] + const (7)

where zi ∈ {−1,+1} and vi ∈ R+ are known valuescomputed from the fixed parameters. Unfortunately,this objective is not convex and not trivial to optimize.Thus, we replace it with the following convex upperbound defined in terms of the hinge function `:

E(ac) =

N∑i=1

vi`(ziaTc xi) (8)

This objective can be globally optimized using an LPsolver or software for SVM training. We had successwith LIBLINEAR [23], which deals nicely with thelarge problem sizes of both our optimization steps.

We have also experimented with several other op-timization methods, including stochastic gradient de-scent applied to a modified version of our objective,but they led to inferior results. Please refer to thesupplementary material for further details.

3.4 Meta-Classes

We have seen that the method to learn CLASSEMESintroduced in Sec. 3.2 is very simple to implementand scalable to the size of the training database butit is sub-optimal in several aspects. Conversely, thePICODES approach described in Sec. 3.3 providesprincipled learning of abstract basis classes explicitlyoptimized for good accuracy with linear models, butit requires a computationally expensive minimization,which in practice can be run only for very compactdimensionalities (our largest PiCoDes contain 2048bits and required several weeks to be learned). Wenow introduce a descriptor-learning algorithm thatcombines the advantages of the previous two meth-ods: 1) scalable learning; 2) automatic discovery ofabstract basis-classes yielding good recognition whenused as features with linear models. We originallypresented this approach in [19].

We refer to the basis classes learned by this methodas “meta-classes”. Intuitively, we want our meta-classclassifiers to be “repeatable” (i.e., they should producesimilar outputs on images of the same object cate-gory) and to capture properties of the image that areuseful for categorization. We formalize this intuitionby defining each meta-class to be a subset of objectclasses in the training set. Specifically, we hierarchi-cally partition the set of training object classes suchthat each meta-class subset can be easily recognized


Fig. 2. Meta-class tree. This figure shows a small portion of the tree learned by the meta-class approach described in Sec 3.4. Therectangular nodes represent the leaves of the tree, associated with real categories of the ImageNet dataset. The inner round nodes representthe meta-classes automatically learned by our method, which tends to group together categories that are difficult to tell apart.

from the others. This criterion forces the classifierstrained on the meta-classes to be repeatable. At thesame time, since the meta-classes are superclasses ofthe original training categories, by definition the clas-sifiers trained on them will capture common visualproperties shared by similar classes while being effec-tive to discriminate visually-dissimilar object classes.

3.4.1 Discovering the meta-classes

In this section we describe the procedure to discoverthe meta-classes. Our method is an instance of thealgorithm for label tree learning described in [24].(Note that other label-tree learning methods, suchas [25], [26], could also be applied to our task of meta-class training.) This algorithm learns a tree-structureof classifiers (the label tree) and was proposed tospeed up categorization in settings where the numberof classes is very large. Instead, here we use the labeltree training procedure to learn meta-classes, i.e, setsof classes that can be easily recognized from others.We provide below a review of the label tree algorithm,contextualized for our objective.

Let `D be the set of distinct class labels in thetraining set DS , i.e. `D ≡ {1, . . . ,K}. The label treeis generated in a top-down fashion starting from theroot of the tree. Each node has associated a set ofobject class labels. The label set of the root node isset equal to `D. Let us now consider a node withlabel set `. We now describe how to generate its twochildren. (Although the label tree can have arbitrarybranching factor at each node, in our work we usebinary trees.) The two children define a partition ofthe label set of the parent: if we denote with `L and`R the label sets of the two children, then we want`L ∪ `R = ` and `L ∩ `R = ∅. Ideally, we want tochoose the partition {`L, `R} so that a binary classifier

h(`L,`R)(x) trained to distinguish these two meta-classes makes as few mistakes as possible. In orderto select the best classifier, we should train a classifierfor each of the possible (|`|(|`| − 1)/2 − 1) partitionsof `, but this operation is prohibitively expensive.Instead, we take inspiration from the work of [24]and we use the confusion matrix of one-vs-the-restclassifiers learned for the individual object classes todetermine a good partition of `: intuitively, our goalis to include classes that tend to be confused witheach other in the same label subset. More formally,let h1, . . . , h|`D| be the one-vs-the-rest LP-β classifierslearned for the individual object classes using thetraining set DS . Let A ∈ R|`D|×|`D| be the confusionmatrix of these classifiers evaluated on a separatevalidation set Dval ⊂ DS : Aij gives the number ofsamples of class i in Dval that have been predictedto belong to class j (we assume the winner-take-all strategy for multiclass classification). Since thismatrix is not symmetric in general, we compute itssymmetrized version as B = (A + AT )/2. Then, foreach node we propose to partition its label set ` intothe subsets `L ⊂ `, `R ≡ ` − `L that maximize thefollowing objective:

E(`) =∑

i,j∈`LBij +

∑p,q∈`R

Bpq . (9)

The objective encourages to include in the same subsetclasses that are difficult to tell apart, thus favoringthe creation of meta-classes containing common visualproperties. At the same time, maximizing this objec-tive will tend to produce meta-classes `L, `R that areeasy to separate from each other.

Note that optimization of eq. 9 can be formulatedas a graph partitioning problem [27]. We computethe solution `L by applying spectral clustering [28] to


the matrix B: this is equivalent to solving a relaxed,normalized version of eq. 9 that penalizes unbalancedpartitions. We repeat this process recursively on eachnode until it contains a single class label, i.e., |`| = 1.

Fig. 2 shows a portion of our meta-class tree learnedfrom the ImageNet dataset [1]. As expected, we foundthat our meta-classes tend to group together objectclasses that are visually similar although not neces-sarily semantically related (e.g., lantern and electriclamp but also hurricane lamp and perfume).

3.4.2 Learning the meta-class classifiersThe learning of the basis classifiers is now straight-forward: at each node ` of the label tree we trainan LP-β classifier on the binary split {`L, `R}. Specif-ically, let I` denote the indices of training exampleshaving class labels in `. Then, we form the labeled setD(`L,`R) = {(xi,+1) : i ∈ I`

L} ∪ {(xi,−1) : i ∈ I`R}

and use it to train meta-class classifier h(`L,`R)(x). Thisprocedure yields in total C basis classifiers, where Cis the number of inner nodes of the label tree. Notethat this learning is embarrassingly parallelizable aswe can learn the hypothesis independently. We alsoinclude in the descriptor the outputs of the one-vs-the-rest classifiers h1, . . . , h|`D|, as we have found thatthis improves the final performance of the classifier.The image descriptor h(x) is then a C-dimensionalvector, where C = C +K.

We experiment with two versions of the function τin Eq. 1, thus giving rise to the two following variantsof the meta-class descriptor:• MC: We convert the raw score of each

LP-β classifier into a probabilistic output bymeans of a sigmoid function, i.e., we setτc(z) = 1/(1 + exp(−αcz + βc)). We learn theparameters of the sigmoid by means of Platt’sscaling procedure [29], using the validation setDval. We found that this sigmoidal normalizationyields a large boost in the accuracy of the finalclassifier trained on this representation, probablyas it makes the range of classification scoresmore homogeneous and reduces outlier values.

• MC-BIT: We create a binary vector by setting setτc(z) = 1[z > 0],∀c ∈ {1, . . . , C}.

3.5 Encoding local informationThe framework described in Sec. 3.1 describes meth-ods to compute global classifier-based descriptors, i.e.,feature vectors where each individual entry is theoutput of a classifier evaluated over the entire image.While our experiments show that these representa-tions produce high accuracy on the task of whole-image classification, they are clearly not suitable forrecognition when the object of interest occupies only asmall region of the photo. In this section we describehow to extend our framework to encode local infor-mation in the descriptor so as to handle cases whenthe image contains small or multiple objects.

We present three different strategies to capture localinformation. Each of these local encoding methodscan be applied to any of the descriptors describedin Sec. 3.2, 3.3, and 3.4. At a high-level, each localencoding method is defined by a splitting function sthat decomposes the image into a set of M subimages,and a combiner m that aggregates the M vectorsproduced by extracting our classifier-based descriptorh from the individual subimages. More formally, let Xbe the space of all images. Thus, the splitting functions : X → XM takes an image x ∈ X as input andoutputs M rectangular subimages {x(m)}Mm=1. Then,we define our final image descriptor h(x) capturinglocal information as h(x) = m(h(x(1)), . . . ,h(x(M))).In the next subsections we describe several choices offunctions s and m, giving rise to different h(x).

3.5.1 SPCAT: Spatial Pyramid + ConcatenationThis is an instantiation of the spatial pyramid methodproposed in [30] where the function s decomposesthe image into a hierarchical partition of rectangularsubimages. The pyramid consists of L layers, wherethe first layer (layer 0) is the image itself. Each layerspatially subdivides each subimage of the previouslayer into a grid of 2x2 subimages of equal size. Hencethe total number of rectangular subimages defined bythe pyramid is M =

∑L−1l=0 4l.

The combiner m takes as input the set of descrip-tors computed from the individual subimages of thepyramid, and concatenates them together:

h(x) =[h(x(1))>, . . . ,h(x(M))>

]>.

The final descriptor h has dimensionality C ·M , whereC is the length of the individual descriptors associatedto the regions (as defined in Sec. 3.1).

3.5.2 SPLPOOL: Spatial Pyramid + Layer PoolingIn this strategy the function s still implements aSpatial Pyramid. Instead, the combiner function mnow concatenates descriptors obtained by pooling (oraggregating) the feature vectors within each layer.Specifically, let x(l,g) be the g-th subimage of the l-th layer. In case of binary descriptors (CLASSEMES-BIT, PICODES, and MC-BIT) we keep the final featuresbinary by performing max-pooling within each layer:

h(x) =

h(x)

maxg=1,...,4 h(x(1,g))...

maxg=1,...,4L−1 h(x(L−1,g))

where the function max computes the component-wise maximum. In case of real-valued descriptors(CLASSEMES and MC), instead we average the featurevalues within each layer (thus replacing max withthe sample mean of each feature). We also tried max-pooling for the real-valued descriptor but we found


empirically that this yields lower accuracy. With thislocal encoding strategy, the final vector h(x) hasdimensionality L · C.

3.5.3 OBJPOOL: Objectness + PoolingThe encoding methods of subsections 3.5.1 and 3.5.2are limited by the fact that the subdivision of theimage is fixed, and does not take into account theactual locations of the objects in the photo. Intuitively,we would like the function s to split the image into re-gions containing objects so as to encode more relevantsubimages. To this end, we propose to implement s asa class-generic object detector producing a candidateset of regions that are likely to contain an object. Forthis purpose, we define the function s to return the Mrectangular subimages {x(m)}Mm=1 that have the great-est ObjectNess measure [31]. Then, the combiner mperforms average pooling, i.e., computes the averageof each feature entry over all M subimages. Even inthis case we tried max-pooling, but again we foundthis strategy to produce inferior results compared toaverage pooling. We append the resulting descriptorto the feature vector computed from the entire image:

h(x) =

[h(x)>,

1

M

M∑m=1

h(x(m))>

]>.

Thus, the final dimensionality is in this case 2 · C.

4 EXPERIMENTS4.1 DatasetsIn this work we make use of several datasets, forboth learning and testing our descriptors. Here wesummarize briefly their characteristics:• Caltech 256 [3]: Popular benchmark for object cate-gorization involving ∼ 30K images partitioned into256 visual categories. Each photo contains a singlecentered object.• ImageNet [1] (Spring 2010 release): Large-scale im-age dataset consisting of more than 15K object cat-egories and 11M pictures. Most images contain asingle object.• ILSVRC2010 [4]: Large-scale benchmark for objectcategorization. It is a subset of the ImageNet dataset:it contains 1000 categories and 1.2M images.• PASCAL VOC 2007 [32]: Benchmark for object de-tection and classification including 20 object cate-gories and 9,963 images. Despite the small size ofthis dataset, the classification is very challenging aseach image might contain multiple objects whosepositions and scales vary greatly.• MIT 67 [33]: Benchmark for indoor scene recogni-tion. It consists of 67 indoor scenes (e.g. corridors,bookstores) for a total of 15,620 images, each usuallycontaining multiple objects.• SUN 397 [34]: Large-scale benchmark for in-door/outdoor scene recognition. It contains 397scene categories for a total number of 108,754 images.

4.2 Low-level descriptors

As previously described in Sec. 3.1, our descriptors arebuilt upon a set of low-level features. For the featuresthat are based on the aggregation of local descriptors,we also exploit the Spatial Pyramid method [30] toencode weak geometry into the representation. Inparticular we use a pyramid (SP) of L layers, and foreach layer l ∈ {0, . . . , L − 1} we partition the imageinto a 2l × 2l grid of cells and we extract a low-levelfeature vector from each cell. We use the followinglow-level features:• Color GIST [21]: we first resized the images to 32×32

pixels, without maintaining the aspect ratio, and thenwe compute the orientation histograms on a 4 × 4grid. We use 3 scales with the number or orientationper scale being 8, 8, 4.• Oriented HOG [35] (4 SP layers): Histogram of Ori-ented Gradients computed using 20 bins.• Unoriented HOG [35] (4 SP layers): an histogram ofunoriented gradients quantized into 40 bins.• SSIM [36] (3 SP layers): we compute a histogramby extracting the 30-dimensional self-similarity de-scriptor every 5 pixels, and by quantizing it into 300cluster centroids obtained from K-means.• SIFT [20] (3 SP layers): The 128-dimensional SIFTdescriptors are computed from the interest pointsreturned by a DoG detector [37]. We finally computea Bag-Of-Word histogram of these descriptors, usinga K-means vocabulary of 500 words.

In our work we treat each pyramid layer as if it wasa separate low-level feature vector. The concatenationof all these M = 15 feature vectors, yields a 22, 860-dimensional vector. As described in Sec. 3.1 wemake use of the explicit map proposed in [12] toperform efficient non-linear classification with anapproximated Intersection Kernel. The map is afunction Ψ(f ; r, L, γ) that takes a feature vector f asinput and is parameterized by r (number of samples),L (period of the sampling), and γ (normalizationof the kernel). We set r = 1 for all the features,producing feature vectors three times as big as theoriginal vectors, and γ = 1 as suggested in [12]. Foreach feature vector, we select the parameter L bygrid search, minimizing the error between the exactkernel distance K, and the approximated one, i.e.minL

∑Ni,j=1

∣∣⟨Ψ(f i),Ψ(f j)⟩−K(f i,f j)

∣∣ computedon a validation set consisting of N = 2560 imagesrandomly sampled from the Caltech 256 dataset [3].The concatenation of all these mapped low-levelfeature vectors form the vector Ψ introduced in Eq. 2,which for this implementation has dimensionalityD = 68, 580. In this section we will refer to thisdescriptor as PSI. Finally, we want to emphasizethat our approach is obviously not constrained towork only with the particular choice of low-levelfeatures described above. Therefore, it could easilytake advantage of more powerful low-level features,


Name Dimens. Storage sizeper image

Our descriptorsPSI (Sec. 4.2) 68580 268 KBCLASSEMES (Sec. 3.2) 2659 11 KBCLASSEMES-BIT (Sec. 3.2) 2659 (bin) 0.33 KBPICODES (Sec. 3.3) 2048 (bin) 0.25 KBMC (Sec. 3.4) 15232 60 KBMC-BIT (Sec. 3.4) 15232 (bin) 1.9 KBMC-LSH (Sec. 3.4) 200 K (bin) 25 KBX+SPCAT L0L1 (Sec. 3.5.1) 5× 5×X+SPLPOOL L0L1 (Sec. 3.5.2) 2× 2×X+OBJPOOL (Sec. 3.5.3) 2× 2×

Prior workGong et al. 2011, [38] (ITQ) 2048 (bin) 0.25 KBGong et al. 2011 [38] (CCA-ITQ) 2048 (bin) 0.25 KBLi et al. 2010 [17] (OBJECTBANK) 44.6 K 175 KBLin et al. 2011 [9] 1.1 M 4.5 MBSanchez et al. 2011 [2] (FV) 1 M 4 MBHarzallah et al. 2009 [39] 69.3 K 4.78 MBSong et al. 2011 [40] 192 K 750 KBElfiky et al. 2012 [41] 18 K 71 KBXiao et al. 2010 [34] 40 K 155 KB

TABLE 1Descriptors considered in our comparison. This table presents: thename of the descriptor and the section where it is introduced; the

native dimensionality and whether the descriptor is binary; therequired memory to store a single image descriptor.

such as the recently introduced Fisher Vectors [2],which have been shown to lead to state-of-the-artresults in object categorization.

4.3 Learning classifier-based descriptors

In this section we describe in detail how we imple-mented the classifier-based descriptors proposed inSec. 3. A summary of the descriptors described in thissection is provided in Table 1.

4.3.1 ClassemesThe training dataset DS for Classemes is obtainedfrom concepts in the Large Scale Concept Ontologyfor Multimedia (LSCOM) [42], which includes textualdescriptions of a collection of concepts selected to beuseful, observable and feasible for automatic visualdetection, and as such are likely to form a good basisfor image retrieval and object recognition tasks.

For each of the selected 2659 categories, the top150 images were gathered from the bing.com imagesearch engine, using the LSCOM textual descriptionsas queries. These pictures form the training set DS

that in total has ∼400K images. We finally learnedthe classeme descriptors CLASSEMES and CLASSEMES-BIT, using the procedure described in Sec. 3.2. Sincethe basis classifiers can be learned independently, weexploited a cluster of machines to reduce the trainingtime.

Note that the textual descriptions of 28 categoriesoverlap with the name of some categories of thepopular benchmark Caltech 256 [43], which is used

as one of the test datasets in Sec. 4.5.1. This “pre-training” of the query classes is in practice not signif-icant: removing the corresponding 28 values from thedescriptor, causes a negligible drop in performance onCaltech 256, < 1%.

4.3.2 PiCoDes

We built the training set DS from 2659 randomly se-lected ImageNet synsets using 30 images per category,for a total of ∼ 80K images. In order to avoid pre-learning the test classes in the descriptor, we avoidedpicking as training synsets categories belonging tothe ILSVRC2010 dataset or related to Caltech 256 (weperformed sub-string matching comparison betweenthe synset tags and the Caltech 256 keywords, remov-ing in total 711 ImageNet classes). This allows us toevaluate our descriptor in a scenario where each testclass to recognize is effectively novel, i.e., not presentin the training set used to learn the descriptor.

Note that the learning procedure described inSec. 3.3 is not easily parallelizable, and it requirescontinuous access to the high-dimensional image de-scriptors Ψ(x) of Eq. 2, which are very costly interms of storage (see PSI in Tab. 1). Thus, in practice,we compress down the vectors via PCA, producinga vector of 6415 dimensions. More formally, for thePICODES learning, we set Ψ(x) ≡ PΨ(x) where Pcontains the top 6415 PCA components. The dimen-sionality of 6415 was chosen based on a preliminaryexperiment of multiclass classification on Caltech 256(please refer to the supplementary material).

Given the training set DS we can train the Pi-CoDes descriptor as described in Sec. 3.3, performing15 iterations. For each target dimensionality C, welearned multiple descriptors using different values forthe hyper-parameter λ, and kept the descriptor thatgave the lowest multiclass error on a validation setconsisting of 5 images per class. In this paper wereport results for C = 2048.

4.3.3 Meta-Classes

We formed the training set DS from 8000 randomlysampled synsets, using at most 1000 examples percategory. We also created a validation set Dval withthe same categories and 80 examples per class. Asdone for PiCoDes, we selected the training classessuch that the synsets of these categories do not containany of the Caltech 256 or ILSVRC2010 class labels, soas to avoid “pre-learning” the test classes during thefeature-training stage. We learned the MC and MC-BIT descriptors following the procedure described inSec. 3.4, using the validation set to compute the con-fusion matrix and the two parameters of the sigmoidfunction for each meta-class. Since the basis classifierscan be learned independently, we use a cluster ofmultiple machines to reduce the training time.


5 10 15 20 25 30 35 40 5015

20

25

30

35

40

45

50

Number of training examples per class

Accu

racy (

%)

classemes−bitclassemesPiCoDesmc−bitmcPsiLP−beta (approximate kernels) [5]ObjectBank [17]ITQ [38]CCA−ITQ [38]

Fig. 3. Multi-class recognition on Caltech 256 using different imagerepresentations. The classification model is a linear SVM (except forLP-β ). The accuracy is plotted as a function of the training set size.

4.4 Evaluation setupThe next sections will describe the experimental eval-uations that we have done using our descriptors,as well as other popular image representations. Allthe experiments involve the usage of a classifier topredict the correct object/scene label of an image,and we decided to use the very efficient linear SVMmodel (with the only exception of LP-β ). In casesof multi-class categorization, we used the 1-vs-the-rest strategy and performed the prediction using thewinner-take-all strategy. The SVM hyperparameter isselected using either 5-fold cross validation or usingthe validation set when available.

4.5 Experiments: Object Class Recognition4.5.1 Caltech 256We present experiments obtained with several imagedescriptors on the challenging Caltech 256 bench-mark. We follow the standard approach of learningthe classifier for different number of training examplesper class {5, 10, . . . , 50}. We evaluate the model ona test set consisting of 25 examples per class, andreport the accuracy as the mean of the diagonal ofthe confusion matrix. We compare our descriptorsCLASSEMES, CLASSEMES-BIT, PICODES, MC and MC-BIT with the following methods:• PSI: the concatenation of the mapped low-level

features (introduced in Sec. 4.2). This baseline isinteresting to consider as it shows the accuracythat can be obtained by directly training the finalclassifier on the low-level image representationthat we have used to learn our descriptors.

• ITQ: the embedding method introduced in [38],which learns a binary code by directly mini-mizing the quantization error of mapping theinput data to vertices of the binary hypercube. Astraining data for the learning we use 2560 images

from the Caltech 256 dataset: the samples areconverted into PSI descriptors by using the samelow-level features, feature mapping, and PCAprojection that we have adopted for PiCoDes. Wehave tried different PCA subspace dimensional-ity; here we report the results obtained with thesetting yielding the best final accuracy. We learna ITQ descriptor of 2048 bits in order to compareit with PICODES.

• CCA-ITQ. Same as ITQ but instead of PCA weuse Canonical Correlation Analysis (CCA) (assuggested in [38]), which performs discrimina-tive dimensionality reduction.

• OBJECTBANK: the descriptor introduced in [17]which encodes into a single vector both semanticand spatial information of objects detected in theinput image, using a set of pre-trained detectors.

• LP-β : this denotes the variant of the LP-β [5]multiple-kernel combiner described in Sec. 3.2,based on the same low-level features used byclassemes and PiCodes (see Sec. 4.2).

Figure 3 shows the multi-class recognition accuracyof the different approaches as a function of thenumber of training examples per class. Note thatthe differences in performance between CLASSEMESand CLASSEMES-BIT is negligible on this benchmark,probably indicating that the accuracy of the classi-fier is saturated at this dimensionality and cannotexploit the additional information provided by thecontinuous data. OBJECTBANK performs moderatelywell but much worse than our descriptors. PICODESoutperforms CLASSEMES as the former are explicitlyoptimized for linear classification, which is the actualusage of the descriptors in this test. For the samedimensionality and storage cost, PICODES outper-form also ITQ and CCA-ITQ. We observed that forvery small descriptor dimensionalities, CCA-ITQ per-forms slightly better than PICODES (e.g., by a marginof 2% for 128 bits). Note however that in our teststhe CCA transformation was given the advantage ofusing the Caltech 256 classes as training data.

The methods PSI and LP-β use multiple featuresand nonlinear kernels (albeit in approximate form)and hence their recognition accuracies are among thebest in the literature. However, for small numbersof training examples per class, PICODES surprisinglymatches LP-β . Furthermore, note from Table 1 thatthe storage cost is 2 order of magnitude higher thanPICODES.

Finally, we can see that our descriptors MC and MC-BIT greatly outperform all the other representations;note that the binarized version (MC-BIT) is only 1%worse than the real-valued descriptor (MC) whilebeing 32 times more compact (the storage size for MC-BIT is less than 2KB per image). Moreover we use asimple linear model, which enables efficient trainingand recognition, and the storage requirement for thesedescriptors is only a few KBytes per image.


Method Top-1accuracy(%)

Storage10Mimages(GB)

Recognitiontime (µs)

CLASSEMES-BIT 22.15 3.09 6.67CLASSEMES 29.95 99.05 1.53PICODES 22.66 2.38 5.14MC-BIT 36.71 18 38.23MC-LSH 42.15 232 501.97Lin et al. 2011 [9] 52.9 43,945 796.60Sanchez et al. 2011 [2](FV)

n/a 39,041 603.1

Sanchez et al. 2011 [2](FV+PQ)

54.3 1,220 2131.0

Gong et al. 2011, [38](ITQ)

21.24 2.38 5.14

TABLE 2Object class recognition on ILSVRC 2010. For each descriptor, we

report: the top-1 accuracy on the test set; the storage size of anhypothetical database consisting of 10 million images; the average

time required to evaluate a linear classifier on a single image,without including the feature-extraction time (see supplementarymaterial for details). All the methods use linear SVM as classifier.

4.5.2 ILSVRC 2010

We now present results of multiclass recognitionon the ILSVRC 2010 dataset. The large size of thisdatabase poses new challenges and issues that are notpresent in smaller databases, as already noted in [44],[2], [9]. Yet, our binary CLASSEMES-BIT, PICODES, andMC-BIT features render the learning on this databaserelatively fast to accomplish even on a budget PC, aswe can represent these descriptors using a single bitper dimension, storing the entire ILSVRC2010 trainingset in at most 2.13 GB of memory. This allows us touse efficient software for batch training of linear SVM:we have had good success with LIBLINEAR [23], bysimply modifying the code to support input data inbitmap format. To further speed-up the learning wereduce the negative training set by sampling n =150 examples per class. According to our study, thissampling causes only a negligible drop in accuracy.Training a single one-vs-the-rest linear SVM using ourMC-BIT descriptor takes on average 50 seconds. Sothe entire multiclass training for ILSVRC2010 can beaccomplished in 14 hours on a single-core computer.In practice the learning can be made highly parallelwhen multiple cores and machines are available. As acomparison, the winning system of the ILSVRC2010challenge [9] required a week of training with a pow-erful cluster of machines and specialized hardware.

In table 2 we show the results of the analysiswe made with our descriptors and other approachesreported in the literature on the ILSVRC2010 dataset.Note that the MC-LSH method is just a compressedform of MC, obtained using LSH with 200K randomprojections. Our experiments suggest that this com-pression yields no significant degradation in accuracy,while reducing considerably the storage requirement.We can notice that the performances of the descriptors

CLASSEMES and MC-LSH are remarkably superior tothe binary versions CLASSEMES-BIT and MC-BIT, yield-ing an improvement of +7.8% and +5.44%, respec-tively. This indicates that the real-valued descriptorsare more informative than the corresponding binaryversions, but this additional expressiveness is exhib-ited only in large-scale scenarios (again, we remindthe reader that on the small Caltech 256 dataset,no significant improvement was obtained using real-valued descriptors). In particular MC-LSH achievesa recognition rate of 42.15% and a top-5 accuracyof 64.01% (top-5 accuracy is the traditonal perfor-mance measure of ILSVRC 2010). The systems of [2]and [9] are based on very high-dimensional imagesignatures and linear classifiers. Note that althoughthese approaches provides better raw accuracy, stor-age requirements and prediction times are orders ofmagnitude more costly and clearly inapplicable in ourmotivational large scale scenarios (see a discussion inSec. 4.7). The embedding method proposed in [38]produces a descriptor that is comparable in storagesize and recognition time to CLASSEMES-BIT and PI-CODES, but it yields lower accuracy.

We also provide results achieved with the individ-ual subcomponents of MC-BIT: “MC-BIT-TREE”, whichconsists of the 7232 meta-class classifiers learned forthe inner nodes of the label tree, yields 33.87% ofTop-1 accuracy; “MC-BIT-1VSALL”, which contains theoutputs of the 8000 one-vs-the-rest classeme classifiersyields 30.64%. Note that the accuracy obtained usingonly the label tree features is clearly superior to theone generated by only the classeme features. Thisindicates that the grouping of classes performed bythe label tree learning produces features that leadto better generalization on novel classes. However,the complete MC-BIT descriptor yields even higheraccuracy (36.71%), suggesting that there is value inusing both subcomponents. Please refer to the sup-plementary material for additional information.

A natural question is: how do our learned meta-classes compare to the nodes in the hand-constructedImageNet hierarchy? For example, [16] showed thatthis hierarchy provides semantic knowledge that canbe exploited to define an effective metric for retrieval.In order to run a fair comparison with our method,we pruned the original ImageNet tree so as to leaveonly the nodes corresponding to the 8,000 synsetsused to train our MC descriptor. This produced a newsemantic tree with 1,528 internal nodes. As before,we then trained a binary classifier for each of thesesemantic classes and obtained a new binary descriptorof 1,528 features. We found that these “semantic meta-classes” yield an accuracy of 18.37% on ILSVRC 2010.However, a random selection of 1,528 features fromour MC-BIT descriptor performs much better, yield-ing an average accuracy of 21.14% (the average iscomputed over 10 random selections of 1,528 fea-tures). This suggests that meta-class features automat-


Method mAPCLASSEMES-BIT 0.427CLASSEMES-BIT + SPLPOOLL0L1

0.452

CLASSEMES 0.438CLASSEMES + SPLPOOL L0L1 0.447PICODES 0.437PICODES + SPLPOOL L0L1 0.455MC-BIT 0.527MC-BIT + SPLPOOL L0L1 0.53MC 0.532MC + OBJPOOL 0.55Li et al. 2010 [17](OBJECTBANK)

0.452

Harzallah et al. 2009 [39]∗ 0.635Song et al. 2011 [40]∗ 0.705

TABLE 3Object-class categorization results obtained on PASCAL 2007 using

our descriptors and other methods in the literature. Theperformance measure is the mean of the Average Precision. Note

that the methods marked with ∗ make additional use of ground truthbounding boxes for training the model. For our descriptors, the

classification model is a linear SVM.

ically learned by considering visual relations betweenclasses are more effective than attributes based onhuman-defined notions of semantic similarity. Furtherresults are given in the supplementary material.

4.5.3 PASCAL 2007We now present categorization results on PAS-CAL 2007. To the best of our knowledge, this isthe first comprehensive analysis of the recognitionaccuracy of classifier-based descriptors on a detec-tion dataset, which includes images containing mul-tiple objects whose position and scale varies greatly.For this reason we tested our descriptors using thelocal-encoding extensions described in Sec. 3.5. Weimplemented SPLPOOL L0L1 using a pyramid oftwo levels, and the extension OBJPOOL using the25 subwindows with the greatest ObjectNess score.Table 3 summarizes the results in terms of mAP. De-spite the simplicity of the linear classification modelthat we are using, we can see that our descriptorsyield good accuracy while enabling extremely efficientprediction. Moreover note that all the proposed local-encoding strategies boost the accuracies of the rawdescriptors. In particular OBJPOOL produces the bestresults while only doubling the storage cost.

4.6 Experiments on Scene Recognition4.6.1 MIT 67In this section we present experiments performed onthe MIT 67 benchmark. We tested our descriptor MCand the variants described in Sec. 3.5. Specifically: fora first set of experiments, we tested the descriptorMC-2048dims created by selecting from MC the 2048most active features according to the criterion ofRecursive Feature Elimination [45] (RFE). The fea-ture selection was performed using ILSVRC 2010,

Method AccuracyMC-2048dims 44.6MC-2048dims + SPCAT L0L1 46.9MC-2048dims + SPLPOOLL0L1L2

49.6

MC-2048dims + OBJPOOL 49.6MC + OBJPOOL 55.9Elfiky et al. 2012 [41] 48.9Li et al. 2010 [17](OBJECTBANK)

37.6

TABLE 4Scene recognition on MIT 2007 using our descriptors and other

methods in the literature. For our image representations, theclassification model is a linear SVM. Our descriptor

MC +OBJPOOL outperforms all prior methods on this test.

by removing at each iteration 50% of the features.This lower-dimensional descriptor reduced the com-putational requirements and allowed us to easilyperform an extensive set of evaluations. Table 4summarizes the results of our experiments, and in-cludes the recognition rates of several other meth-ods presented in the literature. We can see that theplain descriptor MC-2048dims is already very com-petitive, yielding accuracy close to other state-of-the-art methods. All the local encoding variants ofSec. 3.5 are able to boost the accuracy. In particular themethod MC-2048dims+SPLPOOL L0L1L2 is supe-rior to MC-2048dims+SPCAT L0L1 while producinga smaller descriptor; MC-2048dims+OBJPOOL is thebest method as it produces a descriptor that is onlytwice as big as the original one and it yields thebest recognition accuracy (+2.5% over MC-2048dims).Finally the full-dimensional MC +OBJPOOL yields anaccuracy of 56% that, to the best of our knowledge, isthe best published result for this benchmark.

4.6.2 SUN 397To conclude, we present experiments on the large-scale scene recognition benchmark SUN 397. Wetested our binary descriptors CLASSEMES-BIT, PI-CODES and MC-BIT, and the real-valued MC. Table 5shows our results. As already noticed in our priorevaluations, PICODES outperforms CLASSEMES-BIT,and the larger MC-BIT is the best performing one.The accuracy obtained with MC approaches the resultsprovided by the method introduced in [34] which is amultiple-kernel combiner with 15 types of features,and thus orders of magnitudes more expensive totrain and test and requiring much higher storage size.

4.7 Experiments on Object-Class Search4.7.1 ILSVRC 2010We present here results for our motivating problem:fast novel-class recognition in a large-scale database.For this experiment we use again the ILSVRC2010data set. However, this time for each class we learn abinary linear SVM using as positive examples all the


Method AccuracyCLASSEMES-BIT 17.6PICODES 27.1MC-BIT 34.8MC 36.8Xiao et al. 2010 [34] 38.00

TABLE 5Scene recognition on SUN 397 using our descriptors and othermethods in the literature. The classification model for our image

representations is a linear SVM.

1 3 5 7 10 15 20 2525

30

35

40

45

50

55

60

K

Mean P

recis

ion @

K (

%)

mc−bit

mc−bit (2048dims)

PiCoDesclassemes−bit

ITQ [38]

LSH

Fig. 4. Object-class search on ILSVRC2010: precision in retrievingimages of a novel class from a dataset of 150,000 photos. Foreach query, the true positives are only 0.1% of the database. Theclassification model is a linear SVM.

images of that class in the ILSVRC2010 training set;as negative examples we use 4995 images obtainedby sampling 5 images from each of the other 999 cat-egories. Then we use the classifier to rank the 150,000images of the test set. We measure performance interms of mean precision at K, i.e., the average pro-portion of images of the relevant class in the top-K.Note that for each class, the database contains only150 positive examples and 149,850 distractors from theother classes. Figure 4 shows the accuracy obtainedwith MC-BIT, which on this task outperforms by 15%CLASSEMES-BIT. We also compared our descriptorsto 2048-bit codes learned by ITQ [38] and LSH (weran these methods on a compressed version of therepresentation PSI obtained via PCA). It can be seenthat, for the same target dimensionality, all of ourmethods outperform these baselines.

While the systems described in [2], [9] achievehigher multiclass recognition accuracy than ourmethod on ILSVRC2010 where the classes are pre-defined (see Sec. 4.5.2), we point out that these ap-proaches are not scalable in the context of real-timeobject-class search in large databases. Table 2 (col-umn three) reports the storage required by differentmethods for a database containing 10M images. Intheir proposed form, [2] and [9] require to storehigh-dimensional real-valued vectors. Even if splittingthe data across different machines, these approachesremain clearly not scalable in real scenarios. In [2],

product quantization (PQ) is used to compress downthe size of the data but even in this case a 10M-imagesdatabase would require 610 GB. Our approach savesan order of magnitude of storage, resulting in the mostscalable method in terms of memory utilization.

In addition, our system is also outperforming theother competing methods in terms of recognitiontime. The last column of Table 2 shows the averagesearch time per image for a single object-class query.Our methods are by far the fastest. The system pro-posed by [2], which was relatively scalable in terms ofstorage, is the slowest one. Our approach provides a10-fold or greater speedup over these systems and isthe only one computing results in times acceptable forinteractive search. We also note that sparse retrievalmodels and top-k ranking methods [46] could be usedwith our binary code to achieve further speedups onthe problem of class-search.

5 CONCLUSIONS

In this paper we have presented three image descrip-tors, which measure the closeness of a given image toa set of high-level visual concepts, called basis classes.The concepts are either chosen a priori or automati-cally learned. We implement this similarity measureusing the outputs of non-linear classifiers trained onan offline labeled dataset. We also propose methods toaggregate the outputs of the basis classifiers evaluatedon subwindows of the image into a single featurevector, thus rendering the descriptor more robustto clutter and multiple objects. Our descriptors aredesigned to be very compact, yet they achieve state-of-the art accuracy even with simple linear classifiers.We tested and compared our descriptors on sev-eral challenging benchmarks, on several tasks: objectcategorization, scene recognition and novel object-class search. Thank to the compactness and the richvisual information incorporated into the descriptors,the proposed framework enables real-time object-classtraining and search in databases containing millionsof images with good accuracy. The software for theextraction of all our descriptors is publicly available.

ACKNOWLEDGMENT

We are grateful to Andrew Fitzgibbon for his con-tribution in the design of classemes and PICODES.We thank Martin Szummer for discussions and ChenFang for programming help. This research was fundedin part by Microsoft Research, NSF CAREER awardIIS-0952943 and NSF award CNS-1205521.

REFERENCES[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,

“ImageNet: A Large-Scale Hierarchical Image Database,” inCVPR, 2009.

[2] J. Sanchez and F. Perronnin, “High-dimensional signaturecompression for large-scale image classification,” in CVPR,2011, pp. 1665–1672.


[3] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object cat-egory dataset,” California Institute of Technology, Tech. Rep.7694, 2007.

[4] A. Berg, J. Deng, and L. Fei-Fei, “Large scale visualrecognition challenge,” 2010, http://www.image-net.org/challenges/LSVRC/2010/.

[5] P. Gehler and S. Nowozin, “On feature combination for mul-ticlass object classification,” in ICCV, 2009.

[6] N. Kumar, A. Berg, P. Belhumeur, and S. Nayar, “Attributeand simile classifiers for face verification,” in ICCV, 2009.

[7] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describingobjects by their attributes,” in CVPR, 2009.

[8] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learningto detect unseen object classes by between-class attributetransfer,” in CVPR, 2009.

[9] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, andT. S. Huang, “Large-scale image classification: Fast featureextraction and svm training,” in CVPR, 2011.

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenetclassification with deep convolutional neural networks,” inNIPS, 2012, pp. 1106–1114.

[11] S. Maji and A. C. Berg, “Max-margin additive classifiers fordetection,” in ICCV, 2009.

[12] A. Vedaldi and A. Zisserman, “Efficient additive kernels viaexplicit feature maps,” PAMI, 2011.

[13] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, andC. Schmid, “Aggregating local image descriptors into compactcodes,” IEEE Trans. on PAMI, 2011.

[14] G. Wang, D. Hoiem, and D. Forsyth, “Learning image similar-ity from flickr using stochastic intersection kernel machines,”in ICCV, 2009.

[15] L. Torresani, M. Szummer, and A. Fitzgibbon, “Efficient objectcategory recognition using classemes,” in ECCV, 2010.

[16] J. Deng, A. Berg, and F.-F. Li, “Hierarchical semantic indexingfor large scale image retrieval,” in CVPR, 2011.

[17] L. Li, H. Su, E. Xing, and L. Fei-Fei, “Object Bank: A high-level image representation for scene classification & semanticfeature sparsification,” in NIPS, 2010.

[18] A. Bergamo, L. Torresani, and A. Fitzgibbon, “Picodes: Learn-ing a compact code for novel-category recognition,” in NIPS,2011, pp. 2088–2096.

[19] A. Bergamo and L. Torresani, “Meta-class features for large-scale object categorization on a budget,” CVPR, vol. 0, pp.3085–3092, 2012.

[20] D. Lowe, “Distinctive image features from scale-invariantkeypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.

[21] A. Oliva and A. Torralba, “Building the gist of a scene: Therole of global image features in recognition,” Visual Perception,Progress in Brain Research, vol. 155, 2006.

[22] Y. Bengio, “Learning deep architectures for ai,” Found. TrendsMach. Learn., vol. 2, no. 1, pp. 1–127, Jan. 2009.

[23] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J.Lin, “Liblinear: A library for large linear classification,” JMLR,vol. 9, 2008.

[24] S. Bengio, J. Weston, and D. Grangier, “Label embedding treesfor large multi-class tasks,” in NIPS, 2010.

[25] T. Gao and D. Koller, “Discriminative learning of relaxedhierarchy for large-scale visual recognition,” in ICCV, 2011.

[26] J. Deng, S. Satheesh, A. C. Berg, and F.-F. Li, “Fast andbalanced: Efficient label tree learning for large scale objectrecognition,” in NIPS, 2011, pp. 567–575.

[27] U. von Luxburg, “A tutorial on spectral clustering,” Statisticsand Computing, vol. 17, no. 4, pp. 395–416, Dec. 2007.

[28] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering:Analysis and an algorithm,” in NIPS, 2001.

[29] J. C. Platt, “Probabilistic outputs for support vector machinesand comparisons to regularized likelihood methods,” in Ad-vances in Large Margin Classifiers. MIT Press, 1999.

[30] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of fea-tures: Spatial pyramid matching for recognizing natural scenecategories,” in CVPR, 2006.

[31] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the object-ness of image windows,” IEEE PAMI, vol. 34, no. 11, pp. 2189–2202, 2012.

[32] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman, “The PASCAL Visual Object Classes

Challenge 2007 (VOC2007) Results,” http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.

[33] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” inCVPR, 2009, pp. 413–420.

[34] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sundatabase: Large-scale scene recognition from abbey to zoo,” inCVPR, 2010, pp. 3485–3492.

[35] N. Dalal and B. Triggs, “Histograms of oriented gradients forhuman detection,” in CVPR, 2005.

[36] E. Shechtman and M. Irani, “Matching local self-similaritiesacross images and videos,” in CVPR, 2007.

[37] A. Vedaldi and B. Fulkerson, “VLFeat: An openand portable library of computer vision algorithms,”http://www.vlfeat.org/, 2008.

[38] Y. Gong and S. Lazebnik, “Iterative quantization: A pro-crustean approach to learning binary codes,” in CVPR, 2011,pp. 817–824.

[39] H. Harzallah, F. Jurie, and C. Schmid, “Combining efficientobject localization and image classification,” in ICCV, 2009,pp. 237–244.

[40] Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan, “Contextu-alizing object detection and classification,” in CVPR, 2011, pp.1585–1592.

[41] N. M. Elfiky, J. Gonzalez, and F. X. Roca, “Compact andadaptive spatial pyramids for scene recognition,” Image VisionComput., vol. 30, no. 8, pp. 492–500, 2012.

[42] “Lscom: Cyc ontology dated (2006-06-30),”http://lastlaugh.inf.cs.cmu.edu/lscom/ontology/LSCOM-20060630.txt, http://www.lscom.org/ontology/index.html.

[43] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object cat-egory dataset,” California Institute of Technology, Tech. Rep.7694, 2007.

[44] J. Deng, A. C. Berg, K. Li, and F.-F. Li, “What does classifyingmore than 10, 000 image categories tell us?” in ECCV, 2010.

[45] O. Chapelle and S. S. Keerthi, “Multi-class feature selectionwith support vector machines,” Proc. Am. Stat. Ass., 2008.

[46] M. Rastegari, C. Fang, and L. Torresani, “Scalable object-classretrieval with approximate and top-k ranking,” in ICCV, 2011.

Alessandro Bergamo is a Ph.D. student inthe Computer Science Department at Dart-mouth College. He received his B.S. from theUniversity of Padua and his M.S. from theUniversity of Milan in Italy. His research inter-ests spread between the areas of MachineLearning and Computer Vision, with specialfocus on the design of image descriptors andmethods for efficient novel-class recognitionand search in large-scale databases.

Lorenzo Torresani is an Assistant Profes-sor in the Computer Science Department atDartmouth College. He received a LaureaDegree in Computer Science with summacum laude honors from the University ofMilan in 1996, and an M.S. and a Ph.D.in Computer Science from Stanford Univer-sity in 2001 and 2005, respectively. He hasworked at several industrial research labsincluding Microsoft Research, Like.com andDigital Persona. His research is in computer

vision and machine learning. In 2001, Torresani and his coauthorsreceived the Best Student Paper Award at the IEEE ConferenceOn Computer Vision and Pattern Recognition (CVPR). He is therecipient of a National Science Foundation CAREER Award.

submitted to ieee transactions on pattern analysis … · for recognition in specialized domains...

Documents