learning with few examples using a constrained gaussian prior …lfe.pdf · machine learning with...

10
Learning with Few Examples Using a Constrained Gaussian Prior on Randomized Trees Erik Rodner, Joachim Denzler Chair for Computer Vision, Friedrich Schiller University of Jena Email: {rodner,denzler}@informatik.uni-jena.de Abstract Machine learning with few training examples al- ways leads to over-fitting problems, whereas human individuals are often able to recognize difficult ob- ject categories from only one single view. It is a common belief, that this is mostly established by transferring knowledge from related classes. There- fore, we introduce a new hybrid classifier for learn- ing with very few examples by exploiting interclass relationships. The approach consists of a random- ized decision trees structure which is significantly enhanced using maximum a posteriori (MAP) esti- mation. For this reason, a constrained Gaussian is intro- duced as a new parametric family of prior distri- butions for multinomial distributions to represent shared knowledge of related categories. We show that the resulting MAP estimation leads to a simple recursive estimation technique, which is applicable beyond our hybrid classifier. Experimental evaluation on two public datasets (including the very demanding Mammals database) shows the benefits of our approach compared to the base randomized trees classifier. 1 Motivation and Introduction Learning to recognize objects of different categories is one of the major research fields within com- puter vision and machine learning. Although cur- rent state-of-the-art approaches reach impressive re- sults on difficult datasets [3], they are not able to handle very small training sets. Without additional information, machine learn- ing with few training examples always reduces to an ill-posed optimization problem. It is a common paradigm within the community that information of similar object categories or classification tasks (sup- port classes / tasks) is the most useful source to enhance generalization ability of weak representa- tions [4]. This principle is known in the literature as learning to learn, knowledge transfer or transfer learning. Therefore handling few training examples needs classifiers that use interclass relationships (in- terclass transfer). Research in this field is often motivated by clos- ing the gap between human and machine vision quality. However, learning with weak representa- tions is also a typical task demanded by the indus- try. Gathering training material is often expensive and time-consuming and can have a significant im- pact on the overall cost of resulting systems. Previous work on interclass transfer differs in the type of information transferred. Miller et al. [17] try to estimate shared geometric transforma- tions, which can be applied indirectly to a new class representation. Another idea is to assume shared structures in feature space and estimate a metric or transformation from support classes [7, 19, 1]. This mostly leads to methods similar to linear discrimi- nant analysis, without clear motivation and suitable comparisons to the standard approach. Torralba et al. [21] use a discriminative boosting technique which exploits shared class boundaries within feature space to allow more efficient multi- class learning with a sub-linear number of features. In contrast, Fei-Fei et al. [5] develop a generative framework with MAP estimation of model param- eters using a prior distribution estimated from sup- port classes. Contextual information as a helpful in- formation source in object recognition systems can be regarded as a special case of interclass transfer [8, 12, 20]. Our work is motivated by the basic ideas in [21, 5] and combines them in a new manner. We propose to use a combined discriminative and gen- erative technique as a framework of a new transfer learning approach. The key concept is a MAP esti- VMV 2008 O. Deussen, D. Keim, D. Saupe (Editors)

Upload: others

Post on 15-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning with Few Examples Using a Constrained Gaussian Prior …LFE.pdf · Machine learning with few training examples al-ways leads to over-fitting problems, whereas human individuals

Learning with Few Examples Using a Constrained Gaussian Prior onRandomized Trees

Erik Rodner, Joachim Denzler

Chair for Computer Vision, Friedrich Schiller University of JenaEmail: rodner,[email protected]

Abstract

Machine learning with few training examples al-ways leads to over-fitting problems, whereas humanindividuals are often able to recognize difficult ob-ject categories from only one single view. It is acommon belief, that this is mostly established bytransferring knowledge from related classes. There-fore, we introduce anew hybrid classifierfor learn-ing with very few examples by exploiting interclassrelationships. The approach consists of a random-ized decision trees structure which is significantlyenhanced using maximum a posteriori (MAP) esti-mation.

For this reason, a constrained Gaussian is intro-duced as anew parametric familyof prior distri-butions for multinomial distributions to representshared knowledge of related categories. We showthat the resulting MAP estimation leads to a simplerecursive estimation technique, which is applicablebeyond our hybrid classifier.

Experimental evaluation on two public datasets(including the very demanding Mammals database)shows the benefits of our approach compared to thebase randomized trees classifier.

1 Motivation and Introduction

Learning to recognize objects of different categoriesis one of the major research fields within com-puter vision and machine learning. Although cur-rent state-of-the-art approaches reach impressive re-sults on difficult datasets [3], they are not able tohandle very small training sets.

Without additional information, machine learn-ing with few training examples always reduces toan ill-posed optimization problem. It is a commonparadigm within the community that information ofsimilar object categories or classification tasks (sup-

port classes / tasks) is the most useful source toenhance generalization ability of weak representa-tions [4]. This principle is known in the literatureas learning to learn, knowledge transfer or transferlearning. Therefore handling few training examplesneeds classifiers that use interclass relationships (in-terclass transfer).

Research in this field is often motivated by clos-ing the gap between human and machine visionquality. However, learning with weak representa-tions is also a typical task demanded by the indus-try. Gathering training material is often expensiveand time-consuming and can have a significant im-pact on the overall cost of resulting systems.

Previous work on interclass transfer differs inthe type of information transferred. Miller et al.[17] try to estimate shared geometric transforma-tions, which can be applied indirectly to a new classrepresentation. Another idea is to assume sharedstructures in feature space and estimate a metric ortransformation from support classes [7, 19, 1]. Thismostly leads to methods similar to linear discrimi-nant analysis, without clear motivation and suitablecomparisons to the standard approach.

Torralba et al. [21] use a discriminative boostingtechnique which exploits shared class boundarieswithin feature space to allow more efficient multi-class learning with a sub-linear number of features.In contrast, Fei-Fei et al. [5] develop a generativeframework with MAP estimation of model param-eters using a prior distribution estimated from sup-port classes. Contextual information as a helpful in-formation source in object recognition systems canbe regarded as a special case of interclass transfer[8, 12, 20].

Our work is motivated by the basic ideas in[21, 5] and combines them in a new manner. Wepropose to use a combined discriminative and gen-erative technique as a framework of a new transferlearning approach. The key concept is aMAP esti-

VMV 2008 O. Deussen, D. Keim, D. Saupe (Editors)

Page 2: Learning with Few Examples Using a Constrained Gaussian Prior …LFE.pdf · Machine learning with few training examples al-ways leads to over-fitting problems, whereas human individuals

mation of parameters of a discriminative classifier.In contrast to other hybrid methods, such as [13],feature space is efficiently partitioned using a dis-criminative random forest classifier (or extremelyrandomized trees (ERT), [10, 2]) . This partition-ing and an additional prior on the distribution withineach decision tree is afterwards used as prior knowl-edge in a MAP estimation of model parameters of anew class with few training examples (Figure 1).

This allows to develop a classifier, which ef-ficiently combines the generalization power ofdiscriminative- and the advantage of generative ap-proaches to provide a better models of weak repre-sentations.

The use of ensembles of trees, which is the basediscriminative part of our method, prevents over-fitting by randomization and thus solves one of themain problems original decision tree classifiers suf-fer from. Recent applications [20, 6, 15] show thegreat potential and wide applicability of this baseclassifier.

We first review randomized decision trees as pre-sented by [10]. Next we show that Bayesian estima-tion using a prior distribution is a well founded pos-sibility to transfer knowledge from related classesand how to apply this idea to decision trees. Insection 4 a compact model for multinomial distri-butions, called constrained Gaussian prior, is intro-duced, which is the theoretical key concept of ourmethod. Finally, experiments on public availabledatasets demonstrate the benefits of our hybrid clas-sifier. This includes an evaluation using the verydifficult Mammals database of [9]. The use of thisdatabase leads to a short discussion (section 5.3)about the ability of our method to naturally trans-fer context. A summary of our findings concludethe paper.

2 Randomized Decision Trees

The discriminative part of our approach is basedon extremely randomized trees (ERT) as introducedby Geurts et al [10]. Decision tree classifiers are(mostly) binary trees which store a weak classifierand posterior distributionsp(Ωi | n) for each classi at all nodesn. A weak classifier is a binary func-tion consisting of an one-dimensional feature anda simple threshold. At classification phase an ex-ample (feature vector, image) is traversed down thetree according to the output of each weak classifier

on the path until a leaf node is reached. The pos-terior of the leaf node is the result of the decisiontree and estimates the probabilities of an arbitraryexample of classi to reach the leaf.

Standard decision tree approaches suffer fromtwo serious problems: long training time and over-fitting. The ERT approach solves both issues byrandom sampling. Training time is easily reducedby an approximate search for the most informativeweak classifier in each node, instead of evaluatingevery feature and threshold. The selection is doneby choosing the weak classifier with the highestgain in information from a random fraction of fea-tures and thresholds.

Given enough training data for each classi, gen-eralization performance can be greatly increased bylearning S decision trees (forest) with a randomsubset of the training data. Given the leaf nodes ofthe forestnL = (nL

1 , . . . , nLS ) the overall posterior

can be obtained by simple averaging [2]:

p(Ωi | nL) =

1

S

SX

s=1

p(Ωi | nLs ) . (1)

This special case of Bagging [2] reduces over-fitting effects without the need of additional treepruning. Unbalanced training sets lead to a biaseddecision tree classifier, therefore we follow [20]and weight each example with the inverse class fre-quency.

3 Transfer Learning using ERT

Learning with few or only one single example ofeach class often leads to over-fitting of the modelparameters estimated by the classifier. When usingdecision trees, the ability of random forests to ef-ficiently learn robust classifier combinations cannotbe used due to the fact, that they need to additionallyreduce the training data. Within each tree, a sin-gle example only increases posterior probabilitiesin a single branch, which can be seen as memoriz-ing the training example completely. For this reasonour approach uses interclass transfer to learn sharedfeatures from support classes and incorporates thisinformation to boost the generalization ability of anew class. The main steps of our algorithm are sum-marized in Figure 2.

Page 3: Learning with Few Examples Using a Constrained Gaussian Prior …LFE.pdf · Machine learning with few training examples al-ways leads to over-fitting problems, whereas human individuals

new class

few examples

support classes

Hybrid Classifier

prior (CGP)likelihood

Discriminative Classifier (ERT)

MAP θi, i 6= γ

θγ

parameter

θ

Figure 1: General concept of our approach for transfer learning in the context of learning with few examples:common knowledge about the recognition task is obtained from training data of related similar classes(support classes) and used to build a classifier which is ableto efficiently learn from a weak representation(new class). Images are collected from the Mammals databaseof [9], which is used for experiments insection 5.

3.1 Discriminative Structure

Despite the well founded, probabilistic backgroundof Bayesian methods, discriminative classifiers of-ten lead to a better generalization performance,which can be seen in the success of support vectormachines as a standard classifier within the machinelearning community.

For this reason we propose to use as a first stepa discriminative random forest trained with exam-ples of all classes. All subsequent steps to incorpo-rate knowledge of the new class, use this fixed dis-criminative structure. This concept has been used in[11, 15], to reuse features and reduce computationtimes. Our approach can be seen as a refinementstep of this fixed discriminative classifier using aBayesian framework. In the following, we describeour method for a single tree in the forest.

3.2 Generative Transfer Learning

As proposed by [5], transfer learning can be done byMAP estimation of model parametersθ related tothe new classγ. Knowledge from a setS of support

classes is incorporated within this Bayesian frame-work as a prior distribution of parametersθ. Theseclasses represent a subset of all available categories,which are assumed to be related to the new class.Therefore the fundamental assumption is, that it ispossible to estimate a suitable prior distribution use-ful to regularize parameter estimation of a relatedclass. This assumption can be expressed mathemat-ically by :

θMAP = arg max

θp(T γ | θ) p(θ | T

S) . (2)

whereT γ denotes training data of the new classandTS denotes training data of all support classes.Please note that this directly incorporates the mainassumption by using a prior distribution dependingon the training data of all support classesTS .

With a fixed tree structure, incorporating trainingdata of the new class reduces to estimating posteriordistributions in the leafs. Using a single training ex-ample and updating posterior distributions withoutadditional knowledge induces the update of only a

Page 4: Learning with Few Examples Using a Constrained Gaussian Prior …LFE.pdf · Machine learning with few training examples al-ways leads to over-fitting problems, whereas human individuals

(I) Learn a randomized decision forest us-ing all classes [10].

(II) Calculate node probabilities for eachnode and each class (equation (3)).

(III) Approximate prior knowledge usinga constrained Gaussian prior (equa-tion (6)).

(IV) MAP estimation of leaf probabilitiesof the class with few training examplesusing a CGP (section 4.1).

(V) Calculate node posteriors from esti-mated leaf probabilities.

(VI) Build additional discriminative layers(section 3.4).

Figure 2: Main steps of our approach. Steps II toVI are performed independently for each tree of theforest.

single branch from the root to the leafs. The dis-tribution of the new class is restricted to a smallarea in feature space defined by the single leaf nodereached by the training sample. It is obvious, thatthis leads to an over-fitting situation. Therefore wepropose to use MAP estimation of leaf probabili-ties in terms of (2), which allows to incorporateprior knowledge as a well defined distribution. OurMAP estimation technique, as well as the underly-ing model distributions, are presented in section 4.

3.3 Leaf probabilities as a parameter

The event of an example reaching noden is de-fined by the probability mass of a part of the fea-ture space, described by the path from the root ton. For this reason, we denote this event byΩn

analogous toΩi representing the part of the featurespace related to classi. If the sum of training ex-ample weights reaching each nodec(Ωi|Ω

n) (non-normalized posterior distribution, sum of weights)is stored, node posterior distributionsp(Ωi|Ω

n) areeasily converted to node probabilitiesp(Ωn|Ωi) us-ing the following recursive formula (p parent nodeof n):

p(Ωn | Ωi) = p(Ωp | Ωi) p(Ωn | Ωp, Ωi)

= p(Ωp | Ωi)

c(Ωi | Ωn)

c(Ωi | Ωp)

«

. (3)

The trivial case is the probability of the root noder: p(Ωr | Ωi) = 1. Our parameter vectorθ forMAP estimation consists of the node probability ofeach leafl:

θl = p(Ωl | Ωγ) . (4)

Additionally we useθi to denote the vector ofleaf probabilities corresponding to an arbitrary classi. Due to the fact that leafs of a decision tree inducea partitioning of the feature space in disjoint subsetsΩl, each instance of the parameter vector is a dis-crete multinomial distribution. For this reason anysuitable distribution of discrete distributions can beused to model a prior and perform MAP estima-tion. In section 4 we present our model of the prioras well as the resulting MAP estimation of discretedistributions. At this point we assume that the leafprobabilities of the new class are well estimated. In-ner node probabilities can be calculated additionallyby simple recursive summation within the tree. Thelast step is to recompute all posterior distributions,which can be achieved by the usual application ofBayes’ law:

p(Ωi | Ωn) =p(Ωn | Ωi) p(Ωi)

p(Ωn)

=p(Ωn | Ωi) p(Ωi)

P

i′p(Ωn | Ωi′) p(Ωi′)

. (5)

3.4 Additional Discriminative Levels

All machine learning approaches using the inter-class paradigm within a single classification taskhave to cope a common issue: Transferring knowl-edge from support classes can lead to confusionwith the new class. For example using prior infor-mation from camel images to support a dromedarywithin an animal classification task enables to trans-fer shared features like fur color or head appear-ance. However, the classifier has to use additionalfeatures like shape information to discriminate be-tween the two object categories.

To solve this problem we propose to build addi-tional discriminative levels of the decision tree after

Page 5: Learning with Few Examples Using a Constrained Gaussian Prior …LFE.pdf · Machine learning with few training examples al-ways leads to over-fitting problems, whereas human individuals

MAP estimation of the leaf posterior distributions.Starting from a leaf node with non-zero posteriorprobability of the new class, we use the method of[10] to further extend the tree. Training data in thiscase consists of all samples of the new class andsamples of all other classes which reached the leaf.All of the samples are weighted by the correspond-ing unnormalized posterior distribution. This pro-cedure allows to find new discriminative featuresespecially between the new class and all supportclasses.

4 Constrained Gaussian Prior

The selection of a suitable parametric model for theprior distribution is a difficult task. The model pa-rameter, which is estimated, is a discrete distribu-tion itself, thus the prior distribution is addition-ally constrained to a(n − 1)-simplex. A commonchoice is a conjugate prior, as a Dirichlet distri-bution for multinomial distributions. This familyof parametric distributions has typically as manyhyper-parameters as parameters of the underlyingproblem. Therefore one needs a huge set of samplesfrom support classes to estimate optimal Dirichletparameters.

For this reason we propose to use a constrainedGaussian distribution (CGD), which is a much sim-pler family of parametric distributions. Withθ ≥ 0we define the density as:

p(θ|TS) ∝ N (θ|µS, σ

2I) δ

1 −X

l

θl

!

. (6)

We multiply with theδ-term (δ(0) = 1, ∀x 6=0 : δ(x) = 0) to ensure, that the support of the den-sity function is the simplex with all feasible discretedistributions. The use of aσ2

I as a covariance ma-trix is an additional assumption which is useful toderive an efficient MAP estimation algorithm (sec-tion 4.1).

Figure 3 presents a graphical comparison be-tween a Dirichlet distribution and our CGD forsome parameter values. Note that our model canapproximate a Dirichlet for the whole variety of pa-rameters. The main difference is the elliptic shapeof iso-lines in contrast to a more triangular shape,which is of course more appropriate for the simplex.

This simple model allows to estimate hyper-parametersµS , σ in an usual way and efficiently

σ2 = 0.1 α = (1.6, 1.6, 1.6)T

σ2 = 0.03 α = (2.2, 6, 6)T

Figure 3: Comparison between a constrained Gaus-sian (left column) and a Dirichlet distribution (rightcolumn). Color represents the value of the densityat the position on the simplex. Mean of the CGD isset to the mean of the corresponding Dirichlet dis-tribution

perform MAP estimation. The mean vectorµS canbe estimated analogous to a non-constrained Gaus-sian, because of the simplex being a convex set.

Within our application on decision trees,µS isestimated using leaf probabilities of support classes:

µS =

1

|S|

X

i∈S

θi. (7)

As usual, our model of the unknown distribu-tion by a gaussian parametric family is mostly re-lated to computational practical issues rather thantheoretical results. Applied to our leaf probabili-ties with regularization using support classes, thissimple model can of course lead in some cases toa wrong estimation. For example support classescould share a common feature which is not relatedto the new class. In spite of these possible failurecases, we will show that due to the high dimension-ality of θ and careful averaging using bagging oursimple gaussian model is sufficient to increase theperformance of different classification tasks.

Page 6: Learning with Few Examples Using a Constrained Gaussian Prior …LFE.pdf · Machine learning with few training examples al-ways leads to over-fitting problems, whereas human individuals

4.1 MAP Estimation using a CGP

MAP estimation using complex parametric distribu-tion often needs nonlinear optimization techniques.In contrast to these approaches we show that by us-ing our constrained Gaussian as a prior of a multi-nomial distribution, it is possible to derive a closed-form solution of the global optimum depending ona single Lagrange multiplier.

We start by writing the objective function of theMAP estimation as a Lagrange function of our sim-plex constraint and the posterior:

L(θ, λ) = log“

p(T γ|θ) p(θ|TS)”

+ λ

X

l

θl − 1

!

. (8)

The likelihood has the simple form of a multi-nomial and depends on a discrete histogramc =(cl)

m

l=1representing the number of samples of each

component:

p(T γ|θ) ∝Y

l

(θl)cl . (9)

Within our application to leaf probabilities ofdecision trees, the absolute number of examplesreaching a node is used:cl = c(Ωγ |Ω

l). Hence,the overall objective function can be written as:

X

l

cl log(θl) −1

2σ2(θl − µl)

2 + λθl

«

− λ .

The normalization factor of our prior and thelikelihood can be neglected, because they are sin-gle independent additive constants inL. Setting the

gradient“

∂L∂θl

(θ, λ) to zero leads tom indepen-

dent equations:

0 =cl

θl

−1

2σ2· 2 · (θl − µl) + λ . (10)

Note that we get a non-informative prior whichreduces MAP to ML estimation withσ2 → ∞. Itis easy to proof, that under “non-degenerate condi-

tions” all entries of the optimal vectorθMAP arepositive. Therefore we can assumeθl > 0 for eachl and it is possible to obtain a simple quadratic equa-tion in θl:

0 = θ2l + θl (−µl − λσ

2) − σ2cl . (11)

Therefore the optimization problem only has asingle positive solution depending onλ:

θl =µl + λσ2

2+

s

µl + λσ2

2

«2

+ σ2cl . (12)

Estimating the Lagrange multiplier is done witha simple fixed point iteration, derived from our sim-plex constraint:

1!=X

l

θl

=X

l

µl + λσ2

2+X

l

p

gl(λ)

=1

2+

1

2mλσ

2 +X

l

p

gl(λ) (13)

where

gl(λ) =

µl + λσ2

2

«2

+ σ2cl (14)

is an abbreviation of the term under the squareroot. This finally leads to the following recursionformula:

λi+1 =

1 − 2P

l

p

gl(λi)

mσ2. (15)

So far, we cannot prove convergence theoreti-cally, but in our application we find a suitable so-lution within a few iterations. Better techniques tosolve equation (13) beyond a simple fixed point it-eration may lead to better numerical stability butare not investigated. Given the Lagrange multiplierλ we can easily estimate the whole vectorθ usingequation (12).

5 Experiments

To show its applicability, we experimentally eval-uated our approach. For a comparative analysisthe use of publicly available datasets is an impor-tant aspect. On that account the database of hand-written Latin letters [7] and the demanding Mam-mals database [9] are used. The diversity of the twodatabases allows to assess the general performance

Page 7: Learning with Few Examples Using a Constrained Gaussian Prior …LFE.pdf · Machine learning with few training examples al-ways leads to over-fitting problems, whereas human individuals

gain of our method independent of the classificationtask.

Evaluation criterions are unbiased average recog-nition rates of the whole classification task and sin-gle recognition rates of the new class. We performMonte Carlo analysis by selecting randomlyf train-ing examples of the new class, which leaves the restfor testing. To estimate recognition rates for a fixedvalue off we average the results of multiple runs.This also averages the influence of the highly ran-domized manner of our base classifier.

Our experimental evaluation aims at analyzingthe gain of our transfer learning approach comparedto a standard ERT classifier. We do not focus on thedevelopment of new feature types which are suit-able for a special recognition task. For this reasonour choice of features is quite standard and not op-timized (section 5.1 and 5.2).

The varianceσ2 of the CGP is an important pa-rameter of our method. It controls the influenceof the prior and therefore indirectly our assumptionabout how much the new class is related to supportclasses. An optimal value forσ2 could be obtainedby cross validation, but we fixed the value to10−5

in all experiments.Another issue is the selection of support classes

from all available classes. Our main assumption inequation (2), suggests that those categories have toshare common features, shape, appearance or con-text (section 5.3). Automatically estimating classsimilarity would be optimal to provide support classsubsets. However this leads to a vicious circle,because one has to estimate models in advance.Therefore we select support classes manually.

We build an ensemble of10 decision trees with-out limits on depth or example counts within eachnode. During training we test a maximum of 500random features with 15 thresholds drawn from auniform distribution at each node.

5.1 Latin Letters

The database of [7] is a collection of images con-taining handwritten Latin letters resulting in 26 ob-ject categories. For each object class 59-60 imagesare provided.

Features Images of this database are binary, soa very simple feature extraction method is used.We divide the whole image into an equally spaced

amur tiger red panda alpacahippopotamus baboon tapir

sea lion black lemur hyenaasian elephant caracal opossumbeluga whale siberian tiger ocelot

hartebeest cape buffalo mooseafrican elephant ferret llamahowler monkey bighorn sheep marmot

bengal tiger african lion lemurwhite rhinoceros

Table 1: Categories used for experiments with theMammals database[9]: bold font = class with veryfew training examples, italic font = support classes

wx × wy grid. In each cell of the grid, the ratioof black pixels and all pixels within the cell is usedas a single feature. This leads to a feature vectorwith wxwy dimensions. In all experiments we usedvalues ofwx = 8 andwy = 12.

5.2 Mammals Database

The Mammals database of Fink and Ullmann [9]consists of over 45000 images categorized into 409animal types. Categorization of this database is verydifficult, because of large intraclass variability, in-cluded art drawings and mixed head and body im-ages. In all our experiments we used a subset of allobject categories in the database (Table 1).

The database was specially collected to pushprogress in the area of object recognition using theinterclass transfer paradigm. Many object cate-gories share common features due to evolutionaryrelationships. We selected support categories withrather generic properties shared with the new class“sea lion”. Therefore our results may provide alower bound on performance gain, that is achievedwith very similar support classes. It is important tonote, that we did not manually align and flip imagesor use bounding box information.

Features Due to large intraclass variations globalfeatures are not suitable for this classification task.Therefore we build upon the work of [16]. StandardSIFT descriptorsdI

k are evaluated at several ran-dom locations within the imageI (e.g. g = 1000positions). All descriptors are used to train the clas-sifier. Classification of an imageI is done by sim-ple averaging all descriptor posteriori probabilities:

Page 8: Learning with Few Examples Using a Constrained Gaussian Prior …LFE.pdf · Machine learning with few training examples al-ways leads to over-fitting problems, whereas human individuals

78.5

79

79.5

80

80.5

81

81.5

82

82.5

0 2 4 6 8 10 12 14

avg.

rec

ogni

tion

rate

of a

ll cl

asse

s

number of training examples used for class "e"

Our ApproachStandard ERT

0

10

20

30

40

50

60

70

80

90

0 2 4 6 8 10 12 14

avg.

rec

ogni

tion

rate

of "

e"

number of training examples used for class "e"

Our ApproachStandard ERT

Figure 4: Results for the Latin Letters Database of Fink et al. [7]: recognition rates of our classifier and thestandard approach applied to handwritten Latin letters with few training examples for the letter “e” (supportclasses: “a,c,d,b”, 30 training examples used for all otherclasses). Note that due to the high randomization,error bars display0.25σ ranges.

p(Ωi | I) =1

g

gX

k=1

p(Ωi | dIk ) . (16)

Features are based on gray-values and no incor-poration of color information is done. We alsotested a codebook approach [18] with codebook size1000 obtained by online k-Means. This methodonly reached about17% average recognition ratecompared to22% achieved by the method of [16].

5.3 Transferring context information

The success of orderless Bag-of-Features ap-proaches (BoF) as a standard method for objectclassification can be traced back to two aspects:First of all geometric models are difficult to learnand current methods are not yet robust enough tohandle large intraclass variation. Thus, order-lessmethods provide more robust classifiers. Anotheradvantage of BoF methods, which seems to be themost important one, is the indirect use of context in-formation. Background and foreground features areused equally in all steps of the training and classi-fication process. For this reason, a huge amount ofbackground features influences a BoF method.

Our method naturally transfers common features,which can be object-specific or contextual, to a newclass. For example it could transfer the knowledgeof typical desert-like background from a camel classto a dromedary class.

5.4 Evaluation

Results of the comparison between our approachand the standard ERT classifier [10] are presentedin Figure 4 and 5. Both average recognition rates ofthe whole classification task and recognition ratesof the new class are plotted as a function of trainingexamples. Our combined generative/discriminativetransfer learning approach outperforms the standardERT classifier whenusing very few training ex-amples. This shows, that improving recognition ofthe new class does not reduce the classification per-formance of other classes on average.

By training with a single example of the Latinletter “e”, a recognition rate of35.71% was reachedon average, compared to11.93% by the standardERT classifier. Using two training examples the gapeven increases from60% of our approach to31% ofthe standard one.

An important aspect of our method can be seenin both of the experiments. After a certain numberof training images used for the new class MAP esti-mation does not improve average recognition ratesof the whole classification task but increases classi-fication accuracy of the new class. Thus, the perfor-mance on other classes decreases due to confusionwith the new class. This can be traced back to thefact, that the prior has to much weight in spite of avery informative likelihood.

Page 9: Learning with Few Examples Using a Constrained Gaussian Prior …LFE.pdf · Machine learning with few training examples al-ways leads to over-fitting problems, whereas human individuals

20.8

21

21.2

21.4

21.6

21.8

22

22.2

22.4

1 2 3 4 5 6

avg.

rec

ogni

tion

rate

of a

ll cl

asse

s

number of training examples used for class "Sea-Lion"

Our ApproachStandard ERT

0

5

10

15

20

25

30

1 2 3 4 5 6

avg.

rec

ogni

tion

rate

of "

Sea

-Lio

n"

number of training examples used for class "Sea-Lion"

Our ApproachStandard ERT

Figure 5: Results for the very difficult Mammals Database [9]with 28 categories: recognition rates of ourclassifier and the standard approach applied to images of mammals with few training examples for the class“sea lion” (support classes: african lion, beluga whale, hyena, hippopotamus, 50 training examples used forall other classes). Note that due to the high randomization,error bars display0.25σ ranges.

6 Conclusion

We argue that learning with few examples can besolved by incorporating prior knowledge of relatedclasses (interclass transfer paradigm). For this rea-son, a combined discriminative/generative classifierwas presented that uses interclass relationships tosupport object classes with few training examples.

As a first step, a discriminative randomized de-cision tree classifier is learned from all classes.The key concept of our approach is a subsequentMAP estimation of leaf probabilities within eachtree for one class with weak training representation.Bayesian formulation allows to infer knowledgefrom related classes as prior distribution, whichneeds a suitable parametric model. Therefore weintroduced a new family of prior distributions for amultinomial distribution, which is a Gaussian con-strained to a simplex. Resulting MAP estimationleads to easily solvable equations without the needof complex nonlinear optimization techniques.

Experiments performed on two public availabledatasets (Latin letters of [7], Mammals database[9]) show how our method applied to a classifica-tion task significantly boosts the recognition rateof one class with very few examples. In case ofthe Latin letters database we were able to improverecognition using a single training example of theclass “e” about23%. Experimental evaluation us-ing the Mammals database show that our method isable to reach a gain of up to19% using two trainingexamples within a very difficult classification task.

7 Further Work

Our compact model of a prior for multinomial dis-tributions can be applied beyond our hybrid classi-fier. Thus, we are interested in which situations aCGP is more efficient and suitable than a commonDirichlet distribution.

One issue of our algorithm is of course a man-ually selected variance of the prior, which controlsthe influence of related classes. Therefore anotherquestion of interest is whether it is possible to es-timateσ automatically with all information aboutthe classification task one can extract from trainingimages.

Acknowledgments We would like to thankROBOT Visual Systems GmbH for financial sup-port and all of our colleagues for valuable com-ments.

References

[1] Yonatan Amit, Michael Fink, Nathan Sre-bro, and Shimon Ullman. Uncovering sharedstructures in multiclass classification. InICML ’07: Proceedings of the 24th interna-tional conference on Machine learning, pages17–24, New York, NY, USA, 2007. ACMPress.

[2] Leo Breiman. Random forests.MachineLearning, V45(1):5–32, October 2001.

Page 10: Learning with Few Examples Using a Constrained Gaussian Prior …LFE.pdf · Machine learning with few training examples al-ways leads to over-fitting problems, whereas human individuals

[3] M. Everingham, L. Van Gool, C. K. I.Williams, J. Winn, and A. Zisserman. ThePASCAL Visual Object Classes Challenge2007 (VOC2007) Results, 2007.

[4] Li Fei-Fei. Knowledge transfer in learning torecognize visual objects classes. InProceed-ings of the International Conference on Devel-opment and Learning (ICDL), 2006.

[5] Li Fei-Fei, Rob Fergus, and Pietro Perona.One-shot learning of object categories.IEEETransactions on Pattern Analysis and Ma-chine Intelligence, 28(4):594, 2006.

[6] Andras Ferencz, Erik G. Learned-Miller, andJitendra Malik. Building a classification cas-cade for visual identification from one exam-ple. In ICCV ’05: Proceedings of the TenthIEEE International Conference on ComputerVision (ICCV’05) Volume 1, pages 286–293,Washington, DC, USA, 2005. IEEE ComputerSociety.

[7] Michael Fink. Object classification from a sin-gle example utilizing class relevance pseudo-metrics. InAdvances in Neural InformationProcessing Systems, volume 17, pages 449–456. The MIT Press, 2004.

[8] Michael Fink and Pietro Perona. Mutualboosting for contextual inference. In Se-bastian Thrun, Lawrence K. Saul, and Bern-hard Schoelkopf, editors,Advances in NeuralInformation Processing Systems. MIT Press,2003.

[9] Michael Fink and Shimon Ullman. Fromaardvark to zorro: A benchmark for mammalimage classification.Int. J. Comput. Vision,77(1-3):143–156, 2008.

[10] Pierre Geurts, Damien Ernst, and Louis We-henkel. Extremely randomized trees.Mach.Learn., 63(1):3–42, 2006.

[11] D. Hoiem, C. Rother, and J. Winn. 3d lay-outcrf for multi-view object class recogni-tion and segmentation. InIEEE Conferenceon Computer Vision and Pattern Recognition,2007. CVPR ’07., pages 1–8, 2007.

[12] Derek Hoiem, Alexei A. Efros, and MartialHebert. Geometric context from a single im-age. InInternational Conference of ComputerVision (ICCV), volume 1, pages 654 – 661.IEEE, October 2005.

[13] Alex D. Holub, Max Welling, and Pietro Per-ona. Hybrid generative-discriminative visual

categorization. Int. J. Comput. Vision, 77(1-3):239–258, 2008.

[14] Susan S. Jones and Linda B. Smith. The placeof perception in children’s concepts.Cog-nitive Development, 8:113–139, April-June1993.

[15] Vincent Lepetit, Pascal Lagger, and PascalFua. Randomized trees for real-time keypointrecognition. InCVPR ’05: Proceedings ofthe 2005 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition(CVPR’05) - Volume 2, pages 775–781, Wash-ington, DC, USA, 2005. IEEE Computer So-ciety.

[16] Raphael Maree, Pierre Geurts, Justus Piater,and Louis Wehenkel. Random subwindowsfor robust image classification. In CordeliaSchmid, Stefano Soatto, and Carlo Tomasi,editors,Proceedings of the IEEE InternationalConference on Computer Vision and PatternRecognition (CVPR 2005), volume 1, pages34–40. IEEE, June 2005.

[17] Erik G. Miller, Nicholas E. Matsakis, andPaul A. Viola. Learning from one examplethrough shared densities on transforms. InCVPR, volume 1, pages 464–471, Los Alami-tos, CA, USA, 2000. IEEE Computer Society.

[18] Eric Nowak, Frederic Jurie, and Bill Triggs.Sampling strategies for bag-of-features im-age classification. In Ales Leonardis, HorstBischof, and Axel Pinz, editors,EuropeanConference on Computer Vision, volume 3954of Lecture Notes in Computer Science, pages490–503. Springer, 2006.

[19] Ariadna Quattoni, Michael Collins, andTrevor Darrell. Learning visual representa-tions using images with captions. InPro-ceedings of the Conference on Computer Vi-sion and Pattern Recognition (CVPR 2007),pages 1–8, Minneapolis, Minnesota, USA,June 2007. IEEE Computer Society.

[20] Jamie Shotton, Matthew Johnson, andRoberto Cipolla. Semantic texton forests forimage categorization and segmentation. InCVPR 2008, 2008. accepted for publication.

[21] Antonio Torralba, Kevin P. Murphy, andWilliam T. Freeman. Sharing visual featuresfor multiclass and multiview object detection.IEEE Transactions on Pattern Analysis andMachine Intelligence, 29(5):854–869, 2007.