discriminative transfer learning with tree-based...

Discriminative Transfer Learning with Tree-based Priors

Nitish Srivastava [email protected]

Department of Computer Science, University of Toronto.

Ruslan Salakhutdinov [email protected]

Department of Statistics and Computer Science, University of Toronto.

Abstract

This paper proposes a way of improvingclassification performance for classes whichhave very few training examples. The keyidea is to discover classes which are similarand transfer knowledge among them. Ourmethod organizes the classes into a tree hi-erarchy. The tree structure can be used toimpose a prior over classification parameters.We show that these priors can be combinedwith discriminative models such as deep neu-ral networks. Our method benefits from thepower of discriminative training of deep neu-ral networks, at the same time using tree-based priors over classification parameters.We also propose an algorithm for learningthe underlying tree structure. This givesthe model some flexibility to tune the treeso that the tree is pertinent to task beingsolved. We show that the model can transferknowledge across related classes using fixedtrees. Moreover, it can learn new meaning-ful trees usually leading to improved perfor-mance. Our method achieves state-of-the-artclassification results on the CIFAR-100 im-age dataset and the MIR Flickr multimodaldataset.

1. IntroductionLearning classifiers that generalize well is a hard prob-lem when only few training examples are present. Forexample, if we had only 5 images of a cheetah, it wouldbe hard to train a classifier to be good at distinguishingcheetahs against hundreds of other classes. Any pow-erful enough machine learning model would severely

Presented at the International Conference on MachineLearning (ICML) workshop on Challenges in Representa-tion Learning, Atlanta, Georgia, USA, 2013. Copyright2013 by the author(s).

overfit the few examples, unless it is held back bystrong regularizers. This typically results in poor gen-eralization performance. This paper is based on theidea that performance can be improved using the nat-ural structure inherent in the set of classes. For exam-ple, we know that cheetahs are related to tigers, lions,jaguars and leopards. Having labeled examples fromthese related classes should make the task of learningfrom 5 cheetah examples much easier. Knowing classstructure should allow us to borrow “knowledge” fromrelevant classes so that only the distinctive featuresspecific to cheetahs need to be learned. At the veryleast, the model should confuse cheetahs with theseanimals rather than with completely unrelated classes,such as cars or lamps. In other words, there is a needto develop methods of transferring knowledge from re-lated tasks towards learning a new task. In our en-deavour to scale machine learning algorithms towardsAI, it is imperative that we have good ways of trans-ferring knowledge across related problems.

Finding relatedness is also a hard problem. This is be-cause in the absence of any prior knowledge, in order toknow which classes are related, we need to have a goodmodel for each class. But to have a good model foreach class, we need to know which classes are related.This creates a cyclic dependency. One way to circum-vent it is to use an external knowledge source, suchas a human, to specify the class structure by hand.However, this might be inconvenient to do for everyclassification problem that comes up. It is also hard toscale to tens of thousands of labels. It would be moreelegant to directly learn this structure from data.

This paper proposes a way of learning class struc-ture and classifier parameters in the context of deepneural networks with the aim of improving classifica-tion accuracy for classes with few examples. Deepneural networks trained discriminatively with backpropagation have been shown to achieve state-of-the-art performance on hard classification problems with


large amounts of labeled data(Bengio & LeCun, 2007;Krizhevsky et al., 2012; Lee et al., 2009; Dean et al.,2012). Our model augments neural networks with agenerative prior over the weights in the last layer. Theprior is structured so that related classes share thesame prior. As the number of examples grows, the ef-fect of this prior diminishes and the model approachesa standard neural network. If there are only a fewexamples for a particular class, the prior will have astrong effect, driving the classification parameters forthat class close to the shared prior. The shared priorcaptures the features that are common across all mem-bers of that superclass. Therefore, a class with fewexamples can have access to this information just byvirtue of being a member of the superclass. In thisway, information is transferred from related classes toone another.

Learning a hierarchical structure over classes has beenextensively studied in the machine learning, statistics,and vision communities. A large class of models basedon hierarchical Bayesian models have been used fortransfer learning (Shahbaba & Neal, 2007; Evgeniou& Pontil, 2004; Sudderth et al., 2008; Canini & Grif-fiths, 2009). A hierarchical topic model for image fea-tures of Bart et.al. (Bart et al., 2008) can discovervisual taxonomies in an unsupervised fashion fromlarge datasets but was not designed for rapid learn-ing of new categories. Fei-Fei et.al. (Fei-Fei et al.,2006) also developed a hierarchical Bayesian modelfor visual categories, with a prior on the parametersof new categories that was induced from other cate-gories. However, their approach is not well-suited as ageneric approach to transfer learning, as they learneda single prior shared across all categories. Recently in-troduced Hierarchical-Deep models combine hierarchi-cal non-parametric Bayesian models with Deep Boltz-mann Machines (Salakhutdinov et al., 2011a), whichallows one to represent both a layered hierarchy of fea-tures and a tree-structured hierarchy of super-classes.

However, almost all of the the above-mentioned modelsare generative by nature. These models typically re-sort to MCMC approaches for inference, which is hardto scale to large datasets, and they tend to performworse compared to discriminative approaches, partic-ularly as the number of labeled examples examples in-creases.

Most similar to our work is (Salakhutdinov et al.,2011b) that defined a generative prior over the clas-sifier parameters and a prior over the tree structuresto identify relevant categories. However, this work fo-cused on the very specific object detection task andused an SVM model with pre-defined HOG features

as its input. As we argue in the paper, committingto the a-priori defined feature representations, insteadof learning them from data, can be detrimental. Thisis especially important when learning complex tasks,as it is often difficult to hand-craft high-level featuresexplicitly in terms of raw sensory input. As we showin our experimental results, learning (1) low-level fea-tures that abstract from the raw high-dimensional sen-sory input (e.g. pixels), (2) high-level class-specific fea-tures, and (3) a hierarchy of classes for sharing knowl-edge among related classes is crucial for rapid learningof new categories from few examples, while at the sametime building on and not disrupting representationsof old ones (Bart & Ullman, 2005; Biederman, 1995).This allows us to achieve state-of-the-art results ontwo challenging real-world datasets, CIFAR-100 andMIR Flickr, containing many classes and thousands ofexamples.

2. Model DescriptionLet X = {x1,x2, . . . ,xN} be a set of N data pointsand Y = {y1,y2, . . . ,yN} be the set of correspondinglabels, where each label yi is a K dimensional vectorof targets. These targets could be binary, one-of-K,or real-valued. In our setting, it is useful to think ofeach xi as an image and yi as a one-of-K encoding ofthe label. Our model is a multi-layer neural network(see Fig. 1(a)). Let w denote the set of all parametersof this network (weights and biases for all the layers),excluding the top-level weights, which we denote sepa-rately as β ∈ RD×K . Here D represents the number ofhidden units in the last hidden layer. The conditionaldistribution over Y can be expressed as

P (Y|X ) =

∫w,β

P (Y|X ,w, β)P (w)P (β)dwdβ. (1)

In general, this integral will be intractable, and wewould typically resort to MAP estimation to deter-mine the values of the model parameters w and β thatmaximize

logP (Y|X ,w, β) + logP (w) + logP (β).

Here, logP (Y|X ,w, β) is the log-likelihood functionand the other terms are priors over the model’s pa-rameters. A typical choice of prior is a Gaussian dis-tribution with diagonal covariance. For example,

βk ∼ N(

0,1

λID

),∀k ∈ {1, . . . ,K}.

Here βk denotes the classifier parameters for class k.Note that this prior assumes that each βk is indepen-dent of all other βi’s. In other words, a-priori, the


x

fw(x)

w

β

y

• • •Low levelfeatures

Input

High levelfeatures

Predictions • • •

βcar βcat

K

D

(a)

• • •

βcar βtruck βcat βdog

θvehicle θanimal

Prior over β

(b)

Figure 1. Our model : A deep neural network with tree-based priors over the classification parameters.

weights for label k re not related to any other label’sweights. This is a reasonable assumption when noth-ing is known about the labels. It works quite wellfor most applications with large number of labeled ex-amples per class. However, if we know that the classesare closely related to one another, priors which respectthese relationships may be more suitable. Such priorswould be crucial for classes that only have a hand-ful of training examples, since the effect of the priorwould be more pronounced. In this work, we focus ondeveloping such a prior.

2.1. Learning With a Fixed Tree Hierarchy

Let us first assume that the classes have been orga-nized into a fixed tree hierarchy which is available tous. We will relax this assumption later by placing ahierarchical non-parametric prior over the tree struc-tures. For simplicity, consider a two-level hierarchy,as shown in Fig. 1(b). There are K leaf nodes corre-sponding to the K classes. They are connected to Ssuper-classes which group together similar basic-levelclasses. Leaf node k is associated with a weight vectorβk ∈ RD. Each super-class node s is associated witha vector θs ∈ RD, s = 1, ..., S. We define the followinggenerative model for β

θs ∼ N(

0,1

λ1ID

), βk ∼ N

(θparent(k),

1

λ2ID

)(2)

This prior expresses relationships between classes. Forexample, it asserts that βcar and βtruck are both devi-ations from θvehicle. Similarly, βcat and βdog are de-viations from θanimal. Eq. 1 can now be re-written toinclude θ as follows

P (Y|X ) =

∫w,β,θ

P (Y|X ,w, β)P (w)P (β|θ)P (θ)dwdβdθ (3)

We can perform MAP inference to determine the val-

ues of {w, β, θ} that maximize

logP (Y|X ,w, β) + logP (w) + logP (β|θ) + logP (θ)

In terms of a loss function, we wish to minimize

L(w, β, θ) = − logP (Y|X ,w, β)− logP (w) (4)

− logP (β|θ)− logP (θ)

= − logP (Y|X ,w, β) +λ2

2||w||2

+λ2

2

K∑k=1

||βk − θparent(k)||2 +λ1

2||θ||2 (5)

Note that by fixing the value of θ = 0, this loss func-tion recovers our standard loss function. The choiceof normal distributions in Eq. 2 leads to a nice prop-erty that maximization over θ, given β can be donein closed form. It just amounts to taking a (scaled)average of all βk’s which are children of θs. LetCs = {k|parent(k) = s}, then

θ∗s =1

|Cs|+ λ1/λ2

∑k∈Cs

βk (6)

Therefore, the loss function in Eq. 5 can be optimizedby iteratively performing the following two steps. Inthe first step, we maximize over w and β keeping θfixed. This can be done using standard stochastic gra-dient descent (SGD). Then we maximize over θ keep-ing β fixed. This can be done in closed form usingEq. 6. In practical terms, the second step is almostinstantaneous and only needs to be performed afterevery T gradient descent steps, where T is around 10-100. Therefore, learning is almost identical to stan-dard gradient descent. Most importantly, it allows usto exploit the structure over labels at a very nominalcost in terms of computational time. Note that themodel is discriminative in nature. As the number of


1: Given: X ,Y, classes K, superclasses S, initial z, L, M.2: Initialize w, β.3: repeat4: // Optimize w, β with fixed z.5: w, β ← SGD (X ,Y,w, β, z) for L steps.6: // Optimize z, β with fixed w.7: RandomPermute(K)8: for k in K do9: for s in S ∪ {snew} do

10: zk ← s11: βs ← SGD (fw(X ),Y, β, z) for M steps.12: end for13: s′ ← ChooseBestSuperclass(β1, β2, . . .)

14: β ← βs′, zk ← s′, S ← S ∪ {s′}

15: end for16: until convergence

car truck cat dogvan van van

s=vehicle s=animal s=snew

k = van

Figure 2. Procedure for learning the tree.

examples per class increases, the effect of the genera-tive prior gets attenuated and we recover the standardneural net model.

2.2. Learning the Tree Hierarchy

So far we have assumed that our model is presentedwith a fixed tree hierarchy. Now, we show how the treestructure can be learned during training. Let z be aK-length vector that specifies the tree structure, thatis, zk = s indicates that class k is a child of super-class s. We place a non-parametric Chinese Restau-rant Process (CRP) prior over z. This prior P (z) givesthe model the flexibility to have any number of super-classes. It makes certain tree structures more likelythan others. The CRP prior extends a partition of kclasses to a new class by adding the new class eitherto one of the existing superclasses or to a new super-class. The probability of adding it to superclass s iscs

k+γ where cs is the number of children of superclass

s. The probability of creating a new superclass is γk+γ .

In essence, it prefers to add a new node to an existinglarge superclass instead of spawning a new one. Thestrength of this preference is controlled by γ.

Equipped with the CRP prior over z, the conditionalover Y takes the following form

P (Y|X ) =∑z

( ∫w,β,θ

P (Y|X ,w, β)P (w) . . .

P (β|θ, z)P (θ)dwdβdθ)P (z). (7)

MAP inference in this model leads to the followingoptimization problem

maxw,β,θ,z

logP (Y|X ,w, β) + logP (w) + logP (β|θ, z) (8)

+ logP (θ) + logP (z).

Maximization over z is problematic because the do-main of z is a huge discrete set. However, this can beapproximated using a simple and parallelizable searchprocedure.

We first initialize the tree sensibly. This can be doneby hand or by extracting a semantic tree from Word-Net (Miller, 1995). Let the number of superclasses inthe tree be S. We optimize over {w, β, θ} for a fewsteps using this tree. Then, a leaf node is picked uni-formly at random from the tree and S+1 tree proposalsare generated as follows. S proposals are generated byattaching this leaf node to each of the S superclasses.One additional proposal is generated by creating a newsuper-class and attaching the label to it. This processis shown in Fig. 2. We then re-estimate {β, θ} foreach of these S + 1 trees for a few steps, which canbe easily parallelized on a GPU. The best tree is thenpicked using a validation set. This process is repeatedby picking another node and again trying all possiblelocations for it. After all labels have been exhausted,we start the process all over again, using this learnedtree in place of the given tree. In practical terms,this amounts to interrupting the stochastic gradientdescent after every 10,000 steps to find a better tree.

3. Experiments on CIFAR-100The CIFAR-100 dataset (Krizhevsky, 2009) consists of32 × 32 color images belonging to 100 classes. Theseclasses are grouped into 20 superclasses each contain-ing 5 classes. For example, superclass fish containsaquarium fish, flatfish, ray, shark and trout ; and su-perclass flowers contains orchids, poppies, roses, sun-flowers and tulips. Some examples from this datasetare shown in Fig. 3. There are 600 examples of eachclass of which 500 are in the training set and the re-maining are in the test set. We preprocessed the im-ages by doing global contrast normalization followedby ZCA whitening.


whale dolphin

willow tree oak tree

lamp clock

leopard tiger

ray flatfish

Figure 3. Examples from CIFAR-100. Five randomly chosen examples from 8 of the 100 classes are shown. Classes ineach row belong to the same superclass.

10 25 50 100 250 500Number of training examples per label

10

20

30

40

50

60

70

Test

cla

ssific

ati

on a

ccura

cy %

Baseline

Fixed Tree

Learned Tree

(a)

0 20 40 60 80 100Sorted classes

30

20

10

0

10

20

30

Impro

vem

ent

%

Baseline

Fixed Tree

Learned Tree

(b)

Figure 4. Classification results on CIFAR-100. Left: Test set classification accuracy for different number of trainingexamples per class. Right: Improvement over the baseline for different methods when trained on 10 examples per class.

3.1. Model Architecture and Training DetailsWe used a convolutional neural network with 3 con-volutional hidden layers followed by 2 fully connectedhidden layers. All hidden units used a rectified lin-ear activation function. Each convolutional layer wasfollowed by a max-pooling layer. Dropout (Hintonet al., 2012) was applied to all the layers of the networkwith the probability of retaining a hidden unit beingp = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) for the different lay-ers of the network (going from input to convolutionallayers to fully connected layers). Max-norm regular-ization (Hinton et al., 2012) was used for weights inboth convolutional and fully connected layers. Theinitial tree was chosen based on the superclass struc-ture given in the data set. We learned a tree using thealgorithm in Fig. 2 with L = 10, 000 and M = 100.The final learned tree is provided in the Appendix.

3.2. Experiments on Few Examples per ClassIn our first set of experiments, we work in a scenariowhere each class has very few examples. The aim isto assess whether the proposed model allows relatedclasses to borrow information from each other. For abaseline, we use a standard convolutional neural net-

work with the same architecture as our model. Thisis an extremely strong baseline and already achievesstate-of-the-art results, outperforming all previouslyreported results on this dataset as shown in Table 1.We create 5 subsets of the data by randomly choosing10, 25, 50, 100 and 250 examples per class, and trainthree models on each subset – the baseline, our modelwith a the given tree structure, that we call FixedTree,and our model with a learned tree structure, that wecall LearnedTree.

The test set performance of these models is comparedin Fig. 4(a). Observe that if the number of exam-ples per class is small, the FixedTree model alreadyprovides significant improvement over the baseline.Learning the tree structure further improves the re-sults. For example, when training with 10 examplesper class, the baseline model gets 12.81% accuracy,the FixedTree model improves this to 16.29%, andLearnedTree further improves this to 18.52%. Thusthe model can get almost 50% relative improvementwhen few examples are available. As the numberof examples increases, the relative improvement de-creases. For 500 examples per class, the FixedTree


Method Test Accuracy %

Conv Net + max pooling 56.62 ± 0.03Conv Net + stochastic pooling (Zeiler & Fergus, 2013) 57.49Conv Net + maxout (Goodfellow et al., 2013) 61.43Conv Net + max pooling + dropout (Baseline) 62.80 ± 0.08Baseline + Fixed Tree 61.70 ± 0.06Baseline + Learned Tree 63.15 ± 0.15

Table 1. Comparison of classification results on CIFAR-100.

5 10 25 50 100 250 500Number of training cases for dolphin

0

10

20

30

40

50

60

70

Test

cla

ssific

ati

on a

ccura

cy %

Baseline

Fixed Tree

Learned Tree

(a)

1 5 10 25 50 100 250 500Number of training cases for dolphin

45

50

55

60

65

70

75

80

85

Test

cla

ssific

ati

on a

ccura

cy %

Baseline

Fixed Tree

Learned Tree

(b)

Figure 5. Results on CIFAR-100 with few examples for the dolphin class. Left: Test set classification accuracy for differentnumber of examples. Right: Accuracy when classifying a dolphin as whale or shark is also considered correct.

model achieves test accuracy of 61.70% which is worsethan the baseline 62.80%. However, the LearnedTreemodel improves upon the baseline, achieving a classifi-cation accuracy of 63.15%. Table 1 shows performanceof our models trained on the full training set to othermethods.

Next, we analyze the LearnedTree model to find thesource of improvements. We take the model trainedon 10 examples per class and look at the classifica-tion accuracy separately for each class. The aim is tofind which classes gain or suffer the most. Fig. 4(b)shows the improvement obtained by different classesover the baseline, where the classes are sorted by thevalue of the improvement over the baseline. Observethat about 70 classes benefit in different degrees fromlearning a hierarchy for parameter sharing, whereasabout 30 classes perform worse as a result of transfer.For the LearnedTree model, the classes which improvemost are willow tree (+26%) and orchid (+25%). Theclasses which loose most from the transfer are ray (-10%) and lamp (-10%).

We hypothesize that the reason why certain classesgain a lot is that they are very similar to other classeswithin their superclass and stand to gain a lot by trans-ferring knowledge. For example, the superclass forwillow tree contains other trees, such as maple treeand oak tree. However, ray belongs to superclass fishwhich contains more typical examples of fish that are

very dissimilar in appearance. With the fixed tree,such transfer was found to hurt performance (ray didworse by -29%). However, when the tree was learned,this class split away from the fish superclass to join anew superclass and did not suffer as much. Similarly,lamp was under household electrical devices along withkeyboard and clock. Putting different kinds of electri-cal devices under one superclass makes semantic sensebut does not help for visual recognition tasks. Thishighlights a key limitation of hierarchies based on se-mantic knowledge and advocates the need to learn thehierarchy so that it is relevant to the task at hand.

3.3. Experiments on Few Examples for OneClass

In this set of experiments, we work in a scenario wherethere are lots of examples for different classes, butonly few examples of one particular class. The aimto see whether the model transfers information fromother classes that it has learned to this “rare” class.We construct training sets by randomly drawing ei-ther 5, 10, 25, 50, 100, 250 or 500 examples fromthe dolphin class and all 500 training examples for theother 99 classes. We train the baseline, FixedTree andLearnedTree models with each of these datasets. Theobjective was kept the same as before and no specialattention was paid to the dolphin class. Fig. 5(a) showsthe test accuracy for correctly predicting the dolphinclass. We see that transfer learning helped tremen-


Classesbaby, female, people,

portraitplant life, river, water

clouds, sea, sky,transport, water

animals, dog, food clouds, sky, structures

Images

Tags claudia 〈 no text 〉 barco, pesca,boattosail, navegacao

watermelon, hilarious,chihuahua, dog

colours, cores, centro,commercial, building

Figure 6. Some examples from the MIR-Flickr dataset. Each instance in the dataset is an image along with textual tags.Each image has multiple classes.

dously. For example, with 10 cases, the baseline gets0% accuracy whereas the transfer learning model canget around 3%. Even for 250 cases, the LearnedTreemodel gives significant improvements (31% to 34%).

In addition to performing well on the class with fewexamples, we would also want any errors to be sen-sible. To check if this is indeed the case, we evalu-ate the performance of the above models treating theclassification of dolphin as shark or whale to also becorrect, since we believe these to be reasonable mis-takes. Fig. 5(b) shows the classification accuracy un-der this assumption for different models. Observe thatthe transfer learning methods provide significant im-provements over the baseline. Even when we have just1 example for dolphin, the accuracy jumps from 45%to 52%.

4. Experiments on MIR FlickrThe Multimedia Information Retrieval Flickr Data set(Huiskes & Lew, 2008) consists of 1 million imagescollected from the social photography website Flickralong with their user assigned tags. Among the 1 mil-lion images, 25,000 have been annotated using 38 la-bels. These labels include object categories such as,bird, tree, people, as well as scene categories, such asindoor, sky and night. Each image has multiple labels.Some examples are shown in Fig. 6.

We choose this dataset because it is different fromCIFAR-100 in many ways. In the CIFAR-100 dataset,our model was trained using image pixels as inputand each image belonged to only one class. On theother hand, MIR-FLickr is a multimodal dataset forwhich we used standard computer vision image fea-tures and word counts as inputs. Each instance be-longs to multiple classes. Therefore, the classifica-tion task is modeled as learning multiple logistic re-gressors, as opposed to a multinomial regression forCIFAR-100. The CIFAR-100 models used a multi-layer convolutional network initialized with randomweights, whereas for this dataset we use a fully con-nected neural network initialized by unrolling a Deep

Boltzmann Machine (DBM) (Salakhutdinov & Hinton,2009). Moreover, instead of having the same numberof examples per class, this dataset offers a more nat-ural distribution where some classes occur more oftenthan others. For example, sky occurs in over 30% ofthe instances, whereas baby occurs in fewer than 0.4%.The dataset also consists of 975,000 unlabeled images,which we use for unsupervised training of the DBM.We use the publicly available features and train-testsplits from (Srivastava & Salakhutdinov, 2012). MeanAverage Precision (MAP) is used as the performancemetric.

4.1. Model Architecture and Training DetailsIn order to make our results directly comparableto (Srivastava & Salakhutdinov, 2012), we use thesame network architecture as described therein. Theauthors of the dataset (Huiskes et al., 2010) providea high-level categorization of the classes which we useto create an initial tree. This tree structure and theone learned by our model are shown in the Appendix.We used the algorithm in Fig. 2 with L = 500 andM = 100.

4.2. Classification ResultsFor a baseline we use a Multimodal DBM model af-ter finetuning it discriminatively with dropout. Thismodel already achieves state-of-the-art results and rep-resents a very strong baseline. The results of the ex-periment are summarized in Table 2. The baselineachieves a MAP of 0.641, whereas our model with afixed tree improves this to 0.647. Learning the treestructure further pushes this up to 0.651. For thisdataset, the learned tree was not significantly differ-ent from the given tree. Therefore, we expect the im-provement from learning the tree to be marginal. How-ever, the improvement over the baseline is significant,showing that in this case too, transferring informationbetween related classes helps.

Looking closely at the source of gains, we find thatsimilar to CIFAR-100, some classes gain and othersloose as shown in Fig. 7(a). It is encouraging to note


0 5 10 15 20 25 30 35 40Sorted classes

0.04

0.02

0.00

0.02

0.04

0.06

0.08

0.10Im

pro

vem

ent

in A

vera

ge P

reci

sion

Baseline

Fixed Tree

Learned Tree

(a) Class-wise improvement

0.0 0.1 0.2 0.3 0.4Fraction of instances containing the class

0.04

0.02

0.00

0.02

0.04

0.06

0.08

Impro

vem

ent

in A

vera

ge P

reci

sion

(b) Improvement vs. number of examples

Figure 7. Results on MIR Flickr. Left: Improvement in Average Precision over the baseline for different methods. Right:Improvement of Learned Tree model over the baseline for different classes along with the fraction of test cases which containthat class. Each dot corresponds to a class. Classes with few examples (towards the left of plot) usually get significantimprovements.

Method MAP

Logistic regression on Multimodal DBM (Srivastava & Salakhutdinov, 2012) 0.609Multiple Kernel Learning SVMs (Guillaumin et al., 2010) 0.623Multimodal DBM + finetuning + dropout (Baseline) 0.641 ± 0.004Baseline + Fixed Tree 0.648 ± 0.004Baseline + Learned Tree 0.651 ± 0.005

Table 2. Mean Average Precision obtained by different models on the MIR-Flickr data set.

that classes which occur rarely in the dataset improvethe most. This can be seen in Fig. 7(b) which plotsthe improvements of the LearnedTree model over thebaseline against the fraction of test instances that con-tain that class. For example, the average precision forbaby which occurs in only 0.4% of the test cases im-proves from 0.173 (baseline) to 0.205 (learned tree).This class borrows from people and portrait both ofwhich occur very frequently. The performance on skywhich occurs in 31% of the test cases stays the same.

5. ConclusionWe proposed a model that augments standard neu-ral networks with generative priors over the classifica-tion parameters. These priors follow the hierarchicalstructure over classes and enable the model to trans-fer knowledge from related classes. We also proposeda way of learning the hierarchical structure. Exper-iments show that the model achieves state-of-the-artresults on two challenging datasets.

References

Bart, E. and Ullman, S. Cross-generalization: Learningnovel classes from a single example by feature replace-ment. In CVPR, pp. 672–679, 2005.

Bart, E., Porteous, I., Perona, P., and Welling, M. Unsu-pervised learning of visual taxonomies. In CVPR, pp.1–8, 2008.

Bengio, Y. and LeCun, Y. Scaling learning algorithms to-wards AI. Large-Scale Kernel Machines, 2007.

Biederman, I. Visual object recognition. An Invitation toCognitive Science, 2:152–165, 1995.

Canini, Kevin R. and Griffiths, Thomas L. Modeling hu-man transfer learning with the hierarchical dirichlet pro-cess. In NIPS 2009 workshop: Nonparametric Bayes,2009.

Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M.,Le, Q.V., Mao, M.Z., Ranzato, M.A., Senior, A., Tucker,P., Yang, K., and Ng., A. Y. Large scale distributed deepnetworks. In Advances in Neural Information ProcessingSystems 25. 2012.

Evgeniou, Theodoros and Pontil, Massimiliano. Regular-ized multi-task learning. In ACM SIGKDD, 2004.


Fei-Fei, Li, Fergus, R., and Perona, P. One-shot learningof object categories. IEEE Trans. Pattern Analysis andMachine Intelligence, 28(4):594–611, April 2006.

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville,A., and Bengio, Y. Maxout Networks. ArXiv e-prints,February 2013.

Guillaumin, M., Verbeek, J., and Schmid, C. Multi-modal semi-supervised learning for image classification.In Computer Vision and Pattern Recognition (CVPR),2010 IEEE Conference on, pp. 902 –909, june 2010.

Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex,Sutskever, Ilya, and Salakhutdinov, Ruslan. Improvingneural networks by preventing co-adaptation of featuredetectors. CoRR, abs/1207.0580, 2012.

Huiskes, Mark J. and Lew, Michael S. The MIR Flickr re-trieval evaluation. In MIR ’08: Proceedings of the 2008ACM International Conference on Multimedia Informa-tion Retrieval, New York, NY, USA, 2008. ACM.

Huiskes, Mark J., Thomee, Bart, and Lew, Michael S. Newtrends and ideas in visual concept detection: the MIRflickr retrieval evaluation initiative. In Multimedia In-formation Retrieval, pp. 527–536, 2010.

Krizhevsky, Alex. Learning multiple layers of features fromtiny images. Technical report, University of Toronto,2009.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey.Imagenet classification with deep convolutional neuralnetworks. In Advances in Neural Information ProcessingSystems 25. MIT Press, 2012.

Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng,Andrew Y. Convolutional deep belief networks for scal-able unsupervised learning of hierarchical representa-tions. In Proceedings of the 26th International Confer-ence on Machine Learning, pp. 609–616, 2009.

Miller, George A. Wordnet: a lexical database for english.Commun. ACM, 38(11):39–41, November 1995. ISSN0001-0782. doi: 10.1145/219717.219748. URL http://doi.acm.org/10.1145/219717.219748.

Salakhutdinov, R., Tenenbaum, J., and Torralba, A.Learning to learn with compound hierarchical-deepmodels. In NIPS. MIT Press, 2011a.

Salakhutdinov, R., Torralba, A., and Tenenbaum, J.Learning to share visual appearance for multiclass ob-ject detection. In CVPR, 2011b.

Salakhutdinov, R. R. and Hinton, G. E. Deep Boltzmannmachines. In Proceedings of the International Conferenceon Artificial Intelligence and Statistics, volume 12, 2009.

Shahbaba, Babak and Neal, Radford M. Improving clas-sification when a class hierarchy is available using ahierarchy-based prior. Bayesian Analysis, 2(1):221–238,2007.

Srivastava, Nitish and Salakhutdinov, Ruslan. Multimodallearning with deep boltzmann machines. In Bartlett,P., Pereira, F.C.N., Burges, C.J.C., Bottou, L., andWeinberger, K.Q. (eds.), Advances in Neural Informa-tion Processing Systems 25, pp. 2231–2239. MIT Press,2012.

Sudderth, E. B., Torralba, A., Freeman, W. T., and Will-sky, A. S. Describing visual scenes using transformedobjects and parts. International Journal of ComputerVision, 77(1-3):291–330, 2008.

Zeiler, Matthew D. and Fergus, Rob. Stochastic poolingfor regularization of deep convolutional neural networks.CoRR, abs/1301.3557, 2013.

http://doi.acm.org/10.1145/219717.219748

http://doi.acm.org/10.1145/219717.219748


Appendices

A. Experimental Details for CIFAR-100

In this section we describe the experimental setup forCIFAR-100, including details of the architecture andthe structures of the initial and learned trees.

A.1. Architecture and training details

We choose the architecture that worked best for thebaseline model using a validation set. All the ex-periments reported in the paper for this dataset usethis network architecture. It consists of three convolu-tional layers followed by 2 fully connected layers. Eachconvolutional layer is followed by a max-pooling layer.The convolutional layers have 96, 128 and 256 filtersrespectively. Each convolutional layer has a 5 × 5 re-ceptive field applied with a stride of 1 pixel. Eachmax pooling layer pools 3 × 3 regions at strides of 2pixels. The two fully connected hidden layers hav-ing 2048 units each. All units use the rectified linearactivation function. Dropout was applied to all thelayers of the network with the probability of retainingthe unit being p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) for thedifferent layers of the network (going from input toconvolutional layers to fully connected layers). In ad-dition, the max-norm constraint with c = 4 was usedfor all the weights.

Training the model on very small subsets of the dataposes multiple problems. One of them is holding outa validation set. When training with 10 examples ofclass, we want to mimic a situation when we just havethese 10 examples, and not a separate large valida-tion set which has many more examples of this class.Therefore, we held out 30% of training data as valida-tion, even when only 10 training examples are present.In order to remove noise, we repeated the experimentmultiple times, selecting 7 random training cases and 3random validation cases. Once we used the validationset to determine the hyperparameters, we combined itwith the training set and trained the model down tothe training set cross entropy that was obtained whenthe validation set was kept separate. This allows us touse the full training set which is crucial in very smalldataset regimes.

A.2. Initial and Learned Trees

The authors of the CIFAR-100 data set (Krizhevsky,2009) divide the set of 100 classes into 20 superclasses.These classes are shown in Table 3. Using this tree asan initialization, we used our model to learn the tree.

The learned tree is shown in Table 4. This tree waslearned with 50 examples per class. The superclassesdo not have names since they were learned. We cansee that the tree makes some sensible moves. For ex-ample, it creates a superclass for shark, dolphin andwhale. It creates a superclass for worm-like creaturessuch as caterpillar, worm and snake. However thisclass also includes snail which is harder to explain. Itincludes television along with other furniture items,removing it from the house hold electrical devices su-perclass. However, some choices do not seem good.For example, clock is put together with bicycle andmotorcycle. The only similarity between them is pres-ence of circular objects.

B. Experimental Details for MIR Flickr

In this section we describe the experimental setup forMIR Flickr, including details of the architecture andthe structures of the initial and learned trees.

B.1. Architecture and training details

The model was initialized by unrolling a multimodalDBM. The DBM consisted of two pathways. The im-age pathway had 3857 input units, followed by 2 hid-den layers of 1024 hidden units. The text pathway had2000 input units (corresponding the top 2000 most fre-quent tags in the dataset), followed by 2 hidden layersof 1024 hidden units each. These two pathways werecombined at the joint layer which had 2048 hiddenunits. All hidden units were logistic. The DBM wastrained using publicly available features and code. Weadditionally finetuned the entire model using dropout(p = 0.8 at all layers) and used that as our baseline.We split the 25,000 labeled examples into 10,000 fortraining, 5,000 for validation and 10,000 for test.

B.2. Initial and Learned Trees

This dataset contains 38 classes, which can be cate-gorized under a tree as shown in Table 5. The su-perclasses are based on the higher-level concepts asspecified in (Huiskes et al., 2010). The learned tree isshown in Table 6.


Superclass Classes

aquatic mammals dolphin, whale, seal, otter, beaverfish aquarium fish, flatfish, ray, shark, troutflowers orchid, poppy, rose, sunflower, tulipfood containers bottle, bowl, can, cup, platefruit and vegetables apple, mushroom, orange, pear, sweet pepperhousehold electrical devices clock, keyboard, lamp, telephone, televisionhousehold furniture bed, chair, couch, table, wardrobeinsects bee, beetle, butterfly, caterpillar, cockroachlarge carnivores bear, leopard, lion, tiger, wolflarge man made outdoor things bridge, castle, house, road, skyscraperlarge natural outdoor scenes cloud, forest, mountain, plain, sealarge omnivores and herbivores camel, cattle, chimpanzee, elephant, kangaroomedium sized mammals fox, porcupine, possum, raccoon, skunknon insect invertebrates crab, lobster, snail, spider, wormpeople baby, boy, girl, man, womanreptiles crocodile, dinosaur, lizard, snake, turtlesmall mammals hamster, mouse, rabbit, shrew, squirreltrees maple tree, oak tree, palm tree, pine tree, willow treevehicles 1 bicycle, bus, motorcycle, pickup truck, trainvehicles 2 lawn mower, rocket, streetcar, tank, tractor

Table 3. Given tree hierarchy for the CIFAR-100 dataset.

Superclass Classes

superclass 1 dolphin, whale, sharksuperclass 2 aquarium fish, trout, flatfishsuperclass 3 orchid, poppy, rose, sunflower, tulip, butterflysuperclass 4 bottle, bowl, can, cup, platesuperclass 5 apple, mushroom, orange, pear, sweet peppersuperclass 6 keyboard, telephonesuperclass 7 bed, chair, couch, table, wardrobe, televisionsuperclass 8 bee, beetle, cockroach, lobstersuperclass 9 bear, leopard, lion, tiger, wolfsuperclass 10 castle, house, skyscraper, trainsuperclass 11 cloud, forest, mountain, plain, seasuperclass 12 camel, cattle, chimpanzee, elephant, kangaroo, dinosaursuperclass 13 fox, porcupine, possum, raccoon, skunksuperclass 14 snail, worm, snake, caterpillar, raysuperclass 15 baby, boy, girl, man, womansuperclass 16 crocodile, lizard, turtlesuperclass 17 hamster, mouse, rabbit, shrew, squirrel, beaver, ottersuperclass 18 maple tree, oak tree, palm tree, pine tree, willow treesuperclass 19 streetcar, bus, motorcycle, roadsuperclass 20 lawn mower, pickup truck, tank, tractorsuperclass 21 bicycle, motorcycle, clocksuperclass 22 crab, spidersuperclass 23 bridge, sealsuperclass 24 rocketsuperclass 25 lamp

Table 4. Learned tree hierarchy for the CIFAR-100 dataset.


Superclass Classes

animals animals, bird, dog, bird*, dog*people baby*, people*, female*, male*, portrait*, baby, people, female, male, portraitplants flower, tree, plant life, flower*, tree*sky clouds, clouds*, skywater water, river, lake, sea, river*, sea*time of day night, night*, sunsettransport transport, car, car*structures structuresfood foodindoor indoor

Table 5. Given tree hierarchy for the MIR-Flickr dataset.

Superclass Classes

superclass 1 animals, dog, dog*, bird*superclass 2 bird, sky, clouds, clouds*superclass 3 baby*, baby, people*, female*, male*, portrait*, portraitsuperclass 4 people, female, malesuperclass 5 flower, flower*, tree*superclass 6 tree, plant lifesuperclass 7 water, river, seasuperclass 8 lake, river*, sea*superclass 9 night, night*superclass 10 transportsuperclass 11 car, car*superclass 12 sunsetsuperclass 13 structuressuperclass 14 foodsuperclass 15 indoor

Table 6. Learned tree hierarchy for the MIR-Flickr dataset.

discriminative transfer learning with tree-based...

Documents