ieee transactions on pattern analysis and machine ...suzanne/papers/jamiesonetalpami.pdf ·...

15
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 1 Using Language to Learn Structured Appearance Models for Image Annotation Michael Jamieson, Student Member, IEEE, Afsaneh Fazly, Suzanne Stevenson, Sven Dickinson, Member, IEEE, Sven Wachsmuth, Member, IEEE Abstract— Given an unstructured collection of captioned im- ages of cluttered scenes featuring a variety of objects, our goal is to simultaneously learn the names and appearances of the objects. Only a small fraction of local features within any given image are associated with a particular caption word, and captions may contain irrelevant words not associated with any image object. We propose a novel algorithm that uses the repetition of feature neighborhoods across training images and a measure of correspondence with caption words to learn meaningful feature configurations (representing named objects). We also introduce a graph-based appearance model that captures some of the structure of an object by encoding the spatial relationships among the local visual features. In an iterative procedure we use language (the words) to drive a perceptual grouping process that assembles an appearance model for a named object. Results of applying our method to three data sets in a variety of conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions, we can learn both the names and appearances of objects, resulting in a set of models invariant to translation, scale, orientation, occlusion, and minor changes in viewpoint or articulation. These named models, in turn, are used to automatically annotate new, uncaptioned images, thereby facilitating keyword-based image retrieval. Index Terms— Language–Vision integration, Image annotation, Perceptual grouping, Appearance models, Object recognition I. MOTIVATION Manual annotation of new images in large image collections is prohibitively expensive for commercial databases, and overly time-consuming for the home photographer. However, low-cost imaging, storage and communication technologies have already made accessible millions of images that are meaningfully associ- ated with text in the form of captions or keywords. It is tempting to see these pairings of visual and linguistic representations as a kind of distributed Rosetta Stone from which we may learn to automatically translate between the names of things and their appearances. Even limited success in this challenging project would support at least partial automatic annotation of new images, enabling search of image databases by both image features and keywords that describe their contents. Any such endeavor faces the daunting challenge of the percep- tual grouping problem. Regardless of the type of image feature used, a word typically refers not to a single feature, but to a configuration of features that form the object of interest. The problem is particularly acute since any given image may contain multiple objects or configurations; moreover, the meaningful configurations may be easily lost among a huge number of irrelevant or accidental groupings of features. Without substantial Manuscript received <month> dd, yyyy; revised <month> dd, yyyy. M. Jamieson, A. Fazly, S. Stevenson, and S. Dickinson are with the Department of Computer Science at the University of Toronto; S. Wachsmuth is with the Faculty of Technology at Bielefeld University. bottom-up grouping hints, it is a nearly hopeless task to glean the meaningful feature configurations from a single image–caption pair. Given a collection of images, however, one can look for patterns of features that appear much more often than expected by chance. Usually, though, only a fraction of these recurring configurations correspond to salient objects that are referred to by words in the captions. Our system searches for meaningful feature configurations that appear to correspond to a caption word. From these starting points it iteratively constructs flexible appearance models that maximize word–model correspondence. Our approach is best-suited to learning the appearance of objects distinguished by their structure (e.g., logos or landmarks) rather than their color and texture (e.g., tigers or blue sky). By detecting the portions of an object with distinctive structure, we can find whether the object is present in an image and where (part of) the object appears, but we do not determine its full extent. Therefore our system is appropriate for annotation but only limited localization. Our specific focus is on learning correspondences between the names and appearances of exemplar objects from relatively noisy and complex training data rather than attempting to learn the more highly-variable appearance of object classes from less ambiguous training sets. However, our framework and the structure of our appearance model are designed to learn to recognize any objects that appear as multiple parts in a reasonably consistent configuration. We therefore believe that with the right choice of features, our framework could be adapted to learn the appearance of object classes such as cars, jets, or motorcycles. A. Background The literature on automatic image annotation describes a large number of proposals [1], [3]–[8], [10], [12], [13], [17], [19], [21], [23]. Like our previous efforts ([15], [16]) and the work presented here, these systems are designed to learn meaningful correspondences between words and appearance models from cluttered images of multiple objects paired with noisy captions, i.e., captions that contain irrelevant words. Many of these approaches associate a caption word with a probability distribution over a feature space dominated by color and texture (though the set of features may include position [1], [6], or simple shape information [2], [12]). This type of representation is less reliant on perceptual grouping than a shape model or a structured appearance model because color and texture are relatively robust to segmentation errors and the configuration of features is not critical. This type of representation is a good fit for relatively structureless materials such as grass, sand or water. In addition, they can often be more effective than more rigid models for object classes such as animals (e.g., Berg and Forsyth [4]) that have a consistent structure but are so variable

Upload: others

Post on 26-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 1

Using Language to Learn Structured AppearanceModels for Image Annotation

Michael Jamieson,Student Member, IEEE,Afsaneh Fazly, Suzanne Stevenson, Sven Dickinson,Member, IEEE,Sven Wachsmuth,Member, IEEE

Abstract— Given an unstructured collection of captioned im-ages of cluttered scenes featuring a variety of objects, ourgoalis to simultaneously learn the names and appearances of theobjects. Only a small fraction of local features within any givenimage are associated with a particular caption word, and captionsmay contain irrelevant words not associated with any imageobject. We propose a novel algorithm that uses the repetition offeature neighborhoods across training images and a measureofcorrespondence with caption words to learn meaningful featureconfigurations (representing named objects). We also introducea graph-based appearance model that captures some of thestructure of an object by encoding the spatial relationshipsamong the local visual features. In an iterative procedure weuse language (the words) to drive a perceptual grouping processthat assembles an appearance model for a named object. Resultsof applying our method to three data sets in a variety ofconditions demonstrate that from complex, cluttered, real-worldscenes with noisy captions, we can learn both the names andappearances of objects, resulting in a set of models invariantto translation, scale, orientation, occlusion, and minor changesin viewpoint or articulation. These named models, in turn, areused to automatically annotate new, uncaptioned images, therebyfacilitating keyword-based image retrieval.

Index Terms— Language–Vision integration, Image annotation,Perceptual grouping, Appearance models, Object recognition

I. M OTIVATION

Manual annotation of new images in large image collectionsis prohibitively expensive for commercial databases, and overlytime-consuming for the home photographer. However, low-costimaging, storage and communication technologies have alreadymade accessible millions of images that are meaningfully associ-ated with text in the form of captions or keywords. It is temptingto see these pairings of visual and linguistic representations asa kind of distributed Rosetta Stone from which we may learnto automatically translate between the names of things and theirappearances. Even limited success in this challenging projectwould support at least partial automatic annotation of new images,enabling search of image databases by both image features andkeywords that describe their contents.

Any such endeavor faces the daunting challenge of the percep-tual grouping problem. Regardless of the type of image featureused, a word typically refers not to a single feature, but to aconfiguration of features that form the object of interest. Theproblem is particularly acute since any given image may containmultiple objects or configurations; moreover, the meaningfulconfigurations may be easily lost among a huge number ofirrelevant or accidental groupings of features. Without substantial

Manuscript received<month> dd, yyyy; revised<month> dd, yyyy.M. Jamieson, A. Fazly, S. Stevenson, and S. Dickinson are with the

Department of Computer Science at the University of Toronto; S. Wachsmuthis with the Faculty of Technology at Bielefeld University.

bottom-up grouping hints, it is a nearly hopeless task to glean themeaningful feature configurations from a single image–captionpair. Given acollection of images, however, one can look forpatterns of features that appear much more often than expectedby chance. Usually, though, only a fraction of these recurringconfigurations correspond to salient objects that are referred to bywords in the captions. Our system searches formeaningfulfeatureconfigurations that appear to correspond to a caption word. Fromthese starting points it iteratively constructs flexible appearancemodels that maximize word–model correspondence.

Our approach is best-suited to learning the appearance ofobjects distinguished by their structure (e.g., logos or landmarks)rather than their color and texture (e.g., tigers or blue sky). Bydetecting the portions of an object with distinctive structure,we can find whether the object is present in an image andwhere (part of) the object appears, but we do not determine itsfull extent. Therefore our system is appropriate for annotationbut only limited localization. Our specific focus is on learningcorrespondences between the names and appearances of exemplarobjects from relatively noisy and complex training data ratherthan attempting to learn the more highly-variable appearance ofobject classes from less ambiguous training sets. However,ourframework and the structure of our appearance model are designedto learn to recognize any objects that appear as multiple parts ina reasonably consistent configuration. We therefore believe thatwith the right choice of features, our framework could be adaptedto learn the appearance of object classes such as cars, jets,ormotorcycles.

A. Background

The literature on automatic image annotation describes a largenumber of proposals [1], [3]–[8], [10], [12], [13], [17], [19],[21], [23]. Like our previous efforts ([15], [16]) and the workpresented here, these systems are designed to learn meaningfulcorrespondences between words and appearance models fromcluttered images of multiple objects paired with noisy captions,i.e., captions that contain irrelevant words.

Many of these approaches associate a caption word with aprobability distribution over a feature space dominated bycolorand texture (though the set of features may include position[1], [6], or simple shape information [2], [12]). This type ofrepresentation is less reliant on perceptual grouping thana shapemodel or a structured appearance model because color and textureare relatively robust to segmentation errors and the configurationof features is not critical. This type of representation is agoodfit for relatively structureless materials such as grass, sand orwater. In addition, they can often be more effective than morerigid models for object classes such as animals (e.g., Berg andForsyth [4]) thathavea consistent structure but are so variable

suzannestevenson
Typewritten Text
Accepted for publication November 2008.
Page 2: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 2

in articulation and pose that the structure is difficult to discern in2D images. However, highly structured objects, such as buildingsand bicycles, that may lack distinctive color or texture maystillbe recognized if individually ambiguous parts of appearance canbe grouped together into a meaningful configuration.

A number of researchers have addressed the problems ofperceptual grouping and learning of configurations in automaticannotation. For instance, Barnardet al. [2] acknowledge thatcoarse granularity features, such as regions, may be overseg-mented during feature extraction and propose a ranking schemefor potential merges of regions based on a similarity of word–region associations. It is possible, however, for different partsof an object to be associated with very different words, andhence such an approach is problematic for grouping the object’sparts in such cases. In a similar vein, Quattoniet al. [23] usethe co-occurrence of caption words and visual features to mergetogether “synonymous” features. This lowers the dimensionalityof their bag-of-features image representation and therefore allowsimage classifiers to be trained with fewer labeled examples.Such a system, which models visual structure as a mixture offeatures, can represent complex objects as the co-occurrence ofdistinctive parts within an image [6], [8], [13], [17], [21], [23].However, these models contain no spatial relationships (evenproximity) between parts that would allow them to representtruepart configurations. Carbonettaet al. [7] use a Markov randomfield model that can successfully recognize a set of adjacentregions with widely varying appearance as being associatedwitha given word. Their method is not capable of learning structuredconfigurations of features, however, since the spatial relationshipsbetween the object parts are not part of the learned region–wordpairings. The multi-resolution statistical model proposed by Liand Wang [19] can represent configurations of visual featuresacross multiple scales. However, the system does not performgrouping, as each semantic class is associated with a layoutforthe entire image, where the division into parts is predefined.Other work avoids the perceptual grouping problem by focusingon domains where there exists detailed prior knowledge of theappearance of the objects of interest, as in the task of matchingnames with faces [4]. Our work focuses on using correspondencesbetween image features and caption words to guide the groupingof image features into explicit, meaningful configurations.

Methods for grouping individual features of various types intomeaningful configurations are reasonably common in the broaderobject recognition literature. For instance, Ferguset al. [14] learnobject appearance models consisting of a distinctive subset oflocal interest features and their relative positions, by lookingfor a subset of features and relationships that repeat across acollection of object images. Crandall and Huttenlocher [11] usegraph models in which vertices are oriented edge templates ratherthan local feature detections, and edges represent spatialrela-tionships. Sharing elements with both of the above approaches,our appearance models (carried over from [16]) are collectionsof distinctive local interest features tied together in a graph inwhich edges represent constrained spatial relationships.This typeof representation is sufficiently flexible to handle occlusion, minorchanges in scale and viewpoint, and common deformations. Whileboth [14] and [11] can learn an appearance model from verynoisy training images, the methods (like most object recognitionsystems) require training images in which a single object ofthe desired category occupies a large portion of the image. Our

method makes these structured appearance models more applica-ble to automatic image annotation by relaxing that constraint.

While many structured appearance models use features de-signed for object classes, our system uses features that arebestsuited to learning the appearance of exemplar objects (suchasSt. Paul’s Cathedral) rather than a broad class of objects (suchas cathedrals in general). The world is full of exemplars, andthere has been a great deal of work in sorting and annotatingexemplar images, such as the method proposed by Simonetal. [24] for organizing collections of related photographs intolabeled canonical views. While our current detection method isnot as scalable as high-performance exemplar image retrievalsystems such as that proposed by Philbinet al. [22], our useof language can improve text-based querying and link togetherwidely different appearances or views of a single exemplar.

The proposed learning algorithm here is an extension and arefinement of the algorithms presented in our previous work,[15], [16]. In [15] we represented appearance as an unstructuredlocal collection of features and used a translation model tofindcorrespondences between words and appearance. In [16] we addedspatial relationships to the appearance model and introduced amore direct word–model correspondence measure. Here, we intro-duce a novel unified framework for evaluating both the goodnessof a detection and the appropriateness of associating the detectionwith a caption word. We have also modified the improvementmechanism to learn more spatial relationships between featuresand present extensive new evaluation and analysis.

B. An Overview of Our Approach

The goal of this work is to annotate exemplar objects appearingin images of cluttered scenes, such as the images shown in Fig-ure 1(a). A typical such image, with hundreds (or even thousands)of local features, contains a huge number of possible featureconfigurations, most of which are noise or accidental groupings.A complex configuration of features that occurs in many imagesis unlikely to be an accident, but may still correspond to com-mon elements of the background or other unnamed structures.The only evidence on which to establish a connection betweenwords and configurations of visual features is their co-occurrenceacross the set of captioned images. The key insight is that thisevidence can guide not only the annotation of complex featureconfigurations, but also the search for meaningful configurationsthemselves. Accordingly, we have developed a novel algorithmthat uses language cues in addition to recurring visual patternsto incrementally learn strong object appearance models from acollection of noisy image–caption pairs (as in Figure 1(a-c)).The result of learning is a set of exemplar object appearancemodels paired with their names, which can be used for annotatingsimilar objects in new (unseen and uncaptioned) images; a sampleannotation is shown in Figure 1(b).

The overall structure of our proposed system is depicted inFigure 2. The system is composed of two main parts: a learningcomponent, and an annotation component. Both components workwith a particular representation of the images and objects,anduse the same algorithm to detect instances of an object modelin an image. Details of our image and object representation,as well as the detection algorithm, are presented in SectionII.The two stages of the learning component, including methodsforexpanding and evaluating appearance models, are elaborated onin Section III. Finally, in Section IV, we present the annotation

Page 3: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 3

Mats Sundin of the Toronto Maple Leafsmisses a scoring chance against Ryan Millerof the Buffalo Sabres.

Toronto Maple Leafs vs.Montreal Canadiens.

(a) (b)

Florida’s Olli Jokinen gets bumpedby Alexei Ponikarovsky of the MapleLeafs.

“Maple Leafs”

(c) (c)

Fig. 1. (a-c) A sample input image–caption collection, where each imagecontains hundreds of local (SIFT) features (yellow crosses). From the inputtraining collection, associations between structured subsets of local featuresand particular nouns are learned. (d) A sample output of our system, wherethe object (the Maple Leafs logo) is detected (shown with redfeatures andgreen relationships in a yellow box), and annotated with itsname (“MapleLeafs”). The annotation is performed using a word–appearance associationdiscovered from the training image–caption collection.

component, demonstrating the promising performance of oursystem in discovering meaningful word–appearance pairs.

II. REPRESENTING ANDMATCHING OBJECTS

Our learning framework allows a one-to-many relationshipbetween words and appearance models. It is thus not necessarythat a single model capture object appearance from all possibleviewpoints. Moreover, since we deal with exemplar objects,ourmethod need not handle the changes in texture and structuraldetail that are possible within a class of objects. In order to serveas a robust object detector, however, it is important that the ap-pearance model representation be invariant to reasonable changesin lighting, scale, orientation, articulation and deformation. Therepresentation must also be reliably detectable, in order to avoidfalse annotations.

We represent our images using easily-extractable local interestfeatures that can be used reliably to detect exemplar objects inhighly cluttered scenes. We represent small patches of an instanceof an object using such local features, and capture its structureusing the pairwise spatial relationships between the patches. Fromthe object instances, we then construct an abstract object appear-ance model in the form of a graph, by modeling the recurrentlocal features as vertices and the recurrent spatial relationshipsbetween pairs of local features as edges. The model that is built

this way is a reliable descriptor of the object appearance and atthe same time flexible enough to handle common deformations.Details of our choices for the representation of images and objectsare given in Sections II-A and II-B, respectively. Given an objectmodel (in the form of a graph), we need a method for detectinginstances of the object in an image—that is, to find a matchingbetween model vertices and local image features. Our detectionalgorithm is presented in detail in Section II-C.

A. Image Representation

We represent an image as a setI of local interest pointspm,i.e., I = {pm|m = 1 . . . |I |}, referred to hereafter as pointsor image points. These points are detected using Lowe’s SIFTmethod [20], which defines a pointpm in terms of its Cartesianpositionxm, scaleλm and orientationθm. In addition to spatialparameters, for each point we also extract a feature vectorfm

that encodes a portion of the image surrounding the point. Sincefm is extracted relative to the spatial coordinates ofpm, it isinvariant to changes in position, scale and orientation. While ourapproach is not dependent on a particular point detection methodor feature encoding, we use the PCA-SIFT feature encodingdeveloped by Ke and Sukthankar [18] because it allows forfast feature comparison and low memory requirements. Thisfeature encoding is reasonably robust to lighting changes,minordeformations and changes in perspective. Since individualfeaturescapture small, independent patches of object appearance, theoverall representation is robust to occlusion and articulation.

The continuous feature vectorfm is supplemented by a quan-tized descriptorcm for each image point, in order to support theability to quickly scan for potentially matching features.Follow-ing Sivic and Zisserman [25], we use the K-means algorithm togenerate a set of cluster centers,C = {fc|c = 1 . . . |C|}, from a setof features randomly selected from a stock image collection. Theindex of the cluster center closest tofm is used as the descriptorcm associated withpm.

In addition to describing each point individually, we alsoattempt to capture the local spatial configuration of pointsusingneighborhoods that describe the local context of a point. Eachpoint pm is associated with a neighborhoodnm that is the set ofits spatially closest neighborspn, according to the∆xmn distancemeasure taken from Carneiro and Jepson [9]:

∆xmn =‖xm − xn‖

min(λm, λn)(1)

This normalized distance measure makes neighborhoods morerobust to changes in scale, as newly-introduced fine-scale pointsare less likely to push coarse-scale points out of the neighborhoodwhen the scale of an object increases.

To summarize, each image is represented as a set of points,I = {pm|m = 1 . . . |I |}, in which pm is a 6-tuple ofthe form (fm,xm, λm, θm, cm,nm). In addition, a vector oftransformation-invariant spatial relationshipsrmn is defined be-tween each pair of neighboring points,pm andpn, including therelative distance between the two points (∆xmn), the relativescale difference between them (∆λmn), the relative heading frompm to pn (∆φmn), and the relative heading in the opposite di-rection (∆φnm). That is,rmn = (∆xmn, ∆λmn, ∆φmn, ∆φnm),where the spatial relationships are taken from Carneiro andJepson[9], and are calculated as in Equations (1) above, and (2) and(3)

Page 4: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 4

cluttered scenes of

multiple objects,

with noisy captions

Image-CaptionCollection

Stage 2: Improvement

LearnedModel--Word Pairs

SeedModel--Word Pairs

repeat

Detect

Stage 1: Initialization

Expand

Add a spatial constraint

Detect instances of newmodel G’ in all images

Expand a model G to G’

Evaluate new model G’

Find high-confidenceinstances of a given model in a given image

Add a pointor

Find good seed modelsfor each caption word

Evaluate

Accept change if newmodel has highermodel--word confidence

Detect

Annotate

UncaptionedImages

AnnotatedImages

Detect Model instances

Label with associated Word

Learning Component: Discovers Object Models & their Associated Names Annotation Component:Annotates New Images

Fig. 2. A pictorial representation of our system for learning and annotating objects.

below:

∆λmn =λm − λn

min(λm, λn)(2)

∆φmn = ∆θ(tan−1(xm − xn) − θm) (3)

where∆θ(.) ∈ [−π, +π] denotes the principle angle. The spatialrelationships are not part of the stored image representation, butare calculated on demand when object appearance models arebeing built (see Section II-B) or detected (see Section II-C).

B. Object Appearance Model

An object model describes the distinctive appearance of anobject as a particular set of local features that have a more-or-lessstructured arrangement. We represent this structured configurationof features as a graphG = (V, E). Each vertexvi ∈ V iscomposed of a continuous feature vectorfi, and a cluster indexvectorci containing indices for the|ci| nearest cluster centers tofi, i.e., vi = (fi, ci). Associating each model vertex with a setof clusters allows for fast comparison of features during modeldetection while minimizing the effects of quantization noise. Notethat model vertices, unlike image points, do not include spatialinformation (i.e., position, orientation, and scale), because themodel must be invariant to translation, rotation, and scale. Eachedgeeij ∈ E encodes the expected spatial relationship betweentwo verticesvi and vj , in four parts:eij = (∆xij , ∆λij , ∆φij ,∆φij) (as defined in Equations (1)–(3) above).

We assume objects are spatially coherent, and hence two nearbypoints on an object are expected to be related non-accidentally,i.e., through a geometric relation. Thus in our framework onlya connected graph is considered to be a valid object appearancemodel; edges are allowed between any pair of vertices (and thusmodels are not restricted to trees). Models that can encode all thenon-accidental relationships between pairs of nearby image pointsare generally more distinctive, and more robust to occlusion andinconsistent point detection, than models that are restricted totrees.

The parameters of a model’s vertices (fi and ci) as well asthose of the edges (∆xij , ∆λij , ∆φij , and∆φij) are calculated

from the corresponding parameters of the images (points andthespatial relationships between them) through an iterative processof model construction and detection (see Sections II-C and III).

C. Detecting Instances of an Object

We use a heuristic algorithm that searches for high-confidenceinstances of an appearance model in an image. Though eachmodel is intended to robustly describe an object’s appearance, noobserved instance of an object is expected to fit its model exactly.Deformation, noise, and changes in perspective can distortthefeatures encoded at points and/or the spatial relationships betweenthem. Our detection algorithm should thus be capable of findingpartial matches of a model (representing some visible part ofan object). At the same time, the algorithm should distinguish apartial match which has only a few observed vertices from anaccidental collection of background elements. Our algorithm thusassigns, to each observed instance of a model, a confidence scorethat meets the above requirements, thereby determining howwellthe instance fits the model. Below we first explain our proposeddetection confidence score, and then present the details of ourdetection algorithm.

1) Detection Confidence:An observed instanceO of an ob-ject appearance modelG is a set of vertex–point associations,O = {(vi, pm)|vi ∈ V, pm ∈ I}, where O defines a one-to-onemapping between a subset of the model vertices and some subsetof points in an image. Our detection confidence score definesthe goodness of fit between a modelG and an observationO asthe likelihood ofO being a true instance ofG and not a chanceconfiguration of background features. LetHG be the hypothesisthat O is a true instance of the appearance modelG, while HB

is the competing hypothesis thatO is a chance assortment ofbackground features. Thedetection confidenceis:

Confdetect(O, G) = P (HG|O) (4)

=p(O|HG)P (HG)

p(O|HG)P (HG) + p(O|HB)P (HB)

wherep(O|HB) is the background likelihood ofO, andp(O|HG)

is the model likelihood ofO. We useP (.) to indicate probabilityfunctions andp(.) to indicate probability density functions.

Page 5: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 5

The above equation can be rewritten as:

Confdetect(O, G) =

p(O|HG) P (HG)p(O|HB) P (HB)

p(O|HG) P (HG)p(O|HB) P (HB) + 1

(5)

Thus the detection confidence can be calculated from the priorlikelihood ratio,P (HG)/P (HB), and the observation likelihoodratio, p(O|HG)/p(O|HB). The prior likelihood ratio is set to afixed value, empirically determined from experiments on a held-out subset of the training data (see the Appendix). Next, weexplain how we estimate the observation likelihood ratio froma set of training image–caption pairs.

The background likelihood of an observationO (i.e., p(O|HB))is independent of the modelG and depends only on the featuresfm of the observed points and their spatial relationshipsrmn.In other words,p(O|HB) can be estimated from the backgroundfeature probability,p(fm|HB), and the background distributionof spatial relationships,p(rmn|HB). p(fm|HB) reflects how com-mon a feature is across a set of stock images with a wide variety ofobjects, whilep(rmn|HB) reflects the distribution of relationshipsobserved between neighboring points across the stock imageset.

According to the background hypothesis, all point featurevectors fm are i.i.d. (independent and identically distributed).While some local features represent one of a few very commonvisual patterns, other local feature values are quite rare,andtherefore more distinctive. A Gaussian mixture model (GMM)allows us to approximate such a structured background featuredistribution:

p(fm|HB) =

|C|X

c=1

ωc · n(fm|fc, σc) (6)

where ωc is a weight term (P

ωc = 1) and n(fm|fc, σc) isa multivariate normal distribution with meanfc and diagonalcovariance valuesσ2

c . Each meanfc is a member of the clustercentroid setC, found using K-means as explained in Section II-A above. Given these fixed means, the weightsωc and standarddeviationsσc are determined using theEM algorithm on the same(large) set of stock images that we use to find clusters inC. Thisapproach provides a smoother background feature distributionthan using the statistics of the K-means clusters directly and isless computationally expensive than full GMM clustering.

According to the background hypothesis, all spatial relationshipvectors rmn between neighboring points are also i.i.d. Sincehistograms of relative distance (∆xmn) and relative scale (∆λmn)are unimodal in the stock set, we model them with a normaldistribution:

p(∆xmn|HB) = n(∆xmn|µxB , σxB) (7)

p(∆λmn|HB) = n(∆λmn|µλB , σλB) (8)

where the means (µxB and µλB) and standard deviations (σxB

andσλB) are the sample statistics for pairs of neighboring pointsin the stock image set. For the two relative heading terms (∆φmn,∆φnm) we did not observe any tendency to a particular valuein the stock image set, hence we model them as uniformlydistributed.

Calculation of the model likelihood of an observation,p(O|HG), is more complex, since the components of the appear-ance model are not identically distributed. In order to accountfor model vertices that are occluded or lost due to inconsistentinterest point detection, we assume each vertexvi ∈ V is observed

with probability αv .1 In matching an image pointpm to a modelvertex vi, we do not require the feature vectorfm of pm tobe precisely equal tofi, but we assume thatfm is normallydistributed with meanfi and standard deviationσf . This vertexfeature deviationσf is approximately equal to the background’smean cluster varianceσc. While a graph vertexvi ∈ V definesthe expected feature vector of a model observation, a graphedgeeij ∈ E defines the expected spatial relationships betweencertain pairs of observed points. IfO contains observations(vi, pm) and (vj , pn) and eij ∈ E, then the elements of thespatial relationshiprmn are independent and normally distributedwith means (∆xij , ∆λij , ∆φij , ∆φji) and standard deviations(σx, σλ, σφ, σφ). If the model does not specify a relationshipbetween the vertices (eij /∈ E), then the elements ofrmn aredistributed according to the background hypothesis.

From the above formulations for the background and modellikelihoods of an observation, we can calculate the observationlikelihood ratio as:

p(O|HG)

p(O|HB)=

Y

vi /∈O

(1−αv)Y

(vi,pm)∈O

αvp(fm|fi)

p(fm|HB)

Y

eij in O

p(rmn|eij)

p(rmn|HB)

(9)where p(fm|fi) and p(rmn|eij) reflect how well the observedpoints and their spatial relationships match the values expected bythe appearance modelG (as explained in the paragraph above).We considereij to be inO if both (vi, pm) ∈ O and(vj , pn) ∈ O.Note that the observation likelihood ratio takes into account thesize of the observed and unobserved portions of the model,the background likelihood of the observed features and spatialrelationships, as well as how well the observed points and spatialrelationships fitG.

2) Detection Algorithm: To detect an instance of an objectmodel in an image, we need to find a high-confidence associationbetween the model vertices and image points. Given that a typicalimage contains thousands of points, determining the optimalassociation is potentially quite expensive. We thus propose agreedy heuristic that efficiently searches the space of possibleassociations for a nearly-optimal solution. Individual vertices ofa model may be unobserved (e.g., due to occlusion), and thereforesome edges in the model may also not be instantiated. Our detec-tion heuristic thus allows for a connected model to be instantiatedas disconnected components. To reduce the probability of falsedetections, the search for disconnected parts is confined totheneighborhood of observed vertices, and isolated singletonpointsare ignored. That is, in a valid model instance, each observedvertex shares a link with at least one other observed vertex.Also, our detection algorithm only reports those observationsO with a detection confidence greater than a threshold,i.e.,Confdetect(O, G) ≥ Td. We set Td to be quite low so thatpotentially meaningful detections are not overlooked.

Algorithm 1 presents the greedy heuristic that detects instancesof an appearance modelG within the image representationI. Theactual implementation can detect more than one instance ofG

within I by suppressing points in previously detected instances.

1More accurately, we treatαv not as the empirically observed probabilityof model vertex survival, but rather as what we would like it to be (i.e., as auser-defined parameter to guide detection). We choseαv = 0.5 because it isa reasonable compromise between a highαv which requires that almost allelements of a model be reproduced, and a lowαv where a great majority ofthe model vertices can be absent with little impact on detection confidence.

Page 6: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 6

The first step is to find the setA of potential associations betweenimage points and model vertices, that is, all vertex–point pairswhose match (p(fm|fi)) is higher than expected by chance. Thealgorithm then calculates the set of potential model edgesL link-ing these vertex–point pairs. The set of observed correspondences,O, is assembled across several iterations by greedily addingthevertex–point pairs fromA that maximize the detection confidence,Confdetect(O, G). Each timeO is expanded, elements ofA andL that are incompatible with the current observed set are prunedaway. The greedy expansion ofO continues until there are noavailable correspondences that could increaseConfdetect(O, G).

Algorithm 1 Detects instances ofG in I

FindModelInstance(G,I)1) Find the setA of all potential vertex–point associations:

A = {(vi, pm) | cm ∈ ci, p(fm | fi) > p(fm|HB)}2) Find the setL of all potential links between elements ofA:

L = {((vi, pm), (vj , pn)) | pn ∈ nm, p(rmn|eij) >p(rmn|HB)}

3) Set the initial instanceO to the pair{(vi, pm), (vj , pn)}, suchthat the link((vi, pm), (vj , pn)) ∈ L andConfdetect (O, G) ismaximum.

4) Remove(vi, pm) from A if either vi or pm are part of anothervertex–point association∈ O.

5) Remove((vi, pm), (vj , pn)) from L if neither end is inA.6) Let Aadj be the subset ofA that is linked throughL with an

element ofO.7) If Aadj contains associations that could increase

Confdetect (O, G), add to O the association that leads tothe greatest increase, and go to step 4.

8) Let Lneigh be the subset ofL within the union of the neigh-borhoods of observed points inO.

9) If Lneigh contains observed links that could increaseConfdetect (O, G), add toO the pair of associations with thelink that produces the greatest increase, and go to step 4.

10) Return(O, Confdetect (O, G)).

III. D ISCOVERINGWORD–APPEARANCEASSOCIATIONS

We propose an unsupervised learning algorithm that buildsstructured appearance models for the salient objects appearing ina set of training image–caption pairs. Salient objects are those thatappear in many images, and are often referred to in the captions.Because each image contains many features of non-salient objects,and each caption may contain words irrelevant to the displayedobjects, the algorithm has to discover which image featuresandwords are salient. The algorithm learns object models throughdiscovering strong correspondences between configurations of vi-sual features and caption words. The output is a set of appearancemodels, each associated with a caption word, which can be usedfor the annotation of new images.

The learning algorithm has two stages. First, an initializationstage determines a structured seed model for each caption word,by finding recurring neighborhoods of features that also co-occurwith the word. Second, an improvement stage iteratively expandseach initial seed model into an appearance model that coversalarger portion of the object of interest, and at the same timeismore strongly associated with the corresponding caption word.The two stages of learning use a novel measure of correspondencebetween a caption word and an appearance model, which isexplained in Section III-A. We then explain the initialization andimprovement stages of the learning algorithm in Sections III-Band III-C, respectively.

A. Word–Appearance Correspondence Confidence

Here, we explain the word–appearance correspondence scoreused by our learning algorithm. The learning algorithm seeks pairsof words and appearance models that are representations of thesame object in different modalities (linguistic and visual). Weassume that both the word and the model instance are presentin an image because theobject is present. We thus define thecorrespondence score as a measure of confidence that a givenappearance model is a reliable detector for the object referredto by a word. In other words, the correspondence score reflectsthe amount of evidence, available in a set of training images,that a word and an object model are generated from a commonunderlying source object.

Consider a set ofk (training) captioned images. We representthe occurrence pattern of a wordw in the captions of these imagesas a binary vectorrw = {rwi|i = 1, . . . , k}. Similarly, we use abinary vector,qG = {qGi|i = 1, . . . , k} to indicate, for eachtraining image, whether it contains at least one true observation ofmodelG. However, even if we detect modelG in the ith trainingimage, we cannot be certain that this is a true observation ofG

(qGi = 1) instead of a random assortment of background features(qGi = 0). Therefore, we treatqGi as a hidden variable andassociate it with an observed value,oGi ∈ [0, 1], that reflects thelikelihood of modelG being present in imagei. We setoGi to themaximum of the detection confidence scores,Confdetect(O, G),over all the detected instances of a given object modelG in agiven imagei.

It is always possible that the word occurrence pattern,rw, andthe observed confidence pattern,oG = {oGi|i = 1, . . . , k}, areindependent (the null hypothesis orH0). Alternatively, instancesof the word w and modelG may both derive from a hiddencommon source object (the common-source hypothesis orHC).According toHC , some fraction of image–caption pairs containa hidden sources, which may emit the wordw and/or theappearance modelG. The existence of the appearance modelin turn influences our observed confidence values,oGi. Wedefine the correspondence betweenw and G as the likelihoodP (HC |rw,oG), and rewrite it as in Equation (11) below:

Confcorr (G, w) = P (HC |rw ,oG) (10)

=p(rw,oG|HC)P (HC)

p(rw,oG|HC)P (HC) + p(rw,oG|H0)P (H0)

where:

p(rw,oG|HC) =Y

i

X

si

P (si)P (rwi|si) p(oGi|si) (11)

p(rw,oG|H0) =Y

i

P (rwi) p(oGi) (12)

wheresi ∈ {0, 1} represents the presence of the common sourcein image–caption pairi. To calculate the likelihoods of theobserved confidence values under the two competing hypotheses(p(oGi|si) and p(oGi)) we marginalize over the unobservedvariableqGi:

p(oGi|si) = p(oGi|qGi = 1)P (qGi = 1|si)

+p(oGi|qGi = 0)P (qGi = 0|si) (13)

p(oGi) = p(oGi|qGi = 1)P (qGi = 1)

+p(oGi|qGi = 0)P (qGi = 0) (14)

where we choosep(oGi|qGi = 1) to be proportional to the detec-tion confidenceoGi (hencep(oGi|qGi = 0) will be proportional

Page 7: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 7

to 1 − oGi). The intuition is thatoGi is usually high when themodelG is present in imagei, and is low when the model is notpresent.

To get the likelihood of observed data underHC , defined inEquations (11) and (13), we also need to estimate the parametersP (si), P (rwi|si), and P (qGi|si). P (rwi|si) and P (qGi|si = 0)

are given fixed values according to assumptions we make aboutthe training images, which we elaborate on in the Appendix.P (si) andP (qGi|si = 1) are given maximum likelihood estimates(MLEs) determined using expectation maximization over thetraining data. The MLEs for parameters underH0 are morestraightforward.P (rwi) in Equation (12) is the observed prob-ability for word occurrence in the training data whileP (qGi) isthe inferred probability of model occurrence:

P

i oGi/k.

B. Model Initialization

The goal of the model initialization stage is to quickly findfor each word a set of seed appearance models that are fruitfulstarting points for building strong object detectors. Later, the seedobject models are iteratively expanded and refined into larger andmore distinctive appearance models using image captions asaguide. The existence of good starting points strongly affects thefinal outcome of the iterative improvement process. Nonetheless,a balance must be reached between time spent searching for goodseeds and time spent in the improvement stage refining the seeds.

The most straightforward starting points are singleton features,but their relationship to an object may be too tenuous to provideeffective guidance for building strong object models [15].At thesame time, trying all possible configurations of even a smallnumber of features as seeds is impractical. The neighborhoodpattern introduced by Sivic and Zisserman [25] roughly describesthe appearance of a point’s local context as a bag of features.The neighborhood pattern is more distinctive than a singleton butless complex than our configurational models. Our initializationmodule uses neighborhood patterns with potentially meaningfulword correspondences to help construct seed appearance models.

Recall that each pointpm in an image has associated with it avector of neighboring pointsnm. The neighborhood patternηm

is a sparse binary vector that denotes which quantized featuredescriptorsc are present inpm’s neighborhood (includingcm).Thus ηm roughly captures the types of feature vectors presentwithin the small portion of the image centered atpm. Two neigh-borhood patternsηm and ηl are consideredsimilar (ηm ≈ ηl) ifthey have at leasttη quantized feature descriptors in common.We use a modified version of the two-stage clustering methoddescribed in [25] to identify clusters of similar neighborhoodpatterns in the training images. The first stage identifies potentialcluster centers with relatively low similarity threshold (tη = 6)but requires that the quantized descriptor of the central featureof the neighborhoods match. The second pass that greedily formsthe clusters has a higher similarity threshold (tη = 8) but doesnot require a matching central descriptor.

Each resultingneighborhood clusterNm is a set of neigh-borhoods with patterns similar toηm (Nm = {nl|ηl ≈ ηm}).Each neighborhood cluster represents a weak, unstructuredap-pearance context that occurs multiple times across the trainingimage collection. We represent the occurrence pattern of a givenneighborhood clusterN in the training images as a binary vectorqN = {qN i|i = 1, · · · , k}, whereqN i = 1 if image i contains amember ofN , and zero otherwise.

The initialization module measures the association betweeneach caption wordw and neighborhood clusterN , using thecorrespondence confidence score of Equation (11) above,i.e.,Confcorr (N , w).2 We assume that occurrences of a neighborhoodclusterN that has a better-than-chance correspondence to a wordw may spatially overlap the object referred to byw. We thereforeselect for each wordw the 20 neighborhood clustersN with thestrongest correspondence score and attempt to extract fromeachcluster a seed appearance model with the best possible corre-spondence withw. Since the simplest detectable and structuredappearance model is a single pair of linked vertices, we searchfor two-vertex seed appearance modelsG that could explain thepresence of frequently-occurring point pairs inN and that alsohave strong correspondence withw.

Two neighboring points are considered to be a pair in aneighborhood clusterN if at least one end of the pair belongs toone of the neighborhoods inN . We extract groups of “similar”pairs inN , i.e., those that share the same appearance cluster pairs(cm, cn). Each such groupP is viewed as a set of observationsfor a potential two-vertex appearance modelG. For eachPwith members appearing in more than one image, we proposeup to 20 distinct appearance models whose features and spatialrelationships are randomly drawn (without replacement) from thepairs inP. For each modelG, we calculate a score that reflectshow well P supports modelG, by treating theK pairs in thegroup as observationsOk:

Supp(P , G) =X

k=1,...,K

p(Ok|HG) (15)

We keep the model with the highestSupp(P , G) for each groupof pairs. Then, for as many as50 models from different pairgroups with the highest support, we calculateConfcorr (G, w).The model with the highest correspondence confidence is selectedas word w’s seed model for the current neighborhood clusterN . Each starting seed model is therefore initialized from adifferent neighborhood cluster. While it is possible that differentseed models may converge on the same appearance during thefollowing iterative improvement stage, ideally the set of learnedappearance models will cover different distinctive parts of theobject and also provide coverage from a variety of viewpoints.

C. Iterative Improvement

The improvement stage iteratively makes simple changes tothe seed object models found at the initialization stage, guidedby the correspondence between caption words and models. Morespecifically, the improvement algorithm starts with a seed modelG for a given wordw, makes a simple modification to this model(e.g., adds a new vertex), and detects instances of the new modelG′ in the training images (using the detection algorithm presentedin Section II-C). The new model is accepted as a better objectdetector and replaces its predecessor if it has a higher correspon-dence score withw, i.e., if Confcorr (G′, w) > Confcorr (G, w).In other words, the improvement algorithm performs a greedysearch through the space of appearance models to find a reliableobject detector for any given word.

At each iteration, the algorithm tries to expand the currentmodel by adding a new vertex and linking it with one of the

2In our calculation ofConfcorr (N , w), we replaceqG with qN indicatingthe occurrences ofN in the training images.

Page 8: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 8

existing vertices. Vertex candidates are drawn from pointsthatfall within the neighborhood of the detected instances of thecurrent model. To ensure a strong correspondence between thegrowing model and its associated word, only model detectionsthat occur within images with the desired caption wordw areconsidered for this purpose. Consider a detected pointpm thatcorresponds to a model vertexvi. Each pointpn that is in theneighborhood ofpm but not part of the detection is a candidateto form a new vertexvj . The corresponding edge candidate,eij ,inherits the spatial relationship vectorrmn between the two pointspm and pn. As in the initialization stage, from the observationsand their neighboring points, we form groupsP each containingobservations of a potential one-vertex extension to the currentmodel. The observations are pairs(pm, pn) with their first pointsbeing observations of the same existing vertexvi, and theirsecond points sharing the same cluster indexcn. For each pairgroup P, we propose up to20 two-vertex incremental models∆G = ({vi, vj}, eij) that bridge the gap between the existingmodel and the new vertex. We keep the incremental model withthe highestSupp(P , ∆G). The incremental models from differentpair groups form a queue of potential (vertex, edge) additions,prioritized by their degree of support.

An augmented model,G′, is constructed by removing thetop incremental model∆G from the queue and incorporatingit into the existing model (G′ = G ∪ ∆G). If the augmentedmodel does not improve the correspondence score withw, thenthe change is rejected and the next iteration begins with thenext candidate∆G to be tested. If the augmented modeldoesimprove the correspondence score, the change is accepted, andthe algorithm attempts to establish additional edges between theexisting vertices and the new vertex. The edge candidates areprioritized based on their support among detections of the newmodelG′, as this reflects the number of times the two end pointvertices have been observed together, as well as the consistencyof their spatial relationship over those observations. Newedgesare tested sequentially and those that improve the correspondencescore are added. Generally, if the model vertices have veryconsistent spatial relationships across the current detections, themodel will tend to accept many edges. If the underlying object ismore deformable or viewed from a variety of perspectives, feweredges are likely to be accepted.

Once a new vertex is added and connected to the model, andadditional edges are either accepted or rejected, a new iterationbegins with the new model as the starting point. If none of theproposed model extensions are accepted, the modelG paired withthe caption wordw is added to the set of discovered word–appearance associations to be used for future annotation ofnewimages.

IV. EVALUATION : ANNOTATING OBJECTS INUNCAPTIONED

IMAGES

In previous sections, we have presented the components ofour learning framework for discovering object models and theirassociated names from a set of training image–caption pairs.Ultimately, we want to use the discovered model–word pairs todetect and annotate new instances of the objects in unseen anduncaptioned images. For detection of new instances of a given ob-ject model, we use the detection algorithm presented in Section II-C.2. Our confidence that a wordw is appropriate to annotate adetected objectO in an image depends on two factors: (i) our

confidence that the observed instanceO is a true instance of thegiven modelG and not a chance configuration of backgroundfeatures—i.e., the detection confidenceConfdetect(O, G); and (ii)our confidence that the appearance modelG and the wordw

represent the same object—i.e., the correspondence confidenceConfcorr (G, w). We can thus associate anannotation confidenceto every object instance that is to be labeled in a new image. Theannotation confidence is defined as the product of the detectionconfidence and the correspondence confidence, as in:

Confannotate (O, w, G) = Confdetect(O, G) × Confcorr (G, w)

(16)We annotate a new instance of an object only if the

annotation confidence is greater than a threshold,i.e., ifConfannotate (O, w, G) > Ta. The value ofTa could be deter-mined by a user, depending on whether they desire more detec-tions (higher recall) or fewer detections with higher confidence(higher precision). In all experiments reported here, we set thethreshold very high,i.e., Ta = 0.95.

The following sections present the results of applying ourlearning, detection, and annotation algorithms to two types ofdata sets: a small set of real images of toys that we capturedand annotated ourselves (Section IV-A) and two larger andmore challenging sets of real-world images of sports scenesandlandmarks, respectively, downloaded from the web (SectionIV-B). Our choices for the parameters involved in the confidencescores and the algorithms are given in the Appendix.

A. Experiments on a Controlled Data Set

Here, we report the performance of our method applied to aset of 228 images of arrangements of children’s toys, generatedunder controlled conditions. The data set was first used in ourexperiments presented in an earlier paper [15]. Throughoutthisarticle, we refer to this data set as theTOYS data set. Theoriginal color photographs are converted to intensity images witha resolution of 800x600. Most images contain 3 or 4 toy objectsout of a pool of 10, though there are a handful of examples of upto 8 objects. The objects are not arranged in any consistent poseand many are partially occluded. The images are collected againstapproximately 15 different backgrounds of varying complexity.The pool of228 images is randomly divided into a training setof 128 and a test set of100. Each training image is annotatedwith the unique keyword for each object of interest shown andbetween 2 and 5 other keywords uniformly drawn from a poolof distractor labels. Note that the objects of interest never appearindividually and the training data contains no informationas tothe position or pose of the labeled objects.

Figure 3 displays some example images from the test setand the associated annotations produced by our system. Falseannotations are written in italics and missed annotations inparentheses. While there is no direct limit on the size of learnedappearance models, they tend to cover small, distinctive patchesof the objects. In many cases, the size of learned models is limitedby the availability of sufficient repeatable visual structure. Someobjects with large areas of sufficient detail, such as the ‘Cash’object and the two books ‘Franklin’ and ‘Rocket’, can supportlarger models (as in Figure 3(d)). Our method learns multipleappearance models for many of the objects, such as the twomodels shown in Figure 3(b) for ‘Bongos’. It is thus possibletodetect an object even when a distinctive part of it is occluded. The

Page 9: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 9

(a) Bus,Ernie, Dino, Drum (b) Horse, Drum, Bus, Bongos,Cash, Bug

(c) Ernie, Horse, Rocket, (d) Franklin, Rocket, CashCash, (Bongos)

Fig. 3. Sample detections of objects in theTOYS test set. False detectionsare shown in italics while missed detections are placed in parentheses.

system can sometimes detect planar objects such as the ‘Rocket’book under significant perspective distortion (as in Figure3(c)).

Our earlier work ([15], [16]) has shown promising resultson the TOYS data set, while using simpler appearance modelsand/or less efficient learning or detection algorithms. To evaluatethe improvement in annotation performance, we performed asingle training run for our method on the same128 trainingimages used in previous work and compared results on the100-image test set. Figure 4 shows the precision–recall curvesfor four systems: the current system (with and without spatialrelationships), theICCV system presented in [16], and theCVPR

system described in [15]. To implement a system without spatialrelations, we remove all spatial contributions to the detectionconfidence measure,Confdetect(O, G). Therefore, edge positionand connectivity play no role in the detection mechanism. Theonly remaining spatial constraint in the resulting bag-of-featuresmodel is that each vertex added to a detection must fall within theneighborhood of a vertex that is already part of the detection (aconstraint of the underlying detection algorithm). Our newsystemachieves the high annotation precision ofICCV with about15%higher overall recall due to our new detection confidence measure.Training without spatial relationships generates approximately a12% penalty in recall, indicating that more distinctive objectmod-els can be constructed by finding recurring spatial relationshipsamong image points. While theCVPR system and the ‘No Spatial’variant of our current system both represent appearance as anunstructured collection of local features, a variety of changesin the initialization mechanism, the representation of individualfeatures and the detection and correspondence measures greatlyimprove overall performance.3

Table I shows the per-object precision and recall values ofour current system. In this and subsequent tables, the Frequencycolumn shows the number of captions within the test set that

3The improved precision of theICCV system over theCVPR system isdue in part to the addition of spatial relationships to the appearance models,as well as to improvements in the correspondence confidence measure andinitialization method.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

Performance on the TOYS Image Set

Current

Current (No Spatial)

ICCV

CVPR

Fig. 4. A comparison of precision–recall curves over theTOYS test set, forfour systems: current, current without spatial relationships, ICCV [16], andCVPR [15]. CVPR used a local bag-of-features models that were initializedwith singleton features.ICCV added spatial relationships and neighborhoodinitialization. The current system adds detection and annotation confidencescores and builds models with more spatial relationships.

Name Precision Recall FrequencyHorse 1.00 0.82 28Rocket 1.00 0.82 44Drum 1.00 0.81 32

Franklin 1.00 0.73 33Bus 1.00 0.60 57

Bongos 1.00 0.53 36Bug 1.00 0.53 51Dino 1.00 0.12 42Cash 0.97 0.76 46Ernie 0.94 0.41 39

TABLE I

PERFORMANCE RESULTS ON THE TOYS TEST SET; Ta = 0.95.

contain at least one instance of the corresponding word. Allofthe precision and recall values we report are based on wordoccurrence in the captions of the test set; if the system doesnot detect a word that appears in the caption, that instance iscounted as a false positive, even if the named object does notactually appear in the image. The method performs best on objectsthat have large, roughly planar surfaces and distinctive structuraldetails, such as ‘Rocket’, ‘Franklin’, ‘Drum’ and ‘Horse’.Theonly instances of these objects that cannot be detected withveryhigh precision either are highly occluded or are viewed fromanatypical angle (such as edge-on for a book). Our system has thegreatest difficulty with the ‘Ernie’ and ‘Dino’ objects, perhaps be-cause they lack fine-scale surface details and distinctive textures.For instance, the striped pattern of the shirt of the ‘Ernie’dollis somewhat distinctive within the image set, but the face lackssufficient keypoints to build a reliable model. The ‘Dino’ objectis particularly difficult, as specular reflections significantly alterthe local-feature description of its surface appearance dependingon perspective and lighting.

Precision and recall indicate whether the system is detectingobjects in the correct images, but not how often the models aredetected on the objects themselves. To evaluate the locationalaccuracy of the detections we manually defined bounding boxesfor all named objects in the test image set (this data was notavailable to the system). A detection was considered accurate ifall of the detected model vertices were located within the correctbounding box. Overall,98.8% of above-threshold detections onthe TOYS data set are completely within the boundaries definedfor the object. The lowest accuracy was for the ‘Franklin’ object,on which 5.2% of detections were partially outside the objectbounds.

Page 10: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 10

B. Experiments on Web Data Sets

This section reports the performance of our system on twolarger and more challenging sets of images downloaded from theweb. The first set, which we refer to as theHOCKEY data set,includes2526 images of National Hockey League (NHL) playersand games, with associated captions, downloaded from a varietyof sports websites. The second set, which we refer to as theLANDMARK data set, contains3258 images of27 well-knownbuildings and locations, with associated tags, downloadedfromthe Flickr website4. Due to space considerations, our analysisfocuses mainly on the results on theHOCKEY data set (IV-B.1through IV-B.3), though most of the same phenomena also appearin the LANDMARK results (IV-B.4).

1) Annotation Performance on theHOCKEY Data Set: TheHOCKEY set contains examples of all30 NHL teams and isdivided into 2026 training and 500 test image–caption pairs.About two-thirds of the captions are full sentence descriptions,whereas the remainder simply name the two teams involved inthe game (see Figure 1, page 3 for examples of each type). Weautomatically process captions of the training images, removingcapitalization, punctuation, and plural indicators, and droppingwords that occur in less than1% of the captions. Captions of thetest images are only used for evaluation purposes.

Most images are on-ice shots and display multiple players ina variety of poses and scales. We thus expect our system tolearn distinctive appearance models for the players of eachteam,and to discover meaningful associations between the modelsandthe corresponding team names. A team’s logo is perhaps themost distinctive appearance model that our system could learnfor the team. In addition, there may be other visual appearancesthat unambiguously identify a particular team, such as shoulderpatches or sock patterns. Note that we do not incorporate anypriorknowledge into our system about which objects or words are ofinterest. The system is expected to learn these from the pairings ofthe images and captions in the training data. The only informationwe provide to our system is a link between a team’s name (e.g.,Bruins) and its city name (e.g., Boston) because most NHL teamsare referred to by both names. Our system thus treats the twowords (team name and city name) as the same word when learningmodel–word associations. The final vocabulary extracted from thetraining captions contains237 words, of which only30 are teamdesignations. As we will see later, our system learns appearancemodels for many of these words, including team names as wellas other words.

We experiment with different degrees of spatial constraintsimposed by our detection algorithm, in order to see how thataffects the learning and annotation performance of our system.Our detection confidence score, presented in Section II-C, has aset of parameters corresponding to the model spatial relationshipvariances. We set each of these to a fraction1/Γ of the corre-sponding background variance, whereΓ is the spatial toleranceparameter that determines the amount of spatial constraints re-quired by the detection algorithm to consider an observation as atrue instance of a given model.

High values of Γ thus imply a narrow acceptability rangefor spatial relationships between any two connected vertices ofa model, resulting in tighter spatial constraints when detectinginstances of the model. In contrast, a low value ofΓ translates

4www.flickr.com

0 0.05 0.1 0.15 0.2 0.25 0.30.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1Precision vs. Recall for Different Spatial Settings

Recall

Pre

cisi

on

Γ = 100

Γ = 10

Γ = 1

No Spatial

Fig. 5. Precision–recall curves of our current system on test HOCKEY images,over a wide range of spatial settings. Although the system can detect objectseven without spatial constraints (the ‘No Spatial’ case), the use of moderatespatial constraints (whenΓ = 10) offers the best performance.

into looser spatial constraints in detection. Figure 5 shows theoverall precision–recall curves for various settings of the spatialtolerance parameterΓ, including a version in which the spatialrelationships between points do not contribute to the detectionconfidence function (No Spatial). We use the notationΓi to referto an implementation of the system whereΓ is given the valuei.

The curves indicate that the implementation that ignores spatialfactors (No Spatial) andΓ1 (where model spatial variances areidentical to background spatial variances) are roughly equivalent.Remaining differences are due to the fact thatΓ1 can learn someedge connection structure while the ‘No Spatial’ model cannot.Very strong spatial constraints, as inΓ100, may result in brittlemodel detections, but still the system is capable of producingreasonably good results (e.g., substantially better than a bag-of-features model; see Figure 5). Nonetheless, results confirmthatmoderate spatial constraints are generally more effective, as withΓ10. One might expect that stronger spatial constraints (highervalues ofΓ) would always lead to higher precision. This is notnecessarily the case, because even a configuration of relativelyfew observed vertices may have high detection confidence iftheir spatial relationships conform to the model’s tight constraints.Moreover, many of the false annotations on test images are notthe result of incorrect model detection at annotation time,butare due to learning a spurious word–appearance correspondence.Finally, a model that requires a rigid spatial configurationamongits components may not grow to the size of a model that ismore accommodating. The resulting smaller models may be lessdistinctive, even though each edge is more precise in capturing aparticular spatial configuration.

To analyze the sensitivity of our method to the precise valueofthe spatial tolerance parameterΓ, we perform experiments witha smaller range of values for this parameter. Figure 6 shows theprecision–recall curves forΓ set to 5, 10, and 20. The resultsconfirm that our method is not sensitive to small changes in thevalue of Γ. This might indicate that the iterative improvementprocess can optimize appearance models to compensate for dif-ferent spatial tolerance values within a reasonable range.It is alsopossible that the models themselves are effective across a rangeof Γ values, such that, for example, system performance wouldnot be adversely affected if different values ofΓ were used fortraining and annotation.

Page 11: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 11

0 0.05 0.1 0.15 0.2 0.25 0.30.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Recall

Pre

cisi

onSensitivity to Small Changes in Γ

Γ = 20

Γ = 10

Γ = 5

Fig. 6. Precision–recall curves for our current system are relatively unaffectedby modest changes in the spatial tolerance parameterΓ.

Name Precision Recall FrequencyTampa Bay Lightning 1.00 0.51 49Los Angeles Kings 1.00 0.33 36New York Islanders 1.00 0.32 60

Calgary Flames 1.00 0.31 26Dallas Stars 1.00 0.31 42

Minnesota Wild 1.00 0.26 35Chicago Blackhawks 1.00 0.24 35New York Rangers 1.00 0.21 42

Buffalo Sabres 1.00 0.19 32Carolina Hurricane 1.00 0.17 30Nashville Predators 1.00 0.10 20Colorado Avalanche 1.00 0.09 23Toronto Maple Leafs 0.96 0.30 73

Ottawa Senators 0.90 0.16 58Pittsburg Penguins 0.89 0.28 29Atlanta Thrashers 0.86 0.17 35New Jersey Devils 0.85 0.19 59Detroit Red Wings 0.83 0.24 42San Jose Sharks 0.83 0.22 23Florida Panthers 0.83 0.20 25

Montreal Canadiens 0.80 0.17 23Vancouver Canucks 0.57 0.10 40

Boston Bruins 0.44 0.24 17

TABLE II

INDIVIDUAL PRECISION AND RECALL VALUES FOR23 OF THE TEAM

NAMES (OF 30) DETECTED WITH HIGH CONFIDENCE IN THE HOCKEY TEST

SET; Γ = 10 AND Ta = 0.95.

2) An Analysis of the Annotation Results for theHOCKEY Set:Our results presented above show thatΓ10 has the best overallperformance. We thus focus onΓ10, analyzing various aspects ofits performance. Table II shows the annotation performanceofΓ10 on the test images, focusing on the team names only. Thesystem has high-confidence detections for23 of the 30 teams.(There are2 additional teams for which the system learns high-confidence models, but does not detect them in the test images.)Results show that the annotation of the test images generally hashigh precision but low recall. The low recall is partly because ofour choice of a high annotation threshold, but also due to thefactthat the captions that we use as ground truth often mention teamsthat are not visible in the image. In addition, a hockey player has ahighly variable appearance depending on viewing angle and pose.A model that captures the appearance of the front logo will nothelp annotate a view of a player from the side.

An important factor that determines whether the system canlearn high-confidence models for a team, and detect instancesof it in the test images, is the number of training examples thatcontain both the team’s name and some appearance of it. It is hardfor the system to learn a high-confidence model for a team if theteam’s logo appears in a small number of training images. Even ifthe system learns a model for such a team, the model may be too

specific to be useful for detecting new instances (which may differin viewpoint, for example). On average, teams detected in the testset are mentioned in the captions of114 training images, whileteams with no test set detections are mentioned in the captionsof 57 training images. The system detects only two teams (theBruins and the Canadiens) with fewer than60 caption-mentions,and fails to detect only one team (the Flyers) with more than60

caption-mentions in the training set.Note also that a team name’s mention in the caption of an image

does not necessarily mean that the corresponding object (e.g., theteam’s logo) appears in the image. After analyzing15 of the teamsin Table II, we found that only30–40% of the training imagesthat mention a team’s name also contain a visible (less than halfobscured) instance of one of the team’s logos. Therefore, a teamwith fewer than60 training examples will almost always havefewer than24 usable instances of the team logo. In addition, insome cases these instances will display various versions oftheteam’s logo. For example, images of the Maple Leafs regularlyshow three different versions of their logo. The training setcontains four completely different Buffalo Sabres chest logos, andmany of these are placed on different backgrounds for home andaway jerseys. As we see later in Figure 7(d), the system does notlearn the new Sabres logo, but it does have a strong model for theolder version of the logo displayed by the fan in the background.

For teams detected with relatively high recall, the systemtends to learn separate models for each variation of the chestlogo and often additional models for other distinctive parts ofa player’s uniform, such as shoulder patches or sock patterns.In some cases, the system learns several high-confidence modelsfor the same logo. Some of these are redundant models that havenearly identical detection patterns, while in other cases the modelsdescribe the logo in different modes of perspective distortion.Teams with lower recall tend to only have one or two high-confidence appearance models describing a single logo variation.The training restriction of only20 seed appearance models foreach named object is perhaps ill-advised, given that multipleappearance models are helpful and only about1 in 6 seed modelsleads to a high-confidence object appearance model.

The visual properties of a team’s logo also affect whetherthe system learns an association between its appearance andtheteam’s name. For example, whereas the Philadelphia Flyers arementioned in the captions of168 training images, their primarylogo lacks detailed texture and so attracts relatively few interestpoints. This may explain why the system did not learn a reliabledetector for this logo.

3) Sample Annotations from theHOCKEY Set: Figure 7 showssome example annotations in the test images forΓ10. As expected,team logos tend to be the most useful patches of appearance forrecognizing the teams. The system is often able to detect learnedlogos that are distorted or partially occluded. In some cases,however, the system fails to detect learned logos that are relativelyclean and undistorted. These might be due to contrast-reversalfrom a logo appearing against a different background, or perhapsmissed detections of the underlying interest point detector.

While the team logos tend to be the most distinctive aspect ofaplayer’s appearance, the system has learned alternate models for avariety of teams. Figure 8 displays two such examples, includinga shoulder patch and sock pattern. The system might also learnan association between a team name and an appearance patch thatis not part of a player at all.

Page 12: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 12

(a) Panthers (b) Devils

(c) Maple Leafs and Islanders (d) Sabres

(e) Penguins (f) Lightning and Red Wings

(g) Stars (h) Wild and Canadiens

Fig. 7. Sample annotations of our system in the hockey test images. Thesystem automatically discovers chest logos as the most reliable method forrecognizing a team.

Figure 9 displays instances of a particular type of false annota-tion. Variants of a white net pattern against a dark background arelearned independently as reasonably high-confidence appearancemodels for several teams. This appears to be a result of overfitting.While this back-of-the-net appearance is quite common across theNHL image collection, certain variations of this appearance havea weak initial correspondence with particular teams. During itera-tive improvement, the system is able to tailor each team’s versionof the net appearance model to fit only the specific instancesof the net appearance in the training data that are annotatedwith the desired team name. In this way, a model can havefew false positives in the training set, even though the describedappearance is not meaningfully associated to the team name.Weare currently considering approaches for reducing the occurrenceof such overfitting while still allowing the improvement stage toencode meaningful differences in appearance.

As mentioned previously, the system is not restricted to finding

(a) Flames shoulder patch (b) Blackhawks sock pattern

Fig. 8. Alternate models can be learned, such as shoulder patches or sockpatterns.

(a) Red Wings (b) Canucks

Fig. 9. Due to overfitting, our system learns bad models (a white net on adark background) for several teams.

visual correspondences solely for team names. Figure 10 shows afew example detections of models learned for other words. Twoof the models associate the distinctive pads of the Calgary Flamesgoalie Mikka Kiprusoff with the labels ‘mikka’ and ‘kiprusoff’.Most of the other high-confidence model–word associations linkthe appearance of a team logo with other words associated withthe team. For instance, Figure 10(b) shows a learned associationbetween the Islanders logo and words related to their homearena of “Nassau Coliseum in Uniondale, New York”. Otherassociations are with fragments of a team’s name that are notamong the words we manually linked to identify a team. Forinstance, the system learned associations between the words wingand los and the logos of the Detroit Red Wings and the LosAngeles Kings. We associate the wordsangelesand kings withthe same caption token, so that either word could be used torepresent the team. The wordlos was not one of the words welinked, but it still has a strong correspondence in our training setwith the Los Angeles Kings logo. The fact that the componentsofa multi-word name can independently converge on similar visualdescriptions means that it may be possible to automaticallylearnsuch compound names based on visual model similarity, insteadof manually linking the components prior to learning.

(a) ‘mikka’, ‘kiprusoff’ (b) ‘nassau’, ‘coliseum’,‘uniondale’, ‘new’, ‘york’

Fig. 10. Words other than team names for which the system discoveredstrong models include the name of the Calgary Flames’ goalie, the name andlocation of the New York Islanders’ arena.

Page 13: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 13

Name Precision Recall Frequency‘york’ 1.00 0.22 81‘new’ 0.95 0.16 120‘los’ 0.92 0.34 32

‘maple’ 0.86 0.16 37‘uniondale’ 0.85 0.31 35‘coliseum’ 0.85 0.20 35‘nassau’ 0.83 0.29 35‘wings’ 0.73 0.17 47

‘rutherford’ 0.70 0.33 21‘canada’ 0.47 0.16 20

‘california’ 0.40 0.20 20

TABLE III

PRECISION AND RECALL VALUES FOR11 WORDS WITH AT LEAST10

DETECTIONS IN THE TEST IMAGES; Γ = 10, Ta = 0.95.

Table III gives precision and recall values for all words (otherthan the team names) with at least10 detections in the test images.These results may be somewhat understated, as the test imagecaptions are not as consistent in mentioning these backgroundwords as they are with the team names. For instance, a detectionfor the wordmapleof the Toronto Maple Leafs logo will countas a true positive if the caption of the test image is “Maple Leafsvs Lightning”, but count as a false positive if the caption isthesemantically equivalent “Toronto vs Tampa Bay”.

4) Annotation Performance on theLANDMARK Data Set: TheLANDMARK data set includes images of27 famous buildingsand locations with some associated tags downloaded from theFlickr website, and randomly divided into2172 training and1086test image–caption pairs. Like the NHL logos, each landmarkappears in a variety of perspectives, scales and (to a lesserextent) orientations. Whereas a hockey logo is deformable andmay appear in several different versions, the appearance ofthelandmarks can be strongly affected by varying lighting conditions,and different faces of the structure can present dramaticallydifferent appearances. Another difference is that theHOCKEY

captions usually mention two teams, but eachLANDMARK imageis only associated with a single landmark.

Table IV presents precision and recall information for the26 of27 landmarks for which high-confidence detections were made inthe test set. Compared to theHOCKEY results (Table II), recall isgenerally higher. This is probably because test image labels usedas ground truth are somewhat more reliable in theLANDMARK

set; there are fewer instances where a test caption mentionsalandmark that is not present in the image. The only landmarkof the set that was not learned to some degree was the SydneyOpera House, perhaps due to the smooth, relatively texturelessexterior of the building. Our system recognized only one of theCN Tower images in the test set, probably because the moredistinctive pattern of the observation level makes up such asmallfraction of the overall tower.

Figure 11 shows8 examples of the detected landmarks. Ingeneral, the learned models tend to cover highly textured areas ofa landmark. In some cases, however, the profile of a structuresuchas the towers of the Golden Gate Bridge or the Statue of Libertyis distinctive enough to form a strong model. Since many of thebuilding details display different shadowing at differenttimes ofthe day, multiple models are often learned to cover these differentappearances. Due to the repetitive surface structure of many of thelandmarks, an image often contains multiple detections of asingleappearance model (e.g., Figure 11(c-e)). The system also learnedassociations between landmark names and some nearby objects,such as an adjacent office tower that the system associated withthe Empire State building.

Name Precision Recall FrequencyRushmore 1.00 0.71 35

St. Basil’s Cathedral 1.00 0.69 35Statue of Liberty 1.00 0.61 36

Great Sphinx 1.00 0.45 40Notre Dame Cathedral 1.00 0.40 40

Stonehenge 1.00 0.36 42St. Peter’s Basilica 1.00 0.15 41

Chichen Itza 1.00 0.05 37CN Tower 1.00 0.03 34

Golden Gate Bridge 0.97 0.73 45Christo Redentor 0.96 0.55 44

Eiffel Tower 0.95 0.61 33Taj Mahal 0.89 0.52 33Big Ben 0.88 0.68 44

Colosseum 0.87 0.33 39Tower Bridge 0.82 0.79 47White House 0.81 0.38 45US Capitol 0.80 0.80 45Reichstag 0.80 0.53 45

St. Paul’s Cathedral 0.75 0.69 48Arc De Triomphe 0.71 0.57 42

Parthenon 0.71 0.29 35Burj Al Arab 0.71 0.23 43

Leaning Tower 0.70 0.93 43Empire State Building 0.62 0.55 38

Sagrada Familia 0.54 0.40 35

TABLE IV

INDIVIDUAL PRECISION AND RECALL VALUES FOR THE26 OF 27

LANDMARKS DETECTED WITH HIGH CONFIDENCE IN THE LANDMARK

TEST SET; Γ = 10 AND Ta = 0.95.

(a) St. Basil’s Cathedral (b) Cristo Redentor

(c) Golden Gate Bridge (d) Leaning Tower

(e) Sagrada Familia (f) Great Sphinx

(g) Statue of Liberty (h) Eiffel Tower

Fig. 11. Sample annotations of our system in the landmark test images.Multiple detections of a single model within an image can be seen in (c-e).

Page 14: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 14

C. Summary of Results

The results presented in this section show that our systemis able to find many meaningful correspondences between wordlabels and visual structures in three very different data sets. Whilethe system can work equally well over a range of spatial parametersettings, the addition of spatial relationships to the appearancemodel can dramatically improve the strength and number of thelearned correspondences. Our framework can deal with occlusionand the highly variable appearance of some objects by associatingmultiple visual models with a single word, but this depends on thepresence of a sufficient number and a variety of training examples.

V. CONCLUSIONS

We have proposed an unsupervised method that uses languageboth to discover salient objects and to build distinctive appear-ance models for them, from cluttered images paired with noisycaptions. The algorithm simultaneously learns appropriate namesfor these object models from the captions. Note that we do notincorporate any prior knowledge into our system about whichobjects or words are of interest. The system has to learn thesefrom the pairings of the images and captions in the trainingdata. We have devised a novel appearance model that capturesthe common structure among instances of an object by usingpairs of points together with their spatial relationships as the basicdistinctive portions of an object. We have also introduced anoveldetection method that can be reliably used to find and annotatenew instances of the learned object models in previously unseen(and uncaptioned) test images. Given the abundance of existingimages paired with text descriptions, such methods can be veryuseful for the automatic annotation of new or uncaptioned images,and hence can help in the organization of image databases, aswellas in content-based image search.

At this juncture, there are two complimentary directions forfuture work. First, we would like to use insights from computa-tional linguistics to move beyond individual words to groups ofwords (e.g., word combinations such as compound nouns andcollocations, or semantically-related words) that correspond todistinct visual patterns. Second, though local features providea strong basis for detecting unique objects, they are less idealfor detecting object categories or general settings. However, thegrouping problem persists for most types of visual featuresandmost forms of annotation. We expect that many of the mechanismsfor addressing the grouping problem for local features willalsoapply to other feature classes, such as contours.

APPENDIX

This section elaborates on how we set the values of variousparameters required by our model and methods, including: pa-rameters of our image and object representation, and those of ourdetection confidence score from Section II, and parameters of themodel–word correspondence confidence score (used in learning)from Section III.

For our image representation presented in Section II-A, we usea cluster set of size4000 to generate the quantized descriptorscm associated with each image point. We set the neighborhoodsize, |nm|, to 50 as experiments on theCVPR data set indicatedthis was an appropriate tradeoff between having distinctivenessand locality in a neighborhood. (Neighborhoods in the rangeof40− 70 points produced roughly equivalent results.) Recall from

Section II-B that in an appearance model, each vertex is associatedwith a vector of neighboring cluster centers,ci. We set|ci| = 20

to minimize the chance of missing a matching feature due toquantization noise, at the expense of slowing down the modeldetection function.

We set the parameters of the word–appearance correspondencemeasure,Corr(w, G), according to the following assumptions:Reflecting high confidence in the captions of training images,we setP (rwi = 1|si = 1) = 0.95, since we expect to observe anobject’s name in a caption whenever the object is present in thecorresponding image. Similarly, we expect that few images will belabelledw when the object is not present. However, even setting alow background generation rate (P (rwi = 1|si = 0) = 0.5) couldlead to a relatively large number of falsew labels if few image–caption pairs contain the object (P (si = 1) is low). We expectthe false labels to be a smallfraction of the true labels. Since theexpected rate of true word labels isP (rwi = 1|si = 1)P (si = 1)

and the expected rate of false word labels isP (rwi = 1|si =

0)P (si = 0), we therefore set:

P (rwi = 1|si = 0) = 0.05 ·P (rwi = 1|si = 1)P (si = 1)

P (si = 0). (17)

Similarly, during model learning, we wish to assert that a high-confidence model should have a low ratio of false-positives totrue-positives. Therefore we set:

P (qGi = 1|si = 0) = 0.05 ·P (qGi = 1|si = 1)P (si = 1)

P (si = 0)(18)

However, when evaluating correspondences between words andneighborhoods to find good seed models, we set this targetnumber of false-positives as a fraction of the true positives muchhigher at 1. We do so because the starting seed models areexpected to be less distinctive, hence even promising seedsmayhave a large number of false positives. Performing experimentson a held-out validation set of500 HOCKEY training images, wefound the precise value ofP (H0)/P (HC) to have little effect onthe overall performance; we thus setP (H0)/P (HC) = 100, 000.

At the initialization stage, up to20 neighborhood clustersNare selected to generate seed models, which are further modifiedin up to 200 stages of iterative improvement.

If the detection thresholdTd is set too low, the detectionalgorithm can report a large number of low-probability detections,which can impose a computation burden on later stages ofprocessing. However, the threshold should not be set so highthatdetections that might lead to model improvements are ignored.We err on the side of caution and setTd = 0.001. The param-eters of the background likelihood (used in measuring detectionconfidence) are set according to sample statistics of a collection of10,000 commercial stock photographs. Based on this wide sampleof visual content, the distance varianceσ2

xB is set to200, andthe scale varianceσ2

λB is set to5. Each of the model spatialrelationship variances is set to a fraction1/Γ of the correspondingbackground variance,i.e.:

σ2x =

σ2xB

Γ, σ2

λ =σ2

λB

Γ, andσ2

φ =π2

3Γ.

where Γ is the spatial tolerance parameter that determines theexpected degree of spatial variability among observed objectmodels. In most experiments, we setΓ = 10, a value we found toresult in reasonably good performance on theHOCKEY validationset. After estimating the mixture of Gaussian distributionbasedon the stock photo collection, we observed that average variance

Page 15: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...suzanne/papers/JamiesonEtAlPAMI.pdf · conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. V, NO. N, MONTH YYYY 15

among the background feature clusters was1.31, so we set themodel feature variance to be slightly higher atσ2

f = 1.5. Based onempirical findings on the held-out subset of theHOCKEY trainingdata, we setP (HB)/P (HG) = 100, 000, though we found a broadrange of values to yield similar results.

ACKNOWLEDGMENT

The authors gratefully acknowledge the support of Idee Inc.,CITO and NSERC Canada.

REFERENCES

[1] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, andM. Jordan. Matching words and pictures.Journal of Machine LearningResearch, 3:1107–1135, 2003.

[2] K. Barnard, P. Duygulu, R. Guru, P. Gabbur, and D. Forsyth. Theeffects of segmentation and feature choice in a translationmodel ofobject recognition. InCVPR, 2003.

[3] K. Barnard and Q. Fan. Reducing correspondence ambiguity in looselylabeled training data. InCVPR, 2007.

[4] T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, Y.-W. Teh,E. Learned-Miller, and D. Forsyth. Names and faces in the news. InCVPR, 2004.

[5] T. L. Berg and D. A. Forsyth. Animals on the web. InCVPR, 2006.[6] D. Blei and M. Jordan. Modeling annotated data. InSIGIR, 2003.[7] P. Carbonetto, N. de Freitas, and K. Barnard. A statistical model for

general contextual object recognition. InECCV, 2004.[8] G. Carneiro, A. Chan, P. Moreno, and N. Vasconcelos. Supervised

learning of semantic classes for image annotation and retrieval. IEEETransactions on Pattern Analysis and Machine Intelligence, 29(3):394–410, 2007.

[9] G. Carneiro and A. Jepson. Flexible spatial configuration of localimage features.IEEE Transactions on Pattern Analysis and MachineIntelligence, 29(12):2089–2104, 2007.

[10] M. L. Cascia, S. Sethi, and S. Sclaroff. Combining textual and visualcues for content-based image retrieval on the World Wide Web. InProceedings of the IEEE Workshop on Content-based Access ofImageand Video Libraries, June 1998.

[11] D. J. Crandall and D. P. Huttenlocher. Weakly supervised learning ofpart-based spatial models for visual object recognition. In ECCV, 2006.

[12] P. Duygulu, K. Barnard, J. F. G. de Freitas, and D. Forsyth. Objectrecognition as machine translation: Learning a lexicon fora fixed imagevocabulary. InECCV, volume 4, pages 97–112, 2002.

[13] S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple Bernoulli relevancemodels for image and video annotation. InCVPR, 2004.

[14] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning objectcategories from Google’s image search. InCVPR, 2005.

[15] M. Jamieson, S. Dickinson, S. Stevenson, and S. Wachsmuth. Usinglanguage to drive the perceptual grouping of local image features. InCVPR, 2006.

[16] M. Jamieson, A. Fazly, S. Dickinson, S. Stevenson, and S. Wachsmuth.Learning structured appearance models from captioned images of clut-tered scenes. InICCV, 2007.

[17] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotationand retrieval using cross-media relevance models. InSIGIR, 2003.

[18] Y. Ke and R. Sukthankar. PCA-SIFT: A more distinctive representationfor local image descriptors. InCVPR, 2004.

[19] J. Li and J. Wang. Automatic linguistic indexing of pictures bya statistical modeling approach.IEEE Trans. Pattern Analysis andMachine Intelligence, 25(9):1075–1088, 2003.

[20] D. G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60(2):91–110, 2004.

[21] F. Monay and D. Gatica-Perez. Modeling semantic aspects for cross-media image indexing. IEEE Trans. Pattern Analysis and MachineIntelligence, 29(10):1802–1817, 2007.

[22] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Objectretrieval with large vocabularies and fast spatial matching. In CVPR,2007.

[23] A. Quattoni, M. Collins, and T. Darrell. Learning visual representationsusing images with captions. InCVPR, 2007.

[24] I. Simon, N. Snavely, and S. M. Seitz. Scene summarization for onlineimage collections. InICCV, 2007.

[25] J. Sivic and A. Zisserman. Video data mining using configurations ofviewpoint invariant regions. InCVPR, 2004.

Michael Jamieson Michael Jamieson received theB.A.Sc. and M.A.Sc. degrees in Systems DesignEngineering from the University of Waterloo in 1999and 2003 respectively. From 1999 to 2000 he helpeddesign radar countermeasures at Vantage Point In-ternational and since 2003 he has developed visualsearch algorithms at Idee Inc. He is currently a Ph.Dcandidate in the Computer Science department at theUniversity of Toronto. His research interests includevisual search and automatic image annotation.

Afsaneh Fazly Afsaneh Fazly received bachelor’sand master’s degrees in Computer Engineering fromSharif University of Technology (Iran), and master’sand Ph.D. degrees in Computer Science from theUniversity of Toronto. She is currently a postdoctoralfellow at the University of Toronto. Dr. Fazly’sprimary area of research is computational linguisticsand cognitive science, with an interest in approachesthat integrate these disciplines with others in artifi-cial intelligence, such as computational vision.

Suzanne StevensonSuzanne Stevenson received aB.A. in Computer Science and Linguistics fromWilliam and Mary, and M.S. and Ph.D. degrees fromthe University of Maryland, College Park. From1995-2000, she was on the faculty at Rutgers Univer-sity, holding joint appointments in the Departmentof Computer Science and in the Rutgers Center forCognitive Science. She is now at the Universityof Toronto, where she is Associate Professor ofComputer Science, and Vice-Dean, Students, Facultyof Arts and Science.

Sven Dickinson Sven Dickinson received theB.A.Sc. degree in systems design engineering fromthe University of Waterloo, in 1983, and the M.S.and Ph.D. degrees in computer science from the Uni-versity of Maryland, in 1988 and 1991, respectively.He is currently Professor of Computer Science atthe University of Toronto’s Department of ComputerScience. His research interests include object recog-nition, shape representation, object tracking, vision-based navigation, content-based image retrieval, andlanguage-vision integration.

Sven Wachsmuth Sven Wachsmuth received thePh.D. degree in computer science from BielefeldUniversity, Germany, in 2001. He now holds a staffposition there as a senior researcher in AppliedInformatics. In 2008 he joined the Cluster of Excel-lence Cognitive Interaction Technology at BielefeldUniversity and is heading the Central Lab Facilities.His research interests include cognitive and percep-tive systems, human-machine communication, andthe interaction of speech and visual processing.