building descriptive and discriminative visual codebook for large...

37
Building descriptive and discriminative visual codebook for large-scale image applications Qi Tian & Shiliang Zhang & Wengang Zhou & Rongrong Ji & Bingbing Ni & Nicu Sebe Published online: 18 November 2010 # Springer Science+Business Media, LLC 2010 Abstract Inspired by the success of textual words in large-scale textual information processing, researchers are trying to extract visual words from images which function similar as textual words. Visual words are commonly generated by clustering a large amount of image local features and the cluster centers are taken as visual words. This approach is simple and scalable, but results in noisy visual words. Lots of works are reported trying to improve the descriptive and discriminative ability of visual words. This paper gives a comprehensive survey on visual vocabulary and details several state-of-the-art algorithms. A comprehensive review and summarization of the related works on visual vocabulary is first presented. Then, we introduce our recent algorithms on descriptive and discriminative visual word generation, i.e., latent visual context analysis for descriptive Multimed Tools Appl (2011) 51:441477 DOI 10.1007/s11042-010-0636-6 Q. Tian (*) Computer Science Department, University of Texas at San Antonio, San Antonio, TX 78249, USA e-mail: [email protected] S. Zhang Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China e-mail: [email protected] W. Zhou EEIS Department, University of Science and Technology of China, Heifei 230027, China e-mail: [email protected] R. Ji Harbin Institute of Technology, Harbin 150001 Heilongjiang, China e-mail: [email protected] B. Ni National University of Singapore, 4 Engineering Drive 3, Singapore 117576, Singapore e-mail: [email protected] N. Sebe Department of Information Engineering and Computer Science, University of Trento, Via Sommarive 14-38100 Povo, Trento, Italy e-mail: [email protected]

Upload: others

Post on 07-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

Building descriptive and discriminative visual codebookfor large-scale image applications

Qi Tian & Shiliang Zhang & Wengang Zhou &

Rongrong Ji & Bingbing Ni & Nicu Sebe

Published online: 18 November 2010# Springer Science+Business Media, LLC 2010

Abstract Inspired by the success of textual words in large-scale textual informationprocessing, researchers are trying to extract visual words from images which functionsimilar as textual words. Visual words are commonly generated by clustering a largeamount of image local features and the cluster centers are taken as visual words. Thisapproach is simple and scalable, but results in noisy visual words. Lots of works arereported trying to improve the descriptive and discriminative ability of visual words. Thispaper gives a comprehensive survey on visual vocabulary and details several state-of-the-artalgorithms. A comprehensive review and summarization of the related works on visualvocabulary is first presented. Then, we introduce our recent algorithms on descriptive anddiscriminative visual word generation, i.e., latent visual context analysis for descriptive

Multimed Tools Appl (2011) 51:441–477DOI 10.1007/s11042-010-0636-6

Q. Tian (*)Computer Science Department, University of Texas at San Antonio, San Antonio, TX 78249, USAe-mail: [email protected]

S. ZhangKey Lab of Intelligent Information Processing, Institute of Computing Technology,Chinese Academy of Sciences, Beijing 100190, Chinae-mail: [email protected]

W. ZhouEEIS Department, University of Science and Technology of China, Heifei 230027, Chinae-mail: [email protected]

R. JiHarbin Institute of Technology, Harbin 150001 Heilongjiang, Chinae-mail: [email protected]

B. NiNational University of Singapore, 4 Engineering Drive 3, Singapore 117576, Singaporee-mail: [email protected]

N. SebeDepartment of Information Engineering and Computer Science, University of Trento, Via Sommarive14-38100 Povo, Trento, Italye-mail: [email protected]

Page 2: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

visual word identification [74], descriptive visual words and visual phrases generation [68],contextual visual vocabulary which combines both semantic contexts and spatial contexts[69], and visual vocabulary hierarchy optimization [18]. Additionally, we introduce twointeresting post processing strategies to further improve the performance of visualvocabulary, i.e., spatial coding [73] is proposed to efficiently remove the mismatchedvisual words between images for more reasonable image similarity computation; userpreference based visual word weighting [44] is developed to make the image similaritycomputed based on visual words more consistent with users’ preferences or habits.

Keywords Visual vocabulary . Large-scale image retrieval . Image search re-ranking .

Feature space quantization

1 Introduction

Nowadays, due to the fast development of Internet, personal computer, and the digital setssuch as digital cameras, digital video recorders, smart phones, etc., the amount ofmultimedia data available in the Internet has been increasing explosively. For example, inwebsites such as YouTube, Flicker, Facebook, etc., there are billions of images and millionsof hours of digital videos and the users keep uploading photos and videos every minute.Current systems commonly utilize the textual information such as the title and surroundingtext to index these data and the text-based search engines such as Bing, Google, etc., are thedominant tools for image and video search. Although these systems have been successfullyutilized for many years, they still present some shortcomings due to the fact that they arelargely based on textual information and image content only serves as an auxiliary clue.Note that the caption and surrounding textual description of these large-scale multimediadata may be missing and noisy, while manually labeling or refining them is expensive.Therefore, it is urgent and meaningful to develop automatic visual content basedmultimedia data indexing, processing, and searching strategies.

The traditional textual information retrieval is successful in processing the large-scaletextual data. For example, the Google and Bing search engines could answer users’ textualqueries instantaneously and accurately in billions of web-pages. In the textual informationretrieval, the textual words, which are compact and descriptive, are used as the basicfeatures for documents. Inspired by the success of textual words, researchers are trying toidentify basic visual elements from images, namely the visual words [45] [55], which couldfunction just like the textual words. With descriptive visual words, the well developedalgorithms for textual information retrieval can be leveraged for computer vision andmultimedia tasks. Moreover, with the visual word as the image feature, the currentinformation retrieval framework can be leveraged for large-scale image applications.

Traditionally, visual words are created by clustering a large number of local features suchas SIFT [36] in un-supervised ways. After that, each cluster center is taken as a visual word,and a corresponding visual vocabulary is generated. With the visual vocabulary, image canbe transformed as Bag-of-visual Words (BoW) representation. This is simply achieved byextracting image local features and replacing them with their nearest visual words. Basedon visual vocabulary representation, we can refer to document analysis methods, such asTF-IDF [51], pLSA [13], LDA [3], and LSI [10] for image applications. Attribute to itsscalability and simplicity, BoW representation has been very popular in computer visionand visual content analysis in recent years. With such a representation, researchers havedeveloped various successful algorithms for computer vision tasks such as video eventdetection [58, 62], object recognition [5, 7, 21, 26, 32–34, 40, 43, 46, 53, 64, 71] image

442 Multimed Tools Appl (2011) 51:441–477

Page 3: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

annotation [34, 61], etc. BoW representation has been illustrated as one of the mostpromising solutions for the large-scale near-duplicate image and video retrieval [15–17, 45,55, 60, 62, 68–70]. Near-duplicate image retrieval differs with common image retrieval inthat the target images are usually obtained by editing the original image with changes incolor, scale, partial occlusion, etc. Currently, most of the large-scale image retrieval worksare focused on large-scale near-duplicated image retrieval [15–17, 45, 55, 62, 68–70].

Building visual vocabulary to translate image local features [36] to visual words is themost crucial step in BoW representation. Many large-scale retrieval applications usuallyinvolve gigantic databases. To ensure efficient online search in these cases, hierarchicalquantization is usually adopted to generate classic visual words. The dominant methods[45, 48, 54, 55] of generating classic visual words for efficient online search arehierarchical quantization approaches, such as Vocabulary Tree (VT) [45], Approximate K-Means [48], K-D Tree [36], and their variants [9, 15, 21, 57, 63]. Typically, theseapproaches quantize image descriptors using a hierarchical subspace division (such ashierarchical K-means clustering) to produce visual words.

It is simple and scalable to generate classic visual words with unsupervised hierarchicalsubspace division. However, many works show that the classic visual words are not asdescriptive as they are desired to be. For example, in [69], Zhang et al. show that the classicvisual words are noisy and there are lots of mismatched visual words between two images.We analyze that the ineffectiveness of the traditional visual vocabulary might be largely dueto its three innate shortcomings, i.e., 1) the extracted local features are not stable, whichmeans some local features are sensitive to affine transformations. 2) The classic visual wordsare generated from single local features, thus they can not preserve enough spatial information.3) The clustering process is unsupervised, thus, the generated visual words might be notsemantically reasonable, and many noisy visual words can be generated. The reasons causingthe low descriptive power of visual words are still being studied by researchers in related fields,and the ineffectiveness is not limited to these three reasons. Generating visual vocabularies withmore powerful descriptive power is still an active and popular research topic in computervision, pattern recognition, and multimedia research communities.

In this paper, we introduce our recent algorithms for descriptive and discriminativevisual word generation. The latent visual context analysis is proposed for identifying visualwords representative to a certain image set returned from image search engines. Theidentified representative visual words are then utilized for re-ranking the image set to makethe top ranked images more relevant to the query. Novel visual word and visual word pairselection algorithms are proposed to get descriptive visual words and descriptive visualphrases which function more similar to the textual words and textual phrases. Previousapproach for spatial contexts modeling commonly combines several visual words together.However, visual word combination magnifies the quantization error introduced in visualvocabulary generation and degrades the performance of corresponding combinations. Todepress such quantization error, the contextual visual vocabulary is proposed by clusteringlocal features in groups, rather than single local features. Therefore, the spatial contextualinformation in local feature groups can be kept and the semantic contextual information canbe preserved in the learned group distance metric. Meanwhile, we also propose to optimizethe visual word hierarchy to overcome the shortcomings of hierarchical k-means clusteringbased visual word generation, e.g., asymmetrically dividing the feature space, and themagnified asymmetric division in hierarchy clustering, which are also closely related to theineffectiveness of classic visual vocabulary.

Besides the algorithms for descriptive and discriminative visual word generation, wealso propose novel post processing strategies to further improve the performance of visual

Multimed Tools Appl (2011) 51:441–477 443

Page 4: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

vocabulary in visual matching and image retrieval. The spatial coding is proposed to filterthe mismatched visual words between two images based on the spatial consistency cluesbetween two images. After removing the mismatched visual words, the influence ofcluttered background can be depressed, and hence the similarity computed between twoimages can be more reasonable. We also study to combine user preference information inimage similarity computation. A user preference and habit based visual word weightingstrategy is proposed to make the similarity computed between two images more consistentwith users’ preferences and habits. The novel weighting strategy shows promisingperformance in image retrieval applications.

In addition to the introductions to our recent algorithms, this paper also gives acomprehensive survey on visual vocabulary. The related works are reviewed, compared,and summarized into different categories. The structure of this paper is organized asfollows. Section 2 introduces and summarizes the related work on visual codebookgeneration. We will introduce our recent works on descriptive and discriminative visualcodebook generation in detail in Section 3. Our algorithms for visual codebook postprocessing will be introduced in Section 4. Section 5 concludes this paper.

2 Related work

There are many ways to improve the discriminative and descriptive power of visual word,such as optimizing visual word generation, visual word combination for high order visualwords, post-processing with geometric verification or visual context analysis. In thefollowing, the related work will be reviewed.

2.1 Optimizing hierarchical K-means based visual vocabulary

To overcome the shortcoming of unsupervised K-means clustering based visual vocabulary,lots of works have proposed novel feature quantization algorithms [26, 43, 46], targeting formore effective and discriminative visual vocabularies. E.g., an interesting work is reportedby Lazebnik et al. [26]. Using the results of K-means as initializations, the authors generatediscriminative vocabularies according to the Information Loss Minimization theory. In [43],Extremely Randomized Clustering Tree is proposed for visual vocabulary generation,which shows promising performance in image classification. To reduce the quantizationerror, soft-quantization [15, 49] quantizes a SIFT descriptor to multiple visual words. Queryexpansion [5] reissues the highly ranked images from the original query as new queries toboost recall. However, it may fail on queries with poor initial recall. To improve precision,Hamming Embedding [16] enriches the visual word with compact information from itsoriginal local descriptor with Hamming codes [16], and feature scale and orientation valuesare used to filter false matches. Although lots of approaches have been proposed foreffective visual vocabularies and show impressive performances in many vision tasks, mostof them are expensive to compute and are designed for small-scale applications. Moreover,most of the generated vocabularies are specific problem oriented (i.e., for imageclassification, object recognition, etc.), thus they are still not comparable with the textualwords, which could be used as effective features and perform impressively in variousinformation retrieval applications.

In addition, since the visual vocabulary is commonly generated by hierarchicallydividing the feature space, its hierarchical structure is closely associates with the imageretrieval performance. Such a hierarchical structure contains the genesis of many crucial

444 Multimed Tools Appl (2011) 51:441–477

Page 5: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

problems in patch-based visual retrieval, such as quantization errors [21], termweighting efficiency [45], middle-level effectiveness [45], and model flexibility.However, in-depth optimization of this hierarchical construction process is left withoutconsideration in former papers [10, 15, 21, 45, 57, 63]. Recent research has alreadyrevealed the negative effect of hierarchical quantization bias in visual words generation.For instance, Philbin et al. [49] adopted visual words fuzzy matching to alleviate thenegative impacts of hierarchical quantization in Approximate K-Means. Nister et al. [45]combined the middle-level nodes of Vocabulary Tree into a unified BoW vector forranking similarities. Yang [63] investigated the efficiency of vocabulary size, termweighting, and stop-word removal issues in PASCAL and TRECVID. However, to thebest of our knowledge, how to optimize the hierarchical vocabulary structure during thequantization process have not yet been considered in the literature. Current methodsusually refer to refining the visual words in the lowest-level to improve retrieval [30, 36,41, 54, 55], without regard to a feasible way to directly optimize the vocabulary hierarchyduring construction. In texton codebook generation, Jurie et al. [21] adopted scalableacceptance-radius clustering to refine K-means clustering bias in texton generation. Jegouet al. [15] investigated a two-layer clustering scheme together with a contextual distancemeasurement to learn a similarity metric in building visual vocabulary. However, theeffectiveness of these methods in deeper hierarchies and scalable databases is restricted,especially for the large-scale applications with millions of descriptors. Learning-basedword selection is another solution [29, 57] in a supervised scenario: Wang et al. [57]proposed a codeword selection strategy by boosting among leaf nodes. Leung et al. [29]investigated mutual information, odds ratio and linear SVM to optimize the originalcodebook. However, due to the training example restriction, a supervised learningstrategy is unsuitable for the large-scale retrieval tasks. Further problems come from theirgenerality, in which the learned visual vocabularies and similarity measurements cannotbe directly reapplied to a new database or maintained in an incremental database. In [18],Ji et al. presented an unsupervised metric learning approach to refine the original visualcodebook distribution, making it more similar to the textual words distribution. Based onsuch refinement, they have reported better performance than state-of-the-art codebookmodels. In addition, Ji et al. [19] proposed a vocabulary shift model to automaticallyadapt an original visual vocabulary across different databases.

In large scale image retrieval systems, the state-of-the-art approaches leverage scalabletextual retrieval techniques for image search. Similar to textual words in informationretrieval, local SIFT descriptors [36] are quantized to visual words. Inverted file indexing isthen applied to index images via the contained visual words [45]. However, thediscriminative power of visual words is far less than that of textual words due toquantization. Unlike textual word in information retrieval, there is no semantic informationin visual word. And consequently, it is difficult to determine the optimal size of visualcodebook. In practice, the visual codebook size is usually given by experimental study. Andwith the increasing size of image database (e.g. greater than one million images) to beindexed, the discriminative power of visual words decreases sharply. Visual words usuallysuffer from the dilemma of discrimination and ambiguity. On one hand, if the size of visualword codebook is large enough, the ambiguity of features is mitigated and different featurescan be easily distinguished from each other. However, similar descriptors polluted by noisemay be quantized to different visual words. On the other hand, the variation of similardescriptors is diluted when using a small visual codebook. Therefore, different descriptorsmay be quantized to the same visual word and cannot be discriminated from each other.More discussions about the effect of codebook size can be found in [9].

Multimed Tools Appl (2011) 51:441–477 445

Page 6: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

2.2 Supervised visual vocabulary generation

Serving as the foundational component in state-of-the-art visual search and recognitionsystems, most existing visual vocabularies are built based solely on the visual statistics oflocal image patches [14, 16, 18, 45, 48]. In contrast, in most visual search and recognitionscenarios, the quantized codewords are processed together in the subsequent steps (such assimilarity ranking or classifier training), which emphasize more on their semanticdiscriminability. In such case, the correlative semantic labels serve as important cues toguide the quantizer to build more semantically discriminative codewords. Consequently, itis a natural thought to embed correlative labels for supervised vocabulary construction.

In literature, there are former works on learning-based codebook constructions [26, 31,39, 42]. Mairal et al. [39] built a supervised vocabulary by learning discriminative andsparse coding models for object categorization. Lazebnik et al. [26] proposed to constructsupervised codebooks by minimizing mutual information lost to index fully labeled data.Moosmann et al. [42] proposed an ERC-Forest to consider semantic labels as stopping testsin building supervised indexing trees. Another group of related works [34, 47, 67] refines(merges or splits) the initial codewords to build class(image)-specific vocabularies forcategorization. Similar works can also be referred to the Learning Vector Quantization [24,26, 50] in data compression, which adopted self-organizing maps [25] or regression lostminimization [50] to build codebook that minimizes training data distortions aftercompression. There are also previous works in learning visual parts [1, 27] by clusteringlocal patches with spatial configurations. Recently, Ji et al. [19] presented a semanticembedding framework to integrate semantic information from Flickr labels for supervisedvocabulary construction. Their main contribution is a Hidden Markov Random Fieldmodeling to supervise feature space quantization, with specialized considerations to labelcorrelations: Local visual features are modeled as an Observed Field, which follows visualmetrics to partition feature space. Semantic labels are modeled as a Hidden Field, whichimposes generative supervision to the Observed Field with WordNet-based correlationconstraints as Gibbs distribution.

In addition, to generate the visual vocabulary from single local image descriptors, the K-means clustering commonly employ a general distance metric, such as Euclidean distanceor L1-norm, to cluster or quantize the local features. This is unsatisfactory since it largelyneglects the semantic contexts of the local features. With a general distance metric, localvisual features with similar semantic relationship may be far away from each other, whilethe features with different semantics may be close to each other. For instance, withunsupervised clustering, the local features with similar semantics can be clustered intodifferent visual words, while the local features with different semantics can be assigned intothe same visual words. This defection results in some incompact and non-descriptive visualwords, which are also closely related with the mismatches occurred between images. Therehave been some previous works attempting to address this phenomenon by posingsupervised distance metric learning [34, 61, 66, 71]. In [34], the classic visual vocabulary isused as the basis, and a semantically reasonable distance metric is learned to generate moreeffective high-level visual vocabulary. However, the generated visual vocabulary is small-scale problem oriented. In a recent work [61], the authors capture the semantic context ineach object category by learning a set of reasonable distance metrics between local features.Then, semantic-preserving visual vocabularies are generated for different object categories.Experiments on large-scale image database demonstrate the effectiveness of the proposedalgorithm in image annotation. However, the codebooks in [34] are created for individualobject categories, thus they are not universal and general enough, which limits their

446 Multimed Tools Appl (2011) 51:441–477

Page 7: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

applications. Generally, although promising progress has been made, most of those methodsare small-scale problems oriented [34, 66, 71], or do not take the spatial contexts intoconsideration [34, 61].

2.3 High order visual word generation

It has been illustrated that single local feature can not preserve the spatial information inimages, which has been proven important for visual matching and recognition [55, 59, 60,68, 69, 71, 72]. Therefore, losing spatial information is one of the important reasonscausing the low descriptive power of classic visual words. As combining BoW with spatialinformation, lots of works are conducted to combine multiple visual words with spatialinformation [2, 35, 38, 53, 59, 60, 64, 66, 70, 71]. This may be achieved, for example, byusing feature pursuit algorithms such as AdaBoosting [56], as demonstrated by Liu et al.[32]. Visual word correlogram and correlation [53], which are leveraged from the colorcorrelogram [53], are utilized to model the spatial relationships between visual words forobject recognition in [53]; In recent work [60], visual words are bundled and thecorresponding image indexing and visual word matching algorithms are proposed for large-scale near-duplicate image retrieval. Proposed as descriptive visual word pairs in [66, 68],Visual Phrase captures the spatial information between two visual words and presents betterdiscriminative ability than the traditional visual vocabulary in object categorization tasks.Generally, considering visual words in groups rather than single visual word couldeffectively capture the spatial configuration among them. However, the quantization errorintroduced during visual vocabulary generation may degrade the matching accuracy ofvisual word combination. Intuitively, after quantization, local features that near in thedescriptor space may be quantized into different visual words, and this error may bemagnified with general visual word combinations [69]. To overcome the magnifiedquantization error, Zhang et al. proposed the contextual visual vocabulary which firstgroups several local features together, and then quantizes the local feature groups intovisual words [69]. Therefore, the spatial information within each group can be preserved inthe generated local feature. More importantly, since there is no visual word combinationoperation, the quantization error can be depressed. Experiments show that the contextualvisual vocabulary outperforms the bundled feature [60].

2.4 Post processing

Unlike textual words in information retrieval [35], the geometric relationship among visualwords plays a very important role in identifying images. Geometric verification [15, 16, 48,60] has become very popular recently as an important post-processing step to improve theretrieval precision. However, due to the expensive computational cost of full geometricverification, it is usually only applied to some top-ranked candidate images. In web imageretrieval, however, the number of potential candidates may be very large. Therefore, it maybe insufficient to apply full geometric verification to the top-ranked images for sound recall.

Geometric information of local features plays a key role in image identification.Although exploiting geometric relationships with full geometric verification (RANSAC) [8,16, 36, 48] can greatly improve retrieval precision, full geometric verification iscomputationally expensive. In [30, 55], local spatial consistency from some spatial nearestneighbors is used to filter false visual-word matches. However, the spatial nearest neighborsof local features may be sensitive to the image noise incurred by editing. In [60], Bundled-feature groups features in local MSER [41] regions into a local group to increase the

Multimed Tools Appl (2011) 51:441–477 447

Page 8: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

discriminative power of local features. The matching score of bundled feature sets are usedto weight the visual word vote for image similarity. Since false feature matches betweenbundles still exist, the bundle weight will be degraded by such false matches.

In the field of vocabulary post-refinement, VisualRank [20] builds an image graph andintuitively determines the pair-wise image similarity by the number of shared SIFT featuresand computes the image rank value directly through an iterative procedure similar toPageRank [4]. Similar algorithms are proposed by Wang et al. [59] and Zhang et al. [68] torank the importance of visual words. In fact, there is an underlying assumption inVisualRank that all matched local features in image are equally important. In fact, given animage set returned by text-based image search engine, some local features are expected tobe more discriminative than others. Therefore, it is preferred to give these discriminativefeatures a larger weight, which can be achieved by analyzing both latent semantic contextand visual word link context, to infer image significance.

Moreover, based on BoW, many topic models, such as Latent Semantic Analysis (LSA)[6], Probabilistic Latent Semantic Analysis (pLSA) [12] and Latent Dirichlet Analysis(LDA) [3] can be used to analyze the topics within images. As generative data models,pLSA and LDA are based on statistical foundation and work with the number of latenttopics determined beforehand. LSA, instead, is based on Singular Value Decomposition(SVD) to explore the higher-order semantic structure in an implicit manner withoutknowing the latent topic number. In this paper, LSA is adopted to explore the underlyingimplicit semantic context in conjunction with visual words and to generate visual wordsimilarity in latent semantic sense.

Further, another feasible alternative is to regard the context relationship between imageand visual word as visual hyperlinks and construct visual word link graph and image linkgraph, respectively. Then, the visual significance discovery for visual words and images isfulfilled by analyzing the corresponding visual graphs. In [22, 23], link analysis techniquesare also used for recognition of object categories. Without quantizing local features tovisual words, it constructs a visual similarity network (VSN), where the nodes are featuresextracted from all images and the edges link features that have been matched across images.The weight of the graph edge reflects the correspondence consistency and is obtained with aspectral technique [28]. Both Kim’s link analysis approach and our method are based ongraph link analysis, but the link relationship definitions are greatly different from eachother. Although good performance is achieved in object categorization, it ignores the localgeometric constraints of features in images. Besides, without feature quantization, it willnot suffer quantization error, but the time cost is expensive. Also, the work in [60] makesuse of MSER region [41] to impose local geometric constraint.

3 Descriptive and discriminative visual codebook construction

In this part, we will introduce our recent works [18, 68–70, 72, 74] for descriptive anddiscriminative visual vocabulary generation in detail.

3.1 Latent visual context analysis for descriptive visual word identification

Latent Visual Context Analysis (LVCA) is proposed in [74] for most descriptive anddiscriminative visual word selection. It mainly consists of two parts: visual word contextanalysis and image context analysis. Visual word context analysis is carried out from twoperspectives, namely, latent semantic topic context and visual word link graph. Then Visual

448 Multimed Tools Appl (2011) 51:441–477

Page 9: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

Word Rank is performed to discover visual word significance, which is adopted for imagecontext learning with image link graph analysis.

3.1.1 Visual word context analysis

Visual word context is learned through visual word graph analysis. We construct a visualword graph, where the node denotes visual word and the edge linking two nodes isweighted by their similarity. Then the problem becomes how to define the pair-wise visualword similarity. We propose to formulate it from two perspectives: latent semantic topic andvisual word link graph. After that, Random Walk [4, 20] can be employed to discover thesignificance of the graph nodes.

Visual word similarity decomposition Visual similarity is related with human psychologicalcognition, which is a very complex process for simulation. We approach it from two kindsof visual context. The first one is related with latent semantics analysis, and the second oneis about visual link graph.

Before re-ranking, it is necessary to explore the pair-wise relationships for visual words,in other words, the similarity between visual words. We propose to formulate the similaritydefinition for visual word pair (i, j) as follows,

WVW ði; jÞ ¼ a �WsVW ði; jÞ þ ð1� aÞ �Wg

VW ði; jÞ ð1ÞWhere Ws

VW is formulated with latent semantics related visual context,WgVW is defined

with visual link graph related context and a is a weighting factor with range 0<a<1. In thefollowing subsections, we will explain the formulation of the above two decomposedsimilarity components for visual words

Latent semantic similarity analysis Since the image collection is returned from text queryretrieval, it is reasonable to assume that there exist some topics among these images. Thelatent semantic context beneath the visual word-image relationship can be explored bymeans of semantic model LSA [6].

According to LSA, given a visual word-image matrix M0 with size m×n, each column ofwhich is a normalized histogram of visual word occurrence in the corresponding image, itcan be decomposed into the product of three other matrixes by singular valuedecomposition (SVD) as follows,

M0 ¼ T0 � S0 � D0T ð2Þ

where T0 and D0 are column-orthonormalized matrices and S0 is a diagonal matrix with alldiagonal elements positive and in decreasing order. The sizes of T0, S0, and D0 are m×k,k×k and n×k, respectively. To maintain the real data structure and at the same time ignorethe sampling error or unimportant details, only the top t (t<k) largest diagonal elements inS0 are kept while the remaining smaller ones are set to zero. This is equivalent to delete thezero rows and columns of S0 to obtain a compact matrix S and delete the correspondingcolumns of T0 and D0 to yield T and D, respectively. Consequently, a reduced matrix M isdefined as follows,

M ¼ T � S � DT ð3ÞAssume row-normalizing M yields MR. Then the dot product between two row

vectors of MR reflects the extent to which two terms have a similar pattern of occurrence

Multimed Tools Appl (2011) 51:441–477 449

Page 10: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

across the set of images [6]. Therefore, the pair-wise row vector distance of YR can bedefined as,

U ¼ MR � ðMRÞT ð4ÞAccordingly, we can define the pair-wise visual word similarity in the sense of latent

semantic context as follows,

WsVW ði; jÞ ¼ hðUði; jÞÞ ð5Þ

where h(.) is a non-decreasing function.

Visual word link graph Generally, visual words are intrinsically related through image andworks as visual concept carrier. Usually, visual concept is composed of a set of visual atomsunder some geometric constraints. Therefore, it is necessary to incorporate the localgeometric relationship among visual words to analyze the visual link context betweenvisual words. MSER region [41] can serve for this task.

Consequently, a four-layer visual word graph is constructed. As illustrated in Fig. 1,there are two intermediate layers, i.e., image layer and MSER layer. Visual words don’ttransit to each other directly. Instead, a visual word Vi first transits to an image that containsVi, then further to the MSER region in the image, and finally to another visual word Vj thatshares the same MSER region. The intuition behind is that if a user is viewing an image, heor she is most likely attracted by some local features, and other local features within theneighborhood may also be of interest. Such transition behavior naturally reflects the co-occurrence context of visual words.

Based on the visual word graph, we essentially define a propagation probability matrixW on the edges of the graph as follows,

PðVjjViÞ ¼XNk¼1

XNk

t¼1

PðVjjMk;tÞ � PðMk;tjIkÞ � PðIk jViÞ ð6Þ

Where N denotes the total number of images to be ranked, Nk denotes the number ofMSER regions in the k-th image, Mk,t denotes the set of visual words in the t-th MSERregion in the k-th image, PðVjjMk;tÞis defined as the normalized term-frequency of the

iV jV

Layer I: Visual word Layer II: Image Layer III: MSER Layer IV: Visual word

Fig. 1 An illustration of four layers between two nodes in the visual word link graph. The red ellipses inlayer 3 denote detected MSER region

450 Multimed Tools Appl (2011) 51:441–477

Page 11: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

visual word Vj in Mk,t; PðMk;tjIkÞis defined as the normalized MSER-frequency of Mk;t inimage Ik. PðIk jViÞ is defined as the inverse image frequency of visual word Vi for imageIk :PðIk jViÞ ¼ 1 NðViÞ= , where N(Vi) is the total number of images that contain visual wordVi.

Further, the visual link similarity for visual words can be defined as follows,

WgVW ði; jÞ ¼ PðVjjViÞ ð7Þ

Visual word rank After obtainingWsVW and Wg

VW , WVW is calculated. Similar to VisualRank[20], the idea of PageRank [4] can be adopted to discover the visual word significance.Consequently, the significance of visual word R is iteratively defined as follows,

R ¼ d �W » � Rþ ð1� dÞ � p; where p ¼ 1

n

� �n�1

ð8Þ

where W* is the column-normalized version of the transposition of WVW, P is a distractingvector for random walk behavior [20], and d is a constant damping factor. The iteration ofEq. 8 is considered converged when the change of R is small enough or a maximal iterationnumber, such as 100, is achieved.

3.1.2 Image context analysis

Since images are represented by visual words, image visual context can also be deducedfrom the significance of visual words. We explore image context through image link graphwith visual word as a latent layer, and formulate the weight of the graph edge with thevisual word significance. Then, image significance is also discovered with random walk re-ranking method [4, 20].

Image link graph Generally, images are related through intermediate medium of visualwords, which work as visual hyperlinks. And, different visual words will cast differentvotes to the image that contains them, according to their significance. These contextrelationships can be represented with a graph of three-layer, as illustrated in Fig. 2.

Based on the above discussion, the image transition probability from image Ii to image Ijcan be defined as follows,

PðIjjIiÞ ¼ 1

Ni

XVk2Ij;Ii

PðIjjVkÞ � PðVk jIiÞ � f ðRkÞ ð9Þ

where PðIjjVkÞis defined as the inverse image frequency of visual word Vk in image Ij,PðIjjVkÞdenotes the normalized term frequency of visual work Vk in image Ii, Rk is thesignificance value of visual word Vk obtained with Eq. 8, f(·) is a non-decreasing functionand Ni is a normalization factor such that the sum of transition probability for the i-th imageto any other image is one.

Image rank From the discussion above, the image transition probability is obtained.However, it does not necessarily define the image pair-wise similarity, since the morefeatures regardless of importance an image contains, the larger probability it is propagatedfrom other images. Therefore, a regularization term should be included for penalizing

Multimed Tools Appl (2011) 51:441–477 451

Page 12: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

images with too many features. We formulate the image visual similarity heuristically asfollows,

WImgði; jÞ ¼ ProbðIjjIiÞ � tðIjÞ ð10Þwhere τ (Ij) is the regularization term for the jth image, defined as

tðIjÞ ¼ 1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiNðIjÞ þ N

n

q ð11Þ

where N(Ij) denotes the number of features in the j th image, N is the average local featurenumber per image, n is a constant.

Similar to visual word rank, we also adopt the idea of PageRank to explore the imagesignificance. Consequently, the significance of image SImg is iteratively defined as follows,

SImg ¼ d � U � SImg þ ð1� dÞ � p; where p ¼ 1

n

� �n�1

ð12Þ

where U is the column-normalized version of the transposition of WImg defined in Eq. 10.The convergence condition is the same as Eq. 8. For more details, refer to [74].

3.2 Descriptive visual words and descriptive visual phrases

The Descriptive Visual Words (DVW) and Descriptive Visual Phrases (DVP) [68] areproposed based on the classic visual word [45]. The basic idea is that because images arecarriers of different visual objects or visual scenes, visual elements and their combinationsthat are descriptive to certain objects or scenes could be selected as DVWs and DVPs,respectively. The DVWs and DVPs composed of these elements and combinations willfunction more similar to the textual words than the classic visual words because: 1)they are compact to describe specific objects or scenes. 2) Only unique and effectivevisual elements and combinations are selected. This significantly reduces the negativeeffects of visual features from the background clutter. 3) Based on the large-scale imagetraining set containing various scenes and objects, DVWs and DVPs might presentbetter descriptive ability to the real word and could be scalable and capable for variousapplications.

To gather reliable statistics on large-scale image dataset, we collected about 376,500images belonging to 1506 object and scene categories, by downloading and selectingimages from Google Image. A classic visual word vocabulary is first generated based on

Layer II: Visual word Layer III: ImageLayer I: Image

Fig. 2 An illustration of three layers between two nodes in the image link graph

452 Multimed Tools Appl (2011) 51:441–477

Page 13: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

the collected image database. Then, the classic visual words extracted from each categoryare considered as the DVW candidates for the corresponding category. DVP candidates ineach category are generated by identifying the co-occurred visual word pairs within acertain spatial distance. A novel visual-word-level ranking algorithm: VisualWordRankwhich is similar to that of PageRank [4] and VisualRank [20] is proposed for identifyingand selecting DVWs. Based on the proposed ranking algorithms, DVWs and DVPs fordifferent objects or scenes are discriminatively selected. The final DVW and DVP set isgenerated by combining all the selected candidates across different categories. In the nextpart we will introduce the generation of DVW and DVP in detail.

3.2.1 Descriptive visual word selection

DVWs are designed to capture certain objects or scenes, thus several unique features are desiredin them: 1) if one object or scene appears in some images, the DVWs descriptive to it shouldappear more frequently in these images. Also, they should be less frequent in images that do notcontain such object or scene. 2) They should be frequently located on the object or scene, eventhough the scene or object is surrounded by cluttered background. Inspired by PageRank [4]and VisualRank [20], we design a novel visual-word-level ranking algorithm: Visual-WordRank to effectively incorporate the two criteria for DVW selection.

According to the first criterion, the frequency-of-occurrence information of DVWcandidates in the total image set and in each individual image category would be importantfor identifying DVWs. Besides the frequency information of single visual word, if twovisual words frequently co-occur within short spatial distance in images containing thesame object or scene, strong spatial consistency could be inferred between them in suchimages. Considering that these images contain same object but different backgrounds, thespatially consistent visual words are more likely to be located on the object.

Therefore, we combine two clues: 1) each DVW candidate’s frequency information, and2) its co-occurrence with other candidates to identify DVWs. This can be formalized as avisual word ranking problem which is very similar to the one of webpage ranking. Thus, wepropose the VisualWordRank algorithm which leverages the idea of well-known PageRank[4]. Based on the same idea of PageRank [4], for an image category C, we build aVWnum(C)×VWnum(C) matrix R(C) to combine the frequency and co-occurrence clues forDVW selection. VWnum(C) is the number of DVW candidates for category C. In matrix R(C)

we define the diagonal element as:

RðCÞi;i ¼ f ðCÞi lnðFi= Þ ð13Þ

i is a DVW candidate. Fi and fi(C) denote its average frequency in all categories and the

within-category frequency in category C, respectively. RðCÞi;i stands for the inherent-

importance of candidate i. Thus, i would be inherently more significant to category C ifRðCÞi;i has larger values. fi

ðCÞ and Fi are computed beforehand when transforming the imagesin training dataset into BoW.

The non-diagonal element RðCÞi;j is defined as the average co-occurrence frequency of

visual word i and j in image category C:

RðCÞi;j ¼ T ðCÞ

i;j ð14Þwhere T ðCÞ

i;j is from the DVP candidate computation in Section 3.2.1.With the matrix R(C), we set the initial rank value of each DVP candidate equal and then

start the rank-updating iterations. The detailed descriptions of VisualWordRank can be

Multimed Tools Appl (2011) 51:441–477 453

Page 14: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

referred to [68]. During the iteration, the candidates having large inherent-importance andstrong co-occurrence with large-weighted candidates will be highly ranked. After severaliterations, the DVW set in object category C can be generated by selecting the top N rankedDVW candidates or choosing the ones with rank values larger than a threshold. Examplesof selected DVWs are illustrated in Fig. 3.

3.2.2 Descriptive visual phrase selection

Similar to the DVW selection, the DVP selection is desired to select the visual word pairsdescriptive to certain objects or scenes. Since the co-occurrence (i.e., spatial relationship)information of visual word pair has already been integrated in the generated DVPcandidates, we now compute the DVP candidate frequencies within a certain category andthe overall categories. According to the TF-IDF weighting in information retrieval theory, aDVP candidate is considered important to an category if it appears more often in it and lessoften in others. Based on this strategy, the importance of a DVP candidate k to the categoryC is computed as:

VPI ðCÞk ¼ VPf ðCÞk lnðVPFk= Þ ð15Þwhere, VPI ðCÞk is the importance of the DVP candidate k to the category C, VPf ðCÞk and VPFk

stand for the frequencies of occurrence of candidate k in category C and all categories,respectively. Suppose there are M image categories and the two visual words contained in kare visual word i and visual word j, respectively, then VPf ðCÞk and VPFk can be computedwith Eq. 16.

VPf ðCÞk ¼ T ðCÞi;j VPFk ¼

PMm¼1

T ðmÞi;j M= ð16Þ

Consequently, after computing the importance of each DVP candidate, the DVPs forcategory C could be identified and selected from the top ranked VPI(C).

In Fig. 4, the visual words are denoted as green dots and the dots connected by red linesdenote the selected DVPs. As we can clearly observe, most of the selected DVPs appear on

(a): DVW candidates before Visual Word Rank

(b): Selected DVWs in corresponding categories

Fig. 3 a DVW candidates, b DVWs

454 Multimed Tools Appl (2011) 51:441–477

Page 15: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

the object and maintain obvious spatial characteristics of the corresponding object. Formore details, refer to [68].

3.3 Contextual visual vocabulary

The ineffectiveness of classis visual word might be closely related with two reasons, i.e., 1)single local feature can not preserve enough spatial information, and 2) the distance metric usedfor clustering is not semantic-aware. We address these two challenges discussed in a unifiedframework by casting the problem as learning a discriminant group distance between localfeature groups [69]. In contrast to previous methods, we take groups of local features intoconsideration instead of treating the local features independently. In this way, the spatialcontextual information between local features can be modeled and the magnified quantizationerror of directly combining visual words can be depressed. Based on the spatial configurationof the local features within the feature group, we define the spatial contextual similarity [69]between two local feature groups (see Fig. 5). The learned group distance is further applied tocreate visual vocabulary form local feature groups, namely the contextual visual vocabulary,which incorporates both spatial and semantic level contextual information.

3.3.1 Local feature group detection

As shown in Fig. 5, each local feature group contains several local features. In ourformulation, the discriminant group distance is defined between two groups containing thesame number of local features to simplify the computation.

In order to achieve scale invariance, we use the scale information [36] of local feature asthe basis to compute the spatial distance related to the co-occurrence between local features.To make detected the local feature groups repeatable and efficient to compute, according toLiu et al. [32], if too many local features or visual words are combined, the repeatability ofthe combination will decrease. In addition, if more local features are contained in eachgroup, there would be more possible feature-to-feature matches between two groups. Thismay make the computation of the corresponding spatial contextual similarities to be timeconsuming. Therefore, we fix the number of local features in each local feature group as 2.

To detect local feature groups containing two local features, we use the detectorillustrated in Fig. 6. In the figure, a circle with radius R is centered at a local feature. A localfeature group is formed by the centered local feature and another local feature within thecircle. The radius R is computed with

R ¼ Scenter � l ð17Þ

Fig. 4 The selected DVPs in: “inline skate”, “revolver”, “cannon”, and “accordion”

Multimed Tools Appl (2011) 51:441–477 455

Page 16: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

to achieve scale-invariance, where Scenter is the scale of the centered local feature, 1 is aparameter that controls the spatial span of the local feature group, which in turn affects theco-occurrence relation between local features.

3.3.2 Computation of discriminant group distance

The discriminant group distance is computed as the spatial context weighted Mahalanobisdistance based on the local features in two groups. Note that the discriminant groupdistance is defined between groups containing identical number of local features. With nlocal features in each group, there are n! possible feature-to-feature matches between twogroups. We call each possible match as a match order r. It is reasonable to seek the bestmatch order for the group distance computation. Consequently, to define the group distance,

Pcenter

Pa

Pb

Pc

R: local feature in Image

Detected local feature groups:(Pcenter, Pa), (Pcenter, Pb) (Pcenter, Pc)Pd

Fig. 6 The utilized local feature group detector

Spatial Contextual Similarity

spatial configuration in

local feature group A

spatial configuration in

local feature group B

local feature group A local feature group B

DiscriminantGroup Distance

Learned Mahalanobisdistance

combining spatialand semantic

contextual information

Measures the spatial

(orientationand scale

relationship) similarity

Fig. 5 The discriminant group distance measures the spatial context weighted Mahalanobis distancebetween two local feature groups

456 Multimed Tools Appl (2011) 51:441–477

Page 17: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

we first compute the best match order between two groups, and hence obtain thecorresponding spatial contextual similarity based on the spatial relationship of the twofeature groups. After that, we further learn a weighted Mahalanobis distance metric tomodel the distance between two groups.

Spatial contextual similarity We define the spatial context of each local feature group as theorientation and scale relationships between the local features inside the group. Because eachlocal feature contains two aspects of spatial information i.e., the scale and orientation, thespatial contextual similarity between local feature groups is defined as:

SimCxtðI ;J Þ ¼ maxr

SimSðI ;J Þr þ SimOðI ;JÞr

� �2= ð18Þ

where, SimCxt(I,J) denotes the spatial contextual similarity between local feature group I andJ. We compute the SimCxt based on the scale and orientation relationships between thelocal features inside the feature group. Intuitively, if the local features in two groups presentsimilar scale ratios and similar angles by their orientation cues, these two groups will havelarge SimCxt.

Formulation of the discriminant group distance Recall that each local feature groupcontains both the appearance (i.e., the SIFT descriptor) and spatial contextual information.We combine these two information by learning a spatial context weighted Mahalanobisdistance, which is called discriminant group distance, i.e.,

DGDðGI ;GJ jAÞ ¼ dAIJ

¼ WIJPn

k¼1 ðDðkÞI � Dðk»Þ

J ÞT

AðDðkÞI � Dðk»Þ

J Þwhere WIJ ¼ 1� SimCxtðI ;J Þ

ð19Þ

where WIJ denotes the spatial contextual weight that is derived from the spatial contextualsimilarity. DJ

(k*) is the local feature matched with DI(k) under the best match order. A is

the 128×128 matrix to be learned from the semantic labels of the local feature groups.Intuitively, we try to find a good distance metric which makes the feature groups with similar

semantics contexts close to each other and those with different semantics appearing far away. Toachieve this, suppose we are given a set ofM labeled examples: (GI, yI), I=1,…,M, where GI

and yI denote feature group and label, respectively. Following Globerson et al. [11], foreach group GI, we define the ideal conditional distribution p0 and the conditionaldistribution pA as:

p0ðI jJÞ / 1 yI ¼ yJ0 yI 6¼ yJ

�; pAðI jJÞ ¼ e

�dAIJP

K 6¼Je�dA

JKI 6¼ J ð20Þ

Therefore, our metric learning should seek a matrix A* such that pA»ðGJ jGI Þis as close as

possible to the ideal conditional distribution p0ðGJ jGI Þ. We define the objective functionas:

f ðAÞ ¼X

I ;JKL½p0ðGJ jGI ÞjjpAðGJ jGI Þ� s:t: A 2 PSD ð21Þ

where PSD stands for the set of Positive Semi-Definite matrices. Then, similar to [11], wecompute the A* as: A

» ¼ argminA

f ðAÞ:

Multimed Tools Appl (2011) 51:441–477 457

Page 18: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

Optimization of the discriminant group distance With the Kullback–Leibler divergencecomputation and Eq. 20, we may rewrite Eq. 21 as:

f ðAÞ ¼X

I ; J ;yJ¼yI

dAIJ þXI

logXJ6¼I

e�dAIJ ð22Þ

A crucial property of this optimization problem, which is manifested by Globerson andRoweis [11], is that the objective function is convex with respect to A. As in [11], we alsoutilize a simple gradient descent method, specifically the projected gradient approach. Thedetailed algorithm could be found in [11]. The learned matrix A will render the finaldistance metric to incorporate the semantic contexts of each local feature group. For moredetails, refer to [69].

3.3.3 Contextual visual vocabulary computation

To simply the computation of the cluster center, we employ the hierarchical K-centersclustering to cluster the local feature groups. Different form K-means, the cluster center ofK-centers is simply updated as the data point having the maximum similarities with theother data points in the same cluster. Therefore, we need to store a group-to-groupsimilarity matrix for each cluster to update the cluster center. Intuitively, once the similaritymatrix of a cluster is computed, the clustering operation in its corresponding sub-clusterscan be finished efficiently. The clustering finally produces a hierarchical vocabulary tree,and each cluster center of the leaf node is taken as a contextual visual word.

Since the contextual visual words are generated from local feature groups rather than thesingle local features, each of them preserves rich spatial contextual information.Meanwhile, because the distance utilized for clustering is more discriminative, thecontextual visual words have more capacity to represent the image concept than thetraditional visual words.

3.4 Visual vocabulary hierarchy optimization

3.4.1 Vocabulary hierarchy investigation

In the previous work [18], ji et al. have found that denser regions in local featuredistribution correspond to more generic image patches, and the sparser regions are morediscriminative. This is validated in Fig. 7. In TF-IDF term weighting [45, 52], a visual wordobtains less weight when it appears in more images and vice versa. However, earlier papers[21, 37] have shown that the k-means process will asymmetrically divide feature space,which moves clusters to denser regions because of its “mean-shift”-like update rule.Moreover, we have found out that the hierarchical quantization process in vocabularygeneration iteratively magnifies such asymmetric division. More hierarchy levels lead tomore asymmetric metrics in clustering, which causes the words’ distributions biased to thedenser regions and over-fit to the feature density. As a result, the discriminative patches (thesparser regions that rarely appear in images) would be coarsely quantized (given lower IDFand contribute less to ranking), while the general patches (the denser regions that frequentlyappear in many images) would be finely quantized (gain an inappropriately higher IDF inranking). This is exactly the reason that IDF shows limited improvement in [55, 63], whichis caused by the similarity metric biases that are magnified by hierarchical clustering.

458 Multimed Tools Appl (2011) 51:441–477

Page 19: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

Furthermore, the nearest neighbor search within leaf nodes is inaccurate because of itslocality nature in the vocabulary structure. By hierarchical quantization, the nearest neighborsearch in leaf nodes becomes more inaccurate as the increase of nearest neighbor scale.

Table 1 presents our validation of this assessment. We investigated the quantizationerrors in a 3-branch, 5-level vocabulary tree. In Table 1, NN means the nearest neighborsearch scope and GNP 1-5 means the numbers of branches we paralleled in GNP [54]search extension. We selected 3 K images from our urban street scene database (introducedin Section 5) to form a Vocabulary Tree (VT), with 0.5 M features (average 2 K features ineach visual word). We compared the matching ratio between the global-scale NN and leaf-scale NN, in which the global-scale is the overall feature space while the leaf-scale is insideleaf nodes. We extended the leaf-scale to include more local neighbors using Greedy N-bestPath (GNP) [54]. The quantization error was evaluated by the matching ratio (%) to see towhat extent the VT quantization would cause mismatch of feature points. From Table 1, thematch ratios between the inside-leaf and the global-scale search results are extremely lowwhen the GNP number is small.

It is well-known that the distribution of textual words in documents follows the Zipf’sLaw [65], in which the most frequent words comprise dominant portions of wordoccurrences. On the contrary, the visual word distribution did not follow Zipf’s Law [65]because of the inappropriate hierarchical generation. Furthermore, as the hierarchy levelincreases, the word distribution becomes more and more uniform. Consequently, we canexplain the phenomenon observed in an earlier paper [63] that IDF is less effective for largeword volumes. Moreover, we can also infer that, in the current suboptimal visualvocabulary, the “stop words” removal technique won’t be very helpful: In many cases, thereare no very frequent visual words in images like “a” or “the” in documents. Our inferencehas been validated by the "stop words" removal experiments in [63].

3.4.2 Density-based metric learning for hierarchy optimization

In the previous work [18], ji et al. presented a Density-based Metric Learning (DML)algorithm to unsupervised optimize vocabulary tree generation. We achieve better retrievalprecision using the DML-based tree, because: (1) Its "visual words" distribution followsZipf’s Law [65] more closely; and (2) It reduces the overall quantization error, themeaningful words (sparse clusters) are quantized more finely while the meaningless words

Fig. 7 Feature-word frequency distribution (in log-log plot) of the finest level in hierarchical quantization.The fewer features a cluster has, the smaller the portion of images it contains, and hence it is morediscriminative in the BoW vector [18]

Multimed Tools Appl (2011) 51:441–477 459

Page 20: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

(denser clusters) are quantized coarsely. Moreover, we present an “If...Else..." hierarchicalrejection strategy to deal with hierarchical quantization error in the VT-based retrieval. Itcombines the idea of the boosting chain classifier [56] with the nature of the tree hierarchyin an unsupervised manner to improve retrieval.

The main idea of our hierarchy optimization is that, the vocabulary hierarchy iterativelymagnifies the asymmetric feature space quantization of k-means clustering. It results tosuboptimal visual words and suboptimal IDF in earlier papers [57, 63]. DML addresses thisissue by constructing a Density Field, which evaluates the distribution tightness of localfeatures, to guide quantization procedure in local feature space. By extending denser clustersand shrinking sparser clusters, the similarity metric in k-means is modified to correct thebiased hierarchical quantization. Subsequently, a refined hierarchical division is achieved byDML-based hierarchical k-means, which offers more “meaningful” visual words:

Accordingly, we propose the following algorithms to improve the hierarchy of visualword. First, the Density Field in the SIFT feature space is estimated using the density ofeach SIFT point as a discrete approximation. The density of a SIFT point is defined askernel density estimation in Eq. 23:

DðiÞ ¼ 1

n

Xn

j¼1e� xi�xjk kL2 ð23Þ

where D(i) is the point-density of the ith SIFT point, n is the total number of SIFT points inthis dataset, and xj is the jth SIFT point. We adopt L2 distance to evaluate the distancebetween two SIFT points.

To decrease the computational cost, we approximate the density of each SIFT pointusing only its local neighbors as:

D�ði;mÞ ¼ 1

m

Xm

k¼1e� xi�xjk kL2 ð24Þ

where D�ði;mÞis the point-density of the ith SIFT feature in its m-neighborhood. We only

need to calculate local neighbors of SIFT by estimating the point density by theneighborhood approximation: (1) cluster database into k clusters: O(k×h×l), withh iterations on l points, and (2) nearest neighbor search in each cluster: O(l/k). Bychoosing a large k (e.g. 2000), our DML would be very efficient. By using a heap structure,it can be further accelerated. Then, the similarity metric in hierarchical k-means is refinedby density-based adaption:

DisSimðC; iÞ ¼ AveDenðCÞ � DisðCcenter; iÞ ð25ÞSimilarity(c,i is the similarity in DML-based k-means between the cth cluster and the ith

point; AveDen(c) is the average density of SIFT points in the cth cluster, and Dis(Ccenter,i) isthe distance between the center SIFT point in c and the ith SIFT point.

Table 1 Hierarchical quantization error test [19]

NN\GNP 1 3 5 10 15 20

50 41.46% 73.34% 85.00% 94.53% 97.11% 98.18%

200 57.46% 66.21% 79.00% 92.02% 95.00% 97.48%

1,000 11.54% 38.27% 51.57% 67.48% 85.16% 94.91%

2,000 6.38% 25.68% 40.59% 58.54% 79.21% 92.42%

460 Multimed Tools Appl (2011) 51:441–477

Page 21: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

3.4.3 Experimental verification

In our experiments [18], two datasets were investigated: Scity and UKBench. Scity consistsof 24,500 street-side photos, captured automatically along Seattle streets by a car. Weresized these photos to 400×300 pixels and extracted 300 features from each photo onaverage. Every six successive photos were defined as a scene. UKBench dataset contains10,000 images of CD covers and indoor objects.

In both datasets, each category was divided into both a query set (to test performance)and a training set (to create VT): In the Scity dataset, the last image of each scene wasutilized for the query test, the first 5 was used to construct the ground truth set. In theUKBench dataset, in each category, the first 1 image was added to the query set, the restwere used to construct the ground truth set.

We used the Success Rate at N (SR@N) to evaluate performance. This evaluationmeasure is commonly used in evaluating Question Answering (QA) systems. SR@Nrepresents the probability of finding a correct answer within the top N results. Given nqueries, SR@N is:

SR@N ¼Pn

q¼1 qðN� posðaqÞÞn

ð26Þ

aq is the answer of query q, pos(aq) is its position, θ() is a Heaviside function: θ(x)=1, ifx≥0, otherwise θ(x)=0. When n=4, SR@N is the identical criterion in [45].

We build a 2-branch, 12-level VT for both Scity and UKBench. In both trees, if a nodehad less than 2,000 features, we stopped its k-mean division, whether it had achieved thedeepest level or not. In tree adaptation, the maximum threshold was set as 20,000; theminimum threshold was set as 500.

Figures 8 and 9 present the SR@N in UKBench before and after DML, for both of whichfour comparisons were conducted: 1 Traditional leaf comparison, in which only the leafnodes were used for BoW-based ranking; 2 Leaf Comparison + IDF weighting; 3Hierarchical chain, in which we adopt the hierarchical rejection chain for multi-evidencesearch; 4. Hierarchical chain + IDF, which combines BoW vector at hierarchical level andtheir IDF as weights in similarity ranking.

Fig. 8 Performance comparison on UKBench original tree [18]

Multimed Tools Appl (2011) 51:441–477 461

Page 22: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

It is obvious that, before DML learning, the real powers of the hierarchical rejectionchain and hierarchical IDF cannot be expressed. But the combined performance enhance-ments of both methods after DML learning were significant-almost 20% over the leaf-levelbaseline. Compared Figs. 8 with 9, the DML-based VT performs much better than theoriginal VT, with 20% higher over its best results. The same result holds in Scity (Figs. 10and 11). For more details, refer to [19].

4 Post processing for large-scale image application

Rather than generating more descriptive and discriminant visual vocabularies, some workshave proposed to adopt post processing strategy to improve the performance of classicvisual vocabulary and filter the mismatched visual word matches. In this part, we willintroduce two state-of-the-art post processing strategies, i.e., utilizing spatial constraint tofilter the mismatched visual words between images [73] and utilizing user log informationto improve the performance of image retrieval [44].

Fig. 9 Performance comparison on UKBench DML tree [18]

Fig. 10 Performance comparison on Scity original tree [18]

462 Multimed Tools Appl (2011) 51:441–477

Page 23: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

4.1 Spatial information based visual word match verification

The spatial relationships among visual words in an image are critical in identifying specialduplicate image patches. After SIFT quantization, matching pairs of local features betweentwo images can be obtained. However, the matching results are usually polluted by some falsematches. Generally, geometric verification [36, 48] can be adopted to refine the matchingresults by discovering the transformation and filtering false positives. Since full geometricverification with RANSAC [8] is computationally expensive, it is usually only adopted as apost-processing stage. Motivated by this problem, we propose the spatial coding scheme [73].

4.1.1 Spatial coding

Spatial coding encodes the relative positions between each pair of features in an image. Twobinary spatial maps, called X-map and Y-map, are generated. The X-map and Y-mapdescribes the relative spatial positions between each feature pair along the horizontal (X-axis)and vertical (Y-axis) directions, respectively. For instance, given an image I with K featuresfvig; ði ¼ 1; 2; . . . ;KÞ, its X-map and Y-map are both K×K binary matrix defined as follows,

Xmapði; jÞ ¼ 0 if xi < xj1 if xi � xj

�ð27Þ

Ymapði; jÞ ¼ 0 if yi < yj1 if yi � yj

�ð28Þ

where (xi, yi) and (xj, yj) are the coordinates of feature vi and vj , respectively.In fact, X-map and Y-map impose loose geometric constraints among local features.

Further, we advance our spatial coding to more general formulations, so as to imposestricter geometric constraints. The image plane can be evenly divided into 4·r parts, witheach quadrant uniformly divided into r parts. Correspondingly, two relative spatial maps GXand GY are desired to encode the relative spatial positions of feature pairs.

Fig. 11 Performance comparison on Scity DML tree [18]

Multimed Tools Appl (2011) 51:441–477 463

Page 24: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

For an image plane divided uniformly into 4·r fan regions with one feature as thereference origin point as discussed above, we decompose the division into r independentsub-divisions, each uniformly dividing the image plane into four parts. Each sub-division isthen encoded independently and their combination leads to the final spatial coding maps.Fig. 12 illustrates the decomposition of image plane division with r=2 and feature v2 as thereference origin. As shown in Fig. 12(a), the image plane is divided into 8 fan regions. Wedecompose it into two sub-divisions: Fig. 12(b) and (c). The spatial maps of Fig. 12(b) canbe generated by Eqs. 27 and 28. The sub-division in Fig. 12(c) can be encoded in a similarway. It just needs to rotate all the feature coordinates and the division linescounterclockwise, until the two division lines become horizontal and vertical, respectively,as shown in Fig. 12(d). After that, the spatial maps can be easily generated by Eqs. 27, 28.

Consequently, the general spatial maps GX and GY are both 3-D matrix and can begenerated as follows. Specially, the location (xi, yi) of feature vi is rotated counterclockwiseby q ¼ k�p

4�r degreeðk ¼ 0; 1; :::; r� 1Þ according to the image origin point, yielding the newlocation ðxik ; yikÞas,

xik

yik

� ¼ cosðqÞ sinðqÞ

� sinðqÞ cosðqÞ�

� xiyi

� ð29Þ

Then GX and GY are defined as,

GX ði; j; kÞ ¼ 0 if xik < xjk

1 if xik � xjk

�ð30Þ

GY ði; j; kÞ ¼ 0 if yik < yjk

1 if yik � yjk

�ð31Þ

(a) (b)

(c) (d)

Fig. 12 An illustration of spatial coding with r=2 for image features. a shows the image plane division withfeature 2 as the origin point; a can be decomposed into b and c; c rotates π/4 counterclockwise yields d

464 Multimed Tools Appl (2011) 51:441–477

Page 25: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

4.1.2 Spatial verification

Spatial coding plays an important role in spatial verification. Since the problem that wefocus on is partial-duplicate image retrieval, there is an underlying requirement that thetarget image must share some duplicated patches, or in other words, share the same or verysimilar spatial configuration of matched feature points. Due to the unavoidable quantizationerror, false feature matches are usually incurred. To more accurately define the similaritybetween images, it is desired to remove such false matches. Our spatial verification withspatial coding can perform this task.

Denote that a query image Iq and a matched image Im are found to share N pairs ofmatched features through SIFT quantization. Then the corresponding sub-spatial-maps ofthese matched features for both Iq and Im can be generated and denoted as (GXq, GYq) and(GXm, GYm). For efficient comparison, we perform logical Exclusive-OR (XOR) operationon GXq and GXm , GYq and GYm, respectively, as follows,

Vxði; j; kÞ ¼ GXqði; j; kÞ GXmði; j; kÞ ð32Þ

Vyði; j; kÞ ¼ GYqði; j; kÞ GYmði; j; kÞ ð33ÞIdeally, if all N matched pairs are true, Vx and Vy will be zero for all their entries. If some

false matches exist, the entries of these false matches on GXq and GXm may be inconsistent,and so is that on GYq and GYm. Those inconsistencies will cause the correspondingexclusive-OR result of Vx and Vy to be 1. Denote

SxðiÞ ¼XNj¼1

[rk¼1

Vxði; j; kÞ; SyðiÞ ¼XNj¼1

[rk¼1

Vyði; j; kÞ ð34Þ

Consequently,bycheckingvalueofSx and Sy, the falsematchescanbe identifiedand removed.The corresponding entry values of the remained true matches in Vx and Vy will be all zeros.

Figure 13 shows two instances of the spatial verification with spatial coding on a relevantimage pair and an irrelevant image pair. Both image pairs have many matches of local features

(a) (b) (c)

(d) (e) (f)

Fig. 13 An illustration of spatial verification with spatial coding on a relevant pair (left column) and anirrelevant pair (right column). a, d Initial matched feature pair after quantization; b, e False matches detectedby spatial verification; c, f True matches that pass the spatial verification

Multimed Tools Appl (2011) 51:441–477 465

Page 26: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

after quantization. For the left “Mona Lisa” instance, after spatial verification via spatial coding,those false matches are discovered and removed, while true matches are satisfactorily kept. Forthe right instance, although they are irrelevant in content, 12 matched feature pairs are stillfound after quantization. However, by doing spatial verification, most of the mismatching pairsare removed and only 3 pairs of matches are kept. Moreover, it can be observed that those 3pairs of features do share high geometric similarity.

In spatial verification, the main computation operations are logical Exclusive-OR andaddition. Therefore, unlike full geometric verification with RANSAC [8, 36, 48], thecomputational cost is very low.

4.2 User preference based visual word weighting

Typically, the image features extracted from the query image will be compared with theimages from the database and those images share most of the features with the query imagewill be returned. In this sense, a distance metric which defines the image similaritiesbetween the query image and the database images is a key component of a CBIR system. Astraightforward as well as commonly used criterion for this image similarity function designis to maximize the visual similarity between the query image and the returned candidateimages in the database. This criterion seems correct at the first glimpse, however, it ignoresthe human users’ factor. The human users always have certain preferences and favors on theexpected image search results, which might also change over time. From this point of view,those images which are more interesting to the users should be more emphasized and havehigher probabilities to be returned to the users. For example of automobiles, the state-of-the-art car models are more likely to be frequently queried. Then, if a user gives an examplequery image containing a car, the system should be aware that more likely the user isinterested in the new car models, instead of the old ones. Figure 14 illustrates such anexample. From this perspective, designing a new retrieval scheme which considers users’experience to some content is necessary [44].

The main idea of this work [44] is to incorporate the users’ behavior information fromthe query log data into the visual word weighting framework. Intuitively, the morefrequently visited visual word (from the image queries) will have higher weight, which will

Fig. 14 An automobile example. It shows that for the conventional CBIR systems, the returned imagesearch results might contain a large portion of uninteresting images to the users. In this work, our task is toincorporate users’ factor into the CBIR systems for improving the satisfaction level of the users

466 Multimed Tools Appl (2011) 51:441–477

Page 27: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

bias the image search results towards those more interesting database images. Morespecifically, we formulate our query-log based weighting as well as the conventional tf-idfweighting scheme into a uniformed probabilistic framework, regarding the query-logrelated component and the term frequency component as complementary prior informationfrom both users’ preference and the data distribution. By marginalizing all with respect toall database images and all image classes, the query-log related prior probability for eachvisual word is obtained and serves as the important weight for re-weighting each visualword. This new visual word importance weighting scheme is defined as qf-tf-idf, where qfdenotes the query frequency. This novel method could serve as an important building blockfor a user-centric CBIR system.

From a probabilistic point of view, the inverse document frequency weight ωD could beconsidered as the prior distribution of the importance of visual word, conditioned on theimage database, which could be re-written as:

wD�pðnjDÞ¼Yi

pðVijDÞ; ð35Þ

pðVijDÞ�idfi ð36Þ

Here we assume that given the database images, the distribution of each visual word isindependent. This prior information is only based on the intrinsic image data distributions.However, given that the users’ query log information is available (an example of the querylog file structure is illustrated in Fig. 15), the prior distribution of each visual word will benot only conditioned on the database image collections, but also on the query loginformation. These two resources construct the two important and complementary aspectsof the prior information on the visual words’ distribution, namely, the data domain and theuser domain. Therefore, the prior distribution of the visual words in terms of both the imagecollection and the query log could be defined as p(Vi|D,Q), where we use Q to denote thequery log set. Assuming that the query log and the image collection information areindependent, we can further decompose the prior distribution into two terms:

pðnjD;QÞ�pðnjDÞ � pðnjQÞ¼Yi

pðVijDÞ�Yi

pðVijQÞ ð37Þ

where the independence of each visual word Vi is assumed as well.p(Vi|Q) denotes the prior probability given the users’ query log information. This prior

probability measures the importance of the visual word in terms of the users’ favor andpreference information, and could serve as a new user-centric importance weight, denotedas 5Q such that:

wQ�pðnjQÞ¼Yi

pðVijQÞ ð38Þ

where we still assume the independence of each visual word Vi.

Multimed Tools Appl (2011) 51:441–477 467

Page 28: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

To get insight into the query log based prior information, we can further decompose theterm p(Vi|Q) into:

pðVijQÞ ¼XNj¼1

XKk¼1

pðVijQ; Ij; kÞpðkjIj;QÞpðIjjQÞ

¼XNj¼1

XKk¼1

pðVijkÞpðkjIjÞpðIjjQÞð39Þ

These terms are explained as follows: p(Ij|Q) measures the prior probability of thedatabase images given the users’ query log information defined as:

pðIjjQÞ ¼ qfjPi qfi

ð40Þ

which is a normalized frequency value that measures how often the image Ij is used as aquery image. The higher value of p(Ij|Q) indicates that the image Ij is more interesting to theusers, hence it should contribute more in weighting the visual words. Note that:

XNj¼1

pðIjjQÞ ¼ 1 ð41Þ

Fig. 15 An example of the sim-plified query log file structure.The left column is the databaseimage ID, the middle columnillustrates the correspondingdatabase images, and the rightcolumn denotes the number oftimes that each image is queriedby the users

468 Multimed Tools Appl (2011) 51:441–477

Page 29: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

p(k|Ij) is the probability that the image Ij belongs to the k-th image class (concept). If weassume each database image has a single label, it could be further defined as:

pðkjIjÞ ¼ 1; if Ij 2 SkpðkjIjÞ ¼ 0; else;

�ð42Þ

where Sk denotes the subset of the images belonging to the k-th class. Similarly, the add-to-one constraint applies:

XKk¼1

pðkjIjÞ ¼ 1 ð43Þ

p(Vi|k) is the likelihood that the visual word Vi appears in the images from the k-th class,which is defined as:

pðVijkÞ ¼ CikPM

i¼1 Cik

ð44Þ

Cik ¼ j I : Vi 2 I ; I 2 Skf gj ð45Þ

As can be observed, p(Vi|k) measures the importance of the visual word Vi for the class kand those visual words that are shared by more images from the k-th class with take higherp(Vi|k) values. Also note that:

XMi¼1

pðVijkÞ ¼ 1 ð46Þ

Note that the three terms p(Ij|Q), p(k|Ij) and p(Vi|k) work in a cooperative manner thatpropagate the importance of each database image (from the query frequency) into thespecified visual words that belong to certain image categories. In this way, not only theinteresting image but also the other images in the same category contribute to the weights ofthe visual words that frequently occur inside the same category. In other words, theimportance of the interesting images are propagated into the other images from the samecategory and this generality property is very important for the system to deal with unseenimages. On the contrary, those visual words which either appear randomly inside the samecategory or receive little importance from only uninteresting images will finally get lowweights. These features (visual words) could be regarded as spurious features. We utilizep(Vi|Q) to define the proposed query frequency based weight, i.e., qfi=p(Vi|Q). Combiningthe inverse document frequency idfi, the new dissimilarity measurement between twoimages could be defined as:

dðI1; I2Þ ¼X

ijtf 1i � tf 2i jLp � idfi � qf1 ð47Þ

An illustration of the qf-tf-idf visual word weighting scheme is shown in Fig. 16.The above developed visual word weighting scheme provides a lot of advantages. 1) It is

indeed a user-centric which directly aims at improving users’ satisfaction level. 2) Due to itssimplicity, the proposed qf-tf-idf weighting scheme is quite efficient that could seamlesslyadopt the current statistics of the users favor information as a building block for CBIRsystem. 3) Encapsuled in a probabilistic framework, the proposed method is robust as well

Multimed Tools Appl (2011) 51:441–477 469

Page 30: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

as effective. Several examples of image retrieval results from the Oxford Buildings datasetand INRIA Holidays dataset [16] are also given in Figs. 17 and 18. In Fig. 18, the top 10returned results from the INRIA Holidays dataset of both tf-idf (in the middle yellow frame)and our approach (in the right blue frame) are shown for each query (the left column in thered frame). It can be observed from these instances that the image consistency to query isgreatly enhanced for our approach. Similar results could be observed for the OxfordBuildings dataset [48] in Fig. 17. For more details, please refer to [44].

Fig. 16 An illustration of the qf-tf-idf visual word weighting scheme. The first row of images show thevisual words from either high-frequently queried images (left) or less-frequently queried images. The middlerow shows the query frequency based weights (left) and the inverse document frequency based weights(right) for all the visual words. The last row shows the final weights combining both aspects

Fig. 17 Eleven query images and the returned results from the Oxford Buildings dataset [48]. The leftmostcolumn in red frame represents the query images. Top 10 ranked search results using the conventional visualword weighting scheme and our proposed new weighting method are illustrated in the middle yellow frameand the right blue frame, respectively

470 Multimed Tools Appl (2011) 51:441–477

Page 31: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

5 Conclusions and discussions

In this paper, we comprehensively review and summarize the recent works on visualvocabulary. The related works are summarized into four categories, i.e., optimizing hierarchicalk-means based visual vocabulary, supervised visual vocabulary generation, high order visualword generation, and post processing. In addition, we introduce several state-of-the-artalgorithms in detail. The latent visual content analysis [74], descriptive visual words anddescriptive visual phrases [68], contextual visual vocabulary [69], and visual vocabularyhierarchy optimization [19] are introduced in detail for more descriptive and discriminativevisual vocabulary generation. The spatial coding [73] and user preference based visual wordweighting strategy [44] are proposed as the post processing strategies to further improve theperformance of visual vocabulary in image matching and image retrieval.

Improving the descriptive and discriminative power of visual vocabulary is still an activeresearch topic in multimedia and computer vision communities. Lots of problems andtopics are still need to be further explored. For example, the future research topics on visualvocabulary might include but are not limited to the following aspects:

1. Adaptive visual vocabulary generation. It is known that, applying the pre-trained visualvocabulary in datasets largely different from the training dataset degrades theperformance of the visual vocabulary. However training new visual vocabulary couldbe expensive. Therefore, it should be meaningful to develop visual vocabularies whichcould be adaptive to different datasets.

2. Although lots of works are reported on visual vocabulary. There still lacks enoughtheories and guidance on visual vocabulary size. It is interesting to study if there existan optimized visual vocabulary size for specific problems.

3. The relationships between visual words and semantic could be explored. It is well knownthat there are obvious relationships between textual words and semantics, which have beensuccessfully utilized in natural language processing and machine translation. It would be

Fig. 18 Twenty query images and the returned results from the INRIA Holidays dataset [16]. The leftmostcolumn in red frame represents the query images. Top 10 ranked search results using the conventional visualword weighting scheme and our proposed new weighting method are illustrated in the middle yellow frameand the right blue frame, respectively

Multimed Tools Appl (2011) 51:441–477 471

Page 32: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

interesting to explore the relationships between visual words and semantics. Thisrelationship could be helpful for image understanding and bridging the semantic gaps.

4. Many works are reported on visual words and visual phrases. It is interesting to studywhether there exist visual grammar in images, which is a higher level based on visualwords and visual phrases. If the visual grammar could be extracted, it could be utilizedas an image model or topic model for image categorization and image understanding.

5. Current visual codebooks for large-scale image application contain hundreds ofthousands or even millions of visual words. Therefore it is not compact and cannot beadopted in mobile visual search applications. Consequently, developing effective andcompact visual codebook for mobile applications would be another interesting topic.

Acknowledgement This work is supported in part by NSF IIS 1052851 and by Akiira Media Systems, Inc.The work of Nicu Sebe has been supported by the FP7 IP GLOCAL European project and by the FIRB S-PATTERN project.

References

1. Agarwal S, Roth D (2002) Learning a sparse representation for object detection. ECCV2. Battiato S, Farinella G, Gallo G, Ravi D (2009) Spatial hierarchy of textons distribution for scene

classification. Proc. Eurocom Multimedia Modeling, pp 333–3423. Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J MLR 3:993–10224. Brin S, Page L (1998) The anatomy of a large-scale hyper textual web search engine. WWW5. Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: automatic query expansion with a

generative feature model for object retrieval. ICCV6. Deerwester S, Dumais S, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41

(6):391–4077. Duygulu P, Barnard K, Freitas J, Forsyth D (2002) Object recognition as machine translation: Learning a

lexicon for a fixed image vocabulary. ECCV8. Fischler M, Bolles R (1981) Random sample consensus: a paradigm for model fitting with applications

to image analysis and automated cartography. Comm ACM 24:381–3959. Gemert V, Veenman C, Smeulders A, Geusebroek J (2010) Visual word ambiguity. T-PAMI 32(7):1271–128310. K. Grauman and T. Darrell. Approximate correspondences in high dimensions. NIPS, 2007.11. Globerson A, Roweis S (2006) Metric learning by collapsing classes. Adv In Neu Info Proce Sys 18:451–45812. Hofmann T (1999) Probabilistic latent semantic indexing. ACM SIGIR13. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. ML 41:177–19614. Indyk P, Thaper N (1998) Fast image retrieval via embeddings. Symposium on Theory of Computing15. Jegou H, Harzallah H, Schmid C (2007) A contextual dissimilarity measure for accurate and efficient

image search. CVPR16. Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large

scale image search. ECCV17. Ji R, Xie X, Yao H,WuY,MaW (2008) Incremental indexing of visual vocabulary for scalable retrieval. ICME18. Ji R, Xie X, Yao H, Ma W (2009) Vocabulary hierarchy optimization for effective and transferable

retrieval. CVPR19. Ji R, Yao H, Sun X, Zhong B, Gao W (2010) Towards semantic embedding in visual vocabulary. CVPR20. Jing Y, Baluja S (2008) VisualRank: applying pagerank to large-scale image search. IEEE Trans on

PAMI 30:1877–189021. Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. IJCV, pp 604–61022. Kim G, Faloutsos C, Hebert M (2008) Unsupervised modeling of object categories using link analysis

techniques. CVPR23. Kim G, Faloutsos C, Hebert M (2008) Unsupervised modeling and recognition of object categories with

combination of visual contents and geometric similarity links. ACM MIR24. Kohonen T (1986) Learning vector quantization for pattern recognition. Tech. Rep. TKK-F-A601,

Helsinki Institute of Technology25. Kohonen T (2000) Self-organizing maps, 3rd edition, Springer-Verlag

472 Multimed Tools Appl (2011) 51:441–477

Page 33: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

26. Lazebnik S, Raginsky M (2009) Supervised learning of quantizer codebook by information lossminimization. PAMI 31(7):1294–1309

27. Leibe B, Leonardis A, Schiele B (2004) Combined object categorization and segmentation with animplicit shape model. ECCV

28. Leordeanu M, Hebert M (2005) A spectral technique for correspondence problems using pairwiseconstraints. ICCV

29. Leung T, Malik J (2001) Representing and recognizing the visual appearance of materials using 3-dtextons. IJCV

30. Li F, Pietro P (2007) A bayesian hierarchical model for learning natural scene categories. ICCV31. Li T, Mei T, Kweon I, Hua X (2010) Contextual bag-of-words for visual categorization. IEEE

Transactions on Circuits and Systems for Video Technology32. Liu D, Hua G, Viola P, Chen T (2008) Integrated feature selection and higher-order spatial feature

extraction for object categorization. CVPR, pp 1–833. Liu C, Yuen J, Torralba A (2009) Dense scene alignment using SIFT flow for object recognition. CVPR34. Liu J, Yang Y, Shah M (2009) Learning semantic visual vocabularies using diffusion distance. CVPR35. Liu D, Hua X, Yang L, Wang M, Zhang H (2009) Tag ranking. WWW36. Lowe D (2004) Distinctive image features form scale-invariant keypoints. IJCV 20(2):91–11037. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. 5th

Berkeley Symposium on Mathematical Statistics and Probability, pp 281–29738. Marszalek M, Schmid C (2006) Spatial weighting for bag-of-features. CVPR, pp 2118–212539. Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Supervised dictionary learning. NIPS40. Marszalek M, Schmid C (2007) Semantic hierarchies for visual object recognition. CVPR41. Matas J, Chum O, Urban M, Pajla T (2002) Robust wide baseline stereo from maximally stable extremal

regions. BMVC42. Moosmann F, Triggs B, Jurie F (2006) Fast discriminative visual codebooks using randomized clustering

forests. NIPS43. Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. PAMI 30

(9):1632–164644. Ni B, Tian Q, Yang L, Yan S (2010) Query-log aware content based image retrieval. To be submitted45. Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. CVPR, pp 2161–216846. Perronnin F (2008) Universal and adapted vocabularies for generic visual categorization. PAMI 30

(7):1243–125647. Perronnin F, Dance C, Csurka G, Bressan M (2006) Adapted vocabularies for generic visual

categorization. ECCV48. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and

fast spatial matching. CVPR49. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular

object retrieval in large scale image databases. CVPR50. Rao A, Miller D, Rose K, Gersho A (1996) A generalized VQ method for combined compression and

estimation. ICASSP51. Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill, New York52. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage

24(5):513–52353. Savarese S, Winn J, Criminisi A (2006) Discriminative object class models of appearance and shape by

correlatons. CVPR, pp 2033–204054. Schindler G, Brown M (2007) City-scale location recognition. CVPR55. Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. ICCV,

pp 1470–147756. Viola P, Jones M (2001) Robust real-time face detection. ICCV, pp 7–1457. Wang L (2007) Toward a discriminative codebook: codeword selection across multi-resolution. CVPR58. Wang F, Jiang Y, Ngo C (2008) Video event detection using motion relativity and visual relatedness.

ACM Multimedia, pp 239–24859. Wang S, Huang Q, Jiang S, Qin L, Tian Q (2009) Visual context rank for web image re-ranking. ACM

workshop on LSMRM60. Wu Z, Ke Q, Sun J (2009) Bundling features for large-scale partial-duplicate web image search. CVPR61. Wu L, Hoi S, Yu N (2009) Semantic-preserving bag-of-words models for efficient image annotation.

ACM workshop on LSMRM, pp 19–2662. Xu D, Chang S (2008) Video event recognition using kernel methods with multilevel temporal

alignment. PAMI 30(11):1985–199763. Yang J (2007) Evaluating bag-of-visual-words representations in scene classification. ACM Multimedia

Multimed Tools Appl (2011) 51:441–477 473

Page 34: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

64. Yang L, Meer P, Foran D (2007) Multiple class segmentation using a unified framework over mean-shiftpatches. CVPR, pp 1–8

65. Yates R, Neto B (1999) Modern information retrieval, Addison Wesley Longman Publishing Co. Inc66. Yuan J, Wu Y, Yang M (2007) Discovery of collocation patterns: from visual words to visual phrases.

CVPR, pp 1–867. Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of

texture and object categories: A comprehensive review. IJCV68. Zhang S, Tian Q, Hua G, Huang Q, Li S (2009) Descriptive visual words and visual phrases for image

applications. ACM Multimedia69. Zhang S, Huang Q, Hua G, Jiang S, Gao W, Tian Q (2010) Building contextual visual vocabulary for

large-scale image applications. ACM Multimedia70. Zhang S, Huang Q, Lu Y, Gao W, Tian Q (2010) Building pair-wise visual word tree for efficient image

re-ranking. ICASSP71. Zheng Y, Zhao M, Neo S, Chua T, Tian Q (2008) Visual synset: a higher-level visual representation.

CVPR, pp 1–872. Zhou W, Li H, Lu Y, Tian Q (2010) Large scale partial-duplicate image retrieval with bi-space

quantization and geometric consistency. ICASSP73. ZhouW, Lu Y, Song Y, Li H, Tian Q (2010) Spatial coding for large-scale partial-duplicate web image

search. ACM Multimedia74. Zhou W, Tian Q, Yang L, Li H (2010) Latent visual context analysis for image re-ranking. ACM

International Conference on Image and Video Retrieval (CIVR), Xi’an, China

Qi Tian is currently an Associate Professor in the Department of Computer Science, the University of Texasat San Antonio (UTSA). During 2008–2009, he took one-year Faculty Leave at Microsoft Research Asia(MSRA) in the Media Computing Group (former Internet Media Group). He received his Ph.D. in 2002 fromUIUC. Dr. Tian’s research interests include multimedia information retrieval and computer vision. He haspublished about 110 refereed journal and conference papers in these fields. His research projects were fundedby NSF, ARO, DHS, HP Lab, SALSI, CIAS, CAS and Akiira Media System, Inc. He was the co-author of aBest Student Paper in ICASSP 2006, and co-author of a Best Paper Candidate in PCM 2007. He was anominee for 2008 UTSA President Distinguished Research Award. He received 2010 ACM Service Awardfor ACM Multimedia 2009. He is the Guest Editors of IEEE Transactions on Multimedia, ACM Transactionson Intelligent Systems and Technology, Journal of Computer Vision and Image Understanding, PatternRecognition Letter, and EURASIP Journal on Advances in Signal Processing and is the associate editor ofIEEE Transaction on Circuits and Systems for Video Technology and in the Editorial Board of Journal ofMultimedia. He is a Senior Member of IEEE (2003), and a Member of ACM (2004).

474 Multimed Tools Appl (2011) 51:441–477

Page 35: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

Shiliang Zhang is currently pursuing the Ph.D. degree in Key Lab of Intelligent Information Processing,Institute of Computing Technology, Chinese Academy of Sciences. His research interests include imageand video processing, large-scale image retrieval, multimedia content analysis, computer vision andpattern recognition. His current research is focused on visual codebook learning and large-scale imageretrieval.

Wengang Zhou received his B.S. degree in Electronic Information Engineering from Wuhan University,Wuhan, China, in 2006. He is currently a Ph.D. student in signal and information processing in EEISDepartment, University of Science and Technology of China. His research interests include Multimedia andComputer Vision. He has done some work in Partial Differential Equations (PDE) for bio-medical imageprocessing. His current research is focused on large-scale multimedia information retrieval.

Multimed Tools Appl (2011) 51:441–477 475

Page 36: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

Rongrong Ji is currently a PostDoc Researcher at Electronic Engineering Department, Columbia University.He obtained his Ph.D. from Computer Science Department, Harbin Institute of Technology. His Researchinterests include: image retrieval and annotation, video retrieval and understanding. During 2007–2008, he isa research intern at Web Search and Mining Group, Microsoft Research Asia, mentored by Xing Xie, wherehe received Microsoft Fellowship 2007. During 2010.5–2010.6, he is a visiting student at University ofTexas at San Antonio, cooperated with Professor Qi Tian. During 2010.7–2010.11, he is also a visitingstudent at Institute of Digital Media, Peking University, under the supervision of Professor Wen Gao. He haspublished over 40 referred journals and conferences, including Pattern Recognition, CVPR, and ACMMultimedia. He serves as a reviewer for IEEE Transactions on Multimedia, SMC, TKDE, and ACMMultimedia conference et al., an associated editor at International Journal of Computer Applications, as wellas a special session chair at ICIMCS 2010. He is a member of the IEEE.

Bingbing Ni received his B.Eng. degree in Electrical Engineering from Shanghai Jiao Tong University(SJTU), China in 2005. He is currently a Ph.D. candidate in Electrical and Computer Engineering at theNational University of Singapore (NUS). His research interests are in the areas of computer vision, machinelearning and multi-media.

476 Multimed Tools Appl (2011) 51:441–477

Page 37: Building descriptive and discriminative visual codebook for large …disi.unitn.it/~sebe/publications/MTAP11-Qi.pdf · multimedia data indexing, processing, and searching strategies

Nicu Sebe received his PhD degree in 2001 from University of Leiden, The Netherlands. He is with theFaculty of Cognitive Sciences, University of Trento, Italy, where he is leading the research in the areas ofmultimedia information retrieval and human-computer interaction in computer vision applications. He is theauthor of Robust Computer Vision—Theory and Applications (Kluwer, April 2003) and of MachineLearning in Computer Vision (Springer, May 2005). He was involved in the organization of the majorconferences and workshops addressing the computer vision and human-centered aspects of multimediainformation retrieval, among which as a General Co-Chair of the IEEE Automatic Face and GestureRecognition Conference, FG 2008, ACM International Conference on Image and Video Retrieval (CIVR)2007, and WIAMIS 2009 and as one of the initiators and a Program Co-Chair of the Human-CenteredMultimedia track of the ACM Multimedia 2007 conference. He is the general chair of ACM CIVR 2010 anda track chair of ICPR 2010. He has served as the guest editor for several special issues in IEEE Computer,Computer Vision and Image Understanding, Image and Vision Computing, Multimedia Systems, and ACMTOMCCAP. He has been a visiting professor in Beckman Institute, University of Illinois at Urbana-Champaign and in the Electrical Engineering Department, Darmstadt University of Technology, Germany.He was the recipient of a British Telecomm Fellowship. He is the co-chair of the IEEE Computer SocietyTask Force on Human-centered Computing and is an associate editor of Machine Vision and Applications,Image and Vision Computing, Electronic Imaging and of Journal of Multimedia.

Multimed Tools Appl (2011) 51:441–477 477