[ieee 2014 ieee students' technology symposium (techsym) - kharagpur (2014.2.28-2014.3.2)]...

A Survey on Image Retrieval Performance ofDifferent Bag of Visual Words Indexing Techniques

Jit MukherjeeSchool of Information Technology

Indian Institute of TechnologyKharagpur

Email: [email protected]

Jayanta MukhopadhyayDepartment of ComputerScience and Engineering



Pabitra MitraDepartment of ComputerScience and Engineering



Abstract—In this paper a survey has been carried out overimage retrieval performances of bag of visual words (BoVW)method using different indexing techniques. Bag of visual wordmethod is a content based image retrieval technique, whereimages are represented as a sparse vector of occurrences ofvisual words. In this paper different indexing techniques are usedto compute near similar visual word vectors of a query image.Locality sensitive hashing, SR-tree based indexing and naive L1

and L2 norm based distance metric calculation are used here.Standard datasets like, UKBench [19], holiday dataset [9] andimages from SMARAK1 are used for performance analysis.

Keywords—SURF, Bag of Visual Words, SR Tree, LSH, L1 &L2 Norm.

I. INTRODUCTION

Bag of Words is a widely used method in informationretrieval, where a document is indexed by converting it intoan order-less collection of words. In this approach, onlyvocabulary is considered, disregarding the grammar [4], [1].In computer vision, a similar concept is used, where an imageis represented as a sparse vector of number of occurrencesof visual words, which are represented by multidimensionalfeature vectors. It is first used for texture recognition [13].In content based image retrieval, the BoVW based techniqueis found to outperform several other methods that are inthe literature and it is widely used in many systems [22],[20], [14]. In the BoVW based technique, a query image issearched by the frequency of occurrences of visual wordswhich belong to a precomputed corpus of feature vectors,called a bag of visual words. In forming description of animage as a collection of visual words, its key points aredetected, and functional variations around its neighborhood,such as distribution of gradient directions, are used to describethe feature point in a multidimensional real space. The poolof multi-dimensional feature descriptors from all the imagesis further clustered to form a visual vocabulary or codeworddictionary. Different clustering techniques have been comparedfor this visual codebook generation [11]. In a conventionallocal feature based image retrieval system, these key-points areserially searched for similar key-points using nearest neighborquery, range query, etc. But, matching key-points, in a largedataset, takes a very large amount of time. In the bag ofvisual word representation, for an image, the frequency of

1https://imedix.iitkgp.ac.in/SMARAK/

occurrences of each of the visual words is computed, whichact as a sparse representative vector of the whole image. Thecode words in the visual vocabulary are indexed, and searchedby the query image’s representative sparse vector, which takesmuch lesser time than matching each feature descriptor withevery other descriptor. The BoVW process of image searchinghas the following computational steps.

1) Detection and description of interest points in thetraining images,

2) Building a visual vocabulary,3) Image representation and indexing, and4) Searching a query image by its representative visual

code word.

In this paper we present a survey of different search strategiesemployed in bag of visual words retrieval. We compare theirperformances in respect of time and accuracy. This survey aimsto provide a choice of strategies under different circumstances.The techniques used in this survey are briefly described below.

II. DETECTION AND DESCRIPTION OF INTEREST POINTS

Several efficient methods like SIFT [15], [6], [16], [17],ASIFT [18], [15], [16], PCA-SIFT [10], [17], SURF [2],[17], BRISK [24], etc. have been proposed for efficient key-point detection and description. On account of its low timecomplexity, wide applicability, and simplicity, Speeded UpRobust Feature(SURF) [2] is used here for key-point detection.

A. SURF key-point Detection

In Speeded Up Robust Feature , SURF, detection techniquethe local maxima of the determinant of Hessian matrix iscomputed for selecting an interest point. At various scales,convolution of an image at a point with second order Gaussianderivatives is carried out. This is further approximated with boxfilters using integral images.

A circular region of radius 6 times the scale of the detectedkey point is considered for making the measure rotationalinvariant. Using a sliding window of 60 degree, the Haarwavelet responses are calculated. The direction, which maxi-mizes the sum of the Haar wavelet response,, is chosen to bethe dominant direction of the selected key point. Thus, rotationand scale invariance is ensured in SURF. It is faster becauseof its approximation property and computation over integralimages [2].

Proceeding of the 2014 IEEE Students' Technology Symposium

TS14MLA04 364 978-1-4799-2608-4/14/$31.00 ©2014 IEEE 99

B. SURF key-point Description

For each key-point, a 64 dimensionl vector is computed.For each selected key point a square descriptor window isconstructed with the size 20 times the scale used in filtering.It is further divided into 4×4 subregions. For each subregion,four values reflecting the statistics of local gradients arecomputed. These are denoted as Σdx, Σ | dx |, Σdy , Σ | dy |,where dx and dy are Haar wavelet responses in horizontaldirections and vertical directions.

III. BUILDING A VISUAL VOCABULARY

These multidimensional feature descriptors are clusteredby the K-means clustering algorithm [8] to generate a visualvocabulary. Each cluster is treated as a distinct visual word inthe vocabulary, which is represented by their respective clustercenters. In the bag of words method for text retrieval, the sizeof the vocabulary is predetermined. Similarly, in the bag ofvisual words method the size of the vocabulary is determinedby the clustering algorithm (i.e. K, number of clusters). Thissize also depends on the dataset considered for building thevocabulary. If the vocabulary is very small, it would be difficultto differentiate two distinct sets of features. On the other hand,a very large vocabulary would lack generalization, and becomemore prone to noise. However, a large vocabulary is found toincrease the performance for larger datasets, compared to asmaller vocabulary. In the bag of visual word representation,only the count of occurrences of visual words is used, andno spatial arrangement or ordering is considered. The clustercenters of the existing vocabulary are stored after the comple-tion of the clustering of high dimensional descriptors for thecomputation of the BoVW vector of the query image.

The k-means clustering algorithm is a very costly processin respect of time. As the number of clusters increases, the costfor computing the bag of visual words for each training imagealso increases drastically. Parallel programing is used to speedup the computation. We have implemented an MPI (MessagePassing Interface) [21] based K-means clustering approach inthis work [26]. In this approach , the data points are evenlydistributed among the processes. Initial K points are selectedas the cluster center. In each process, each data point is allotedto the nearest cluster center. The new centroid for each clustersis calculated. The overall process is repeated until a thresholdvalue is reached among the newly assigned cluster center andthe old one [26].

IV. IMAGE REPRESENTATION AND INDEXING

An image is represented as a sparse vector of histogramof visual word occurrences. It is analogous to representationof a document in conventional information retrieval, wheredocuments are represented as a sparse vector of terms. Eachdescriptor of an image represents a visual word, which is near-est to it. However, similar images with different resolutions andscale may have different number of key-points in them, whichwould not maintain uniformity in their representative vectors.So, each of those vectors is further normalized with respect tothe total number of descriptors found in the respective images.The resulting vectors of each images are indexed with their bagof visual word representation.

Fig. 1. Block Diagram of Bag of Visual Words Method

V. SEARCHING OF QUERY IMAGE BY VISUAL CODE WORD

A query image is encoded into a vector of count of visualword occurrences in the generated vocabulary. Local featuredescriptors are computed from the query image by followingthe same technique used for vocabulary generation. These 64-dimensional descriptors are searched for their nearest clusterpoints. This can be achieved either by indexing and searchingthose cluster centers in any multidimensional index structure orby naive search with a L2 distance measure. Both methods areconsidered here. For indexing the cluster centers, SR tree basedindexing [12] is used. Each descriptor vector is searched for thenearest neighbor in the index file. Nearest cluster point (visualword) is tagged as the representative visual code word of thedescriptor, and the count of that visual word is incrementedby one in the representative sparse vector of the query image.

Nearest cluster points are also obtained by finding theminimum distance between a descriptor and each cluster pointin the 64-dimensional space. For each descriptor, this processis iterated to find the resultant bag of visual word vector ofthe query image. Once the query image is mapped to a visualword in the vocabulary, the searching procedure for similarimages is initiated.

Locality sensitive hashing [5], [23], SR tree [12], L1

Norm and L2 Norm are used in our study to find the nearestvisual words of the query image’s representative vector. Thosemethods are described below. A block diagram of the overallmethod is shown in Fig 1.

A. Locality Sensitive Hashing (LSH)

The core concept of locality sensitive hashing(LSH) statesthat if two points are close in an N -dimensional space, it willstay close if projected to an M -dimensional space where M <N . On the contrary, if those points are far in the N -dimensionalspace, they will most likely to stay far in the M -dimensionalspace [23]. A visual word vector, −→a , of a training image, I ,


TS14MLA04 364 978-1-4799-2608-4/14/$31.00 ©2014 IEEE 100

is projected with a randomly chosen vector,−→b , for a specific

number of time (empirically chosen here as 1000). As stated in[5], [23], this randomly chosen values are taken here from astandard normal distribution with zero mean and unit standarddeviation. This scalar projection h(−→a ) is given by.

h(−→a ) =−→b .−→a (1)

where, −→a is bag of visual word vector, and−→b is a random

selected vector. The scalar projection is then quantized into aset of hash bins.

Hb,q(−→a ) = b−→b .−→a + q

wc (2)

where, w is the width of quantization bin, and q is an offset(uniformly distributed between 0 to w). In our experiments,bucket width (w) is empirically chosen as 0.1. For a queryimage the vector is projected with the same projection values,and hash bins are computed accordingly. The nearest neighborsearch is achieved by searching all the points that fall into thesame hash bin.

B. Sphere Rectangle Tree (SR tree) based Indexing

Sphere rectangle tree (SR tree) is a multidimensionalindexing structure. It has been found to outperform [12]several index structures like SS tree [25], R-tree [7], R*-tree [3]. R-tree [7], R*-tree [3], SS-tree [25], which are used tocompute nearest vectors within a minimum bounding rectangle(MBR) or a minimum bounding sphere (MBS). A boundingsphere divides a space into short-diameter region, whereasa Bounding rectangle divides into short-volume region, andthereby improves the disjointness of the regions [12]. SR treebased index structure uses both minimum bounding sphereand minimum bounding rectangle by specifying a region ofintersection of a bounding sphere and a bounding rectangle.Insertion of a new point in SR tree follows the model of SStree insertion algorithm, where a new entry is accommodatedinto a sub-tree, whose centroid is the nearest to the entry beingmade. For insertion of a new point, the corresponding boundingrectangle and bounding sphere are updated. Nearest neighborsearch in SR tree structure is a depth-first search (DFS) [12].By this, a candidate set is generated from the BoVW vectors,which are nearest to the query image’s BoVW vector [12].

C. Choice of Distance Functions

In choosing distance functions, we considered two suchnorms namely L1 and L2 norms. Those are part of Lp spacesor Lebesgue spaces. The Lp norm or p-norm of x for any realp ≥ 1 is computed as

‖ x ‖= (| x1 |p + | x2 |p + . . .+ | xp |p)1p (3)

L1 norm corresponds to Manhattan distance, and Euclideannorm is represented as L2 norm.

This is a naive approach, where L1 norm between eachof the stored vectors and the vector of the query image iscomputed. The K nearest neighbors are selected, and markedas the similar images of the query image. Similarly the L2

norm is also used for the same purposes.

TABLE I. MAP, RACALL AND F-SCORE OF UKBENCH DATASET ATTOP FOUR

L1 Norm L2 Norm SR-tree Locality Sensitive HashingmAP 86.25 74.06 69.68 73.43

Recall 86.25 74.06 69.68 73.43F-score 86.25 74.06 69.68 73.43

VI. RESULTS & DISCUSSION

In this section the performance of bag of visual wordbased retrieval technique coupled with different indexingschemes are discussed. For performance analysis, UKBenchdataset [19], and holiday dataset [9] are used. Images from aweb-based image archival system for archiving and sharingIndian cultural heritage images, named SMARAK, developedin our laboratory, is also used. The performance metrics usedhere are mean average precision (mAP), average recall, andF-score. precision (P) and recall (R) are defined as

P = TPTP+FP

R = TPTP+FN

where, TP= Number of True Positives, FP= Number of FalsePositives, and FN= Number of False Negatives. F-measureor F-score is harmonic mean of precision and recall andcomputed as

F-measure= 2× P ×R

P + R

The UKBench dataset consists of 10200 images with atleast 4 occurrences of each sample. It has more than twothousand samples with exact four occurrences. Key-points ofeach of the images are computed by SURF key-point detectorand descriptor. The vocabulary of this large dataset is obtainedwith the K means clustering with K = 1000. Bag of visualwords of all the images is thus formed, and indexed forsearching. The mAP, average recall and F-score measures ofUKBench dataset at top four and top five are shown in Table Iand Table II. In holiday dataset [9], 678 images are usedfor experimentation. Images in holiday dataset have similarinstances, but the number of instances varies from 2 to 8.The dissimilar images in holiday and UKBench dataset varyin several aspects, such as in color, illumination, camera view-point, etc. The results for holiday dataset is shown in Table III.From holiday and UKBench dataset, it is observed that in allthe methods L1 norm gives better performance than using L2

norm. A typical example of top four results from UKBenchdataset is shown here. Fig 2 shows a typical example ofretrieved results while using L1 norm in top four. The resultsare ranked, and shown in a clockwise order from the top leftimage in the second row. All the ranked results in this paperare shown in this way. Fig 3 shows the retrieved results forthe same query image, if we use L2 norm instead of L1 norm.Similarly, retrieved images using LSH and SR-Tree indexingare shown in Fig 4 and in Fig 5, respectively.

The precision and recall values in UKBench dataset at top 4are same because every image in UKBench dataset has exact 4similar instances. The precision of the performance in holiday


TS14MLA04 364 978-1-4799-2608-4/14/$31.00 ©2014 IEEE 101

TABLE II. MAP, RACALL AND F-SCORE OF UKBENCH DATASET ATTOP FIVE

L1 Norm L2 Norm SR-tree Locality Sensitive HashingmAP 70.25 62.50 57 61.75

Recall 87.81 78.12 71.25 77.18F-score 78.05 69.44 63.33 68.60

TABLE III. MAP, RACALL AND F-SCORE OF HOLIDAY DATASET ATTOP FIVE

L1 Norm L2 Norm SR-tree Locality Sensitive HashingmAP 46 42 40 42

Recall 86.06 80 74.33 79.53F-score 60 55.05 52.01 54.9

Fig. 2. Retrieved images from L1 norm, Top: query image, Results: Inclockwise order from top left.

Fig. 3. Retrieved images from L2 norm, Top: query image, Results: Inclockwise order from top left.

Fig. 4. Retrieved images from LSH , Top: query image, Results: In clockwiseorder from top left.

Fig. 5. Retrieved images from SR − tree, Top: query image, Results: Inclockwise order from top left.

dataset is found to be low. Because in holiday dataset thenumber of similar images for a given image varies from 2 to8. Here precision and recall are at top 5. Hence, images withsimilar instances more than 5, decreases the performance ofprecision. If the values are taken at top 10 or 8, the recall valuewould have been poorer because of the images having lessernumber of similar instances. It has been found that in cases theperformance of L1 norm based exhaustive search techniqueis the best among different searching techniques consideredhere. The search with L2 norm also performs better than SR-tree based indexing and LSH. But as it is expected, exhaustivesearch techniques takes a longer time than the methods usingSR-tree and LSH indexing (Table IV). For a large databaseof images the time of computation for exhaustive searcheswith L2 and L1 norm significantly increases. Locality SensitiveHashing has also been checked with cauchy distribution asmentioned in [5]. But the performance of LSH under cauchydistribution is found to be poorer than standard normal distri-bution which is used here. As exhaustive search techniquesare are not suitable for real time systems like SMARAK,appropriate indexing and hashing techniques are incorporated


TS14MLA04 364 978-1-4799-2608-4/14/$31.00 ©2014 IEEE 102

TABLE IV. AVERAGE RUNNING TIME OF UKBench DATASET

L1 Norm L2 Norm SR-tree Locality Sensitive HashingAverage time 0.9ms 0.9ms 0.4ms 0.34ms

TABLE V. PRECISION OF L1 NORM AND LSH IN SMARAK FROMGoogle DOWNLOADED IMAGES AT TOP 10

Technique PrecisionL1 Norm 76.38%

Locality Sensitive Hashing 67.75%

to enhance the speed of searching.

As expected, exhaustive search mechanisms using L1 andL2 norms produce better quality results than those obtainedusing SR-tree and LSH indexing schemes. But, use of indexingreduces search time. Out of the two indexing schemes used inthis survey, LSH provides faster response time, and it is moreprecise than the SR-tree based indexing. Table VI shows theaverage searching time to produce results for heritage imagesin SMARAK by different methods. The SMARAK system hasbeen implemented in C programming language with OpenCVapplication programming interface on a PC (Intel(R) Xeon(R)CPU , 8 processors, 16GB RAM) in LINUX environment. Inthe database of SMARAK, images are contributed by differentusers in the domain of Indian cultural heritage. Hence, onlyprecision is computed here, as the exact number of similarimages of an image is difficult to estimate. Images which aresearched here, are taken from Google image search results.These searches were carried out in Google by providing spe-cific meta-data related to SMARAK. The results from Googleimage search are stored, and searched for similar images inSMARAK. Table V shows average precision of the searchingtechniques , namely L1 based exhaustive search and Local-ity Sensitive Hashing based search at top ten. The imageswhich are searched vary in different sizes, and are usuallysignificantly larger than UKBench and holiday images. Thus,it takes more time to search these images, as observed fromTable VI, where search times for query images with differentspatial resolutions are reported. Hence, it may be summarizedthat among the discussed methods, L1 norm based exhaustivesimilarity search technique provides better quality of searchresults, but as an interactive system requires fast response time,LSH indexing based search technique is recommended. It isfast and it provides satisfactory results in retrieving similarimages.

VII. CONCLUSION

In this paper different search mechanisms are comparedin judging their performances for a bag of visual word basedimage retrieval. In has been experimentally found that L1 normbased similarity search, among other methods like L2 normbased exhaustive search, SR tree, and LSH indexing basedtechniques, performs the best in providing more relevant im-ages. But for a huge database and large size images, techniques

TABLE VI. AVERAGE RUNNING TIME OF L1 NORM AND LSH FORHERITAGE IMAGES

Spatial Resolution L1 Norm Locality Sensitive Hashing800 × 500 8 Seconds 5 Seconds1000 × 800 20 Seconds 14 Seconds2000 × 1000 51 Seconds 39 Seconds

like searching through L1 or L2 norm take significantly highcomputation time whereas, the indexing using SR tree andLSH reduces it to a great extent. Hence, a real time interactivesystem like, SMARAK, needs to incorporate such mechanisms.

VIII. ACKNOWLEDGMENT

This work is carried out under the sponsorship of Dept.of Science and Technology (DST), Govt. of India throughsanction number NRDMS/11/1586/2009.

REFERENCES

[1] Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto, Modern informationretrieval, Addison-Wesley Longman Publishing Co., Inc., 1999.

[2] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool,Speeded-up robust features (SURF), Computer Vision and Image Un-derstanding 110 (2008), no. 3, 346–359.

[3] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and BernhardSeeger, The R*-tree: an efficient and robust access method for pointsand rectangles, Special Interest Group On Management Of Data (SIG-MOD) Rec. 19 (1990), no. 2, 322–331.

[4] Maya Carrillo and Aurelio Lpez-Lpez, Concept based representationsas complement of bag of words in information retrieval., Artificial Intel-ligence Applications and Innovations (AIAI), Springer, 2010, pp. 154–161.

[5] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni,Locality-sensitive hashing scheme based on p-stable distributions,Proceedings of the twentieth annual symposium on Computationalgeometry, ACM, 2004, pp. 253–262.

[6] Jun Jie Foo and Ranjan Sinha, Pruning sift for scalable near-duplicateimage matching, Proceedings of the eighteenth conference on Aus-tralasian database, Australian Computer Society, 2007, pp. 63–71.

[7] Antonin Guttman, R-trees: A Dynamic Index Structure for SpatialSearching, International Conference on Management of Data, ACM,1984, pp. 47–57.

[8] J. A. Hartigan and M. A. Wong, A K-means clustering algorithm,Applied Statistics 28 (1979), 100–108.

[9] Herve Jegou, Matthijs Douze, and Cordelia Schmid, Hamming embed-ding and weak geometric consistency for large scale image search,European Conference on Computer Vision, Springer, 2008, pp. 304–317.

[10] I. T. Jolliffe, Principal component analysis, Springer-Verlag, 1986.[11] Frederic Jurie and Bill Triggs, Creating efficient codebooks for visual

recognition, Proceedings of the Tenth IEEE International Conference onComputer Vision (ICCV’05) Volume 1 - Volume 01, IEEE ComputerSociety, 2005, pp. 604–610.

[12] Norio Katayama and Shin’ichi Satoh, The SR-tree: an index structurefor high-dimensional nearest neighbor queries, Special Interest GroupOn Management Of Data (SIGMOD) Rec. 26 (1997), no. 2, 369–380.

[13] Thomas Leung and Jitendra Malik, Representing and recognizing thevisual appearance of materials using three-dimensional textons, Inter-national Journal of Computer Vision 43 (2001), no. 1, 29–44.

[14] Jialu Liu, Image retrieval based on bag-of-words model, ClinicalOrthopaedics and Related Research(CoRR) (2013).

[15] David G. Lowe, Object recognition from local scale-invariant features,Proceedings of the International Conference on Computer Vision, IEEEComputer Society, 1999, pp. 1150–1157.

[16] David G Lowe, Distinctive image features from scale-invariant key-points, International Journal of Computer Vision 60 (2004), no. 2, 91–110.

[17] Juan Luo and Gwon Oubong, A comparison of SIFT, PCA-SIFT andSURF, International Journal of Image Processing (IJIP) 3 (2009), no. 4,143–152.

[18] Jean-Michel Morel and Guoshen Yu, ASIFT: A new framework for fullyaffine invariant image comparison, SIAM Journal on Imaging Sciences(SIIMS) 2 (2009), no. 2, 438–469.


TS14MLA04 364 978-1-4799-2608-4/14/$31.00 ©2014 IEEE 103

[19] D. Nister and H. Stewenius, Scalable recognition with a vocabularytree, IEEE Conference on Computer Vision and Pattern Recognition(CVPR), IEEE Computer Society, 2006, pp. 2161–2168.

[20] Eric Nowak, Frdric Jurie, and Bill Triggs, Sampling strategies forbag-of-features image classification, Proceedings of the 9th Europeanconference on Computer Vision - Volume Part IV, Springer-Verlag,2006, pp. 490–503.

[21] Michael J. Quinn, Parallel programming in c with mpi and openmp,McGraw-Hill Education Group, 2003.

[22] J. Sivic and A. Zisserman, Video Google: A text retrieval approach toobject matching in videos, Proceedings of the Ninth IEEE InternationalConference on Computer Vision - Volume 2, IEEE Computer Society,2003, pp. 1470–1477.

[23] M. Slaney and M. Casey, Locality-Sensitive Hashing for FindingNearest Neighbors, Signal Processing Magazine, IEEE 25 (2008), no. 2,128–131.

[24] Leutenegger Stefan, Chli Margarita, and Siegwart Roland, BRISK:Binary robust invariant scalable keypoints, Proceedings of the IEEEInternational Conference on Computer Vision, IEEE Computer Society,2011, pp. 2548–2555.

[25] David A. White and Ramesh Jain, Similarity indexing with the SS-tree,International Conference on Data Engineering, IEEE Computer Society,1996, pp. 516–523.

[26] Jing Zhang, Gongqing Wu, Xuegang Hu, Shiying Li, and ShuilongHao, A parallel k-means clustering algorithm with mpi, Proceedingsof the 2011 Fourth International Symposium on Parallel Architectures,Algorithms and Programming, IEEE Computer Society, 2011, pp. 60–64.


TS14MLA04 364 978-1-4799-2608-4/14/$31.00 ©2014 IEEE 104

[ieee 2014 ieee students' technology symposium (techsym) - kharagpur (2014.2.28-2014.3.2)]...

Documents