[ieee 2013 2nd iapr asian conference on pattern recognition (acpr) - naha, japan...

Improvements to the descriptor of SIFT by BOFapproaches

Zhouxin YangGraduate School of Engineering

Hiroshima UniversityHigashi-hiroshima, Japan 739-8725Email: [email protected]

Takio KuritaGraduate School of Engineering

Hiroshima UniversityHigashi-hiroshima, Japan 739-8725Email: [email protected]

Abstract—The efficacy and efficiency of SIFT have made it astate-of-art feature descriptor. It has been widely used in manycomputer vision applications such as image classification. A largenumber of methods, e.g. PCA-SIFT, have been contributed tofurther improve its performance focusing on different componentsof it. Differing from those previous works, we broach a newscheme to improve the performance of SIFT’s descriptor inthis paper. We first establish the connection between SIFTand bag of features (BOF) model in descriptor construction.Based on this connection, we then introduce approaches of BOF,e.g. the preservation of spatial information (we adopt spatialpyramid matching as an example to achieve this goal), intoSIFT to enhance its robustness. Experimental results in scenematching and image classification show that the BOF-driven SIFTeffectively and consistently outperforms the original SIFT.

I. INTRODUCTION

In computer vision, the feature descriptor is an underlyingfactor used to summarize the appearance of an image patch.It is expected to be informative, compact and discrimina-tive. Many feature descriptors have emerged recently bear-ing different functionalities. Among these descriptors, scaleinvariant feature transform (SIFT) [1] has received intensiveattention and become state-of-art for its reliability, efficiencyand simplicity. Generally SIFT comprises two components:the key-point detector and the key-point descriptor. The de-tector discovers key-points in an image which are invariantto transformation. The key-point descriptor is then used todescribe the appearance of the surrounding region of this key-point1. SIFT is extensively used as a local feature descriptor inapplications such as image classification, image retrieval andimage stitching in computer vision.

Bag of features (BOF) [2], which draws idea from textcategorization, is a popular model in image classification andaction recognition. BOF represents the query image by anoccurrence histogram2 in terms of the number of occurrenceof descriptors extracted from this image subject to visualwords in a vocabulary. Usually, four steps are needed toconstruct the occurrence histogram: descriptor extraction, learnvocabulary, coding of descriptors and pooling of descriptors.As a prevailing model, BOF has obtained great advancementscontributed by studies in these four steps. Studies on descriptor

1In this paper, we refer to SIFT as its descriptor as other documents adopt.2This is indeed a descriptor of the query image. However, in order to avoid

confusion with local descriptors such as SIFT, we refer to it as occurrencehistogram afterwards.

extraction aim to enhance the expressive and discriminativepower of the descriptor. Studies in [3], [4] learn higher levelfeature descriptors, which are refined to summarize morevisual information upon lower level feature descriptors. Thelearning of vocabulary is also widely studied. In order to obtainmore compact and representative vocabulary, works such as[5], which tries to learn a structured vocabulary oriented by theFisher discrimination criterion, are presented. Recently, manystudies come to focus on the coding and pooling steps. Thesparse coding [6], locality-constrained linear coding [7] andlocal soft-assignment [8] have achieved notable success bylimiting the coding of descriptors to their neighboring visualwords. The spatial pyramid matching (SPM) [9] and object-centric spatial pooling (OCP) [10] retain spatial informationof descriptors during pooling, which conduces to significantimprovement of BOF model in image classification.

Motivated by the study in [12], we broach a new schemeto improve the performance of SIFT in this paper. We firstestablish the connection between SIFT and BOF in descriptorconstruction, and then leverage approaches proposed for BOF,such as SPM as introduced above, into the construction ofdescriptor in SIFT to improve its performance. To evaluate theefficacy of this combination, we compare the performance ofthe BOF-driven SIFT with original SIFT in experiments ofscene matching and image classification. Experimental resultsshow that the utilization of these BOF approaches successfullyimproves the performance of the original SIFT.

The rest of this paper is organized as follows. In section II,we manifest the connection between SIFT and BOF in descrip-tor construction following four steps to build the occurrencehistogram in BOF model. In section III, we introduce twoapproaches of BOF into the descriptor construction of SIFT.In section IV, we compare the performance of the BOF-drivenSIFT with the original SIFT in experiments of scene matchingand image classification. In section V, we conclude our workin this paper.

II. CONNECTION BETWEEN SIFT AND BOF

In this section, we build the connection between SIFTand BOF in descriptor construction step by step followingfour steps used to construct the occurrence histogram in BOFmodel.

2013 Second IAPR Asian Conference on Pattern Recognition

978-1-4799-2190-4/13 $26.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.31

95


978-1-4799-2190-4/13 $26.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.31

95


978-1-4799-2190-4/13 $26.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.31

95


978-1-4799-2190-4/13 $31.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.31

95

A. Descriptor extraction

In BOF, local feature descriptors, such as SIFT, are ex-tracted densely or sparsely on the given image. Study in[11] has found that dense extraction of descriptors has betterperformance than sparse extraction since it retains more visualinformation. Let xi ∈ Rp denote a p dimensional descriptor.

In SIFT, gradients are extracted for all pixels in thesurrounding image region of a given key-point. Let xm

i andxθi represent the magnitude value and the orientation value of

a gradient respectively. We can treat xθi as the descriptor in

BOF, of which p = 1. The calculation of xθi for all pixels is a

dense descriptor extraction.

B. Learn vocabulary

In BOF, a vocabulary is learned on extracted descriptorsby some methods such as k-means. Let V ∈ Rp×n denote avocabulary with n visual words, vj ∈ Rp the jth visual wordin this vocabulary.

In SIFT, the orientation dimension of the gradient spaceis evenly divided into bins. We can view the centers of thesebins, which are orientation values, as visual words. Usually,the number of bins is set to 8, namely, n = 8. The vocabularyV in SIFT is then a manually defined one. Let vθj denote thevalue of the center of the jth bin. Due to the even distributionof these visual words in orientation dimension, the value ofthem can be simply calculated as follows,

vθj =2j + 1

2n2π, j ∈ [0, n− 1]. (1)

C. Coding of descriptors

In BOF, the coding of descriptors assigns these descriptorsto their corresponding visual words in the vocabulary withcorresponding coefficients. The number of visual words andthe value of coefficient vary with different coding schemes.Let uij denote the coding coefficient of descriptor xi to visualword vj .

In SIFT, the orientation value xθi of a pixel is assigned

to its two most neighboring bins along with its correspondingmagnitude value as a weight. This procedure can be viewed asthe coding of xθ

i to its corresponding visual words vθj by localsoft-assignment scheme, which posits that a descriptor onlycontributes to its locally nearest visual words. The number oflocally nearest visual words in SIFT is two since xθ

i is onlyassigned to two bins. In the case of even distributed visualwords, the coding coefficient uij can be calculated by

uij =

{1−

|xθi−vθ

j |

disvθj ∈ vxθ

i

0 vθj /∈ vxθi

j ∈ [0, n− 1], (2)

where dis is the euclidian distance between two neighboringvisual words and is identical for any two neighboring visualwords since the vocabulary is evenly distributed. The vxθ

iis

the set of locally nearest visual words to which xθi is assigned.

Fig. 1: The gradient space is evenly divided into 6 regions(small rectangles) with Sθ = 3 and Sm = 2. Each region’center (circle) is then a visual word of the manually definedvocabulary V. A descriptor (cross) is assigned to its 4 nearestvisual words (hollow circles).

D. Pooling of descriptors

The pooling of descriptors summarizes descriptors’ coef-ficients for each visual word via schemes such sum-pooling,average-pooling and max-pooling. The sum pooling adds upcoefficients for each visual word. The average-pooling alsoadds up coefficients for each visual word but divides the sumby the number of coefficients of this visual word. The max-pooling simply selects the maximum coefficient for each visualword.

In SIFT, the surrounding region of the key-point is tiledevenly into blocks. Block descriptors are constructed on eachblock and finally concatenated together to form the key-pointdescriptor. In each block, assignments of orientation valuesto each bin in the histogram are finally summed up withcorresponding weights such as the magnitude value. Thisprocedure can be interpreted as the sum-pooling of descriptorsto visual words in BOF. Denote the pooling result of the jthvisual word wj , it is calculated by

wj =∑i

uij × xmi × ws

i × wgi , (3)

where wsi is a spatial weight resulted by the bilinear interpola-

tion of xθi into the same entry of neighboring block descriptors,

and wgi is a gaussian weight, which is calculated based on

the relative location of xθi with respect to the center of the

surrounding region for the purpose of giving less emphasis toorientation values that are spatially far away.

III. INTRODUCE BOF APPROACHES INTO SIFT

Following operations in [12], we introduce two approachesof BOF into SIFT to improve its performance. These twoapproaches are the preservation of locality and the preservationof spatial information.

A. Approach 1: the preservation of locality

The preservation of locality supposes that during codingand pooling of descriptors, a descriptor only interacts with itslocally near descriptors or visual words in the descriptor space.This preservation endeavors to fend off descriptors that divergelargely from this descriptor. It can also be illustrated fromthe view of manifold [7]. Local feature descriptors usuallydwell in a low-dimensional manifold in the descriptor spaceapproximately, which indicates that only within a locally smallregion, can the euclidian distance between two descriptorsprecisely approximate the actual geodesic distance. Outside

96969696

Fig. 2: An example of spatial pyramid of the surroundingregion of a key-point with 3 layers.

this region, descriptors with small euclidian distance mayvirtually have a large geodesic distance.

The realization of the locality preservation in SIFT canbe simply achieved by using the whole gradient as descriptorrather than merely the orientation value. In original SIFT,the coding and pooling of descriptors are biased along theorientation dimension. As a result, in the gradient space,gradients with greatly distinct magnitude values but similarorientation values are eventually assigned to the same visualword. This bias can be remedied if the vocabulary can spanthe whole gradient space, enabling descriptors to be assignedmore closely to their corresponding visual words.

Using the whole gradient, i.e. xi = (xθi , x

mi ), as descriptor,

we follow the original SIFT to define the vocabulary to beevenly distributed in the descriptor space. This amounts todividing the descriptor space into n identical regions andtaking the center of each region as visual word to form thevocabulary V as shown in Fig. 1. If we scale the ranges ofthe orientation dimension and magnitude dimension into Lθ

and Lm respectively, the distances between any two adjacentvisual words along the orientation dimension and magnitudedimension, i.e. disθ and dism then can be calculated by

disθ =Lθ

Sθ

, dism =Lm

Sm

, (4)

where Sθ and Sm are the number of sections the orientationdimension and magnitude dimension are divided into. Theorientation value and magnitude value of visual word vjk(j ∈ [0, Sθ − 1], k ∈ [0, Sm − 1]) are then obtained by

vθjk =2j + 1

2Sθ

Lθ, vmjk =2k + 1

2Sm

Lm. (5)

Also, the coding coefficient of descriptor xi to visual word vjkis obtained by

uijk =

{(1−

|xθi−vθ

jk |

disθ)× (1−

|xmi −vm

jk|

dism) vjk ∈ vxi

0 vjk /∈ vxi,(6)

where vxiis the set of locally nearest visual words to which

xi is assigned. We constrain the number of locally nearestvisual words of a descriptor to four. Please refer to Fig. 1 forillustrated explanations.

B. Approach 2: the preservation of spatial information

The preservation of spatial information of descriptors intoresulted occurrence histogram in BOF has contributed to thesuccess of many recent studies. SPM in [9] pools descriptorsin a spatial pyramid of the given image. OCP in [10] firstlocalizes the foreground region and then pools descriptorsin the foreground and background regions separately. Better

results given by these studies have exhibited the necessity ofspatial information preservation. Indeed, the original SIFT alsotries to preserve the spatial information of gradients extractedfrom the surrounding region of the key-point by dividing thissurrounding region into blocks and giving spatial weights suchas ws

i and wgi , which vary with respect to the relevant location

of the gradient in the block and the surrounding region, to eachgradient during pooling.

To preserve more spatial information, we adopt the SPMtechnique into SIFT. We build a spatial pyramid with T layers,where each layer is a copy of the surrounding region of thekey-point, and tile the tth layer evenly into 4t−1 blocks3. Anexample of spatial pyramid with T = 3 is given in Fig. 2. Wethen pool gradients in each block to build block descriptorsand concatenate block descriptors on each layer to form thelayer descriptor. The key-point descriptor is finally constructedby concatenating all layer descriptors together. Followingthe operation in the original SIFT to reduce the effects ofillumination changes, we normalize each layer descriptor intounit length, suppress large entry in the descriptor with somethreshold and normalize it into unit length again.

IV. EXPERIMENT AND RESULT

We conduct experiments in scene matching and imageclassification to compare the performance of the BOF-drivenSIFT, which leverages approaches of BOF into SIFT, with theoriginal SIFT.

A. Common parameters in implementation

We use the SIFT key-point detector in opencv [14] forboth the original SIFT and the BOF-driven SIFT, the SIFTdescriptor extractor in opencv for the original SIFT and amodified version of it for the BOF-driven SIFT. We adoptLm = 1, Sm = 2 and T = 3 for the BOF-driven SIFT andall other parameters are set as the same as the original SIFT.We use the L2-norm rule H = H/

√‖H‖2 for normalization,

where H is the key-point descriptor for the original SIFT andlayer descriptor for the BOF-driven SIFT.

B. Experiment 1: scene matching

1) Dataset and evaluation method: We use the Oxforddataset [13] (as shown in Fig. 3) for this experiment. Thisdataset contains eight categories of scenes for the evaluationof descriptor’s performance on rotation and scale changes(boat and bark), viewpoint changes (graffiti and wall), imageblur (bike and tree), jpeg compression (ubc) and illuminationchanges (leuven). Each category contains one reference image(image 1) and five transformed images (image 2 to image6) for matching. The degree of distortion and transformationincreases progressively from image 2 to image 6.

We adopt the ”recall versus 1-Precision (1 minus preci-sion)” curve in [13] to display the matching result. The recalland 1-Precision are calculated as follows,

recall =#correct matches

#correspondences, (7)

3The evenly divided surrounding region in original SIFT can be viewed asa special spatial pyramid with only one layer.

97979797

((a) bark ((b) boat ((c) graffi� ((d) wall ((e) bikes ((f) trees ((g) ubc ((h) leuven

Fig. 3: Image 1 (above) and image 4 (below) from each category of the Oxford dataset.

1− Precision =#false matches

#correct matches+#false matches,

(8)where #correspondences represents the number of correspond-ing regions in two matching images and is the same for alldescriptor due to the identical key-point detector and surround-ing region4. A good descriptor is expected to have high recalland low 1-Precision. Two descriptors in two matching imagesare assumed to be matched if their euclidian distance is belowa given threshold. The ”recall versus 1-Precision” is obtainedby varying this threshold.

2) results: Due to the space limitation, we only show onerepresentative matching result for each category here as shownin Fig. 4 (please refer to the supplementary for more matchingresults). In this figure, the ”original” represents the originalSIFT, the ”app1” represents the BOF-driven SIFT combingoriginal SIFT with approach 1 in section III and ”app1 app2”a combination of original SIFT with approach 1 and 2.

From these matching results, we can observe that BOF-driven SIFTs outperform original SIFT in all categories.Moreover, curves of BOF-driven SIFTs are more natural andconsistent compared with the original SIFT, which demon-strates the robustness of BOF-driven SIFT against variousand diversified transformation. In a closer observation, recallsof BOF-driven SIFTs have large increment against originalSIFT in bark and graffiti, which have structured images forevaluation of affine transformation. The locality preservationin BOF-driven SIFTs helps to segregate descriptors that di-verge greatly after transformation and raises the recall. Asfor bark and wall, which are also created for evaluation ofaffine transformation, the improvements of BOF-driven SIFTsare not as significant as in bark and graffiti due to theunstructured and repeated textures in these two categories.The addition of spatial information preservation into ”app1”generally contributes to higher recalls except in bark, boat andtrees. Because of the irregular clusters and blobs in imagesof these three categories, the spatial information preservationconversely brings in superfluous information and deterioratesthe performance of ”app1 app2” in contrast to ”app1”. TheBOF-driven SIFTs also obtain remarkable increase of recall inubc for evaluation of jpeg compression.

4Note that for some matching pair (e.g. matching between image 1 and 3 inbark), #correspondences is zero because of the large distortion in transformedimage.

TABLE I: Classification results

Descriptor Recognition rate (Average±Deviation)original 56.847±0.650%

app1 58.482±0.596%app1 app2 59.409±0.777%

C. Experiment 2: image classification

1) Dataset and evaluation method: To further evaluate theperformance of the BOF-driven SIFT, we also do experimenton image classification. The dataset used in this experimentis the fifteen scenes dataset from [9], which contains fifteencategories of scenes such as industrial, kitchen and coast (someexamples of this dataset are displayed in Fig. 5). Each categoryis composed of 200 to 400 images of average size 250× 300.Although color images are available in some categories, wetransform all images into grey image for experiment.

The experiment is repeated for five times and the average ofall per-category recognition rates for each descriptor along withthe standard deviation is calculated to present each descriptor’sperformance. In each time’s experiment, we randomly select50 images from each category for training and the remainingimages are for testing. Original SIFT and BOF-driven SIFTsare extracted on a grid with spacing of 8 pixels horizontally andvertically covering the whole image. The vocabulary of 200visual words is learned by k-means on descriptors extractedfrom the training images. A multi-class SVM trained by libsvm[15] is used for classification.

2) Results: Classification results for all descriptors areshown in table I. We can see that the BOF-driven SIFTs(”app1” and ”app1 app2”) all outperform the original SIFTalthough increments of recognition rate are not significant dueto the large intra-category and little cross-category variationof the dataset, which makes the contribution of the improve-ment in descriptor to the final recognition rate trivial. The”app1 app2” has higher recognition rate than ”app1” due to theaddition of spatial information preservation. This experimentalresult further confirms the close relationship between SIFTand BOF in descriptor construction, as well as the efficacy ofcombining BOF approaches with original SIFT.

V. CONCLUSION

In this paper, motivated by the study in [12], we introduce anew scheme to improve the performance of SIFT’s descriptorby combing it with approaches of BOF based on the con-nection between them in descriptor construction. To evaluatehow these combinations help to increase the robustness of

98989898

0 0.2 0.4 0.6 0.8

0.2

0.25

0.3

0.35

0.4

0.45

0.5(a) bark: img2 to img1.

1 − precision

#cor

rect

mat

ches

/255

5

originalapp1app1_app2

0 0.2 0.4 0.6 0.8

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55(b) bikes: img2 to img1.

1 − precision

#cor

rect

mat

ches

/188

6


0 0.05 0.1 0.15 0.20

0.05

0.1

0.15

0.2

0.25

0.3

0.35(c) boat: img2 to img1.

1 − precision

#cor

rect

mat

ches

/576

8


0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7(d) graf: img2 to img1.

1 − precision

#cor

rect

mat

ches

/192

4


0 0.2 0.4 0.6 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8(e) leuven: img2 to img1.

1 − precision

#cor

rect

mat

ches

/154

6


0 0.05 0.1 0.15 0.2 0.250

0.05

0.1

0.15

0.2

0.25(f) trees: img2 to img1.

1 − precision

#cor

rect

mat

ches

/841

5


0 0.01 0.02 0.03

0.2

0.25

0.3

0.35

0.4

0.45

0.5(g) ubc: img2 to img1.

1 − precision

#cor

rect

mat

ches

/428

2


0 0.2 0.4 0.6 0.8 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07(h) wall: img6 to img1.

1 − precision

#cor

rect

mat

ches

/225

8


Fig. 4: Comparison results of matching between image 1 and image 2 (or 6) in 8 categories.

bedroom

inside city

suburb industrial kitchen living room coast forest highway

mountain open country street tall building office store

Fig. 5: Examples from each category in the fifteen scenes dataset.

SIFT, we conduct experiments on scene matching and imageclassification for these BOF-driven SIFTs and the originalSIFT. Desirable results are obtained from these experimentsshowing the efficacy of these combinations. Although we onlyintroduce two approaches of BOF into SIFT in this paper, itis possible to apply more advancements of BOF into SIFTto further improve its performance based on the connectionestablished in this paper. Moreover, as a generalization of theoriginal SIFT, previous methods for the improvement of SIFTare also adaptive to the BOF-driven SIFT.

REFERENCES

[1] D. G. Lowe, ”Distinctive Image Features from Scale-Invariant Key-points”, IJCV, vol.60, no.2, pp.91-110, 2004.

[2] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, C. Bray, ”Visualcategorization with bags of keypoints”, ECCV, pp.1-22, 2004.

[3] G. Hinton and R. R. Salakhutdinov, ”Reducing the dimensionality ofdata with neural networks”, Science, Vol.313. no.5786, pp.504-507,2006.

[4] Y. L. Boureau, F. Bach, Y. LeCun, and J. Ponce, ”Learning Mid-LevelFeatures for Recognition”, CVPR, pp.2559-2566, 2010.

[5] M. Yang, L. Zhang, X. Feng, and D. Zhang, ”Fisher discriminationdictionary learning for sparse representation”, ICCV, pp.543-550, 2011.

[6] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma, ”Robust facerecognition via sparse representation”, PAMI, vol.31, no.2, pp.210-227,2009.

[7] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, ”Locality-constrained linear coding for image classification”, CVPR, pp.3360-3367, 2010.

[8] L. Liu, L. Wang, and X. Liu, ”In defense of soft-assignment coding”,ICCV, pp.2486-2493, 2011.

[9] S. Lazebnik, C. Schmid, and J. Ponce, ”Beyond bags of features: Spatialpyramid matching for recognizing natural scene categories”, CVPR,vol.2, pp.2169C2178, 2006.

[10] O. Russakovsky, Y. Lin, K. Yu, and F. Li, ”Object-centric spatial poolingfor image classification”, ECCV, vol.2, pp.1-15, Oct. 2012.

[11] F. Li and P. Perona, ”A bayesian hierarchical model for learning naturalscene categories”, CVPR, pp.524-531, 2005.

[12] Z. Yang and T. Kurita, ”BOG: an extension of hog by interpreting it asbag of features”, MVA, pp.415-418, 2013.

[13] K. Mikolajczyk and C. Schmid, ”A performance evaluation of localdescriptors”, PAMI, vol.27, no.10, pp.1615-1630, 2005.

[14] G. Bradski, ”The opencv library”, Dr.Dobb’s Journal of Software Tools,2000.

[15] C. Chang and C. Lin, ”LIBSVM: A library for support vector ma-chines”, ACM Trans. on Intelligent Systems and Technology, vol.2,no.3, pp.27:1-27:27, 2011.

99999999

[ieee 2013 2nd iapr asian conference on pattern recognition (acpr) - naha, japan...

Documents