cross-domain image localization by adaptive feature...

Cross-domain Image Localization by AdaptiveFeature Fusion

Neelanjan Bhowmik, Li Weng, Valerie Gouet-Brunet, Bahman SoheilianUniv. Paris-Est, LASTIG MATIS, IGN, ENSG, F-94160 Saint-Mande, France

Abstract—We address the problem of cross-domain imagelocalization, i.e., the ability of estimating the pose of a landmarkfrom visual content acquired under various conditions, such asold photographs, paintings, photos taken at a particular season,etc. We explore a 2D approach where the pose is estimatedfrom geo-localized reference images that visually match the queryimage. This work focuses on the retrieval of similar images, whichis a challenging task for images across different domains. Wepropose a Content-Based Image Retrieval (CBIR) framework thatadaptively combines multiple image descriptions. A regressionmodel is used to select the best feature combinations accordingto their spatial complementarity, globally for a whole datasetas well as adaptively for each given image. The framework isevaluated on different datasets and the experiments prove itsadvantage over classical retrieval approaches.

I. INTRODUCTION

Image-based localization is an alternative to conventionalsignal-based positioning solutions represented by GPS. Thistechnology is particularly interesting when precise localizationsolutions are unavailable. It can be potentially used for auto-matic navigation [1]. It is also a useful tool for location-relatedmultimedia applications [2], [3], such as landmark recognitionand augmented reality. The goal of image-based localizationis to infer where an image is taken by comparing it witha database of geo-referenced visual information. Based onthis principle, the localization problem is typically modelledas an image feature retrieval scenario, and solved by exactor approximate nearest neighbor search. More specifically,features are extracted from a query image and compared withfeatures in a database; the location is inferred from e.g. the bestmatches. Depending on the required accuracy, there are mainlytwo tasks: 1) place recognition and 2) camera pose estimation.The former estimates the zone where the image was acquired;the latter estimates the pose with six degrees of freedom (6-DOF). Related research mainly focuses on accuracy, efficiency,and scale. In order to derive a solution for large cities, or eventhe whole globe, various efforts have been devoted to databaseindexing and query strategies [4], [5], [6], [7].

In this work, we study a novel application called cross-domain [8], [9] image localization. Our goal is to local-ize query images which are different in characteristics fromdatabase images, such as postcards vs. street views. Thechallenge is tackled in two steps: 1) retrieve the most sim-ilar images from a database; 2) perform place recognitionor pose estimation. This paper only covers the first step,where our contribution is about using multiple features for

This work was supported by the French project POEME ANR-12-CORD-0031 and the European project Things2Do KET ENIAC/JU Call 2013-2.

image retrieval. Since existing work typically uses a singlekind of visual features, such as SIFT [10], we propose toimprove retrieval accuracy by incorporating multiple featurecategories. This approach has not been widely considered forimage localization. We design a novel framework for query-adaptive feature fusion and automatic algorithm configuration.It can effectively improve localization accuracy by increasingretrieval precision.

The rest of the paper is organized as follows: SectionII revisits related work on image localization; Section IIIdescribes our proposed methodology; Section IV presents theexperiment results; Section V concludes the work.

II. RELATED WORK

Existing work on image-based localization typically uses a3D point retrieval approach. Place recognition and camera poseestimation are solved by point-based voting and matching. Forexample, Schindler et al. propose a city-scale place recognitionscheme [4]. They use a vocabulary tree [11] to index SIFTfeatures with improved strategies for tree construction andtraversal. Irschara et al. [12] consider sparse place recognitionusing 3D point-clouds. They not only use real views, but alsogenerate synthetic views to extend localization capability. Agreedy set cover algorithm is used for selecting a representativesubset of views. The camera pose is estimated after pointcorrespondences are found. Li et al. [5] address city-scaleplace recognition and focus on query efficiency. They prioritizecertain database features according to a set covering criteriondifferent from [12]. Zamir and Shah [13] use Google street-view images for place recognition. They distinguish single im-age localization and image group localization. Correspondingvoting and post-processing schemes are derived to refine thematching. Chen et al. [14] study the localization of mobilephone images using street-view databases. They propose toenhance the matching by aggregating the query results fromtwo datasets with different viewing angles. Sattler et al. [6]propose to accelerate 2D-to-3D matching by associating 3Dpoints with visual words and prioritizing certain words. Liet al. [7] consider worldwide image pose estimation. Theypropose a co-occurrence prior based RANSAC and bidirec-tional matching to maintain efficiency and accuracy. Lim [1]et al. address real-time 6-DOF estimation in large scenes forauto-navigation. They use a dense local descriptor for fast key-point extraction, and a binary descriptor for key-point tracking.Aubry et al. [15] propose a technique to align 2D depictionsof an architectural site with a 3D model.

Another category of localization techniques, which ourwork belongs to, is based on image retrieval. Conventionally,

978-1-5090-5808-2/17/$31.00 © 2017 IEEE

this is only used for place recognition [3], [8]. Recently, Songet al. [16] propose to estimate 6-DOF after image retrieval.

Besides the mainstream approaches, others utilize addi-tional means. For example, Gupta et al. [17] exploit relativetrajectory from visual odometry to assist localization. Kendallet al. [18] use a neural network for small-scale relocalization.

III. IMAGE RETRIEVAL BY ADAPTIVE FEATURE FUSION

Nowadays, various local visual features, which describedifferent image characteristics, have been proposed with re-spective pros and cons. The goal of feature fusion is to improvecontent representation by determining the best combinationof features for a given application [19], [20]. The recentlyproposed fusion of inverted index (FII) [21] is an effectivefeature fusion strategy for image retrieval. It works by in-tegrating the responses to a query in finer subdivisions ofmulti-indexing structures which is constructed from multiplecodebooks (where visual features are clustered into a vocab-ulary of visual words). In this work, we further improve theFII technique by incorporating a complementarity regressionmodel, as described below.

Our hypothesis in this work is that retrieval performance isrelated to the spatial complementarity of the features to com-bine. We consider three criteria for complementarity betweenfeatures: spatial distribution of feature points (CSDi) [22],contribution measures (CSCo) [23], and spatial cluster com-plementarity (CSCl) [24]. We propose to learn a linearregression model, which estimates the mean average precision(mAP ) for different feature combinations according to thenumber of detected interest keypoints per image (Kp) and thecomplementarity. The mAP is an overall measure of retrievalperformance, computed by averaging average precision acrossall queries. Our objective is to select the best feature combi-nation for each query so that mAP is maximized. We employthe following linear regression model:

mAP = β1Kp+ β2CSDi+ β3CSCo+ β4CSCl (1)

where βi are learnt model coefficients. Given a training dataset,the training consists of the following steps:

1) For each image, x detectors D1, . . . , Dx are used toextract local features, leading to x sets of keypoints.The number of keypoints and the complementarityscores are computed for all C2

x =(x2

)detector pairs

(Di, Dj)i6=j .2) For each detector pair (Di, Dj)i 6=j , one mAP is

computed using the training set and the FII approach.3) The model parameters are learnt according to Eq. (1).

Given a testing dataset, complementarity scores and keypointcounts are computed for all detector pairs, in the same way asin the training procedure. Then we predict the mAP , denotedby mAP ′, using the previously trained regression model. Thedetector combination with the highest mAP ′ is selected.

The above approach can predict the best detector combina-tion globally for a dataset as well as for each query image. Thequality of image retrieval can be enhanced by image-by-imageprediction and adaptive selection of detector combinations, asdemonstrated in Section IV. The model can also adaptivelyselect some FII parameters, such as the number of nearestneighbors k by varying it during training.

IV. EXPERIMENTS AND EVALUATION

The experiment results are mainly based on two datasets:

1) Paris (Paris DB): This dataset is a public benchmark1

consisting of 6412 images collected from Flickr bysearching for different Paris landmarks.

2) ParisCrossDomain DB (PCD DB): This dataset (ap-proximately 6500 images) consists of Paris DB im-ages and some old images of Paris monuments.

Some examples are shown in Fig. 1. The second datasetrepresents a cross-domain application, because old/modifiedParis monument images are used as queries. We have alsoperformed experiments with other datasets of different content,such as Inria Holiday2, Oxford3. However, we only present theresults of Paris DB and PCD DB due to space limits.

(a) (b)

Fig. 1. Dataset: (a) Paris DB (b) Cross-domain images from PCD DB.

We select six representative detectors: Hessian affine(hesaff) [25], color symmetry (colsym) [26], MSER(mser) [27], Harris (har) [28], Star (star) [29], and orientedand rotated BRIEF (orb) [30]. They detect feature pointsat corners, blobs, symmetries, etc. Three local descriptors,SIFT [10], SURF [31] and shape context (SC) [32], are usedto describe detected points. The descriptors are then usedjointly in our feature fusion based image retrieval process.The performance is measured with mAP . Codebook sizeand the value of k during nearest neighbor (k-NN) searchare two important parameters of the FII system. The optimalcodebook size in our experiments is approximately 20% ofthe total points of each detector combination. The parameter kis varied between 2 to 10 for adaptive configuration selection.

A. Prediction of retrieval performance

We first examine the prediction of retrieval performanceusing the complementarity scores and regression model. Dur-ing training, the complementarity scores and the numbers ofkeypoints are computed for each detector combination usingthe images of Paris DB. During testing, we use the trainedregression model to compute mAP ′ for PCD DB. Predictionresults of all detector combinations are presented in Table I forboth training and testing datasets. The combinations ’hesaff-star’ and ’hesaff-har’ show the best mAP ′ for PCD DB.Therefore, they are used for image retrieval.

B. Image retrieval performance

The image retrieval results for Paris DB and PCD DBare presented in this section. Due to space limits, we onlyshow results of the best two predicted detector pairs. We alsoselect one of the worst predicted combination to validate theprediction. According to the prediction, the best performanceshould be obtained with the ’hesaff-star’ combination. We

1http://www.robots.ox.ac.uk/∼vgg/data/parisbuildings/2https://lear.inrialpes.fr/∼jegou/data.php3http://www.robots.ox.ac.uk/∼vgg/data/oxbuildings/

TABLE I. DETECTOR COMBINATIONS AND PREDICTED mAP .

Detectors pair mAP ′ Detectors pair mAP ′

Pari

sD

Bhesaff-colsym 0.512 hesaff-mser 0.548

hesaff-har 0.481 hesaff-star 0.547hesaff-orb 0.501 colsym-mser 0.429colsym-har 0.384 colsym-star 0.457colsym-orb 0.410 mser-har 0.481mser-star 0.526 mser-orb 0.510har-star 0.481 har-orb 0.458star-orb 0.456

PCD

DB

hesaff-colsym 0.497 hesaff-mser 0.535hesaff-har 0.537 hesaff-star 0.549hesaff-orb 0.512 colsym-mser 0.468colsym-har 0.485 colsym-star 0.458colsym-orb 0.375 mser-har 0.519mser-star 0.494 mser-orb 0.465har-star 0.521 har-orb 0.493star-orb 0.461

denote the effective mAP by mAP \. In Table II (the ’FII’rows), we observe that the highest mAP \ (0.409) is indeedobtained by ’hesaff-star’, followed by ’hesaff-har’. The worstmAP \ is obtained with ’colsym-orb’, which again validatesthe prediction. Therefore, the regression model is effective.Next, we compare our results with a state-of-the-art late fusion

TABLE II. RETRIEVAL PERFORMANCE OF FEATURE FUSION.

Dataset Detector pair k-NN mAP \

FII

Paris DBhesaff-mser 2 0.589hesaff-star 2 0.570har-colsym 2 0.371

PCD DBhesaff-star 2 0.409hesaff-har 2 0.398

colsym-orb 2 0.227

LF

[19] Paris DB

hesaff-mser 2 0.541hesaff-star 2 0.535

PCD DBhesaff-star 2 0.365hesaff-har 2 0.362

(LF) retrieval technique [19]. We use the best performingdetector combinations of LF retrieval. The results are presentedin Table II (’LF’ rows). By comparison, our fusion methodexhibits superior performance. Additionally, we observe thatfor PCD DB with ’hesaff-star’, FII outperforms LF for thefirst ten retrievals with an mAP of 0.656 (vs. 0.420). This isimportant for applications which require accurate retrieval atthe top, such as image localization. We also present mAP \

values with single detectors in Table III. Compared withTable II, the advantage of feature fusion is clear.

TABLE III. RETRIEVAL PERFORMANCE OF SINGLE DETECTORS.

Paris DB PCD DBDetector k-NN mAP \ Detector k-NN mAP \

hesaff 2 0.546 hesaff 2 0.351mser 2 0.523 star 2 0.287

C. Adaptive selection of k and detector combinations

The effective mAP can be further improved by adaptiveselection of detector combinations with varying k valueswhen training the regression model. Table IV lists mAP \

values for six different combinations, obtained by the twobest detector combinations and three varying k values (k =2, 5, 10). Compared with the previous results (k = 2), the bestmAP \ (underlined) is further increased by 1.4% and 8.8% forParis DB and PCD DB respectively. We also observe fromFig. 2 that most queries (about 91% and 50%) are executedwith k = 2, followed by k = 5. During the similarity search ofa query, sometimes higher values of k might include dissimilarneighbors. Although certain combinations and k values dom-inate the results, these results demonstrate that the adaptiveselection of feature combinations and other configurations canimprove the retrieval performance, especially for images ofconvoluted content, e.g. artworks, postcards, or other cross-domain representations with diverse content (see Fig. 2b).

TABLE IV. EFFECTIVE mAP OBTAINED FOR ALL THE DATASETS, BYSELECTING OPTIMAL DETECTOR PAIRS AND OPTIMAL VALUE k FOR EACH

QUERY IMAGE.

Dataset Detector combination mAP \

k = 2 k = 5 k = 10

Paris DB

hesaff-mser 0.589 0.571 0.531hesaff-star 0.570 0.544 0.512

Adaptive detector 0.597combination (Adaptive k 2,5 & 10)

PCD DB

hesaff-star 0.409 0.402 0.372hesaff-har 0.398 0.393 0.354

Adaptive detector 0.445combination (Adaptive k 2,5 & 10)

(a) (b)

Fig. 2. Distribution of predicted values of k and detectors pairs across thequeries: (a) Paris DB and (b) PCD DB.

D. Cross-domain image localization by feature fusion

In this section, we demonstrate cross-domain image local-ization with the proposed retrieval framework. A set of street-level geo-referenced images acquired by a mobile mappingsystem called Stereopolis [33] was used. We sample about7000 images acquired in 4th and 5th districts of Paris coveringa distance of 51 km. About 6% of these images contain partialor full views of some historical monuments such as NotreDame, Sorbonne, Pantheon, etc. Note that the known poseparameters of images are not required for training or retrieval.These parameters are only used in order to show the pose ofretrieved images on a map and to check if the monument ofinterest is in the field of view (see Fig. 4). Non-georeferencedimages, such as old postcards of Paris monuments, are usedas queries for localization. An example is depicted in Fig. 3a:the query image (Pantheon) is an old postcard. It is executedwith the ’hesaff-mser’ combination according to the predictionmodel trained on Paris DB. Although the query is different instyle, our retrieval technique is able to find relevant images thatcontain the same monument or similar geographic areas. Afterthe retrieval, we use the pose of the retrieved images to markon a geographical map in Fig. 4. The indicated numbers arethe rankings of retrieved images according to the similarity tothe query. For localization purposes, it is crucial to retrieve themost similar images in the first responses. We see that sevenout of ten marks point to the correct monument. We also findthat using a single feature (e.g. hesaff) only retrieves four out often (not shown due to space limits). Thus our fusion approachis effective. Another example is shown in Fig. 3b with a photoquery – all ten retrieved images include the correct monument,which again validates our method.

V. CONCLUSIONS

We have presented an approach dedicated to the effec-tive fusion of descriptions for image retrieval. It is able toperform combinations adaptively for each image, by traininga regression model on the spatial complementarity of thedescriptions. Compared with the state-of-the-art, this adaptivemodel improves the precision of retrieval across differentvisual domains where the relevance of a description may vary

(a)

Retrieved images

Query image

(b)

Retrieved images

Query image

Fig. 3. Cross-domain localization of (a) Pantheon and (b) Notre Dame: the10 first retrieved results by decreasing order of similarity, from left to rightand top to bottom.

Fig. 4. Location estimation of the Pantheon query (Fig. 3a).

from one to another. This improvement has a strong impacton image-based localization frameworks involving an imageretrieval step, such as [16] where the 6-DOF estimation relieson the retrieval of images from a geolocalized dataset. Thecapability of localizing cross-domain content from artwork,multimedia, etc. opens the opportunity to link content withmap databases and provide new tools for the promotion ofvisual and geographical content in various domains includingculture, tourism, history, social sciences, etc.

REFERENCES

[1] H. Lim, S. N. Sinha, M. F. Cohen, and M. Uyttendaele, “Real-timeimage-based 6-DOF localization in large-scale environments,” in CVPR,June 2012, pp. 1043–1050.

[2] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: Exploringphoto collections in 3D,” ACM Trans. Graph., vol. 25, no. 3, pp. 835–846, Jul. 2006.

[3] D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg,“Mapping the world’s photos,” in WWW. ACM, 2009, pp. 761–770.

[4] G. Schindler, M. Brown, and R. Szeliski, “City-scale location recogni-tion,” in CVPR, June 2007, pp. 1–7.

[5] Y. Li, N. Snavely, and D. P. Huttenlocher, “Location recognition usingprioritized feature matching,” in ECCV, 2010, pp. 791–804.

[6] T. Sattler, B. Leibe, and L. Kobbelt, “Fast image-based localizationusing direct 2D-to-3D matching,” in ICCV, Nov 2011, pp. 667–674.

[7] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua, “Worldwide poseestimation using 3D point clouds,” in ECCV, 2012, pp. 15–29.

[8] A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros, “Data-driven visual similarity for cross-domain image matching,” ACM Trans.Graph., vol. 30, no. 6, p. 10, Dec. 2011.

[9] J. Huang, R. Feris, Q. Chen, and S. Yan, “Cross-domain image retrievalwith a dual attribute-aware ranking,” in ICCV, 2015, pp. 1062–1070.

[10] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”IJCV, vol. 60, no. 2, pp. 91–110, Nov. 2004.

[11] D. Nister and H. Stewenius, “Scalable recognition with a vocabularytree,” in CVPR, vol. 2, 2006, pp. 2161–2168.

[12] A. Irschara, C. Zach, J. M. Frahm, and H. Bischof, “From structure-from-motion point clouds to fast location recognition,” in CVPR, June2009, pp. 2599–2606.

[13] A. R. Zamir and M. Shah, “Accurate image localization based on googlemaps street view,” in ECCV, 2010, pp. 255–268.

[14] D. M. Chen, G. Baatz, K. Kser, S. S. Tsai, R. Vedantham, T. Pylvni-nen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, andR. Grzeszczuk, “City-scale landmark identification on mobile devices,”in CVPR, June 2011, pp. 737–744.

[15] M. Aubry, B. C. Russell, and J. Sivic, “Painting-to-3D model alignmentvia discriminative visual elements,” ACM Trans. Graph., vol. 33, no. 2,p. 14, Apr. 2014.

[16] Y. Song, X. Chen, X. Wang, Y. Zhang, and J. Li, “6-DOF image local-ization from massive geo-tagged reference images,” IEEE Transactionson Multimedia, vol. 18, no. 8, pp. 1542–1554, Aug 2016.

[17] A. Gupta, H. Chang, and A. Yilmaz, “GPS-denied geo-localisation usingvisual odometry,” ISPRS Annals of Photogrammetry, Remote Sensingand Spatial Information Sciences, vol. III-3, pp. 263–270, 2016.

[18] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutionalnetwork for real-time 6-dof camera relocalization,” in ICCV, 2015, pp.2938–2946.

[19] N. Neshov, “Comparison on Late Fusion Methods of Low Level Fea-tures for Content Based Image Retrieval,” in Artificial Neural Networksand Machine Learning. Springer, 2013, vol. 8131, pp. 619–627.

[20] E. Rashedi, H. Nezamabadi-pour, and S. Saryazdi, “A simultaneousfeature adaptation and feature selection method for content-based imageretrieval systems,” Knowledge-Based Systems, vol. 39, pp. 85–94, 2013.

[21] N. Bhowmik, R. Gonzalez V, V. Gouet-Brunet, H. Pedrini, andG. Bloch, “Efficient fusion of multidimensional descriptors for imageretrieval,” in ICIP, Oct 2014, pp. 5766–5770.

[22] S. Ehsan, A. F. Clark, and K. D. McDonald-Maier, “Rapid onlineanalysis of local feature detectors and their complementarity,” Sensors,vol. 13, no. 8, p. 10876, 2013.

[23] G. Gales, A. Crouzil, and S. Chambon, “Complementarity of featurepoint detectors.” in VISAPP, 2010, pp. 334–339.

[24] K. Mikolajczyk, B. Leibe, and B. Schiele, “Local features for objectclass recognition,” in ICCV, vol. 2, Oct 2005, pp. 1792–1799 Vol. 2.

[25] K. Mikolajczyk and C. Schmid, “Scale and affine invariant interest pointdetectors,” IJCV, vol. 60, no. 1, pp. 63–86, 2004.

[26] G. Heidemann, “Focus-of-attention from local color symmetries,”PAMI, vol. 26, no. 7, pp. 817–830, July 2004.

[27] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baselinestereo from maximally stable extremal regions,” in BMVC, 2002, pp.36.1–36.10.

[28] C. Schmid and R. Mohr, “Local grayvalue invariants for image re-trieval,” PAMI, vol. 19, no. 5, pp. 530–534, May 1997.

[29] M. Agrawal, K. Konolige, and M. Blas, “Censure: Center surroundextremas for realtime feature detection and matching,” in ECCV, 2008,pp. 102–115.

[30] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficientalternative to SIFT or SURF,” in ICCV, Nov 2011, pp. 2564–2571.

[31] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-up robustfeatures (SURF),” CVIU, vol. 11, no. 3, pp. 346–359, Jun 2008.

[32] S. Belongie, J. Malik, and J. Puzicha, “Shape Matching and ObjectRecognition using Shape Contexts,” PAMI, vol. 24, no. 4, pp. 509–522,Apr 2002.

[33] N. Paparoditis, J.-P. Papelard, B. Cannelle, A. Devaux, B. Soheilian,N. David, and E. Houzay, “Stereopolis II: A multi-purpose and multi-sensor 3D mobile mapping system for street visualisation and 3Dmetrology,” RFPT, vol. 200, pp. 69–79, Oct. 2012.

cross-domain image localization by adaptive feature...

Documents