3d model search and pose estimation from single images ......3d model search and pose estimation...

3D model search and pose estimation from single images using VIP features

Changchang Wu2, Friedrich Fraundorfer1,1Department of Computer Science

ETH Zurich, Switzerland{fraundorfer, marc.pollefeys}@inf.ethz.ch

Jan-Michael Frahm2, Marc Pollefeys1,2

2Department of Computer ScienceUNC Chapel Hill, USA{ccwu,jmf}@cs.unc.edu

Abstract

This paper describes a method to efficiently search for3D models in a city-scale database and to compute the cam-era poses from single query images. The proposed methodmatches SIFT features (from a single image) to viewpointinvariant patches (VIP) from a 3D model by warping theSIFT features approximately into the orthographic frame ofthe VIP features. This significantly increases the number offeature correspondences which results in a reliable and ro-bust pose estimation. We also present a 3D model searchtool that uses a visual word based search scheme to effi-ciently retrieve 3D models from large databases using in-dividual query images. Together the 3D model search andthe pose estimation represent a highly scalable and efficientcity-scale localization system. The performance of the 3Dmodel search and pose estimation is demonstrated on urbanimage data.

1. IntroductionSearching for 3D models is a key feature in city-wide lo-

calization and pose estimation from mobile devices. Froma single snapshot image the corresponding 3D model needsto be found and 3D-2D matches between the model and theimage need to be established to estimate the users pose (seeillustration in Fig. 1). Main challenges so far are the corre-spondence problem (3D-2D) and the scalability of the ap-proach. In this paper we will contribute to both of this top-ics. The first contribution will be a 3D-2D matching methodthat is based on viewpoint invariant patches (VIP) and candeal with severe viewpoint changes. The second contri-bution will be the use of a visual word based recognitionscheme for efficient and scalable database retrieval. Ourdatabase consists of small individual 3D models that repre-sent parts of a large scale reconstruction. Each 3D model istextured and is represented by a collection of VIP featuresin the database. When querying with an input image, the in-put’s image SIFT features are matched with the database’sVIP features to determine the corresponding 3D model. Fi-

3D model from 1.3 M images

matching part of the 3D model

query image

Figure 1. Mobile vision based localization: A single image from amobile device is used to search for the corresponding 3D model ina city-scale database and determine thus the user’s location. SIFTfeatures extracted from the query image will be matched to VIPfeatures from the 3D models in the database.

nally, 3D-2D matches between the 3D model and the inputimage are established for pose estimation.

Viewpoint-invariant-patches (VIP) have been used forregistering 3D models in [9] so far. The main idea is tocreate ortho-textures for the 3D models and detect local fea-tures, e.g. SIFT, on them. For this, planes in the 3D modelare detected and a virtual camera is set fronto-parallel toeach plane. Features are now extracted from the virtualcamera image from which the perspective transformationof the initial viewpoint change is removed.

In this paper we extend this method to create matchesbetween a 3D model and a single image (3D-2D). In theoriginal method features from both models are representedin the canonical (orthographic) form. In our case only the

978-1-4244-2340-8/08/$25.00 ©2008 IEEE

features from the 3D model are represented in the canonicalform while the features from the single image are perspec-tively transformed. However, while matching will not workfor features under large perspective transformation, featureswhich are almost fronto-parallel will match very well withthe canonical representation. Under the assumption that thecamera of the query image and the 3D plane of the match-ing features are parallel we can generate hypotheses for thecamera pose of the query image. And using these hypothe-ses we can warp parts of the query image so that they matchthe perspective transform of the canonical features of the3D model. This allows us to generate many more additionalmatches for robust and reliable pose estimation. For exhaus-tive search in large databases this method would be to slow,therefore we use the method described by Nister and Stewe-nius [5] for an efficient model search. The model searchworks with quantized SIFT (and VIP) descriptor vectors, socalled visual words.

The paper is structured in the following way. The fol-lowing section describes relevant related work. Section 3describes the first contribution of this paper, pose estima-tion using VIP and SIFT features. Section 4 describes howto search for 3D models in large databases efficiently. Sec-tion 5 shows experiments on urban image data and finallysection 6 draws some conclusions.

2. Related workMany texture based feature detectors and descriptors

have been developed for robust wide-baseline matching.One of the most popular is Lowe’s SIFT detector [3]. TheSIFT detector defines a feature’s scale in scale space and afeature orientation from the gradient histogram in the im-age plane. Using the orientation, the SIFT detector gen-erates normalized image patches to achieve 2D similaritytransformation invariance. Many feature detectors, includ-ing affine covariant features, use the SIFT descriptor to rep-resent patches. SIFT-descriptors are also used to encodeVIP features. However, the VIP approach will work withother feature descriptors, too. Mikolajczyk et al. give acomparison of several local features in [4]. The recentlyproposed VIP features [9] go beyond affine invariance torobustness to projective transformations. The authors in-vestigated the use of VIP features to align 3D models, butthey did not investigate the case of matching VIP to featuresfrom single images.

Most vision based location systems so far have beendemonstrated on small databases [6, 8, 11]. RecentlySchindler et al. [7] presented a scheme for city-scale en-vironments. The method uses the visual word based recog-nition scheme following the approach in [5, 2, 2]. However,Schindler et al. only focused on location recognition. Thepose of the user is not computed. Our proposed methodwill combine both, scalable location recognition and pose

estimation. Pose estimation only is the focus of the workin [10]. The authors propose a method to accurately com-pute the camera pose from 3D-2D matches. High accuracyis achieved by extending the set of initial matches with re-gion growing. Their method could be used as a last step inour localization approach to refine the computed pose.

3. Pose from SIFT-VIP matches

Figure 2. VIP’s detected on a 3D model.

3.1. Viewpoint-Invariant Patch (VIP) detection

VIP’s are features that can be extracted from textured 3Dmodels which combine images with corresponding depthmaps. VIPs are invariant to 3D similarity transformations.They can be used to robustly and efficiently align 3D mod-els of the same scene from videos taken from significantlydifferent viewpoints. In this paper we’ll mostly consider3D models obtained from video by SfM, but the methodis equally applicable to textured 3D models obtained usingLIDAR or other sensors. The robustness to 3D similari-ties exactly corresponds to the ambiguity of 3D models ob-tained from images, while the ambiguities of other sensorscan often be described by a 3D Euclidean transformationor with even fewer degrees of freedom. The undistortionis based on local scene planes or on local planar approxi-mations of the scene. Conceptually, for every point on thesurface the local tangent plane’s normal is estimated and atexture patch is generated by orthogonal projection onto theplane. Within the local ortho-texture patch it is determinedif the point corresponds to a local extremal response of theDifference-of-Gaussians (DoG) Filter in scale space. If itis the orientation is determined in the tangent plane by thedominant gradient direction and a SIFT descriptor on thetangent plane is extracted. Using the tangent plane avoidsthe poor repeatability of interest point detection under pro-jective transformations seen in popular feature detectors [4].

(a)

(b)

(c)Figure 3. (a) Initial SIFT-VIP matches. Most matches are as ex-pected on the fronto-parallel plane (left image is query image). (b)Camera pose estimated from SIFT-VIP match (re). (c) Resultingset of matches established with the proposed method. The initialset of 17 matches could be extended to 92 correct matches. Themethod established many matches on the other plane, too.

Viewpoint-normalized image patches need to be gener-ated to describe VIPs. Viewpoint-normalization is similar tothe normalization of image patches according to scale andorientation performed in SIFT and normalization accordingto ellipsoid in affine covariant feature detectors. The view-point normalization can be divided into the following steps:

1. Warp the image texture for each 3D point, conceptu-ally, using an orthographic camera with optical axis

parallel to the local surface normal passing through the3D point. This step makes the VIP invariant to the in-trinsics and extrinsic of the original camera generatingan ortho-texture patch.

2. Verify VIP, and find its orientation and size. Keep a3D point as a VIP feature only when its correspondingpixel in the ortho-texture patch is a stable 2D imagefeature. Like [3] a DoG Filter and local extrema sup-pression is used. VIP orientation is found based on thedominant gradient direction in the ortho-texture patch.

With the virtual camera, the size and orientation of a VIPcan be obtained by transforming the scale and orientationof its corresponding image feature to world coordinates. AVIP is then fully defined as (x, σ, n, d, s) where

• x is its 3D position,

• σ is the patch size

• n is the surface normal at this location,

• d is texture’s dominant orientation as a vector in 3D

• s is the SIFT descriptor that describes the viewpoint-normalized patch. Note, a sift feature is a sift descrip-tor plus it’s position, scale and orientation.

Fig. 2 shows VIP features detected on a 3D model.

3.2. Matching VIP with SIFT

To match SIFT features from a single image with VIPfeatures from a 3D model, the SIFT features extracted fromthe image need to be fronto-parallel (or close to) to the VIPfeatures in the model. This might hold only for a fractionof features whose plane is accidentally parallel to the cam-era viewpoint. For all other features we will warp the cor-responding image areas, so that they approximately matchthe canonical form of the VIP features. The projective warpcan be computed along the following steps:

1. Compute the approximate camera position of the queryimage in the local coordinate frame from at least onefronto-parallel SIFT-VIP match.

2. Determine image areas that need to be warped by pro-jecting the VIP features of the model into the queryimage.

3. Compute the warp homography for each image areafrom the 3D plane of the VIP and the estimated camerapose.

The whole idea is based on the assumption that initalmatches between VIP and SIFT features are fronto-parallel(see Fig. 3(a) for example matches). This assumption al-lows to compute an estimate for the camera pose of the

query image. The VIP feature is located on a plane in 3Dand is defined by the feature’s center point X (in 3D) andthe normal vector n of the plane. Our assumption is that theimage plane of the SIFT feature is parallel to the plane andthat the principal ray of the camera center is in the directionof n and connects X and the center of the SIFT feature x.This fixes the camera pose along the normal vector n. Thedistance d between the camera center and the plane can becomputed from the scale ratio of the matched feature withthe help of the focal length f .

d = fS

s(1)

The focal length f of the camera can be taken from theEXIF data of the image or from camera calibration. S isthe scale of the VIP feature and s is the scale of the match-ing SIFT feature. The missing rotation around the principalaxis r can finally be recovered from the dominant gradi-ent direction of the image patches. Fig. 3(b) shows a cam-era pose estimated from a SIFT-VIP match. Now with thecamera P fully defined this approximation can be used tocompute the necessary warps. For each VIP feature in the3D model we determine the corresponding image region inthe query image, by projecting the VIP region (specified bycenter point and scale) onto the image plane. Next we com-pute the homography transform H that will warp our imageregion to the canonical form of the VIP feature with

H = R +1dTNT (2)

where R and T are rotation and translation from P to thevirtual camera of the VIP feature and N is the normal vectorof the VIP plane in the coordinate system of P . Finally weare looking for stable 2D image feature in the warped imagearea by applying the SIFT detector.

Clearly our assumptions are not met exactly which re-sults in an inaccurate camera pose estimate. SIFT descrip-tors, which are developed for wide-baseline matching, en-able matching within a certain range of viewpoint changeand thus the camera plane might not be exactly parallelto the VIP feature plane. However, we do not depend onan exact pose estimate for this step. We account for theuncertainty in the camera pose by enlarging the region towarp. In addition, remaining differences between the VIPand SIFT feature can be compensated with SIFT matching.Fig. 3 shows examples of final SIFT-VIP matches. The ini-tial matching between SIFT and VIP features results in 17matches. From this a camera pose estimate can be com-puted which allows to warp the SIFT detections in the in-put image into approximate fronto-parallel configuration.Matching the rectified SIFT detections with the VIP fea-tures yields 92 correct matches.

Algorithm 1 3D model search and pose estimation1: Extract SIFT features from query image2: Compute visual word document vector for query image3: Compute L2 distances to all document vectors in 3D

model database (inverted file query)4: Use 3D model corresponding to the smallest distance

as matching 3D model5: Match SIFT features from query image to VIP features

from database 3D model (nearest neighbor matching)6: Compute camera pose hypotheses from SIFT-VIP

matches7: Warp the query image according to the camera pose hy-

potheses and extract fronto-parallel SIFT features.8: Match fronto-parallel SIFT features to VIP features9: Compute final pose from SIFT-VIP matches

3.3. Pose estimation

The 3D-2D matches between VIP and SIFT features cannow be used to compute the camera pose accurately andthus determine the location of the user within the map. Themain benefit for pose estimation is that we could signifi-cantly increase the number of feature matches, which re-sults in a reliable and robust pose estimation. An outline ofthe complete localization method is given in Algorithm 1.

4. Efficient 3D model search in large databasesFor pose estimation as described in the previous section

the corresponding 3D model needs to be known. For largedatabases, necessary for city-wide localization, an exhaus-tive search through all the 3D models is not possible. Thusa first step prior pose estimation is to search for the corre-sponding 3D model. Our database consists of small indi-vidual 3D models that represent parts of a large scale visionbased 3D reconstruction, created as described in [1]. Eachindividual 3D model is represented by a set of VIP featuresextracted from the model texture. These features are usedto create a visual word database as described in [5]. This al-lows for efficient model search to determine the 3D modelnecessary for pose estimation.

Similar to [5], firstly, VIP features are extracted fromthe 3D models. Each VIP descriptor is quantized by a hi-erarchical vocabulary tree. All visual words from one 3Dmodel form a document vector which is a v-dimensionalvector where v is the number of possible visual words. It isusually extremely sparse. For a model query the similaritybetween the query document vector to all document vectorsin a database is computed. As similarity score we use theL2 distance between document vectors. The organizationof the database as an inverted file and the sparseness of thedocument vectors allows a very efficient scoring. For scor-ing, the different visual words are weighted based on the

inverse document frequency (IDF) measure. The databaseimages are ranked by the L2 distance. The vector with thelowest distance is reported as the most similar match.

In a next step initial SIFT-VIP matches are sought to startthe pose estimation algorithm. Corresponding features canbe efficiently determined by using the quantized visual worddescription. Features with the same visual word descriptionare reported as matches which only takes O(n) time wheren is the number of features.

The visual word description is very efficient. The plainvisual word database size is

DBinv = 4fI, (3)

where f is the maximum number of visual words per modeland I is the number of models in the database. The factor 4comes from the use of 4 byte integers to hold the model in-dex where a visual word occurred. If we assume an averageof 1000 visual words per model a database containing 1 mil-lion models would only need 4GB of RAM. In addition tovisual words we also need to store the 2D coordinates, scaleand rotation for the SIFT features and additional 3D coordi-nates, plane parameters and virtual camera for the VIP fea-tures, which still allows to store a huge number of modelsin the database.

5. Experiments5.1. SIFT-VIP matching results

We conducted an experiment to compare standard SIFT-SIFT matching with our proposed SIFT-VIP matching.Fig. 4(a) shows the established SIFT-SIFT matches. Only10 matches could be detected and many of them are ac-tually mis-matches. When computing the initial SIFT-VIP matches, the number of correspondences increases to25, most of them are correct (see Fig. 4(b)). The pro-posed method however is able to detect 91 correct SIFT-VIPmatches as shown in Fig. 4(c). This is a significantly highernumber of matches which allows a more accurate pose esti-mation. Note, that the matches are nicely distributed on twoscene planes. Fig. 4(d) shows the resulting pose estimate inred color. Fig. 5 shows the camera position hypotheses fromsingle SIFT-VIP matches in green. Each match generatesone hypothesis. The red camera is the correct camera pose.All the camera estimates are set fronto-parallel to the VIPfeature in the 3D model and therefore the camera estimatesgenerated from the plane not fronto-parallel to the real cam-era position are off. However, it can be seen that many posehypotheses are very close to the correct solution. Each ofthem can be used to extend the initial SIFT-VIP matches toa larger set.

Fig. 6 shows an example with 3 scene planes. The105 (partially incorrect) SIFT-SIFT matches get extendedto 223 correct SIFT-VIP matches on all the 3 scene planes.

Fig. 6(b) shows examples for orthographic VIP patches.The images show the extracted SIFT patches from the queryimage, the warped SIFT patches and the VIP patches of the3d model. (from left to right). Ideally the warped SIFTpatches and the VIP patches should be perfectly aligned.However, as the initial SIFT-VIP matches are not exactlyfronto-parallel the camera pose is inaccurate and the patchesare not perfectly registered. But the difference is not verylarge, which means that our simple pose estimation worksimpressively well.

5.2. 3D model search performance evaluation

In this experiment we show the performance of the 3Dmodel search. The video data to create the models in thefirst place was acquired with car mounted cameras whiledriving through a city. Two cameras were mounted on theroof of a car, one was pointing straight sidewards the otherone was pointing forward in a 45° angle. The fields of viewof both cameras do not overlap but as the system is movingover time the captured scene parts will overlap. To retrieveground truth data for the camera motion the image acquisi-tion was synchronized with a highly accurate GPS-inertiasystem. Accordingly we know the location of the cam-era for each video frame. In this experiment a 3D modeldatabase represented by VIP features is created from theside camera video. The database will be queried with thevideo frames from the side camera which are represented bySIFT features. The database contains 113 3D models whichwill be queried with 2644 images. The query video frameshave a resolution of 1024×768 which resulted in up to 5000features per frame. The vocabulary tree used was trained ongeneral image data from the web. The 3D model search re-sults are visualized by plotting lines between frame-to-3dmodel matches (see Fig. 7). The identical camera paths ofthe forward and side camera are shifted by a small amountin x and y direction to make the matching links visible. Weonly draw matches below a distance threshold of 10m sothat mis-matches get filtered out. The red markers are thequery camera positions and the green markers are the 3Dmodel positions in the database. In the figure the top-10ranked matches are drawn. Usually one considers the top-n ranked matches as possible hypotheses and verifies thecorrect one geometrically. In our case this can be done bythe pose estimation. Fig. 8 shows some correct examplematches.

5.3. 3D model query with cell phone images

We developed an application that allows to query a 3Dcity-model database (see screenshot in Fig. 9) from arbitraryinput images. The database so far contains 851 3D modelsand the retrieval works in real-time. Fig. 9(b) shows an im-age query with a cell phone image. The cell phone query

(a)

(b)

(c)

(d)Figure 4. Comparison of standard SIFT-SIFT matching and ourproposed SIFT-VIP method. (a) SIFT-SIFT matches. Only 10matches could be found, most of them are mis-matches. (b) Ini-tial SIFT-VIP matches. 25 matches could be found, most of themare correct. (c) Resulting set of matches established with the pro-posed method. The initial set of 25 matches could be extended to91 correct matches. (d) The SIFT-VIP matches in 3D showing theestimated camera pose (red).

image shows different resolution and was taken month later,nevertheless we could match it perfectly.

Figure 5. Camera pose hypotheses from SIFT-VIP matches(green). The groundtruth camera pose of the query image is shownin red. Multiple hypotheses are very close to the real camera pose.

6. Conclusion

In this paper we addressed two important topics in vi-sual localization. Firstly, we investigated the case of 3D-2Dpose estimation using VIP and SIFT features. We showedthat it is possible to match images to 3D models by match-ing SIFT features to VIP features. We demonstrated, thatit is possible to increase the number of initial SIFT-VIPmatches significantly by warping the query features into theorthographic frame of the VIP features. This increases thereliability and robustness of pose estimation. Secondly, wedemonstrated a 3D model search scheme that is efficientlyscalable up to city-scales. Localization experiments withimages from camera phones showed that this approach issuitable for city-wide localization from mobile devices.

References[1] A. Akbarzadeh, J. Frahm, P. Mordohai, B. Clipp, C. En-

gels, D. Gallup, P. Merrell, M. Phelps, S. Sinha, B. Tal-ton, L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch,H. Towles, D. Nister, and M. Pollefeys. Towards urban 3dreconstruction from video. In 3D Data Processing, Visual-ization and Transmission, pages 1–8, 2006.

[2] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman.Total recall: Automatic query expansion with a generative

(a)

(b)Figure 6. (a) SIFT-VIP matches and estimated camera pose for ascene with 3 planes. (b) Examples for warped SIFT patches andorthographic VIP patches. From left to right: Extraced SIFT patchfrom query images, warped SIFT patch, VIP patch of 3d model.The VIP patches are impressively well aligned to the warped SIFTpatches, despite the inaccuracies of the camera pose.

320 340 360 380 400 420 440 460 480

-2360

-2340

-2320

-2300

-2280

-2260

-2240

-2220

x [m]

y [m

]

Figure 7. 3D model search. Red markers are query camera po-sitions and green markers are the 3D model positions in thedatabase. Lines show matches below a 10 m distance threshold.Each match should be seen as a match hypothesis which is to beverified by the geometric constraints of pose estimation.

Figure 8. Matches from the 3D model search. Left: Query imagefrom the forward camera. Right: Retrieved 3D models.

(a)

(b)Figure 9. (a) Screenshots of our 3D model search tool. The queryimage can be selected from a list on the left. As a result the cor-responding 3D model shows up. (b) Query with an image from acamera phone.

feature model for object retrieval. In Proc. 11th InternationalConference on Computer Vision, Rio de Janeiro, Brazil,2007.

[3] D. Lowe. Distinctive image features from scale-invariantkeypoints. International Journal of Computer Vision,60(2):91–110, 2004.

[4] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. Acomparison of affine region detectors. International Journalof Computer Vision, 65(1-2):43–72, 2005.

[5] D. Nister and H. Stewenius. Scalable recognition with a vo-cabulary tree. In Proc. IEEE Conference on Computer Visionand Pattern Recognition, New York City, New York, pages2161–2168, 2006.

[6] D. Robertsone and R. Cipolla. An image-based system forurban navigation. In Proc. 14th British Machine Vision Con-ference, London, UK, pages 1–10, 2004.

[7] G. Schindler, M. Brown, and R. Szeliski. City-scale locationrecognition. In Proc. IEEE Conference on Computer Visionand Pattern Recognition, Minneapolis, Minnesota, pages 1–7, 2007.

[8] H. Shao, T. Svoboda, T. Tuytelaars, , and L. J. V. Gool. Hpatindexing for fast object/scene recognition based on local ap-pearance. In Conference on Image and video retrieval, pages71–80, 2003.

[9] C. Wu, B. Clipp, X. Li, J.-M. Frahm, and M. Pollefeys. 3dmodel matching with viewpoint invariant patches (vips). InTo appear in Proc. IEEE Conference on Computer Visionand Pattern Recognition, 2008.

[10] G. Yang, J. Becker, and C. Stewart. Estimating the locationof a camera with respect to a 3d model. In 3DIM07, pages159–166, 2007.

[11] W. Zhang and J. Kosecka. Image based localization in ur-ban environments. In 3D Data Processing, Visualization andTransmission, pages 33–40, 2006.

3d model search and pose estimation from single images ......3d model search and pose estimation...

Documents