real time registration of rgb-d data using local visual...

Real Time Registration of RGB-D Data using Local Visual Features and3D-NDT Registration

Henrik Andreasson and Todor StoyanovCenter of Applied Autonomous Sensor Systems (AASS), Orebro University, Sweden

Abstract— Recent increased popularity of RGB-D capablesensors in robotics has resulted in a surge of related RGB-D registration methods. This paper presents several RGB-Dregistration algorithms based on combinations between localvisual feature and geometric registration. Fast and accuratetransformation refinement is obtained by using a recentlyproposed geometric registration algorithm, based on the Three-Dimensional Normal Distributions Transform (3D-NDT). Re-sults obtained on standard data sets have demonstrated meantranslational errors on the order of 1 cm and rotational errorsbellow 1 degree, at frame processing rates of about 15 Hz.

I. INTRODUCTION

In the past year the use of inexpensive sensors whichprovide both depth images and either RGB or intensity basedimages has revolutionized many aspects of mobile robotics.With the introduction of the Microsoft Kinect structuredlight camera, algorithms utilizing combined RGB and depth(RGB-D) information have become an increasingly importanttopic in mapping, navigation and perception.

Recent works present several mapping strategies, basedon RGB-D sensors. Camera tracking methods, based ondense models have been shown to produce very accurateenvironmental models [1], based on depth information alone.Approaches to utilize both depth and intensity (or RGB)information are also being investigated, and some promisingRGB-D SLAM results have recently been reported [2].SLAM techniques today typically formulate the estimationproblem into a graph structure which is then optimized witha back-end, for example, the g2o framework [3]. Essentiallytwo types of constraints are posed into the graph: consecutiveestimates (frame-by-frame) also called visual odometry andconstraints imposed by loop-closure. Consecutive frames arecommonly assumed to have a rather small change in pose,which is successfully utilized in [4] where the authors obtainaccurate tracking of the relative pose by minimizing a re-projection error function between the consecutive frames.For detecting loop closures, visual features, for exampleSURF [5], and vocabularies [6] to describe frames arecommonly used.

This work intends to be suitable both for obtaining con-secutive frame-to-frame registration estimates at high framerates and for detecting loop closure events. The algorithmspresented here start from a similar vantage point as theRGB-D SLAM approach presented in [2], but focus solelyon the frame-to-frame pose estimation and do not performgraph relaxation. The initial visual feature based estimatesare then used to seed a recently proposed fast 3D registrationalgorithm [7] and obtain refined and improved tracking

performance. An uncertainty or covariance estimate of eachestimated constraint is also directly obtainable from theregistration method. Due to the high frame rate of RGB-D sensors, many methods (including the one presented here)attempt to achieve real-time performance, which for visionapplications is refereed to be at least 10 Hz [8].

The next section presents our proposed combined visualand geometric registration method. Section III then presentsresults obtained on some of the TUM RGB-D data setsfrom [9] and preliminary comparisons with previously re-ported results on the same benchmark suite [4]. Finally,we discuss some directions for future improvement of theproposed methods.

II. APPROACH

The suggested approach assumes that both an RGB (orintensity) image IRGB and a depth image ID are acquiredat approximately the same time I = {IRGB , ID}t. Themain idea is to utilize standard vision techniques (localvisual features) to find corresponding regions from It andIt+1 and then use RANSAC to find a consistent alignment.The geometrical information in the vicinity of the selectedfeatures can then be efficiently utilized as an input to a fast3D registration method.

A. Visual Features Registration

For each image IRGB a set of visual features F areextracted. To find the correspondences between features fromtwo images It and It+1 a brute force search is done betweenimage feature Ft and Ft+1, where the best associations arekept as a possible set of candidates matches C. Every featurepoint fi in C is then associated to a neighbourhood of pixelsin the depth image ID. The pixels in each neighbourhoodcan further be represented as a set of corresponding 3dpoints. The average 3D point for each neighbourhood canthen be viewed as the geometrical position of the featurepoint fi. By choosing a neighbourhood size of one pixels, adirect point-to-point correspondence can be established. Onthe contrary, choosing larger neighbourhood size (or supportsize) leads to averaging of the feature position and a point-to-region correspondence. Finally, the points in each featureneighbourhood are stored and later utilized in the geometricregistration step.

The set of candidate matches C and the set of featurepoint 3D positions can be used to establish a least squareerror transformation. However, due to local visual imilarity,a large portion of outliers and wrong associations can be

RGB-D Images Data Processing Flow

Extract SURF features and brute forcematch correspondance

RANSAC'

Find best fit RANSACtransformation

Compute 3D-NDTat neighbours of

RANSAC Set 3D position of to cell mean

Compute 3D-NDTregistration

NDT F'

NDT F

Set 3D position of to cell mean

Trim outlier featuresNDT F2

Filter features by depthNDT F3

NDT F4

Compute 3D-NDTof point clouds 3D-NDT-D2D

Fig. 1. Different registration algorithms tested.

expected. Thus, a 3 point RANSAC algorithm is deployedas a consistency filtering step. Three feature pairs from Care randomly selected and used to determine a closed formsolution for rotation and translation [10]. The translationand rotation estimate are then used to transform all featurepoints in a common coordinate system. The number oftransformation inliers — associated matching features thatare close to their corresponding feature points, is used asa ranking criteria, in order to select the best consistenttransformation. Finally, the transformation obtained is usedas an initial estimate in the geometrical registration step.

B. Geometric Registration using the 3D-NDT

Several 3D geometric registration approaches could beused to successfully perform fine registration and adjustmentof the obtained poses. Local registration methods, such asthe Iterative Closest Point (ICP) method [11], are oftenused as a post processing step for feature-based registrationmethods (See for example FPFH features and a non-linearLevenberg-Marquardt registration [12]). However, even stateof the art algorithms like Generalized ICP [13] can betoo slow to ensure real time operation, due to the largenumber of points that need to be considered. The 3D-NDT(Normal Distributions Transform) is a spatial representationthat reduces the number of components in a point cloudby an order of magnitude, by locally representing spaceas Gaussian probability density functions. The 3D-NDThas previously been used for registration, using a Point-to-Distribution [14] semantic — maximizing the probability of aset of points, given the probability distributions in a 3D-NDTmodel of a known point set. In a recent article, Stoyanov et

al. [7] propose to register directly two 3D-NDT represen-tations (Distribution-to-Distribution or 3D-NDT-D2D), thusmaximizing the overlap between two sets of Gaussian pdf.

In principle, any 3D geometric registration algorithm couldbe used to refine the initial alignment, obtained using visualfeatures and RANSAC. The choice of the 3D-NDT-D2D forthis task is motivated not only by the fast runtimes, butalso by its suitability to a particularly relevant modification.Instead of computing the 3D-NDT representation of the fulldepth images, we propose to only consider the immediatelocal neighbourhoods of each of the detected local visualfeature points. Thus, the number of Gaussian componentscan be decreased substantially and consequently, a full 3Dpoint cloud corresponding to the depth image need notbe computed. Several options for performing the 3D-NDTregistration were explored and are further described in thefollowing subsection.

C. Tested Algorithm Variations

Several options for performing the combined visual andgeometrical registration exist and were explored. Figure 1shows a schematic explanation of the different registrationmethod variations tested. All algorithms start from a commoninput — two consecutive RGB-D images I1, I2. The firstoption tested is labeled RANSAC’ and consists of just twomajor blocks — extracting and matching SURF featuresand running RANSAC to find a consistent transformationbetween the 3D coordinates of the feature points. The secondoption is the main RANSAC variation considered, whichperforms an additional step — the 3D coordinates of thefeature points are calculated as the mean position of all

(a) (b)

Fig. 2. An example registration of two consecutive frames, using the RANSAC (in (a)) and NDTF (in (b)) algorithms. Problematic areas are highlighted.

points in the feature neighbourhood. The first, baseline 3D-NDT-Feature registration algorithm (NDTF ) keeps track ofnot only the mean, but also the covariance of each featureneighbourhood. In this way each RGB-D Image I is rep-resented as a set of Gaussian distributions around featurepoints. NDTF performs a 3D-NDT-D2D registration stepon these Gaussian distributions, taking into considerationthe correspondences and transformation found by RANSAC.The two variations NDTF2 and NDTF3 test two possiblenoise reductions for features, during the RANSAC step —namely a statistical outlier removal at a 90% significance anda distance-based feature rejection respectively. The NDTF4

variation tested skips the RANSAC and correspondenceselection steps altogether and directly performs a registrationon the distribution sets. Finally, a standard 3D-NDT-D2Dregistration is performed for comparison, using a regular grid3D-NDT model of the full point clouds, computed from thedepth images. An example registration of two consecutiveframes using the RANSAC and NDTF algorithms is shownin Figure 2.

III. EVALUATION AND RESULTS

To obtain real-time performance (>10Hz), the RGB im-ages were first downscaled with a factor of 1

16 (scale downfour times both width and height). The feature detector anddescriptor used was the standard SURF implementation inOpenCV [15]. All evaluations were performed using a singlecore with 4800 bogomips (Intel Q6600). The evaluationutilizes the RGB-D data sets collected by Strum et al. [9].Results from the method proposed by Steinbrucker et al. andGeneralized ICP (GICP) are taken from the paper [4].

Table I summarizes the accuracy performance of thedifferent evaluated algorithms on several of the standardRGB-D benchmark sets. The reported results show themean translational errors over all frame pairs, as calculatedusing the ground truth path and evaluation tool, provided

in the benchmark. We note that the performance of thevisual features / RANSAC combination and the 3D-NDT-D2D depth-based registration is comparable over all the datasets. Both algorithms have average translation errors on theorder of 2-3 centimeters. Next, we note that three of thecombined algorithms - NDTF , NDTF2, and NDTF3 producecomparable results, with an error of about 1.5 centimeters,or half of the error of the previous two algorithms. Thedifference between the performance of these three variationswill be discussed in subsequent paragraphs. Finally, thevariation NDTF4 which relies on the registration of only thegeometrical information around each visual feature, performssubstantially worse, with mean errors of several centimeters.The lack of correspondences and the very sparse models usedresult in local minima problems for the registration methodin this case and need further investigation.

Table II summarizes the runtime performance (in seconds)of each of the proposed algorithms over all the data sets.Again, the runtime performance of the first three NDTFvariations is very similar, with a mean of about 15 Hz.NDTF4 performs much faster, due to the sparsity and lack ofcorrespondence searches, but the obtained accuracy is subop-timal, as already discussed. The RANSAC visual algorithm ismarginally faster then the NDTF approaches, totaling around0.05 seconds per frame. The full 3D-NDT-D2D registrationalgorithm runs a couple of times slower, with around 0.15seconds per frame, which is still two orders of magnitudefaster then the geometrical registration algorithm (GICP),used as a baseline by Steinbrucker et al.

Looking at the results in Tables I and II, the performanceof the two modified versions NDTF2 and NDTF3 does notappear to differ much from that of the standard NDTFversion. Some additional tests were performed to investigatethis similarity in performance. Figure 3(b) shows how themean translation error of NDTF2 varies for all data setswhen the feature cutoff threshold is changed. The idea behind

0

0.01

0.02

0.03

0.04

0.05

0.06

0 5 10 15 20 25 30 35 40

Tra

nsla

tion e

rror

(mete

r)

Support size

Translation error with different support sizes

xyzrpy

360floordesk

desk2room

(a)

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

2 3 4 5 6 7 8 9 10

Tra

nsla

tion e

rror

(mete

r)

Maximum keypoint distance (meter)

Translation error while skipping frames

xyzrpy

360floordesk

desk2room

(b)

Fig. 3. Mean translation errors using different support sizes (in (a)) and different maximum keypoint distances (in (b)).

Dataset NDTF NDTF2 NDTF3 NDTF4 RANSAC 3D-NDT-D2D [7]1-360 0.014 0.015 0.014 0.081 0.035 0.0461-desk 0.016 0.017 0.016 0.056 0.021 0.0211-desk2 0.016 0.017 0.016 0.063 0.024 0.0271-floor 0.015 0.016 0.015 0.049 0.020 0.0211-room 0.011 0.011 0.011 0.035 0.025 0.0202-desk 0.005 0.005 0.005 0.006 0.024 0.011

TABLE IMEAN TRANSLATIONAL ERRORS FOR DIFFERENT REGISTRATION APPROACHES. NDTF : THE STANDARD VERSION WITHOUT ANY TRIMMING.

NDTF2 : WITH 90% TRIMMING. NDTF3 REMOVE ALL KEYPOINTS WITH A DISTANCE GREATER THAN 6 METERS FROM THE CAMERA. NDTF4 : USE

THE EXTRACTED FEATURES TO CREATE THE NDT BUT DO NOT USE ANY CORRESPONDENCES IN THE MATCHES. RANSAC: THE ESTIMATE FOUND BY

ONLY RUNNING RANSAC. 3D-NDT-D2D: NON-VISUAL FEATURE APPROACH USING 3D-NDT DISTRIBUTION TO DISTRIBUTION MATCHING [7].

Dataset NDTF NDTF2 NDTF3 NDTF4 RANSAC 3D-NDT-D2D [7]1-360 0.057 0.058 0.057 0.061 0.048 0.2181-desk 0.069 0.069 0.070 0.049 0.041 0.1281-desk2 0.065 0.065 0.065 0.062 0.055 0.1421-floor 0.041 0.053 0.054 0.038 0.047 0.1051-room 0.072 0.064 0.072 0.069 0.045 0.1582-desk 0.069 0.068 0.069 0.031 0.048 0.135

TABLE IIMEAN PROCESSING TIMES IN SECONDS FOR DIFFERENT APPROACHES.

cutting off far-away features is motivated by the severedegeneration in accuracy of the Kinect sensor at distancesabove 3.5 meters. The results however seem surprising at first— using a cutoff threshold does not offer an improvementin any of the cases, and in fact can only make resultsworse. The error naturally increases if the cutoff thresholdused is too low — lower or equal then 3 meters for mostdata sets, due to a depletion of feature points. Using ahigher cutoff threshold however has no overall effect. Twopossible explanations for this effect are the low percentageof feature points at long distances (in the considered datasets) and the general robustness of RANSAC in selecting awell fitting transformation in the presence of a high numberof outliers. Similarly, the possible improvements due totrimming the RANSAC matches to only use 90% of inliersin the transformation computation have little effect on thesubsequent 3D-NDT-D2D registration performance.

The impact on the support size - the surrounding regionaround each keypoint used to calculate the 3D-NDT dis-tribution, are depiected in Fig. 3(a). To obtain a resonable

distribution it is important that the the size is not to small.All other results presented utilize a support size of 15× 15pixels.

Table III also demonstrates the performance of two of thekey algorithms — RANSAC and NDTF without a meanneighbourhood filter. The approaches, labeled RANSAC’ andNDTF ’ set the 3D position of each keypoint directly to thevalue calculated from the depth image. Thus, when keypointsare detected in noisy regions or close to edges in the depthimage, the associated 3D positions can vary substantially inbetween different frames. Although the results obtained showa worse performance for the unfiltered keypoint versions ofboth algorithms, the performance drop is not very significant,testifying yet again to the robustness of the 3-point RANSACmethod used.

Some measure of a baseline comparison with state of theart approaches can be obtained through studying the resultsof Steinbrucker et al. [4]. Another interesting point of com-parison would be with the open loop registration part of theRGB-D SLAM algorithm [2], but unfortunately only values

Dataset RANSAC RANSAC’ NDTF NDTF ’1-360 0.035 0.035 0.014 0.0151-desk 0.021 0.022 0.016 0.0191-desk2 0.024 0.025 0.016 0.0171-floor 0.020 0.023 0.015 0.0181-room 0.025 0.025 0.011 0.0102-desk 0.024 0.025 0.005 0.005

TABLE IIIMEAN TRANSLATION ERROR. WHERE RANSAC’ AND NDTF ’ SETS

THE KEYPOINT 3D POSITION DIRECTLY TO THE CLOSEST DEPTH VALUE.

Dataset x (m) x (m) θ (deg) θ (deg) ¯fps (Hz)1-360 0.014 0.010 0.592 0.483 17.51-desk2 0.016 0.012 1.009 0.820 15.31-floor 0.015 0.007 1.027 0.402 24.11-room 0.011 0.008 0.635 0.502 13.81-desk 0.0165 0.0122 1.1567 0.909 14.4G-ICP [13] 0.0103 - 0.0154 - 0.13Steinbrucker [4] 0.0053 - 0.0065 - 12.5

2-desk 0.0048 0.0039 0.3413 0.2961 19.7G-ICP [13] 0.0062 - 0.0060 - 0.13Steinbrucker [4] 0.0015 - 0.0027 - 12.5

TABLE IVMEAN AND MEDIAN ERRORS FOR POSITION (x AND x), MEAN AND

MEDIAN FOR ROTATIONAL ERRORS (θ AND θ) AND MEAN FRAME RATE

¯fps. THE RESULTS ARE WHILE USING (NDTF ).

for the absolute error after loop closure have been reported.Table IV shows a comparison between the NDTF algorithmand the Generalized ICP and visual odometry approaches.The mean translation errors obtained by the proposed ap-proach are clearly less accurate then the visual odometry,but similar to the ones obtained by Generalized ICP. Wenote however the relatively large differences between themean and median values of our approach, which testifiesto the presence of some large outliers. Indeed, examiningthe errors obtained for a single dataset (freiburg2desk)shown in Figure 4, we can detect several clear outliersin both translational and rotational errors. Identifying suchoutliers online should be possible by examining the shapeof the 3D-NDT-D2D objective at the (local) minimum andwill be one important future direction. Another importantfuture direction for improvement will be the investigation andreduction of the (relatively) large rotational errors obtainedby our approach.

Finally, we demonstrate the stability of the proposedapproach (NDTF is used here) to skipping of frames. Fig-ure 5 shows a plot of the mean translation error obtained,depending on the number of frames skipped — i.e., the resultat n on the x axis of the plot is obtained by registeringevery n-th frame. Naturally, the accuracy of the algorithmdegenerates with increasing the number of skipped frames,but the rate of drop in performance is relatively low. Theperformance is naturally influenced by the overlap betweenframes and the speed of movement of the sensor, but thepreliminary results promise that in the future applicationsfeaturing faster camera motions or using less processingpower will be possible.

IV. DISCUSSION

This paper presented several RGB-D registration methodsbased on a combination between local visual features and3D-NDT registration. The implementation of all variants pre-sented here is available as a ROS package (ndt feature reg) athttp://code.google.com/p/oru-ros-pkg/. Theresults obtained on standard data sets have demonstrated apromising performance, with mean translational errors on theorder of 1cm and rotational errors usually bellow 1 degree.This performance is attained in near real-time, with a frameprocessing rate of about 15 Hz. Future work will concentrateon lowering the rotational errors obtained and identifyinglocal minima in the registration, as well as applications as aregistration front-end in a SLAM system.

ACKNOWLEDGMENTS

This work has been partially supported by the EuropeanUnion FP7 Project “RobLog — Cognitive Robot for Au-tomation of Logistic Processes”.

REFERENCES

[1] R. A. Newcombe, A. J. Davison, S. Izadi, P. Kohli, O. Hilliges,J. Shotton, D. Molyneaux, S. Hodges, D. Kim, and A. Fitzgibbon,“Kinectfusion: Real-time dense surface mapping and tracking,” inMixed and Augmented Reality (ISMAR), 2011 10th IEEE InternationalSymposium on, oct. 2011, pp. 127 –136.

[2] F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Bur-gard, “An evaluation of the RGB-D SLAM system,” in Proc. of theIEEE Int. Conf. on Robotics and Automation (ICRA), St. Paul, MA,USA, May 2012.

[3] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,“g2o: A general framework for graph optimization,” in Proc. of theIEEE Int. Conf. on Robotics and Automation (ICRA), Shanghai, China,May 2011.

[4] F. Steinbruecker, J. Sturm, and D. Cremers, “Real-time visual odom-etry from dense rgb-d images,” in Workshop on Live Dense Recon-struction with Moving Cameras at the Intl. Conf. on Computer Vision(ICCV), 2011.

[5] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Comput. Vis. Image Underst., vol.110, no. 3, pp. 346–359, June 2008. [Online]. Available: http://dx.doi.org/10.1016/j.cviu.2007.09.014

[6] J. Sivic and A. Zisserman, “Video Google: A text retrieval approachto object matching in videos,” in Proceedings of the InternationalConference on Computer Vision, vol. 2, Oct. 2003, pp. 1470–1477.[Online]. Available: http://www.robots.ox.ac.uk/∼vgg

[7] T. Stoyanov, M. Magnusson, and A. J. Lilienthal, “Point Set Registra-tion through Minimization of the Distance between 3D-NDT Models,”in Proc. of IEEE Int. Conf. on Robotics and Automation, ICRA 2012,May 2012.

[8] D. Beymer and K. Konolige, “Real-time tracking of multiple peopleusing continuous detection,” in Proc. ICCV Frame-rate Workshop,1999.

[9] J. Sturm, S. Magnenat, N. Engelhard, F. Pomerleau, F. Colas, W. Bur-gard, D. Cremers, and R. Siegwart, “Towards a benchmark for rgb-d slam evaluation,” in Proc. of the RGB-D Workshop on AdvancedReasoning with Depth Cameras at Robotics: Science and SystemsConf. (RSS), Los Angeles, USA, June 2011.

[10] K. S. Arun, T. S. Huang, and S. D. Blostein, “Least-squares fitting oftwo 3-d point sets,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 9,no. 5, pp. 698–700, 1987.

[11] P. Besl and N. McKay, “A Method for Registration of 3D Shapes,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 14, pp. 239–256, 1992.

[12] R. B. Rusu, N. Blodow, and M. Beetz, “Fast Point Feature Histograms(FPFH) for 3D registration,” in IEEE Int. Conf. on Robotics andAutomation, ICRA 2009, 2009, pp. 3212 –3217.

[13] A. Segal, D. Haehnel, and S. Thrun, “Generalized-ICP,” in Proceed-ings of Robotics: Science and Systems, Seattle, USA, June 2009.

(a) (b)

Fig. 4. Figure 4(a): translation error (left) and Figure 4(b):orientational error using NDTF with freiburg2desk data set.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 10 20 30 40 50 60

Tra

nsla

tion

erro

r (m

eter

)

Number of skipped frames


(a)

0

0.05

0.1

0.15

0.2

0.25

0 10 20 30 40 50 60

Tra

nsla

tion

erro

r (m

eter

)

Number of skipped frames


(b)

Fig. 5. Translation error while skipping intermediate frames. Figure 5(a): freiburg1desk data set. Figure 5(b): freiburg2desk data set.

[14] M. Magnusson, T. Duckett, and A. J. Lilienthal, “3D Scan Registrationfor Autonomous Mining Vehicles,” Journal of Field Robotics, vol. 24,no. 10, pp. 803–827, Oct. 24 2007.

[15] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of SoftwareTools, 2000.

real time registration of rgb-d data using local visual...

Documents