fusing 2d and 3d clues for 3d tracking using visual and...

8
Fusing 2D and 3D Clues for 3D Tracking Using Visual and Range Data O. Serdar Gedik and A. Aydin Alatan, Senior Member, IEEE gedik, [email protected] Department of Electrical and Electronics Engineering, Middle East Technical University, Ankara, Turkey Abstract—3D tracking of rigid objects is required in many applications, such as robotics or augmented reality (AR). The availability of accurate pose estimates increases reliability in robotic applications and decreases jitter in AR scenarios. Pure vision-based 3D trackers require either manual initializations or offline training stages, whereas trackers relying on pure depth sensors are not suitable for AR applications. In this paper, an automated 3D tracking algorithm, which is based on fusion of vision and depth sensors via Extended Kalman Filter (EKF), which inherits a novel observation weighting method, is proposed. Moreover, novel feature selection and tracking schemes based on intensity and shape index map (SIM) data of 3D point cloud, increases 2D and 3D tracking performance significantly. The proposed method requires neither manual initialization of pose nor offline training, while enabling highly accurate 3D tracking. The accuracy of the proposed method is tested against a number of conventional techniques and superior performance is observed. Keywords- 3D tracking, sensor fusion, EKF I. I NTRODUCTION Model-based 3D tracking stands for the estimation of rotation and translation parameters, R o and t o , between ref- erence frames of a known object and the camera. The object is typically known by means of a Computer-Aided-Design (CAD) model, snapshots captured from different viewpoints (key-frames) or a Point-Cloud-Model (PCM) obtained by using a depth sensor. For each of these representations, 3D object coordinates with respect to the object reference frame, [X o Y o Z o ] T , are available and the 2D pixel coordinates, [x o y o ] T , are extracted from video frames. Hence, 2D and 3D coordinates are related by the perspective camera relation in homogenous coordinates, see Fig. 1: s " x o y o 1 # = K [R o |t o ] X o Y o Z o 1 (1) where K is the internal camera calibration matrix and s is a scale factor. Assuming calibrated cameras, the aim is to recover R o and t o at all time instants. 2D pixel and 3D object coordinate correspondences are required in order to solve (1). Classical model-based tracking algorithms establish these correspondences by either exploiting manual initializations of pose in order to align CAD model edges and edges extracted from images, or using off-line training stages in order to develop 2D-3D correspondences that can be used during the online stage. However, CAD Fig. 1: 3D tracking principles models might not be available or the off-line training before 3D tracking is typically not desired in practical AR scenarios. The proposed method overcomes such limitations by elim- inating dependence on any CAD model or offline training by integrating vision and depth sensors within a single formu- lation. By utilization of two sets of measurements for each object point, namely 2D pixel coordinates from vision sensor and 3D coordinates from depth sensor, the proposed method tries to solve 2D-3D matching and 3D tracking problems in an accurate manner. This paper is organized as follows: Section II is devoted to the model-based tracking algorithms in the literature. The proposed method is introduced in Section III. Section IV presents experimental results. Conclusions are given in Section V. II. RELATED WORK Model-based 3D tracking algorithms that exploit vision sensors estimate pose so that 2D features extracted from video frames are aligned with the projections of 3D object features. In the literature, these algorithms can be classified as marker- based and marker-free methods [1]. Marker-free methods are more widely used, since they do not utilize special markers. Marker-free methods mostly rely on edges or point features (texture) in order to estimate the object pose. By the motivation that edges are invariant under lighting conditions, edge-based trackers, which generally require CAD models, are proposed under the marker-free framework [6],[7]. The initial pose required to project the CAD model can be

Upload: others

Post on 14-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fusing 2D and 3D Clues for 3D Tracking Using Visual and ...fusion.isif.org/proceedings/fusion2013/html/pdf...integrating vision and depth sensors within a single formu-lation. By utilization

Fusing 2D and 3D Clues for 3D Tracking UsingVisual and Range Data

O. Serdar Gedik and A. Aydin Alatan, Senior Member, IEEEgedik, [email protected]

Department of Electrical and Electronics Engineering, Middle East Technical University, Ankara, Turkey

Abstract—3D tracking of rigid objects is required in manyapplications, such as robotics or augmented reality (AR). Theavailability of accurate pose estimates increases reliability inrobotic applications and decreases jitter in AR scenarios. Purevision-based 3D trackers require either manual initializations oroffline training stages, whereas trackers relying on pure depthsensors are not suitable for AR applications. In this paper, anautomated 3D tracking algorithm, which is based on fusion ofvision and depth sensors via Extended Kalman Filter (EKF),which inherits a novel observation weighting method, is proposed.Moreover, novel feature selection and tracking schemes based onintensity and shape index map (SIM) data of 3D point cloud,increases 2D and 3D tracking performance significantly. Theproposed method requires neither manual initialization of posenor offline training, while enabling highly accurate 3D tracking.The accuracy of the proposed method is tested against a numberof conventional techniques and superior performance is observed.

Keywords- 3D tracking, sensor fusion, EKF

I. INTRODUCTION

Model-based 3D tracking stands for the estimation ofrotation and translation parameters, Ro and to, between ref-erence frames of a known object and the camera. The objectis typically known by means of a Computer-Aided-Design(CAD) model, snapshots captured from different viewpoints(key-frames) or a Point-Cloud-Model (PCM) obtained byusing a depth sensor. For each of these representations,3D object coordinates with respect to the object referenceframe, [XoYoZo]

T , are available and the 2D pixel coordinates,[xoyo]

T , are extracted from video frames. Hence, 2D and 3Dcoordinates are related by the perspective camera relation inhomogenous coordinates, see Fig. 1:

s

[xoyo1

]= K [Ro|to]

Xo

YoZo1

(1)

where K is the internal camera calibration matrix and s is ascale factor. Assuming calibrated cameras, the aim is to recoverRo and to at all time instants.

2D pixel and 3D object coordinate correspondences arerequired in order to solve (1). Classical model-based trackingalgorithms establish these correspondences by either exploitingmanual initializations of pose in order to align CAD modeledges and edges extracted from images, or using off-linetraining stages in order to develop 2D-3D correspondencesthat can be used during the online stage. However, CAD

Fig. 1: 3D tracking principles

models might not be available or the off-line training before3D tracking is typically not desired in practical AR scenarios.

The proposed method overcomes such limitations by elim-inating dependence on any CAD model or offline training byintegrating vision and depth sensors within a single formu-lation. By utilization of two sets of measurements for eachobject point, namely 2D pixel coordinates from vision sensorand 3D coordinates from depth sensor, the proposed methodtries to solve 2D-3D matching and 3D tracking problems in anaccurate manner. This paper is organized as follows: Section IIis devoted to the model-based tracking algorithms in theliterature. The proposed method is introduced in Section III.Section IV presents experimental results. Conclusions aregiven in Section V.

II. RELATED WORK

Model-based 3D tracking algorithms that exploit visionsensors estimate pose so that 2D features extracted from videoframes are aligned with the projections of 3D object features.In the literature, these algorithms can be classified as marker-based and marker-free methods [1]. Marker-free methods aremore widely used, since they do not utilize special markers.Marker-free methods mostly rely on edges or point features(texture) in order to estimate the object pose.

By the motivation that edges are invariant under lightingconditions, edge-based trackers, which generally require CADmodels, are proposed under the marker-free framework [6],[7].The initial pose required to project the CAD model can be

Page 2: Fusing 2D and 3D Clues for 3D Tracking Using Visual and ...fusion.isif.org/proceedings/fusion2013/html/pdf...integrating vision and depth sensors within a single formu-lation. By utilization

obtained from the pose of the previous video frame [8] or usinginertial sensors [9]. Edge-based trackers that utilize only visionsensors suffer from the necessity of manual initialization,which is required in order to project CAD model on theimage, and the associated performances are quite sensitive toinitial pose estimates. In order to account for such limitations,texture-based methods emerged [2].

Texture-based methods are based on matching image fea-tures between consecutive frames or between frames andkey-frames [2], [10], [11]. In an offline training period, bythe help of the features extracted and matched among thekey frames, 2D-3D correspondences are obtained by back-projections of the CAD model or using structure from motion(SfM) techniques. In the online period, features extracted fromthe video images are matched to key-frame features to establish2D-3D correspondences required for 3D pose estimation.

In order to account for limitations of edge and texture-based methods, algorithms combining both approaches arealso proposed in the literature. Edge-based tracking may beinitialized using texture-based tracking [12] or vice-versa [13].

On the other hand, model-based tracking algorithms utiliz-ing pure range (depth) data are mainly based on registrationof 3D point clouds. For instance, as proposed in [14], IterativeClosest Point (ICP) algorithm first matches the closest pointsbetween two point sets, and calculates the pose based onthis association. Then, it aligns points by this pose estimate,and continues to match nearest points and estimate pose inan iterative manner. The authors in [15] use articulated ICP(which also utilizes texture information) in Kalman Filter(KF) formulation. Coherent Point Drift (CPD) [16] is a recentprobabilistic point set registration algorithm utilizing Gaussianmixture model. A local descriptor based 3D-3D registrationapproach is also proposed in [18].

In a recent approach [17], multiple Kinects are utilized for ahuman tracking application. The human body is approximatedas a cylinder whose model parameters and pose are estimatedin a Bayesian framework.

As already mentioned, model-based trackers utilizing onlyvision sensors suffer from necessity of CAD models or offlinetraining stages. For augmented reality applications, CAD mod-els might not be available or offline training is not suitable.On the other hand, depth-sensor based trackers suffer frominherent sensor noise and also are not suitable for AR applica-tions due to lack of color data. Moreover, 3D-3D registrationcan easily be trapped into a local minimum. Inspired by theavailability of cheap vision-depth camera bundles, such asMicrosoft Kinect [19], in this paper, a 3D tracking algorithm,based on fusion of vision and depth sensors via ExtendedKalman Filter (EKF), is proposed. The proposed 3D trackingalgorithm requires neither manual pose initialization nor offlinetraining and aims highly accurate 3D tracking.

III. PROPOSED ALGORITHM

In the proposed formulation, the object of interest is simplyselected automatically by using depth thresholding in the firstframe; hence, object colored PCM is obtained. Dependingon the application, the object of interest is also selected byany suitable object detection routine or for 3D map fusionapplications, the whole frame could be tracked.

A. 3D Tracking Using EKF

Fusion of vision and depth sensors is accomplished usingthe Kalman Filter, which has proven itself in the trackingliterature. However, due to the non-linearities involved, theEKF is utilized. A constant velocity motion model, as inthe well-known MonoSLAM algorithm [20], is utilized in theproposed EKF formulation. Such a model is suitable for freemoving hand-held camera scenarios. States of the filter aredefined as follows:

Rdo = R(ρdo, θdo, φdo) : Rotation matrix (at specifiedinstant) between object and depth camera reference framesdefined by angles ρ, θ and φ in x, y and z directions,respectively,

tdo = [txdo, tydo , tzdo ]T : Translation parameters (at speci-

fied instant) between object and depth camera reference framesin x, y and z directions, respectively,

vdo = [ρdo, θdo, φdo, txdo, tydo , tzdo ]T : Associated velocity

parameters between object and depth camera reference frames.

In the proposed method, PCM is used for the objectrepresentation. Thus, the measurements that are associatedwith each object point are 3D coordinate measurements fromthe depth sensor and 2D pixel coordinate measurements fromthe vision sensor:

[XodiYodiZodi ]T : 3D coordinates of the ith object point

measured by the depth camera with respect to the depth camerareference frame,

[xoiyoi ]T : 2D pixel coordinates of the ith object point.

Having defined states and measurements for EKF, the stateupdate and measurement equations can be written as follows[3]:

1) State Update Equations: State update equations definetransition from previous state xt−1 to current state xt, whenthe input ut is applied to the system. Note that the subscript tstands for time instant. In our case, it stands for frame index:

xt = g(xt−1, ut) + εt (2)

Please note that ut is included for the sake of completeness.The rotation and translation parameters between depth cameraand object are updated linearly according to the followingconstant velocity model:

[ρdo, θdo, φdo, txdo, tydo , tzdo ]

Tt =

[ρdo, θdo, φdo, txdo, tydo , tzdo ]

Tt−1 + (3)[

ρdo, θdo, φdo, txdo, tydo , tzdo

]Tt−1

+ εit

where εit is the state update noise. On the other hand, the veloc-ity parameters only have noise updates in order to account forpossible accelerations. Note that state update noise covariancematrix is defined as Rt.

[vdo]t = [vdo]t−1 + εiit (4)

Page 3: Fusing 2D and 3D Clues for 3D Tracking Using Visual and ...fusion.isif.org/proceedings/fusion2013/html/pdf...integrating vision and depth sensors within a single formu-lation. By utilization

Fig. 2: Typical color-depth calibration images

2) Measurement Equations: Measurement equations relatecurrent states, xt, and current measurements, zt, as follows:

zt = h(xt) + εt (5)

For each object point i, measurement equations re-late 3D coordinates with respect to the object refer-ence frame [XoiYoiZoi ]

T , 3D depth camera measurements[Xodi

YodiZodi ]T and 2D coordinate measurements [xoiyoi ]T .

zt is a 5N vector, N specifying the number of object points: XodiYodiZodi

t

= Rdo

[XoiYoiZoi

]+ tdo + εit (6)

αi

[xoiyoi1

]t

= K

[Rdv

[Rdo

[XoiYoiZoi

]+ tdo

]+ tdv

]+ εiit

(7)αi is the scale factor, K is the internal calibration matrix ofthe vision sensor, Rdv and tdv include rotation and translationparameters between vision and depth sensors, εit and εiit areobservation noise terms, whose diagonal covariance matricesare equal to QXY Zt and Qxyt , respectively, where subscript tdenotes time instant. QXY Zt and Qxyt are parameters depend-ing on sensor specifications and current experiments assumethat they are diagonal. Throughout the study, EKF implemen-tation of [21] is used.

3) Sensor Calibration: RGBD sensor network needs to becalibrated once internally and externally. The routine of [4] isused to calibrate the color camera of the module. Fig. 2 showstypical calibration images utilized for external calibration ofcolor and depth cameras. Sub-pixel accurate centers of ellipti-cal projections are used to develop 2D-3D correspondencesbetween two sensors, and these correspondences are usedto estimate Rdv and tdv using a Perspective-n-Points (PnP)and Levenberg-Marquardt (LM) based algorithm [5], whichminimizes reprojection error. Mean reprojection error aftercalibration is about 0.8 pixels.

B. Spatially Distributed Feature Selection Using Texture and3D Curvatures

As already mentioned, PCM is used in the proposedformulation. Depending on the size of the object, associatedPCM could contain thousands of 3D points. Unfortunately,due to increased computational complexity, all of the pointsin the PCM cannot be utilized within the EKF formulation.Therefore, the points with high reliability should be selected.

To this aim, we propose a procedure that locates features withhigh spatial and textural quality.

In the literature of the feature point extraction by intensitydata, a widely used approach is to utilize image gradients inorder to locate points with high textural derivatives, [22] [23].On the other hand, the feature point detectors operating onrange (or 3D) data typically utilize curvature information, [24].

Since the selected points are used for 3D tracking, inaddition to the spatio-textural qualities of points, the relationbetween 3D tracking quality and points location in 3D spaceshould be considered for feature selection. Intuitively, it isexpected to have a positive correlation between 3D trackingquality and the spatial spread of these points in 3D space.In order to verify such a hypothesis, initially, an artificialmotion scenario is defined for the Face sequence of [25].Then, the associated 2D and 3D measurements are obtainedfor different subsets of PCM using motion scenario and fedto the EKF in order to estimate states as highlighted in Sub-section III-A. Table I indicates that as the spatial spread ofpoints in 3D space, which is measured by using the norm of thestandard deviation vector σXY Z of 3 dimensions, increases, theassociated pose estimation errors tend to decrease. (Rotationerrors are in mili-radians and translation errors are in mms.)

The availability of 3D information as well as the intensityinformation should direct one to exploit both sources whilelocating features with high cornerness measures. However,since structural details are lost in naive depth data (please referto Fig. 3), it is not suitable for such a purpose. At this point,Shape Index (SI) [28], which utilizes the scene informationin terms of principal curvatures κ1 and κ2, comes as a usefulnonlinear transformation to extract the structural details in thedepth data. After transformation of depth, Shape Index Map(SIM) is obtained by calculating per pixel SI values:

SI =1

2− (

1

π) tan−1(

κ1 + κ2κ1 − κ2

) (8)

TABLE I: Relation between 3D spatial spread of the pointsand 3D tracking errors

Point Spread (mm) Rot-x Rot-y Rot-z Tr-x Tr-y Tr-zσXY Z : 44.92 3.8 3.6 4.1 2.1 3.0 1.9σXY Z : 59.83 3.6 2.4 3.5 1.7 3.1 1.2σXY Z : 68.76 3.1 2.3 3.4 1.6 2.5 1.2

Considering the observation depicted in Table I, in order toselect good features to track, the following method is proposed:

I. Divide the region of interest into regular rectangularpatches, as in Fig. 4. Within each intensity and SIMpatch, calculate the cornerness measures of eachpixel, namely Cintensityi and CSIMi

, using anysuitable method.

II. Calculate the final weighted cornerness measure foreach pixel i as:

Ci = λCintensityi + (1− λ)CSIMi(9)

where λ controls the individual contributions ofintensity and SIM cornerness measures.

Page 4: Fusing 2D and 3D Clues for 3D Tracking Using Visual and ...fusion.isif.org/proceedings/fusion2013/html/pdf...integrating vision and depth sensors within a single formu-lation. By utilization

(a) RGB Data (b) Depth Data

(c) SIM Data

Fig. 3: RGB, depth and SIM data

Fig. 4: Regular sampling patches

III. Within each patch select a single pixel with themaximum cornerness measure Ci.

Patch size and λ parameters are selected based on the exper-iments and literature results. Decreasing and increasing λ toomuch (i.e. towards 0 and 1 respectively) decreases the overall3D tracking performance, which actually reveals the strengthof the combined feature selector. Thus, λ is selected as 0.5.Among corner detectors, Harris [22] and Shi-Tomasi [23] aretested and Harris is observed to have better localization.

C. Measurement Tracking Using Optical Flow Estimation on2D and 3D Data

At each time instant, EKF is fed with 2D ([xoiyoi ]T )

and 3D ([XodiYodiZodi ]T ) measurements belonging to the

object to be tracked (see the relations in (6-7)). Therefore,these measurements must be transferred between consecutiveframes. A trivial solution to this problem is tracking of 2Dmeasurements via optical flow estimation on the intensity(luminance) data. Thus, since the two sensors are externallycalibrated, associated 3D measurements at the next time instantare also obtained. However, due to the sensor noise or changein illumination, 2D tracking might be erroneous, which causeserror accumulation in 3D tracker.

Fig. 5: Template Inverse Matching

Therefore, the main requirement to minimize 3D trackingerror is to increase the accuracy of 2D optical flow estimationby utilization of additional 3D data available. A trivial solutionmight be the direct utilization of depth maps from the depthsensor during optical flow estimation. It is experimentally ob-served that naive utilization of raw depth maps during opticalflow estimation does not yield satisfactory performance, sincemany points cannot be matched due to absence of structuraldetails within depth data (please refer to Fig. 3).

At this point, we propose a feature tracker, which utilizesintensity and transformed depth (SIM) data together. Sincestructural details are more distinct in SIM data, it is used toassist intensity based optical flow estimation for tracking thesame features.

Ideally, a 2D measurement [xoiyoi ]Tt should match the

same pixel coordinate, if it is tracked by using intensity andSIM data. However, since two sensors have different noisecharacteristics, for an arbitrary feature, intensity and SIMtrackers might match different locations and one of thesematches could be more accurate compared to the other. In orderto detect their accuracies, we utilize template inverse matching(TIM) (illustrated in Fig. 5) [29], which simply calculatesthe Euclidean distance between a 2D measurement [xoiyoi ]

Tt

associated with a feature-i at time t and the 2D measurement[xoiyoi ]

′Tt obtained by tracking the correspondence of i at time

t+1 backward, as follows:

dTIMi = ‖ [xoiyoi ]Tt − [xoiyoi ]

′Tt ‖ (10)

Ideally, its value being zero, dTIMiincreases as the track-

ing quality of feature-i decreases. Consequently, in order totrack features across consecutive frames, the following opticalflow estimation algorithm is proposed:

I. Track 2D measurements by optical flow estimation,(e.g. use [26]) on:

a. Intensity datab. SIM data

II. Calculate TIM errors for intensity and SIM trackers,

III. For each feature, comparing the TIM errorsof intensity and SIM trackers, assign finalcorrespondence at t + 1 based on the trackerwith minimum error.

IV. Discard features with TIM errors larger than a pre-defined threshold.

Page 5: Fusing 2D and 3D Clues for 3D Tracking Using Visual and ...fusion.isif.org/proceedings/fusion2013/html/pdf...integrating vision and depth sensors within a single formu-lation. By utilization

(a) Tracked features on intensity data

(b) Tracked features on SIM data

Fig. 6: Tracked features on intensity and SIM data

Fig. 6 shows features matched between consecutive framesutilizing the proposed approach.

Optical flow literature is mainly concentrated on trackingfeatures with high intensity cornerness measures. However,for relatively low-textured surfaces, the optical flow estimationis expected to be erroneous. Similarly, although SIM trackerhas degraded optical flow estimation performance on spatiallysmooth regions, it performs promising on spatial corners, suchas nose tip or edge of the table. Hence, when all featuresare tracked with SIM tracker, overall performance is low asin Table II. However, for a specific feature, the combinedtracker switches between SIM and intensity trackers based ontheir optical flow estimation accuracies and hence the resultantperformance is much better than individual trackers. As it isclear from Table II, combining two independent optical flowestimators by TIM metric decreases 2D tracking errors (bymore than 90% for the tested sequence) and guaranties highquality measurement to be fed to the EKF. Since the mainreason for 3D tracking error accumulation is 2D tracking errors[1], it is eliminated significantly.

TABLE II: Mean TIM errors (in pixels) for the tracked featuresof ’Freiburg2-Desk’ sequence [33]

Intensity Tracker SIM Tracker Combined Tracker1.15 332.43 0.09

D. Measurement Weighting

In order to increase the quality of 3D tracking, anotherapproach could be based on weighting of 2D and 3D mea-surements in order to favor good features. Such a weightingis accomplished by changing the measurement noise variances(diagonal entries of QXY Z and Qxy) in (6-7), based on thequalities of the obtained features [30]. At each time instant,for the high quality features, the corresponding measurementnoise variances are decreased, or vice versa. 2D measure-ment variances are simply weighted using dTIMi

, after somenormalization, in order to account for possible errors duringmeasurement tracking:

σ2pixi

= ndTIMi∑ni=1 dTIMi

σ2pix (11)

σ2pix is the vision sensor noise variance, typically remaining

constant in the conventional EKF solutions. By the proposedformulation, mean noise variance is still preserved, whereasnoise is distributed unevenly among the tracked features. Sucha weighting scheme is more efficient than the method in [30],where for each feature, σ2

pix is estimated by fitting Gaussianto sum-of-squared-distance surface, which is computationallyinvolved.

3D measurements from the depth sensor are weightedby considering two concepts. Since 3D measurements areobtained by tracking 2D measurements, the errors in 2Dmeasurement tracking should affect 3D measurements. Thus,using the perspective camera model, the following relationsare obtained that claim noise variance to be proportional to2D displacement quality and depth:

w1i = (zodi/f)dTIMi, w1in

= w1i/∑N

i=1w1i (12)

where zodi is depth measurement of feature-i and f is the depthcamera focal length, N is the number of features.

Furthermore, as shown in Table I, there is a strong relationbetween tracking quality and spread of 3D measurements([XSTDYSTDZSTD]T ). As the norm of standard deviationvector increases, i.e. the spread of 3D measurements increases,pose estimation errors decrease. Thus, this observation mo-tivates us to develop a weighting scheme favoring featuresthat are away from the 3D center of mass of the observations[CXCY CZ ]T as follows:

w2i = 1/‖[XodiYodiZodi ]T − [CXCY CZ ]T ‖

w2in= w2i/

∑N

i=1w2i (13)

Hence, 3D measurement noise variances of (6) are weightedby the following relation:

σ2XY Zi

= N(w1in

+ w2in)∑N

i=1(w1in+ w2in

)σ2XY Z (14)

Finally, by the addition of observation weighting scheme, thebasic EKF iterations, [3], are modified as in (15), where µtand Σt are mean and covariance matrices of states, G and

Page 6: Fusing 2D and 3D Clues for 3D Tracking Using Visual and ...fusion.isif.org/proceedings/fusion2013/html/pdf...integrating vision and depth sensors within a single formu-lation. By utilization

H are Jacobian matrices for state transition and measurementfunctions, Kt is the Kalman gain and Qt = diag(QXY Zt , Qxyt ).

µt = g(ut, µt−1), Σt = GtΣt−1GTt +Rt

Qit = diag(σ2pix1

, ...., σ2pixN

), Qiit = diag(σ2XY Z1

, ...., σ2XY ZN

)

Kt = ΣtHTt (HtΣtH

Tt +Qt)

−1 (15)µt = µt +Kt(zt − h(µt)), Σt = (I −KtHtΣt)

The only term related to the quality of measurementsin Kalman gain calculation is Qt term, i.e. measurementnoise covariance matrix. Generally Qt is fixed and determinedduring the system design. In the proposed formulation byevolving Qt matrix based on the quality/importance of thefeatures, Kalman gain is manipulated in order to favor goodfeatures. During tracking, by minimum extra computations, themeasurement qualities are estimated and exploited during EKFupdates. Thus, discrimination between measurements of thesame sensor is accomplished, which is not conventional forEKF formulation.

IV. TEST RESULTS

The performance of the proposed 3D tracking algorithm iscompared to that of well-known algorithms in the literature,namely ICP [31], CPD [16], and quaternion-based 3D poseestimator [27]. (For ICP and CPD original implementations areused.) Many variants of ICP and CPD algorithms are utilized aseither the core pose estimator or a final method to refine initialpose estimates in many approaches, such as [32]. To performfair comparisons, high quality features of Sub-section III-Bare tracked by using the proposed SIM assisted feature trackerand obtained 3D-3D correspondences between model and dataare fed to these conventional methods and associated posesare estimated. Previous pose estimates are also utilized toalign model and data initially. For the utilized Book and Facesequences [25], since the groundtruth pose parameters are notavailable, the objective evaluation of 3D tracking performancesis performed in terms of mean reprojection error, where thereprojection error is defined as:

ReprojectionError = ‖[xoiyoi

]t

−[xoiyoi

]P

‖ (16)

where [xoi , yoi ]Tt is the coordinate observation and [xoi , yoi ]

TP

is the projection obtained using the state estimates for ithobject point at time instant t. Table III shows the meanreprojection errors of all tracked features. Tracking videos areavailable via [25]. It can be clearly observed from Table IIIthat the proposed EKF scheme, which enables weightingof features based on their qualities/importance, increases theoverall tracking performance:

TABLE III: Mean reprojection errors (in pixels)

Method/Error Proposed ICP CPD QuaternionBook 2.24 3.38 3.94 3.38Face 4.00 4.52 4.61 4.39

(a) PCM projections with pro-posed method.

(b) PCM projections withquaternion-based method.

Fig. 7: Model reprojections

In order to compare the proposed tracking scheme andthe second-best algorithm in Table III, namely quaternion-based pose estimator [27], the model PCM is projected onthe video frames using the estimated pose parameters. Fig. 7shows typical reprojections for Book and Face sequences. Itis clear that the projected PCM almost perfectly fits the videoframe.

Moreover, in order to examine the improvements due toeach of the proposed contributions, the algorithms given inTable IV are compared and associated performances are givenin Table V. Based on these results, one can conclude that all thecontributions (from feature selection to sensor fusion) increasethe tracking performance significantly.

TABLE IV: Algorithms compared

Method Proposed EKF-1 EKF-2 EKF-3Measurement Weighting Yes Yes Yes No

SIM Assisted Feature Tracking Yes Yes No NoSpatio-Textural Featue Selection Yes No No No

TABLE V: Mean reprojection errors (in pixels)

Method Proposed EKF-1 EKF-2 EKF-3Book 2.24 4.61 6.88 7.30Face 4.00 4.50 7.39 7.44

For further objective evaluation of the proposed method, weutilized the sequences provided by TUM for which groundtruthposes are available, [33]. The sequences have different motionand scene characteristics. Microsoft Kinect and Asus Xtionsensors are used for data capture. Since the utilized sequencescorrespond to static scenes, colored PCM of the first frame istracked. Due to the space restrictions only the mean percentageimprovement over the conventional methods are reported here.Absolute pose errors for tested sequences are available as sup-

Page 7: Fusing 2D and 3D Clues for 3D Tracking Using Visual and ...fusion.isif.org/proceedings/fusion2013/html/pdf...integrating vision and depth sensors within a single formu-lation. By utilization

Fig. 8: Panoramic scene generation

plemental material [25]. Percentage improvement is calculatedusing the formula:

PI =errorconventional − errorproposed

errorconventional× 100 (17)

TABLE VI: Mean percentage improvement over conventionalmethods

Method/Error Rot-x Rot-y Rot-z Tr-x Tr-y Tr-zICP [31] 41.9 39.8 46.1 24.3 15.7 34.5CPD [16] 95.3 88.1 97.0 92.7 88.9 98.2

Quaternion [27] 39.8 33.7 33.5 29.3 25.1 35.0

The proposed 3D tracking scheme is utilized in orderto register the colored point clouds of sequences to obtainpanoramic 3D models. Similarly, colored PCM of the initialframe is tracked. Since the sequences are relatively long, theinitial PCM may become invisible as time goes. Thus whenthe number of tracked features decreases severely, the currentframe is selected as the model frame and its colored PCM istracked at the following time instants. Finally, estimated 3Dposes are utilized in order obtained a panoramic 3D model, asshown in Fig. 8. Fig. 9 shows snapshots from selected models,whereas the resulting videos are available in [25]. Note thatpost-processing (such as volumetric integration, surface fitting,etc.) is not applied to the final point clouds in Fig. 9, while theyare quite satisfactory. Furthermore, in the Freiburg 2 Desk withPerson sequence provided by TUM there are moving objects.The final map in Fig. 9f is of high quality for the static parts;hence we can conclude that the tracking performance is notaffected by dynamic objects in the scene.

Although, the system is implement in MATLAB, themethod is expected to operate in real-time on CPU. It takesabout 30 seconds to track a whole 640 × 480 RGBD frame.Utilized PC has 2.4 GHz i7 processor with 16 GB ram andWindows 7 OS.

(a) Freiburg2 Desk (1300 frames) (b) Freiburg3 Teddy (600 frames)

(c) Freiburg3 Structure TextureFar (800 frames)

(d) Freiburg3 Long OfficeHousehold (1400 frames)

(e) Freiburg2 Large No Loop(600 frames)

(f) Freiburg2 Desk with Person(500 frames)

Fig. 9: Snapshots from 3D panoramas

V. CONCLUSIONS

In this paper, a novel fully automated model-based 3Dtracker is proposed that yields improvement over conventionaltechniques. To sum up, there are three novel concepts withinour manuscript: The proposed feature selection algorithmguarantees spatially and texturally important features to beselected and fed to the 3D tracker that increases trackingperformance significantly. Moreover, utilization of intensityand SIM data together during 2D optical flow estimation es-tablishes much reliable correspondences between consecutiveframes to result with better 3D tracking. Finally, the novelmeasurement weighting approach, which favors measurementsbased on their qualities/importance, is incorporated within theEKF formulation to improve the results. The accuracy ofthe proposed 3D tracker is verified using publicly availabledatasets and performed superior over conventional techniques.

ACKNOWLEDGMENT

This work is partially supported by The Scientific andTechnological Research Council of Turkey.

Page 8: Fusing 2D and 3D Clues for 3D Tracking Using Visual and ...fusion.isif.org/proceedings/fusion2013/html/pdf...integrating vision and depth sensors within a single formu-lation. By utilization

REFERENCES

[1] V. Lepetit, P. Fua, Monocular model-based 3D Tracking of RigidObjects, Now Publishers Inc., 2005.

[2] L. Vacchetti, V. Lepetit, P. Fua, Stable Real-time 3D Tracking UsingOnline and Offline Information, IEEE Transactions on Pattern Analysisand Machine Intelligence, 2004.

[3] S. Thrun, W. Burgard, D. Fox, Probabilistic Robotics, MIT Press, 2005.[4] Camera Calibration Toolbox for Matlab,

http://www.vision.caltech.edu/bouguetj/calibdoc/. Accessed 11.11.2012.[5] W. H. Press, S. A. Teukolsky, W. T. Vetterling and B. P. Flannery,

Numerical Recipes in C - The Art of Scientific Computing. 2nd ed.,Cambridge University Press, 1992.

[6] G. Reitmayr, T. Drummond, Going out: Robust Model-based Trackingfor Outdoor Augmented Reality, The International Symposium onMixed and Augmented Reality, 2006.

[7] A. P. Gee and W. Mayol-Cuevas, Real-Time Model-Based SLAM UsingLine Segments, Lecture Notes in Computer Vision, SpringerLink, 2006.

[8] M. Vincze, M. Schlemmer, P. Gemeiner, M. Ayromlou, Vision forRobotics a tool for Model-based Object Tracking, IEEE Robotics andAutomotion Magazine, 2006.

[9] G. Klein, T. Drummond, Robust Visual Tracking for Non-InstrumentedAugmented Reality, The International Symposium on Mixed and Aug-mented Reality, 2003.

[10] I. Skrypnyk, D. Lowe, Scene Modelling, Recognition and Tracking withInvariant Image Features, The International Symposium on Mixed andAugmented Reality, 2004.

[11] J. S. Beis, D. Lowe, Shape Indexing Using Approximate Nearest-Neighbor Search in High-dimensional Spaces, IEEE Conference onComputer Vision and Pattern Recognition, 1997.

[12] E. Rosten, T. Drummond, Fusing Points and Lines for High Perfor-mance Tracking, IEEE International Conference on Computer Vision,2005.

[13] G. Bleser, H. Wuest, D. Stricker, Online Camera Pose Estimation inPartically Known and Dynamic Scenes, The International Symposiumon Mixed and Augmented Reality, 2007.

[14] P. J. Besl, N. D. McKay, A Method for Registration of 3D Shapes, IEEETransactions on Pattern Analysis and Machine Intelligence, 1992.

[15] M. Krainin, P. Henry,X. Ren, D. Fox, Manipulator and Object Trackingfor In-Hand 3D Object Modeling, IJRR, 2011.

[16] A. Myronenko, X. Song, ”Point-Set Registration: Coherent Point Drift”,IEEE Transactions on Pattern Analysis and Machine Intelligence 2010.

[17] F. Faion, M. Baum and U. D. Hanebeck, ”Tracking 3D Shapes in NoisyPoint Clouds with Random Hypersurface Models”, 15th InternationalConference on Information Fusion, 2012.

[18] N. Gelfand, N. J. Mitra, L. J. Guibas, H. Pottmann, Robust GlobalRegistration, EurographicsSymp. on Geo. Proc. 2005.

[19] XBox Kinect Motion Sensors, http://www.xbox.com/en-GB/kinect. Ac-cessed 1.10.2013.

[20] A. J. Davison, I. D. Reid, N. D. Molton, O. Stasse, MonoSLAM: Real-Time Single Camera SLAM, IEEE Transactions on Pattern Analysisand Machine Intelligence, 2007.

[21] Rebel Toolkit, http://web.cecs.pdx.edu/ ericwan/rebel-toolkit.html. Ac-cessed 5.29.2013.

[22] C Harris, M Stephens, A combined corner and edge detector, AlveyVision Conference, 1988.

[23] J. Shi and C. Tomasi, Good Features to Track, IEEE Conference onComputer Vision and Pattern Recognition, 1994.

[24] A. Flint, A. Dick, and A. Hengel, ”Thrift: Local 3D Structure Recog-nition,” in 9th Biennial Conference of the APRS on DICTA, 2007.

[25] https://www.dropbox.com/sh/t0n8m1pu70wg5cr/do5Yd739Nj[26] J.Y. Bouquet, Pyramidal Implementation of the Lucas Kanade Feature

Tracker. Intel Corp.[27] R. Jain, R. Kasturi, B.G. Schunck, Machine Vision, McGraw-Hill, 1995.[28] J. J. Koenderink, Solid shape: MIT Press, 1990.[29] R. Liu, Stan Z.Li, X. Yuan, R. He, Online Determination of Track Loss

Using Template Inverse Matching, International Workshop on VisualSurveillance 2008.

[30] K. Nickels, S. Hutchinson, Weighting observations: The Use of Kine-matic Models in Object Tracking, IEEE International Conference onRobotics and Automation, 1998.

[31] Finite Iterative Closest Point, http://www.mathworks.com/matlabcentral/fileexchange/24301.Accessed 29.05.2013.

[32] M. Ye, X. Wang, R. Yang, F. Ren, M. Pollefeys, Accurate 3D PoseEstimation From a Single Depth Image, IEEE International Conferenceon Computer Vision, 2011.

[33] J. Sturm, S. Magnenat, N. Engelhard, F. Pomerleau, F. Colas, W.Burgard, D. Cremers, R. Siegwart, ”Towards a Benchmark for RGB-DSLAM Evaluation”, In Proc. of the RGB-D Workshop on AdvancedReasoning with Depth Cameras at Robotics: Science and SystemsConference (RSS), 2011.