robust multi-view l2 triangulation via optimal inlier...

Robust multi-view L2 triangulation via optimal inlier selectionand 3D structure refinement

Lai Kang a,1,n, Lingda Wu a,b, Yee-Hong Yang c

a College of Information System and Management, National University of Defense Technology, Changsha 410073, Chinab The Key Lab, The Academy of Equipment Command and Technology, Beijing 101416, Chinac Department of Computing Science, University of Alberta, Edmonton, Canada T6G 2E8

a r t i c l e i n f o

Article history:Received 15 February 2013Received in revised form22 January 2014Accepted 24 March 2014Available online 2 April 2014

Keywords:Multi-view triangulationOptimal inlier selection3D structure error boundingDifferential evolution

a b s t r a c t

This paper presents a new robust approach for multi-view L2 triangulation based on optimal inlierselection and 3D structure refinement. The proposed method starts with estimating the scale of noise inimage measurements, which affects both the quantity and the accuracy of reconstructed 3D points but isoverlooked or ignored in existing triangulation pipelines. A new residual-consensus scheme withinwhich the uncertainty of epipolar transfer is analytically characterized by deriving its closed-formcovariance is developed to robustly estimate the noise scale. Different from existing robust triangulationpipelines, the issue of outliers is addressed by directly searching for the optimal 3D points that arewithin either the theoretical correct error bounds calculated by second-order cone programming (SOCP)or the efficiently calculated approximate ranges. In particular, both the inlier selection and 3D structurerefinement are realized in an optimal fashion using Differential Evolution (DE) optimization whichallows flexibility in the design of the objective function. To validate the performance of the proposedmethod, extensive experiments using both synthetic data and real image sequences were carried out.Comparing with state-of-the-art robust triangulation strategies, the proposed method can consistentlyidentify more reliable inliers and hence, reconstruct more unambiguous 3D points with higher accuracythan existing methods.

& 2014 Elsevier Ltd. All rights reserved.

1. Introduction

Given a set of m known camera matrices and the correspondingimage projections of a 3D point, the reconstruction of the 3D point isknown as triangulation. Multi-view triangulation which concernswith scenario when mZ3 is a fundamental problem in computervision [1]. It is a trivial problem in the absence of noise because the3D point is simply the intersection of the lines-of-sight or raysof corresponding image points. In practice, however, these raysnever intersect at a common point since image measurementsare unavoidably contaminated by noise. Besides, triangulation iswell-known to be sensitive to outliers, which are incorrect matchesthat frequently appear in real images [2–6]. Both of the above issuesmake finding the optimal solution to multi-view triangulation achallenging one. While traditional triangulation methods are domi-nated by linear triangulation followed by local optimization

techniques (e.g., bundle adjustment) [1,7], globally optimal formula-tions of multi-view triangulation problems have drawn much atten-tion from the computer vision community recently [8–12]. This paperproposes a new computational framework for globally computing theoptimal multi-view L2 triangulation in the presence of outliers.

Motivation and contributions: State-of-the-art robust triangula-tion methods use L1�based algorithms [6] to remove outliers andthen run bundle adjustment to refine the 3D structure by mini-mizing the reprojection error based on L2 norm. Robust methods[3,6] empirically choose a threshold to identify and to removeoutliers and use the reprojection error as the criterion to evaluatethe results. In particular, a smaller reprojection error is oftenassumed to correspond to a more accurate reconstructed 3Dstructure. Indeed, a smaller reprojection error can always beachieved by choosing a smaller threshold. However, the accuracyof 3D structure does not necessarily improve accordingly in mostcases. As well, the number of successfully reconstructed 3D pointsdecreases noticeably when the threshold decreases. To obtainmore reasonable 3D reconstruction, the chosen scale of noise(i.e., the standard deviation of noise) should be as close to that ofthe ground truth as possible. Since outliers frequently appear in

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/pr

Pattern Recognition

http://dx.doi.org/10.1016/j.patcog.2014.03.0220031-3203/& 2014 Elsevier Ltd. All rights reserved.

n Corresponding author.E-mail address: [email protected] (L. Kang).1 This work was done while the first author was at University of Alberta as a

visiting Ph.D. student supported by the Chinese Scholarship Council.

Pattern Recognition 47 (2014) 2974–2992

www.sciencedirect.com/science/journal/00313203

www.elsevier.com/locate/pr

http://dx.doi.org/10.1016/j.patcog.2014.03.022



http://crossmark.crossref.org/dialog/?doi=10.1016/j.patcog.2014.03.022&domain=pdf



mailto:[email protected]


real images due to error in feature extraction or mismatching offeature points, triangulation methods without outlier handling canproduce arbitrary output as the wrong data is used in the fitting.The algorithms developed for outliers handling in triangulationbased on L1�norm remove potential outliers either iteratively orin a one-shot fashion [6]. One limitation of these algorithms is thatmany inliers are also incorrectly removed, especially when there isa higher fraction of outliers in the data. Moreover, to our bestknowledge, the issue of outlier handling in optimal multi-viewtriangulation under L2-norm has not been investigated in pre-vious works.

The main contributions of this paper are highlighted as follows:

� The uncertainty of epipolar transfer is analytically character-ized by deriving the closed-from covariance. The analysis ofuncertainty of epipolar transfer is further incorporated in a newresidual-consensus based scheme, which allows robust estima-tion of the scale of noise in image measurements.

� Two new algorithms to calculate the error bounds of 3Dstructure are proposed. One is based on the theoreticallycorrect bounds computed using second-order cone program-ming (SOCP) and the other on bounds computed using anefficient approximate algorithm. Both algorithms provide validconstraints on the 3D structure and meanwhile can be used toreduce the search space significantly.

� A new robust formulation of direct triangulation problem based onmaximum a posteriori (MAP) estimate [13] is proposed. Therobustness is in the sense that both outliers and occlusions inimage measurements can be handled. In particular, we propose touse a Differential Evolution (DE) algorithm to solve such a complexoptimization problem. Thus the inlier selection and 3D structurerefinement are realized in an optimal fashion. We show that such astrategy can consistently identify more reliable inliers and thusreconstruct more 3D points accurately.

The remainder of this paper is organized as follows. The nextsection gives an overview of previous works. The background ofoptimal multi-view triangulation is briefly introduced in Section 3.In Section 4, key components of our proposed framework arepresented in detail. Next, in Section 5, we show the experimentalresults on both synthetic and real data sets, and the comparisonwith other methods. Finally, the paper concludes in Section 6.

2. Related works

Under the assumption of Gaussian noise in image measurements,minimizing the L2-norm cost function leads to the maximum like-lihood estimate (MLE) of the 3D point in multi-view triangulation [1].The optimal triangulation problem with two views is addressed byHartley et al. [14] by solving a polynomial of degree 6 and also byother researchers [15,16]. For the case of three views, it has beenshown in [17] that the optimal triangulation problem involvessolving a polynomial of degree 47. Although methods for a closed-form solution for optimal triangulation in two or three views exist,these methods are not easy to generalize to more views in practice.The standard method for solving the general m ðmZ3Þ�viewtriangulation problem is based on local optimization such as bundleadjustment (BA) [7] initialized with a linear method [1]. Though ithas been successfully applied to many vision problems, bundleadjustment requires a relatively accurate initialization. Only untilrecently, the branch-and-bound algorithm that provides a theoreticalguarantee of global optimality under L2-norm is proposed [9].However, this strategy is computationally expensive [10,18]. Simi-larly, branch-and-bound is employed in [10], in which a simplersolution is provided to improve the efficiency.

The introduction of L1�norm based cost function [8] to multi-view geometry leads to an exciting new direction. It has been shownthat many multi-view problems are quasiconvex [8,11,4] underL1�norm, which can be exploited for finding globally optimalsolutions. In contrast to conventional L2 based methods, L1 normleads to a simpler but geometrically meaningful formulation for manyvision problems with global optimal solutions. A detailed overview ofdevelopments in multi-view geometry under L1�norm can be foundin [19]. The standard approach for solving L1 problems involvessolving a sequence of second order cone programming (SOCP) with abisection procedure [20]. The required computation is of polynomialcomplexity, which implies that the method may not scale well forlarge scale problems. To accelerate the computation, several methodshave been developed [11,21,12]. One limitation of triangulationmethods based on L1�norm is that these algorithms are extremelyvulnerable to outliers [8].

In recent years, the issue of outlier handling in multi-viewgeometry has drawn much attention from researchers [2–6]. Apopular outlier removal method used in computer vision is RANSAC[1]. However, as RANSAC only works for two views, mismatches inlonger point tracks may go undetected [6,3]. Furthermore, RANSAC is arandomized algorithm, there is no guarantee that all outliers areremoved. In practice, RANSAC is used to remove the most egregiousoutliers in the first stage, and a more sophisticated technique isrequired to handle the remaining outliers [3]. A pioneering methodtoward robust geometric estimation is proposed by Ke and Kanade [4].Instead of minimizing the L1 norm, the k-th level maxima areminimized in their work. Such a strategy is a generalization of LeastMedian of squares (LMeds) in robust regression and thus possesses ahigh breakdown point. However, as pointed out in [3], the k-thmedians are generally not unique, but can have many. Thus, the k-thmedian algorithm deviates from the original promise of the L1 ideawhich was meant to find a single, unique and global solution. Sim andHartley [2] develop an algorithm to iteratively remove bad points withthe maximal residuals. Theoretical analysis provided in [2] shows thatat least one outlier will be removed in each iteration. This method iscomputationally inefficient and may discard inliers as well. The aboveissues are addressed by Li [3], in which the author develops a methodof removing only outliers rather than the whole support set [2].Nevertheless, the method in [3] still requires a few iterations. Recentapproaches described in [22,5,6] recast the problem of outlier removalas a convex optimization problem, in which the objective function iseither the maximum of infeasibility or the maximum of the sum ofinfeasibilities. Although these methods are significantly faster thaniterative methods [8,3], there is no guarantee of optimality.

Since many vision problems can be formulated as optimizationproblems, different optimization techniques have been used in theliterature. The one proposed to solve the robust direct triangula-tion problem in this paper is a class of Evolutionary Algorithms(EAs) called Differential Evolution (DE), which has been shown tobe efficient and accurate in many applications [23]. EvolutionaryAlgorithms (EAs) in computer vision are topics of active researchduring the last two decades and have been applied to many visiontasks such as image segmentation [24,25], face detection [26,27],stereo vision [28,29] and more recently camera calibration [30]. DEhas been previously applied to the problem of two-view triangula-tion [31] and direct 3D reconstruction from images [32,33]. Never-theless, in these existing works on geometric problems using DE,only small scale of problems are considered and no strategies forhandling outliers have been presented.

3. Background and notations

Let fPiji¼ 1;…;mg denote a set of m ðmZ3Þ 3�4 cameramatrices, U¼ ð ~U >

;1Þ> ¼ ðX;Y ; Z;1Þ> the homogeneous coordinates

L. Kang et al. / Pattern Recognition 47 (2014) 2974–2992 2975

of a 3D point and u¼ ð ~u > ;1Þ> ¼ ðu; v;1Þ> the homogeneous coordi-nates of an image point. The projection of U by each camera matrix Pi

is given by [1]

uiCPiU; ð1Þwhere the symbol “C” means equality up to scale. Given the cameramatrices fPig and the corresponding image measurements fuig, theoptimal triangulation problem seeks for a 3D point Un such that

Un ¼ arg minU

J UðfPig; fuig;UÞ; ð2Þ

where J Uð�; �; �Þ is a cost function. The cost function based on L2-normis defined as

J UðfPig; fuig;UÞ ¼ ∑m

i ¼ 1dðPi;ui;UÞ2; ð3Þ

where the Euclidean image distance

dðPi;ui;UÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiui�

½Pi�1U½Pi�3U

� �2

þ vi�½Pi�2U½Pi�3U

� �2s

ð4Þ

is the reprojection error. Here, the notation ½Pi�j denotes the j-th rowof camera matrix Pi. As most existing works on triangulationcommonly assume Gaussian noise in image measurements, we alsouse the same assumption. Minimizing Eq. (3) thus yields the MLE of a3D point [1]. This objective function is known to be non-linear andnon-convex, and hence, no trivial solution exists. Moreover, since Eq.(2) minimizes the sum of squares of Euclidean distances, which isparticularly sensitive to outliers. For clarity, the notations used in thispaper are listed in Table 1.

4. The proposed framework

This section introduces the key components of our new frame-work for multi-view triangulation in detail. We first present a novel

noise scale estimation algorithm based on residual consensus inSection 4.1, and then the error bounding algorithms in Section 4.2.The new robust formulation of direct multi-view triangulationproblem is discussed in Section 4.3 followed by details of applyingDE to solve the optimization problem in Section 4.4.

4.1. Robust estimation of noise scale

While conventional methods empirically choose the noise scaleand use it as a criterion to identify outliers, the issue of noise scaleestimation in triangulation problem is somehow overlooked but isaddressed in this paper. Indeed, the selection of noise scale doeshave an impact to the accuracy of 3D reconstruction and thereprojection error (see Fig. 1). It is easy to see that a smaller noisescale results in lower reprojection errors. However, the accuracy of3D reconstruction decreases if either an underestimated or over-estimated noise scale is applied. It is noteworthy that althoughcomparable accuracy of 3D structures is obtained when s=s¼ 0:1,the number of successfully reconstructed 3D points is less than100 while 512 3D points are successfully reconstructed whens=s¼ 1 (see Fig. 1(c)). In this subsection, a new algorithm forrobustly estimating the scale of noise in image measurements ispresented. The proposed algorithm is a robust residual-consensusbased approach [34,35], by which analytical uncertainty analysis ofepipolar transfer is derived. The details are presented in thefollowing subsections.

4.1.1. Uncertainty propagation via epipolar transferLet u; u0; u″ be the coordinates of three corresponding image

points observed by three cameras P1; P2 and P3, respectively (seeFig. 2). The coordinates u″ can be transferred from the correspon-dence u2u0 by the so-called epipolar transfer [1]:

ψ : ðu;u0Þ↦u″¼ ðF13uÞ � ðF23u0Þ½ðF13uÞ � ðF23u0Þ�3

; ð5Þ

Table 1Notations used in this paper.

Notation Description

P 3�4 camera matrixF 3�3 fundamental matrix~U Cartesian coordinates of a 3D point

U Homogeneous coordinates of a 3D point, such that U¼ ð ~U >;1Þ>bU Column vector obtained by stacking a set of ~U i

Θ Matrix obtained by horizontally stacking a set of UibU � Upper bounds of (x,y,z) coordinates of bUbU � Lower bounds of (x,y,z) coordinates of bU~u Cartesian coordinates of an image pointu Homogeneous coordinates of an image point, such that u¼ ð ~u > ;1Þ>bu Column vector obtained by stacking a set of ~u i

C Equality up to scaleJ � J The Euclidean normRm Real space of dimension ms Standard deviation of Gaussian noiseξ Fraction of outliersμv Mean vector of random vector vΣv Covariance matrix of random vector vJϕ Jacobian matrix of mapping ϕ

χr2 Chi-square distribution with r degrees of freedomχr Chi distribution with r degrees of freedomJ ð�Þ Objective functionPðgÞ The g-th generation of population in DE optimization½v�i The i-th element of vector v½A�i The i-th row vector of matrix A½A�i;j The element in matrix A in the i-th row and j-th column

Aþ The pseudo-inverse of matrix A½a�� The skew-symmetric matrix associated with the cross product of vector a

L. Kang et al. / Pattern Recognition 47 (2014) 2974–29922976

where Fi3 ði¼ 1;2Þ stands for the fundamental matrix betweencamera P3 and Pi. By setting

bu ¼~u~u 0

� �¼ ðu; v;u0; v0Þ> ; ð6Þ

ψcan be rewritten as

ϕ : bu↦u″¼ A!� B!CðbuÞ ; ð7Þ

where A!¼ ðA1ðbuÞ;A2ðbuÞ;A3ðbuÞÞ> and B!¼ ðB1ðbuÞ;B2ðbuÞ;B3ðbuÞÞ>are vector valued functions of bu, and CðbuÞ is a scalar valuedfunction of bu. In order to characterize the uncertainty of u″, ananalytical first-order approximation of the covariance of epipolartransfer is derived in this paper.

Theorem 1 (Hartley and Zisserman [1]). Let v be a random vector inRm with mean μv and covariance matrix Σv , and φ : Rm-Rn bedifferentiable in the neighborhood of v. Then φðvÞ is a random vectorin Rn with mean φðμvÞ and covariance matrix JφΣvJ

>φ , where Jφ is the

Jacobian matrix of φ, evaluated at μv .

For the mapping ϕ given by Eq. (7), Theorem 1 characterizesthe uncertainty of epipolar transfer based on the first-orderapproximation of the covariance matrix. Since ϕ involves thecomputation of cross product, it is not straightforward to calculatethe Jacobian matrix Jϕ. For completeness, we present the followingproposition to facilitate the computation of Jϕ.

Proposition 1. The Jacobian matrix of the mapping defined by Eq.(7) can be computed as

Jϕ ¼ 1C½A!��

∂B!∂bu �1

C½B!��

∂A!

∂bu � 1

C2

∂C∂buðA!� B!Þ; ð8Þ

where ½�� represents the skew-symmetric matrix associated with thecross product, i.e., for any 3-vector a;b, the relationship a� b¼ ½a��bholds.

The proof of Proposition 1 is given in Appendix A. Assume thatimage measurements ~u; ~u0 are contaminated by 2-dimensionalzero mean Gaussian noise and the covariance matrices are

Σ ~u ¼Σ ~u 0 ¼ s2 00 s2

!: ð9Þ

Then the covariance matrix of vector bu is given by

Σbu ¼

s2 0 0 00 s2 0 00 0 s2 00 0 0 s2

0BBB@1CCCA: ð10Þ

Given both Σbu and Jϕ, Theorem 1 allows us to compute the meanvalue and the covariance matrix of u″ as

μu″ ¼ϕðbuÞ ð11Þand

Σu″ ¼ JϕΣbu J>ϕ ¼ s2JϕJ>ϕ ; ð12Þ

respectively. By analyzing the measured u″ and the computedμu″; Σu″, the scale of noise can be determined. The details arepresented in the following subsection.

4.1.2. Residual consensus-based noise scale estimationBefore proceeding any further, let us introduce Theorem 2 and

Proposition 2.

Theorem 2 (Hartley and Zisserman [1]). If v is a Gaussian randomvector with mean μv and covariance matrix Σv , then the squaredMahalanobis distance ðv�μvÞ>Σ þ

v ðv�μvÞ follows a χr2 distribution,where Σ þ

v is the pseudo-inverse of the covariance matrix Σv andr¼ rankðΣvÞ.

Proposition 2. Let fuig2fu0ig2fu″ig ði¼ 1;…;nÞ be a set of n

triplets of corresponding image points and ti 0 the Mahalanobisdistance between ui″ and ϕðbu iÞ. Under the assumption of zero meanGaussian noise, the scale of noise s in image measurements can be

0.5 1 1.5 20

1

2

3

4

5R

MS

2D

σ=0.6

σ=1.2

σ=1.8

σ=2.4

σ=3.0

0.1 0.5 1 1.5 2

1

2

3

4

5

RM

S3D

σ=0.6

σ=1.2

σ=1.8σ=2.4

σ=3.0

0.1 0.1 0.2 0.3 0.4 0.5 0.6

100

200

300

400

500

σ=0.6σ=1.2σ=1.8σ=2.4σ=3.0

σ/σσ/σσ/σ

Influence of on #3D pointsσ/σInfluence of on accuracy of 3D structureσ/σInfluence of on repro jection errorσ/σ

Fig. 1. The influence of the ratio s=s (used noise scale to ground truth noise scale) on (a) the root mean square reprojection error (RMS2D), (b) the root mean square 3Derror (RMS3D) and (c) the number of successfully reconstructed 3D points. Synthetic data used in this experiment is generated in the “bunny” setup (see Section 5), where 10cameras and 512 3D points are simulated and 20% of image measurements are randomly perturbed by up to 100 pixels. Outlier removal by L1 minimization [6] with specifiednoise scale s followed by bundle adjustment are used to perform the triangulation.

1P2P

3P

u

U

F u

uF uu

F u

uu

F u u

( , )u u

~

Fig. 2. Uncertainty propagation via epipolar transfer: in the absence of noise, atriplet of image correspondences u2u 02u″ strictly follows the epipolar transfer,i.e., u″ exactly lies in the intersection of the epipolar line F13u and F23u

0 . For a tripletof image correspondences u2u02u″ contaminated with noise, the transferredimage point ψðu;u0Þ usually does not correspond to u″. Under the assumption ofGaussian noise, the uncertainty of ψðu;u0Þ is analytically characterized in this paper.


approximated by

s ¼ 1arg maxtffmdðtÞg

; ð13Þ

where fmdðtÞ is the probability density function (pdf) of the distribu-tion of the scaled Mahalanobis distance ft0=sg.

Proof. Let us denote the squared Mahalanobis distance betweenu″i and ϕðbu iÞ ast02i ¼ ðu″i�ϕðbu iÞÞ>Σ þbu i

ðu″i�ϕðbu iÞÞ

¼s2ðu″i�ϕðbuiÞÞ> ðJϕJ>ϕ Þþ ðu″i�ϕðbu iÞÞ: ð14Þ

According to Theorem 2, ft02i g follows a χr2 distribution. Thus fti 0gfollows a χr distribution with pdf

f χ ðt0; rÞ ¼t0ðe� t02=2ÞΓ

r2

� � ; ð15Þ

where Γð�Þ is the Gamma function. In our problem, r (the rank ofΣu″) is equal to 2 because ½u″�3 � 1. Since the scale of noise s in Eq.(14) is unknown, ti0 cannot be calculated. So we divide both sidesof Eq. (14) by s2 and taking square root on both sides, we get thescaled Mahalanobis distance:

ti ¼ti 0

s ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðu″i�ϕðbu iÞÞ> ðJϕJ>ϕ Þþ ðu″i�ϕðbu iÞÞ

q: ð16Þ

Note that the right side of Eq. (16) does not depend on s and canbe calculated by Eq. (7) and Proposition 1. Since s is constant for agiven data set, the pdf of the distribution fmd of ftig can becalculated using the basic change of variable formula in statistics[36], which results in

fmdðtÞ ¼sf χ ðst; rÞ ¼s2tðe�ðstÞ2=2Þ: ð17Þ

It is easy to verify that fmdðtÞ has a single peak value at 1=s, whichis the result of Proposition 2. □

The basic idea behind our noise scale estimation algorithm is tolocate the peak value of the fitted distribution of tif g, which can becalculated by Eq. (16). Thus the noise scale s can be estimated.Given a set of scaled Mahalanobis residuals ftiji¼ 1;…;nsg, we useKernel Density Estimation (KDE) algorithm to fit its distribution.The kernel density estimated at t is computed as

f mdðtÞ ¼1ns

∑Nc

j ¼ 1K

t�tih

� �; ð18Þ

where Kð�Þ is the kernel function and h the bandwidth. Whilemany kernels have been proposed in the literature, in this paper,we use the commonly used Epanechnikov kernel [37]:

KðτÞ ¼34 ð1�‖τ‖2Þ if JτJr1;0 otherwise:

(ð19Þ

The Epanechnikov kernel is an optimum kernel in the sense ofminimum mean integrated square error (MISE) [37]. The samebandwidth is used as in [35]:

h¼ 243R 1�1 KðτÞ2dτ

35NcR 1�1 KðτÞ dτ

� �20B@

1CA1=5

s0; ð20Þ

where s 0 is a rough estimation of noise scale which can be obtainedby existing algorithms such as the median estimator [38]. Once thedensity function is obtained, we apply linear search with a suffi-ciently small step size (e.g., 0.01) to locate the first peak of theestimated density function. Some other techniques such as themean-shift valley (MSV) [35] can also be used to locate the peak.However, we found that linear search is fast enough for our problem.An example of residuals for the data set “Chair” is shown in Fig. 3.

In practice, the input data set may contain a large number ofimage correspondences. We incorporate the above algorithmspresented in the aforementioned subsections into a randomsampling procedure. The cost function to minimize is defined as

J s ¼Z 1

0f χ ðt;2Þ� f md

ts

� �� 2

dt; ð21Þ

where f mdð�Þ and s are the estimated density function and thecorresponding noise scale, respectively. Such a scheme reduces thecomputational complexity especially for large scale data sets. Asthe number of observations for each 3D point varies, the i-thð1r irnÞ 3D point is selected with probability

pstðiÞ ¼nvi

∑ni ¼ 1nvi

ð22Þ

in our sampling procedure, where nvi is the number of imageobservations for the i-th 3D point. The noise scale estimationalgorithm is summarized in Algorithm 1.

Algorithm 1. Noise scale estimation for triangulation based onresidual-consensus.

input: Camera matrices and image measurements with correspondencesoutput: The scale of noise ^sInitialize maxIter;maxSamp;Let iter ’1;s’0;min J s ’1;repeatfor i’1 to maxSamp doSelect a 3D point j with probability pstðjÞ defined by Eq: ð22Þ;Randomly select three views from all the views; in which the j�th 3D point is visible;Calculate the scaled Mahalanobis distance ti according to Eq: ð16Þ;

��endEstimate the pdf f mdðtÞ of the distribution of the scaled Mahalanobis distances ftiji¼ 1;…;Ng using Eq: ð18Þ;Estimate the scale of noise s based on f mdðtÞ according to Proposition 4:4;Calculate the cost function J s defined by Eq: ð21Þ;if J somin J s thenmin J s ¼ J s;

s¼ s;

��end

��


4.2. Calculate the error bounds of 3D structure

Unlike conventional parameter estimation tasks, where therange of each parameter is known a priori, e.g., the rotation spaceSOð3Þ can be parameterized by unit quaternion. The range ofcoordinates of a 3D point in Euclidean space R3, however, can beunpredictably large in practice. A readily available approach forerror bounding in triangulation is based on the Interval Analysis(IA) [39] proposed by Farenzena and Fusiello [40]. Two limitationsof IA-based approach are as follows: (1) it is inefficient to handlelarge scale problems and (2) it cannot yield valid intervals in thepresence of outliers. To improve the efficiency and establish goodinitialization, we incorporate two error bounding algorithms (onehas theoretical guarantee of correctness and the other is anapproximation) in our framework. In the first step we select anoutlier-free subset of image measurements. In the second step, theoutlier-free image measurements and the corresponding cameramatrices are used to calculate a bounding box for each 3D point.

4.2.1. Outlier-free subset image observations selectionFor each 3D point, three observations are drawn randomly from

image measurements, and then we test if these samples are allinliers. The above process is repeated until an outlier-free subset isfound or the maximum number of iterations has been reached, inwhich case the 3D point is removed. For completeness, twodefinitions from [20] are presented in the following.

Definition 1. A k dimensional second-order (Lorentz) cone Qk isby definition a cone of the form:

Qk ¼ fx¼ ðx1; x2ÞAR1 � Rk�1j‖x2‖rx1g; ð23Þwhere J � J refers to the standard Euclidean norm.

Definition 2. A second-order cone programming (SOCP) is anoptimization problem of the form:

min f >xs:t: JAixþbi Jrc>

i xþdi; i¼ 1;…;mc

g>j x¼ hj; j¼ 1;…;nc ð24Þ

where x; f; ci; giARn, di;hiAR and the matrix AiARðni �1Þ�n,biARni �1.

The constraint JAixþbi Jrc>i xþdi in Eq. (24) is called a

second-order cone constraint, since it is the same as requiringthe affine function ðJAixþbi J ; c>

i xþdiÞ to lie in a second-ordercone in Rni . Recall that for the triangulation problem, the reprojec-tion error (defined in Eq. (4)) of a given 3D point with coordinatesU can be rewritten as

dðPi;ui;UÞ ¼J ðf 1ð ~UÞ; f 2ð ~UÞÞ> J

f 3ð ~UÞ; ð25Þ

where f 1ð ~UÞ, f 2ð ~UÞ and f 3ð ~UÞ are the affine functions in ~U withcoefficients determined by ui and Pi.

Definition 3. Given a 3D point with homogeneous coordinates Uobserved by a camera with camera matrix P, the image measure-ment u is deemed as an inlier iff u satisfies the followingconstraint:

dðP;u;UÞrγ; ð26Þwhere the threshold

γ ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiPχ22 ðαÞ

qð27Þ

depends on the selected confidence level α (α¼ 0:95 in thispaper), and Pχ22 is the cumulative distribution function (cdf) of χ2

2distribution.

According to Eq. (25), the constraint dðPi;ui;UÞrγ defines asecond-order cone Q3 for a fixed threshold γ. By gathering all the3D points into a column vector and adding one auxiliary variablefor each 3D point, we get the unknown vector

Y¼ ~U>1 ; ~U

>2 ;…; ~U

>n|fflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflffl}bU >

: stacked 3D points; s1; s2;…; sn|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}auxiliary variables

0BB@1CCA

>

:

ð28ÞDenote the indices of the selected subset of observations for the i-th 3D point as I i ¼ fp1i ; p2i ; p3i g. The all-inlier test problem (AITP) isgiven as

ðAITPÞ min ∑n

i ¼ 1si

s:t: JApji ;iYþbpji ;i

Jrγðc>pji ;i

Yþdpji ;iÞþsi

8 iAf1;2;…;ng8 jAf1;2;3g; ð29Þ

where Apji ;i, bpji ;i

, cpji ;i, and dpji ;i

are constants, such that

dðPpji;u

pjii ;UiÞ ¼

JApji ;iYþbpji ;i

J

c>pji ;i

Yþdpji ;i; ð30Þ

where upjii represents the projection of the i-th 3D point by the pi

j-th camera. The AITP (29) is of the form of the standard SOCP andcan be solved by well-developed numerical optimization tools,such as SeDuMi [41], MOSEK [42] and SDPT3 [43]. In this paper,SDPT3 was used to solve all the SOCPs.

Proposition 3. Let s¼ ðs1; s2;…; snÞ be the auxiliary variablesextracted from the solution to the AITP (29), component sio0ð1r irnÞ indicates that the corresponding selected subset of imageobservations for the i-th 3D point are outlier-free.

Proof. Since sio0, it follows that

JApji ;iYþbpji ;i

Jrγðc>pji ;i

Yþdpji ;iÞþsi

rγðc>pji ;i

Yþdpji ;iÞþ0; ð31Þ

Scaled Mahalanobis distance

=0.847

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Fig. 3. Illustration of noise scale estimation on data set “Chair” (see Section 5).According to Proposition 2, a candidate noise scale is determined by the scaledMahalanobis distance which maximizes its probability density f mdðtÞ calculated bykernel density estimation algorithm.


8 jA 1;2;3f g. Combining with Eq. (30), we get

dðPpji;u

pjii ;UiÞrγ ð32Þ

8 jAf1;2;3g, which satisfies the constraint defined in Definition 3. □

The basic idea of introducing auxiliary variables is inspired bythe outlier removal algorithm proposed by Seo et al. [5]. There aretwo differences between their algorithm and ours. The first is thatwe use only three image measurements for each 3D point, whichresults in a smaller scale and very spare structure of the resultingmatrix used in the AITP (29) (an example is shown in Fig. 4). Thissparsity can be exploited to significantly improve the efficiencywhen solving the optimization problem (e.g., using SDPT3 [43]).The second difference is that we use the solution to the AITP (29)to select inliers rather than to remove outliers. Thus, our methoddoes not risk removing inliers as that happens frequently in outlierremoval algorithms [5].

In practice, image observations are usually corrupted by out-liers, solving the optimization problem defined in the AITP (29)can only pick out a number of 3D points with outlier-free subsetimage observations. The number of iterations required to find suchan outlier-free subset depends on both the number of visible viewsand the fraction of outliers in the data. For a 3D point visible innvðZ3Þ views, the number of iterations nit required to ensure withprobability pit that at least one of the sets of random samples is anoutlier-free subset is given by [1]

nit ¼log ð1�pitÞ

log ð1�ð1�ξÞ3Þð33Þ

where ξ is the fraction of outliers in image measurements.Considering that ξ is usually unknown beforehand, we assumethat there are only three inliers for each 3D point (i.e.,ξ¼ ðnv�3Þ=nv), which results in an over-estimated version of nit:

nit ¼ 1þ log ð1�pitÞ

log 1� 3nv

� �3 !

26666666

37777777; ð34Þ

where ⌈ � ⌉ denotes the ceiling operator. In this paper, pit is set to0.95 to ensure reliable results. Usually, only several iterations areneeded before an outlier-free subset is selected. However, thecomputational complexity increases dramatically in the case ofhigh fraction of outliers. In order to speed up the selection ofoutlier-free subsets, a pre-filtering operation is used to detectsubsets containing outliers before constructing the problemdefined in the AITP (29). In the pre-filtering stage, we compute

the Mahalanobis distance between an image point and thetransferred one, if this distance is larger than a certain threshold(e.g., 10s), then it is safe to conclude that this subset is corruptedby at least one outlier and thus is removed from the currentiteration. Only when an outlier-free subset of three views does notexist and after a number of nit random sampling are performedwithout finding the all-inlier subset, the 3D point is removed fromfurther processing in our framework.

4.2.2. The error bounding algorithmBased on the obtained outlier-free image observations, two

new methods for error bounding are presented in this subsection.The first method is based on theoretical analysis and the secondone is an approximation. In particular, for the theoretical method,the exact error bounds are calculated by solving 6 convex optimi-zation problems (see Fig. 5). Without loss of generality, let usconsider the exact error bounding problem (EEBP):

ðEEBPÞ min f 0> bUs:t: JApji ;i

bUþbpji ;iJrγðc>

pji ;ibUþdpji ;i

Þ8 iAf1;2;…;ng8 jAf1;2;3g; ð35Þ

where bU is defined in Eq. (28), and f 0 is a 3n dimensional rowvector with a value of 1 for the ½1þ3� ðj�1Þ��th ð1r jrnÞ

Fig. 4. The structure of the matrix coefficients of the AITP (29). In this example, 8 3D points are observed by 10 cameras and 3 views for each 3D point are randomly chosen,leading to a sparse coefficient matrix of dimension 32�72, where about 90% of its elements are zeros. Zero and non-zero elements are represented by hollow and solid dots,respectively.

1P2P

3P

uu

u

Fig. 5. Illustration of the exact error bounding algorithm. The location of a 3D pointis constrained in the intersection of three cones determined by three imageobservations. The bounding box of the 3D point is calculated by SOCP. See the textfor details.


element and 0 for others. Let us denote the solution by bUn

, thenthe lower x-coordinate bounding vector is given by

bUx ¼ ð½bUn�1; ½bUn�4;…; ½bUn�3n�2Þ> : ð36Þ

By replacing f 0 with �f 0 in the EEBP, we can get the upper x-coordinate bounding vector bUx . We can calculate the lower andupper y; z�coordinate bounds by solving another 4 SOCPs usingsimilarly defined f 0. We denote all these error bounds as bUy , bUy ,bUz , bUz .

Instead of calculating the exact error bounds, an approximationcan be employed as a tradeoff between accuracy and efficiency.Recall that a number of outlier-free subset image measurementsare found by solving the AITP (29). The solution to this problemalso provides us with feasible 3D points. Let Uf be a feasiblesolution to a 3D point. We use a sphere with radius equal to themaximum perpendicular distance λ between the feasible 3D pointand the cone to approximate the search space. Let u be themeasured image point, uf the projection of Uf in the image planeand up the intersection of the cone and the ray originating from uf

and pointing to u. As shown in Fig. 6, λ is the distance between Uf

and the line-of-sight determined by the image point up. Let K andP be the intrinsic calibration and camera matrix, respectively, λcan be calculated as

λ¼ J ðI�Q ÞK�1PUf J ; ð37Þwhere Q is the line-of-sight projection matrix [44] defined as

Q ¼K�1upðK�1upÞ>ðK�1upÞ>K�1up

: ð38Þ

For a subset of size three, min3j ¼ 1λj is used as the final radius of the

bounding sphere and the results are converted into boundingboxes. Although this approximation may result in an error boundwhich may not contain the real optimal solution, it is much moreefficient than the exact error bounding, which is demonstrated inour experiments.

4.3. A robust formulation of direct triangulation problem

The direct triangulation problem refers to finding the optimal3D points that lie within the error bounds obtained by thealgorithms described in the aforementioned subsection. Theoptimality is in the sense of minimal sum of squared reprojectionerrors. Note that outliers have not been removed from imagemeasurements so far, so instead of minimizing the objective

function defined by Eq. (3), a robust formulation which can handleoutliers and occlusions in image measurements is required. Thedetails on the new formulation are presented in the following.

To improve the efficiency of our framework, the complete set ofimage projections (m cameras and n3D points) are gathered into amatrix equation:

λ11u11 � … λ1nu

1n

λ21u21 λ22u

22 … �

⋮ ⋮ ⋱ ⋮λm1 u

m1 � … λmn u

mn

0BBBBB@

1CCCCCA¼P1

⋮Pm

0B@1CAΘ; ð39Þ

where λij ð1r jrn; 1r irmÞ is a scalar factor, u ij ð1r jrn;

1r irmÞ is the projection of the j-th 3D point by the i-th cameramatrix Pi in homogeneous coordinates (with the third coordinatebeing 1). Θ is a parameter matrix obtained by stacking all of thehomogeneous coordinates of the estimated 3D points, i.e.,Θ¼ ðU1;U2;…;UnÞ. The symbol “� ” in the estimated imagemeasurement matrix denotes that the corresponding 3D point isinvisible in the view due to occlusions. Note that the locations of“� ” vary with data sets and those shown in Eq. (39) are forillustration only. The Euclidean distance between the measuredimage points and the projection of Θ is

DΘ ¼

Ju11� u1

1 J � … Ju1n� u1

n J

Ju21� u2

1 J Ju22� u2

2 J … �⋮ ⋮ ⋱ ⋮

Jum1 � um

1 J � … Jumn � um

n J

0BBBBB@

1CCCCCA: ð40Þ

Since the 3D points are independent of each other, the columns ofDΘ are also independent. Inspired by the basic idea of MAP [13], arobust objective function can be explicitly written into a vector-valued function of Θ:

JΘðfPig; fuig;ΘÞ ¼

∑1r irmðT inþϑi1ð½DΘ�i;1�T inÞÞ

∑1r irmðT inþϑi2ð½DΘ�i;2�T inÞÞ

⋮∑1r irmðT inþϑi

nð½DΘ�i;n�T inÞÞ

0BBBBB@

1CCCCCA; ð41Þ

where ϑij ð1r jrn;1r irmÞ is a binary value determined by

ϑij ¼

0 if ½DΘ�i;jZT in3 ½DΘ�i;j ¼�;

1 otherwise:

ð42Þ

From the definition of JΘ and ϑji, it is easy to see that outliers and

occlusions in image measurements are handled in the same way.The threshold T in is set to s

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPχ2

2ð0:95Þ

qin this paper. Now, the new

robust formulation of direct multi-view triangulation problem(RDTP) can be written as

ðRDTPÞ Θn ¼ arg minΘ

JΘðfPig; fuig;ΘÞ

s:t: ½Θ�1≽bUx ; ½Θ�1≼bUx

½Θ�2≽bUy ; ½Θ�2≼bUy

½Θ�3≽bUz ; ½Θ�3≼bUz

½Θ�4 � 1; ð43Þwhere the symbols “≼” and “≽” stand for element-wise compar-ison between vectors (i.e., for two n-vectors a and b,a≼b3 ½a�ir ½b�i; a≽b3 ½a�iZ ½b�i; 8 i¼ 1;…;n).

4.4. DE-based minimization process

The multi-view L2 triangulation problem is well-known to benon-linear and non-convex. The RDTP (43) is even more compli-cated since extra coefficients are used to form the robust formula-tion. Due to its inherent non-convex and non-linear property, local

P

fU

fuu

pu

Fig. 6. Illustration of the approximate error bounding algorithm. In this figure, Uf isa feasible solution to a 3D point. A sphere with radius equal to the maximumperpendicular distance λ between the feasible 3D point and the cone imposed by animage observation approximates the search space. See the text for details.


methods require good initial guess to converge to the globaloptimum for such kind of optimization problem. In this paper,we propose to apply Differential Evolution (DE) [23] to solve theoptimization problem. DE is a simple yet powerful real parameterglobal optimizer, especially suitable for nondifferentiable problem.Similar to many other Evolutionary Algorithms (EAs), DE main-tains a population of trial parameter vectors (matrices in thispaper) and the fitness of population is improved iteratively. DEconsists of four stages, namely, initialization, mutation, crossoverand selection. Detailed issues are presented in the following andthe overall optimization process is presented later.

4.4.1. InitializationSince the search space is bounded in the ranges calculated in

Section 4.2, the initialization of the RDTP (43) can be easilyrealized by selecting a set of feasible trial matrices. Let O denotethe feasible region of parameter matrix Θ. O consists of allmatrices SAIR4�n (IR4�n denotes the space of all 4�n matrices)which satisfy the following:

i. ½S�1;jA ½½bUx �j; ½bUx �j�,ii. ½S�2;jA ½½bUy �j; ½bUy �j�,iii. ½S�3;jA ½½bUz �j; ½bUz �j�, andiv. ½S�4;j � 1,

8 jAf1;…;ng. Let N be the size of population, the population of theg-th ðgZ0Þ generation is denoted by

PðgÞ ¼ fΘðgÞ1 ;ΘðgÞ

2 ;…;ΘðgÞN g: ð44Þ

The initial population Pð0Þ is generated by randomly selectingΘð0Þ

i AO ðiAf1;…;NgÞ. Note that since ½Θ�4 � 1, the followingmutation and crossover operations are not applied to this rowvector.

4.4.2. MutationAt each generation, a population of N trial matrices are produced

by mutating and recombining trial matrices in the current population.The scheme for generating these trial matrices is one of the maindifferences between DE and the classic EAs and meanwhile forms thecrucial idea behind DE. Basically, to produce a new trial matrix, theweighted difference between two trial matrices is added to a thirdone. We denote all trial matrices created by mutation at the g-thgeneration by

MðgÞ ¼ fΦðgÞ1 ;ΦðgÞ

2 ;…;ΦðgÞN g; ð45Þ

where ΦðgÞi is obtained by combining three different, randomly

chosen trial matrices from PðgÞ. Specifically,

ΦðgÞi ¼ΘðgÞ

r0 þF � ðΘðgÞr1 �ΘðgÞ

r2 Þ; ð46Þ

where the scale factor FA ð0;1� is a positive real number used tocontrol the rate of evolution, and the indices r0; r1; r2 ðr0ar1ar2Þ oftrial matrices are randomly generated integers within the range ½1;N�.For each trial matrix, r0; r1; r2 are chosen anew.

4.4.3. CrossoverIn order to enhance the potential diversity of the population,

crossover operation is employed in DE. We tested two popularcrossover schemes, namely exponential crossover and binomial cross-over [23] and found no significant difference between them in ourproblem. The classic binomial crossover is performed on each elementof the trial matrix when some conditions are met. At the g-thgeneration, the binomial crossover scheme builds a population of N

trial matrices

CðgÞ ¼ fΨðgÞ1 ;ΨðgÞ

2 ;…;ΨðgÞN g; ð47Þ

out of the parameter values copied from PðgÞ and MðgÞ. The crossoveroperation updates each entry of ½ΨðgÞ

i �j;k by

½ΨðgÞi �j;k ¼

½ΘðgÞi �j;k if randrCr

½ΦðgÞi �j;k otherwise;

8<: ð48Þ

8 jAf1;…;mg, 8kAf1;…;ng, where rand is a random real numberbetween 0 and 1. In this paper, the crossover operation can becompactly written as

ΨðgÞi ¼ΘðgÞ

i ○ð1�McÞþΦðgÞi ○Mc; ð49Þ

where the symbol “○” stands for the Hadamard product (element byelement matrix multiplication) between two matrices and 1 is a 3�nmatrix with all its elements being 1.Mc is a 3�n binary matrix whichis created once per crossover. The matrix Mc is controlled by a pre-defined crossover rate CrA ½0;1� as follows: for each entry of ½Mc�j;k, arandom real number rj;k between 0 and 1 is generated, the entry is setto 1 if rj;krCr , and 0 otherwise.

4.4.4. SelectionSo far, we have obtained three sets of population PðgÞ, MðgÞ and

CðgÞ at each generation. The selection operation determines whichindividuals survive to the next generation. As MðgÞ is an inter-mediary population which is discarded, only the trial matrices inCðgÞ compete with the population matrices in PðgÞ of thesame index.

Recall that the objective function JΘ (see Eq. (41)) defined onthe trial matrix Θ is a vector-valued function, and thus canbe denoted as ðJ 1ðΘÞ;J 2ðΘÞ;…;J nðΘÞÞ> . The goal of optimaltriangulation is to find a trial matrix Θn, such that 8ΦAOðΦaΘnÞ; 8 iA 1;…;nf g : J iðΘnÞrJ iðΦÞand( iA 1;…;nf g : J iðΘnÞoJ i ðΦÞ. Given two trial matrices Θ;ΦAO, denote a set ofindices as

I ðΘ;ΦÞ ¼ fijiAf1;…;ng4J iðΘÞoJ iðΦÞg: ð50ÞThen the selection operation can be described as

½Θðgþ1Þi �j;k ¼

½ΘðgÞi �j;k if kAI ðΘðgÞ

i ;ΦðgÞi Þ

½ΨðgÞi �j;k otherwise;

8<: ð51Þ

8 iAf1;2;…;mg, 8 jAf1;2;3g, 8kAf1;2;…;ng. Hereafter, we useΩðgÞ to denote the best trial matrix at the g-th generation and itfollows that

½ΩðgÞ�j;k ¼ ½Θðgþ1Þarg miniJ kðΘðgþ 1Þ

i Þ�j;k; ð52Þ

8 jAf1;2;3g, 8kAf1;2;…;ng. The best trial matrixΩðgÞ will be usedto test the stopping criterion in the next subsection.

4.4.5. The termination criterion and minimization processThe most commonly used stopping criterion in DE is limiting

the maximum number of generations. For our problem, since thenumber of views and the noise scale vary with data sets, a propernumber of generations used as a stopping criterion cannot beeasily determined. Therefore, we choose to apply a statistical-based stopping criterion to each 3D point, resulting in a dynamicpopulation structure. At each generation g, we check the condition

J iðΩðgÞÞ�J iðΩðg�Nt ÞÞoδ ð53Þfor i¼ ð1;…;nÞ, where δ is a small threshold and Nt the number ofgenerations to analyze, both control the convergence rate. The i-thcolumn vector for which Eq. (53) holds will be removed from allpopulation. The work flow of the minimization process is sum-marized in Algorithm 2.


Algorithm 2. Robust direct multi-view triangulation using DEoptimization.

5. Experiments

In order to evaluate the proposed framework, we implemented ourframework in Matlab with SDPT3 [43] and conducted extensiveexperiments. All the experiments were carried out on a machine with2.53 GHz Intel duo core processors and 2 GB of RAM runningWindows 7 operation system. Unless stated otherwise, the followingDE settings were used: the mutation rate F¼0.8, the crossover rateCr¼0.95, size of population N¼20 and δ¼ 0:001; Nt ¼ 20 for thestopping criterion. Three types of data sets including synthetic data,publicly available real image sequences and image sequences capturedby ourselves were used in our experiments. Representative resultsfrom our experiments and comparison with existing algorithms arepresented in this section.

5.1. Experiments with synthetic data

The use of synthetic data offers great flexibility in generatingdata sets under various configurations and also provides thecorresponding ground truth, which facilitates quantitative evalua-tion. In this subsection, the experimental results on a set ofsynthetic data are presented and discussed in detail.

5.1.1. Experimental setup and data setsSynthetic data sets used in this section were generated by a virtual

setup consisting of several simulated cameras mounted around the 3Dmodel “bunny” [45]. The images captured by the simulated camerasrecord the projections of the 3D points of “bunny”. Various config-urations based on this setup were studied by changing the number ofsimulated cameras and 3D points, and adding varying amount of zeromean Gaussian noise as well as varying fraction of outliers to thesynthetic data sets. The resolution of the images captured in this setupis roughly 1000�1000 pixels. An visualization of the experimentalsetup and one image captured in this setup combined with trackedfeatures are shown in Fig. 7.

5.1.2. Accuracy of noise scale estimatorSince the proposed noise scale estimator involves characteriz-

ing the uncertainty of epipolar transfer, the accuracy of the closed-form covariance is studied and compared to that computed using astatistical method approximating the covariance according to thelaws of large numbers [46]. Specifically, in the statistical method,the mean μu″ is approximated by the discrete mean of a suffi-ciently large number Nd of samples defined by

Ed½u″� ¼1Nd

∑Nd

i ¼ 1ui″ ð54Þ

and the corresponding covariance Σu″ by

Covdðu″Þ ¼ Ed½ðui″�Ed½u″�Þðui″�Ed½u″�Þ> �; ð55Þwhere Nd¼1500 in our experiments. To facilitate both the com-parison and the graphical visualization of the estimated covar-iance, the concept of the k-hyper-ellipsoid [46] of uncertainty isused. For any scalar k, the probability that u″ lies inside the k-hyper-ellipsoid defined by the equation

ðu″�μu″Þ>Σu″ðu″�μu″Þ ¼ k2 ð56Þis equal to Pχ2

2ðkÞ. Thus, given a k ð0rkr1Þ, an ellipse can be drawn.

For a triplet of noise-free image correspondences, Gaussian noise withthe same standard deviation is added to the coordinates and thecontaminated ui″ ði¼ 1;…;1500Þ are used in the statistical method.For our analytical method, the covariance is calculated based on asingle triplet of contaminated image correspondences. To demonstratethe stability of the analytical method, 50 runs were performed basedon different instances of contamination. The estimated covarianceusing the statistical method and by our analytical method are shownin Fig. 8, where k is set to 0.75 so that ideally 75% of the transferredimage points ui″ should lie inside the ellipse. From the results in Fig. 8,we can see that the covariance estimated by our algorithm is quiteclose to that estimated by the statistical method, which confirms thatthe first order approximation is accurate enough for our problem.

The performance of our proposed noise scale estimator isstudied statistically on synthetic data sets contaminated by

input: Camera matrices and image measurements with correspondences.output: Optimal estimates of sparse 3D points.Estimate the scale of noise s according to Algorithm 1;Calculate the threshold γ defined in Eq. (27);Calculate error bounds of 3D points using the algorithms proposed in Section 4.2;Initialize DE setting: F;Cr;N; δ;Nt;Let the index of generation g’0;Initialize the population PðgÞ as described in Section 4.4.1;//Evolutionary cycle

repeatfor i’1 to N doEvaluate each individual ΘðgÞ

i in PðgÞ by Eq: ð41Þ;Perform the mutation operation to produce ΦðgÞ

i by Eq: ð46Þ;Perform the crossover operation to produce ΨðgÞ

i by Eq: ð49Þ;

��endSelect the trial matrices ΘðgÞ

i ði¼ 1;…;NÞ for the next generation Pðgþ1Þ by Eq: ð51Þ;Update the best trial matrix ΩðgÞ by Eq: ð52Þ;Test the termination criteria for each 3D point by Eq: ð53Þ;Retrieve 3D points from terminated vectors;Remove terminated vectors from trial matrices in Pðgþ1Þ;

Let g’gþ1;

��


-1000

-500

0

500

1000

150010005000X

-500-1000

-1500-1000-5000

500

Z

Y

Fig. 7. The figure in (a) shows the “bunny” setup used to generate synthetic data sets. Different configurations consisting of varying number of cameras and 3D points aretested in our experiments. One image (gray solid dots) captured by a simulated camera and 100 feature tracks, with the color dots to denote the locations of correspondingimage points shown in this figure for (b) outlier free configuration and (c) a configuration with 10% outliers.

−250 −240 −230 −220 −210 −200 −190−267

−266

−265

−264

−263

−262Noise standard deviation σ=0.5

statistical covarianceanalytical covariance

−350 −300 −250 −200 −150 −100−275

−270

−265

−260

−255Noise standard deviation σ=2


−400 −300 −200 −100 0−285

−280

−275

−270

−265

−260

−255



−260 −240 −220 −200 −180−267

−266

−265

−264

−263



−400 −350 −300 −250 −200 −150 −100−275

−270

−265

−260

−255

−250Noise standard deviation σ=2


−500 −400 −300 −200 −100 0 100−280

−275

−270

−265

−260

−255

−250



Fig. 8. Visualization of cov. of epipolar transfer estimated by the statistical method and our analytical algorithm using kðk¼ 0:75Þ�hyper�ellipsoid [46]. The gray dots arerandomly generated image points used for the statistical method. Figures shown in (a), (b) and (c) are centered at the true mean value. Figures shown in (d), (e) and (f) arecentered at the transferred image point. The analytical method was run on 50 instances.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.5 1 1.5 2 2.5 3 3.5Ground truth std. σ (pixels)

Abs

olut

e er

ror Δ

σ (p

ixel

s)

Fraction of outliers ξ: 15%

0

0.1

0.2

0.3

0.4

0.5


Abs

olut

e er

ror Δ

σ (p

ixel

s)


0

0.2

0.4

0.6

0.8


Abs

olut

e er

ror Δ

σ (p

ixel

s)


Fig. 9. Statistics of the accuracy of estimated noise scale on synthetic data sets with different amounts of noise and outliers. Each box plot visualizes Δs calculated from 50 instancesunder a certain level of noise and fraction of outliers. The linewithin the box denotes themedian, the bottom and the top of the box represent the 25th and 75th percentiles, respectively.


−5000

500

−500

0

500

−400−200

0200

400

−5000

500

−500

0

500

−400−200

0200

400

Fig. 10. Illustration of error bounding (512 3D points, s¼ 1:5 pixels). The search space is constrained in (a) the bounding boxes calculated by the exact bounding algorithmand (b) the approximate error bounds. Both algorithms provide valid error bounds for use in DE-based minimization.

−5000

500

−500

0

500

−2000

200400

−5000

500

−400

−200

0

200

400

−400−2000

200400

−5000

500

−400

−200

0

200

400

−2000200

Fig. 11. Illustration of DE minimization in our framework. The reconstructed 3D points (represented by solid dots) are connected to its ground truth (represented by hollowcircles) by a line segment. By setting (a) Nt¼5 and (b) Nt¼10, DE fails to converge to the ground truth value for a noticeable number of 3D points due to prematuretermination, which can be improved by setting a higher (c) Nt¼15.

0 2000 4000 6000 8000−15

−10

−5

0

5x 10−3 Comparison of RMS2D

Index of 3D point

Diff

eren

ce o

f RM

S2D

0 2000 4000 6000 800010−10

10−8

10−6

10−4

10−2 Comparison of RMS2D

Index of 3D point

Diff

eren

ce o

f RM

S2D

Fig. 12. Optimality verification. The two figures plot the difference between RMS2D obtained by the provably optimal algorithm ðL2; L2Þ and ours. ðL2; L2Þ performs better thanours when (a) Nt¼10, and our method successfully found the optimal solution for all 3D points when (b) Nt¼20.

3 4 5 6 7 8 9 1010−2

100

102

Number of views

CP

U ti

me

per 3

D p

oint

(s)

Noise standard deviation σ=0.5 (pixels)

New

( , )L L

( , )L L

( , )L L

3 4 5 6 7 8 9 1010−2

100

102

Number of views

CP

U ti

me

per 3

D p

oint

(s)

Noise standard deviation σ=3.5 (pixels)

New

( , )L L

( , )L L

( , )L L

Fig. 13. Comparison of average cpu time (s) for triangulating a 3D point. The time complexity of our method is not sensitive to the number of views. This property enablesour framework to handle large scale problems.


different amount of noise and outliers. In this experiments, 512 3Dpoints were projected by 10 cameras, maxIter and maxSamp inAlgorithm 1 were set to 10 and 1500, respectively. Outliers wereadded by randomly perturbing the image points by up to 100pixels. For each level of noise and portion of outliers, 50 instanceswere generated, the absolute errors of noise scale Δs¼ js� sjwere recorded and are summarized in Fig. 9. As we can see, theabsolute error Δs of estimated noise scale ranges from less than0.1 pixels to about 0.4 pixels for the majority of tested instances.

As more outliers are added to the data, no dramatic changes of Δsoccur, which imply that our method is robust to outliers.

5.1.3. 3D reconstruction using synthetic dataAs detailed in Section 4, our proposed framework consists of

three key components, namely noise scale estimation, errorbounds estimation and the DE minimization. Once the noise scaleis determined, it can be used to select outlier-free subset observa-tions and calculate the corresponding error bounds. Illustrations oferror bounding and minimization in our work flow are shown inFigs. 10 and 11, respectively.

In order to verify the optimality of our method, the resultsobtained by our method are compared with those obtained bythree provably optimal algorithms based on branch-and-bound(denoted as ðL2; L2Þ; ðL2; L1Þ and ðL1; L1Þ as in the original paper) [9].Because these algorithms are not robust to outliers, our compar-ison with theirs is limited to the outlier-free data set only. In thisexperiment, 8704 3D points are randomly selected from the“bunny” model and each 3D point is observed by at least 3 and

Table 2Comparison of the accuracy of 3D reconstruction using different noise scales. Themean and the standard deviation are computed over 20 runs. “Adpt.” denotes thenoise scale estimated by our method. The best results are highlighted in bold.

Std. s Std. s Mean RMS2D7Std. Dev. Mean RMS3D7Std. Dev.

rand(0.5,2.5) 0.5 0.71670.031 3.94271.6911.0 1.40870.237 3.20971.6642.0 2.07570.647 3.21971.4133.0 2.18170.681 4.64871.513Adpt. 2.10670.719 2.52171.350

rand(2.5,5.0) 2.0 3.01570.069 6.34372.1953.0 3.93370.447 5.44672.4304.0 4.24370.727 6.16572.3285.0 4.32470.811 6.85472.750Adpt. 4.21470.796 4.69371.623

rand(0.5,5.0) 0.5 0.70470.023 7.79873.0502.0 2.75870.558 6.05872.3733.5 3.79371.204 5.44871.4395.0 4.01971.411 6.59272.204Adpt. 3.87671.388 4.48371.978

0 5 10 15 200

1000

2000

3000

4000

5000

Rem

aine

d tra

cks

Time (seconds)

Outlier−free set selection without pre−filtering

ξ=0.05ξ=0.10ξ=0.15ξ=0.20ξ=0.25ξ=0.30ξ=0.35

Rem

aini

ng tr

acks

0 200 400 6000

1000

2000

3000

4000

5000

Time (seconds)

Outlier−free set selection without pre−filtering

ξ=0.40ξ=0.45ξ=0.50ξ=0.55ξ=0.60ξ=0.65ξ=0.70

Rem

aini

ng tr

acks

0 5 10 150

1000

2000

3000

4000

5000

Time (seconds)

Outlier−free set selection with pre−filtering

ξ=0.05ξ=0.10ξ=0.15ξ=0.20ξ=0.25ξ=0.30ξ=0.35

Rem

aini

ng tr

acks

0 50 100 150 200 2500

1000

2000

3000

4000

5000

Time (seconds)

Outlier−free set selection with pre−filtering

ξ=0.40ξ=0.45ξ=0.50ξ=0.55ξ=0.60ξ=0.65ξ=0.70

Fig. 14. Comparison of CPU time for outlier-free subset selection with and without pre-filtering under varying ξ (fraction of outliers) on a synthetic configuration containing30 cameras and 5000 3D points with noise standard deviation s¼ 1:0 pixels. Remaining tracks mean the number of tracks onwhich the outlier-free subset selection needs tobe performed.

Table 3Synthetic data sets used to evaluate overall performance of triangulation pipelines.s¼ 1 for all data sets.

Data set #Views #3D points #Inlier observations #Outlier observations (%)

SD1 10 128 900 380 (� 30%)SD2 10 512 3590 1530 (� 30%)SD3 10 2048 14,340 6140 (� 30%)SD4 20 4096 49,160 32,768 (� 40%)SD5 30 8192 122,880 122,880 (� 50%)SD6 40 12,288 196,610 294,912 (� 60%)


up to 10 cameras. Gaussian noise with standard deviation 1.0 pixelswas added to image measurements. The maximum number ofiterations for comparative algorithms is set to 100. The comparisonof root mean square reprojection error (RMS2D) between ourmethod and ðL2; L2Þ is shown in Fig. 12 (ðL2; L2Þ is used because itconsistently yields lower RMS2D than ðL2; L1Þ and ðL1; L1Þ). Ourmethod successfully finds the globally optimal solution for all 3Dpoints when Nt is set to 20. The computational complexity of ourmethod, as shown in Fig. 13, is significantly lower than the threealgorithms based on branch-and-bound.

In practice, image measurements are always corrupted byoutliers. Thus a threshold needs to be specified for all robusttriangulation methods. The next experiment investigates theinfluence of noise scale on 3D reconstruction and verifies thathigher accuracy of 3D structure can be achieved by using ourestimated noise scale. To examine the influence of noise scalestatistically, three sets of data were used in this experiment. Eachset of data consists of 20 instances, where 512 3D points areobserved by 10 cameras. Zero mean Gaussian noise with standarddeviation randomly chosen from a given range is added to the dataand 20% of image measurements are perturbed by up to 100pixels. The results obtained using the noise scale estimated by ourmethod (denoted as “Adpt.”) are compared with those using fixednoise scales. As shown in Table 2, although lower reprojectionerror can be achieved by using a smaller noise scale, usingadaptive noise scale consistently results in lower root mean squareerror of the coordinates of 3D points (RMS3D).

We also tested the computational complexity of our frameworkon various data sets. As the outlier-free subset selection algorithm

is the most time consuming step, detailed tests were carried outon it. We ran our algorithm on a typical configuration consisting of30 cameras and 5000 3D points. Varying amount of outliers(ranging from 5% to 70%) were added to the data sets. The requiredtime for outlier-free subset selection with and without pre-filtering is shown in Fig. 14. The advantage of using pre-filteringis obvious for data sets contaminated with a higher fraction ofoutliers (ξA ½40%;70%�), in which case roughly two times ofcomputational efficiency improvement is obtained.

In order to handle outliers in triangulation, state-of-the-artmethods use L1�based algorithms to remove outliers and thenrun bundle adjustment to refine the 3D structure [6]. In thisexperiment, we compare the performance of such triangulationstrategies with our proposed framework. To be fair, the same noisescale is used for all methods. Data sets used to test the overallperformance of 3D reconstruction are listed in Table 3. Compar-isons between our proposed framework and various existingmethods are summarized in Table 4. In this table, “Bisc.þBA”refers to using the algorithm proposed by Sim and Hartley [2] toremove outliers, and then running bundle adjustment. “LOneþBA”and “DualþBA” are the two algorithms proposed by Olsson et al.[6]. For algorithms “Bisc.” and “LOne”, outliers are removediteratively. For the algorithm “LOne”, outliers are removed in aone-shot manner. Note that as we consider calibrated cameras,both the rotation and translation of camera are assumed to be knownin our implementation of the three algorithms “Bisc.”, “LOne” and“Dual”. “New1” and “New2” refer to using our proposed frameworkwith the exact and approximate error bounding algorithms, respec-tively. It is easy to see that method “Bisc.þBA” is the most timeconsuming one, which spent more than 1 h to reconstruct data setSD3 with only 2048 3D points. Compared with “LOneþBA” and“DualþBA”, our method consistently found more reliable inliers andsuccessfully reconstructed more 3D points. For large scale data sets(SD5 and SD6), our method outperforms existing methods in terms ofboth computational complexity and accuracy of 3D reconstruction.Moreover, while the results obtained by “New2” is quite close to“New1”, it requires only half the CPU time of that required by “New1”.This result indicates that the approximate error bounding algorithmworks well in the tested data sets.

5.2. Experiments with real images

In this subsection, we present the experimental results using9 real image sequences, which are listed in Table 5. Data sets“Plane”, “Desk” and “Chair” were captured with a calibrated digitalcamera and the method introduced in [47] was used to estimatethe external camera parameters. Data sets “Fountain-P11”, “Herz-Jesu-P25” and “castle-P30” (including calibration data) listed inTable 5 are publicly available data sets [48]. For all images, SIFTfeature points were detected and matched using the algorithmsdescribed in [49]. To remove outliers between two images,RANSAC with epipolar geometric verification [1] was used in

Table 5Real image sequences used in the experiments.

Data set #Views #3D points #observations s

Plane 9 11,000 58,790 1.26Desk 24 18,389 93,334 0.97Chair 27 17,787 64,928 1.18Fountain-P11 11 32,159 150,276 0.27Herz-Jesu-P25 25 47,101 266,495 0.36castle-P30 30 38,102 203,111 0.36Fountain-P11n 11 14,027 88,940 0.28Herz-Jesu-P25n 25 21,987 182,582 0.38castle-P30n 30 17,793 134,673 0.37

Table 4Performance of triangulation pipelines on synthetic data sets listed in Table 3. Thebest results are highlighted in bold.

Data set Method #TPa #FNb #PTSc RMS2D RMS3D Time (s)

SD1 Bisc.þBA 634 274 114 1.228 1.721 229.755LOneþBA 848 60 120 1.222 1.448 0.482DualþBA 415 493 66 1.232 1.632 2.206New1 903 5 128 1.227 1.346 5.071New2 903 5 128 1.227 1.346 2.609

SD2 Bisc.þBA 2372 1174 444 1.227 2.162 955.871LOneþBA 3356 190 490 1.205 1.746 1.155DualþBA 1556 1990 269 1.231 1.792 4.275New1 3527 19 511 1.234 1.428 8.749New2 3526 20 511 1.234 1.432 5.019

SD3 Bisc.þBA 9498 4716 1751 1.200 1.936 3790.894LOneþBA 13,391 823 1938 1.195 1.560 4.422DualþBA 6656 7558 1105 1.237 1.682 10.520New1 14,095 119 2037 1.215 1.467 21.166New2 14,080 134 2033 1.216 1.469 11.481

SD4 LOneþBA 46,477 2122 3996 1.264 1.446 25.418DualþBA 6799 41,800 1237 1.217 2.717 125.695New1 48,374 224 4095 1.288 1.090 43.942New2 48,345 255 4095 1.288 1.156 26.794

SD5 LOneþBA 113,021 8851 7855 1.271 1.771 124.353DualþBA 5148 116,724 1204 1.237 2.362 732.387New1 121,368 504 8190 1.305 1.101 103.280New2 121,210 662 8191 1.306 1.689 64.220

SD6 LOneþBA 161,220 32,953 10,877 1.256 2.270 385.905DualþBA 2944 191,229 782 1.104 2.779 2220.453New1 193,322 851 12,285 1.306 1.779 216.155New2 192,974 1199 12,285 1.308 1.976 154.474

a Number of correctly identified inlier observations.b Number of inlier observations incorrectly classified as outlier observations.c Number of successfully reconstructed 3D points.


pairwise feature matching. In order to test more challengingsettings, we randomly chose roughly 45% of feature tracks fromdata sets “Fountain-P11”, “Herz-Jesu-P25” and “castle-P30”, andperturbed 20% of the original data by up to 100 pixels, resulting inthree new data sets “Fountain-P11n”, “Herz-Jesu-P25n” and “cas-tle-P30n”.

5.2.1. Influence of noise scale on 3D reconstructionIn the first experiment, we demonstrate that the selection of

noise scale can affect both the quantity of successfully recon-structed 3D points and the accuracy of 3D reconstruction. For realimages, it is difficult to get the ground truth, so we chose to exploitthe constraint in the scene to facilitate quantitative evaluation.Shown in Fig. 15(a) is one of the 9 frames in data set “Plane,” twoplanes can be easily detected in this scene. The boundaries of thetwo planes were manually specified in each frame, so that 3Dpoints belonging to the two planes can be identified. Then, weapplied our method to reconstruct the sparse 3D structure usingdifferent scales of noise. For each set of 3D points on the plane,least squares plane fitting was used to reconstruct the plane. Thenwe use the root mean square perpendicular distance (RMSPD)between a reconstructed 3D point to the corresponding recon-structed plane as the error measurement. Such a quantity shouldreflect the accuracy of 3D reconstruction since it essentiallymeasures the flatness of reconstructed planes. The results of thisexperiment are shown in Table 6 and Fig. 15, which confirm that asmaller reprojection can always be achieved by choosing a smallernoise scale but the accuracy of 3D structure decreases. Moreimportantly, if the chosen noise scale is too small, a large numberof 3D points cannot be reconstructed.

5.2.2. Comparison of overall performanceIn the following experiments, we evaluate the overall perfor-

mance of our framework using more real image sequences. Ourestimated noise scale is used in both the comparative methods andin our method. Although RANSAC can discard most outliers from

two view correspondences, it cannot detect outliers in longerfeature tracks. The presence of a single outlier may lead to anarbitrary 3D point. To see the sensitivity of triangulation algo-rithms to outliers, we applied bundle adjustment [7] initializedwith results using linear triangulation [1] to the “Desk” and“Chair” data sets without any outlier removal procedure and wegot an RMS reprojection error larger than 300 pixels for both datasets. The corresponding reconstructed 3D structures are shown inFig. 16. As we can see in this figure, the reconstructed scenes arenoisy due to a large fraction of incorrectly estimated 3D points.

The performance of our framework on real images is alsocompared with bundle adjustment incorporated with two state-of-the-art outlier removal algorithms [6]. The results are presentedin Table 7, where the number of inlier observations is the numberof image observations successfully used in the final reconstruction.In all cases, our methods “New1” and “New2” identified moreinlier observations and reconstructed more 3D points. As well, thereprojection errors achieved by our methods are quite close tothose obtained by “LOneþBA” and “DualþBA”. “LOneþBA” per-forms best in terms of overall time efficiency, which is roughly1.5 times faster than “DualþBA”, 3 times faster than “New1”. Nosignificant difference between the running time of “New2” and“LOneþBA” is observed. For data set “Herz-Jesu-P25”, which is thelargest data set in this experiment, “LOneþBA” requires 1.2 timesthe CPU time of “New2”.

Table 6Comparison of 3D reconstruction on data set “Plane” by using different noise scales.

s #Inlier observations #PTS RMS2D RMSPD

0.2 7468 2371 0.280 0.8240.5 29,485 7522 0.724 0.7830.9 46,855 9889 1.216 0.738Adpt. 51,550 10,416 1.432 0.7231.6 52,490 10,511 1.509 0.7302.0 52,906 10,572 1.562 0.850

Fig. 15. The figure in (a) shows one frame of data set “Plane” and 650 feature tracks with lengths ranging from 2 to 6. When (b) the noise scale is set to 0.2 pixels, we get areprojection error as low as 0.28 pixels but only 2371 3D points are successfully reconstructed. Two views of 3D reconstruction when the estimated noise scale is used areshown in (c) and (d), in which case 10,416 3D points are successfully reconstructed with the lowest RMSPD.


The results on the three manually perturbed data sets aresummarized in Table 8. This experiment aims at testing theperformance of triangulation pipelines under relatively high frac-tion of outliers. In all three tests, our method successfully recon-structed around 5 times the number of 3D points obtained by“DualþBA”. The performance of “LOneþBA” is more stable than“DualþBA” and it requires the least CPU time. But still, it isnoteworthy that our methods can consistently identify morereliable inlier observations in a reasonable amount of time withoutintroducing a noticeable increase in reprojection errors.

Finally, some results of 3D reconstruction using “New2” on thereal image sequences are presented for qualitative evaluation (seeFigs. 17 and 18). One of the 24 frames from data set “Desk” and twoviews of the 3D reconstruction are shown in Fig. 17(a–c). Com-pared with the 3D reconstruction shown in Fig. 16(a), the 3Dstructure obtained by our method is cleaner. For the data set“Chair”, we also get better results (shown in Fig. 17(e and f)) thanthat shown in Fig. 16(b) as expected. In Fig. 18, we present one

image frame and one view of 3D reconstruction of data sets“fountain-P11”, “Herz-Jesu-P25” and “castle-P30”. For these threedata sets, more feature points were detected and the recon-structed 3D structures are thus denser. Again, the 3D point cloudsobtained are rather clean and reliable.

6. Conclusions

This paper proposes a new robust framework for optimalmulti-view L2 triangulation. To our best knowledge, this is thefirst original work in which the issue of outlier handling in optimalmulti-view triangulation under L2-norm is investigated. The pro-posed method starts with estimating the scale of noise in imagemeasurements, which affects both the quantity and the accuracyof reconstructed 3D points but is overlooked in existing research.The robust noise scale estimator follows a residual-consensusbased scheme within which the uncertainty of epipolar transferis analytically characterized by deriving the closed-from covar-iance of epipolar transfer. The issue of outliers is addressed bydirectly searching for the optimal 3D point within either theanalytically computed bounds using second-order cone program-ming (SOCP) or the approximately bounded ranges. In particular,we propose to use Differential Evolution (DE) algorithm to solvesuch a complex optimization problem. Thus the inlier selectionand 3D structure refinement are realized in an optimal fashion.Extensive experimental results on both synthetic data sets and realimage sequences demonstrate that such an optimal inlier selectionand 3D structure refinement strategy can consistently identify more

Fig. 16. Reconstruction results of (a) data set “Desk” and (b) data set “Chair” using bundle adjustment initialized with linear triangulation. The reconstructed 3D structuresare noisy due to outliers and the RMS2D for both data sets is larger than 300 pixels.

Table 7Comparison of 3D reconstruction on five real data sets listed in Table 5. The bestresults are highlighted in bold.

Data set Method #Inlier observations #PTS RMS2D Time (s)

Desk LOneþBA 74,966 15,605 1.112 35.863DualþBA 72,690 15,600 1.276 86.176New1 77,922 16,236 1.121 144.119New2 76,853 15,922 1.141 50.183

Chair LOneþBA 45,473 14,287 1.121 22.298DualþBA 32,810 10,877 1.334 39.232New1 48,029 15,052 1.156 137.438New2 47,803 14,980 1.170 47.399

Fountain-P11 LOneþBA 120,878 27,927 0.326 71.351DualþBA 121,035 27,482 0.393 71.580New1 123,707 28,651 0.325 227.912New2 123,147 28,479 0.325 77.595

Herz-Jesu-P25 LOneþBA 216,205 41,044 0.445 168.126DualþBA 203,409 40,211 0.512 198.619New1 220,428 42,110 0.445 397.106New2 217,563 41,230 0.451 139.188

castle-P30 LOneþBA 154,724 31,238 0.422 106.256DualþBA 143,881 29,781 0.493 140.765New1 161,000 32,757 0.424 309.519New2 158,585 32,041 0.430 119.317

Table 8Comparison of 3D reconstruction on three manually perturbed real data sets listedin Table 5. The best results are highlighted in bold.

Data set Method #Inlier observations #PTS RMS2D Time (s)

Fountain-P11n LOneþBA 44,670 11,206 0.304 30.243DualþBA 10,563 2650 0.388 50.938New2 51,878 12,821 0.318 54.563

Herz-Jesu-P25n LOneþBA 72,478 16,678 0.403 84.210DualþBA 13,666 3519 0.498 262.838New2 89,571 20,281 0.426 161.032

castle-P30n LOneþBA 54,067 12,791 0.388 52.647DualþBA 11,580 2960 0.474 218.613New2 66,811 15,631 0.410 126.856


reliable inliers and also reconstruct more unambiguous 3D points withhigher accuracy without noticeably sacrificing computational effi-ciency. It is noteworthy that on a few large scale data sets contami-nated with outliers, the proposed method even outperforms existingalgorithms in terms of efficiency.

Sparse 3D structure obtained by our method can be useful inguiding further dense 3D reconstruction. For example, recentresearch in stereo matching has shown that the use of sparseGround Control Points (GCPs) [50] can improve reconstructionaccuracy. Multi-view stereo framework which exploiting more

Fig. 17. The figures in (a), (b) and (c) show one frame from the data set “Desk” and two views of the reconstructed 3D points and the location of cameras with a thumbnailimage frame attached in front of each camera, where 15,922 3D points are successfully reconstructed from 76,853 inlier observations selected from 93,334 noisyobservations. The figures in (d), (e) and (f) show one frame from the data set “Chair” and the 3D reconstruction, where 14,980 3D points are successfully reconstructed from47,803 inlier observations selected from 64,928 noisy observations.

Fig. 18. Figures in (a) are one frame and one view of the 3D reconstruction on data sets “fountain-P11”, for which 28,479 3D points are successfully reconstructed from123,147 inlier observations selected from 150,276 noisy observations. Figures in (b) are one frame and one view of the 3D reconstruction on data sets “Herz-Jesu-P25”, forwhich 41,230 3D points are successfully reconstructed from 217,563 inlier observations selected from 266,495 noisy observations. Figures in (c) are one frame and one viewof the 3D reconstruction on data sets “castle-P30”, for which 32,041 3D points are successfully reconstructed from 158,585 inlier observations selected from 203,111 noisyobservations.


general structural constraints (e.g., lines, planes) recognized fromsparse 3D structure will be investigated in the future.

Conflict of interest

None declared.

Acknowledgment

This work was partially supported by the Hunan ProvincialInnovation Foundation for Postgraduate (Grant no. CX2010B025),the Chinese Scholarship Council (Grant no. 2010611068), theNatural Sciences and Engineering Research Council of Canada(NSERC) (Grant no. DG000370) and the University of Alberta.

Appendix A. Proof of Proposition 4.2

Proof. The Jacobian matrix of Eq. (7) is given by

Jϕ ¼ 1C∂ðA!� B!Þ

∂bu � 1

C2

∂C∂buðA!� B!Þ: ðA:1Þ

Since

A!� B!¼A2B3�A3B2

A3B1�A1B3

A1B2�A2B1

0B@1CA; ðA:2Þ

the partial derivative of A!� B! with respect to bu is given by

∂ðA!� B!Þ∂bu ¼

∂ðA2B3 �A3B2Þ∂½bu �1 ∂ðA2B3 �A3B2Þ

∂½bu �2 ∂ðA2B3 �A3B2Þ∂½bu �3 ∂ðA2B3 �A3B2Þ

∂½bu �4∂ðA3B1 �A1B3Þ


∂½bu �3 ∂ðA3B1 �A1B3Þ∂½bu �4

∂ðA1B2 �A2B1Þ∂½bu �1 ∂ðA1B2 �A2B1Þ


∂½bu �4

0BBBBB@

1CCCCCAðA:3Þ

For any i; j¼ 1;2;3 and k¼ 1;2;3;4, it follows that

∂ðAiBj�AiBjÞ∂½bu�k ¼ Ai

∂Bj

∂½bu�k�Aj∂Bi

∂½bu�k� �

þ Bj∂Ai

∂½bu�k�Bi∂Aj

∂½bu�k� �

ðA:4Þ

Substituting Eq. (A.4) into Eq. (A.3) yields

∂ðA!� B!Þ∂bu ¼ ½A!��

∂B!bu �½B!��∂A!bu : ðA:5Þ

Combining both Eq. (A.1) and Eq. (A.5), we get

Jϕ ¼ 1C½A!��

∂B!

∂bu �1C½B!��

∂A!

∂bu � 1

C2

∂C∂buðA!� B!Þ: □

References

[1] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed.,Cambridge University Press, New York, NY, USA, 2004.

[2] K. Sim, R. Hartley, Removing outliers using the L1 norm, in: IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition, vol. 1, 2006,pp. 485–494.

[3] H. Li, A practical algorithm for L1 triangulation with outliers, in: IEEEComputer Society Conference on Computer Vision and Pattern Recognition,2007, pp. 1–8.

[4] Q. Ke, T. Kanade, Quasiconvex optimization for robust geometric reconstruc-tion, IEEE Trans. Pattern Anal. Mach. Intell. (2007) 1834–1847.

[5] Y. Seo, H. Lee, S. W. Lee, Outlier removal by convex optimization for L1approaches, in: PSIVT'09, 2009, pp. 203–214.

[6] C. Olsson, A. Eriksson, R. I. Hartley, Outlier removal using duality, in: CVPR'10,2010, pp. 1450–1457.

[7] B. Triggs, P. Mclauchlan, R. Hartley, A. Fitzgibbon, Bundle adjustment - amodern synthesis, in: Vision Algorithms: Theory and Practice, Lecture Notes inComputer Science, Springer Verlag, London, UK, 2000, pp. 298–375.

[8] R. Hartley, F. Schaffalitzky, L1 minimization in geometric reconstructionproblems, in: Proceedings of the 2004 IEEE Computer Society Conference onComputer Vision and Pattern Recognition, 2004, pp. 504–509.

[9] F. Kahl, S. Agarwal, M.K. Chandraker, D. Kriegman, S. Belongie, Practical globaloptimization for multiview geometry, Int. J. Comput. Vis. 79 (2008) 271–284.

[10] F. Lu, R. Hartley, A fast optimal algorithm for L2 triangulation, in: Proceedingsof the 8th Asian Conference on Computer Vision—Volume Part II, ACCV'07,Springer-Verlag, Berlin, Heidelberg, 2007, pp. 279–288.

[11] F. Kahl, R. Hartley, Multiple view geometry under the L1�norm, IEEE Trans.Pattern Anal. Mach. Intell. 30 (2007) 1603–1617.

[12] H. Li, Efficient reduction of l-infinity geometry problems, in: CVPR'09, 2009,pp. 2695–2702.

[13] P. Torr, Bayesian model estimation and selection for epipolar geometry andgeneric manifold fitting, Int. J. Comput. Vis. 50 (2002) 35–61.

[14] R.I. Hartley, P. Sturm, Triangulation, 1994.[15] K. Kanatani, Y. Sugaya, H. Niitsuma, Triangulation from two views revisited:

Hartley–Sturm vs. optimal correction, in: Proceedings of 19th British, 2008,pp. 173–182.

[16] F.C. Wu, Q. Zhang, Z.Y. Hu, Efficient suboptimal solutions to the optimaltriangulation, Int. J. Comput. Vis. 91 (2011) 77–106.

[17] H. Stewenius, F. Schaffalitzky, D. Nister, How hard is 3-view triangulationreally?, in: Proceedings of the 10th IEEE International Conference on Compu-ter Vision (ICCV'05), vol. 1–01, IEEE Computer Society, Washington, DC, USA,2005, pp. 686–693.

[18] M. Byrod, K. Josephson, Fast optimal three view triangulation, in: AsianConference on Computer Vision, 2007.

[19] C. Olsson, F. Kahl, Generalized convexity in multiple view geometry, J. Math.Imaging Vis. 38 (2010) 35–51.

[20] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University PressCambridge, UK, 2004.

[21] S. Agarwal, N. Snavely, S.M. Seitz, Fast Algorithms for L1 Problems in Multi-view Geometry, 2008.

[22] A. Dalalyan, R. Keriven, L1-penalized robust estimation for a class of inverseproblems arising in multiview geometry, in: Advances in Neural InformationProcessing Systems, vol. 22, 2009, pp. 441–449.

[23] K.V. Price, R.M. Storn, J.A. Lampinen, Differential Evolution: A PracticalApproach to Global Optimization (Natural Computing Series), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.

[24] S. Bhandarkar, H. Zhang, Image segmentation using evolutionary computation,IEEE Trans. Evol. Comput. 3 (1999) 1–21.

[25] B. Bhanu, S. Lee, J. Ming, Adaptive image segmentation using a geneticalgorithm, IEEE Trans. Evol. Comput. 25 (1995) 1543–1567.

[26] J.-S. Jang, J.-H. Kim, Fast and robust face detection using evolutionary pruning,IEEE Trans. Evol. Comput. 12 (2008) 562–571.

[27] A. Treptow, A. Zell, Combining adaboost learning and evolutionary searchto select features for real-time object detection, in: Proceedings of 2004IEEE Congress on Evolutionary Computation (CEC), Oregon, 2004, pp. 2107–2113.

[28] M. Gong, Y.-H. Yang, Genetic-based stereo algorithm and disparity mapevaluation, Int. J. Comput. Vis. 47 (2002) 63–77.

[29] M. Gong, Y.-H. Yang, Quadtree-based genetic algorithm and its applications tocomputer vision, Pattern Recognit. 37 (2004) 1723–1733.

[30] L. de la Fraga, O. Schütze, Direct calibration by fitting of cuboids to a singleimage using differential evolution, Int. J. Comput. Vis. 81 (2009) 119–127.

[31] R. Landa-Becerra, L.G. de la Fraga, Triangulation using differential evolution,in: Proceedings of the 2008 Conference on Applications of EvolutionaryComputing, Evo'08, Berlin, Heidelberg, 2008, pp. 359–364.

[32] L.G. de la Fraga, I.V. Silva, Direct 3D metric reconstruction from two viewsusing differential evolution, in: Proceedings of the 2008 IEEE Congress onEvolutionary Computation (CEC), Hong Kong, 2008, pp. 3266–3273.

[33] L.G. de la Fraga, I.V. Silva, Direct 3D metric reconstruction from multiple viewsusing differential evolution, in: Applications of Evolutionary Computing,Lecture Notes in Computer Science, vol. 4974, 2008, pp. 341–346.

[34] X. Yu, T.D. Bui, A. Krzyzak, Robust estimation for range image segmentationand reconstruction, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 530–538.

[35] H. Wang, D. Suter, Robust adaptive-scale parametric model estimation forcomputer vision, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 1459–1474.

[36] M. Lavine, Introduction to Statistical Thought, University Press of FloridaGainesville, FL, 2009.

[37] B.W. Silverman, Density Estimation: For Statistics and Data Analysis,Chapmanand Hall, London, 1986.

[38] P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, JohnWiley & Sons, Inc., New York, NY, USA, 1987.

[39] R.E. Moore, Interval Analysis, Prentice-Hall, Englewood Cliffs, NJ, 1966.[40] M. Farenzena, A. Fusiello, Reconstruction with interval constraints propaga-

tion, in: Proceedings of the Conference on Computer Vision and PatternRecognition, 2006, pp. 1185–1190.

[41] J.F. Sturm, Using SeDuMi 1.02, a Matlab toolbox for optimization oversymmetric cones, Optim. Methods Softw. 11 (1) (1999) 625–653.

[42] E.D. Andersen, K.D. Andersen, The MOSEK Interior Point Optimization forLinear Programming: An Implementation of the Homogeneous Algorithm,Kluwer Academic Publishers, Dordrecht, The Netherlands, 1999, pp. 197–232.

[43] K.C. Toh, M. Todd, R.H. Tütüncü, Sdpt3—a Matlab software package forsemidefinite programming, Optim. Methods Softw. 11 (1999) 545–581.

[44] C.-P. Lu, G.D. Hager, E. Mjolsness, Fast and globally convergent pose estimationfrom video images, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 610–622.


http://refhub.elsevier.com/S0031-3203(14)00120-4/sbref1



















































[45] Digital Bunny Model, ⟨http://graphics.stanford.edu/data/⟩, 1996.[46] G. Csurka, C. Zeller, Z. Zhang, O.D. Faugeras, Characterizing the uncertainty of

the fundamental matrix, Comput. Vis. Image Understand. 68 (1997) 18–36.[47] N. Snavely, S.M. Seitz, R. Szeliski, Modeling the world from internet photo

collections, Int. J. Comput. Vis. 80 (2008) 189–210.[48] C. Strecha, W. von Hansen, L. Van Gool, P. Fua, U. Thoennessen, On bench-

marking camera calibration and multi-view stereo for high resolution

imagery, in: IEEE Computer Society Conference on Computer Vision andPattern Recognition, 2008, pp. 1–8.

[49] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J.Comput. Vis. 60 (2004) 91–110.

[50] L. Wang, R. Yang, Global stereo matching leveraged by sparse ground controlpoints, in: CVPR'11, 2011, pp. 3033–3040.

Lai Kang received the M.S. degree in Systems Engineering and Ph.D. degree in Control Science and Engineering from National University of Defense Technology, Changsha,China, in 2008 and 2013, respectively. He is currently a lecturer at the College of Information System and Management, National University of Defense Technology. FromSeptember 2010 to August 2012, he was a visiting Ph.D. student with the Computer Graphics Laboratory, Department of Computing Science, University of Alberta, Edmonton,Canada. His current research interests include optical flow estimation, image-based 3D reconstruction and global optimization.

Lingda Wu received the Ph.D. degree in Management Science and Engineering from the National University of Defense Technology, Changsha, China, in 1999. She is currentlya professor with the National Laboratory of Electronic Information Equipment System, the Academy of Equipment Command and Technology, Beijing, China. From 2011, sheserves as the Director of the Technical Committee on Multimedia Technology (TCMT), China Computer Federation (CCF). Her research focuses on multimedia informationsystems and virtual reality technology. She has published over 100 technical papers in academic journals and conference proceedings.

Yee-Hong Yang received the Ph.D. degrees from the University of Pittsburgh, Pittsburgh, PA. His research interests cover a wide spectrum of topics in computer vision andcomputer graphics, which include 2D and 3D shape analysis, edge detection, segmentation, motion analysis, ultrasound image processing, color image processing, physics-based modeling and animation, human body motion analysis and animation, rendering of realistic imagery, image-based modeling and rendering, texture analysis andsynthesis, and multi-view computer vision.

Dr. Yang is a senior member of the IEEE and serves on the Editorial Board of the journal Pattern Recognition. He has published over 100 technical papers in internationaljournals and conference proceedings, co-edited one book and served as a guest editor of an international journal. In addition, he has served as a reviewer to numerousinternational journals, as committee members to many conferences and review panels. He also co-chaired Vision Interface 98.


http://graphics.stanford.edu/data/







robust multi-view l2 triangulation via optimal inlier...

Documents