optical flow based structure from motion - citeseer

Optical Flow Based

Structure from Motion

Marco Zucchelli

Stockholm 2002Doctoral Dissertation

Royal Institute of TechnologyNumerical Analysis and Computer Science

Computational Vision and Active Perception Laboratory (CVAP)

Akademisk avhandling som med tillstand av Kungliga Tekniska Hogskolan framlaggestill offentlig granskning for avlaggande av teknisk doktorsexamen torsdagen den

13 juni 2002 kl 14.00 i horsal E1, Lindstedtsvagen 3, Kungliga Tekniska Hogskolan,Valhallavagen 79, Stockholm.

TRITA–NA–0211ISSN 0348–2952

ISBN 91–7283–308–4CVAP266

Copyright c© Marco Zucchelli, may 2002

Abstract

Reconstructing the 3D shape of a scene from its 2D images is a problem that has at-tracted a great deal of research. 3D models are nowadays widely used for scientificvisualization, entertainment and engineering tasks. Most of the approaches developedby the computer vision community can be roughly classified asfeaturebased orflowbased, according to if the data they use is a set of features matches or an optical flowfield. While a dense optical flow field, due to its noisy nature, is not extremely suitablefor tracking, finding corresponding features between different views of large baseline isstill an open problem.

The system we develop in this thesis is of a hybrid type. We track sparse featuresover sequences acquired at25Hz from an hand held camera. During the tracking goodfeatures can be selected as those laying in high textured areas: this guarantees higherprecision in the estimation of features displacements. Such displacements are used toapproximate optical flow. We demonstrate that this approximation is a good one forour working conditions. Using this approach we bypass the matching problem of stereoand the complexity and time integration problems of the optical flow based reconstruc-tion. Time integration is obtained by an optimal predict-update procedure that mergesmeasurements by re-weighting by the respective covariance measurements.

Most of the research effort of this thesis is focused on the robust estimation of struc-ture and motion from a pair of images and the related optical flow field. We test firsta linear solution that has the appealing property of being of closed form but the prob-lem of returning biased estimates. We propose an non-linear refinement to the linearestimator showing convergence properties and improvements in bias and variance. Wefurther extend the non-linear estimator to incorporate the optical flow covariance matrix(maximum-likelihood) and, moreover, we show that, in the case of dense sequences, itis possible tolocally time integrate the reconstruction process for increased robustness.We experimentally investigate the possibility of introducing geometrical constraints inthe structure and motion estimation. Such constraints are of bilinear type, i.e. planes,lines and incidence of these primitives are used. For this purpose we present a new mo-tion based segmentation algorithm able to automatically detect and reconstruct planarregions.

To asses the efficacy of our solution the algorithms were tested on a variety of realand simulated sequences.

ISBN 91-7283-308-4• TRITA-02-11• ISSN 0348-2952• ISRN KTH/NA/R 02-11

v

Acknowledgments

A number of people contributed in different ways to this thesis:

Henrik C. Thanks for supervising my research, for the advices, the discussions and theenthusiasm you always showed.

Jan-Olof E. & Henrik C. Thanks for appointing this Ph.D. position. Thanks to Jan-Olof for being always so optimistic about everything.

E.U. Thanks for supporting my position through the C.A.M.E.R.A. project.

C.A.M.E.R.A. members Thanks for the exciting research environment.

CVAP & CAS members Thanks for the stimulating research discussions and for thefun I had here.

Philip. A, Guido Z., Frank H., Lars P., Danica K. Thanks for practical help with mythesis and for the lunches spent talking about soccer.

Jana K. Thanks for inviting me to George Mason University, for the research we didtogether and for the good time I had there.

Guido C. Thanks for the research we did together and for the late night pizzas. JaneK. also introduced me to home made thai food. My mouth is still on fire.

Foundation Blanceflor Boncompagni-Ludovisi, nee Bildt Thanks for supporting myvisit at George Mason University.

Jose S-V. Thanks for inviting me to VISLAB, for the research, the fun and the nicefood.

Etienne G. Thanks for the discussions, for reviewing my articles and for the good timewe spent together in Lisbon.

VISLAB members Thanks for making my stay there so funny.

Obviously I could not have managed without my families, both the Italian and theSwedish one.

vi

Contents

1 Introduction 11.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Feature Based Reconstruction . . . . . . . . . . . . . . . . . 31.2.2 Flow Based Reconstruction . . . . . . . . . . . . . . . . . . 5

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 List of Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Camera Model and Benchmarks 112.1 Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Optical Flow Estimation . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Point Features Tracking over Multiple Frames . . . . . . . . . 182.3.2 Optical Flow Approximation . . . . . . . . . . . . . . . . . . 20

2.4 Simulation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Discrete and Differential Constraints 253.1 The Discrete Epipolar Constraint . . . . . . . . . . . . . . . . . . . . 253.2 The Differential Epipolar Constraint . . . . . . . . . . . . . . . . . . 263.3 The Differential Epipolar Constraint as a Limit of the Discrete One . . 323.4 An Effectiveness Comparison . . . . . . . . . . . . . . . . . . . . . . 33

3.4.1 Structure and Motion from Optical Flow . . . . . . . . . . . . 333.4.2 Structure and Motion from Stereo . . . . . . . . . . . . . . . 35

3.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Linear Recursive Estimation of Structure and Motion 414.1 Egomotion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . . 424.3 Automatic Rescaling . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Time Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vii

viii Contents

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.6 Non-Linear Refinement . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 484.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Maximum Likelihood Structure from Motion 535.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3 Two-frames Non-Linear Estimation of Structure and Motion . . . . . 555.4 Re-weighted Multi-View Formulation . . . . . . . . . . . . . . . . . 57

5.4.1 Re-weighted Formulation . . . . . . . . . . . . . . . . . . . 575.4.2 Multi-view Structure and Motion Estimation . . . . . . . . . 58

5.5 Recursive Structure and Motion Estimation . . . . . . . . . . . . . . 585.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Automatic Segmentation and Reconstruction of Multiple Planar Scenes 656.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.2 Planar Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . 67

6.2.1 Basic Model and Notation . . . . . . . . . . . . . . . . . . . 676.2.2 Two Frames Re-weighted Estimation . . . . . . . . . . . . . 686.2.3 Multi-frames Re-weighted Estimation . . . . . . . . . . . . . 69

6.3 Planar Motion Segmentation . . . . . . . . . . . . . . . . . . . . . . 706.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.3.2 Final Refinement: Resolving Ambiguities . . . . . . . . . . . 72

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7 Maximum Likelihood Structure and Motion with Geometrical Constraints 797.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.2.1 Unconstrained Optimization . . . . . . . . . . . . . . . . . . 807.3 Constrained Non Linear Estimation of Structure and Motion . . . . . 81

7.3.1 Constraints Formulation . . . . . . . . . . . . . . . . . . . . 817.3.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . 817.3.3 Constraints Enforcement . . . . . . . . . . . . . . . . . . . . 82

7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8 The Calibration Issue 898.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898.2 Auto Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 918.3 Egomotion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 92

8.3.1 Jepson & Heeger . . . . . . . . . . . . . . . . . . . . . . . . 928.3.2 MacLean . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Contents ix

8.3.3 Bruss & Horn . . . . . . . . . . . . . . . . . . . . . . . . . . 938.3.4 Kanatani A . . . . . . . . . . . . . . . . . . . . . . . . . . . 948.3.5 Kanatani B . . . . . . . . . . . . . . . . . . . . . . . . . . . 948.3.6 Ma-Kosecka-Sastry . . . . . . . . . . . . . . . . . . . . . . . 94

8.4 Biased Egomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 958.5 Calibration and Egomotion . . . . . . . . . . . . . . . . . . . . . . . 968.6 3D Structure Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 1008.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1018.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

9 Conclusions 1059.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059.2 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1069.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

A Optimal Rotation 109

B Error Propagation 111

C Maximum-Likelihood Estimator 113C.1 Total Least Squares (TLS) . . . . . . . . . . . . . . . . . . . . . . . 114

List of Figures

1.1 The 3D reconstruction process is made of three phases: data acquisition,data processing and visualization of the reconstructed models. . . . . 2

1.2 Block diagram of the algorithm developed in this thesis. . . . . . . . . 7

2.1 Pinhole projective camera model . . . . . . . . . . . . . . . . . . . . 122.2 Camera motion and optical flow . . . . . . . . . . . . . . . . . . . . 142.3 The manifoldM and the structure and motion solutionθ lie in the pa-

rameters spaceT . Choosing a gaugeG that intersects the manifoldMat such points fixes a unique solutionθG . . . . . . . . . . . . . . . . 15

2.4 Set of images used for camera calibration . . . . . . . . . . . . . . . 162.5 Camera extrinsic parameters . . . . . . . . . . . . . . . . . . . . . . 182.6 Approximation of the optical flow with the displacement∆x . . . . . 212.7 Angular error due to the displacement approximation (dashed line) and

due to displacement approximation and tracking error (continuous line).Translational velocity is 0.02 focal lengths per frame in the directionz.Results are similar for different motions. . . . . . . . . . . . . . . . . 22

2.8 Conversion between the two error measures we use for the simulationconditions described above. . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Epipolar geometry. Theel ander are the two epipoles. . . . . . . . . 273.2 (a) Distribution of the variableZ−Z

Z for 10 degrees error in the estima-tion of the optical flow.(b) Average percentage of positive depths as afunction of the error on the estimation of the optical flow. . . . . . . . 33

3.3 Performance of the stereo reconstruction as a function of the error inthe pixel position of feature points for the minimal configuration of 8points. The average disparities between the stereo views for the 3 curvesare about 10, 20 and40 pixels. . . . . . . . . . . . . . . . . . . . . . 35

3.4 (a) Error for structure reconstruction from a single optical flow field.(a) A comparison of stereo and differential reconstruction using 50points, assuming 10 degrees error on optical flow and 1 pixel error onfeatures position. Disparity between consecutive frames is about 1.0pixel frame . The reconstruction error function is defined in section 2.5 37

3.5 Reconstruction of a synthetic video sequence of 2 frames. The average disparityis about 1 pixel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

xi

xii List of Figures

3.6 (a) A frame from the sequence.(b) Reconstruction from two frames.Optical flow is approximated with the features displacement. . . . . . 38

3.7 (a) A frame from the sequence.(b) Computed optical flow.(c) Re-construction from two frames. Optical flow is approximated with thefeatures displacement. . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8 (a) A frame from the sequence.(b) Estimated optical flow.(c) Recon-struction from two frames . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 Algorithm overview. The structure and motion problem is reduced intoa set of modules that are executed subsequentially. The camera is sup-posed to be calibrated a priori. . . . . . . . . . . . . . . . . . . . . . 42

4.2 Comparison between the Pentland and the ODE based re-scaling sys-tems for different noise levels. Time integration is performed over 5frames. Reconstruction error is defined in section 2.5 . . . . . . . . . 43

4.3 (a) Reconstruction improvement by refining the linear solution. Re-construction error is defined in section 2.5.(b) Improvement in theestimation ofη(t) by refining the linear solution.η error is defined inEq. 4.26. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 (a) Polar plot of the translational velocity as estimated by the linearalgorithm. (b) Translational velocity estimates after the non-linear re-finement. Error on optical flow is approximately 10%. . . . . . . . . 48

4.5 (a) An image from the sequence.(b) Estimated optical flow.(c) & (e)Reconstructed model of the calibration grid using 2 frames and(d) &(f) Reconstructed model using 10 frames.The model is based on 237 features and the average disparity betweenframes is about 1 pixel . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.6 (a) An image from the sequence.(b) Estimated Optical flow.(c) and(d) Reconstructed model of a Linux box based on 11 frames and 220features. The average disparity is about 0.9 pixels per frame. . . . . . 51

4.7 (a) An image from the sequence.(b) Estimated optical flow.(c) and(d) Reconstructed model of a tea box based on 7 frames and 250 fea-tures. The average disparity is about 1.1 pixels per frame. . . . . . . . 52

5.1 Multi-frame setting.I0 is the reference view. Reconstruction is per-formed simultaneously between the reference view and the views thatdo satisfy the instantaneous motion model. . . . . . . . . . . . . . . . 54

5.2 The aperture problem. The true image velocityu cannot be distin-guished from the image velocity normal to the moving contour,un,when viewed through an aperture. . . . . . . . . . . . . . . . . . . . 56

5.3 Directional uncertainty is indicated by the drawn ellipse. For sharp cor-ner points the uncertainty is small and anisotropic since the intensitypattern has variations in all the directions. For flat corners the uncer-tainty is larger in the direction tangent to the curve. . . . . . . . . . . 57

List of Figures xiii

5.4 (a) Angle between true and estimated linear velocities for the re-weightedand un-weighted algorithms with constant error ellipse orientation.(b)Standard deviation of the estimated linear velocities for the re-weightedand un-weighted algorithms with random error ellipse orientation.Thetrials for each noise level were 100. . . . . . . . . . . . . . . . . . . 59

5.5 Structure and motion error using the two views algorithms over the pairs{I0, I1} and{I0, I2} and using the multi-frame algorithm over the 3frames simultaneously.(a) Linear velocities (v1, v2) error. (b) Struc-ture error (see section 2.5 for the definition). . . . . . . . . . . . . . . 60

5.6 Polar plots of the linear velocity azimuthal and polar angles.(a) Jepson-Heeger algorithm. (b) Refinement of Jespson-Heeger by non linearminimization.(c) Average values of the computed linear velocities. . 62

5.7 Multi-frame reconstruction over 3 frames. 245 features were used. Av-erage optical flow is one pixel per frame. . . . . . . . . . . . . . . . . 63

5.8 Multi-frame reconstruction over 3 frames. 271 features were used. Av-erage optical flow is one pixel per frame. . . . . . . . . . . . . . . . . 63

5.9 TheFlower gardensequence.(a) A frame from the video sequence.(b) Estimated optical flow.(c) & (d) Examples of reconstruction. . . 64

6.1 Performance of weighted linear least squares for the estimation of theflow matrix. The average values ofrλ for two of the sequences used fortests are also reported. . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.2 B matrix estimation improvement due to multi-frames integration. 10frames were used. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.3 (a) Optical flow warping respect to 4 points on the plane chosen ac-cording to the nearest neighbor principle (the 4 points are marked withcrosses). (b) Magnitude of the residuals. . . . . . . . . . . . . . . . . 69

6.4 Thetea boxsequence.(a) An Image from the sequence.(b) First Itera-tion of the clustering algorithm. Crosses indicate the initial 5 points se-lected.(c) Segmentation obtained starting from surface number 2.(d)Segmentation obtained starting from surface number 3.(e) Ambiguousfeatures.(f) Final segmentation after ambiguous features removal. . . 74

6.5 TheLinux boxsequence.(a) A frame from the video sequence.(b)Estimated optical flow.(c) Segmentation obtained at the first iteration.4 of the 5 initial features (crosses) on the surface nr.1 (dots) are rejectedas too noisy.(d) Final segmentation . . . . . . . . . . . . . . . . . . 75

6.6 Thecalibration grid sequence. (a) A frame from the video sequence.(b) Segmentation obtained after three iterations of the clustering algo-rithm (c) Reassigned ambiguous features (d) Segmentation obtained af-ter reassignment of the ambiguous and the rejected features. . . . . . 76

6.7 Theaquatic centersequence.(a) A frame from the video sequence.(b)Final segmentation. Triangles mark features that were not assigned toany surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.1 Linear velocity error . . . . . . . . . . . . . . . . . . . . . . . . . . 83

xiv List of Figures

7.2 Structure error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847.3 2 views of a reconstructed model. . . . . . . . . . . . . . . . . . . . 857.4 2 views of a reconstructed model. . . . . . . . . . . . . . . . . . . . 857.5 TheGMU Building sequence.(a) A frame from the video sequence.

(b) Estimated optical flow.(c) and(d) Reconstruction examples . . . 867.6 TheAquaticentersequence.(a) A frame from the video sequence.(b)

Estimated optical flow.(c) and(d) Reconstruction examples . . . . . 87

8.1 Geometric interpretation of the error distance . . . . . . . . . . . . . 958.2 Relationship between translation bias as a function of calibration error,

assuming that the measurement errors were fixed at 15%. Note that thebias of the Jepson-Heeger algorithm is approximately 0 for an error ofcalibration parameters of about 15% . This is the situation when the twobias terms cancel each other.FOV for these experiments was 90◦. . . 97

8.3 The error of translation estimate expressed in angular units as a functionof the calibration error in the same conditions as in Fig. 8.4. The erroris always increasing for increasing calibration error. . . . . . . . . . . 98

8.4 Dependence of the bias due to noisy calibration on theFOV of thecamera. 30% noise on camera parameters is generated while opticalflow is noiseless. The magnitude of the bias increases with increasingFOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

8.5 Dependence of the linear velocity bias on the focal length. A data set isgenerated withf = 1 and then the linear velocity is estimated assuminga measured focal lengthf ∈ [0.1, 10]. The estimated velocityvz isplotted normalized to the ground truth. . . . . . . . . . . . . . . . . . 100

8.6 Synthetic sequence.(a) Original model. (b) Model distorted by anunderestimated focal length.(c) Model distorted by an overestimatedfocal length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

8.7 (a) Original model with estimatedvz = 0.5530. (b) Model distortedby an underestimated focal length of 50%,vz = 0.3975. (c) Modeldistorted by an overestimated focal length of 50%,vz = 0.7597. . . . 103

List of Tables

4.1 Planar residuals for the calibration grid sequence. Units are focal lengths. 474.2 Planar residuals. Units are focal lengths. . . . . . . . . . . . . . . . . 49

5.1 Planar residuals, two views setting. Units are focal lengths. . . . . . . 61

9.1 Planar residuals, two views setting. Units are focal lengths. . . . . . . 106

xv

Notation

O camera center of projectionf camera focal length(ox, oy) camera principal point(sx, sy) camera effective pixel size in the horizontal and vertical directionk1 first order radial distortionπ focal planeK matrix of camera internal parametersMext matrix of camera external parametersv camera linear velocityω camera angular velocityR camera rotationT camera translationX = [X, Y, Z] feature point 3D positionx = [x, y]T projection over the focal planex = [X

Z , YZ ]T

x augmented projectionx = [x, y, 1]T

u optical flowu = [ux, uy]T

u augmented optical flowu = [ux, uy, 0]T

i feature indexi ∈ {1, . . . , N}j frame indexj ∈ {1, . . . , M}tj time at which thejnd frame is acquiredF optical flow field:F = {ui : i ∈ (1 . . . N)}α reconstruction scaleE Essential matrixE essential spaceS special symmetric spaceb planar flow parameters vector(p, h) plane normal and plane distance form the originI(x) image brightness atxL LagrangianaˆDenotes estimated quantitiesa with a ∈ R3 is the skew symmetric matrix built fromaC covariance matrix

xvii

Chapter 1

Introduction

1.1 Motivations

Three dimensional perception of the world is a feature common to many biological vi-sion systems. Humans are an amazing example since they are able to perceive depth,surface orientation and spatial relationships with remarkably accuracy under most cir-cumstances. The primary mechanism used by human’s vision system is stereopsis, i.e.the the lateral displacement of objects in two retinal images. Motion parallax, i.e. thedifferential motion of points relative to the fixation point, is another powerful source ofspatial information. The computer vision community has gone to great length in order torecreate 3-D perception by computers. The focus has mostly been on stereopsis for thereason that the underling geometry is well understood and that with a couple of imageshigh quality reconstructions can be obtained. Moreover, the signal to noise ratio i.e. theratio between features positions errors and their displacements, is very good comparedto the motion parallax case. On the other hand, stereo vision is a sort of chicken and eggproblem since reconstruction requires matching of features that today is still an openproblem.

In-between approaches that bypass the matching step have been also developed.Tracking over continuous sequences is a well known problem for which robust solutionsexist. Features, for example, can be tracked over a video sequence and when the baselineis large enough the first and last images of the sequence can be used as a stereo pair andthe tracked features are matched features for stereo reconstruction. The major drawbackis that many features are lost during the tracking due to occlusions and inefficienciesand such information is just discarded. Otherwise the epipolar constraint can be usedto estimate structure and motion between close by views and time integration over thewhole sequence can be used to improve the estimates. The main problem here is thegeometrical nature of such a constraint that is poorly conditioned for high noise to signalratios. The reason will be explained further in this thesis and depends upon the fact thatthe epipolar constraint is not anif-only-if one.

1

2 Chapter 1. Introduction

The main problems with optical flow based reconstruction are the high complexityand the difficulty to time integrate over sequences. This is due to the fact that, due to itsnoisy nature, optical flow does not provide any efficient tracking. The time integrationproblem has usually been approached in Kalman filter settings with different constraintson either the structure or the motion of the camera. Generally speaking Kalman filtersare optimal recursive solutions for linear problems with Gaussian errors. This is rarelythe case when real images are used so convergence can be a serious problem. Moreoverdynamical models for the camera motion can be unknown, for example for hand heldcameras.

In this thesis we present a hybrid approach to the Structure from Motion problem.We track sparse features over a continuous sequence and approximate the optical flowat these locations with the feature displacements. The conditions under which this ap-proximation is good will be discussed later. Optical flow is used to estimate structureand motion over consecutive pairs by enforcing the differential epipolar constraint. Re-constructions relative to different frames are then optimally integrated to provide a finalmodel of the filmed scene.

With this technique the matching problem is reduced and reliable estimates are ob-tained even with close pairs of views by using the differential constraints instead ofdiscrete ones. As optical flow is estimated at good locations, matching is also directlyprovided making time integration more reliable. Complexity is limited by the sparsityof the optical flow field and Kalman filter updating can be avoided.

In general dense reconstructions are not provided but we believe that a sparse modelis a good starting point for future update to dense structure.

Acquisition Processing Reconstruction

Figure 1.1. The 3D reconstruction process is made of three phases: data acquisition, dataprocessing and visualization of the reconstructed models.

1.2 State of the Art

Structure from motion (SfM ) has been a very active area of computer vision in the past30 years. The idea is to recover the shape of objects or scenes from a sequence of imagesacquired by a camera undergoing an unknown motion. Usually it is assumed that the

1.2. State of the Art 3

scene is made up of rigid objects possibly undergoing some kind of Euclidean motion.The roots of the Structure from Motion community can be traced back to two key fields,photogrammetry and computer vision. The vision community, which was traditionallydriven more from biology and AI roots, extensively developed computer system inspiredeither by stereopsis or motion parallax, which are the mechanisms primary used byhumans. Most of such approaches can be classified asfeatureor optical flowbased.

1.2.1 Feature Based Reconstruction

Feature based reconstruction is carried out using corresponding features in pairs of im-ages of the same scene taken from different viewpoints. When the relative positionand orientation of the two cameras are known the 3D position of the imaged point canbe easily computed by triangulation. The use of the epipolar geometry for the estima-tion of the relative motion was first proposed by Lounget-Higgins in the early eighties(Longuet-Higgins 1981). The so calledessentialmatrix linearly constraints featurepoints in the two images of the stereo pair:

xT1 Ex2 = 0 (1.1)

The 8 points algorithm developed by the author has the appealing property of be-ing linear. Relative rotation and translation of the cameras can be estimated by afactorization of the essential matrix. When the camera calibration is unknown thematrix derived by the constraint in Eq. 1.1 is called thefundamentalmatrix. Thiscan still be used to estimate motion and then structure but only up to a projectivetransformation (Faugeras 1992). Despite its simplicity the 8 points algorithm has of-ten been criticized for its excessive sensitivity to noise and lot of other techniqueshave been developed. These are mostly based on the minimization of functions of theepipolar distancesand usually require iterative optimization techniques. A descrip-tion and comparison of such techniques can be found in (Luong, Deriche, Faugeras &Papadopoulo 1993, Zhang 1996). Hartley showed (Hartley 1997) that the performanceof the 8 points algorithm can be drastically improved by renormalizing point featurecoordinates. In his experiments he proved that the final performance is very similar tothat of more advanced and complex algorithms. Beardsley and Zisserman (Beardsley,Zisserman & Murray 1994) proposed an interesting technique that uses the weighted8 points algorithm iteratively. At each stage the estimated essential matrix is used tocalculate weights for the features used in the computation. Such weights are estimatedby calculating the epipolar distances and then used in the next iteration.

The precision of stereo reconstruction can be improved when more than two viewsare available by multi-viewpoint triangulation. Recent research showed that the projec-tion of points and lines are constrained over triplets of images by a tensor called thetrifocal tensor (Shashua & Werman 1995). In general the images of lines and points inm frames are constrained by am-linear constraint. Ma, Kosecka and Sastry showedthat constraints of order higher than four are dependent on the epipolar (bilinear), tri-focal (trilinear) and quadrilinear ones (Ma, Kosecka & Sastry 1998). In the multi-viewreconstruction context these three constraints are used to match feature points across


different views and to estimate camera relative motion by their factorization. Triangu-lation is used to get a first estimate of the structure. Such an initial estimate is thenrefined bybundle adjustment. Bundle adjustment is a maximum likelihood estimatorthat consist in minimizing the re-projection error. The name comes from the fact thatsuch a technique adjusts the bundle of rays between each camera center and the set of3D points. Due to its extensive use, bundle adjustment raised the interest of a numberof researchers such that conference special sessions have often been dedicated to thistechnique (Triggs, Zisserman & Szeliski 1999). An appealing property of multi-viewstereo is the possibility of estimating the camera internal parameters simultaneouslywith structure and motion (auto-calibration) (Triggs 1997, Zeller & Faugeras 1996).This is based on the observation that the absolute conic is fixed under rigid cameramotion and that its entries are functions of the internal camera parameters.

Reconstruction using multi-viewpoint stereo is nowadays a very popular topic. Thegroups more active in the last decade are those of Oxford (Beardsley, Torr & Zisserman1996) and Leuven (Koch, Pollefeys & Van Gool 1998) universities. An excellent sum-mary of their efforts is in the Ph.D thesis of David Nister (Nister 2001a). The au-thor here tries to put together different modules independently developed by the re-search community in order to get an automatic system for dense 3D reconstructionfrom monocular sequences. The main building blocks are a frame decimator that elim-inates redundant frames (Nister 2000), a projective reconstruction module that extractscorner and lines independently from the images, match them using RANSAC and re-turns a projective reconstruction, and finally an auto-calibration step (Nister 2001b) thatupdates the projective reconstruction to an Euclidean one. Nister uses Kalman filteringto integrate reconstruction from different viewpoints. This follows a well known trend:Soatto and Perona (Soatto & Perona 1998a, Soatto & Perona 1998b, Chiuso, Favaro,Jin & Soatto 2000) studied extensively the possibility of applying such linear filteringto the structure from motion problem when dynamical models for camera motion aremissing. They show how to design a completelyobservablefilter in which the measure-ment vectors and the state vectors are the same so the Kalman Filter is linear and hencedoes not have any linearization problems. They prove the stability of the estimationalgorithm they propose and show how to approach the problem of occlusions by usingappropriate sub filters.

In the orthographic limit a very popular approach for multi-viewpoint reconstruc-tion is the Kanade Tomasi factorization (Costeira & Kanade 1998, Han & Kanade 1999,Morita & Kanade 1997). The authors showed that structure and motion (in this approx-imation just a rotation) can be computed by factorization of a matrix built from thefeature points coordinates. The factorization is based on aSV D and so very easy toimplement. More advanced variants were developed later to include lines and covari-ance matrices (Irani & Anandan 2000).

The matching of features between pairs of images is the major weakness of featurebased methods. Although huge efforts have been made (Meer, Kim & Rosenfeld 1991,Torr & Murray 1997, Tell 2002), robustness of feature matching techniques is still anopen problem.

1.2. State of the Art 5

1.2.2 Flow Based Reconstruction

In the differential setting, matching is replaced by optical flow. This is the featuresvelocity field generated by the camera motion. The estimation of optical flow is basedon the image brightness constancy equation which states that the apparent brightnessI(x, t) of moving objects remains constant over time. This implies:

dI

dt= ∇xIu +

∂I

∂t= 0 (1.2)

There is quite a large amount of different techniques for optical flow estimation. Roughlythey can be classified into three groups:featurebased,gradientbased andcorrelationbased techniques. A review and comparison of the most popular algorithms can befound in (Barron, Fleet & Beuchemein 1994).

The scene structure and camera motion are elegantly tied to the optical flow by thedifferential epipolar constraint1:

u(x) =1Z

A(x)v + B(x)ω (1.3)

Structure and motion can be computed either starting from the estimated flow or byinserting a suitable parameterization of the optical flowu in Eq. 1.2 and extractingstructure and motion from the resulting equations. This approach is usually known asDirect methods(Irani & Anandan 1999) and started being popular at the beginning ofthe nineties (Hanna 1991, Hanna & Okamoto 1993).

Due to its noisy nature, dense optical flow is not extremely suitable for trackingpurposes. Moreover dense flow fields are estimated at pixel locations so that some kindof re-sampling is required during the tracking between consecutive frames. The errorintroduced by this dense tracking is difficult to be evaluated (Xiong & Shafer 1998).

Researchers promptly found out that the least square optimal structure could beestimated directly once camera velocities are known. The derivative of the squarednorm of the optical flow residualsr(x) = u∗ − u(ω,v, Z) yields a linear expression inthe depthZ:

∂‖r(x)‖2∂Z

= 0 ⇒ Z = f(ω,v,x,u∗) (1.4)

For these reason researchers mostly focused on the problem of ego-motion estimation.Camera velocities have been often obtained by eliminating the depth from the residualequation either through algebraic manipulation (Heeger & Jepson 1992, MacLean 1999,Kanatani 1993b) or by back substitution of the expression forZ (Bruss & Horn 1983).Although the solutions proposed have the appealing property of being of closed form orfast converging, they lead to biased estimation of the ego-motion. The reason is that theobjective functions minimized do not verify the property of being rotationally invariant(Zhang & Tomasi 1999).

As in the discrete case, the Kalman filter has proved to be an efficient tool for multi-view integration, especially for keeping the complexity constant over time. This is

1See Chapter 2 for an explanation of the notation.


particularly important since in the differential setting the number of frames processed isusually large.

Matthies, Szeliski and Kanade (Matthies, Szeliski & Kanade 1988, Matthies, Szeliski& Kanade 1989) proposed a Kalman filter based algorithm where the motion is a lat-eral translation of known magnitude. In this working condition the optical flow is justthe disparity, and their algorithm operates by filtering the optical flow itself. Tempo-ral matching between views is obtained by linear interpolation of the image brightnessfunction. Heel (Heel 1991) proposes an algorithm for general known camera motion.The feature’s depth is propagated in time by the known motion and estimated at eachframe by Eq. 1.2. Temporal matching is here achieved by propagating the structureahead in time, then re-projecting such structure over the next image and interpolatingthe image brightness. Later Heel (Heel 1990) extended his approach to unknown motionshowing that the camera linear and angular velocities can be estimated from the prioristructure for each frame. Heel’s algorithm has been severely questioned by Tomasi(Tomasi 1991) regarding convergence. Xiong and Shafer (Xiong & Shafer 1998) pro-posed an Extended Kalman filter framework where the structure and motion are the statevariables and the optical flow the measurements. They show how to get a complexityO(N) for each iteration of the filter by approximatedSV D’s. As in Heel’s work theyestimate the motion at each instant using the propagated structure from the instantt−1.They use bilinear interpolation to re-sample pixels in the next image.

Dense flow can create serious problems to these algorithms along occlusion bound-aries or in regions of low texturing: these problems are in general not addressed. Gener-ally speaking, the Kalman filter is an optimal estimator under the hypothesis of linearityand Gaussian noise. This is rarely the case when real images are used.

The system we develop in this thesis is of a hybrid type: the main building blocksare presented in Fig. 1.2.2. We tracksparsefeatures over sequences acquired at25Hzfrom a hand held camera. During the tracking good features can be selected as thoselaying in high textured areas. This guarantees higher precision in the estimation offeatures displacements. Such displacements are used to approximate optical flow. Wedemonstrate that this approximation is a good one for our working conditions. Us-ing this approach we bypass the matching problem of stereo and the complexity andtime integration problems of the optical flow based reconstruction. Time integrationis obtained by an optimal predict-update procedure that merges measurements by re-weighting according to the respective covariance measurements. Most of the researcheffort of this thesis is focused on the robust estimation of structure and motion from apair of images and the related optical flow field. We test first a linear solution that hasthe appealing property of being of closed form but with the problem of returning biasedestimates. We propose an non linear refinement to the linear estimator showing con-vergence properties and improvements in bias and variance. We further extend the nonlinear estimator to incorporate the optical flow covariance matrix (maximum-likelihood)and, moreover, we show that in the case of dense sequences it is possible tolocally timeintegrate the reconstruction process for increased robustness. We evaluate the possibil-ity of introducing geometrical constraints in the structure and motion estimation. Suchconstraints are of bilinear type, i.e. planes, lines and incidence of these primitives areused. For this purpose we presented a new motion based segmentation algorithm able

1.3. Contributions 7

��

��

��

��

Calibrated Video Sequence

Tracking from n−1 to n

Acquire Frame n

Integration into the model

3D Model

Structure and Motion

Figure 1.2.Block diagram of the algorithm developed in this thesis.

to automatically detect and reconstruct planar regions. Previous research in the discretesetting has shown that geometrical constraints can be effectively used for the solutionof the structure from motion problem (Grossman & Santos-Victor 2000a, Szelisk &Torr 1998, Bondyfalat & Bougnoux 1998).

The problem of dense reconstruction is not approached but we believe that the esti-mation of camera motion and of the sparse scene structure can be an excellent startingpoint for dense reconstruction.

1.3 Contributions

In line with the above motivation, the main contributions of this thesis are listed below:

• We provide a performance comparison between differential and discrete approaches.This demonstrates the tradeoff between the two techniques in terms of baselines(chapter 3).

• We provide a linear algorithm for recursive reconstruction (chapter 4). The al-gorithm estimates structure and motion for consecutive pairs of views and thenintegrates such reconstructions over time. Since only linear optimization tech-niques were used the algorithm has a closed form and can run almost in real time.


In this framework we also present a new technique to handle the scale indetermi-nacy between reconstructions from different pairs of frames. Unlike the widelyused approach of fixing one feature, we show how the rescaling can be done inde-pendently for each feature point solving a first order differential equation. Whenspeed is not a major concern, we show how estimation improvements can be ob-tained by a non linear refinement technique of which the linear algorithm is theinitialization.

• Linear techniques are fast and closed in form but they turn out to be biased due toa poor choice of the objective function. Moreover, due to the aperture problem,optical flow errors are anisotropic. We provide the maximum-likelihood formu-lation of the problem based on a rotational invariant objective function that takesinto account anisotropies in the errors. We propose a suboptimal iterative solu-tion to the minimization problem that uses the output of the linear algorithm as astarting point (chapter 5).

• We show how to locally integrate reconstruction over several views without theneed for rescaling (chapter 5)

• We develop a new motion based segmentation algorithm to automatically seg-ment multiple planar scenes. Since we use the projective flow approximation thealgorithm is capable of simultaneously detecting planes and reconstructing theirstructure (chapter 6).

• We show how planarity constraints can be incorporated in the non-linear algo-rithm we proposed in chapter 5. Constraints on the relative orientation of planescan also be incorporated by using a minimization technique calleddirect projec-tion (chapter 7).

• We analyze how camera calibration errors effect flow based motion and structureestimation (chapter 8). We derive analytical results and then provide tests using avariety of structure from motion algorithms.

1.4 Outline

The thesis is organized according to the structure outlined below:

Chapter 2 This chapter is dedicated to the review of tools and models used throughoutthe whole thesis. We present the camera model, the tracking strategy adoptedand the camera calibration algorithm we used to estimate internal parameters forthe real sequences. Moreover we present simulation benchmarks and the errorfunctions we used to asses the performance of the algorithms we studied.

Chapter 3 In this chapter the properties of the epipolar and differential epipolar con-straints are reviewed. These are the basic tools used forSfM in the featurebased and flow based settings. Their efficiency for solving theSfM problem arecompared using simple algorithms of closed form.

1.4. Outline 9

Chapter 4 In this chapter a recursive algorithm for structure from motion is described.Camera velocities and scene structure are estimated for each frame by a cascadeof two linear algorithms. Moreover we describe a new technique to handle scaleambiguity across frames. This is based on the solution of a first order differentialequation. This technique is compared with the more classical approach of fixingone feature. When speed is not a major concern better results can be obtained bya non-linear refinement solved with a Gauss-Newton iterative technique. Timeintegration of the structure estimates is performed with a propagate-update pro-cedure similar to that used for Kalman filters. In our algorithm we do not supposemotion smoothness.

Chapter 5 Closed form solutions to the structure from motion problem are shown tobe biased and excessively noisy, due to a poor choice of the objective function.We propose a maximum-likelihood formulation that takes into account errors onoptical flow parameters. We show moreover how to improve the precision ofthe estimates by local time integration of the differential epipolar constraint. Wecompare our formulation with previous work both on simulated and real data sets.

Chapter 6 In this chapter we develop a new motion based segmentation algorithm thatfinds planar surfaces starting from sparse optical flow fields. Optical flow forplanar surfaces optical flow is reviewed and a time integrated version of the therelative flow equations is used to grow clusters for planar regions present in theimage. Unlike many other algorithms, we make use of the projective planar flowequations so that our algorithm is able to segment and simultaneously estimatethe piecewise planer structure.

Chapter 7 Detected planes can be effectively used to improve structure and motionestimation. The algorithm presented in chapter 4 is here modified to constrainfeatures points on planes and lines, and to incorporate bilinear constraints amongsuch entities, i.e. incidence or parallelism. Compared to chapter 4, the algorithmwe use here performs a constrained minimization. We use the method ofdirectprojectionsto enforce constraints during a Levenberg-Marquardt style iterativeoptimization.

Chapter 8 This chapter is dedicated to the problem of camera calibration. As we willshow, just two of the calibration parameters can be computed simultaneously withthe structure and motion. Given the un-practicability of camera auto-calibrationit is necessary to resort to off-line calibration strategies. The effect of errors inthe calibration parameters on structure and motion estimation has been studiedextensively in the discrete setting while there is a large lack of analytical resultsand experimental assessment in the differential one. Here we review some of themost popular approaches forSfM and demonstrate analytically and experimen-tally the biasing effects of noisy calibrations.


1.5 List of Papers

The thesis is based on the following articles:

1 A Comparison of Stereo Based and Flow Based Structure from Parallax

Marco Zucchelli and Henrik I. Christensen

Symposium on Intelligent Robotics Systems 2000pages 199-207

2 Recursive Flow Based Structure from Parallax with Automatic Rescaling

Marco Zucchelli and Henrik I. Christensen arBritish Machine Vision Conference2001pages 183-192

3 Motion Bias and Structure Distortion induced by Calibration Errors

Marco Zucchelli and J.Kosecka

British Machine Vision Conference 2001pages 663-672 and submitted toIEEEPattern Analysis and Machine Intelligence

4 ML Structure and Motion Estimation Integrated over Time

Marco Zucchelli, Jose Santos-Victor Henrik I. Christensen

International Conference on Pattern Recognition 2002

5 Optical Flow Based Structure and Motion Estimation with Geometrical Con-straints

Marco Zucchelli, Jose Santos-Victor and Henrik I. Christensen

International Conference on Pattern Recognition 2002

6 Automatic Segmentation of Multiple Planes


submitted toBritish Machine Vision Conference

7 Automatic Segmentation and Reconstruction of Piecewise Planar Scenes


submitted toComputer Vision and Image Understanding

The following paper has also been published but does not belong to the thesis:

8 An Application of the Learnable Evolution Model to 3D Scene Reconstruction

Guido Cervone and Marco Zucchelli

Artificial Intelligence and Applications 2001pages 403-408

Chapter 2

Camera Model and Benchmarks

Throughout the thesis we will make use of a number of models, approximations andalgorithms. This introductory chapter is dedicated to the review and discussion of suchissues and it is intended to be a reference to the reader.

The projective pinhole camera model we adopted is described in detail and the pro-jection equations reviewed. Camera motion modelled as a rigid body motion and char-acterized by a linear and a rotational velocity vectors. Under these assumptions opticalflow equations are deduced and reviewed. The optical flow field is estimated at sparselocations by measuring the displacement of features between consecutive frames, in thesame fashion as in the Lucas-Tomasi-Kanade tracker. Such a displacement is shown tobe a realistic approximation under our working conditions.

Structure and motion estimation are affected by what is called agauge indetermi-nacy. Camera motion and scene structure can be determined just up to a scale and anEuclidean transformation, i.e. a translation and a rotation. We review and discuss suchindeterminacy.

In our work we assume that the camera is already calibrated. The reason is that,in the differential framework, just two of the calibration parameters can be estimatedsimultaneously with the structure and motion. Calibration is performed by using viewsof a planar pattern of known dimensions. This system is flexible and guarantees preciseestimates of the internal parameters.

Throughout the whole thesis we make extensive use of simulated data sets to assesalgorithms performance. The benchmarks are presented here together with error mea-sures.

2.1 Camera Model

The pinhole camera is the most common geometric model for intensity cameras. It ischaracterized by a focal planeπ and a 3D pointO called center of projection (see Fig.2.1). The distancef betweenπ andO is called the focal length and the line throughO and perpendicular toπ is the optical axis. The intersectiono of the optical axis with

11

12 Chapter 2. Camera Model and Benchmarks

π

X

o

O

[x,y,f]

f

x=−T

Z

X

Y

camera frame

optical axis

optical plane

x=[x,y]T

x

y

image frame

YZ

world frameX

Figure 2.1. Pinhole projective camera model

π is named principal point or image center. The image of a 3D point is given by theintersection of the focal planeπ and the ray going through the point and the opticalcenterO. The reference frame centered inO and with the axisZ parallel to the opticalaxis is usually called thecamera frame. If X = [X,Y, Z]T indicates the 3D pointsposition andx its projection in the camera frame, the projection equations have theform:

x = f [X

Z,Y

Z, 1]T (2.1)

The position of the projected 3D point in a 2D frame centered at the image center isindicated asx = [x, y]T .

In general the projection of the 3D points over the focal plane is expressed in a localcoordinate system. Defining(sx, sy) as the effective pixel size in the horizontal andvertical direction,(ox, oy) as the coordinates of the optical center over the focal planeandk1 as the first order radial distortion, the calibrated projectionx becomes:

λx = KX (2.2)

whereλ is a non zero constant proportional to the point depth (see also (Trucco &Verri 1998)) andK is the matrix of the internal parameters:

2.1. Camera Model 13

K =

− f

sxk1 ox

0 − fsy

oy

0 0 1

(2.3)

In general the camera reference frame is not known and a common problem is de-termining the location of it with respect to a fixedworld frame. The camera positionrelative to such a world frame at timet is described by the rotationR(t) around theoptical center and the translationT(t) of this with respect to the origin of the worldreference frame (rigid body motion). The transformation between the camera and worldreference frames is:

Xw = RXc + T (2.4)

Defining the extrinsic camera matrix like:

Mext =(R | −RT

)(2.5)

the general projectionx of the pointX over the focal plane is defined by:

λx = KMext

Xw

Yw

Zw

1

(2.6)

The motion of the camera generates a motion field of the projected points over theoptical plane. Assuming that the camera is a rigid body,Xc satisfies the differentialequation:

Xc = ω ×Xc + v (2.7)

where(v, ω) are the linear and angular velocities of the camera. The relationship be-tween the image plane motion fieldu(x) = [ux, uy]T and the motion of the camera canbe expressed as (see Fig. 2.2) :

u(x) =1Z

A(x)v + B(x)ω (2.8)

The matricesA(x) andB(x) are functions of image coordinates defined as follows (see(Heeger & Jepson 1992)):

B =[ −xy (1 + x2) −y−(1 + y2) xy x

];A =

[1 0 −x0 1 −y

]

It can easily be seen that the optical flow can be separated into two distinct contri-butions, one generated by the angular velocity and one by the linear translation:

u = uv + uω (2.9)

uv =1Z

A(x)v (2.10)

uω = B(x)ω (2.11)

(2.12)


V

X

u

O

x

Figure 2.2. Camera motion and optical flow

Note that the rotational component of the flow field does not contain any structure in-formation.

Eq. 2.7 and Eq. 2.8 contain inherent indeterminacies which are usually calledgaugefreedoms(Morris, Kanatani & T. 2001). Under a perspective camera, shape and motionparameters are estimated up to an unknown similarity (scale indeterminacy). Moreover,the absolute rotation and orientation of the world reference frame is arbitrary. Due tosuch indeterminacies the Structure From Motion problem, rather than a single, has amanifold of solutionsM that are all mapped into the same measurements by perspec-tive projection. The dimension of the manifold is 7: 3 angles to define the rotation,3 displacements to define the translation and a scale. Fixing this quantities is calledchoosing agauge. The gaugeG is define by a set of constraint equations over the struc-ture and motion parametersθ (see Fig. 2.3). Morris and Kanade (Morris et al. 2001)showed that the gauge choice affects the parameters uncertainty and derived a Geomet-ric Equivalence Relationship with which covariances under different parameterizationsand gauges can be compared.

In this thesis the gauge is fixed so that the the world reference frame is characterizedby a null rotation and translation. The global scale is fixed such that the initial camera

2.2. Camera Calibration 15

T

M

G

θG

θ

Figure 2.3. The manifoldM and the structure and motion solutionθ lie in the parametersspaceT . Choosing a gaugeG that intersects the manifoldM at such points fixes a uniquesolutionθG

velocity is of unit norm, i.e.‖v(t = 0)‖ = 1.

2.2 Camera Calibration

In this thesis we use calibrated video sequences. The main reason is, as explained inchapter 8, that in the differential setting auto or self calibration cannot be pursued. Ingeneral just two of the camera internal parameters can be recovered simultaneously withstructure and motion, so an a priori and at least partial camera calibration is necessary.

The sequences shown later on are shot with three different commercialCCD cam-eras. Cameras have been calibrated at the beginning of the acquisition. In general inter-nal parameters do not change sensibly if the focal length is kept constant. We acquiredsequences over of period of some weeks and re-calibrated the camera regularly. Novariation in the value of the internal parameters was noticed which was not compatiblewith the estimation errors.

There exists quite a large number of different techniques according to the differentapplications. In most of the cases either one view of a cubic like calibration grid ormore views of a planar grid (see Fig. 2.4) are used (Tsai 1986). For practical reasonswe used the last approach, since it is quite easy to get images of a plane from differentview points and the user interaction in the calibration process is quite limited.

For simplicity we assume that the viewed plane is atZ = 0 in the world referencesystem. For a camera viewing the plane we have:


Calibration images

Figure 2.4. Set of images used for camera calibration

λx = K[R|T]

XY01

= K[r1 r2 T]

XY1

(2.13)

wherer1 andr2 are the first and second columns of the rotation matrixR. Let’s denoteH

.= [h1 h2 h3] = K[r1 r2 T]. The matrixH3×3 is a homography and projects thepoints on a planar surface of equationZ = 0 over the focal plane. Such a homographyis defined up to a scale so only 8 of the parameters are independent. The matrixH canbe easily determined by linear least squares minimization. DefiningC = [X,Y, 1]T

anda = [hT1 ,hT

2 ,hT3 ]T , Eq. 2.13 can be rewritten as:

(CT 0T −xCT

0T CT −yCT

)a = 0 (2.14)

2.2. Camera Calibration 17

whenn points are available we can rewrite Eq. 2.14 asLa = 0 whereL is a2n × 9.The homography can be estimated by least squares as the eigenvector associated to theminimum eigenvalue ofLT L. The solution is then refined by solving the problem:

H = minH

∑

i

(x− x∗)T (x− x∗) (2.15)

wherex∗ indicates the measured projections. Minimization is performed in Matlab bya Levenberg-Marquardt technique.

Using the orthonormality ofr1 andr2 we obtain the two constraints:

hT1 K−T K−1h2 = 0 (2.16)

hT1 K−T K−1h1 = hT

2 K−T K−1h2 (2.17)

The two equations above are linear in the elements of the matrixQ = K−T K−1.Stacking the elements ofQ in the6× 1 vectorq, the system in Eq. 2.16 reduces to:

Gq = 0 (2.18)

whereG is a 2m × 6 matrix, wherem is the number of views of the planar grid. Ifthree or more views of the planar surface are available the matrixQ can be estimatedby computing the eigenvector relative to the minimum eigenvalue of the matrixGT G.The elements of the matrixK can be estimated uniquely but up to a constant by pa-rameterizingQ in terms of the calibration parameters and solving the resulting systemof equations (see (Zhang 2000)) for more details. The extrinsic parameters can also beeasily estimated whenK is known by:

r1 = K−1h1 (2.19)

r2 = K−1h2 (2.20)

r3 = r1 × r2 (2.21)

T = K−1h3 (2.22)

(2.23)

Due to the noise, the rotation matrix so estimated will not be exactly orthogonal. Amethod to estimate the best rotation matrix from a general3× 3 matrix is described inappendix A.

Using this technique precise calibrations can be obtained with little effort. Figures2.4 and 2.5 reports the 12 images used to calibrate aCCD camera and the computedcamera extrinsic parameters. The estimated calibration matrix was:

K1 =

1976.8± 12.0 0. 352.7± 7.30 2145.1± 12.6 267.9± 7.00 0 1

(2.24)

errors are three standard deviations.


−1000

100

0

200

400

600

800

1000

1200

−150

−100

−50

0

50

Oc

Yc

Zc

Xc

121135104172

968

Figure 2.5. Camera extrinsic parameters

2.3 Optical Flow Estimation

As stated before we are interested in tracking sparse point features. Under the assump-tion of a dense camera sampling rate, features displacements are a good approximationof the optical flow (Barron et al. 1994) and at the same time temporal matching is pro-vided.

2.3.1 Point Features Tracking over Multiple Frames

Following (Tomasi & Kanade 1991, Lucas 1985, Adelson & Bergen 1985, Simoncelli1993, Simoncelli, Adelson & Heeger 1991, Shi & Tomasi 1994) we implemented aLucas-Kanade tracking algorithm. This assumes that the changes in image intensitypatternsI(x) change with time according to:

It+τ (x + δ) = It(x) (2.25)

This actually means that the image a timet + τ can be obtained from the imageat time t by moving each point by a suitable amount. An affine motion field is anapproximation of points motion according to which:

δ = Dx + d (2.26)

Tracking points consists in solving the equation:

It+τ (x + Dx + d) = It(x) (2.27)

2.3. Optical Flow Estimation 19

over a windowW centered inx in the frameIt. The solution of such equation isdetermined by minimizing:

ε =∫ ∫

W

[It+τ (x + Dx + d)− It(x)]2w(x)dx (2.28)

wherew(x) is a weight function that we chose to be Gaussian. To minimize the residualin Eq. 2.28 we take the derivatives of the residualε with respect to the unknowns andthen we linearize the dissimilarity in Eq. 2.27 by Taylor expansion. This yields to a6× 6 linear system of the form (Shi & Tomasi 1994):

T f = a (2.29)

wheref t = [D11, D12, D21, D22, d1, d2]. The vectora has the form:

a =∫ ∫

W

[It+τ (x)− It(x)]

xIx

xIy

yIx

yIy

Ix

Iy

w(x)dx (2.30)

whereIx,y are the derivatives inx andy directions of the image intensityI(x). ThematrixT is defined as:

T =∫ ∫

W

(A B

BT G

)w(x)dx (2.31)

whereA, B andG are functions ofIx,y andx. When the inter-frame distance is smallEq. 2.26 is an overparameterization of the features motion may result in large errors.Better results from the matching procedure can be obtained by solving the smaller sys-tem:

Gd = [a5, a6]T (2.32)

This is equivalent to assume that feature motionδ is a simple translation, i.e.δ = d.The affine similarity is used to monitor the tracker efficiency over multiple frames. Theresidualε is estimated between the first and the current frame by fitting the whole affineimage points motion. Whenε goes over a certain threshold the feature is abandoned.Reasons for going over threshold are for example drift or occlusion boundaries.

Good features to track are selected by analyzing the eigenvalues ofG. This matrixis approximately the Hessian of the image intensityI(x) atx:

G =(

I2x IxIy

IyIx I2x

)(2.33)

We want the matrixG to be well conditioned. This requires the eigenvaluesλ1

andλ2 are similar in magnitude. In addition we want the region around the observed


features to be well textured which requires that eigenvalues are large. In general thesetwo criteria are met when the minimum eigenvalue is sufficiently large. In fact thevariations in image intensity in the window are bounded that also bounds the magnitudeof the greater eigenvalue. So imposing the condition:

min(λ1, λ2) > λ (2.34)

helps to select good features to track. Moreover edges, for whichλ1 À λ2 are notselected. This reduces the error generated by the aperture problem.

Techniques like the Lucas Kanade tracker rely on the assumption thatI(x, t) isdifferentiable and that derivatives can be estimated reliably. In a two frame setting,where temporal smoothing is not allowed, this is achieved by applying the algorithmin a coarse to fine manner (see (Shi & Tomasi 1994)). Other techniques that makeuse of the derivatives ofI(x, t) have been developed by Horn and Schunck (Horn &Schunck 1981) and Nagel (Nagel 1983, Nagel 1987). They are in general aimed at theestimation of dense fields and so not useful for our work.

In general the use of derivatives can be avoided byregion matching, energy basedmethodsandphase based techinques. A description of the most used algorithms and ageneral comparison of optical flow techniques can be found in (Barron et al. 1994).

2.3.2 Optical Flow Approximation

Optical flow is approximated with the displacement of features between two consecutiveframes (Fig. 2.6). It is interesting to see how such an approximation is good for motiongenerated by hand held cameras and when otherwise it does lead to serious errors. Thedisplacement field is generated by the motion equation :

X(t + ∆t) = RT (t + ∆t)(X(t)−T(t + ∆t)) (2.35)

For small∆t this can be rewritten as:

X(t + 1) ∼= (I −∆tω×)X(t)− v∆t (2.36)

where we used the approximationsT(t + ∆t) ∼= ∆v andR(t + ∆t) ∼= e∆tω(t)× ∼=I −∆tω×. The displacement field is easily found to be :

x(t + 1)− x(t)∆t

=u(x, y, t)

1 + Z−1(ωxY − ωyX)∆t + Z−1vz∆t(2.37)

Eq. 2.37 states that even in the case of smooth motion the displacement field canbe quite different from the optical flow. To have a satisfactory approximation wemust impose that the denominator of Eq. 2.37 is close to 1. Note that the two termsZ−1(ωxY − ωyX)∆t andZ−1vz∆t represent the motion of the camera along the op-tical axis, generated respectively by rotation and translation, relative to the distance ofthe scene from the camera. We finally get the following set of conditions that must besatisfied:

2.3. Optical Flow Estimation 21

ω∆2t ¿ ω∆t ; v∆2t ¿ v∆t (2.38)

ωxX∆t ¿ Z ; ωyY ∆t ¿ Z ; vz∆t ¿ Z (2.39)

Eq. 2.38 implies that the motion must be smooth, so rapid changes in the veloci-ties are not allowed. Eq. 2.39 expresses the fact that motion alongZ must be smallcompared to the distance of the object to the camera. Fig. 2.7 shows how egomo-tion estimation behaves when the instantaneous approximation breaks down. Syntheticfields were generated and linear and rotational velocities estimated with the techniquedescribed above. Both the case of noise-free and noisy tracking are shown. Clearly,when noise is present, a too dense time sampling produces significant errors due to thefact that the features displacement becomes smaller while the tracking error in general islower bounded. Observe that a rotation of 90 degrees per second (pretty fast for a videoamateur exploring some new environment!) corresponds, at 25Hz, to ω∆t ∼= 5%. Sowe can conclude that approximating the optical flow with the displacement field is safein the case of hand held cameras.

V

X

∆Τ1O(t )

O(t )2

u ∆x

Figure 2.6. Approximation of the optical flow with the displacement∆x


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

2

4

6

8

10R

otat

ion

Err

or (

deg)

Rotation Magnitude (deg)

noiseno noise

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

5

10

15

Tra

nsla

tion

Err

or (

deg.

)

Rotation Magnitude (deg)

Figure 2.7. Angular error due to the displacement approximation (dashed line) and due todisplacement approximation and tracking error (continuous line). Translational velocity is0.02 focal lengths per frame in the directionz. Results are similar for different motions.

2.4 Simulation Benchmarks

In this thesis simulations are frequently used to asses algorithms performance. Forhomogeneity and simplicity, unless otherwise stated, we use the same benchmarks as in(Tomasi & Heeger 1994).

The focal length was set to 1 and the focal plane dimensions to 2×2 focal lengths.The field of view is 90o. Random clouds of N = 100 points are generated in a depthrange of 2-8 focal lengths. The motion is a combination of rotations and translations.The rotational speed magnitude was constant and chosen to be 0.23 degrees per frame.The magnitude of the linear velocity was chosen to fixate the point at the center of therandom cloud. With this setting the average optical flow is about 1 pixel per frame, verysimilar to real working conditions. Note that the actual parameter defining the workingpoint is the ratio between the linear velocity and points depth: this is due to the factthat this two terms are numerator and denominator of a fraction in the expression of theoptical flow field in Eq. 2.8 and consequently their absolute scale cannot be estimated(scale indeterminacy). Zero-mean Gaussian noise of different standard deviations wasadded to the components of the velocity to simulate measurement errors.

2.5. Error Functions 23

2.5 Error Functions

We denote with a hat (a) the measured quantities. Bias and sensitivity for the estimatesof camera velocities and 3D reconstruction were measured as the mean and standarddeviation over a number(L) of trials. Translation bias was computed as the anglebetween the true translation direction and the average of the estimates. The averagevectorv was chosen to be the unit vector that minimizes:

v = arg minv

L∑

l=1

cos−1 vlv (2.40)

subject to‖v‖ = 1. Translation sensitivity was computed as:

σv =

√√√√ 1L− 1

L∑

l=1

[cos−1(vvl)]2 (2.41)

which is the standard deviation of the angles between the true velocity and the estimates.

The angular velocity bias was quantified by computing the average rotation matrixR from the estimatesRl. The bias is defined as the angular difference between the truerotation and the average measured rotation:

φ = cos−1[Tr(RR)− 1

2] (2.42)

The rotation sensitivity is computed as the standard deviation of the difference anglesφl between the estimatesRl andR for each trial sample:

σφ =

√√√√ 1L− 1

L∑

l=1

φ2l (2.43)

Sensitivity in the 3D reconstruction is measured by aligning the ground-truth andthe reconstructed model. Alignment is performed first by translating the center of massof the estimated structure in the center of mass of the ground truth. At this point therotation matrixR that aligns the reconstruction with the ground truth is computed byminimizing:

R = arg minR

Σi‖Xi −RXi‖T (2.44)

wherei ∈ {1 . . . N}. For an optimal solution to this problem we refer to (Kanatani1993a). At this point the relative scale is computed as:


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

Optical Flow Noise (%)

Opt

ical

Flo

w N

oise

(de

gree

s)

Figure 2.8. Conversion between the two error measures we use for the simulation conditionsdescribed above.

α = meani

‖Xi‖‖Xi‖ (2.45)

Structure sensitivity is defined as:

σX =1N

N∑

i=1

‖Xi − Xi‖obj. size

(2.46)

Note that the so defined sensitivity is a pure number.Error on the optical flow is expressed as percentage of the ground truth:

σu =‖u− u‖‖u‖ (2.47)

Alternatively we represent the 2D velocity of components(ux, uy) as a 3D vector−→u ≡ (ux, uy, 1)T · 1√u2

x+u2y+12 . The angular error between the correct velocity−→u and

the estimated velocity−→u is defined as:

ψE = arccos(−→u T · −→u ) (2.48)

Note that a relative error of about 10 % corresponds to an angular error of about2.5 degrees at about 1 pixel/frame. This error function was first introduced in (Barronet al. 1994). Fig. 2.8 shows the conversion between the two measures for the simulationconditions described above.

Chapter 3

Discrete and DifferentialConstraints

In this chapter the differential and discrete epipolar constraints are presented and theirproperties reviewed. The limits of validity of these two constraints are exposed and theireffectiveness in estimating structure and motion compared. Since an approximate trade-off between the two techniques wants to be established, simple structure from motionalgorithms are used.

In the discrete setting we chose to re-implement the 8-points (Longuet-Higgins1981) algorithm for the estimation of the essential matrix. Despite the prevailing viewis that such algorithm is very susceptible to noise, Hartely (Hartley 1997) showed that,through a suitable re-normalization, results are comparable with the best iterative algo-rithms (Luong et al. 1993, Zhang 1996).

In the differential setting, the egomotion is estimated by the linear subspace method(Heeger & Jepson 1992) (see Chapter 8). Structure is then computed byvariable sepa-ration as described in Section 3.2.

3.1 The Discrete Epipolar Constraint

We assume that two images of the same scene, that we call left and right, are given. Theprojectionsxl andxr of a scene pointX over the two image planes satisfy the discreteepipolar constraint:

xTl RT Txr = 0 (3.1)

whereR is the relative rotation andT ∈ so(3) is the skew symmetric matrix built fromthe vectorT. The matrixE = RT T with R ∈ SO(3) andT ∈ so(3) is called theessential matrixand the set of all the essential matrices is calledessential space, definedto be:

25

26 Chapter 3. Discrete and Differential Constraints

E = {RT|R ∈ SO(3), T ∈ so(3)} (3.2)

Eq. 3.1 can be easily derived noticing that, given the relative cameras position andorientation(R,T), the vectorsRXl,T andXr all lie in the same plane, called epipolarplane (see Fig. 3.1). This relationship can be expressed as:

XTr [T×R + Xl] = 0 (3.3)

that in turns is equivalent to Eq. 3.1 using thatT×R ≡ TR and taking the transpose ofthe whole expression. Note that the discrete epipolar constraint is of geometrical natureand only asserts that the lines generated by the intersection of the epipolar plane andthe focal planes of the two cameras are in correspondence. Feature displacements thatsatisfy the rigid body motion constraint do satisfy also the discrete epipolar one but theinverse is not true. For example the displacementX 7→ T× (RX) satisfies the epipolarconstraint but it is not generated by any rigid motion.The following theorem characterizes the essential space:

Theorem 3.1. Characterization of the Essential SpaceA non zero matrixE is an essential matrix if and only if the singular value decom-position of E: E = UΣV T satisfies: Σ = diag{γ, γ, 0} for someγ > 0 andU, V ∈ SO(3).Proof: The proof is given in (Ma, Kosecka & Sastry 1997).

¤Theorem 3.2. Uniqueness of the motion Recovery from the Essential MatrixThere exist exactly two 3D displacements(R,T) corresponding to a non zero essentialmatrix E ∈ E . Given the SVD ofE = UΣV T the displacements(R,T) that solveE = RT solve:

(R1,T1) = (URTz (

π

2)V T , RT

z (π

2)ΣT T) (3.4)

(R2,T2) = (URTz (−π

2)V T , RT

z (−π

2)ΣT T) (3.5)

(3.6)

whereRZ(ψ) is the rotation matrix inducing a rotation of angleψ around thezaxis.Proof: The proof is given in (Ma et al. 1997).

¤

3.2 The Differential Epipolar Constraint

Eq. 2.8 provides the most straightforward expression of the differential epipolar con-straint. Algebraic manipulations can be done to obtain different formulations. We re-view two such manipulations that will be used throughout the thesis.

3.2. The Differential Epipolar Constraint 27

X

1

epipolar plane

��

��

��

��

��

��

��

��

l

O

(T,R)

O 2

xrx

l ree

Figure 3.1.Epipolar geometry. Theel ander are the two epipoles.

Theorem 3.3. A camera moving with velocities(v, ω) generates an optical flow fieldof the form:

u(x) =1Z

A(x)v + B(x)ω (3.7)

whereA andB are defined in Eq. 2.8Proof: By definition:

u = (dx

dt,dy

dt)T = [

XZ − ZX

Z2,Y Z − ZY

Z2]T (3.8)

Substituting Eq. 2.7 into Eq. 3.8 we get:

u = [(1+x2)ωy−yωz−xyωx +vx − xvz

Z,−(1+y2)ωx +xωz +xyωy +

vy − yvz

Z]T

(3.9)which in turn is Eq. 3.7.


¤Note theseparabilityof the differential epipolar constraint equation. Defining:

e .= [e1, e2]T =A(x)v‖A(x)v‖ (3.10)

the vectore = [e1,−e2]T is normal to the component of the optical flow generated bythe linear velocity and can be used to eliminate this from Eq. 3.7:

e · 1Z

A(x)v = 0 ⇒ e · (u−B(x)ω) = 0 (3.11)

from whichω can be estimated whenv is known. Whenv andω are known we canestimate the ratio1Z from:

1Z

=eT (u−B(x)ω)

‖A(x)v‖ (3.12)

Note that this expression for the depthZ is optimal in the least squares sense. Equatingto zero the derivative of the square norm of the velocity residuals we get exactly Eq.3.12:

∂Z‖u(x)− 1

ZA(x)v + B(x)ω‖2 ⇒ 1

Z=

eT (u−B(x)ω)‖A(x)v‖ (3.13)

Theorem 3.4. The optical flow field of a camera moving with velocities(v, ω) satisfies:

(uT ,xT )(vs

)x = 0 (3.14)

wherev andω are the skew symmetric matrices built from the vectorsv andω ands is the symmetric matrix defined bys := 1

2 (ωv + vω)The space of the matrices such that:

S = {12(ωv + vω)|ω ∈ R3,v ∈ S2} (3.15)

is calledspecial symmetric space.Proof: Multiplying Eq. 2.7 byv × x we get:

X(v × x) = (ω ×X)(v × x) (3.16)

SinceX = λx andxT (v × x) = 0 from Eq. 3.16 we have:

λuT xu− λxωT vxT = 0. (3.17)

Whenλ 6= 0, we have:uT vx + xωvx = 0 (3.18)

Using that for a generic skew symmetric matrixA, bAbT = 0 and that the matrixωv can be decomposed in the sum of a symmetric and a skew symmetric matrix byωv = 1

2 (ωv + (ωv)T ) + 12 (ωv − (ωv)T ) we get:

uT vx + xT sx = 0 (3.19)


¤

Theorem 3.5. The optical flow field of a camera moving with velocities(v, ω) satisfies:

λu + λx = λω × x + v (3.20)

Proof: Demonstration is straightforward substitutingX = λx in Eq. 2.7 and using that˙(λx) = λx + λu.

¤

It is shown in chapter 2 that reconstruction from optical flow is always accompaniedby a scaling ambiguity. Indeed, some optical flow fields yield a further ambiguity sincethey are compatible with one or more reconstructions that are not just related by a scalefactor.

Definition 3.6. An optical flow fieldu is ambiguous if it admits two decompositions:

u = uv1 + uω1 = uv2 + uω2 (3.21)

with uv1 6= uv2 anduω1 6= uω2

Ambiguous velocity fields are discussed in more detail in (Maybank 1985) and(Horn 1987).

Proposition 3.7. An optical flow field which arises from angular velocity alone is neverambiguous.

¤

Theorem 3.8. Letu be an ambiguous image velocity field and let:

u = uv1 + uω1 = uv2 + uω2 (3.22)

be the two decomposition ofu. Thenv1 andv2 are non-zero and non-parallel.

¤

Theorem 3.9. Letu be an ambiguous image velocity field and let:

u = uv1 + uω1 = uv2 + uω2 (3.23)

be the two decomposition ofu. Let ω = ω1 − ω2. Then the critical surfaceψ hasequation:

(ω ×X) · (v2 ×X) + (v2 × v1) ·X = 0 (3.24)

¤

Theorem 3.10. There are at most three different reconstructions compatible with anoptical flow field.


¤

A more advanced and complete discussion of the critical surfaces and the proofsof the theorems reported above can be found in (Maybank 1993). Here we are mostlyinterested planar surfaces that will be used further in chapter 7.

Planes are parameterized with the normal to the surfacep and the distance of theplane from the origin of the reference frameh. The following theorem holds:

Theorem 3.11. There are exactly two reconstructions compatible with with an opticalflow generated by a moving planar surface(p, h).Proof: Points on the planar surface verify:

pT x =h

Z(3.25)

Moreover, since the surface is critical, we assume that two sets of rigid velocities(v1, ω1) and(v2, ω2) generate the same optical flowu. We have:

u = Av1pT xh

+ Bω1 = Av2

Z+ Bω2 (3.26)

Taking the scalar product of Eq. 3.26 withv2 × x and usingω = ω2 − ω1 we get:

pT xh

[(v1 × v2)T x] = (ω × x)(v2 × x) = (vT2 x)(xT x)− (vT

2 x)(ωx) (3.27)

It follows that:

ωT v2 = 0 (3.28)

(ω,v2) = (cv1 × v2, c−1 p

h) (3.29)

Eliminatingv2 in the expression forω we finally get:

v2 =ph

(3.30)

ω2 = ω1 +p× v1

h(3.31)

To complete the prove we have to show that yields a reconstruction compatible withuand that the surface obtained is a plane. Define a second optical flowu′:

u′ = Av2

Z+ Bω2 (3.32)

Subtracting the expression foru and using Eq. 3.30 we get:

(u− u′)× x = (p× x)[v1x− c1Z

] (3.33)

Sou − u′ = 0 if and only ifv1x = c 1Z that implies that the flowu can aries from a

planar surface of normal in directionv1 with velocities(v2, ω2).


¤

The flow generated by a planar configuration of points can be parameterized in termsof the plane parameters(p, h):

Theorem 3.12.The optical flow generated by a plane moving with velocities(v, ω) canbe expressed in the formu(x) = F (x)b whereF (x)2×8 is a matrix only depending onthe feature positionx andb8×1 is a vector function of motion and planes parameters.Proof: Points on a plane satisfy:

ZpT x = h (3.34)

Solving forZ and substituting in Eq. 2.8 we get:

u = AvpT xh

+ Bω (3.35)

Expanding the matrix-vector products we finally get:

u(x) = F (x)b (3.36)

with :

F (x) =(

1 x y 0 0 0 x2 xy0 0 0 1 x y xy y2

)(3.37)

and

b1 = −hωy + vzpx b2 = hωx + vzpy (3.38)

b3 = vzpz − vxpx b4 = hωz + vxpy (3.39)

b5 = −hωy − vxpz b6 = vzpz − vypy (3.40)

b7 = −hωz − vypx b8 = hωx − vypz (3.41)

¤

Note that the elements ofb are invariant under the transformation:

h′ = h (3.42)

p′ =v‖v‖ (3.43)

v′ = ‖v‖p (3.44)

ω′ = ω +1hp× v (3.45)

This result is consequence of the theorem 3.11. When sparse fields are used a minimalconfiguration of points is needed in order to get a unique solution.


Theorem 3.13. Structure and motion is uniquely determinated by five optical flows notlaying on a critical surface.Proof: The total number of parameter to be determined areN + 5 whereN is thenumber of depthsZ and5 are the motion parameters (the 3 components of the angu-lar velocity and the 2 independent components of the linear one). Since each featureprovides2 independent equation for the two components of the flow, we have:

2N ≥ N + 5 ⇒ N ≥ 5 (3.46)

¤

3.3 The Differential Epipolar Constraint as a Limit ofthe Discrete One

The continuous camera motion is the limit of the discrete motion over period∆t when∆t → 0. So the differential epipolar constraint must be the limit of the discrete one inthe same approximation. We have that:

xr → xl + ul∆t (3.47)

R → I + ω∆t (3.48)

T → v∆t (3.49)

(3.50)

In this approximation, the epipolar constraint becomes:

xTl RTxr → xT

l [(I + ω∆t)(v∆t)](xl + ul∆t) (3.51)

= (xTl vxl)∆t + (xT

l vul + xlωvxl)∆t2 + (xlωvul)∆t3 (3.52)

= 0 (3.53)

The termxTl vxl = 0 sincev is skew symmetric. Ignoring the third order term we get:

xTl RTxr →

o(∆t2)uT vx + xT sx (3.54)

So the limit for∆t → 0 of the epipolar constraint is the differential epipolar constraintin the form of Eq. 3.14. As we demonstrated above the epipolar constraint is of geo-metrical nature and does not imply rigid motion. The same holds for the form Eq. 3.14.This formulation can be obtained from the rigid body motion equation in Eq. 3.7 byalgebraic manipulation (see chapter 8). This observation is important in the contest ofstructure from motion since in general the use of aweakerconstraint can lead to biasedand noisy estimations. This probelm will be analyzed in detail in chapter 8.

3.4. An Effectiveness Comparison 33

3.4 An Effectiveness Comparison

In this section we provide a comparison between the discrete and differential approachesto reconstruction from monocular video sequences. We assume that a sequence ofMframes has been shot and thatN features are tracked along the sequence and that theoptical flow at feature locations is given for each frame. We first estimate structure bya stereo technique using the first andMth frame. Then, using the differential epipo-lar constraint, for each frame between1 andM . Reconstructions relative to differentcoordinate systems are integrated over time.

The comparison we propose is primarily to indicate the major tradeoff between thetwo approaches.

3.4.1 Structure and Motion from Optical Flow

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.40

20

40

60

80

100

120

140

160

180

Reconstruction Error0 5 10 15 20 25 30

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Noise degrees

% n

egat

ive

Z

a b

Figure 3.2. (a) Distribution of the variableZ−ZZ

for 10 degrees error in the estimation ofthe optical flow.(b) Average percentage of positive depths as a function of the error on theestimation of the optical flow.

Structure and motion are estimated with a stratified approach using the separabilityproperty of the differential epipolar constraint in the form of Eq. 3.7. The main stepsare:

• Compute the linear velocity using the linear subspace method. We used the biascorrected approach proposed by MacLean (MacLean 1999) and chapter 8.

• Compute the angular velocity by least squares minimization of Eq. 3.11

• Compute the depth by using Eq. 3.12

Structure is estimated for each frame of the sequence. The reconstructions differ foran unknown scale that we indicate withαj . Moreover reconstructions are relative to the


camera coordinate system and so differ also for a rotation and a translation which is stillup to the scaleαj .

To integrate the different structures we first re-scale them and then transform into thesame coordinates system. Re-scaling is achieved by fixing one of the static parameters.This method, suggested by Pentland (Azarbayejani & Pentland 1995), consists in fixingthe scaleα1, propagating one of the features 3D positions with respect to the first frameinto the second by the estimated motion, i.e.α1X2

i = α1R1X1i + α1T1 and then re-

scaling the structure estimated with respect to the second frame so that it matches thepropagated feature. This easily achieved by multiplying theα2X2 vectors byα1

α2

.=‖α1X2

i‖‖α2X2

i‖. This process is repeated recursively for all the frames in the sequence so that

they all have the same scaleα1.The coordinates frame transformation is achieved by time integrating the estimated

motion according to:

Tj,M =M∑

p=j

vp∆t (3.55)

Rj,M =M∏

p=j

eωp∆t (3.56)

(3.57)

whereTj,M andRj,M are the relative translation and rotation between the camera ref-erence frames at timetj andtM . The structure estimates at each instanttj are expressedin thetM coordinate’s frame by:

Xj(M) = Rj,MXj + T j,M (3.58)

whereXj(M) is the structure estimate at timetj respect to the camera reference frameat timetM . Such a forward propagation introduces an further error that has to be prop-erly modelled to obtain statistically consistent results. If we assume that the covariancematricesΣv andΣω onv andω are time independent we get:

σT j,M = (M − j)Σv (3.59)

σRj,M = (M − j)∂(e

Pp ωp)

∂(∑

p ωp)Σω (3.60)

Taking then differential of Eq. 3.58 we get:

dXj(M) = Rj,MσXj + σT j,M + σRj,M Xj (3.61)

When Eq. 3.59 is plugged in Eq. 3.61 we observe that the error inXj(M) is approx-imately proportional toM − j. A rough time integration taking into account such anerror model is:

3.4. An Effectiveness Comparison 35

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rec

onst

ruct

ion

Err

or

Noise pixel

10 pixels20 pixels40 pixels

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

5

10

15

20

25

30

Tra

slat

ion

Err

or (

deg)

Noise (pixel)

Figure 3.3. Performance of the stereo reconstruction as a function of the error in the pixelposition of feature points for the minimal configuration of 8 points. The average disparitiesbetween the stereo views for the 3 curves are about 10, 20 and40 pixels.

X(M) = A

M∑

p=j

wpXp(M) (3.62)

wherewp = 1M−p+1 andA is the normalization constant,A = 1P

wp. Note that the

sum in Eq. 3.62 can be broken up into two terms:

X(M) = A[(M−1∑

p=j

wpXp(M)) + XM (M)] .= A(X−(M) + XM (M)) (3.63)

Using Kalman filter notation,X−(M) is the prior estimate of the structure at timetMbased on the observations up to timetM−1.

Experiments according to the benchmarks described in Chapter 2 are performed inorder to asses efficiency of the differential reconstruction. In Figure 3.2 (a) the recon-struction error distribution is shown. As can be noticed such distribution is not sym-metric and this depends upon the fact that estimation of the linear velocity by subspacetechniques is biased. This is discussed in detail in (MacLean 1999) and in Chapter 8.The percentage of negativeZ is another important indicator since the sign of the recon-struction has to be chosen properly. For normal working conditions (about 10 degrees ofnoise) the pollution from negativeZ is of the order of a few percents. Fig. 3.4(a) showsthe error on structure estimation for reconstruction performed using a single optical flowfield.

3.4.2 Structure and Motion from Stereo

To compute camera motion we first estimate the essential matrix as in (Hartley 1997)and then we implement the factorization technique of Eq. 3.4. We summarize the basic


steps:

• Compute the essential matrixE

• Find the SVD decompositionE = UΣV T whereΣ = diag{γ1, γ2, γ3}• We obtain 2 possible solutions(T1,R1) and(T2,R2)

For any pair of matchings we have an equation of type:

xTl Exr = 0 (3.64)

which in turn is equivalent to:

[xlxr, xlyr, xl, ylxr, ylyr, yl, xr, yr, 1] ·

E11

E21

E31

E12

E22

E32

E13

E23

E33

= 0 (3.65)

From all the point matches we obtain a set of linear equations of the form:

Ce = 0 (3.66)

wheree is a nine-vector containing the entries of the matrixE. Instead of solvingdirectly Eq. 3.66 Hartley showed that better results can be obtained by re-normalizingthe entries of the matrixC. Re-normalization is achieved by translating the featurepointsxr andxr such that their centroid is at the origin and then rescaling such that theaverage distance from the origin is

√2. Thanks to this the conditioning of the matrixC

can be largely improved and more stable results obtained.Moreover the essential matrixE is rank 2 and this constraint can be imposed on the

estimate to further improve the precision of the estimate. The rank constraint is enforcedby replacingC by C

′, the closest singular matrix toC in Frobenius norm. This is done

by taking the singular value decomposition ofC = UΣV T whereΣ = diag(σ1, σ2, σ3)and choosingC

′= Udiag(σ1, σ2, 0)V T .

The so estimate essential matrix is then decomposed according to Eq. 3.4. Theright solution is determined by the positive depth constraint. In practice we check thedominant sign of the reconstructed depth to discriminate among the possible solutions.

The feature point world positions are computed by triangulation as the mid pointof the minimum segment joining the rays though the optical centers of the camera andthe projected point position. This is achieved through the least squares solution of thefollowing linear system:

ax1 − bRT x2 + c(x1 ×RT x2) = T (3.67)

where the unknownsa, b, c represent, up to a scale, the lengths of the left ray, right rayand of the joining segment.

3.5. Comparison 37

0 5 10 15 20 25 30 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Optical Flow Error (degrees)

Rec

onst

ruct

ion

Err

or

5 10 15 20 25 30 35 40 45 50

0.1

0.2

0.3

0.4

0.5

0.6

Number of frames

Rec

onst

ruct

ion

Err

or

Discrete

Differential

a b

Figure 3.4. (a) Error for structure reconstruction from a single optical flow field.(a) Acomparison of stereo and differential reconstruction using 50 points, assuming 10 degreeserror on optical flow and 1 pixel error on features position. Disparity between consecutiveframes is about 1.0 pixel frame . The reconstruction error function is defined in section 2.5

For thestereosimulation we show the result of the algorithm for three different dis-parities and for generic motion using the minimal configuration of 8 points (Fig. 3.3).The errors in egomotion are reported in terms of directional difference and modulerelative error between the reconstructed and the generated quantities. A 1 pixel is rea-sonable for a feature detector and we see how a reasonable reconstruction error (about50 %) is obtained for disparities approximately of the order of 20 pixels. Obviously theprecision of reconstruction depends drastically on the number of points used.

3.5 Comparison

Fig. 3.4(b) shows the reconstruction error for the discrete and differential approaches asa function of the number of frames used. Stereo reconstruction is applied to the first andlast image while the differential algorithm to each consecutive pair and then the struc-ture estimates integrated. Differential reconstruction outperforms largely stereo whenthe baseline is small. As pointed out in section 3.1 the discrete epipolar constraint isof geometrical nature and in general is not equivalent to rigid body motion. The con-sequence of using a weaker constraint is that this is particularly sensitive to the relativeerror in the input. This is extremely large for short baselines since matching error isweakly dependent on the baseline itself. On the contrary the differential epipolar con-straint is just a re-parameterization of the rigid body motion equation and performs welleven when frames are very close. Time integration is shown to improve the reconstruc-tion performance even if the re-scaling process introduces a cumulative error which atin our algorithm is not properly modelled. Tracking inefficiencies are not simulated.In general outliers are more common while matching among views separated by largebaselines than when tracking over dense sequences.


��

��

Figure 3.5. Reconstruction of a synthetic video sequence of 2 frames. The averagedisparity is about 1 pixel

a b

Figure 3.6. (a) A frame from the sequence.(b) Reconstruction from two frames. Opticalflow is approximated with the features displacement.

3.6 Summary

In this chapter the properties of the differential and discrete epipolar constraints arereviewed. Their efficiency in reconstructing from image sequences is tested and com-pared using simple algorithms. Stereo is well suited when the baseline between imagesis large enough, which requires re-sampling the video sequence to a lower rate. Whendense sequences are used the differential epipolar constraint is more effective. This isdue to the different natures of the such constraints, one being the a rigid body motionand the other a geometrical constraint.

Reconstruction by differential constraints can be affected by error accumulation dueto the re-scaling process, if this is not modelled properly. Such a problem shows upwhen hundreds of reconstructions are integrated. Stereo is not affected by error ac-cumulation but if multi-viewpoint systems are to be used they also suffer from such a

3.6. Summary 39

problem.

a

b

c

Figure 3.7. (a) A frame from the sequence.(b) Computed optical flow.(c) Reconstructionfrom two frames. Optical flow is approximated with the features displacement.


a

b

c

Figure 3.8. (a) A frame from the sequence.(b) Estimated optical flow.(c) Reconstructionfrom two frames

Chapter 4

Linear Recursive Estimation ofStructure and Motion

In this chapter we propose a technique to recursively estimate structure and motionfrom a sequence of images acquired by a calibrated hand held camera. The algorithmis based on estimation of instantaneous optical flow for each frame of the sequence andon the use of the differential epipolar constraint to compute motion and structure. Anew procedure to manage scale ambiguity across frames is also proposed. Neither userinteraction nor any kind of scene knowledge is required and the algorithm, given its lowcomplexity, is adaptable to real time applications.

Fig. 4.1 shows the stratified structure of the algorithm. Images acquired by a 25Hzcalibrated video camera are passed one by one to the system. Features are tracked overthe new frame and optical flow estimated as described in Section 2.3. Egomotion isestimated by a linear method and the estimate refined by a non linear minimizationprocedure. Once motion is known, structure relative to the new frame can be computedby a linear algorithm.

As we already discussed, the structure and the linear motion can be determined upto a scale. Since such scale is, in principle, not unique but different from frame to frame,a rescaling procedure is needed to manage such an ambiguity.

4.1 Egomotion Estimation

Egomotion estimation is a 2 step process. We first estimate linear velocity with thesubspace method corrected for bias reduction (MacLean 1999). This algorithm, evenif not among the most precise (see (Tomasi & Heeger 1994)), has a closed form so-lution, which is advantageous for automatic reconstruction. Subsequently a non-linearrefinement of the solution is carried out in the same fashion as Bruss and Horn (Bruss& Horn 1983). Both the subspace method and the refinement algorithm are discussedin detail in chapter 8.

41

42 Chapter 4. Linear Recursive Estimation of Structure and Motion

��

��

��

��

Calibrated Video Sequence

Tracking from n−1 to n

Egomotion Estimation

Rescaling to the zero’s frame scale

3D Model

Structure Estimation up to a scale

Acquire Frame n

Integration into the model

Figure 4.1. Algorithm overview. The structure and motion problem is reduced into a set ofmodules that are executed subsequentially. The camera is supposed to be calibrated a priori.

4.2 Structure from Motion

Structure and linear velocity magnitude can be estimated only up to a scale. Letλ = αZandϑ = α‖v‖ we can write:

X =λ

α(X

Z,Y

Z, 1) ≡ λ

xα

(4.1)

v =ϑ

α−→v (4.2)

where−→v is the direction ofv. From Eq. 3.14 we get:

λx + λx = λω × x− ϑ−→v (4.3)

GivenN feature points we can rewrite Eq. 4.3 in matrix form:

4.2. Structure from Motion 43

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Noise (pixels)

Rec

onst

ruct

ion

Err

or

Pentland

ODE

Figure 4.2. Comparison between the Pentland and the ODE based re-scaling systems fordifferent noise levels. Time integration is performed over 5 frames. Reconstruction error isdefined in section 2.5

Sλ = 0 (4.4)

where the unknowns areλ = [λ1, λ1...., λN , λN , ϑ]T and the structure matrixS3N,2N+1 is:

S =

x(1) −u(1)x − (ω × x)(1)x 0 0 . . . −−→v x

y(1) −u(1)y − (ω × x)(1)y 0 0 . . . −−→v y

1 −(ω × x)(1)z 0 0 . . . −−→v z

0 0 x(2) −u(2)x − (ω × x)(2)x . . . −−→v x

0 0 y(2) −u(2)y − (ω × x)(2)y . . . −−→v y

0 0 1 −(ω × x)(2)z . . . −−→v z

......

......

. .....

We find the unknowns solving the orthogonal least squares problem:


min‖Sλ‖2 (4.5)

‖λ‖ = 1

The solution is the eigenvector ofST S corresponding to the smallest eigenvalue.(See Appendix C)

4.3 Automatic Rescaling

The optimization presented above returns depth of features with respect to the the cur-rent camera position up to a scale factor:

λi(t) = α(t)Zi(t) (4.6)

where the time dependence expresses the fact that depths are measured with respectto different coordinate systems at each timet. Time integration requires theα’s to beidentical at all the times. Such rescaling usually requires some kind of scene or motionknowledge: for example the modulus of the linear velocity, the distance of some pointto the camera or the relative distance of two points can be used for this purpose. In ourframework the rescaling can be achieved without the use of any external information.We observe that the ratioη(t) = λ

λ is scale independent; if we writeη(t) = dln(λ)dt and

let s = ln(λ) we get :

s = η(t) (4.7)

s(t0) = ln(λ(t0)) (4.8)

ThisODE can be solved to rescale all the views to thet0 scale. We use the notationλr(t) for the rescaled depths (notice that at timet0 λr(t0) ≡ λ(t0)). We get

λr(t) = λ(t0)es(t) = α(t0)Z(0)eR t

t0η(q)dq = α(t0)Z(t) (4.9)

Approximating the integral by a discrete sum we can write the integral Eq. 4.9 as:

λr(tj) = λ(t0)j∏

l=1

er(tl)∆tl = (λ(t0)j−1∏

l=1

erη(tl)∆tl)eη(tj)∆tj = λr(tj−1)eη(tj)∆tj

(4.10)This is the basic equation to recursively rescale all the views to the same scaling

constant: at each timet the ratioη(t) = λλ can be estimated and using Eq. 4.9 the

up-to-a-scale depthλ(t) can be updated. Note that the global scale is still unknown andthat such indeterminacy can be resolved only by scene or motion knowledge.

Partial tracks can be easily managed in this framework. Suppose that a new feature,indicated byN + 1, is tracked at timet = t and its unrescaled depth measured as

4.4. Time Integration 45

λN+1(t) = α(t)ZN+1(t). Rescaling cannot be done using Eq. 4.10 since the previoushistory of the track is not known. Anyway the ratioα(t0)

α(t) and the rescaled depth can beestimated as:

κ = E[λr

i (t)λi(t)

]i6=N+1 ⇒ λrN+1(t) = κλN+1(t) (4.11)

At time t > t, rescaling can also be done for this track using Eq. 4.10. Note thatthe integration of the differential equation introduces an additional error of the ordero((∆t)−1). For25Hz sequences this error is pretty small compared to the measurementerrors and can be ignored.

In Fig. 4.2 the effectiveness of the proposed method is compared with the methodof Pentland (see (Azarbayejani & Pentland 1995)) used in the previous chapter. Thesimulated data sets consist of 100 fetaures and 100 trials as discussed in Section 2.4. Interms of structure estimation precision such two methods are approximately equivalent.The main advantage of the autorescaling we propose is the fact that features are rescaledindependently of each other. With the method of Pentland the precision of the recon-struction of all the points depends on the feature chosen for the rescaling: this approachcan lead to severe errors when such feature is particularly noisy. Moreover Pentland ap-proach assumes visibility of the rescaling feature throughout the whole sequence. Withour method this is not necessary and features can appear and disappear freely.

4.4 Time Integration

We use exactly the same approach we proposed in the previous chapter. We assume thatmotion can be approximated as:

Tj,M =M∑

p=j

vp∆t (4.12)

Rj,M =M∏

p=j

eωp∆t (4.13)

(4.14)

The structure is then predicted over time by:

Xj(M) = Rj,MXj + T j,M (4.15)

whereXj(M) is the structure estimate at timetj respect to the reference frame ofthe camera at timetM . Forward propagation introduces an additional error that has tobe properly modelled to obtain statistically consistent results. If we assume that thecovariance matricesΣv andΣω onv andω are time independent we get:


σT j,M = (M − j)Σv (4.16)

σRj,M = (M − j)∂(e

Pp ωp)

∂(∑

p ωp)Σω (4.17)

Taking then differential of Eq. 4.15 we get:

dXj(M) = Rj,MσXj + σT j,M + σRj,M Xj (4.18)

The error inXj(M) is approximately proportional toM − j. A rough time integrationtaking into account the error model is:

X(M) = A

M∑

p=j

wpXp(M) (4.19)

wherewp = 1M−p+1 andA is the normalization constant,A = 1P

wp. Note that the

sum in Eq. 4.19 can be broken up into two terms:

X(M) = A[(M−1∑

p=j

wpXp(M)) + XM (M)] .= A(X−(M) + XM (M)) (4.20)

Using Kalman filter notation,X−(M) is the prior estimate of the structure at timetMbased on the observations up to timetM−1.

4.5 Experiments

The method has been extensively tested with real sequences acquired by a hand helddigital camera. The camera was previously calibrated using different views of a planargrid (one of the planes in Fig. 4.5 was used). Sparse features were tracked using theKanade-Lucas-Tomasi tracker (Tomasi & Kanade 1991) and optical flow approximatedwith features displacement. In Fig. 4.5, Fig. 4.6 and Fig. 4.7 we show examples ofthe reconstructions obtained with the algorithm presented in the paper. Fig. 4.5 visuallyshows the reconstruction precision improvement due to time integration. The groundtruth is hard to be estimated since in principle features are extracted automatically. Tonumerically estimate the improvement we estimated the average residuals respect to aplanar hypothesis for the three planes that constitute the grid. Results are reported inTab. 4.1.

It is important notice would that if the focal length is not changed calibration iswell preserved over time: we used the camera for about one month re-calibrating reg-ularly and checked that the camera parameters were always consistent with the initialcalibration. This is an important observation since, in the differential framework, fullauto calibration of the camera cannot be performed, so at least partial knowledge of thecamera parameters is necessary (see (Ma, Kosecka & Sastry 2000b) and 8).

4.6. Non-Linear Refinement 47

2frames 11 framescalibration grid 0.0351 0.0081

Table 4.1.Planar residuals for the calibration grid sequence. Units are focal lengths.

4.6 Non-Linear Refinement

0.05 0.1 0.15 0.2 0.25 0.30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Noise (pixels)

Rec

onst

ruct

ion

Err

or

LinearNon linear

0.05 0.1 0.15 0.2 0.25 0.30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Noise (pixels)

η er

ror

Non LinearLinear

a b

Figure 4.3. (a) Reconstruction improvement by refining the linear solution. Reconstructionerror is defined in section 2.5.(b) Improvement in the estimation ofη(t) by refining the linearsolution.η error is defined in Eq. 4.26.

The algorithm outlined above has the appealing characteristic of having closed form,since it is composed of three linear modules. It does not anyway provide an optimalsolution to the structure and motion problem. For example the linear subspace method,that we used to estimate the translational velocity, is well known to be biased (MacLean1999, Zucchelli & Kosecka 2001). An optimal solution, in the maximum likelihoodsense, is the one that does minimize the squared norm of the velocity residuals (seeAppendix C). Dividing byλ Eq. 3.14 can be rewritten as:

u = ω × x +vλ− ηx (4.21)

The optimal estimator(ω, v, λ, η) of the unknowns is found by minimizing:

(ω, v, λ, η) = arg min(ω,v,λ,η)

1N

∑

i

‖r(xi)‖2 (4.22)

where:r(x) = u− ω × x− v

λ+ ηx (4.23)

Note that such an estimator is optimal only if the optical flow measurement errorsare Gaussians with identical standard deviation.


0.5

1

1.5

2

30

210

60

240

90

270

120

300

150

330

180 0

0.5

1

1.5

2

30

210

60

240

90

270

120

300

150

330

180 0

a b

Figure 4.4. (a) Polar plot of the translational velocity as estimated by the linear algorithm.(b) Translational velocity estimates after the non-linear refinement. Error on optical flow isapproximately 10%.

If speed is not a major concern, a sub-optimal solution to the problem in Eq. 4.22can be found by an iterative procedure. We use a Gauss-Newton updating procedurewhich has quadratic convergence, and is simple and stable. DefiningΘ = (ω,v, λ, η)and the residual vectorρ(Θ) = [r1, . . . , rM ]T we get:

Jk∆Θk = −ρk (4.24)

whereJ is the jacobian ofρ andk is the iteration index. The system in Eq. 4.24 iscomposed ofN linear equations of the form:

1 0 00 1 00 0 1

1

λ∆vk +

0 1 −y−1 0 xy −x 0

∆ωk − 1

λ2v∆λ− x∆ηk = r(x)k (4.25)

The constraint‖v‖ = 1 yields, by differentiation, the additional linear equationvT ∆vk = 0.

4.6.1 Experiments

We tested the algorithm by initializing with the estimates from the linear approach de-scribed above and for different motions. Convergence is fast, within 2 and 7 iterationsdepending on the amount of noise. Improvements are large both in the global error (see

4.7. Summary 49

linear refinementlinux box 0.0401 0.0250tea box 0.0180 0.0120

Table 4.2.Planar residuals. Units are focal lengths.

Fig. 4.3(a)) and in the estimation of the ratioη = λλ (Fig. 4.3(b)). The error inη is

defined as:

σeta =1N

∑

i

|ηi − ηi

ηmed| (4.26)

whereηmed is the trueη value for a feature positioned in the center of the cloud. Theerror measure so defined is pure number.

In Fig. 4.4 an example of the distribution of the translational velocity polar andazimuthal angles estimates are shown in a polar plot. The samples are generated withlinear velocityv = [1, 0.5, 1] and generic rotation. It can be noticed the symmetry andthe reduced spread of such distribution after the non linear refinement. In general thesame experimental observations hold for different motions.

Due to the unavailability of the ground truth for the real data sets, we assessed theefficiency of our method by measuring the planarity of the 3 planar surfaces of the box-like shapes in Fig. 4.7 and 4.6. This was done by fitting 3 planes to the 3D reconstructionand measuring the average residual of the fit. Results are reported in Tab. 4.2 and refersto structure estimation based on two frames. Units are focal lengths up to the unknownreconstruction scale.

4.7 Summary

In this chapter a linear recursive algorithm to estimate the point wise structure of ascene from a calibrated video sequence was presented. The algorithm is based on theextraction of optical flow for each frame and the sequential computation of motion andstructure. Moreover an alternative new way to automatically re-scale the 3D informationfrom different views is presented and tested. The two stages of egomotion and struc-ture computation are approached with linear procedures: this makes our algorithm fastenough to be run in real time. Results on simulated data and real images are presentedto validate the effectiveness of the approach.

When speed is not a major concern, better results can be obtained by direct non-linear least squares minimization of the differential epipolar constraint. Using the linearalgorithm estimates as starting value, the non-linear minimization shows very good con-vergence properties and leads to major estimation improvements.


��

a b

c d

e f

Figure 4.5. (a) An image from the sequence.(b) Estimated optical flow.(c) & (e) Re-constructed model of the calibration grid using 2 frames and(d) & (f) Reconstructed modelusing 10 frames.The model is based on 237 features and the average disparity between frames is about 1 pixel

4.7. Summary 51

a b

c d

Figure 4.6. (a) An image from the sequence.(b) Estimated Optical flow.(c) and (d)Reconstructed model of a Linux box based on 11 frames and 220 features. The averagedisparity is about 0.9 pixels per frame.


a b

c d

Figure 4.7. (a) An image from the sequence.(b) Estimated optical flow.(c) and(d) Re-constructed model of a tea box based on 7 frames and 250 features. The average disparity isabout 1.1 pixels per frame.

Chapter 5

Maximum Likelihood Structurefrom Motion

In the previous chapter we proposed a recursive algorithm to estimate structure andmotion from a sequence of images. Linear algorithms have closed form but in generalthey are noisy and biased. Better results can be obtained by direct minimization of thedifferential epipolar constraint in the least squares sense. However, we implicitly as-sumed that the errors along thex andy directions are identical and uncorrelated. This israrely the case for real data, due to the aperture problem. Instead, one should minimizethe covariance weighted squared error. Moreover, when dense sequences are acquired,further robustness can be achieved by integrating the reconstruction of structure overtime. This chapter has three main contributions:(i) we show that the minimization ofthe weighted squared errors (i.e. of the Maximum-Likelihood estimator) outperformsthe un-weighted least squares approach,(ii) we show how structure estimation can belocally integrated over time in a multi-view approach that drastically improve estimatesand(iii) we show how to recursively integrate the structure relative to different frames.

5.1 Introduction

Optical flow can effectively be used to estimate structure and motion. In the last20 years, a number of different solutions to the problem of structure from motion inthe differential setting has been proposed. Linear techniques are fast and can be ex-pressed in closed form, but the estimation of motion and structure is biased. Zhang andTomasi (Zhang & Tomasi 1999) recently showed that the bias is due to a poor choiceof the objective function, and that unbiased and consistent estimates can be obtainedby direct minimization of the differential epipolar constraint in the least squares sense.However, that approach assumes that errors on thex andy directions are identical anduncorrelated. Whenever this is not true, severe errors and bias can be produced duringthe minimization process. Instead, it is here proposed to minimized themahalanobis

53

54 Chapter 5. Maximum Likelihood Structure from Motion

distance (the re-weighted squared error), which takes into account the spatial structureof the error: this is theMaximum Likelihoodformulation of the problem.

If more than two images are available, then more information can be used for struc-ture and motion estimation. One possible approach consists in blending the variousdepth estimates arising from pair-wise application of structure and motion estimationmethods. Alternatively we formulate a single estimation problem, where all the infor-mation is used simultaneously to determine structure and motion.

In summary, we extend previous work in three fundamental ways: (i) by consideringthe covariance of the noise in the estimation problem and (ii) by proposing a multi-view approach that increases statistical precision by relying on a reduced number ofparameters (iii) by recursively integrating the structure estimates relative to differentsequence frames.

��

��

I I I II I0 1 2 3 4 5

motion model

time

Reference view

Limit of validity of the instantaneous

Figure 5.1. Multi-frame setting. I0 is the reference view. Reconstruction is performedsimultaneously between the reference view and the views that do satisfy the instantaneousmotion model.

5.2 Problem Formulation

As stated before, the relationship between the image plane motion fieldu(x) and themotion of the camera is given by:

u(x) =1Z

A(x)v + B(x)ω + n(x) (5.1)

5.3. Two-frames Non-Linear Estimation of Structure and Motion 55

wheren(x) ∼ N(0, Σ) is zero-mean Gaussian additive noise. The matricesA(x) andB(x) are functions of image coordinates defined as follows:

A =[

1 0 −x0 1 −y

]; B =

[ −xy (1 + x2) −y−(1 + y2) xy x

]

5.3 Two-frames Non-Linear Estimation of Structureand Motion

Given two views of the same scene, the instantaneous motion model of Eq. 5.1 is validwhen the camera rotation is small and the forward translation is small relative to thedepth (see Section 2.3.2). ConsiderM framesIj j ∈ {1 . . . M} and letI0 be thereference view. We further assume that all the image pairs{I0, Ij} satisfy the smallmotion approximation and callF0j the relative optical flow field (see Fig. 5.1).F0j canbe used to estimate the relative motion betweenI0 andIj and the structure relative toI0.

The residual for theith feature relative to a pair of frames{I0, Ij} is defined as:

ri = ui − 1Zi

A(xi)v −B(xi)ω (5.2)

ui is the optical flow of theith feature calculated from the frames{I0, Ij}, xi andZi

denotes the feature’s positions and the depths in the reference frameI0. Stacking theresidualsri in the2N × 1 vectorρ = [rT

1 , . . . , rTN ]T the motion and structure can be

estimated by solving the least squares problem:

(v, ω, Z) = arg min(v,ω,Z)

‖ρ‖2 (5.3)

whereZ = (Z1, . . . , ZN ).The problem in Eq. 5.3 is a non-linear least squares estimation and has to be solved

by an iterative technique. We used Gauss-Newton in the form:

Jk∆[vT , ωT ,ZT ]Tk = −ρk (5.4)

whereJ is the Jacobian ofρ andk is the iteration index. In general,J is rank deficient,due the fact that the residual function is invariant under the transformation(v, ω,Z) 7→(αv, ω, αZ) (gauge freedom). The rank deficient linear system 5.4 can be solved in theleast square sense by using thepseudoinverse of J . Alternatively, the constantα canbe fixed by imposing the constraint‖v‖ = 1. Such a constraint can be differentiated,i.e. vT

k ∆vk = 0, and this equation added as the last line of the linear system in Eq. 5.4.The resulting system of equations is full rank and can be solved with techniques for fullrank least squares problems that are about twice as fast as thepseudoinverse (Golub& Van Loan 1996).

Initialization : Iterative techniques for non-linear optimization problems are locallyconvergent and a good initialization is needed in order to find the global minimum. In


n

Moving contour

Aperture

u

u

Figure 5.2. The aperture problem. The true image velocityu cannot be distinguished fromthe image velocity normal to the moving contour,un, when viewed through an aperture.

our problem initialization is easier due to theseparabilityof the differential epipolarconstraint equation (see section 3.2). To generate the initial value for(v, ω,Z), it issufficient to initialize the vectorv on the half sphere of ray 1 and then estimate thecorrespondingω andZ using Eq. 3.11 and Eq. 3.12. An initial value forv can beobtained either by using an egomotion algorithm (see for example (Bruss & Horn 1983)(Heeger & Jepson 1992) (MacLean 1999)) or by random sampling.

Convergence: We generated randomly initial values ofv over an hemisphere ofray one and computed the minima by stopping each branch when the relative residualwas not changing more of 1

1000 . We found that starting with 10-15 random valuesof v at least one of the branches converged to the global minimum. Fig. 5.6 showsthe improvement in terms of error and bias obtained by refining the Jepson-Heegerestimates of the translational velocity by the non linear algorithm.

CovarianceCovariance for the structure and motion parameters can be easily esti-mated by:

C = E[∆[vT , ωT ,ZT ]∆[vT , ωT ,ZT ]T ] = J−1E[ρρT ]J−T = σ2J−1J−1T (5.5)

where we used the hypothesis that errors in the optical flow are uncorrelated Gaussiansall of varianceσ (Σi = σI2×2 that impliesE[ρρT ] ≡ σI2N×2N ).

5.4. Re-weighted Multi-View Formulation 57

5.4 Re-weighted Multi-View Formulation

��

��

��

��

Error ellipse Error ellipse

Low Cornerness High Cornerness

Figure 5.3. Directional uncertainty is indicated by the drawn ellipse. For sharp corner pointsthe uncertainty is small and anisotropic since the intensity pattern has variations in all thedirections. For flat corners the uncertainty is larger in the direction tangent to the curve.

In this section we re-formulate themaximum-likelihoodand time integrated versionof the algorithm described in the previous section.

5.4.1 Re-weighted Formulation

The algorithm described above gives a consistent and unbiased solution to the problemwhen errors are isotropic and all equals. However, due to the aperture problem, the flowestimates in the direction of the image gradient are much more precise than those in thenormal direction (see Fig. 5.2 and 5.4). Hence, errors are usually elliptic and correlatedalong the directionsx andy. An estimate of the covariance matrixΣ for the computedflow vectors is given by the hessian of the images gray levels around the consideredfeature point (Shi & Tomasi 1994):

Σ−1 =(

Ixx Ixy

Iyx Iyy

)(5.6)

whereI(x, y) is the image brightness. Assuming that there is no correlation betweenthe noise relative to different features, Eq. 5.3 can be rewritten as:

(v, ω, Z) = arg min(v,ω,Z)

‖W 12 ρ‖2 (5.7)

whereW is the block diagonal matrix whose blocks are the matricesΣ−1i : this is the

maximum-likelihoodestimator. The Gauss-Newton iterations associated to Eq. 5.4 be-come:

W12 Jk∆[vT , ωT ,ZT ]Tk = −W

12 ρk (5.8)

Again the constraint‖v‖ = 1, expressed asvTk ∆vk = 0 is added as the last line of

the linear system in Eq. 5.8.


CovarianceCovariance for the structure and motion parameters can be easily esti-mated by:

C = E[∆[vT , ωT ,ZT ]∆[vT , ωT ,ZT ]T ] = J−1E[ρρT ]J−T = J−1WJ−T (5.9)

5.4.2 Multi-view Structure and Motion Estimation

The algorithm described above can be applied to all image pairs,{I0, Ij}, satisfyingthe small motion approximation, yielding independent estimates of the same structureZ.

Since the parametersZ are shared by all the minimizations of the type of Eq. 5.3,it is possible to minimize all the two frames residuals simultaneously, in a single non-linear least square problem. Stacking the linear and angular velocities (vj , ωj) betweenpairs of frames{I0, Ij} in 3M ×1 vectors−→v = [vT

1 . . .vTM ]T and−→ω = [ωT

1 . . . ωTM ]T

we can formulate themulti-viewminimization as :

(−→v , −→ω , Z) = arg min(−→v ,−→ω ,Z)

‖W 12−→ρ ‖2 (5.10)

where−→ρ 2NM×1 = [ρT1 , . . . , ρT

M ]T is obtained by stacking the two-frames residualvectors,ρj , andW2NM×2NM is the block diagonal weight matrix whose diagonalblocks are the two-frames weight matricesWj , j ∈ {1 . . . M}. For the minimizationwe used again Gauss-Newton in the form:

W12 Jk ·∆[−→v T ,−→ω T ,ZT ]T = −W

12−→ρ (5.11)

whereJ2NM,N+6M is the Jacobian of−→ρ . In the multi-frame setting it is more conve-nient to handle the scale ambiguity by fixing the norm ofZ, which automatically fixesthe norms of the differentvj . EquationZT

k ∆Zk = 0 is added as last line of the systemin Eq. 5.11.

The advantage of the multi-frame minimization is that the number of fitted param-eters is significantly reduced, hence improving the statistical precision of the estimate.Assuming thatZ is estimatedM times independently from the two-frames algorithms,the precision of the estimate is aboutεs u 1/

√M(2N − p), wherep denotes the num-

ber of estimated parameters, in our problemp = N + 6M . For the multi-frame es-timation we getεm u 1/

√M2N − p. Convergence properties for the multi-frame

minimization are considered in the experiments section. The covariance of the mea-surements can be easily estimated by extending Eq. 5.5.

5.5 Recursive Structure and Motion Estimation

Time integration is achieved in a recursive manner as described in the previous chapter.Using a notation similar to that of Kalman filtering we define three structure estimatesat timetM : X−(M) is the a priori estimate,XM (M) is the current measurement andX+(M) is the a posteriori estimate. The relationship among such estimates is:

5.6. Experiments 59

X+(M) =C−−1X−(M) + C−1XM (M)

(C− + C)−1(5.12)

whereC is the covariance matrix of the structure estimates.The covariance matrix of the current measurement can be easily estimated by taking

the derivatives of the optimization function in Eq.5.10 in the minimum (see Eq.5.5).The covariance matrix of the a posteriori structure measurement can be easily ob-

tained by error propagation:

C+ = (C−−1 + C−1)−1 (5.13)

The a priori covariance is obtained by error propagation using Eq. 4.15 and Eq.4.18.

5.6 Experiments

0 5 10 15 200

5

10

15

20

25

30

35

40

rλ

deg

rees

reweighted unweighted

0 5 10 15 200

5

10

15

20

25

30

35

rλ

σ d

egre

es

reweightedunweighted

a b

Figure 5.4. (a) Angle between true and estimated linear velocities for the re-weighted andun-weighted algorithms with constant error ellipse orientation.(b) Standard deviation of theestimated linear velocities for the re-weighted and un-weighted algorithms with random errorellipse orientation.The trials for each noise level were 100.

We extensively tested the algorithm using synthetic flow fields. Zero-mean Gaussiannoise was added to the components of the velocity with different degrees of ellipticityand orientation. The shape of the elliptical uncertainty was varied changing the value ofthe parameterrλ =

√λmax/λmin whereλmax andλmin are the largest and smallest

eigenvalues of the covariance matrixΣ.Simulations: The two frames re-weighted algorithm was tested for different ellip-

ticity in the range0 ≤ rλ ≤ 20. We performed two different set of tests. In the first,the errors were elliptical and the orientation of the error ellipses was kept constant.Figure 5.4(a) shows the bias in the estimation of the linear velocity. The un-weighted


0.05 0.1 0.15 0.2 0.25 0.30

2

4

6

8

10

12

14

Noise (pixels)

Err

or (

degr

ees)

two−views {I0,I

1}

two−views {I0,I

2}

multiview 1multiview 2

0.05 0.1 0.15 0.2 0.25 0.30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Noise (pixels)

Err

or

two−views {I0,I

1}

two−views {I0,I

2}

multiview

a b

Figure 5.5.Structure and motion error using the two views algorithms over the pairs{I0, I1}and{I0, I2} and using the multi-frame algorithm over the 3 frames simultaneously.(a)Linear velocities (v1, v2) error.(b) Structure error (see section 2.5 for the definition).

algorithm fails almost systematically to find the correct camera velocity. In the secondtest, the ellipses orientation was random.

Both the un-weighted and re-weighted algorithms lead to an unbiased translationalvelocity, but Figure 5.4 (b) shows that the re-weighted version has globally a lowererror, up to 3 times smaller for ellipticityrλ = 20.

In the case of the multi-view minimization we used 3 views, of which one is fixedas the reference view. We estimated motion and structure parameters (v1, ω1, v2, ω2,Z)for different noise levels using the two-views algorithm with the image pairs{I0, I1}and{I0, I2} and using the multi-view algorithm with the 3 views simultaneously. Fig-ure 5.5 clearly shows that the multi-view algorithm outperforms the two-views ones inboth the estimation of the structure and the motion.

Real Images: Fig. 5.8 and Fig. 5.7 show two examples of the multi-frame re-construction using a total of 3 frames and recursive time integration. Features weretracked using the method in (Shi & Tomasi 1994). Sequences are acquired with a handheld commercial camcorder at 25Hz. The average feature motion is about 1 pixel perframe. Due to the unavailability of the ground truth, we assess the efficiency of ourmethod by measuring the planarity of the 3 planar surfaces of the box-like shapes. Thisis done by fitting 3 planes to the 3D reconstruction and measuring the average resid-ual of the fit. We find that, on the average, the re-weighting improves the planarity ofabout 10% and the multi-frame integration of about 30% respect to the un-weighted andun-integrated approach. Results are reported in Tab. 5.1. The units are focal lengths.

Convergence: Initialization of the multi-frame algorithm is obtained first estimating(v1, ω1,Z using the 2 frames algorithm described in 5.4.1. The estimatedZ are used tocompute the velocities(vj , ωj). Doing this way the vectorsZ andvj have all the samescale. The multi-view minimization is performed by imposing the constraint‖Z‖ =

5.7. Summary 61

normal weighted multi-viewlinux box 0.0163 0.0146 0.0109tea box 0.0110 0.0101 0.0078

Table 5.1.Planar residuals, two views setting. Units are focal lengths.

const. With this initialization the algorithm does converge to the absolute minimumapproximately 99% of the times.

5.7 Summary

In this chapter we propose a maximum likelihood algorithm for structure and motionestimation. Due to the aperture problem, error in the optical flow estimation are usuallyanisotropic. Such error needs to be modelled in the process of optimization for thevariance not to be greater than necessary. We test our algorithm on both simulated andreal data sets. We show how to obtain a simple estimate of the optical flow covariancematrix and how reconstruction is improved by our approach.

When dense sequences are acquired the instantaneous motion approximation holdsfor a few frames acquired in a row. This can be used to further improve reconstructionby locally time integration of the structure. The main advantage of this technique isthe increase of the number degrees of freedom. This improves the statistics and in turnthe variance of the estimates. The efficacy of our approach has been tested both onsimulated and real data sets.


a

0.2

0.4

0.6

0.8

30

210

60

240

90

270

120

300

150

330

180 0

Linear

fov = 50 T = [4,3,0.01]N = 100 Trials = 100

b

0.2

0.4

0.6

0.8

30

210

60

240

90

270

120

300

150

330

180 0

non linear

fov = 50 T = [4,3,0.01]N = 100 Trials = 100 TOL = 1e−2

average # iterations = 2.15 average cpu time = 0.1 sec

c

0.2

0.4

0.6

0.8

30

210

60

240

90

270

120

300

150

330

180 0

real velocitynonlinear averagelinear average

fov = 50 T = [4,3,0.01]N = 100

Figure 5.6. Polar plots of the linear velocity azimuthal and polar angles.(a) Jepson-Heegeralgorithm.(b) Refinement of Jespson-Heeger by non linear minimization.(c) Average valuesof the computed linear velocities.

5.7. Summary 63

Figure 5.7. Multi-frame reconstruction over 3 frames. 245 features were used. Averageoptical flow is one pixel per frame.

Figure 5.8. Multi-frame reconstruction over 3 frames. 271 features were used. Averageoptical flow is one pixel per frame.


a b

c d

Figure 5.9. TheFlower gardensequence.(a) A frame from the video sequence.(b) Esti-mated optical flow.(c) & (d) Examples of reconstruction.

Chapter 6

Automatic Segmentation andReconstruction of MultiplePlanar Scenes

In this chapter we present a motion based segmentation algorithm to automatically de-tect multiple planes from sparse optical flow information. An optimal estimate for pla-nar motion in the presence of additive gaussian noise is first proposed, including di-rectional uncertainty of the measurements (thus coping with the aperture problem) anda multi-frame (n> 2) setting (adding overall robustness). In the presence of multipleplanes in motion, the residuals of the motion estimation model are used in a clusteringalgorithm to segment the different planes. The image motion parameters are used tofind an initial cluster of features be-longing to a surface, which is then grown towardsthe surface borders. Initialization is random and only robust statistics and continuityconstraints are used. There is no need for using and tuning thresholds. Since the ex-act parametric planar flow model is used, the algorithm is able to cope efficiently withprojective distortions and 3D motion and structure can be directly estimated.

6.1 Introduction

Planes are common features in both man made and natural environments. The under-laying geometry of induced homographies is well understood (Kanatani 1993a) andused to perform different tasks: video stabilization, visualization, 3D analysis (usingfor example plane + parallax (Sawhney n.d., Irani, Anandan & Cohen 1999, Hartley &Zisserman 2000)), ego-motion estimation, calibration (Malis & Cipolla 2000, Demird-jian, Zisserman & R. 2000, Matsunaga & Kanatani 2000), just to cite few of them. Inthe continuous limit the homography is replaced by the Flow Matrix (Kanatani 1993a,Shakernia 1999): this can be calculated from two views (Bergen, Anandan & Hanna

65

66 Chapter 6. Automatic Segmentation and Reconstruction of Multiple Planar Scenes

1992, Ju, Black & Jepson 1996), or as suggested more recently from Irani (Zelnik-Manor & Irani 1999) in a multi-frame context to gain added stability and precision.

Automatic detection of planar surfaces from flow fields belongs to the wider areaof motion based segmentation, where the image is partitioned into regions of homoge-neous 2D motion based on continuity or on fitting a parametric motion model. Thereexist quite a large number of different approaches to the solution of this problem.Top-down techniques handle the whole image as the estimation support. A global motionmodel is estimated and areas that do not conform to such a model are detected, generat-ing a two class partition (Torr, Szeliski & Anandan 2001). The main limitations are (i)the presence of a dominant motion is required and (ii ) simply rejecting non conformingpixels does not produce spatially compact regions. Segmentation in a Markovian frame-work enables addition of spatial consistency constraints (Odobez & Bouthemy 1995).Simultaneous estimation of models and supports is another approach useful when amix-ture of motion models, none of them dominant, is present. EM algorithm has been usedefficiently for this purpose in (Ayer & Sawhney 1996, Weiss 1997). A more generalapproach consists in partitioning the image intoelementary regions(intensity-based,texture-based regions or square blocks are often used) and searching motion based re-gions as clusters of these. A commonly used technique consists in fitting affine motionto the regions and grouping them with a clustering process based on similarity of themodel parameters. In (Wang & Adelsson 1993, Ayer & Sawhney 1997) ak-meansal-gorithm was used in the motion parameters space.

In this chapter we present an automatic clustering technique that works on a sparseflow field and is able to find features laying on planar surfaces by analyzing the flowmatrix that they generate. There are three main contributions:

• We first show that greater robustness in the estimation of the planar flow parame-ters can be achieved by re-weighting the linear least squares estimation by opticalflow covariance matrix. Due to the aperture problem, errors in optical flow com-putation are rarely symmetric, but tend instead to be anisotropic and correlatedalongx andy directions. In this case re-weighted least squares is the maximumlikelihood estimator. We show how to compute the covariance matrix directlyfrom image gray levels.

• Further robustness can be obtained in the case in which multiple frames are avail-able. In such case, the fact that the underlying planar geometry is the same inall the views, provides a rank constraint over the matrix obtained by stackingtogether the planar flow parameters for the couples of frames.

• Planar flow is fitted to a cluster of points and the standard deviation of the resid-uals is estimated by means of robust statistic. The consistency of each singlefeature image motion with the planar hypothesis can now be established by com-paring its residual with the estimated standard deviation; features that have resid-uals larger than2.5σ are discarded as outliers. Clusters of inliers so obtained aregrown outwards as the model allows using a nearest neighbor algorithm that takesinto account continuity (i.e. points closer to a cluster are more likely to belong

6.2. Planar Motion Estimation 67

to it than others) and robustness (i.e. is better to begin to grow the algorithm inregions where flow is estimated robustly).

Since the exact parametric planar model is used, 3D motion and structure can be esti-mated directly from the segmentation information. As only robust statistic is used, thereis no need of tuning thresholds, eliminating the need for user interaction.Unlike top-down techniques no dominant motion is required,unlikeaffine flow fitting, the methodwe propose is able to cope efficiently with projective distortions and works well whenobjects are close to the camera. The algorithm is tested also under averse conditions,i.e when cameras are approximately orthographic or surfaces are just approximatelyplanar, obtaining satisfying results.

Although random sampling and robust statistics are used,this is not a RANSAC-likealgorithm for the reason that we do not assume to have a dominant motion model. Ingeneral, we allow any relative dimensions of planes, while RANSAC is based on a highinliers to outliers ratio (Fischler & Bolles 1981).

6.2 Planar Motion Estimation

6.2.1 Basic Model and Notation

The image motion of points on a planar surface, between two image frames can beexpressed as (Kanatani 1993a):

u(x) = F (x) · b + n(x) (6.1)

whereF (x) is a2× 8 matrix depending only on the pixel coordinatesx = (x, y):

F (x) =(

1 x y 0 0 0 x2 xy0 0 0 1 x y xy y2

)(6.2)

n ∼ N(0, σ) is gaussian additive noise andb8×1 is the vector of the planar flow param-eters vector. The vectorb can be factorized into ashapeand amotionpart as:

b = S8×6 ·(vω

)(6.3)

with

S =

fpz 0 0 0 f 0px 0 −pz 0 0 0py 0 0 0 0 −10 fpz 0 −f 0 00 px 0 0 0 10 py −pz 0 0 00 0 −px

f 0 1f 0

0 0 −py

f − 1f 0 0

(6.4)

wherep = (px, py, pz) is the normal to the plane,ω = (ωx, ωy, ωz) andv = (vx, vy, vz)are the rotational and the linear velocities of the camera respectively. The camera focallength is denoted byf and we assume that it is constant over time.


6.2.2 Two Frames Re-weighted Estimation

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

erro

r

rλ

weighetdunweighted

teabox

calgrid

Figure 6.1. Performance of weighted lin-ear least squares for the estimation of theflow matrix. The average values ofrλ fortwo of the sequences used for tests are alsoreported.

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

pixel noise

erro

r

rank constrainedno constraint

Figure 6.2.B matrix estimation improve-ment due to multi-frames integration. 10frames were used.

If N features are available, stacking the optical flow vectorsui in the vectorU2N×1,theF (xi) into the matrixG2N×8 and the noisen into the vectorη2n×1, Eq. 6.1 can berewritten as:

U = Gb + η (6.5)

The maximum likelihood estimation of the planar flow parameters vectorb is givenby theweightedlinear least squares problem:

b = arg minb{(U−Gb)T W (U−Gb))} (6.6)

The solution of the LLSE problem of Eq. 6.6 is found by solving the re-weightedsystem of normal equations:

GT WGb = GT WU (6.7)

The weight matrixW is block diagonal and the diagonal blocks are2 × 2 matri-ces that represent the covariance of the estimated optical flow vectors. The covariancematricesΣ can be estimated (see (Shi & Tomasi 1994)) as:

Σ−1 =(

Ixx Ixy

Iyx Iyy

)(6.8)

The introduction of re-weighting in Eq. 6.6 is particularly important since, due to theaperture problem, the errors are usually asymmetric. This approach has first been usedsuccessfully by (Irani & Anandan 2000) in the context of orthographic factorization.

6.2. Planar Motion Estimation 69

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10−3

0

5

10

15

20

25

Hits

Residual

outiliers

a b

Figure 6.3. (a) Optical flow warping respect to 4 points on the plane chosen according tothe nearest neighbor principle (the 4 points are marked with crosses). (b) Magnitude of theresiduals.

A set of simulated tests was performed in order to assess the improvements of thisapproach in the estimation of the planar flow parameters. We randomly generated pointson a planar surface and then added gaussian elliptical noise of randomly generated di-rection to the flow vectors. The shape of the elliptical uncertainty was varied changing

the value of the parameterrλ =√

λmax

λminwhereλmax andλmin are the largest and

smallest eigenvalues of the covariance matrixΣ. We run 20 trials for each of the valuesof rλ for 100 points on a plane. We defined the residual field as:

r = u− F b (6.9)

The estimation error we used is then:

err = meani

(‖ri‖Σ‖ui‖ ) (6.10)

wherei = 1 . . . N runs over the set features and‖‖Σ is theΣ-norm. Results reported inFigure 6.1 show clearly the superiority of the weighted approach.

6.2.3 Multi-frames Re-weighted Estimation

If m views of the same planar surface are acquired,m− 1 pairs can be formed betweenthe first and thejth images,j ∈ {2, . . . ,m}. If such pairs are close enough such thatthe instantaneous approximation can be used, the set of flow parameters vectorsbj canstill be estimated by arranging them in a matrixB8×m−1 = [b2, ...,bm] and solving:


GT WG ·B = GT W [U2, .....,Um] (6.11)

or in shortC ·B = K. The matrixG depends only on the features positions in the firstframe and is defined in the previous section. The weight matrixW is estimated doing atemporal average over all them frames. In this multi-frame settingB can be factorizedas:

B = S8×6 ·(v2 ... vm

ω2 ... ωm

)

6×m−1

(6.12)

Solving forB in Eq. 6.11 is equivalent to solving independently for thebj of the im-age pairs. This does not exploit the fact that the underlying plane geometry (expressedby the matrixS) must be the same in all the views. Such a constraint is expressed bythe dimensionality of the matrices on the right side of Eq. 6.12 that fixes the rank ofBto be smaller than 6. Lower ranks can be generated by special motion configuration, forexamplerank(B) = 1 when the motion is constant over time.

Due to the fact thatrank(B) ≤ 6 we get thatrank(K) ≤ 6. Hence before solv-ing Eq. 6.11 we can re-projectK over a lower dimensional linear subspace seekingthe matrixK with rank(K) ≤ 6 that is closest toK in the sense of the Frobeniusnorm (Kanatani 1993a).

Fig. 6.2 shows the error in the estimation ofB for a simulated data set. A set of10 views of the same plane was generated and matrixB estimated using Eq. 6.11 withor without using the rank constraint. A total of 100 features and 20 trials where used.

The estimation error was defined aserr = ‖B−B‖‖B‖ whereB is the ground truth. Error

over optical flow was varied between 0.05 and 0.4 pixels. The rank constraint clearlyincreases the performance of the estimation. Theun-weightedmulti-frame approachwas first used in (Zelnik-Manor & Irani 1999).

6.3 Planar Motion Segmentation

We have proposed anoptimalsolution to the problem of estimation of planar motion inthe presence of Gaussian additive noise. Such a method is now used as the core step ofthe segmentation algorithm by fitting planar motion to a tentative cluster of points andrejecting outliers by means of robust statistics.

6.3.1 Algorithm

The magnitude of the residualsr = ‖r‖ can be effectively used for segmentation pur-poses.Figure 6.3 (a) shows the residual flow when the planar flow parameters are estimatedby 4 points in the highlighted plane. The norm of the residual vectors is plotted inFigure 6.3 (b): the difference in magnitude between points on the plane and off the

6.3. Planar Motion Segmentation 71

plane is quite obvious. In the multi-frame setting a more robust estimate of the featuresresidual is defined as:

ri =

∑mj=1 rije

−d2j

∑mj=1 e−d2

j

(6.13)

wherej runs over the frames,i indexes the features anddj is the average motion of thefeatures between the framej and the reference frame and measures the adequacy of theinstantaneous approximation.

The selection of inliers and outliers is based on a robust standard deviation es-timate (Rousseeuw & Leroy 1987). If a moderate amount of outliers is present ina set ofQ features, a robust estimate of the standard deviation of the residualsrq,q ∈ {1, . . . , Q} can be obtained as:

σ = 1.4826(1 +5

Q− l)median

√r2q (6.14)

wherel is the number of fitted parameters, 8 in our problem. Inliers are those that verify:rq ≤ 2.5σ.

The segmentation algorithm is outlined below.

1) Randomly select one pointand determine an initial cluster of 5 points adding its4 nearest neighbors. The nearest neighbor to a configuration of points is definedas the point closest to the center of mass of the configuration. The center of massis found as a weighted mean of the features position where the weights are theminimum eigenvalue of the covariance matrix of the flow vectors: this ensuresthat the algorithm grows, at the beginning, towards areas where the features aretracked robustly which, in turn, helps to get a more precise initial estimation of theplanar flow parameters. At the same time this procedure enlarges the probabilitythat the initial cluster is located on a plane: it is crucial that at each step thenumber of outliers is small so that Eq. 6.14 can be used.

2) Fit the planar flow parameters and select as good features those for whichrq ≤2.5σ. If less than 4 features are left, start over; otherwise go to the next step.

3) Add the nearest neighbor feature and start over. Growing the cluster addingrecursively the nearest neighbor exploits plane continuity, i.e. it is more likelythat the nearest neighbor to the cluster belongs to the plane than another pointvery far apart. In general, if no a priori information is available about the filmedscene, this approach turns to be very effective.

In general, the initial 5 points can be discarded during the growing process. Thismakes the algorithm more flexible and well behaved even in the case where the 5 pointslie on different surfaces: theNearest Neighborgrowing moves into one of the planarsurfaces and the initial points that do not lie on this surface are later discarded as outliers.The algorithm ends naturally when all the features have been analyzed and no moreinliers are found. Since the detected outliers can be belong to another planar surface thealgorithm can be restarted for the next plane detection.


6.3.2 Final Refinement: Resolving Ambiguities

The growing algorithm can sometimes be greedy, including into a given planar areaambiguous points, which belong to a neighboring plane, but whose flow is similar tothat of the first plane. This is the case of the three sequences shown in Figures 6.4, 6.5and 6.6, where planes are incident. The ambiguous points close to the intersection ofthe surfaces can easily be found by recursively running the clustering algorithm.

Let us call the three surfacesα, β and γ (see Figure 6.4 (a)) and the 3 clustersobtained running for the first time the clustering algorithmG1

α,G1β ,G1

γ . At this pointwe know approximately which features belong to the 3 planes up the ambiguous onesclose to the incidence. We re-run the algorithm taking as initial feature the one closestto the center of mass of the features inG1

β . In this way the planeβ is, in the new run ofthe algorithm, detected as the first one and the cluster will tend to invade the surfacesα andγ close to the borders. The algorithm finds 3 more clusters over the 3 surfaces:G2

α,G2β ,G2

γ . We get that the ambiguous features of surfacesα, β are defined as:

Bαβ = G1α ∩ G2

β (6.15)

Running the algorithm a third time starting from the center of mass of the planeγthe ambiguous features close to the three intersections can be defined as:

Bαβ = G1α ∩ G2

β (6.16)

Bαγ = G1α ∩ G3

γ (6.17)

Bβγ = G2β ∩ G3

γ (6.18)

The final clusters of points that lie on the three surfaces are defined as :

Gα = G1α \ (Bαβ ∪ Bαγ) (6.19)

Gβ = G1β \ (Bαβ ∪ Bβγ) (6.20)

Gγ = G1γ \ (Bαγ ∪ Bβγ) (6.21)

The final planar flow parametersb for the three planes are found by refitting theclusters according to Eq. 6.11.

Ambiguous points can eventually be reassigned by checking their residual respectto the three final plane hypothesis, i.e. assigning the feature to the plane respect towhich the residual is minimum and verifies Eq. 6.14 whereσ is calculated just fromthe points already in the cluster. Eventually, rejected points can be also assigned usingthis principle. The reason is that the efficiency of assigning a point to a cluster dependson how large the cluster is (i.e how many points the cluster has) due to the fact thatthe statistical precision of the fit of theB matrix grows as a function of the number ofpoints in the cluster. This means that points that are erroneously discarded when thecluster size is small can be successfully assigned when the cluster is completely grown.An example is shown in Figure 6.6 where we applied reassignment to thecalibrationgrid sequence.

6.5. Summary 73

6.4 Results

We tested our algorithm extensively on real images and simulated data.Figure 6.4 shows the process of iterating three times the clustering and the ambigu-

ous features detection and removal. The process of finding ambiguous features findsnot only features at the intersection of planes but also random outliers. A total of 10frames was integrated and 237 features were used; the number of features rejected waszero and 83 ambiguous features were found.

Figure 6.5 shows another application for a different sequence made of 11 frames and254 features. Only 9 features were rejected and 44 features were removed as ambiguous.

Figure 6.6 illustrates an application to a very noisy sequence of 10 frames. Apertureproblem is very serious due to the massive presence of edges (see also Figure 6.3). Inthis case we decided to reassign ambiguous and rejected features. We found that 76 ofthe 220 features were removed as ambiguous and all of them were reassigned correctly;53 features were rejected of which 13 were reassigned and the reassignment was correct.In Fig. 6.7 the algorithm is applied to segment the walls of a building. The two mainsurfaces are exactly found while the features on the third one (triangles) are unclassified.In this case the algorithm can not go the the third step since less than 4 features are lefteach time the constraintrq ≤ 2.5σ is enforced.

6.5 Summary

In this chapter we presented a new motion based segmentation technique able to auto-matically find planar surfaces when a sparse optical flow field is given.

We first formulated anoptimal solution to the estimation of planar flow parame-ters in presence of gaussian addictive noise. Experiments on simulated data show theimprovement of performance we obtained compared with previous approaches.

We then showed how the planar geometry induces a constraint between planar flowparameters estimated using different couples of frames and stated the performance im-provement we can obtain by applying such constraint.

Robust estimation of planar flow parameters is the core of the segmentation algo-rithm. A cluster of points is initialized randomly and then grown on a plane by meanof robust statistics, i.e. finding and eliminating outliers, and proximity constraints, i.e.using the fact that planes are mostly continuous surfaces.

Results with real images were presented to illustrate the performance of the pro-posed method.


α

β

γ

a b

c d

outliers

e f

Figure 6.4.Thetea boxsequence.(a) An Image from the sequence.(b) First Iteration of theclustering algorithm. Crosses indicate the initial 5 points selected.(c) Segmentation obtainedstarting from surface number 2.(d) Segmentation obtained starting from surface number 3.(e) Ambiguous features.(f) Final segmentation after ambiguous features removal.

6.5. Summary 75

a b

c d

Figure 6.5. TheLinux boxsequence.(a) A frame from the video sequence.(b) Estimatedoptical flow.(c) Segmentation obtained at the first iteration. 4 of the 5 initial features (crosses)on the surface nr.1 (dots) are rejected as too noisy.(d) Final segmentation


��

a b

c d

Figure 6.6. The calibration grid sequence. (a) A frame from the video sequence. (b) Seg-mentation obtained after three iterations of the clustering algorithm (c) Reassigned ambiguousfeatures (d) Segmentation obtained after reassignment of the ambiguous and the rejected fea-tures.

6.5. Summary 77

a b

Figure 6.7. The aquatic centersequence.(a) A frame from the video sequence.(b) Finalsegmentation. Triangles mark features that were not assigned to any surface.

Chapter 7

Maximum Likelihood Structureand Motion with GeometricalConstraints

Unbiased and consistent estimates of structure and motion can be obtained by leastsquares minimization of the differential epipolar constraint. Previous work on thissubject does not make use of geometrical constraints that often are present in naturaland man built scenes. This chapter shows how linear constraints among feature points(collinearity and coplanarity) and bilinear relations among such entities (parallelism andincidence) can be incorporated in the minimization process to improve the structure andmotion estimates. There are 2 main contributions: (i) the formulation of a constrainedminimization problem for structure and motion estimation from optical flow and (ii ) thesolution of the optimization problem byLevenberg-Marquardtanddirect projection.We show that the proposed approach is stable, fast and efficient.

7.1 Introduction

Optical flow has been efficiently used for structure and motion estimation. The differ-ential epipolar constraint has been manipulated in different way in order to linearize theminimization problem (Heeger & Jepson 1992, Ma et al. 2000b) . Although fast andof closed form, linear techniques have been shown to lead to biased estimation of mo-tion and structure. Zhang and Tomasi (Zhang & Tomasi 1999) recently demonstratedthat such bias is due to the incorrect choice of the objective function and that unbiasedand consistent estimates can be obtained by minimization of the differential epipolarconstraint in the least squares sense. However, this approach does not exploit any con-straints on the scene geometry. Man made and natural environments feature many geo-metrical entities (lines and planes for example) often arranged in special configurations

79

80Chapter 7. Maximum Likelihood Structure and Motion with Geometrical Constraints

(parallelism, incidence etc). Attempts to incorporate such constraints in reconstruc-tion algorithms have been previously presented in the stereo framework (Grossman &Santos-Victor 2000a, Szelisk & Torr 1998, Bondyfalat & Bougnoux 1998).

In this chapter we approach the problem of using such geometrical constraints inthe differential structure from motion setting. We show (i) how to incorporate linearconstraints among feature points (collinearity and coplanarity) and bilinear relationshipsamong such entities (parallelism and incidence at a certain angle) in the minimizationof the differential epipolar constraint; (ii) we show how fast and stable convergenceis obtained by the use of a Levenberg-Marquardt iterative minimization and adirectprojection method(Scales 1985) for constraints enforcement. The result is a fast andefficient algorithm which drastically reduces the reconstruction error.

7.2 Problem Formulation

As stated before, relationship between the image plane motion fieldu(x) and the motionof the camera is expressed as :

u(x) =1Z

A(x)v + B(x)ω + n(x) (7.1)

where(v, ω) are the linear and angular camera velocities andn(x) ∼ N(0, σ) is zero-mean Gaussian additive noise.Z is the depth of the scene points whose 3D positionwe indicate withX = [X, Y, Z]. The matricesA(x) andB(x) are functions of imagecoordinates defined as follows:

B =[ −xy (1 + x2) −y−(1 + y2) xy x

]; A =

[1 0 −x0 1 −y

]

7.2.1 Unconstrained Optimization

Given two views of the same scene, the instantaneous motion model of Eq. 7.1 is validwhen the camera rotation is small and the forward translation is small relative to thedepth. If this condition is met, optical flow between the two frames can be computedand depths and velocities can be estimated. We define the residual for theith featurerelative to the pair as:

ri = ui − 1Zi

A(xi)v −B(xi)ω (7.2)

We stack the residualsri in the2N × 1 vectorρ = [r1, . . . , rN ] and the structure andmotion parameters in the vectorθ = (v, ω,Z) with Z = (Z1, . . . , ZN ). Structure andmotion can be estimated by solving the least squares problem:

θ = arg minθ‖ρ(θ)‖2 (7.3)

.

7.3. Constrained Non Linear Estimation of Structure and Motion 81

The problem in Eq. 7.3 is a non linear least squares estimation and has to be solvedby an iterative technique. We used Gauss-Newton in the form:

Jk∆θk = −ρk (7.4)

whereJ is the Jacobian ofρ, andk is the iteration index.

7.3 Constrained Non Linear Estimation of Structureand Motion

7.3.1 Constraints Formulation

We consider only constraints linear in the features coordinates (i.e. collinearity andcoplanarity) and bilinear constraints among such geometrical entities (i.e. parallelismand incidence at a certain angle). Planes are parameterized by a directionp and adistanced. Lines are described by the intersection of 2 planes. Constraints can beexpressed as:

Xpa = da X ∈ plane (7.5)

papb = θa,b incidence at angle θh,l (7.6)

pa = pb parallelism (7.7)

7.3.2 Constrained Optimization

Coplanarity and collinearity can be easily incorporated in equation Eq. 7.3 by parame-terization. If a set of pointsXh, h ∈ {1 . . . H} belongs to a plane, their depth can beparameterized asZh = d

[xh,yh,1]p . The residual of such points takes the form:

rh = uh −A(xh)[xh, yh, 1]p

dv −B(xh)ω (7.8)

Indicating withc = [p1, d1 . . .pM , dM ] the vector of the parameters of theM con-straints, structure and motion can be estimated by solving:

Θ = arg minθ‖ρ(Θ)‖2 (7.9)

whereΘ = (v, ω,Zu, c) andZu is the vector of the depths of the points for which noconstraints are available. Gauss-Newton performs pretty poorly to solve this problemwhile we found that fast convergence is guaranteed by a Levenberg-Marquardt iterationtechnique. This consist in solving the system of equations:

Lk∆Θk = −JTk ρk (7.10)

whereJ is the Jacobian ofρ(Θ). The matrixLm,n is defined asLm,m = (1 +λ)(JT J)m,m andLm,n = (JT J)m,n for m 6= n. We found that a good choice ofthe initial value of the parameterλ is λ = 10−3.


Constraints among geometrical entities (Eq. 7.6 and Eq. 7.7) define a feasible spaceto search for the absolute minimum of the function we are optimizing. To solve theconstrained minimization problem we used adirect projectionmethod (Scales 1985):constraints of the formf(c) = 0 are differentiated obtaining linear equations in∆c:

∂f

∂c∆c = 0 (7.11)

Such equations are added to the system in Eq. 7.10 which is solved as for normalLevenberg-Marquardt iterations.

7.3.3 Constraints Enforcement

Due to the fact that the constraints are replaced by their differentials, the incrementedstructure and motion parametersΘk+1 do not belong to the feasible space but to itstangent space. To keep the solution in the feasible space the vectorΘk+1 is re-projectedonto it, i.e.Θk+1 7→ P(Θk+1) whereP is the projector.

This approach is known asdirect projection. The main advantage of this techniqueis that a constrained minimization problem is transformed into an un-constrained onethat can be solved by a standard iterative technique. The drawback is that convergenceis not guaranteed since the minimization method finds a descent direction in the tangentspace which does not ensures that such direction projected onto the feasible space isstill descent. Formally this means that‖ρ(Θk+1)‖2 < ‖ρ(Θk)‖2 that does not implythat‖ρ(P(Θk+1)‖2 < ‖ρ(Θk)‖2. Nevertheless for the class of problems we are tryingto solve this approach has shown very good convergence properties. This is reviewed inmore detail in the experiments section.

7.4 Results

For constrained minimization parameters for planes and lines must also be initialized.We first estimate an initial structure and motion as in the unconstrained case. The set ofparametersc can initially be estimated by using the initial structure and Eq. 7.5. Suchan initial estimate must be refined in order to havec in the feasible space. This is simplydone by re-projection. For example, ifM perpendicular planes are present the projectoris the unitary transformation that align them withM perpendicular unitary vectors. Theoptimal solution to this problem can easily computed bySV D (see (Kanatani 1993a)and Appendix A).

Simulations We generated 48 points distributed on 3 orthogonal surfaces for test-ing purposes. Initialization was obtained by estimating the linear velocity by subspacemethods (MacLean 1999) and then using Eq.3.11 and Eq. 3.12 to estimateω andZ.We did 4 tests to assess the performance of the constrained minimization: we first usedplane Gauss-Newton with no constraints initialized by a linear algorithm (see 4), wethen used Gauss-Newton with one plane constraint, three planes constraints and finallythree orthogonal plane constraints. Fig. 7.1 and Fig. 7.2 show respectively the linear

7.5. Summary 83

0 0.05 0.1 0.15 0.2 0.25 0.30

5

10

15

20

25

30

35

Noise (pixels)

Err

or (

degr

ees)

InitializationGauss−Newton no const.Levenberg−Marq. 1 planeLevenberg−Marq. 3 planesLevenberg−Marq. 3 planes perp.

Figure 7.1. Linear velocity error

velocity and structure errors for different noise levels. The error functions are describedin the Chapter 2.

Real ImagesOptical flow is estimated at sparse positions over the image planeby the method described in (Shi & Tomasi 1994). The average disparity for the twosequences used is approximately 1 pixel per frame. The total number of features usedis 245 for the Teabox and 271 for the Linuxbox sequence.

Convergence PropertiesTo speed up the convergence the Levenberg-Marquardtiterative algorithm was used. For a noise of 0.1 pixels, which is the normal amountexpected in our working conditions (see (Barron et al. 1994)), the global minimumis found essentially all the times within 10-15 steps. For very noisy fields, i.e. 0.3pixels, the convergence rate is about 70%. Convergence rate for noisy fields can befurther increased by randomly initializing several times the algorithm and then takingthe solution that generates the minimal residual flow. This process is simplified by thevariable separability described in Chapter 5.

7.5 Summary

In this paper we showed how linear geometrical constraints (collinearity and copla-narity) and bilinear constraints among them (incidence and parallelism) can be usedto improve structure and motion estimation in the differential setting. While Gauss-Newton is proved to converge fast in the unconstrained case, Levbenberg-Marquardt


0 0.05 0.1 0.15 0.2 0.25 0.30

0.05

0.1

0.15

0.2

0.25

0.3

Noise (pixels)

Err

orInitializationGauss−Newton no const.Levenberg−Marq. 1 planeLevenberg−Marq. 3 planesLevenberg−Marq. 3 planes perp.

Figure 7.2. Structure error.

has to be used when constraints are incorporated. The method ofdirect projection, usedto enforce the constraints, is proved to be stable and efficient. Simulations and tests onreal images show the efficiency of the proposed method.

7.5. Summary 85

Figure 7.3. 2 views of a reconstructed model.

Figure 7.4. 2 views of a reconstructed model.


a b

c d

Figure 7.5. TheGMU Buildingsequence.(a) A frame from the video sequence.(b) Esti-mated optical flow.(c) and(d) Reconstruction examples

7.5. Summary 87

a b

c d

Figure 7.6.TheAquaticentersequence.(a) A frame from the video sequence.(b) Estimatedoptical flow.(c) and(d) Reconstruction examples

Chapter 8

The Calibration Issue

While in the discrete setting calibration parameters can be estimated simultaneouslywith structure and motion, in the differential one camera has to be calibrated at priori.For this pecularity of the differential epipolar constraint, it is important to study sensitiv-ity and robustness of structure and motion recovery with respect to the errors in intrinsicparameters of the camera. This chapter provides such an account: we demonstrate bothanalytically and in simulation, the interplay between measurement and calibration er-rors and their effect on motion and structure estimates. In particular we show that thecalibration errors introduce an additional bias towards the optical axis, which has oppo-site sign to the bias typically observed in egomotion algorithms. The overall bias causesa distortion of the resulting 3D structure, which we express in a parametric form. Whilethe analytical explanations are derived in the context of linear techniques for motionestimation, we verify our observations experimentally on a variety of optimal and sub-optimal motion and structure estimation algorithms. The obtained results illuminate andexplain the performance and sensitivity of the differential structure and motion recoverytechniques in the presence of calibration errors.

8.1 Introduction

While the basic geometric relationships governing the problem of structure and motionrecovery from image sequences are well understood, the existence of robust automatictechniques for recovery of motion and structure is still elusive. Different aspects ofthe performance and sensitivity of the existing general techniques for motion and struc-ture recovery have been addressed in the past. While the list of references is by nomeans exhaustive, we mention here a few more recent representative works addressingthe problem. The intrinsic sensitivity of the differential formulation of the problem hasbeen studied thoroughly using analytical techniques as well as simulations (Danilidis& Spetsakis 1996, Cheong & Peh 2000, Oliensis 2000). These studies assumed cali-brated cameras and focused on determining sensitive directions of motion, dependency

89

90 Chapter 8. The Calibration Issue

on the depth variation and field of view (Zhang & Tomasi 1999). The process of cam-era calibration introduces additional errors in the measurements, which affect the finalestimates. This is the case both when the camera is calibrated off-line or when self-calibration techniques are used towards this end. With the exception of few, the studyof these effects, has not received much attention. Various empirical observations regard-ing the stability of the estimation of intrinsic parameters and their effect on the structureestimation in the discrete setting have been made by Bougnoux (Bougnoux 1998). Hepursued the stratified approach to Euclidean reconstruction and experimentally demon-strated that, in spite of depth distortions caused by calibration errors, the basic geometricrelationships (orthogonality, parallelism) were preserved. The effect of calibration er-rors on motion estimates in the discrete setting have also been explored in (Svodoba &Sturm 1996), assuming noise free measurements of corresponding points. In (Grossman& Santos-Victor 2000b) the authors derived the covariances of the parameters of an un-calibrated stereo system, both for fixed calibration parameters and for the hypothesisthat ana priori Gaussian distribution for the calibration parameters is known. The effectof this prior knowledge on the quality of the final estimates was demonstrated in the con-text of nonlinear optimization techniques. In the differential setting Cheong (Cheong &Peh 2000) characterized the depth distortions due to the free varying focal length usingthe analytical iso-distortion framework developed previously in the uncalibrated case.While the iso-distortion framework enables us to study the intrinsic distortions as a fam-ily of transformation parameterized by errors in motion estimates, it does not assumeany particular distribution of noise in the image measurements and camera parameters.Hence it is not suitable for quantifying the quality of the final estimates. We present ananalytical study of the sensitivity of egomotion and structure estimation assuming bothnoisy measurements and errors in intrinsic parameters of the camera. Resorting to someapproximations, we both analytically and experimentally demonstrate that the errors incalibration introducean additional bias termin motion estimation, which reduces thepreviously observed translation bias reported in (MacLean 1999). The overall transla-tion bias distorts the resulting structure by a skewing or flattening in the direction of theoptical axis. These distortions are further accentuated by the errors in focal length andcenter of the projection. We offer an analytical explanation of the errors introduced bycalibration in the context of linear techniques. The observations are verified by simula-tion of a variety of linear and nonlinear algorithms for structure and motion estimationand confirm the intrinsic nature of the errors, making them independent of the algorithmchoice and objective function used.

For the purpose of the analysis we consider simplified model of the calibration ma-trix assuming no skewk1 = 0 and the aspect ratiosx

sy= 1. The calibration matrixK is

then defined as:

K =

−f 0 ox

0 −f oy

0 0 1

.

Camera’s field of view (FOV)θ is related to the dimension of the focal planeI, withtan θ = I

f .The focus of this chapter is to study the interplay between noisy image velocities and

8.2. Auto Calibration 91

errors in calibration and their effects on the resulting motion and structure estimates.We initially assume that the calibration parameters are obtained by an off-line calibra-tion procedure. For the purpose of sensitivity analysis assume that the entries ofK:f, ox, oy are corrupted by zero-mean Gaussian noisenf ∼ N(0, σ2

f ), nox∼ N(0, σ2

ox),

noy ∼ N(0, σ2oy

) respectively. Since the focal length and optical center position canbe measured independently and the choice of the axis in the retinal plane is arbitrary,we assume that the focal lengthf andox, oy are uncorrelated random variables and theerrors inox andoy are identical withσox

= σoy= σo. The errors in calibration affect

the coordinates of feature positionsx as well as image velocitiesu. These are measuredin the retinal plane and are expressed as nonlinear functions of the intrinsic parametersof the camera in the following way:

(xy

)=

1f

(ox − xc

oy − yc

)≡ 1

f

(∆x

∆y

)(8.1)

u =1fuc =

1f

(xc(t + 1)− xc(t)) (8.2)

where∆x = ox−xc and∆y = oy−yc. Note that image velocitiesu depend only onthe focal lengthf and are unaffected by the knowledge of the center of the projection.We further assume that the calibration errors are much smaller than optical flow errors.This approximation is valid for cameras calibrated with a calibration grid where thecamera parameters can be estimated with errors up to few percent (Bouget n.d.). Theoptical flow errors are typically on the order of a few tens of percent (Barron et al. 1994),

henceσ2u

u2 /σ2

f

f2 u 100. We will refer to the above assumption later in the text and will useit only to simplify the analytical derivation. The experimental results will be providedfor a broader set of conditions.

8.2 Auto Calibration

The possibility of autocalibrating the camera is an appealing feature of stereo basedreconstruction algorithms. In the differential case the differential epipolar constraintbecomes:

uT K−T vK−1x + xT K−T sK−1x = 0 (8.3)

The matrix:

F =(

K−T vK−1

K−T sK−1

)(8.4)

is called the differential epipolar matrix, in analogy with the discrete setting. Defin-ing v = K−T vK−1 andω = K−T vK−1 we get from Eq. 8.3:

uT v′x +

12(xT v

′H−1ω

′x + xT ω

′H−1v

′x) = 0 (8.5)

whereH−1 = K−T K−1. From this equationH can be estimated and then decomposedto get the calibration parameters. However, unlike the discrete case, only two of the


intrinsic parameters can be estimated. In factH is a symmetric matrix that can bedecomposed by SVD as:

H = RT ΣR (8.6)

with R ∈ SO(3). Substituting in Eq. 8.5 and usingv′′

= Rv′

andω′′

= Rω′

we get:

(Ru)T v′′(Rx) +

12((Rx)T v

′′Σω

′′(Rx) + (Rx)T ω

′′Σv

′′(Rx)) (8.7)

From this equation the three eigenvalues ofΣ can be determined. Since Eq. 8.7is homogeneous inω

′′andΣ, the norm ofΣ has to be fixed so that only two of the

eigenvalues are independent.

8.3 Egomotion Estimation

In this section we review the egomotion algorithms that we will use in the simulations.The structure can be estimated, as soon as the velocities are known, by the variablesparation described in chapter 3.

8.3.1 Jepson & Heeger

Jepson and Heeger (Heeger & Jepson 1992) showed that it is possible to define a set ofvectors all normal to the linear velocityv. Such vectors are defined as:

τi ≡∑

k

cik[u(xk)× xk] (8.8)

wherecik are the coefficients orthogonal to all the possible quadratic forms inx. Using:

τT1...

τTN

v = 0 (8.9)

the linear velocity vectorv can be estimated by finding the eigenvector relative tothe minimum eigenvalue of the matrix:

D =N∑

i=1

wiτTi τi (8.10)

where the weightswi can be properly chosen.The rotational velocity is obtained by eliminating the translational component of the

velocity field as explained in Section 3.2. As we will show later this method, even ifappealing for its closed form, is seriously affected by bias.

8.3. Egomotion Estimation 93

8.3.2 MacLean

This method was proposed by MacLean (MacLean 1999) in 1999 and is a correctedsubspace method that takes into account bias.

When the flow is noisy the expected value of the noisy matrixD∗ becomes:

E[D∗] = D + σ2M (8.11)

whereσ is the standard deviation of the noise which is supposed to be isotropic andzero mean Gaussian. The analytical expression forM can be easily deduced to be:

M =∑

i

wi

1 0 −xi

0 1 −yi

−xi yi x2i + y2

i

(8.12)

The noise matrixM is the cause of the biased estimation. In general it does perturbatethe eigenvectors ofD so that bias terms appear. For the removal of such bias MacLeansuggested a whitening technique: multiplyingE[D∗] by M− 1

2 we get:

M− 12 E[D∗]M− 1

2 = M− 12 DM− 1

2 + σ2I3 (8.13)

Now the eigenvectors ofM− 12 DM− 1

2 + σ2I3 are also the eigenvectors ofD up to atransformationM− 1

2 . The noise matrixM has been reduced to the identityI3 that doesnot influence the eigenvectors.

8.3.3 Bruss & Horn

Bruss and Horn (Bruss & Horn 1983) defined a residual function as:

r(x,v, ω) = (u(x)− 1Z

A(x)v + B(x)ω)‖A(x)v‖ (8.14)

They eliminated the depth by noticing that the optimal least squares estimation ofZ canbe expressed as a function of the camera velocities, i.e.:

∂‖r‖2∂Z

= 0 ⇒ Z = f(v, ω,x) (8.15)

substituting the expression forZ in Eq. 8.14, they get the estimator:

(v, ω) = arg minv,ω

‖uT vx + xT sx‖2 (8.16)

Differentiating respect tov andω they obtain a system of 6 equations. Three of theseare linear inω and the angular velocity can be estimated directly as a function ofv,ω = g(v,x). Such expressions can be substituted in the remaining three equations andv estimated from them using a non-linear technique.


8.3.4 Kanatani A

Kanatani (Kanatani 1993b) reduces the differential epipolar constraint to the form:

uT vx + xTKx = 0 (8.17)

where:

K = (ωT v)I3 − 12(ωvT + vωT ) (8.18)

The linear velocity can be estimated by solving the linear least squares problem:

(v, K) = arg minv,K

∑

i

‖uT vx + xTKx‖2 (8.19)

The matrixK can be later decomposed to extract the angular velocity:

ω =12[tr(K) + 3vTKv]v − 2Kv (8.20)

8.3.5 Kanatani B

Renormalizationconsist in observing that extracting the linear velocity from Eq. 8.19 isequivalent to find the minimum eigenvalue of the matrixD defined in Eq. 8.10. Againin presence of noise we have that:

E[D∗] = D + σ2M (8.21)

Definingc = σ2 Kanatani claims to find the unbiased solution forv by :

(c, v) = arg minc,v

vt(E[D∗]− cM)v (8.22)

8.3.6 Ma-Kosecka-Sastry

This is a recent algorithm described in (Ma, Kosecka & Sastry 2000a). The residualminimized:

r(x,v, s) = (uT ,xT )(

vs

)x (8.23)

The linear velocityv and the matrixs can be estimated by linear least squares whenthe flow is measured at at least 8 different locations.

The authors demonstrate that the special symmetric matrics can be decomposed andthe camera motion (v,ω) can be found up to a four fold degeneracy. A unique solutioncan be found by comparing thev’s estimated from the decomposition ofs with the oneestimated directly from the least squares minimization of Eq. 8.23.

8.4. Biased Egomotion 95

x

FOE

x

u

d

u (v x)T

Figure 8.1.Geometric interpretation of the error distance

8.4 Biased Egomotion

The algorithms described above are based on the minimization of the residualr(v, ω) =uT vx + xT vωx and ,in general, return biased estimates of the camera motion and ofthe structure. The geometric interpretation is given in Fig. 8.1 for the case of puretranslation. The minimized residualr(v, 0) is only proportional to the distance of theoptical flowu from the line joining the feature location and theFOE, i.e. r(v, 0) =d([0, 0, 1]T × (v × x)). For a small field of viewx tends to be parallel to the opticalaxis so that the proportionality constant[0, 0, 1]T × (v × x) tends to be small whenvis also close to the optical axis. Linear subspaces suffer from the same problem sinceconstraints of the form(u× x)v are minimized.

An analytical estimation of the bias is obtained below. Taking the differential epipo-lar constraint:

uT vx + xT sx = 0 (8.24)

we can rewrite Eq. 8.24 to highlight the dependence on the optical flow as:

[A1(u) | A2] e = Ae = 0 (8.25)

whereA1(u) ∈ <m×3 is a linear function of the measured optical flow andA2 ∈ <m×6

is a function of the image points alone; vectore ∈ <9, e = [vx, vy, vz, s1, s2, s3, s4, s5, s6]T

is associated with the unknown parametersv and the matrixs. Minimizing ‖Ae‖2 leadsto LLSE estimate ofe, which is obtained as the eigenvector ofAT A associated with thesmallest eigenvalue. Translation estimatev is then directly available and angular veloc-ity ω can be obtained by decomposition of the special symmetric matrixs. This partic-ular parameterization and the associated algorithm is described in greater detail in (Maet al. 2000b). Assuming perfect calibration and measurement noise due to the temporalmatching, only the columns ofA related to matrixA1 are corrupted by noise. Thiscauses the linear techniques to lead to biased estimates. We will now demonstrate thesource of this bias and its interplay with the bias induced by calibration errors. If we as-sume that calibration is perfect and each component ofu is corrupted byn ∼ N(0, σ2

u)


due to the temporal matching, the noise in image velocities perturbs the matrixA byδA. This perturbation alters the eigenvectors ofAT A. It can be shown using pertur-bation theory forHermitianmatrices that ifE[δ(AT A)] 6= 0, then the eigenvectors arebiased (see (Kosecka & Zucchelli 2001)). First note that the rows ofA1 have the formτ = u×x while those ofA2 are of the formp = (x2, 2xy, 2xz, y2, 2yz, z2). Lets writethe noise freeAT A in terms of constraints (τi, pi) and denote its upper-left block as D :

AT A =∑

i

(τiτ

Ti τip

Ti

piτTi pip

Ti

)and D =

∑

i

τiτiT

The entries ofAT A are nonlinear functions ofx andu, which means that if these arecorrupted by zero-mean Gaussian noise, the expectation value of the errorδ(AT A) inAT A is not zero. Considering the errors in temporal matching only, it is only the blockof the error matrixδ(AT A) associated withD that has expectation value different fromzero1. In fact the non-diagonal block elements are linear inu and thepip

Ti block is

independent ofu. Denote the noisy constraintsτi, τi = τi + n. By propagating theerrors in image velocitiesu due to temporal matching to the constraint coefficientsτi,one can easily compute the covariance matrixΣτi of τi. Let D be the noisy matrixDandE[D] its expectation value, such thatE[D] = E[D] + σ2

uΣ whereΣ is:

Σ =∑

i

1 0 −xi

0 1 −yi

−xi −yi x2i + y2

i

=

∑

i

0 0 00 0 00 0 x2

i + y2i − 1

(8.26)

Assuming uniformly distributed features,E[∑

i xi] and E[∑

i yi] are approximatelyzero and the part proportional to identity can be omitted since it does not change theeigenvectors of matrixD. We see thatE[Σ] 6= 0, biasing the final estimate towardsthe optical axis. This has been previously observed both analytically and in simula-tions (Heeger & Jepson 1992, MacLean 1999).

8.5 Calibration and Egomotion

Suppose now that in addition to the errors due to flow computation, we want to under-stand how are the motion estimates affected by calibration errors. Assume that cali-bration parameters form a vector random variablek = [ox, oy, f ]T , where the individ-ual components are independent normally distributed random variables withN(0, Σc),whereΣc = diag([σ2

o , σ2o , σ2

f ]). Each component ofu is also corrupted by errorsn ∼ N(0, σ2

u) due to temporal matching. By propagating the errors in calibrationto image positions and image velocities, we obtain characterization of the noisy featurepositionsx ∼ N(0, Σx) and noisy image velocitiesu ∼ N(0, Σu) as normal randomvariables. The form of these covariance matrices can be found in Appendix A. More de-tailed derivation can be found in (Kosecka & Zucchelli 2001). Consequently the random

1The expected values for the remaining blocks are zero. In factE[pipTi ] = 0 sincexi, yi are noise

free andE[piτTi ] ∝ E[ux] = E[uy ] = 0 with the assumption that components ofu are IID zero-mean

Gaussian,n ∼ N(0, σ2u).

8.5. Calibration and Egomotion 97

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

calibration error (%)

V B

ias

(%)

Jepson HeegerKanatani AKanatani BBruss HornMacLean

Figure 8.2. Relationship between translation bias as a function of calibration error, assumingthat the measurement errors were fixed at 15%. Note that the bias of the Jepson-Heegeralgorithm is approximately 0 for an error of calibration parameters of about 15% . This is thesituation when the two bias terms cancel each other.FOV for these experiments was 90◦.

variablesx andu determine the structure of the errorδ(AT A) via constraints(τi, pi).Similarly as in the calibrated case we will justify, that the expected value ofδ(AT A) isapproximately block diagonal. This approximation is valid given the hypothesis thatxandu are weakly correlated andE[ui], defined in Appendix A, is zero. In such a casewe obtain:

E[τipTi ] ∝ E[ui]E[x3

i ] = 0 ⇐ E[ui] = 0 (8.27)

The block relative topipTi contains coefficients of 4th powers in the image coordinates.

Under the assumption that optical flow errors are bigger than calibration errorsσ2u

u2 Àσ2

f

f2 , the relative errors ofpipTi are small compared to the relative errors ofτiτ

Ti .

Then similarly as in the calibrated case we can approximateE[δ(AT A)] consideringonly the upper-left blockD, with all the other entries being zero. In the presence ofcalibration errors theE[D] is augmented by two additional bias terms in the followingway:

E[D] = E[D]+σ21Σ1+σ2

2Σ2 where Σ2 =∑

i

0 0 00 0 00 0 u2

xi+ u2

yi

(8.28)


0 0.05 0.1 0.15 0.2 0.25 0.3 0.353

4

5

6

7

8

9

10

11

12

13

calibration error (%)

V e

rror

(de

g)Jepson HeegerKanatani AKanatani BBruss HornMacLean

Figure 8.3. The error of translation estimate expressed in angular units as a function of thecalibration error in the same conditions as in Fig. 8.4. The error is always increasing forincreasing calibration error.

whereΣ1 ≡ Σ as in the Eq. 8.26 andσ21 andσ2

2 have the following form:

σ21 =

12

u2

f2(σ2

u

u2+

σ2f

f2) and σ2

2 =12

∆2

f2(σ2

f

f2+

σ2o

∆2) (8.29)

where∆2 is the average distance of the features from the center of the projection, un-der the assumption that the features are uniformly distributed in the image plane. Itsexpected value can be related to the size of the image plane asE[∆2] = I2/12. Using

the assumption thatσ2u

u2 À σ2f

f2 andE[∆2] = I2/12, the expressions forσ21 andσ2

2 canbe simplified to obtain:

σ21 =

12

σ2u

f2and σ2

2 =12

σ2f

f2(1 +

tan2 θ

12) (8.30)

whereu2 andx2 are the average values of image velocities and feature positions. We can

compare the two terms contributing to bias:1−x2

2σ2

u

f2 andu2

2

σ2f

f2 (1+ tan2 θ12 ). Since1−x2

is on the average negative, the two bias terms have opposite signs and dampen (or underfavorable circumstances cancel) each other. This is demonstrated in Fig. 8.4 wherewe computed the bias as a function of the calibration error assuming 15% error in themeasurements of the optical flow. Fig. 8.4 shows the effect of the errors in calibration

8.5. Calibration and Egomotion 99

20 40 60 80 100 120 140

−0.5

−0.4

−0.3

−0.2

−0.1

0

fov (deg)

V b

ias

(%)

Kanatani BJepson HeegerKanatani ABruss HornMacLean

Figure 8.4.Dependence of the bias due to noisy calibration on theFOV of the camera. 30%noise on camera parameters is generated while optical flow is noiseless. The magnitude of thebias increases with increasingFOV

on the precision of the translation estimates. The details of the simulations are outlinedin Section 8.7.

The dependence of thevz bias onf can is difficult to derive analytically but can besimply computed experimentally. Results for the Jepson-Heeger algorithm are shownin Fig. 8.5 for a simulated set of data generated withf = 1. The linear velocity isthen estimated assuming thatf is in the rangef ∈ [0.1, 10]. As we see the estimatedvz drops quickly to zero whenf is underestimated, and tends to a finite value (in thisexample∼ 1.7) whenf is overestimated.

FOV dependency. Eq. 8.30 also reveals the dependence of the motion estimates onthe field of viewθ of the camera. AsFOV → 0 the term1 − x2 becomes ’more’negative generating a stronger bias as previously shown (Tomasi & Heeger 1994). Thetermσ2

2 arising from the noisy calibration also increases for biggerFOV . This effectis demonstrated in Figure 8.4 where we calculated the bias invz (z-component of thelinear velocity) as a function ofFOV for noiseless optical flow and 30 % error on thecalibration parameters forθ ∈ [10, 150].


0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

focal length

v z

f = 1

Figure 8.5.Dependence of the linear velocity bias on the focal length. A data set is generatedwith f = 1 and then the linear velocity is estimated assuming a measured focal lengthf ∈[0.1, 10]. The estimated velocityvz is plotted normalized to the ground truth.

8.6 3D Structure Distortion

In the following section we study the effect of the calibration errors on 3D structurereconstruction. In order to separate the sensitivity issues pertaining to the translationbias from those related to the structure of the scene we assume that the motion is purelytranslational approximately in the direction of the optical axis. We moreover assumethat both optical center position and the optical flowu are known exactly, so that onlythe focal lengthf is affected by error. These assumptions will be released in the exper-iments and we will demonstrate that the same results hold in the general case.

Given the translationvz the optimal structure is given by Eq. 2.8:

Z ' ‖x‖‖u‖vz (8.31)

The estimated velocityvz is characterized by an errorσvz and a biasbvz = E(vz −vz) that propagates into the structure:

σZ ' ‖x‖‖u‖σvz '

Z

vzσvz (8.32)

bZ = E(Z − Z) =‖x‖‖u‖bvz =

Z

vzbvz (8.33)

It can be seen that bias and noise are proportional to the depth.

8.8. Summary 101

The estimated depthZ can be expressed as a function of the measured focal lengthf . Representing the experimental data in Fig. 8.5 as a functiona(f), i.e. vz = a(f)vz

we get, from Eq. 8.31:

Z = a(f)Z (8.34)

The structure comes to be:

[X, Y , Z] = [a(f)xZ

f,a(f)yZ

f, a(f)Z] = a(f)[

xZ

f,yZ

f, Z] (8.35)

Parameterizingf asf = cf we get:

[X, Y , Z] =a(f)

c[X, Y, cZ] (8.36)

The functiona(f) just introduces a scale ambiguity in the structure estimation. Over-estimation of the focal length (c > 1) tends to skew the structure while underestimation(c < 1) tends to flatten it. Fig. 8.7 and Fig. 8.7 shows the skew for a general motion onsynthetic and real sequences.

8.7 Experiments

The experiments verifying the observations derived in the previous sections were per-formed both on synthetic and real sequences. For synthetic data sets the simulationbenchmarks are described in Chapter 2. The tested algorithms were those of Horn (Bruss& Horn 1983), Heeger and Jepson (Heeger & Jepson 1992), Kanatani (Kanatani 1993b)and MacLean (MacLean 1999), with the implementations made available by (Tomasi &Heeger 1994). The tested algorithms differ in the choice of objective function, leadingto linear or nonlinear optimization problems. The linear algorithms of (MacLean 1999)and (Kanatani 1993b) provide a solution for correcting the bias due to errors in tempo-ral matching. Bias and sensitivity were measured over one thousand trials. The recon-struction of unknown depths was obtained using the algorithm described in (Zucchelli& Christensen 2000). The experiments on both real and simulated sequences are inFig. 8.7 and Fig. 8.7. The figures demonstrate the effect of over and under estimation ofthe focal length on the computed 3D structure. The flattening and elongation are clearlynoticeable.

8.8 Summary

The analysis carried out in this chapter demonstrates the sensitivity of structure and mo-tion estimation with respect to the errors in camera calibration. As the main contributionwe demonstrated that the calibration errors introduce an additional bias in the directionof the optical axis with the opposite sign to the one produced by errors due to temporalmatching. Under favorable circumstances the two bias terms cancel each other leading


a

b

c

Figure 8.6. Synthetic sequence.(a) Original model. (b) Model distorted by an underesti-mated focal length.(c) Model distorted by an overestimated focal length.

8.8. Summary 103

a

b

c

Figure 8.7. (a) Original model with estimatedvz = 0.5530. (b) Model distorted by anunderestimated focal length of 50%,vz = 0.3975. (c) Model distorted by an overestimatedfocal length of 50%,vz = 0.7597.


to unbiased estimates. Moreover the bias produced by erroneous calibration increasesin magnitude for increasingFOV . This is in contrary to what happens to the bias pro-duced by noisy image velocities, which in addition to that, depends on the direction oftranslation. The relationship betweenFOV and the errors caused by noisy calibrationis mostly independent of the choice of translation direction. The linear velocity biaspropagates to the structure estimation and distorts the resulting 3D structure. This isshown analytically for a simple motion configuration and tested on real sequences onvariety of motions in order to justify the generality of the previous assessment.

The above observations were derived analytically, resorting to some approxima-tions, in the context of linear techniques. More extensive simulations confirmed thereasonability of the approximations and justified the intrinsic nature of the calibrationerrors independent of the algorithm choice.

Chapter 9

Conclusions

3D models are nowadays widely used in scientific visualization, for entertainment andfor a large number of engineering tasks. Due to this, the problem of reconstructing the3D structure of objects or environments from 2D images of them has attracted a greatdeal of research in the last 30 years. A large a number of different algorithms exists toapproach the problem under different working conditions. In general such approachescan be classified in feature based and flow based according to the type of informationthat they used, i.e. feature correspondences or optical flow. Further information aboutscene or motion can be efficiently used in the reconstruction process. Such an informa-tion can be made available by the user himself or, in some cases, automatically extractedfrom the data. The degree of user interaction and the constraints used should dependon the specific problem studied since higher degrees of precision can be achieved whenadditional information is supplied.

In this thesis we studied the problem of structure and motion from dense videosequences. In general we do not use any constraint on the motion or the structure sincewe would like to provide a fully automatic 3D reconstruction system. The main target ofsuch a system are video amateurs that want to get 3D models from video films acquiredfrom hand held digital cameras. Obviously video amateurs are not supposed to haveany knowledge of 3D computer vision so that they are not supposed to interact with thesystem.

We propose a hybrid framework where sparse features are tracked over the sequenceand their displacements are used to approximate the optical flow. Compared to stereosystems we want to reduce the inefficiencies typical of matching over pairs of imagesof large baseline. Compared to flow based systems we want to reduce the complexityand use estimates of the flow at high textured locations.

9.1 Contributions

The research work presented in this thesis is mostly focused in solving the problem ofestimating camera motion and structure from a pair of images and the related optical

105

106 Chapter 9. Conclusions

flow field. We used a hybrid approach in which features are tracked over close byframes and their displacement is used to approximate optical flow. Then the differentialepipolar constraint is used for estimating structure and motion. The major contributionsare:

• A linear algorithm for recursive structure and motion estimation. Due to its closedform such a system does not require initialization and can run in real time. A newauto-rescaling approach is also presented. This is based on the solution of anOrdinary Differential Equation for each feature. The performance is compared tothe more standard approach of fixing one feature.

• A non-linear recursive algorithm that uses the linear one as initialization. Weshow convergence properties and improvements in bias and variance of the esti-mates.

• A maximum-likelihoodformulation of the structure from motion problem. Weshow the improvements achieved compared to the un-weighted estimator. Wemoreover show how to locally integrate reconstruction over more than two frames.Improvements are assessed.

• A motion based segmentation algorithm that finds planar surfaces by clustergrowing. The algorithm segments images into planes and simultaneously recon-struct the underlying scene structure.

• A structure from motion algorithm with geometrical constraints. We reformulatethe non-linear algorithm described above in order to accommodate geometricalconstraints. We show convergence properties of the iterative constrained mini-mization. Constraints are enforced bydirect projection.

• An analysis if the effects of noisy calibration on the structure and motion estima-tion. We show that nosy calibration introduces a bias both in the camera motionand scene structure. We derive the results analytically and with real and simulateddata sets.

A schematic summary of the results in terms of reconstruction precision is presentedin Tab. 9.1.

lin. non-lin. non-lin. weight non-lin. weight multi-view const.tea box 0.0180 0.0110 0.0101 0.0078 10−16

Table 9.1.Planar residuals, two views setting. Units are focal lengths.

9.2 Open Issues

There is a number of open issues that haven’t been investigated for lack of time.

9.3. Future Directions 107

Calibration In the hybrid framework we decided to work, cameras have to be cali-brated and calibration parameters not changed during the acquisition. This lim-itation can be released allowing for active tracking of the camera focal length.In (Brooks, Chojnacki & Baumela 1997) the authors show how to compute inclosed form the camera motion, the focal length and the rate of change of thefocal length. Obviously such an approach relies on the fact that camera center isknown a priori and does not change when zooming. This ia approximately true formodern cameras but the problem would require further investigation. In chapter8 we studied extensively the effect of calibration errors on structure and motionestimation. In general optical flow is pretty noisy so that some imprecision in thecalibration parameters is tolerable.

Time Integration No dynamical model of the camera motion is available in our work-ing conditions. Translation and rotation are estimated by integrating over timethe computed velocities. This introduces an error that can not be modelled. Itwould probably be advisable to investigate the possibility of extracting simulta-neously camera motion and camera position at each timetj . This problem is notaddressed in the thesis but left as a future development.

Dense ReconstructionDense reconstruction is advisable for visualization tasks. Thisproblem is, for the moment, left open. We just observe that reconstruction sys-tems based on dense flow fields requires well textured images and smooth sur-faces to work efficiently. This is rarely the case in the real world, so that a methodto bypass such limitations should be worked out.

9.3 Future Directions

The quality of features correspondence and optical flow depends drastically on theworking conditions. A fully automatic system should be able to work efficiently in-dependently of such conditions. For this reason a hybrid approach that uses both opticalflow and correspondences would guarantee a higher degree of robustness. An elegantformulation of this hybrid approach has been proposed by Ma, Kosecka and Sastry (Ma,Sastry & Kosecka 1998) in the form of an optimal control problem. It is assumed thatfeatures are tracked overM frames and the optical flow estimated at feature locationsfor each intermediate timetj . The velocities(v, ω) are the control inputs to the dynam-ical system:

g = g

(ω v0 0

)(9.1)

with:

g =(R T0 1

)(9.2)

The final state cost function as:

108 Chapter 9. Conclusions

ϕ(tM ) =∑

i

xTi (t0)RTxi(tM ) (9.3)

and the Lagrangian:

L(ω,v, t) =∑

i

(uTi vx + xT vωx) + ωT ω + vT v (9.4)

The optimal motion(v, ω) is the optimal control law for the dynamic system in Eq.9.1, subject to the constraint in the final state‖T‖ = 1, which minimizes:

H(v, ω) = ϕ(tM ) +∫ tM

t0

L(v, ω, t)dt (9.5)

By solving this problem the camera motion(v(t), ω(t)) and the displacement(R(t),T(t))are estimated simultaneously. Structure estimation can be pursued with one of the algo-rithms described in the thesis with the advantage that the transformation(R(t),T(t)) isestimated directly and that the camera motion is estimated consistently with the epipo-lar geometry. Stereo triangulation is another available option: the tradeoff between thetwo approaches is studied in chapter 3. General solutions to this kind of optimal controlproblems have not been explored yet.

Appendix A

Optimal Rotation

Given a matrixQ ∈ R3×3 we want to find the rotation matrix matrixR such that:

minR‖R−Q‖F (A.1)

RT R = I (A.2)

det(R) = +1 (A.3)

where‖ · ‖F is the Frobenius norm. We have that:

‖R−Q‖F = trace((R−Q)T (R−Q)) = 3 + trace(QQT )− 2trace(RT Q). (A.4)

So solving the problem in Eq. A.1 is equivalent to solve:

maxR

trace(RT Q) (A.5)

Using theSV D of Q, Q = UΣV T whereΣ = diag(σ1, σ2, σ3) and the properties ofthe trace (Golub & Van Loan 1996), we have:

trace(RT Q) = trace(RT UΣV T ) = trace(V T RT UΣ) =∑

i

(V T RT U)iiσi ≤∑

i

σi

(A.6)So the maximum is achieved whenV T RU = I. Since we wantR to be a properrotation, i.e.det(R) = +1, we find finally (Kanatani 1993a):

R = U

1 0 00 1 00 0 det(V UT )

V T (A.7)

109

Appendix B

Error Propagation

By propagating the errors in calibration to image positions and image velocities wepresent here covariance matrices for noisy feature positions and noisy velocities. Thenoisy feature positionx ∼ N(0, Σx) and noisy image velocityu ∼ N(0, Σu) arenormal random variables, with the following covariance matrices:

Σx u 1

2

∆2

f2

0@σ2f

f2 +σ2

o∆2

σ2f

f2

σ2f

f2σ2

f

f2 +σ2

o∆2

1A Σu u 1

2

u2

f2

0@σ2u

u2 +σ2

f

f2σ2

f

f2

σ2f

f2σ2

uv2 +

σ2f

f2

1A (B.1)

where∆2 is the average distance of the features from the center of the projection underthe assumption that the features are uniformly distributed in the image plane (i.e. onthe average∆2

x u ∆2y) andu2 = u2

x + u2y is the average image velocity.

111

Appendix C

Maximum-Likelihood Estimator

Consider a p.d.f.f(x|θ) with an unknown parameterθ to be estimated from a set ofobservations[x1, . . . , xn]. The Likelihood function of the problem is defined as:

L(x|θ) =n∏

i=1

f(xi|θ) (C.1)

L is the joint conditional probability of the observations[x1, . . . , xn] at a fixedθ. Ac-cording to theMaximum− Likelihood principle we should choose as an estimate ofthe unknown parameterθ that particularθ which rendersL as large as possible. Thismeans that:

L(x|θ) ≥ L(x|θ) ∀θ (C.2)

If L is twice differentiable with respect toθ, the valueθ can be found by solving theequation:

∂L(x|θ)∂θ

=∂

∂θ

n∏

i=1

f(xi|θ) (C.3)

with the condition that the second derivative evaluated atθ is negative. For sake ofsimplicity it is often easier to maximize the logarithm ofL:

∂ ln L(x|θ)∂θ

=∂

∂θ

n∑

i=1

ln f(xi|θ) (C.4)

The Maximum-Likelihood estimators have a set of appealing properties that we reportbelow. Demonstrations can be found in (Frodesen, Skjeggerstad & Tø fte 1978).

• Invariance under parameter transformation: We can choose to maximize theLikelihood function as a function ofτ(θ) instead of as function ofθ itself. If θ isthe maximum-likelihood estimator ofθ then we get:

ˆτ(θ) = τ(θ) (C.5)

113

114 Appendix C. Maximum-Likelihood Estimator

• Consistency: Under very general conditions theML estimators are consistent.This means that theML estimates will converge to true parameter when the sam-ple size increases.

• Unbiassedness: In the asymptotic limits of infinite samples all theML esti-mators are unbiased. In favourable situations theML estimators turn out to beunbiased irrespective of the size of the sample.

• Sufficiency: If there exist asufficientestimator of the parameterθ, this is also theML estimator. It can be also shown that sufficient estimators have the minimumattainable variance.

• Efficiency: If a minimum variance boundestimator exist, this is also theMLestimator.

• Asymptotic Normality : The estimatorθ is asymptotically normally distributedwith mean equal to the true value ofθ.

When the p.d.f is Gaussian, the maximum likelihood estimator is defined by:

∂

∂θln[exp− 1

2‖W−1(x− g(θ))‖2] = 0 (C.6)

wherex is the vector of observations andW is the covariance matrix. Eq. C.6 isequivalent to:

θ = arg minθ‖W−1(x− g(θ))‖2 (C.7)

This is exactly the least squares estimator ofθ, so that we conclude that when the p.d.fis Gaussian the least squares estimator of the parametersθ is also theML estimator.

C.1 Total Least Squares (TLS)

The total least squares is the least squares solution for the problemAy = b:

y = arg miny‖Ay − b‖2 (C.8)

where bothA andb are noisy measurements. Taking the derivatives of‖Ay − b‖2 wefind easily that the solution is given by:

AT Ay = Ab (C.9)

The problemAy = 0 belongs also to this category and its solution is the eigenvectorassociated to the smallest eigenvalue of the matrixAT A. The estimatory is the max-imum likelihood estimator when the errors on the entries ofA andb are independentand identically distributed (Van Huffel & Vandewalle 1991). The solution strategy ofthe problem in Eq. C.9 depends on the rank ofA. If A ∈ Rm×n with m ≥ n is fullrank, then holds:

C.1. Total Least Squares (TLS) 115

Proposition C.1. Let A ∈ Rm×n with m ≥ n andA is full rank. LetA = QR be theQR factorization ofA. Then there exist a unique factorization ofA of the form:

A = Q1R1 (C.10)

whereQ1 andR1 are sub-matrices ofQ andR given by:

Q1 = Q(1 : m, 1 : n) R1 = R(1 : n, 1 : n) (C.11)

MoreoverAT A = RT1 R1.

Theorem C.2. LetA ∈ Rm×n with m ≥ n andA is full rank. Then the unique solutionof the probelm in Eq. C.9 is:

y = R−11 QT b (C.12)

The computational cost of theGram−Schmidt algorithm to compute the reducedQR factorization is of the order ofmn2.

If the matrix A is deficient the solution of problem in Eq. C.9 is known up to avector in the null space ofA. A further constraint has to be imposed to get a uniquesolution. Imposing thaty has minimal Euclidean norm we find that:

Theorem C.3. LetA ∈ Rm×n with m ≥ n. Then the unique solution to the problem inEq. C.9 with the minimal norm constraint‖y‖ = min is:

y = A+b (C.13)

whereA+ is the generalize pseudo-inverse ofA.

The number of flops needed to compute theSV D of A is of the order of2m2n +4mn2 + 9

2n3.In general, if the matrixA is full rank, theQR and pseudoinverse solutions are

exactly the same but the reducedQR decomposition is largely computationally lessexpensive than theSV D factorization needed to compute the pseudo-inverse. A moreadvanced discussion and proofs of the theorems can be found in (Quarteroni, Sacco &Saleri 2000).

Bibliography

Adelson, E. & Bergen, J. (1985), ‘Spatiotemporal energy models for the perception ofmotion’, Journal of the Optical Society of America A2(2), 284–299.

Ayer, S. & Sawhney, H. (1996), ‘Compact representations of videos through dominantand multiple motion estimation’,IEEE Transactions on Pattern Analysis and Ma-chine Intelligence18(8), 814–830.

Ayer, S. & Sawhney, H. (1997), Layered representation of motion video using robustmaximum-likelihood estimation of mixture models and mdl encoding,in ‘Interna-tional Conference on Computer Vision, ICCV1997’, pp. 777–784.

Azarbayejani, A. & Pentland, A. (1995), ‘Recursive estimation of motion, structure, andfocal length’, IEEE Transactions on Pattern Analysis and Machine Intelligence17(6), 562–575.

Barron, J. L., Fleet, D. J. & Beuchemein, S. S. (1994), ‘Performance of optical flowtechniques’,International Journal of Computer Vision12(1), 43–77.

Beardsley, P., Torr, P. & Zisserman, A. (1996), 3d model acquisition from extendedimage sequences,in ‘European Conference on Computer Vision, ECCV1996’,pp. II:683–695.

Beardsley, P., Zisserman, A. & Murray, D. (1994), Navigation using affine struc-ture from motion,in ‘European Conference on Computer Vision, ECCV1994’,pp. B:85–96.

Bergen, J., Anandan, P. & Hanna, K. (1992), Hierarchical model-based motion estima-tion, in ‘European Conference on Computer Vision, ECCV1992’, pp. 237–252.

Bondyfalat, D. & Bougnoux, S. (1998), Imposing euclidean constraint during self-calibration processes,in ‘SMILE Workshop’, pp. 224–235.

Bouget, J. (n.d.), ‘http://www.vision.caltech.edu/bouguetj/calibdoc/index.html’.

Bougnoux, S. (1998), From projective to euclidean space under any practical situation,a criticism of self-calibration,in ‘International Conference on Computer Vision,ICCV1998’, Vol. 2, pp. 790–795.

117

118 Bibliography

Brooks, M., Chojnacki, W. & Baumela, L. (1997), ‘Determining the egomotion ofan uncalibrated camera from instantaneous optical flow’,JOSA-A14(10), 2670–2677.

Bruss, A. & Horn, B. (1983), ‘Passive navigation’,Computer Vision, Graphics andImage Processing21, 3–20.

Cheong, L. & Peh, C. (2000), Characterizing depth distortion due to calibration uncer-tainty, in ‘European Conference on Computer Vision, ECCV2000’, pp. 665–667.

Chiuso, A., Favaro, P., Jin, H. & Soatto, S. (2000), 3d motion and structure causallyintegrated over time: Implementation,in ‘European Conference on Computer Vi-sion, ECCV2000’, pp. II: 734–750.

Costeira, J. & Kanade, T. (1998), ‘A multibody factorization method for independentlymoving-objects’,International Journal of Computer Vision29(3), 159–179.

Danilidis, K. & Spetsakis, M. (1996), Understanding noise sensitivity in structure frommotion,in Y. Aloimonos, ed., ‘Visual Navigation’, pp. 61–88.

Demirdjian, D., Zisserman, A. & R., H. (2000), Stereo autocalibration from one plane,in ‘European Conference on Computer Vision, ECCV2000’, Vol. 2, pp. 625–639.

Faugeras, O. (1992), What can be seen in three dimensions with an uncalibrated stereorig?, in ‘European Conference on Computer Vision, ECCV1992’, pp. 563–578.

Fischler, M. & Bolles, R. (1981), ‘Random sample consensus: A paradigm for modelfitting with applications to image analysis and automated cartography’,CACM24(6), 381–395.

Frodesen, A., Skjeggerstad, O. & Tø fte, H. (1978),Probability and Statistics in ParticlePhysics, Universitetforlaget.

Golub, G. & Van Loan, C. (1996),Matrix Computations, Johns Hopkins UniversityPress.

Grossman, E. & Santos-Victor, J. (2000a), Dual representations for vision-based 3dreconstruction,in ‘British Machine Vision Conference, BMVC2000’, pp. 516–525.

Grossman, E. & Santos-Victor, J. (2000b), ‘Uncertainty analysis of 3d reconstructionfrom uncalibrated views’,International Vision Conference18, 685–696.

Han, M. & Kanade, T. (1999), Perspective factorization methods for euclidean recon-struction,in ‘CMU-RI-TR’.

Hanna, K. (1991), Direct multi-resolution estimation of ego-motion and structure frommotion,in ‘IEEE Workshop on Visual Motion’, pp. 156–162.

Bibliography 119

Hanna, K. & Okamoto, N. (1993), Combining stereo and motion for direct estimationof scene structure,in ‘International Conference on Computer Vision, ICCV1993’,pp. 357–365.

Hartley, R. (1997), ‘In defense of the eight-point algorithm’,IEEE Transactions onPattern Analysis and Machine Intelligence19(6), 580–593.

Hartley, R. & Zisserman, A. (2000),Multiple view geometry in computer vision, Cam-bridge University Press.

Heeger, D. J. & Jepson, A. D. (1992), ‘Subspace methods for recovering rigid motion’,International Journal of Computer Vision7(2), 95–117.

Heel, J. (1990), Direct estimation of structure and motion from multiple frames,in ‘MITAI Memo’.

Heel, J. (1991), Temporal surface reconstruction,in ‘IEEE Conference on ComputerVision and Pattern Recognition, CVPR1991’, pp. 607–612.

Horn, B. (1987), ‘Motion fields are hardly ever ambiguous’,International Journal ofComputer Vision1(3), 239–258.

Horn, B. & Schunck, B. (1981), ‘Determining optical flow’,Artificial Intelligence17, 185–203.

Irani, M. & Anandan, P. (1999), About direct methods,in ‘Vision Algorithms: Theoryand Practice’, pp. 267–277.

Irani, M. & Anandan, P. (2000), Factorization with uncertainty,in ‘European Confer-ence on Computer Vision, ECCV2000’, Vol. 1, pp. 539–553.

Irani, M., Anandan, P. & Cohen, M. (1999), Direct recovery of planar-parallax frommultiple frames,in ‘International Conference on Computer Vision, ICCV VisionAlgorithms Workshop’, pp. 1–8.

Ju, S., Black, M. & Jepson, A. (1996), Multilayer, locally affine optical flow and regu-larization with transparency,in ‘IEEE Conference on Computer Vision and PatternRecognition, CVPR1996’, pp. 307–314.

Kanatani, K. (1993a), Geometric Computation for Machine Vision, Clarendon Press.

Kanatani, K. (1993b), Renormalization for unbiased estimation,in ‘International Con-ference on Computer Vision, ICCV1993’, pp. 599–606.

Koch, R., Pollefeys, M. & Van Gool, L. (1998), Multi viewpoint stereo fromuncalibrated video sequences,in ‘European Conference on Computer Vision,ECCV1998’, pp. 55–71.

Kosecka, J. & Zucchelli, M. (2001), Motion bias and structure distortion induced bycalibration errors, Technical report, George Mason University.

120 Bibliography

Longuet-Higgins, H. (1981), ‘A computer algorithm for reconstructing a scene fromtwo projections’,Nature293, 133–135.

Lucas, B. (1985), Generalized image matching by the method of differences,in ‘Ph.D.’, p. CS.

Luong, Q., Deriche, R., Faugeras, O. & Papadopoulo, T. (1993), On determining thefundamental matrix:analysis of different methods and experimental results, Tech-nical Report RR-1894, INRIA.

Ma, Y., Kosecka, J. & Sastry, S. (1997), Motion recovery from image sequences: Dis-crete viewpoint vs. differential viewpoint, Technical Report UCB/ERL(M97/42),UC Berkeley.

Ma, Y., Kosecka, J. & Sastry, S. (1998), Motion recovery from image sequences: Dis-crete viewpoint vs. differential viewpoint,in ‘European Conference on ComputerVision, ECCV1998’, Vol. 2, pp. 337–353.

Ma, Y., Kosecka, J. & Sastry, S. (2000a), ‘Linear differential algorithm for mo-tion recovery: A geometric approach’,International Journal of Computer Vision36(1), 71–89.

Ma, Y., Kosecka, J. & Sastry, S. (2000b), ‘Linear differential algorithm for mo-tion recovery:a geometric approach’,International Journal of Computer Vision36(1), 71–89.

Ma, Y., Sastry, S. & Kosecka, J. (1998), Euclidean structure and motion from imagesequences, Technical Report UCB/ERL/ M98/38, Berkeley.

MacLean, W. J. (1999), Removal of translational bias when using subspace methods,in‘International Conference on Computer Vision, ICCV1999’, pp. 753–758.

Malis, E. & Cipolla, R. (2000), Self-calibration of zooming cameras observing anunknown planar structure,in ‘International Conference on Pattern Recognition,ICPR2000’, Vol. 1, pp. 85–88.

Matsunaga, C. & Kanatani, K. (2000), Calibration of a moving camera using a planarpattern: Optimal computation, reliability evaluation, and stabilization by modelselection,in ‘European Conference on Computer Vision, ECCV2000’, Vol. 2,pp. 595–609.

Matthies, L., Szeliski, R. & Kanade, T. (1988), Incremental estimation of dense depthmaps from image sequences,in ‘IEEE Conference on Computer Vision and Pat-tern Recognition, CVPR1988’, pp. 366–374.

Matthies, L., Szeliski, R. & Kanade, T. (1989), ‘Kalman filter-based algorithms for es-timating depth from image sequences’,International Journal of Computer Vision3(3. September 1989), 209–238.

Bibliography 121

Maybank, S. (1985), ‘The angular velocity associated with the optical flow field arisingfrom motion through a rigid environment’,RoyalA-401, 317–326.

Maybank, S. (1993),Theory of Reconstruction from Image Motion, Vol. 28 of Informa-tion Sciences, Springer-Verlag.

Meer, P. Mintz, D., Kim, D. & Rosenfeld, A. (1991), ‘Robust regression methods forcomputer vision: A review’,International Journal of Computer Vision6(1), 59–70.

Morita, T. & Kanade, T. (1997), ‘A sequential factorization method for recovering shapeand motion from image streams’,IEEE Transactions on Pattern Analysis and Ma-chine Intelligence19(8), 858–867.

Morris, D., Kanatani, K. & T., K. (2001), Gauge fixing for accurate 3d estimation,in ‘IEEE Conference on Computer Vision and Pattern Recognition, CVPR2001’,pp. II:343–350.

Nagel, H. (1983), ‘Displacement vectors derived from second-order intensity variationsin image sequences’,Computer Vision, Graphics and Image Processing21(1), 85–117.

Nagel, H. (1987), ‘On the estimation of optical flow: Relations between different ap-proaches and some new results’,Artificial Intelligence33(3), 299–324.

Nister, D. (2000), Frame decimation for structure from motion,in ‘SMILE 2000’, pp. 2–9.

Nister, D. (2001a), Automatic Dense Reconstruction from Uncalibrated Video Se-quences, PhD thesis, Royal Institute of Technology.

Nister, D. (2001b), Calibration with robust use of cheirality by quasi-affine reconstruc-tion of the set of camera projection centres,in ‘International Conference on Com-puter Vision, ICCV2001’, pp. II: 116–123.

Odobez, J. & Bouthemy, P. (1995), Mrf-based motion segmentation exploiting a 2d mo-tion model robust estimation.,in ‘International Conference on Image Processing’,pp. 628–631.

Oliensis, J. (2000), ‘A structure from motion ambiguity’,IEEE Transactions on PatternAnalysis and Machine Intelligence22(7), 685–700.

Quarteroni, A., Sacco, R. & Saleri, F. (2000),Numerical Mathematics, Texts in AppliedMathematics, Springer.

Rousseeuw, P. & Leroy, A. (1987),Robust Regression and Outlier Detection, John Wi-ley and Sons.

Sawhney, H. (n.d.), 3d geometry from planar parallax, Technical report, IBM ResearchDivision.

122 Bibliography

Scales, L. (1985),Introduction to Non-Linear Optimization, Macmillan.

Shakernia, O. (1999), Landing an unmanned air vehicle: Vision based motion estima-tion and nonlinear control, Master’s thesis, UC Berkeley.

Shashua, A. & Werman, M. (1995), Trilinearity of three perspective views and its as-sociated tensor,in ‘International Conference on Computer Vision, ICCV1995’,pp. 920–925.

Shi, J. & Tomasi, C. (1994), Good features to track,in ‘IEEE Conference on ComputerVision and Pattern Recognition, CVPR1994’, pp. 593–600.

Simoncelli, E. (1993), Distributed analysis and representation of visual motion,in‘Ph.D. thesis’.

Simoncelli, E., Adelson, E. & Heeger, D. (1991), Probability distributions of op-tical flow, in ‘IEEE Conference on Computer Vision and Pattern Recognition,CVPR1991’, pp. 310–315.

Soatto, S. & Perona, P. (1998a), ‘Reducing “structure from motion” part 1: modelling’,IEEE Transactions on Pattern Analysis and Machine Intelligence20(9), 933–942.

Soatto, S. & Perona, P. (1998b), ‘Reducing “structure from motion” part 2: experimen-tal evaluation’,IEEE Transactions on Pattern Analysis and Machine Intelligence20(9), 943–961.

Svodoba, T. & Sturm, P. (1996), What can be done with a badly calibrated camera inego-motion estimation?, Technical report, Czech Technical University.

Szelisk, i. R. & Torr, P. (1998), Geometrically constrained structure from motion: Pointson planes,in ‘SMILE workshop’, pp. 171–186.

Tell, D. (2002), Wide Baseline Matching with Applications to Visual Servoing, PhDthesis, Royal Institute of Technology.

Tomasi, C. (1991), Shape and motion from image streams: A factorization method,in‘CMU-CS-TR-91-172’.

Tomasi, C. & Heeger, D. J. (1994), Comparison of approaches to egomotion com-putation, in ‘IEEE Conference on Computer Vision and Pattern Recognition,CVPR1994’, pp. 315–320.

Tomasi, C. & Kanade, T. (1991), Detection and tracking of point features, TechnicalReport TR CMU CS 91 132, CMU.

Torr, P. & Murray, D. (1997), ‘The development and comparison of robust methodsfor estimating the fundamental matrix’,International Journal of Computer Vision24(3), 271–300.

Bibliography 123

Torr, P., Szeliski, R. & Anandan, P. (2001), ‘An integrated bayesian approach to layerextraction from image sequences’,IEEE Transactions on Pattern Analysis andMachine Intelligence23(3), 297–303.

Triggs, B. (1997), Autocalibration and the absolute quadric,in ‘IEEE Conference onComputer Vision and Pattern Recognition, CVPR1997’, pp. 609–614.

Triggs, B., Zisserman, A. & Szeliski, R. (1999),Vision Algorithms: Theory and Prac-tice, Vol. 1883 ofLecture Notes in Computer Science, Springer.

Trucco, E. & Verri, A. (1998),Introductory Techniques for 3D Computer Vision, Pren-tice Hall.

Tsai, R. (1986), An efficient and accurate camera calibration technique for 3-d ma-chine vision,in ‘IEEE Conference on Computer Vision and Pattern Recognition,CVPR1986’, pp. 364–374.

Van Huffel, S. & Vandewalle, J. (1991),The Total Least Squares Problem. Computa-tional Aspects and Problems, SIAM, pp. 33–43.

Wang, J. & Adelsson, E. (1993), Layered representation for motion analysis,in ‘IEEEConference on Computer Vision and Pattern Recognition, CVPR1993’, pp. 361–366.

Weiss, Y. (1997), Smoothness in layers : motion segmentation estimation using non-parametric mixture.,in ‘IEEE Conference on Computer Vision and Pattern Recog-nition, CVPR1997’, pp. 520–526.

Xiong, Y. & Shafer, S. (1998), ‘Dense structure from a dense optical flow sequence’,Computer Vision and Image Understanding69(2), 222–245.

Zeller, C. & Faugeras, O. (1996), Camera self calibration from video sequences: Thekruppa equations revisited, Technical Report 2763, INRIA.

Zelnik-Manor, L. & Irani, M. (1999), Multi-frame alignment of planes,in ‘IEEEConference on Computer Vision and Pattern Recognition, CVPR1999’, Vol. 1,pp. 151–156.

Zhang, T. & Tomasi, C. (1999), Fast, robust, and consistent camera motion estimation,in ‘IEEE Conference on Computer Vision and Pattern Recognition, CVPR1999’,Vol. 1, pp. 164–170.

Zhang, Z. (1996), Determining the epipolar geometry and its uncertainty:a review, Tech-nical Report RR-2927, INRIA.

Zhang, Z. (2000), ‘A flexible new technique for camera calibration’,IEEE Transactionson Pattern Analysis and Machine Intelligence22(11), 1330–1334.

124 Bibliography

Zucchelli, M. & Christensen, H. (2000), A comparison of stereo based and flow basedstructure from parallax,in ‘Symposium on Intelligent Robotic Systems’, pp. 199–207.

Zucchelli, M. & Kosecka, J. (2001), Motion bias and structure distortion induced bycalibration errors,in ‘British Machine Vision Conference, BMVC2001’, pp. 663–672.

optical flow based structure from motion - citeseer

Documents