reconstructing 3d mesh from video...
TRANSCRIPT
Katedra aplikovanej informatikyFakulta Matematiky, Fyziky a Informatiky
Univerzita Komenského, Bratislava
Martin BujµNák
Reconstructing 3D mesh from videosequence
Diplomová práca
BRATISLAVA 2005
Comenius University, Bratislava, Slovakia
Faculty of Mathematics, Physics and Informatics
Department of Applied Informatics
Martin BujµNák:
Reconstructing 3D mesh from videosequence
(Master thesis)
Advisor: Bratislava,
RNDr. Martin SamuelµCík April 2005
i
I honestly declare that I have written the submitted master thesis
by myself and I have used only the literature mentioned in the
bibliography.
Bratislava, 29th April 2005 . . . . . . . . . . . . . . . . . . . . . .
Martin Bujµnák
ii
I would like to thank to my master thesis advisor Martin
Samuelµcík for his valuable advices, remarks and suggestions.
I would also like to thank to my family for support and for their
patience during my work.
Abstrakt
Moja práca sa venuje problematike vytvárania 3D scény zo súboru nekalib-
rovaných obrázkov, alebo videa. Prezentovaný algoritmus spracováva vstup
spôsobom on-line a je zaloµzený na trasovaní význaµcných bodov v súbore vstup-
ných obrazov, h,ladaní geometrického vz ,tahu medzi pármi obrazov a vytvorení
projektívnej rekon�trukcie nasnímanej scény. Pomocou lineárnej metódy al-
goritmus nájde vnútorné parametre kamery a transformuje scénu do metrick-
ého priestoru. V práci,dalej prezentujem nový algoritmus na nájdenie hustej
reprezentácie scény, jej triangulácii a popisujem experiment, kde pomocou geo-
metrie dvoch poh,ladov h
,ladám radiálnu chybu optiky kamery.
Algoritmus predpokladá, µze kamera nevytvára skosené obrazy, stred projekcie
leµzí v strede obrazu a pomer vý�ky a �írky bodu obrazu je rovný 1. Ohnisková
vzdialenos ,t kamery sa môµze poµcas snímania meni ,t.
Výstup splµna predpoklady vstupnej kamery a rekon�truovaná scéna nepod-
lieha projektívnej deformácii.
K ,lúµcové slová: Vizuálne modelovanie, �truktúra a pohyb kamery, Hustá re-
kon�trukcia scény, Samokalibrácia, Radiálna chyba �o�ovky
iii
Abstract
This thesis aims to create complete 3D reconstruction of real scene from un-
calibrated video sequence. My work deals with image features correspondence
problem reduced to feature tracking throughout image sequence, camera track-
ing with retrieving cameras positions and camera calibration, and �nally dense
scene reconstruction represented in 3D mesh.
Even input consists of un-calibrated images, algorithm assumes that images
were taken by camera with these restrictions to intrinsic parameters: zero-skew,
principal point is at image center and aspect ratio of 1. Camera focal length can
vary across the sequence. Images must be processed in the order of how they
were captured. Motion between two consequent frames is assumed to be small.
Main contribution of this work is in simple feature detector and tracker, novel
fast on-line structure from motion algorithm based on two-view geometry, dense
reconstruction based on new stereo algorithm and 3D mesh extraction. In this
work I also describe linear method for calibrating cameras only from input image
(self-calibration). Experimental method for lens radial distortion detection based
on two-view geometry is also presented here.
Keywords: Structure-from-motion, Uncalibrated video, Self calibration, Fea-
ture tracking, Dense reconstruction, Radial lens distortion
iv
Table of Contents
Abstrakt iii
Abstract iv
Table of Contents v
1 Introduction 11.1 Motivation and goal of this work . . . . . . . . . . . . . . . . . . 11.2 Outline of the document . . . . . . . . . . . . . . . . . . . . . . . 2
2 Feature tracking 32.1 Inroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Harris based feature tracker . . . . . . . . . . . . . . . . . . . . . 32.3 Removing outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Finding more features . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5.2 Guided matching . . . . . . . . . . . . . . . . . . . . . . . 7
3 Structure and Motion 103.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Camera pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.1 Quasi-calibrated camera pair . . . . . . . . . . . . . . . . . 123.2.2 Sparse scene from camera pair . . . . . . . . . . . . . . . . 123.2.3 Small motion �precision issues . . . . . . . . . . . . . . . 13
3.3 Updating the structure and motion . . . . . . . . . . . . . . . . . 133.3.1 Merging camera pairs . . . . . . . . . . . . . . . . . . . . . 14
3.4 Self-calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
v
CONTENTS vi
4 Dense reconstruction 204.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Volumetric methods . . . . . . . . . . . . . . . . . . . . . 204.2.2 Stereo methods . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Novel algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.2 Recti�cation . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3.3 Initial disparity map . . . . . . . . . . . . . . . . . . . . . 264.3.4 Re�ning disparity map using dynamic programming . . . . 274.3.5 Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . 284.3.6 Multi-view linking . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Experiments 375.1 Radial lens distortion . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Conclusion and future work 39
A Resources 40
List of Figures
2.1 Self correlatin. left �self correlating feature, left bottom �cor-
relation to two self correlating features, right �good feature to
track, right bottom �neighbours correlations. . . . . . . . . . . . 4
2.2 Principal components marked by red. The only good feature is
marked by green colour. . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Feature matches perfectly to more features in second image. . . . 5
2.4 Point corresponding to the point in right image must lay on (or
due to noise lie near) the line in the left image. . . . . . . . . . . . 6
2.5 Pair P1 1-1 will be added if correlation exceeds threshold. Pairs
1-3, 2-1 2-3 will not be tested because lengths of line segments
are too di¤erent. If 2-2(If not missing) is used, 3-1 would not be
tested due to ordering criterion. . . . . . . . . . . . . . . . . . . . 6
2.6 Novel detector �traceable features are marked white. Red fea-
tures are self correlating. . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 KLT good features � traceable features marked white. Bigger
motion results in invalid matching on the roofs of houses. . . . . . 8
2.8 Guided matching. Matching is processed on two corresponding
epipolar lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9 More features found using guided matching. Small lines starting
in feature points represents feature motion to previous frame. . . . 9
3.1 Filled areas denote place where the 3D point can be put. Left im-
age - perpendicular rays lead to biggest precision (smaller region),
right image - region grows with smaller distance of two cameras. . 12
3.2 Merging new pair and existing space using common feature points
and corresponding 3D points. . . . . . . . . . . . . . . . . . . . . 14
vii
LIST OF FIGURES viii
3.3 Structure and motion precision progress. Blue - reconstruction
from 5, green - from 8 and red from 10 cameras. . . . . . . . . . . 18
3.4 Structure and motion of the real scene. Video was captured in
resolution 640x480. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Sparse reconstruction of the scene. Resolution of input video was
320x240. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Searching for matching pairs using path-search problem. Cut of
the scene on the left. Occlusion diagram on the right. . . . . . . . 22
4.2 Stereo ambiguities. It is not possible to detect true matching from
these two view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Human cannot match pixels in background. Background must be
removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Area visible from both cameras. Black lines de�ne left camera
visibility and red lines right camera visibility. Minimal area is
selected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5 Recti�cation process. Width of recti�ed image is di¤erence of min
X radius from max X radius. Epipolar line is rotated so that it
travels at maxima 1 pixel on outer circumference (with max X
radius). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Row of recti�ed image. Each row correspond to some epipoler line
and its intersection with the image. . . . . . . . . . . . . . . . . . 27
4.7 Re�ned disparity map. Triangle verticles correspond to points for
which we have its corresponding 3D vertex - from scene structure. 29
4.8 Triangulation of small disparity discontinuities. . . . . . . . . . . 30
4.9 Triangulation mergin. Input reconstructions in a) and b), result
in c). d) shows triangle orientation, all in one image . . . . . . . . 31
4.10 Recti�cation example. Original image pair (top) and recti�ed
image pair (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.11 Dense reconsturction of the nature. Two merged camera pairs
were used to obtain disparity map and 8 other cameras were used
for photoconsistence checks. . . . . . . . . . . . . . . . . . . . . . 34
4.12 Dense reconsturction of human face. 2 cameras to obtain disparity
map and 7 cameras for photoconsistence checks were used. Scene
is rendered as point cloud, each point is rendered as 2x2pixels
colored splat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
LIST OF FIGURES ix
4.13 Reconstruction process. Frame from input sequence (top-left),
feature tracing (top-right), structure and motion with reconstruc-
ted object (bottom-right), �nal 3D mesh (bottom-left). . . . . . . 35
4.14 Dense reconstruction of scene from �gure 3.2.2. Scene is recon-
structed from two views. Video resolution was 320x240. . . . . . . 35
4.15 Scene reconstructed using 9 cameras. . . . . . . . . . . . . . . . . 36
4.16 Novel view of the object with may homogenous regions. Scene is
reconstructed from 2 cameras. . . . . . . . . . . . . . . . . . . . 36
5.1 Radial distortion test. Left - original, right - undistorted. Cost
function graph for each feature point is below. . . . . . . . . . . . 38
Chapter 1
Introduction
1.1 Motivation and goal of this work
In many nowadays applications such architectural visualization, cultural herit-
age, medicine, movie and computer games industry it is required to acquire high
detail photo-realistic 3D representation of some real objects. There exist several
methods how to create virtual copy of existing real object �from modeling by
artists to laser scanning.
In this work I take a closer look to the one of the most accessible and cheapest
way of 3Dmodel reconstruction �using video sequences. With such methods user
can freely move camera around an object or scene and record video. From this
video we are able to reconstruct motion of the camera and reconstruct textured
3D scene. Neither camera position, nor camera setting has to be known a priori.
My approach tracks point features across video sequence. From tracked fea-
tures algorithm creates two-view geometry using robust algorithm. Then mul-
tiple view structure and motion is created. Every change of structure must kept
sparse scene consistent. It means that all re-projected 3D points must be lying
on their corresponding 2D features (practically due to discrete space and noise
we want to achieve minimal quadratic error �further in text �error aspects�). If
mean error exceeds threshold then non-linear minimization is performed. Self-
calibration is used to restrict space ambiguity from projective to metric. Final
3D reconstruction is processed in two steps. The �rst step is to create hypo-
thesis about scene - using stereo algorithm and the second step is merging with
photo-consistency checks.
1
1. Introduction 2
Note that if we could e¤ectively �nd global minima of following expression
Pmi=0
Pnj=0 d(mij; PiXj)
2;
where mij is known 2D re-projection of unknown 3D point Xj by unknown
camera Pi and d(x; y) is distance of two 2D points, then all e¤orts behind recon-
struction would be futile. Therefore all the e¤ort is being put into �nding initial
condition for numerical minimization methods minimizing this formula.
1.2 Outline of the document
Algorithms are described in three sections - feature tracking, structure and mo-
tion with self calibration and dense scene reconstruction. Each section contains
overview, comparison with existing methods and concludes with results of my
novel approaches.
In my work I target to feature tracking with own modi�cation of guided
matching, experimental radial lens distortion removing, sequential two-view to
multiple-view merging, and dense reconstruction. Other intermediate processes
are described with smaller detail with references to complete description so that
reader can get complete look to the problem. The thesis is concluded with
Conclusion with future work.
Chapter 2
Feature tracking
2.1 Inroduction
Feature point can be de�ned as point that can be di¤erentiated from its neigh-
boring points. Feature matching plays key role for most of the photo/video
based modeling tools even it is ill conditioned from its basics. Let consider two
images of building with many equal looking windows. Human �nds correspond-
ing windows be counting number of windows from some edge or in some similar
way. Selection of strategy depends on scene context. This is complicated to im-
plement. We transform this problem in computer vision to simpler problem of
feature tracking with adding assumption on maximal motion between two input
frames.
There already exist robust commercial feature tracking packages like de-
scribed in [5]. For our work we have selected free KTL feature tracking toolkit
[6] and developed new feature tracker similar to KTL.
2.2 Harris based feature tracker
My feature tracker uses Harris point feature detector [7]. Features points are de-
tected on two neighboring images from sequence. Similarly as in KTL algorithm
selects features that are good to track. As small motion between neighboring
images is assumed, algorithm removes all features that can be miss-exchanged
with its neighboring feature points in the same image �further in the text this
is referred to as self-correlation (see �gure 2.1, left).
3
2. Feature tracking 4
Figure 2.1: Self correlatin. left �self correlating feature, left bottom �correlationto two self correlating features, right �good feature to track, right bottom �neighbours correlations.
Figure 2.2: Principal components marked by red. The only good feature ismarked by green colour.
Each feature point is extended by its orientation - de�ned by two principal
axes of covariance matrix formed from small region surrounding feature point
(see �gures 2.2, 2.1, top-left). Features with one principal component much
bigger than second one are removed. This is typical for edges.
Features left in two images are than matched using zero-mean normalized
cross-correlation (ZNCC). I modi�ed ZNCC to take in account feature orienta-
tion. This is done by changing coordinate system to polar coordinate system.
Due to noise in image and discrete sampling, features orientation can be slightly
changed when we rotate the image. To handle this we rotate feature in correla-
tion process.
2. Feature tracking 5
Figure 2.3: Feature matches perfectly to more features in second image.
2.3 Removing outliers
From both KTL and my feature matching algorithms, we get well correlating
feature pairs. In real images we noticed that these pairs sometimes do not
point to the same object in the scene (see 2.3). Such feature pairs (further in
the text outliers, similarly good correspondences are called inliers) have to be
removed as in next stages they may cause various errors. In my approach I use
RANSAC paradigm [8] to �nd two-view geometry - epipolar geometry described
by fundamental matrix. All inliers satisfy epipolar constrain
mFm�= 0; (2.1)
where m and m�are two corresponding feature points, and F is fundamental
matrix.
Using this constraint algorithm eliminates almost all outliners, but unfor-
tunately some outlier will persist. It is because bad matches can also satisfy
epipolar constrain. This occurs when second feature point lies on epipolar line
of the �rst point (see �gure 2.4). Such bad matches can be removed using an-
other view. In my algorithm I assume that some outlier can persist and thus
other algorithms must be able to deal with outliers and �ltrate them.
2.4 Finding more features
After two view geometry is obtained guided searching can be perform. Funda-
mental matrix restricts searching region for each point in �rst image to line in
second image (see �gure 2.4).
2. Feature tracking 6
Figure 2.4: Point corresponding to the point in right image must lay on (or dueto noise lie near) the line in the left image.
P1
P2
2 - missing
1
2
1
3
P3
Left image
Right image3
Figure 2.5: Pair P1 1-1 will be added if correlation exceeds threshold. Pairs 1-3,2-1 2-3 will not be tested because lengths of line segments are too di¤erent. If2-2(If not missing) is used, 3-1 would not be tested due to ordering criterion.
In my approach I �nd epipolar lines for each feature. Guided matching is
performed on these lines with taking ordering constraint into account. Two
features are correlated only if length of epipolar line segment from previous
feature point is similar to length on corresponding epipolar line (see �gure 2.5).
From all matches algorithm calculates new fundamental matrix using normal-
ized 8 point algorithm [9]. Normalized 8 point algorithm uses linear least-square
error methods to �nd new fundamental matrix. This method does not perfectly
distribute error as pointed in [9]. Therefore 8point algorithm is used as an initial
estimation for nonlinear numerical minimization of
cost(F ) =P
i d(mi;mit)2 + d(mi�;mit�)
2;
where mitFmit� = 0, mi and mi� are i-th corresponding feature pair and
d(x; y)is distance of two 2D points.
Cost function is minimized using Levenberg-Marquard algorithm [11].
2. Feature tracking 7
Figure 2.6: Novel detector �traceable features are marked white. Red featuresare self correlating.
2.5 Results
2.5.1 Tracker
My feature detector does not di¤er from KLT detector a lot. Comparing with
KLT I introduced feature orientation, perform feature correlation in color space
and have other criterion for selecting features that are good to track. Taking
orientation into account caused that this feature tracker does not lose features
when camera is rotated. For small camera rotation even KLT will work �ne.
Because feature correlation is performed on colored images my algorithm is able
to track more features if these can be di¤erentiated by color.
My criterion for selecting traceable features keep main role in scenes like
in Figure 2.6 (compare to Figure 2.7). Here KLT selects features that are not
suitable for tracking and will cause many bad matches. For small motion both
KTL and my feature trackers give the same results in comparable time.
2.5.2 Guided matching
Guided matching algorithm �nds matches that satisfy epipolar constraint (equa-
tion 2.1). If two matching pairs lay down on their epipolar lines, than epipolar
constraint is satis�ed even if these features are outliers. My algorithm performs
guided matching similar to the way how dense stereo algorithms do. Di¤erence
2. Feature tracking 8
Figure 2.7: KLT good features �traceable features marked white. Bigger motionresults in invalid matching on the roofs of houses.
comparing to stereo algorithms is that algorithm runs on smaller amount of
features instead of all image pixels.
Imposing ordering constraint and length criterion dramatically eliminate num-
ber of outliers. Computational complexity stays in worst case O(nm) - for n
features in �rst and m features in second image. Algorithm works in expected
linear time due to length criterion constraint. New criterions reduced number of
expensive cross correlation tests so that amount of cross correlation tests is in
worst case mn, but also linear in expected time. See �gures 2.8 and 2.9.
2. Feature tracking 9
Figure 2.8: Guided matching. Matching is processed on two corresponding epi-polar lines.
Figure 2.9: More features found using guided matching. Small lines starting infeature points represents feature motion to previous frame.
Chapter 3
Structure and Motion
3.1 Introduction
3.1.1 Previous work
Structure and motion is de�ned as reconstruction of camera motion - positions
of cameras - and sparse reconstruction of structure of the scene. Note that it is
desireable to �nd this only from knowledge of image features correspondences.
In the past years many approaches retrieving structure and motion from
image sequence has been proposed. The most similar - in the way of how input
and output are de�ned - is approach presented by Marc Pollefeys [1]. Pollefeys
searches for two good initial frames. From initial camera pair the structure of the
scene is reconstructed. Structure and motion is then completed from knowledge
of 2D-3D correspondence. Quality of reconstruction depends on selection of
initial frames. Even through, this method is considered today to be most robust.
The main di¤erence against Pole¤eys method is that I process input on-line and
create structure even if it is not accurate. New frame is classi�ed if it can improve
quality of the reconstruction. If some part of the scene could be improved then
the structure and motion would be updated. In contrast with Pollefeys method
my method does not estimate position of the new camera not only from 2D-3D
correspondence, but also from two-view geometry. Sequential approach has been
also proposed by [16], but unlike this algorithm, my approach does not require
quasi-Euclidian initialization.
Di¤erent approach has been proposed by Kanade et. al [2]. Kanade uses
perspective factorization method. This method requires that every feature is
10
3. Structure and Motion 11
known in all views. From such features the measurement matrix is build. This
matrix is then factorized into P and X (see equation 3.1).0BBBBBBBBBBB@
�11
0B@x11y111
1CA ::: �1n
0B@x1ny1n1
1CA::: :::
�m1
0B@xm1ym1
1
1CA ::: �mn
0B@xmnymn
1
1CA
1CCCCCCCCCCCA= PX (3.1)
The main problem is to �nd � values. Complete description of method and
algorithm can by found in [3].
There exist many other methods where more assumptions or space markers
are required. Many of them are described in [4].
3.1.2 Overview
From previous stages we have pairs of matching features and also two-view geo-
metry of the last and the new frame from the image sequence. Note that frames
are added sequentially. From relation between the views and feature correspond-
ence we want to create structure of the scene and motion of the camera.
Images are processed as they came and re�ne existing scene if information
from new frame leads to more precise scene. This information is obtained from
two view projective reconstruction (described in section 3.2). I calculate weight
for each 3D point - telling how big is the region where 3D point can by put while
reprojection error stays within thresholds (see �gure 3.1). All measurements are
carried out in image space so algorithm works in projective space. Structure
and motion is built sequentially by merging camera pair with previous structure
using common features. There must be enough �at least 4 �common feature
points. Introducing image based measurements allows us to measure amount of
motion parallax between image pairs. During merging step I take this motion
to assign weight to common feature points.
3. Structure and Motion 12
Figure 3.1: Filled areas denote place where the 3D point can be put. Left image -perpendicular rays lead to biggest precision (smaller region), right image - regiongrows with smaller distance of two cameras.
3.2 Camera pair
3.2.1 Quasi-calibrated camera pair
Projective camera pair is created using epipolar geometry known from previous
step. Canonical pair is de�ned as
P1 =hI3x3 j 03
i;
P2 =h[e12]x F12 + e12a
T j oe12i;
(3.2)
where F12 is fundamental matrix from image 1 to image 2, e12 is epipole. Note
that o and a are free parameters and changing them will let epipolar geometry of
camera pair unchanged [4]. o determines the global scale of reconstruction and
a position of reference plane. Thus o can be simply set to one. My algorithm
�nds a so that camera P2 will hold calibration condition - zero skew, principal
point at image centre and varying focal length. Note that at least 3 cameras are
needed to perform full calibration with input assumptions. Therefore further in
the text I will refer to this camera pair as quasi-calibrated.
3.2.2 Sparse scene from camera pair
Having projective matrices allows calculating 3D position for each feature pair.
Usually, this is done using triangulation. Due to noise and discrete image space,
sight lines may not intersect perfectly. Features pair position in 3D point can
be calculated so that distance between reprojected 3D point and matching 2D
point is minimal:
d(m;P1M)2 + d(m�; P2M)
2;
3. Structure and Motion 13
where m, m�are corresponding 2D feature points, both corresponding to M ,
and P1, P2 are camera matrices pair and d(x; y) is distance of two 2D points.
Many methods how to obtain optimal 3D position are proposed in [4]. In my
work I �nd 3D position M by minimizing following formula:
cost(M) =P
i
��!Aim;M :Summembers are distances of unknown pointM and sight lines (Ai is camera
centre, m point in 3D on projection plane). Such point can be computed using
least squares method. If reprojected 3D point is to far from any of its 2D pair,
the feature is considered as outlier.
3.2.3 Small motion �precision issues
Sometimes it is not possible to calculate projection matrices. This occurs when
no-motion was made, or virtual parallax occurred. In that case the epipolar
geometry (fundamental matrix) is poorly estimated. To avoid this algorithm
performs 2D measurements and skip frame if median length of motion vectors
is smaller than threshold. Virtual parallax caused by pure rotation around axis
passing through focal point combined with pure zooming can be detected by
thresholding fundamental matrix eigen values.
Even if fundamental matrix is well de�ned, discrete space and noise can cause
too big freedom for placing 3D point (see
Figure 3.1). The error can be enormous when camera motion is small. For
that case I use image based measure (weight) for each 3D feature saying, how
precise is the estimation of 3D point (volume of intersection of sight lines).
Similarly, if median weight is smaller than threshold, then the frame is skipped.
Note that photo-consistent 3D point lays in intersection of all sight lines from
all cameras from which the 3D point is visible.
Also note that skipped frames are hold in memory for feature tracking pur-
poses.
3.3 Updating the structure and motion
In this section I describe how camera pairs are merged together with existing
reconstruction. My algorithm merges new camera pair with existing structure
3. Structure and Motion 14
Figure 3.2: Merging new pair and existing space using common feature pointsand corresponding 3D points.
and motion using their common camera. Merging is performed so that the best
from old and the new scene is used. The algorithm also calculates re-projection
error of 3D space for each camera. If mean error exceeds given threshold value,
than we perform nonlinear minimization - bundle adjustment [12]. This problem
can be solved e¤ectively by taking sparsity of the problem into account [13].
3.3.1 Merging camera pairs
For merging process we have new camera pair P1; P2 - in canonical form, and
existing reconstruction. Let P be last camera in existing motion structure. P
corresponds to P1 camera in new pair. Merging pairs means to transform both
P1 and P2, so that P1 will be equal to P . After that P2 will not be correctly
placed as P2 can di¤er in position of reference plane and also in the scale factor
(see Figure 3.2).
3. Structure and Motion 15
In ideal conditions we can express homography transformation that will �x
P2 camera from known 3D space and common correspondence as
Y = HX;
X = H�1Y;
where X and Y are corresponding 3D points in new and existing structure.
From four 3D points we are able to calculate H or H�1. Practically due
to error aspects we need robust approach as not all 3D points are suitable for
calculating such homography. In ours approach we select bundle of those points
that have biggest weight (see section 3.2.3). Features are divided into two groups
by weights - new space is better and current space is better. Points in one group
are used to calculate H and the second to calculate H�1. Homographies are
calculated using RANSAC paradigm [8]. All measurements are carried in 2D.
After transforming P2 with H we can merge 3D structure of new pair into
existing structure. Already known 3D points are merged with new points and
are recalculated as described in section 3.2.2.
To minimize accumulated error mean 3D to 2D re-projection error is calcu-
lated. If this error exceeds threshold value algorithm performs bundle adjust-
ment.
3.4 Self-calibration
Until now we did not care about intrinsic camera parameters. Reconstructed
space and camera poses are locked by photo consistency constraints (reprojection
error is small) but such reconstruction is not unique. Now we detect camera
intrinsic parameters using only images - this is called self-calibration. Many
techniques how to perform self-calibration of cameras are described in [4].
Let X by any 3D point of reconstruction, P any camera andm corresponding
2D feature in camera P . For any homography H4x4 we get :
m = PX;
m = PHH�1X:(3.3)
It means that we can transform both cameras and 3D points so that repro-
jection error will stay unchanged. Without loss of generality we can consider
3. Structure and Motion 16
that H does not shear, rotate, translate and scale. These components are inter-
esting only if we want to align reconstruction to some existing space. Now, the
only component that we will care about is projective part of the homography
- the only part that can transform plane at in�nity. Such homography can be
described as follows:
H =
0BBBB@k�1 0 0 0
0 k�1 0 0
0 0 k�1 0
ak�1 bk�1 ck�1 k
1CCCCA ; (3.4)
where a; b; c; k are unknowns.
Camera matrix can be factorized into upper triangular 3x3 calibration matrix
(3.5), 3x3 rotation matrix and 3x1 translation matrix.
K =
0B@ax s x0
0 ay y0
0 0 1
1CA ; (3.5)
where ax; ay - focal length, ax : ay aspect ration, s �skew, [x0, y0] is principal
point.
My algorithm �nd homography H so that all cameras transformed with H
will have calibration matrices as assumed �skew equal zero, principal point at
image centre, aspect ration equal 1 and varying focal length. Key of �nding such
homography lays in projection of absolute conic. Absolute conic can be repres-
ented using dual absolute quadric [14]. One of the most important properties
of absolute quadrics is that they are invariant to similarity transformations.
Another property leads to direct key to how to �nd H: The projection of dual
absolute quadric is directly related to the intrinsic camera parameters [1] :
KKT = PP T ; (3.6)
where P is 3x4 projection matrix.
Since the images are independent to the projective basis of the reconstruction,
equation (3.6) is always valid and constraints on the intrinsic can be translated
to constraints of absolute quadric [1]. With our assumptions K can found from
equation (3.6) by solving linear system with one cubic constraint [15].
3. Structure and Motion 17
Using H I transform whole structure and cameras as shown in equation (3.3).
This will cause that re-projection error will stay unchanged and cameras hold
�real�conditions.
3.5 Results
RANSAC paradigm in both two-view and merging to n-view processes makes
algorithm robust for presence of outliers. Tests on synthetically generated data
showed that algorithm can also deal with 30% amount of outlier (for 100 feature
correspondence). Radial lens distortion in�uence 3D scene but 3D to 2D re-
projection error stays under 1pixel. Adding noise with gaussian distribution to
the images will cause problems for linear algorithms. Having more cameras will
cause that 3D points are estimated from more 2D correspondences and thus noise
is slightly suppressed so that structure does not change dramatically. Because
camera projection matrix is calculated only from using fundamental matrix,
numerical minimization on fundamental matrix is in this case essential.
For noise with radius under 1pixel both motion of camera and scene structure
did not change dramatically - mean residual error (further in the text error)
measured in pixel space stays under 0.5 pixels for 100 corresponding points and
3 cameras. 4 cameras dropped error under 0.3pixels. Adding more cameras did
not changed error so rapidly. Noise with radius 2-3pixels caused that camera
motion was poorly estimated and reprojected structure comparing to ground
true was under 2pixels measured in image space. Ten cameras were able to drop
error under 1.2pixel. Change in structure and motion is visualized in �gure 3.3.
Numerical minimization - both on structure and motion - found new motion
and structure with error under 0.8pixel. For noise with radius 3pixels and above,
cameras were estimated poorly even after numerical minimization. For such case
it would be better to calculate projection matrix from 3, 4 or more view image
constraints.
Tests on real data give good results even for small resolution camera. Figure
3.5 shows sparse reconstruction of scene captured by digital camera in resolution
320x240. Another example is in �gure 3.4.
Although quasi-calibration of pairs is not required, my experiences show that
it helped in merging processes. Merging quasi-calibrated pairs will cause that
sparse 3D space is not so distorted by perspective. Such 3D points are near true
3. Structure and Motion 18
Figure 3.3: Structure and motion precision progress. Blue - reconstruction from5, green - from 8 and red from 10 cameras.
position and space is more uniformly distributed. We explain better results by
better uniformity of distribution of the space.
3. Structure and Motion 19
Figure 3.4: Structure and motion of the real scene. Video was captured inresolution 640x480.
Figure 3.5: Sparse reconstruction of the scene. Resolution of input video was320x240.
Chapter 4
Dense reconstruction
4.1 Introduction
Until now we have worked with sparse data. Sparse data in combination with
extrinsic and intrinsic camera parameters (from previous process) can be used
to align virtual world with real world. Then we can simply render virtual scene
from known camera position and merge it with the original image. Sparsity
of the reconstruction and presence of wrong 3D points are not a problem for
applications like mentioned above. Reconstructed 3D points are used only for
aligning worlds and it can be done by user. It there were occluders in the front
of virtual objects then the real and rendered images would have to be merged
by the user because depths for pixels in real image are not known.
For visualization or cultural heritance purposes is sparse reconstruction in-
su¢ cient. Dense reconstruction would be prefered.
4.2 Overview
In recent years many algorithms for dense reconstructions have been proposed
[17]. In this section, I describe approaches that are relevant to my approach.
4.2.1 Volumetric methods
Volumetric methods assume that bounded area in which the objects of interest
lie is known. Then 3D space is �lled with 3D points/voxels such that projection
of these points/voxels on cameras has the same color - resp. must be photo
20
4. Dense reconstruction 21
consistent [19]. In practical, it is not as simple as visibility and many other
aspects of the 3D points have to be known. There are many voxel based method.
More detailed overview of many volumetric methods can be found in [29].
In my approach I use some ideas of these two voxel based algorithms, voxel
coloring [25] and space carving [19].Voxel coloring searches for 3D points that are
photo consistent to all cameras that can see these points. Unlike voxel coloring,
space carving starts with some initial reconstruction of the scene - for example
cube - and carve photo inconsistent points. For recall - bounds of the interest
area has to be known.
Mathematical background and theory can by found in [19]. Note that both
voxel based methods need to know visibility or occlusion of voxels to return good
results. [26] and [19] deals how to traverse space to get good reconstruction
for any camera motion. Quality and also numerical complexity depends on
resolution of voxel space.
4.2.2 Stereo methods
Unlike volumetric methods, stereo methods use image space to generate 3D space
- using per-pixel correspondences. Stereo algorithms use epipolar constraint (2.1)
to restrict the correspondence search to 1D. In case of calibrated stereo rig, two
cameras lies in con�guration where second camera is purely translated against
the �rst one. Note that in such con�guration the epipoles lie at in�nite and
thus the epipolar lines are parallel. Let consider case where the position of the
second camera is obtained by pure translation by X axis of image space of the
�rst camera. In such con�guration the matching process is performed in rows.
This leads to more advantages like: (1) two matching pairs can be expressed
as signed distances between them (called disparity) and thus save memory and
(2) we access neighboring blocks of memory and thus reduce cache misses which
leads to better performance. In general if we know fundamental matrix we can
unwrap space so that epipoles will lie at in�nite too. This process is called
recti�cation. For details see section 4.3.2.
Goal is to �nd which subset of points on one epipolar line that match to
subset on corresponding epipolar line in other image - in recti�ed image we work
with recti�ed image scan lines. Finding such subsets can be transformed to
search path problem and solved using dynamic programming (see �gure 4.1).
Dynamical programming allows to incorporate other constraints like preserving
4. Dense reconstruction 22
Figure 4.1: Searching for matching pairs using path-search problem. Cut of thescene on the left. Occlusion diagram on the right.
order of neighboring pixel, bidirectional uniques of match and occlusion detection
[23], [24]. Although many other algorithms area available, I selected dynamic
programming approach because it provides good trade-o¤ between quality and
speed. Note that matching each rows/line can be treated independently and
thus parallel.
Having more images o¤er more constraints. [27] recti�ed all images against
the �rst view and instead of direct �nding of matches it searches for correct depth.
This is good idea since disparity changes from view to view but depth is used
as common search index. For wide baseline cameras [28] presents probabilistic
method with good mathematical background and results. Another algorithms
like [21] or in [1] merge disparity maps from more camera pairs to obtain very
dense depth map.
More detailed description of stereo algorithm and my implementation follows
in next sections.
4.3 Novel algorithm
One of the biggest disadvantages of the volumetric method are that scene bound-
ing area and physical properties of material has to be known, it is hard to handle
occlusions and result is dependant on noise. Also already known scene structure
is not used and it is hard to introduce epipolar constraint. On the other hand, the
4. Dense reconstruction 23
Figure 4.2: Stereo ambiguities. It is not possible to detect true matching fromthese two view.
most stereo algorithms were built to extract disparity map without any know-
ledge of the scene structure and camera motion. For these only fundamental
matrix has to be known.
4.3.1 Design
It is not possible to uniquely reconstruct 3D scene from two views (see �gure
4.2). If there were homogenous regions with singe color then we would not know
which two pixels match in the real world. There are many photo consistent
solutions and even human brain is not able to �nd correct one (see �gure 4.3).
Because of that all that we can do is to use heuristic to get "most real" looking
scene. Remember that we can use more than two views and thus we have more
information. In this work the scene hypothesis from two views is built and then
merged with global scene. Hypotheses are either broken or con�rmed. If there
is some uncertainty in two view reconstruction then all pixels which cannot be
reconstructed uniquely are removed. Since this is not easy to detect we need to
remove all invalid (photo inconsistent) 3D points in merging process too.
There are many algorithms trying to reconstruct scene from only two views.
They can achieve better results for two views, but with cost of the speed. This
partial solution is �ltrated and merged with global space which is also re�ned to
achieve better precision.
The requrements to algorithm were :
4. Dense reconstruction 24
Figure 4.3: Human cannot match pixels in background. Background must beremoved.
� To use already known scene structure - as these 3D points are photo con-sistent, they had already passed many tests and thus surface pass near to
them.
� To use camera motion - as we can extract two-view geometry from it and
also use it for photo consistence checks.
� To have possibility to make algorithm distributed on more CPUs/GPUs
to achieve better response time - the aim is real-time.
� To allow to de�ne scene constraints by user - like "this is a plane" or "thisis an empty area" and so on. Such approach can by with real-time or near
real-time response very powerful and precise photo modeling tool.
In this work I use dynamic programming approach since using it I can exploit
all constraints and all these requirements are met.
4.3.2 Recti�cation
Recti�cation process wrap image pairs so that epipolar lines are coinciding with
the image scan lines. We can achieve this by �nding transformation that will
move epipoles to in�nity. Wrapping need to be done so that all pixels in the image
will persist and size of recti�ed image is minimal. Finding such transformation
seems to be complicated. If we realize that after transforming epipoles to in�nity,
epipolar lines will become parallel. If we get two corresponding epipolar lines,
we can transfer image pixel on epipoles to rows. By doing this for every pairs of
4. Dense reconstruction 25
Figure 4.4: Area visible from both cameras. Black lines de�ne left camera vis-ibility and red lines right camera visibility. Minimal area is selected.
epipolar lines we get image that satisfy recti�cation condition.This method can
also handle cases when epipoles lie in image.
In my work I extracted bilinear �ltered image segment under each epipolar
lines and transfer them to output recti�ed image. My algorithm is similar to
that used in [1]. The di¤erence is in the way how epipolar lines are selected.
Simple description follows:
� Enumeration of resolution of resulting recti�ed images. Image height cor-responds to count of epipolar lines that we use. First, visible area in both
images is found (see �gure 4.4). In my implementation I rotate �rst epi-
polar line around its epipole by angle which is calculated so that no pixel
compression will occur - see �gure 4.5. The angle is calculated in both im-
ages and the minimal one is used. Image width corresponds to di¤erence of
outer (denoted max X) and inner (denoted min X) radius of circles passing
through the most distant and the most near image pixel - see �gure 4.5.
Note that for camera pairs where epipole lies in the image it su¢ ce to
select any epipolar line and rotate it 180�.
� Rows of recti�ed image are built from intersections between epipolar lines
and images. Intersection of epipolar line and image can be empty, 1 pixel
or line segment.
1. Empty intersection - this case cannot occur since we process only
visible part of image.
2. 1 pixel - bilinear �ltered pixel is read from original image and stored
to row.
4. Dense reconstruction 26
Figure 4.5: Recti�cation process. Width of recti�ed image is di¤erence of min Xradius from max X radius. Epipolar line is rotated so that it travels at maxima1 pixel on outer circumference (with max X radius).
� Line segment - line rasterization algorithm is used to traverse pixels in
input image. Each traversed pixel is read and placed into row of recti�ed
image (see �gure 4.6). Note that to increase quality of recti�ed image we
perform bilinear �ltering when reading pixels.
Note that lines are resterized in direction from position of epipole to outer
radius. For con�guration with epipole in image we use oriented fundamental
matrix. This can by calculated from any fundamental matrix by correcting
orientation. We can do it using one known corresponding feature pair as follows:
let l = Fm and l0 = m0F , if sign(l �m0) di¤ers from sign(l0 �m) then sign of Fmust be changes.
Back transformation from recti�ed image to normal image can be expressed as:
y - angle to �rst visible epipolar line and x- distance from epipole decreased by
inner radius.
4.3.3 Initial disparity map
Let consider two recti�ed images. We know that for any selected point we can
�nd (if it is not occluded) its corresponding point on the same scan line in the
second image. Speed of algorithm depends on search region. In my thesis I use
already known structure and feature correspondence to �nd it.
After two input images from two di¤erent views are recti�ed we �nd all
common feature points and their matching. Note that we need to transfer feature
points to recti�ed space. Then Delaunay triangulation is used in the �rst recti�ed
4. Dense reconstruction 27
Figure 4.6: Row of recti�ed image. Each row correspond to some epipoler lineand its intersection with the image.
image to �nd triangulation on these points (see �gure 4.7). Triangulation is
transferred to second image by changing vertices with corresponding vertices.
Since feature matching is known algorithm can know disparities for all these
vertices. Note that disparity stored in vertices is known with sub pixel precision.
In next step algorithm rasterize triangles to matrix of �oat values, such that
each point inside triangle A;B;C with disparities D(A); D(B); D(C) will have
disparity uD(A)+vD(B)+wD(C), where u+v+w = 1 and u; v; w are baricentric
coordinates of this point.
In preprocessing algorithm marks triangles which correlates with correspond-
ing triangle in second view and also algorithm marks triangles where ratio of
triangle area is to far from 1.
If we have constrains from user such "this is plane" then we rasterize this
plane to disparity matrix.
4.3.4 Re�ning disparity map using dynamic programming
Disparity re�nement process uses two recti�ed images (left and right), initial dis-
parity map and known structure and motion. Since we work in recti�ed space we
can treat each scanline alone. For each pixel in left scan line algorithm searches
for corresponding pixel in right scan line. From initial disparity map we have
estimation of position of the corresponding pixel. Because initial disparity map
can di¤er from true disparity, algorithm searches in neighborhood of estimated
4. Dense reconstruction 28
position.
Algorithms can be described in few steps :
� Detect homogenous area and reduce noise - using mean �lter - in both scanlines.
� Correlate each pixel from left scan line with all pixels with prede�ned
radius around estimated position in the right scan line. Bigger number
means better match. Correlation under prede�ned threshold is clamped to
zero.
� Build grid graph with size image_width � search_range.
� Weight each edge in the graph as follows:
�Vertical and horizontal edges have weights equal to zero.
�Edge between i; j and i+1; j+1 node have weight equal to correlationbetween i-th and j-th pixel of left and right scan line.
� Increase weights of those diagonal edges, where i-th and j-th pixel liesin triangles marked as correlating and j is chose of initial disparity.
�Reconstruct 3D position from hypothetical match and test photo con-sistence. Decrease weight if it is not. In this step constraints like "this
is empty area" can be tested.
�Zero weights of those diagonal edges, where i-th and j-th pixel lie intriangles where ratio of areas of triangles is not near 1.
� Find the most expensive path using dynamic programing - seach pathproblem algorithm.
Example of re�ned disparity map is shown in �gure 4.7
4.3.5 Triangulation
Next step is dense point cloud triangulation. This is done by interconnecting
neighboring pixels in image space and then 2D positions are swapped with 3D
coordinates. We need to handle two types of discontinuities - (1) occlusion and
(2) small discontinuities caused by noise and camera discrete sampling. 2nd
4. Dense reconstruction 29
Figure 4.7: Re�ned disparity map. Triangle verticles correspond to points forwhich we have its corresponding 3D vertex - from scene structure.
4. Dense reconstruction 30
Figure 4.8: Triangulation of small disparity discontinuities.
type is present as small / single point discontinuities with known neighborhood.
These can be removed by interpolating neighbors values - we do not need to do
it because triangulation will do it. Triangulation of (2) is shown in �gure 4.8.
Occlusions / big discontinuities are not triangulated and this space is let
unde�ned with hope that it can be completed from another views. We also
remove triangles if triangle normal is almost perpendicular to the camera view
direction - for example 85�.
Reconstruction noise can be suppressed by smoothing disparity map. Smooth-
ing should be processed with respect to edges.
4.3.6 Multi-view linking
From two views reconstruction we get dense space with its triangulation. Areas
that are occluded to both cameras are not reconstructed.
Linking process consists of:
� Re�nement of common space.
� Merging with new space.
� Update of triangulation.
Common space is identi�ed using matching transitivity of new and old space.
All common points are recalculated like in section 3.2.2. Detection of outliers
and inconsistence is similar to method described in [1].
New space is merged with respect to triangulation as follows:
� If two non-parallel triangles have intersection then these triangles are sub-divided in this intersection (point or edge). New triangles are segmented
by triangles normal orientation. All occluded triangles are removed. Idea
of this algorithm is similar to packet wrapping algorithm (illustration in
�gure 4.9).
4. Dense reconstruction 31
Figure 4.9: Triangulation mergin. Input reconstructions in a) and b), result inc). d) shows triangle orientation, all in one image .
� If one triangle oclude another triangle and these triangles don�t have in-tersection then :
�These two triangles de�ne the same part of object and are shifted dueto noise. This is detected by thresholding triangles distances in 3D.
Space is re�ned by caclulating weight for each vertex of triangle saying
how precisely is this triangle calculated. Vertices are then recalculated
similary as in section 3.2.2.
�Occluded area is photo inconsistent.
�Occluder is photo inconsistent.
�Both triangles are photo consistent and outside testing threshold -triangles are marked and stay unchanged.
� If two triangles lie one-on the second then algorithm removes that onewhich is completely covered by that second one.
There are many other sub cases but most of them will never occur since initial
4. Dense reconstruction 32
Figure 4.10: Recti�cation example. Original image pair (top) and recti�ed imagepair (bottom).
triangulation is �ltrated from those triangles which are invalid. In linking process
algorithm removes only those triangles which are occluded or photo inconsistent.
4.4 Results
My recti�cation algorithm unwrap image so that no pixel compression occurred.
We can save image space if we removed unused space (see �gure 4.6) but then we
would break vertical image continuity - vertical lines would not be continuous.
Example of recti�ed image is in �gure 4.10 .
Computation of initial disparity is not time expensive process and using it
we reduced search region which caused speed-up of algorithm since numerical
complexity of dynamical programming approach isO(width�2�searchradius): Inour observations 4% of image resolution was su¢ cient - for image with resolution
640x480 it is around 20pixel radius.
Segmentation of homogenous regions removes noise and thus homogenous
regions stay continuous since discontinuities concentrates in occluded areas. In
the case where triangles in both views correlated well, algorithm supports nodes
4. Dense reconstruction 33
of the grid which were estimated by initial process. It means that planar solution
is supported. On the other hand if ratio of triangles area in both views has
changed a lot, then space behind these triangles would di¤er a lot - comparing
to planar solution. Testing photo consistence leads to performance drop. On the
other hand this removes photo inconsistent paths.
Algorithm with 10 cameras for photo consistence check and 2 selected cam-
eras for space reconstruction from input with resolution 640x480 was able to
calculate �oat precision disparity map in 3 seconds on AMD AthlonXP 2800+
CPU/512MB. All stages such that recti�cation, initial disparity guess, disparity
re�nement, dense map extraction were present. See examples section 4.4.1.
Triangulation that is build from 2D space �lls gaps but the reconstruction is
a¤ected by noise. Smoothing disparity maps in combination with median �lter
will suppress noise but also deform edges. It would be better to use some point
cloud approximation method.
4.4.1 Examples
Each of these examples was captured by standard Digital hand-help camera in
resolution 640x480. Reconstruction time was around 2.5 seconds for each pair
of cameras. Merging of two reconstructions was done under one second.
4. Dense reconstruction 34
Figure 4.11: Dense reconsturction of the nature. Two merged camera pairs wereused to obtain disparity map and 8 other cameras were used for photoconsistencechecks.
Figure 4.12: Dense reconsturction of human face. 2 cameras to obtain disparitymap and 7 cameras for photoconsistence checks were used. Scene is rendered aspoint cloud, each point is rendered as 2x2pixels colored splat.
4. Dense reconstruction 35
Figure 4.13: Reconstruction process. Frame from input sequence (top-left),feature tracing (top-right), structure and motion with reconstructed object(bottom-right), �nal 3D mesh (bottom-left).
Figure 4.14: Dense reconstruction of scene from �gure 3.2.2. Scene is reconstruc-ted from two views. Video resolution was 320x240.
4. Dense reconstruction 36
Figure 4.15: Scene reconstructed using 9 cameras.
Figure 4.16: Novel view of the object with may homogenous regions. Scene isreconstructed from 2 cameras.
Chapter 5
Experiments
5.1 Radial lens distortion
Optical distortion of camera lens can move 2D points from original position far
away �even more than 10 pixels. In this experiment I take into account only
radial lens distortion that can be approximated as
� = (1 + �(x2 + y2));
x1 = �x;
y1 = �y;
where � is unknown distortion factor.
My algorithm is similar to [10]. In [10] authors modi�ed distortion equation
and this allowed them to modify linear algorithm for calculating fundamental
matrix to calculate distortion too. New equation for F matrix returns more
roots (around 10) and all must be tested. It was also noted that change of radial
distortion equation creates many local minima around global minima.
Our radial lens distortion removal algorithm searches for � directly by min-
imizing Pi d(mi:Fmi�)
2 + d(F Tmi:mi�)2;
where mi, mi� are corresponding 2D feature points, F is fundamental matrix
and d(x; y) is distance of line and 2D point. For each � we un-distort features
position and �nd new fundamental matrix. Features are scaled to �t window. �
is found using simulated annealing [18].
37
5. Experiments 38
Figure 5.1: Radial distortion test. Left - original, right - undistorted. Costfunction graph for each feature point is below.
Algorithm assumes that � is equal for two neighboring cameras. � for i-th
camera is approximated by averaging obtained from both pairs (i � 1th andi+1th cameras). Experiments showed that for small motion this algorithm does
not �nd good � and similarly when radial is higher - j�j > 0:2 .
5.2 Results
Radial distortion algorithm was tested on grid patterns and real images. I did
not aim to �nd perfect calibration, because global nonlinear minimization - such
bundle adjustment - can correct radial distortion too. Numerical complexity of
the method depends on number of tested feature pairs. . It is because the fun-
damental matrix is recalculated each iteration. Our experience show that linear
estimation of fundamental matrix su¢ ces. For 200 feature points we estimate
radial distortion under 1 second on 2.8GHz AMD CPU powered machine.
For grid pattern with resolution 512x512 the error against ground truth was
6 pixels at image corners before and fewer than 1.64 pixels (measured on feature
points) after correction. See Figure 5.1.
Chapter 6
Conclusion and future work
This thesis presented a sequential approach for creating calibrated motion and
structure from un-calibrated video sequence with dense 3D reconstruction of
the space. Sequential processing allows us to process input video directly from
camera stream. Biggest advantage of processing from stream is that we can skip
process of storing to disk and video compression which leads to better quality
(due to uncompressed transfer).
Because there is always a noise in the images it is not good to calculate
camera position from only two views. Therefore in future it would be better to
improve camera projection matrix calculation, using more images, maybe using
factorization approach. My experiences with real camera also showed that if
principal point is not in image centre, than scene stay skewed even after self-
calibration. For cheap hand held cameras it is unexpected to have principal
point at image center. Allowing principal point to be constant (non zero) or to
be varying leads to non-linear self-calibration algorithm [4].
Since there are always many ambiguities in dense reconstruction process I
recommend using algorithm where user can control algorithm pipeline. This
is possible and e¤ective when algorithm responds to user interaction promptly.
Slowest part in my reconstruction pipeline is algorithm for dense reconstruction.
Because it works in recti�ed space, more scan lines can be processed parallel and
thus it is possible to drop down response time.
More work need to be done also in process of triangulation. It would be
better to extract dense point cloud and generate surface using point cloud ap-
proximation algorithms.
39
Appendix A
Resources
Results from my work with some additional materials are included on enclosed
compact disc.
You can �nd there:
� The text of this master thesis in pdf format.
� The paper submited to the CESCG 2005 in Budmerice.
� The paper submited to the �VK 2005, FMFI UK, Bratislava.
� Images with results from various parts of algorithm.
� Video from reconstructed scenes.
� Test sets - generated, captured, standard video and images.
� Papers relevant to my thesis - the most of referenced in bibliography.
� Source code of some parts.
Online resources
More recent versions, updates and results can be found on homepage of Data-
Expert, s.r.o. http://www.dataexpert.sk.
40
Bibliography
[1] M. Pollefeys - L. Van Gool - M. Vergauwen - F. Verbiest - K. Cornelis - J.Tops - R. Koch. Visual modeling with a hand-held camera, InternationalJournal of Computer Vision 59(3), 207-232, 2004.
[2] M. Han - T. Kanade. Creating 3D Models with Uncalibrated Cameras, pro-ceeding of IEEE Computer Society Workshop on the Application of Com-puter Vision (WACV2000), December 2000.
[3] P. Sturm - B. Triggs. A Factorization Based Algorithm for Multi-ImageProjective Structure and Motion, 4th European Conference on ComputerVision, Cambridge, England, April 1996, pp. 709-720 .
[4] R. Hartley - A. Zisserman. Multiple View Geometry In Computer Vision.Second Edition. Cambridge University press, UK. March 2004.
[5] A. W. Fitzgibbon �A. Zisserman. Automatic Camera Tracking. RoboticsResearch Group. Department of Engineering Science. University of Oxford,UK.
[6] S. Birch�leld. KLT: An Implementation of the Kanade-Lucas-Tomasi Fea-ture Tracker. Stanford University, http://vision.stanford.edu/~birch
[7] C. Harris - M. Stephens. A combined corner and edge detector, Fourth AlveyVision Conference pp.147-151, 1988.
[8] M. Fischler �R. Bolles. Random Sample Consensus: A Paradigm for ModelFitting. Communications of the ACM, 24 (6), 381-395. 1981.
[9] R. Hartley, In defense of the eight-point algorithm. IEEE Trans. On PatternAnalysis and Machine Intelligence, 19(6):580-593, June 1997.
[10] A. W. Fitzgibbon. Simultaneous linear estimation of multiple view geometryand lens distortion. Department of Engineering Science. University of Ox-ford, UK.
[11] W. Press �S. Teukolsky �W. Vetterling. Numerical recipes in C : the artof scienti�c computing, Cambridge university press, 1992
41
BIBLIOGRAPHY 42
[12] B. Triggs �P. McLauchlan �R. Hartley �A. Fitzgibbon. Bundle Adjust-ment - Modern Synthesis, Vision Algorithms: Theory and Practice, SpringerVerlag, 298-375, 2000.
[13] M. I. A. Lourakis - A. A. Argyros. The Design and Implementation of a Gen-eric Sparse Bundle Adjustment Software Package Based on the Levenberg-Marquardt Algorithm, Institute of Computer Science �FORTH, Heraklion,Crete, Greece, August 2004.
[14] B. Triggs. The absolute Quadric, Proc. 1997 Conference on Computer Vis-ion and Pattern Recognition, IEEE Computer Society press, pp. 609-617,1997.
[15] M. Pollefeys �R. Koch �L. V. Gool. Self-Calibration and Metric Recon-struction in spite of Varying and Unknown Intrinsic Camera Parameters,International Journal of Computer Vision, KLuwer Academic Publishers,Boston, 1998
[16] P. Breadsley �A. Zisserman �D. Murray. Sequential Updating of Project-ive and A¢ ne Structure from Motion. International Journal of ComputerVision (23), No. 3, Jun-Jul 1997, pp. 235-259.
[17] M. Pollefeys. 3D Photography, comp290-89 Fall, University of North Caro-lina, 2004.
[18] V. Kvasniµcka �J. Pospíchal �P.Tiµno. Evoluµcné algoritmy. Slovak technicaluniversity, Bratislava, 2000
[19] K. N. Kutulakos - S. M. Seitz, A Theory of Shape by Space Carving, Proc.Seventh Int�l Conf. Computer Vision, vol. 1, pp. 307-314, 1999.
[20] V. Kolmogorov - R. Zabih. Multi-Camera Scene Reconstruction via GraphCuts, Proc. Seventh European Conf. Computer Vision, 2002.
[21] G. Zeng - S. Paris - L. Quan - M. Lhuillier. Surface Reconstruction byPropagating 3D Stereo Data in Multiple 2D Images, Dep. of Computer Sci-ence, HKUST, Clear Water Bay, Kowloon, Hong Kong
[22] M. Lhuillier - L. Quan. A Quasi-Dense Approach to Surface Reconstructionfrom Uncalibrated Images, IEEE Trans. On Pattern Analysis and MachineIntelligence, VOL. 27, NO. 3, MARCH 2005
[23] I. Cox - S. Hingorani - S. Rao. A Maximum Likelihood Stereo Algorithm,Computer Vision and Image Understanding, Vol. 63, No. 3., 1996
[24] S. Birch�eld - C. Tomasi, Depth Discontinuities by Pixel-to-Pixel Stereo,International Journal of Computer Vision, 35(3): 269-293, December 1999
BIBLIOGRAPHY 43
[25] S. M. Seitz - C. R. Dyer. Photorealistic Scene Reconstruction by VoxelColoring. International Journal of Computer Vision, 35(2), 151-173. 1999
[26] Culbertson - W. B., T. Malzbender - G. Slabaugh: 1999, �Generalized VoxelColoring. In: Workshop on Vision Algorithms: Theory and Practice. Corfu,Greece.
[27] M. Okutomi - T. Kanade. A multiple baseline stereo. IEEE Trans. On Pat-tern Analysis and Machine Intelligence, 15, 1993.
[28] C. Strecha, R. Fransens, L. Van Gool - Wide-baseline Stereo from MultipleViews : a Probabilistic Account, ESAT-PSI, University of Leuven, Belgium
[29] C. R. Dyer. Volumetric scene reconstruction from multiple views, FIU01,2001, 469-489