monoslam using surf · code of monoslam (davison, reid, molton, & stasse, 2007) has been used...
TRANSCRIPT
8th November 2013
MonoSLAM using SURF Experimentation Project
Student Rik Bosma
3864669
Supervisor Robby Tan
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 1 of 28
Abstract
Augmented reality systems require positional and orientation information of the used
camera. Robotics studies refer to this problem as localization and multiple researchers
solved this problem using simultaneous localization and mapping methods. Since
augmented reality commonly uses cameras it is worth taking a look at visual simultaneous
localization and mapping methods to acquire more knowledge about localization and
computer vision.
The method considered during the experimentation project is called MonoSLAM. The
extraordinary part of this method is one camera is used to locate recognizable points in
3D. This is achieved by estimating the depth of a feature over multiple frames. However
this process of feature tracking and depth estimation fail after some time because the
feature detection and matching is not robust enough.
To deal with this problem speeded up robust features are implemented trying to achieve
a more robust feature detector and matcher, which results in a more robust localization.
The conclusions are positive because the feature detector and matching are more robust,
but mismatches might occur when lots of features are considered for matching.
* State the main problems, why the current existing methods fail (drawbacks), how you do the experiments/evaluation, what your hypotheses, and what the conclusions from the experiments.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 2 of 28
Acknowledgement
optional
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 3 of 28
Contents
Chapter 1: Introduction ........................................................................ 4
Chapter 2: Related work ....................................................................... 5
Chapter 3: Theory .............................................................................. 6
MonoSLAM ..................................................................................... 6
Limits ........................................................................................ 11
Hypotheses .................................................................................. 12
SURF.......................................................................................... 13
Chapter 4: Experimental results and evaluation ......................................... 15
MonoSLAM using SURF Implementation ................................................. 15
Observations ................................................................................ 16
Performance ................................................................................ 17
Experiments ................................................................................. 20
Results ....................................................................................... 21
Evaluation ................................................................................... 23
Chapter 5: Conclusion ........................................................................ 24
Appendix ....................................................................................... 27
References ..................................................................................... 28
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 4 of 28
Chapter 1: Introduction
In the field of computer vision one of the trends is mixed reality. Mixed reality is about
merging the real world and a virtual world. Either to be able to handle user input in
virtual worlds (augmented virtuality) like the PS Move and Kinect as well as merging a
virtual environment into the real world (augmented reality) like Layar applications. The
ideal situation of augmented reality is a user is able to wander around into an augmented
environment freely. To do this an absolute positioning system in six degrees of freedom
is required. If such a positioning system would work accurately the possibilities are
limited by computer power and human creativity, because we would be able to create
our own reality within the real world. Imagine the possibilities!
The main goal of the project was to acquire more knowledge of computer vision with
a focus on visual simultaneous localization and mapping (SLAM). The paper and source
code of MonoSLAM (Davison, Reid, Molton, & Stasse, 2007) has been used as basis to
analyze this specific method. During the project the Good Features to Track (GFTT)
features have been replaced by SURF features to analyze whether those features perform
better in detecting and matching.
If this method would be improved so it could handle larger environments AR
applications using a single camera could be realized. This could contribute a lot to mobile
applications like games and navigation apps.
First the details of the MonoSLAM algorithm are analyzed, especially the part about
the depth retrieval. Once this was done the major weaknesses of the algorithm has been
specified. To improve the algorithm a more advanced feature detection method has been
implemented.
The experiments will focus on the SURF features namely to prove the hypotheses.
First of all there will be a focus on the robustness in means of repeatability of the features.
During the experiments the robustness of the feature matching during axial translations
is analyzed.
In the next chapter related work is considered. Chapter 3 is about the theory behind
the MonoSLAM algorithm and the SURF features. Also the implementation of the SURF
features in the algorithm is explained. In chapter 4 the experiments are stated and the
results are shown. The limits of the MonoSLAM and the conclusion about the thesis are
discussed in chapter 5.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 5 of 28
Chapter 2: Related work
For several decades navigation is a trending problem within the field of robotics. A
number of solutions are developed over the past years, one of the directions a lot of
researchers are looking into is simultaneous localization and mapping (SLAM) as proposed
by (Leonard & Durrant-Whyte, 1991).
The idea is measured points are used to build a map of the environment and using
those points and the change of those measured points relative to the robot the position
of the robot is approximated.
MonoSLAM focuses on visual SLAM where the measured points are provided by an
algorithm which uses images as sensor to determine points to measure. The points in an
image are selected using a feature detector as described by (Shi & Tomasi, 1994) with
additional prediction algorithms to predict where the feature should be based on an
approximated velocity (Davison, Real-time simultaneous localisation and mapping with a
single camera, 2003).
However the feature detector used in the MonoSLAM algorithm is old and several
other feature detectors might outperform the current one. For example the speeded up
robust feature (SURF) detector (Bay, Tuytelaars, & van Gool, 2006) is analyzed in this
report.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 6 of 28
Chapter 3: Theory
The MonoSLAM algorithm and the SURF features are discussed in this chapter. MonoSLAM
is the main reference used during the experimentation project and will be explained in
full detail. Trying to deal with some of the limitations SURF features are implemented
and will be briefly discussed here.
MonoSLAM
In this paragraph the MonoSLAM algorithm and each of its steps are discussed. Each
feature its depth is estimated therefor there are two type of states for the features,
partially initialized features which the depth is not yet properly estimated and fully
initialized features which do have a properly estimated depth.
Algorithm
The algorithm main part is an endless loop where an image is retrieved from a webcam,
then processed by the MonoSLAM sub algorithm and at last everything is rendered. The
focus here will be on the MonoSLAM sub algorithm as shown in #REF#.
When an image frame from the webcam is retrieved a MonoSLAM step is performed.
First the prediction step is performed to predict the next state of the camera. To do the
prediction a motion model which is a constant velocity, constant angular velocity model
is used. The camera represents our position in the world.
Then based on the prediction a number of features is selected to do the
measurements for the SLAM part of the algorithm. The features selection is based on the
prediction just made. Each fully initialized feature is tested whether it should be visible;
if so then the feature is selected until a specified amount of features is selected.
For the selected features the image is searched to find them within a certain search
radius. If the feature is found the feature’s position is updated. When all selected
features have been processed the map is updated.
Then the Kalman filter is updated and the camera state is approximated and
normalized. After that new features are initialized if the amount of visible features is
too sparse. At last the partially initialized features are updated.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 7 of 28
Virtual representation
The virtual representation of the real world consists of a camera model which represents
the webcam and a combination of a covariance matrix and a state vector filled with
features and the camera state represents the map.
Camera
The camera object contains some attributes to be able to do calculations for projecting
and unprojecting operations. The camera is represented by a state defined as:
𝒚𝑖𝐶 = 𝑹𝐶𝑊(𝒚𝑖
𝑊 − 𝒓𝑊) (1)
Figure 1 MonoSLAM algorithm
Kalman filter prediction step
Select features to do
measurement
Predict locations of selected
features
Kalman filter update step
Enough features visible?No
Initialize new
feature(s)
Yes
Update partially initialized
features
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 8 of 28
Where:
𝒚𝑖𝑊 = initial camera state,
𝒓𝑊 = displacement w.r.t. 𝒚𝑖𝑤,
𝑹𝐶𝑊 = camera orientation.
The initial camera state is the first camera state of an entire sequence. All camera
states are determined relative to this state.
The camera localization is based on image plane measurement. In #REF# a possible
case is shown, the numbers on the red triangles represent the sequenced camera states.
During both states the depth of the features is known then taking into account the
displacement of the features the new position of the camera can be estimated. The
accuracy of the estimation heavily relies on the error in depth, but the impact of this
error might be reduced by using more features to do this calculations. The image plane
measurements are executed during the Kalman filter update and the features used are
the selected fully initialized features as provided by the second step in the algorithm.
For the localization a pinhole calibrated camera model is used. Which means the
camera state is the sum of all previous camera displacements including a noise term. The
mathematical definition is: 𝒛𝑖 = ∏𝒚𝑖𝐶 + 𝑛𝑧.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 9 of 28
State
A state which represents an entire state (including camera state) in the covariance matrix
is represented by the vector:
�̂� = (
𝑟𝑊
𝑞𝑊𝐶
𝑣𝑊
𝜔𝐶
) (2)
Where:
𝒒𝑊𝐶 = 𝑹𝐶𝑊 represented as a quaternion,
𝒗𝑊 = velocity vector,
𝜔𝐶 = angular velocity vector.
The state represents the camera state and adds a velocity and angular velocity
component to it. Those components are predicted by the motion model during the Kalman
filter prediction step.
Map
The probabilistic 3D map is represented by a vector and a covariance matrix. The state
vector consists of the estimated camera state and all fully initialized features. The
mathematical representation is:
�̂� =
(
�̂�𝑣�̂�1�̂�2⋮�̂�𝑛)
(3)
�̂�𝑖 is a 3D position vector representing a feature its position.
Feature tracking
An important part of the MonoSLAM algorithm is feature tracking. The accuracy of
the localization heavily relies on the robustness of the tracking. To improve the
robustness of the feature detector and matching a motion model is designed to predict
where the features “should” be. Robustness also increases when the time step gets
smaller therefor map management is required.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 10 of 28
Motion model
The motion model is a constant velocity, constant angular velocity motion model.
Which means the current average velocities are considered and the accelerations are
taken into account by a using a Gaussian function.
The mathematical representation of this model is as follows:
[
𝒓𝑛𝑒𝑤
𝒒𝑛𝑒𝑤
𝒗𝑛𝑒𝑤
𝜔𝑛𝑒𝑤
] = [
𝑟 + (𝑣 + 𝑛𝑟)∆𝑡
𝑞⨂𝑞((𝜔 + 𝑛𝜔)Δ𝑡𝑣 + 𝑛𝑟𝜔 + 𝑛𝜔
] (4)
Where 𝑛𝑟 and 𝑛𝜔 are noise terms.
The main goal of the motion model is to make the feature matching more robust.
However because the model uses the previous state to estimate the next step a
cumulative error occurs which results in drift. When running an implementation of this
algorithm this noise can be noticed very quickly resulting in a lot of unmatched features.
Unmatched features means less accurate localization and there an infinite loop is entered
because this results in even more unmatched features. On this point the localization is
corrupt and not usable anymore. This phenomenon might occur within a minute especially
when the calibration target is partially visible.
Feature detection
The feature detector is based on the work of (Shi & Tomasi, 1994) extended with an
identifier and the motion model for a more robust matching. However several
adjustments have been applied.
The corner detection works exactly the same as with the GFTT detector, but instead
of taking the whole image into account a small selection is processed. The selection has
a size of 80x60 pixels and is placed at a random position within the image. Within this
selected area the best feature is detected. This process repeats to detect the best 𝑛
features when new features are required by the algorithm. 𝑛 is a number given by the
user representing the number of features the algorithm is allowed to initialize during one
frame.
When a feature is detected it gets initialized retrieving its identifier and becomes a
partially initialized feature as long as the depth needs to be estimated.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 11 of 28
Feature matching
The matching is performed by searching the patches’ identifiers within a search region.
Those identifiers are image patches of 11x11 pixels. To avoid mismatches the feature its
new position within the image is estimated using the motion model. Then a small area is
searched for the feature using the identifier.
The correlation of the identifier to the search region is determined to relocate the
feature. Using the relocation data of all tracked features the new location of the camera
is estimated.
Feature depth estimation
During the initialization of the feature the position and direction relative to the camera
are stored. Along the direction of the feature a set of 1D particles is distributed over a
certain range. The 1D particles represent a probability density in one dimension. By
default there are 100 particles distributed between 0.5m and 5.0m. The identifier is
retrieved from the image and stored in the feature object.
Next few frames the algorithm tries to track the feature and updates each particle’s
probability. After a few frames a peak arises and when the ratio of the standard deviation
of depth to the estimated depth is below a certain threshold the feature will be
transformed from a partially initialized feature to a fully initialized feature. The fully
initialized feature’s depth is still uncertain but during measurements this uncertainty is
reduced.
Map management
An import aspect of the MonoSLAM algorithm is map management. Especially to keep the
algorithm real-time a sparse set of features is required. To manage the map and decide
which features can stay and which should be delete “natural selection” is deduced to the
features. When a fully initialized feature is selected to be measured but fail a number of
attempts the feature will be deleted.
Limits
Theoretical limits of the applications could be:
Use of monochrome images instead of color images (ignoring information)
Good features for tracking detector (robustness in repeatability and speed)
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 12 of 28
Depth estimation (uncertainty might be crucial for localization)
Monochrome images
The features are matched by an identifier which is an image patch of 11x11 pixels. This
identifier contains more information than a default corner identifier. However only the
pixel intensity can be taken into account while matching identifiers, while color
information is also available. (Davison, Reid, Molton, & Stasse, 2007) note color patches
could also be used, but grayscale images are used for performance.
Feature detector
As stated by (Davison, Reid, Molton, & Stasse, 2007) the robustness repeatability is
essential for estimating the depth of a feature. However the robustness of the matching
algorithm is also very important, as unmatched features can ruin the localization and
thus the algorithm. The whole algorithm heavily relies on the localization even feature
matching since the motion model uses the current state to predict the features location.
Meaning if the localization fails the prediction fails and a lot of more features become
unmatched. Any error in the localization accumulates over time resulting in a drift which
leads to even more unmatched features. The robustness of the feature detection and
matching is of extreme importance.
Depth estimation
When a partially initialized feature gets converted to a fully initialized feature the
uncertainty in an initialized feature is large. Also the accuracy of the estimated depth is
based on the matched feature positions and those are projected from the camera.
Projecting pixels is inaccurate and combined with some unmatched features due to the
motion model its prediction, which takes depth into account, this inaccuracy can lead to
the same failing as the feature detector does.
Hypotheses
The major limitation on the MonoSLAM algorithm is when mismatches occur. This happens
when either the depth is not estimated properly or when the feature matching does not
work robust enough.
Depth estimation is based on matching the features identifier. Thus when mismatches
occur the depth estimation is wrong as well. Therefor considering another feature
detector is most beneficial for the algorithm.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 13 of 28
The hypotheses are:
1. The SURF detector is more robust in repeatability then MonoSLAM its customized
GFTT detector.
2. SURF matching results in less unmatched features then MonoSLAM its matching
algorithm.
The GFTT detector doesn’t threat a whole frame at once to initialize new features.
It randomly chooses a non-overlapping region and picks the “best” spot to become a
feature. This means not all features can be initialized at once and strong features might
be skipped during the initialization process, because the region where the feature is isn’t
“randomly” selected.
The SURF detector threats a whole frame each time and describes all features above
a given threshold. Resulting in a much more deterministic and robust detection without
features being skipped.
The SURF detector might also resolve unmatched features issues for the feature matching.
The current matching method doesn’t take advantage of the feature detector but search
for correlations with the features identifier in the image. Possibly a better method would
be if all features of two images are matched using their descriptors and calculate a
distance from there. For matching the SURF features to prove the second hypothesis the
L1-norm distance is used.
SURF
In the few years after the publication of GFTT researchers developed several new feature
detectors. One of them is the SURF detector, which is inspired on SIFT features created
by (Bay, Tuytelaars, & van Gool, 2006). The strengths of this detector is the features are
invariant to scale, rotation and contrast. The detector is very fast as well because of the
use of an integral image and the Fast-Hessian algorithm.
Integral image
The integral image is built from the given image, if for example the picture is in RGB
format all three elements are multiplied by a constant and added to the values of the
pixel above and the pixel left of the current pixel.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 14 of 28
Also smaller boxes from within the integral can be calculated very efficiently, the
sum of pixel intensities can be calculated by:
∑ =𝐴 − 𝐵 − 𝐶 + 𝐷 (5)
Where A, B, C and D are the pixels bounding the box as shown in #REF#.
Fast-Hessian detector
The detector determines interest points using the integral image. Using the integral boxes
only 8 operations are required to compute the value independent of the box’s size instead
of 𝑁x𝑀 operations. To calculate a response value for a potential interest point Haar
wavelets are calculated using several integral box operations. Calculating Haar wavelets
happens a lot and is a serious bottleneck in the algorithm if not performed efficiently.
A B
C D
O
Figure 2 Integral image
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 15 of 28
Chapter 4: Experimental results and evaluation
In this chapter the implementation of the code, the experiments and the results will be
discussed. Also the hypotheses will be evaluated. The computer used to test the
algorithm uses an Intel Core i7-2760M CPU @ 2.4 GHz, as Ubuntu 12.04 LTS only uses one
core a lot of CPU power is wasted.
MonoSLAM using SURF Implementation
The source code did not need to be built up from scratch because an open source demo
implemented by Davison himself is available for download. However Davison recommends
to use an updated version of his code implemented by Hanme Kim (Kim, 2013). This
implementation is also available for download.
The main reason to use the updated version for the experimentation project is
because the implementation supports USB cameras instead of firewire cameras. Also the
older Linux libraries are updated. The implementation is also tested in an Ubuntu 12.04
LTS distribution. Therefor all code is implemented and tested on the same distribution.
Also the experiments are executed using the same distribution to be sure the code is
working as described by (Davison, Reid, Molton, & Stasse, 2007).
SURF
During the experimentation project a major mistake has been made by implementing the
SURF detector itself. Taking two weeks of time to work properly, while OpenCV has a
very efficient implementation ready to use with only several lines of code. Two lessons
are learned here. First check OpenCV if you are doing computer vision and second the
inner details of SURF are fully understood.
Using OpenCV’s SURF detector and descriptor for each frame the features are
extracted and provided to the MonoSLAM algorithm as a list of key points and a matrix
filled with descriptors. For matching OpenCV’s Fast Library for Approximate Nearest
Neighbors (FLANN) based matcher is used. Notice the CPU versions of the SURF detector,
SURF descriptor and FLANN based matcher are used and not the GPU versions.
Within the MonoSLAM algorithm a list of initialized SURF features is stored. Thus when
new features should be initialized all initialized features are matched to the key points
and descriptors as provided by the SURF detector. The matching is done using the FLANN
matcher implementation in OpenCV. The FLANN matcher uses L1-norm distances by
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 16 of 28
default and matches the descriptors of each feature. If available the best three matches
are selected and the identifier of each patch is matched by a very simple method to avoid
mismatches.
Observations
Several (re)implementations of the MonoSLAM are available. The first implementation
tested was a C# reimplementation ([email protected], 2008). However during runtime
strange artifacts appeared. Those appeared as result of mutilated input frames. The
mutilated frames are most likely a bug in the code but to be sure the code is verified to
the implementation of Hanme Kim. Multiple major differences were occurring, therefor
the C++ reimplementation by Hanme Kim is used for further analysis of the algorithm and
for running experiments.
Image input
The input image of the C++ implementation does not get mutilated. However the
implementation is not suitable for real augmented reality applications. The image
grabber runs in a separate thread which is a good thing, but it fills a buffer. Each frame
in this buffer should be processed, which on the other hand is better for the localization
algorithm. Each image gets converted to grayscale and is resized to 320x240 pixels.
Localization
The major goal of the MonoSLAM algorithm is to provide a method to do localization.
However when the demo program as is provided is running the camera is jumping along
the axis of its own view direction. This can be easily resolved by tweaking the MonoSLAM
parameters, especially the amount of features that should be selected do affect the
positioning a lot. Also the difference between a normal USB webcam (Logitech C210) and
a wide-angle camera (Logitech C930e) is noticeable, because less features seem to be
needed.
After a few seconds of motion it is clearly visible the motion model isn’t estimating
the new location anymore and fails to measure a feature which is just a little outside of
the search eclipse region. Those search eclipses can be shown in the screen if the check
is selected in the user interface. However the search eclipse drawing method seems to
be very expensive since the frame rate drops below 2 frames per second. A low frame
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 17 of 28
rate leads to skipped frames and the algorithm is not able to deal with it so the
localization will get corrupted very easily.
There are more ways to get the localization corrupt, for example when the camera
slowly turns around and the calibration target is lost the matching algorithm seems to be
trying to match its corners on the edge of the image. It seems the motion model then
tries to predict the next location which obviously is wrong since the feature is not visible
at all.
At the point where the localization gets corrupt, all features start spinning around
more and more rapidly each time step. The only thing explaining this is the motion model
which predicts features to be somewhere where they are not resulting in all the features
being unmatched. The angular velocity is estimated incorrect and then magnifies itself
each time step. This phenomena has probably something to do with the depth estimation
which gets corrupt due to the unmatched and wrongly predicted features.
Depth estimation
The depth estimation is essential to the localization and heavily relies on the matching
capabilities of the algorithm. Like state in (Davison, Reid, Molton, & Stasse, 2007) any
mismatches are crucial to the localization, because the position of the feature is wrong.
Leaving both the depth estimation and the motion model with a significant error. Then
it is likely the localization gets corrupt like described in the paragraph above.
Experimentation scene
During several test rounds it became clearer the MonoSLAM algorithm is not suited for
every arbitrary space limited to a range 5 m. The scene should contain enough potential
features and also enough deviation in depth between the visible surfaces. Both constrains
seem to be very important to be able to do the localization properly when the camera is
translating. Rotating the camera like in the kitchen video (imperialrobotvision, 2010)
referenced by (Davison, Reid, Molton, & Stasse, 2007) seems almost impossible.
Performance
The experiment as shown in the kitchen video (imperialrobotvision, 2010) referenced by
(Davison, Reid, Molton, & Stasse, 2007) uses a small environment with a lot of deviation
in depth between surfaces. Also a lot of recognizable objects are placed within the scene.
For the experiments done during the project a room has been redecorated trying to get
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 18 of 28
deviations in depth as well. A picture of the room is shown in #REF#. There are two
calibration targets in the scene. That is a test case for both algorithms to see its
repeatability and to see the effects on the localization.
Feature detection
The implementation of the MonoSLAM algorithm by Hanme Kim confirms the limitation
of the feature detector and tracking. In #REF# the feature detecting and matching
algorithm is shown.
First a region is selected by randomly positioning a region within the image. The
region is valid if no features are inside this region. If a region contains a feature the
region is randomly repositioned to another location up to five times. If there is no region
found within five tries no feature is initialized.
If a valid region is selected the algorithm checks each pixel within the region for a
change in gradients with pixel left and above. The greatest value is the strongest corner
and this position is selected as key point.
When the key point is selected an identifier in the form of an image patch of 11x11
pixels is stored for matching purposes. This identifier is comparable with the descriptors
of a SURF feature.
Figure 3 The experimentation room
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 19 of 28
What may be noticed is only one feature per region selection can be retrieved at a
maximum. This is done to reduce the time needed to detect points within the image and
keep the algorithm real-time. Hence the algorithm is very slow if several of identifiers
should be detected at once. As stated in (Davison, Reid, Molton, & Stasse, 2007) a PC
with a 1.6 GHz Pentium M processor needs 4ms to do one feature initialization search.
Using the PC with the 2.4 GHz CPU one feature initialization search takes about 1ms.
Using the SURF detector and descriptor on an input image of 320x240 pixels perform
the detection within 6ms for ~200 features with a minimal Hessian threshold of 800.
However the MonoSLAM algorithm is too slow to be able to handle 200 features real-time.
Therefor a threshold of 8000 is used resulting in detecting and describing ~30 features
within 4ms.
Feature matching
An image correlation search takes 3ms for 12 features. With an image correlation search
the algorithm searches the identifier of each feature within a small search region on the
location predicted with the motion model. For matching 12 known features against ~30
detected features the FLANN based matcher takes less than 1ms. The matcher also
considers the entire image.
The FLANN based matcher can be extended with some methods to reduce mismatches.
First of all a radius match can be applied, which only considers two features a match if
the distance between the descriptors is less than a certain threshold. Another method is
to determine the ratio of the best two matches their distances and consider it a valid
match when this ratio is above a certain threshold.
Figure 4 (a) Selecting region (b) Picking feature key point (c) Retrieve identifier
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 20 of 28
Experiments
Experiments have been conducted to determine the performance of both GFTT and
SURF detectors and the two matching algorithms. The goal of the experiments will be
explained and validated. The results will be shown and the hypotheses will be evaluated.
Goal
The goal of the experiments is to find out whether the SURF features are superior to the
GFTT features in the MonoSLAM application. Therefor the GFTT feature detector and the
SURF feature detector are compared. Also the customized GFTT feature matching
algorithm will be compared with a FLANN based matching algorithm.
Validation
The experiments are conducted in a conditioned room, the room is shown in #REF#. There
is variation in contrast, depth and the calibration target is available at a distance of 60cm.
The initialization file for the MonoSLAM algorithm stores the corners of this calibration
target and their identifiers as well. Those are matched the first frame and the camera
relocates in the virtual scene correctly.
The first experiment is about the detectors performance. The question here is how
deterministic is are the detectors. Therefor the algorithm runs 30 frames and the
positions of the features are outputted by the program. For the SURF detector 1 frame
might produce the same results as 30 frames, but the GFTT detector in the MonoSLAM
algorithm only detects 1 feature each frame.
The test sequence has been executed 10 times for each detector. The results are
stored in an Excel sheet, which can be found in the Appendix #REF#. The data is sorted
by position and sorted so unique features are arranged per column.
The second experiment is about the matching performance. The question here is how
well does the matching algorithm performs when subjected to viewport changes. Which
are most common in an augmented reality application. Therefor for each degree of
freedom and for not changing the viewport at all the same experiment ran 3 times.
The test sequence is ready to start as soon as 12 features are initialized. Those 12
features are being matched each frame. The MonoSLAM might predict the feature is not
visible and therefor it will not try to match the feature. To deal with this problem the
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 21 of 28
amount of attempts and successful matches are outputted by the program. Based on
those values the ratio of successful and unsuccessful matching attempts can be measured.
However for the FLANN based matcher counting matching attempts is a problem. The
FLANN based matcher is not implemented yet, due to difficulties with the depth retrieval
in the MonoSLAM code and the very limited time. This means the MonoSLAM algorithm
does not predict for each feature whether it is visible. If the feature is not visible it
counts as an unsuccessful match. Also there are several ways to cope with mismatches
for the FLANN based matcher.
The simplest way to deal with mismatches is the radius match as used in the
experiments. Radius matching searches the best matches within a certain threshold if
none are found an unsuccessful attempt is counted. Another common way to reduce
mismatches is calculating the ratio of the two best matches and subject it to a threshold
value. The last method eliminates weak matches and is more reliable, then again time
did not allowed to implement this method properly.
The current SURF implementation uses the detector to determine the new position
of a feature and partially bypasses the motion model. Then the identifiers are searched
on the location given by the SURF detector.
Results
First of all the data of the experiments is discussed. Then the results are shown and those
are discussed as well.
Detector data
When analyzing the data the first thing noticed is the GFTT algorithm picks features
which already exists, resulting in doubles. Analyzing the code the existence of features
which are detected and partially initialized is confirmed. The code seem to not take
partially initialized features into account when searching for a salient feature.
Another issue arises here as well, because it is not possible to know which of the
doubles are detected in the test sequences. The benefit is for the GFTT detector as “a
match” is considered a match of the first detected match.
The SURF detector also has its limits. Due to the performance of the depth
initialization the threshold at which features are passed as detected is very high. The
minimum Hessian value threshold is set to 8000. Therefor it is clearly some features only
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 22 of 28
are detected a few times. When the 12 features would have been initialized with a
threshold of 8000 and relocated with a threshold of 7500 the feature detector would most
likely detected each feature in each frame for this experiment. However this is the way
it is implemented and therefor the extremely high threshold is used.
The main difference between the data is the features detected by GFTT are
assembled over 30 frames. The features of the SURF detector are detected in the 30th
frame only.
Matching data
The matching data provides the amount of attempts to match a feature and the amount
of successful attempts measured during one sequence. Each sequence might have a
different length, so all measurements are normalized and expressed as percentage values
of the successfully matched features.
Mismatches are not considered, because the identifier of the feature can be used to
exclude mismatches for the FLANN based matcher. Also during one sequence one feature
is not attempted to track at all. This feature can be identified in #REF# as (B7 Run 2 F11)
both in the “Analysis” and “Data – B – GFTT” tab. This results in a calculation 0
0= 𝑒𝑟𝑟𝑜𝑟,
and thus this feature for that specific run is not considered in the statistics.
Detector statistics
The GFTT detector shows doubles, SURF does not. In #REF# and #REF# the amount of
detected features is shown for each detector each run. The total amount of unique
features for GFTT detector are 12 over all runs. For the SURF detector there are 27 unique
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10
Count
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8 9 10
Count Uniques Doubles
Figure 6 GFTT features detected during 30 frames Figure 6 SURF features detected 30th frame
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 23 of 28
features. More importantly on average (unique features considered only) the GFTT
detector has an average detection rate of 74,2% and the SURF detector has an average
detection rate of 79,6%, which is significantly better even though it is taken over one
frame.
The average count of is 8,9 features with a maximum deviation of 4,9 for the GFTT
detector. For the SURF detector the average count is 21,5 with a maximum deviation of
4,5. Giving a deviation of 55,1% for GFTT against a deviation of 20,9% for SURF.
… to be continued…
… e. Show the results of the quantitative evaluation
c. Discuss why the results are good or bad
Evaluation
Discuss the hypotheses regarding the algorithm supported by the experiments
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 24 of 28
Chapter 5: Conclusion
a. State what the thesis are all about c. State the drawbacks of the algorithm d. State and discuss your hypotheses in analyzing the algorithm/drawbacks.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 25 of 28
Chapter 6: Future work
Future work. Mismatches oplossen, en depth retrieval fixen.
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 26 of 28
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 27 of 28
Appendix
MonoSLAM using SURF November 2013
Rik Bosma University of Utrecht Page 28 of 28
References
Bay, H., Tuytelaars, T., & van Gool, L. (2006). Surf: Speeded up robust features.
Computer Vision-ECCV (pp. 404-417). Springer Berlin Heidelberg.
Davison, A. J. (2003). Real-time simultaneous localisation and mapping with a single
camera. Computer Vision, 2003. Proceedings. Ninth IEEE International
Conference on (pp. 1403-1410 vol. 2). IEEE.
Davison, A. J., Reid, I., Molton, N., & Stasse, O. (2007). MonoSLAM: Real-time single
camera SLAM. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
1052-1067.
imperialrobotvision. (2010, November 29). MonoSLAM: Real-Time SIngle Camera SLAM
[video]. Retrieved from Youtube: http://youtu.be/mimAWVm-0qA
Kim, H. (2013). SceneLib2 - MonoSLAM open-source library. Retrieved from
hanmekim.blogspot.com: http://hanmekim.blogspot.com/2012/10/scenelib2-
monoslam-open-source-library.html
Leonard, J., & Durrant-Whyte, H. (1991). Simultaneous map building and localization for
an autonomous mobile robot. Intelligent Robots and Systems' 91.'Intelligence for
Mechanical Systems, Proceedings IROS'91. IEEE/RSJ International Workshop on (pp.
1442-1447). IEEE.
Shi, J., & Tomasi, C. (1994). Good features to track. Computer Vision and Pattern
Recognition, 1994. Proceedings CVPR'94., 1994 IEEE Computer Society Conference
on (pp. 593-600). IEEE.