hierarchical image geo-location on a world-wide scale349703/fulltext.pdf · iii abstract...

Hierarchical Image Geo-Location

On a World-Wide Scale

A Dissertation Presented by

Alexandru Nicolae Vasile

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in

Electrical Engineering

Northeastern University Boston, Massachusetts

December 2014

ii

Keywords: Internet Data, Image Geo-location, Scene Classification, Image Retrieval, 3D Reconstruction, Structure from Motion , 3D Registration, Sensor Fusion, 3D Ladar, Geiger-Mode, Data Filtering, Coincidence Processing

iii

Abstract

Hierarchical Image Geo-Location On a World-Wide Scale

by Alexandru N. Vasile

Doctor of Philosophy in Electrical Engineering

Northeastern University, December 2014 Dr. Octavia Camps, Advisor

There are increasingly vast amounts of imagery and video collected from a variety of sensor modalities. Considering that each individual image may contain considerable amounts of information, the ability to interpret, understand and extract scene information is highly beneficial. In order to enable automated scene understanding, there is a need for an organizing principle to store, visualize and exploit the data. Three-dimensional geometry provides such an organizing principle as imagery and video have inherent 3D structure and can be associated with geographic coordinates. In this thesis, we leverage multiple large geo-spatial databases to create a 3D world model and develop a hierarchical image geo-location framework using a coarse-to-fine localization approach. Starting at the coarsest level, a query image is geo-located to regions of the world though a probabilistic terrain classification approach using a 6.5 million image Flickr database. Next, a novel medium-scale localization method is developed to rule out most of the regions and establish candidate geo-locations with geo-positioning accuracy at a city level. Results from the combined hierarchical classifier demonstrate a 10% improvement over current state-of-the-art. A fine-scale geo-location stage was also developed to determine the pose of a query image to street-level geo-positioning accuracy. The fine-scale algorithm introduced an efficient structure-from-motion (SfM) 3D reconstruction approach that scales to city-sized image databases, incorporating ground video imagery as well as aerial video imagery for a more complete 3D city model. The newly developed SfM approach is demonstrated to have an order of magnitude computational speed-up compared to prior work, and validated to produce a 3D city model that is absolutely geo-located to within 3 meters compared to 3D Laser radar (Ladar) truth imagery. The fine geo-location stage was also tested using a 500 image hold-out set and demonstrated to geo-locate close to 80% of query images to within better than 100m, exceeding the system goal of street-level geo-positioning accuracy. As a proof-of-concept, we demonstrate improved image understanding by leveraging the newly developed 3D world model to perform information transference to example query images from other geo-located, labeled data sources.

In support of fine-scale geo-location validation, we also developed an algorithm to process 3D Ladar data using a novel 3D noise filtering technique that is shown to be a significant improvement over current state of the art, resulting in a 9x improvement in signal-to-noise ratio, a 2-3x improvement in angular and range resolution, a 21% improvement in ground detection and a 5.9x improvement in computational efficiency.

iv

Acknowledgements

I would like to thank my thesis advisor, Professor Octavia Camps for her advice, guidance and patience throughout the 7-year duration of this research. I would also like to thank my Lincoln Lab group leaders Dr. Richard Heinrichs and Dr. M. Jalal Khan for funding and enthusiastically supporting my research. Many of my colleagues and other researchers helped me in developing the presented algorithms. In particular, Prof James Hays from Brown University was kind enough to share the image database and baseline code. Also, my former colleague, Karl Ni, was instrumental in getting the funding for the fine-scale geo-location data collects and also contributed to algorithm development. Furthermore, I have had many white-board discussions with my colleague and friend, Luke J. Skelly. Some of the results in this thesis, namely the 3D Ladar processing, were due to our joint collaboration in code and algorithm development. Also, the everyday advice on how to navigate and successfully finish a PhD from my colleague, friend, neighbor and former Master’s thesis advisor, Richard Marino, was invaluable for keeping my sanity through the long process. Finally, I would like to thank my family. My wife, Osiris Vasile and my parents Stefan and Eliza Vasile have been a source of constant support, encouragement and inspiration. On numerous occasions, they provided me with advice on how to conduct my thesis research based on their own research experience. Also, my kids, Owen Vasile and Oksana Vasile, brought a lot of joy to my life and kept me well grounded.

v

Table of Contents

Abstract

List of Figures

List of Tables

iii

vii

x

1 Introduction

1.1 Data ……..……………………………………………………

1

8

2 Coarse Scale Geo-location

2.1 Background and Related Work .………………………….…

2.1.1 Determining Geo-spatial Coordinates Given Terrain

Type …………………………….…………………………...

2.1.2 Recognizing Terrain Types from an Image…………

2.1.3 Towards a Coarse Scale Geo-location Approach…...

2.2 Coarse Scale Geo-location Approach………………………..

2.3 Coarse Scale Geo-location Experimental Setup and Results .

9

9

12

15

19

20

32

3 Medium Scale Geo-location

3.1 Background …………………………………………………

3.2 Learning for Scene Classification and Image Retrieval ……

3.3 Medium Scale Geo-location Algorithm and Results ……….

35

35

36

39

4 Fine Scale Geo-location

4.1 Background and Related Work ……………………………..

4.2 Fine Scale Geo-Registration Approach …………………….

49

49

53

vi

4.2.1 3D Reconstruction from Video Imagery …………..

4.2.2 3D Reconstruction Merge ………………………….

4.2.3 Geo-locating a new image ………………………….

4.3 Fine Scale Geo-location Results ...…………………………..

4.3.1 Reconstruction and Geo-location of Aerial Imagery..

4.3.2 Reconstruction and Geo-location of Ground

Imagery ……………………………………………………...

4.3.3 Geo-Registration of Combined Aerial-Ground

Imagery ……………………………………………………...

4.3.4 Geo-locating a new image ...………………………..

4.4 Towards Improved Image and Scene Understanding ...……..

53

58

59

61

62

64

67

70

71

5 3D Ladar Processing: An Extension to 2D Image Geo-location

5.1 Background and Related Work ……………………………..

5.2 3D Ladar Background ……………………………...……….

5.3 3D Ladar Processing Approach ……………………………..

5.4 3D Filtered Results and Discussion …………………………

5.4.1 Qualitative Results …………………………………

5.4.2 Quantitative Results ………………………………...

74

74

76

79

85

85

87

6 Conclusion

6.1 Contributions .……………………………………………….

6.2 Recent Developments ……………………………………….

6.2 Future Work …………………………………...……………

90

90

93

95

References

97

vii

List of Figures

1.1 3D world-model representation .…………………………….

1.2 Mapping Flickr .……………………………………………..

1.3 Proposed hierarchical geo-location approach ………………

2.1 Examples of land coverage and terrain type maps ………….

2.2 Images afflicted with various types of noise ….…………….

2.3 Computing a Gist feature ....………………………………...

2.4 Examples of terrain labeled images ...………………………

2.5 Geographical distribution of photos from Flickr database ….

2.6 Block diagram of coarse-scale geo-location algorithm …......

2.7 UNEP Mountains and Tree cover in Mountain Regions 2002

Database ...…………………………………………………..

2.8 Pseudo-code for the first training stage ....…………………..

2.9 Results of processing GLCC land-cover database into 4

general land-types ...………………………………………...

2.10 Results of processing GLCC land-cover database into 5

general land-types …………………………………………..

2.11 Pseudo-code for the proposed coarse geo-location algorithm

3.1 Image attributes that might be used for medium-scale geo-

location .…..………….…………….…………….………….

2

2

5

13

16

17

18

19

22

24

25

26

28

31

35

viii

3.2 Mean shift clustering of urban-only database using 200km

bandwidth ...…………………………………………………

3.3 Accuracy as function of urban database size for proposed

algorithm ...………………………………………………….

3.4 Accuracy of geo-location estimates once lower ranked

clusters are considered ...……………………………………

3.5 Sample of 2k random data set, showing 30 images ...………

3.6 Geo-location results 1 ...………..…………………..………..

3.7 Geo-location results 2 ...………..…………………..………..

3.8 Geo-location results 3 ...………..…………….……..……….

4.1 Example of data used by the 3D reconstruction system ...…..

4.2 Structure from motion 3D reconstruction pipeline for video

imagery ...………………..……………..……..……………..

4.3 3D Geo-location method ...………..……..……………….....

4.4 3D Merge Method …..…………….………………………...

4.5 Geo-locating a new image ...…………….……..……………

4.6 Aerial 3D reconstruction of 1x1km area of Lubbock, Texas .

4.7 Qualitative geo-registration results of aerial reconstruction ..

4.8 Quantitative geo-registration results of aerial reconstruction

4.9 Qualitative results of ground reconstruction ...……………...

4.10 Initial geo-location of ground reconstruction ...……………..

4.11 Improvement in geo-location after applying 3-D Merge

algorithm ...……….…………….…………….……………..

4.12 Examples of merged aerial-ground reconstruction ...……….

4.13 Fine-scale geo-location accuracy for a 500 image test subset

4.14 Towards improved image understanding ...…………………

41

44

44

45

46

47

48

52

54

57

59

60

62

63

63

65

66

68

69

71

73

ix

5.1 3D Laser Radar (Ladar) system concept ....…………………

5.2 3D Ladar concept of operations ....…………………….........

5.3 Line-of-sight (LOS) coordinate systems for various sensor

platforms ...…………………………………………………..

5.4 Raw Lidar data showing salt and pepper noise ....…………..

5.5 Raw 3D Lidar point cloud color-coded by scan pattern

induced output variation ...…………………………………..

5.6 Method for correcting for photon and detector range

attenuation effects ...………………………………………...

5.7 Computation of laser-detector 3D point spread function ...…

5.8 MPSCP algorithm block diagram ….………….……………

5.9 Visual comparison of MAPCP versus MPSCP results …..…

5.10 Coincidence processing quantitative results ….….…………

75

76

78

79

80

82

83

85

86

88

x

List of Tables

2.1 Terrain Coverage Numbers of 48 Contiguous Stages ....……

2.2 Mapping from USGS land-use terrain classes to a reduced

set of terrain classes …..…………………………………….

2.3 Geo-Label database statistics ……..………………………...

2.4 Confusion Matrix for baseline method …….…………….…

2.5 Confusion Matrix for proposed coarse-scale geo-location

method ………….…………….…………….…………….…

3.1 Medium-scale Geo-Location Confusion Matrix at the city-

level ………….…………….…………….…………….……

4.1 Timing comparison of 3D Reconstruction method to prior

state of art …………………………………………….......…

14

23

29

33

33

42

67

1

Chapter 1

Introduction

In the last decade, there has been an explosion in the amount of digital imagery and

video. Vast numbers of photos and videos, shot with ever increasingly higher quality

digital cameras and smartphones, can now be accessed via the web using online databases

such as Flickr, Facebook, Instagram and YouTube. Though the total number of images

and videos is not well known, tens of billions of images and videos are now accessible on

the World Wide Web. Considering that each individual image may contain considerable

amount of information, the ability to interpret, understand and extract scene information

is highly beneficial for many communities, including, but not limited to, online social

networking sites, intelligence agencies and companies dealing with large-scale data

mining.

Image understanding algorithms are designed to take the burden off human analysts and

process the data automatically, in a timely manner. With such a vast volume of image

data, some organizing principle is needed to enable efficient navigation, understanding

and exploitation of these large imagery archives. Fortunately, three-dimensional

geometry provides such an organizing principle. For example, suppose we have a set of

photos of some ground scene. Those photos represent 2D projections of the 3D world

structure onto a variety of image planes. If the geometry of the scene is captured in a 3D

map, it can be utilized to mathematically relate the different photos to one another.

Moreover, the 3D map connects together data collected by completely different sensors at

different times, places and perspectives. For instance, one can relate a photo of a city shot

by a ground camera with a corresponding satellite image or a 3D Ladar point cloud.

Thus, the 3D map can add a lot of context to a scene and improve scene understanding

through the process of information transference from one data modality to another. But,

we can only get this improvement in scene context and understanding if all these data

products are geo-located with the 3D map. Figure 1.1 captures this common 3D world

model representation.

2

Fig 1.1 - 3D world-model representation for organizing data from multiple sensing modalities. The 3D map provides a geometrical framework for organizing imagery collected at different times, places and perspectives, enabling improved image context and scene understanding.

Fig 1.2 - Mapping Flickr. A map of geo-tagged images from Flickr, as of April 2011, with data binned at 0.5 x 0.5 degree latitude-longitude square (about 55x55km at Equator) on the Earth’s surface [2]. Certain regions, color-coded in magenta, have upwards of 1 million images, translating to thousands of images per square km.

3

In order to initialize this 3D world model representation, we need to start out with enough

imagery that not only has geo-spatial metadata but also samples a wide variety of scenes

and locations across the entire world. Fortunately, there is a wealth of imagery available

online that has geo-spatial metadata. For example, Flickr has enabled geo-tagging of

images since August 2006; within the first day, 1.2 million images were geo-tagged [1].

A more recent map of Flickr from April 2011 in Figure 1.2 reveals an explosion in the

number of geo-tagged images [2], with millions of images now available, with some

locations having upwards of thousands of images per square kilometer.

However, most images on the web usually do not have geo-spatial metadata available,

leaving the task up to a human operator to manually annotate the data, which can often be

tedious or impractical. In the absence of metadata, we need to rely on scene content to

deduce geo-spatial information. Depending on the scene content, such as how many

features we might be able to extract and how salient those features are, we can geo-locate

an image to various levels of geo-spatial accuracy, such as to a particular continent,

region, city, street or even an actual camera pose.

The problem of image geo-location has been addressed by several authors. Most of the

geo-location research falls into two categories: 1. localization by landmark recognition

using local image features and 2. localization by similar image retrieval using global

features that capture whole image content. Geo-location by landmark recognition

[3][4][5][6] tends to focus on limited image datasets (100s of thousands) that are already

highly localized around a set of landmarks or comprised of images from a single city.

Many of these methods apply feature matching and structure-from-motion techniques to

estimate camera location and pose. For the problem of geo-location on a world-wide

scale using million of images, direct localization by landmark recognition is not

computationally tractable. Another drawback of the above methods is that they require

extremely dense sampling of the world, with at least one or two instances that have the

exact scene content as the query image. Unfortunately, currently available image

databases with world-wide coverage do not have such dense image sampling, leading to

poor localization performance.

4

The second research category, of image geo-location by similar image retrieval, holds

more promise for geo-location on a world-wide scale. Seminal work by Hays et al. [7][8]

demonstrated the feasibility and potential for image localization on a worldwide scale.

The method applied a single-stage unsupervised algorithm on a multi-million image

world-wide database to directly geo-locate a query image to a set of likely locations in

the world. One of the drawbacks of the method was the use of a single stage classifier,

resulting in the need for both a high-dimensional feature space to separate highly

complex classes and the need to use an unsupervised classification method for

computational efficiency. Applying an unsupervised classification method in high

dimensions is not ideal as such methods are known to suffer in classification performance

as feature dimensionality increase due to their inability to discard irrelevant feature

dimensions for a given task [8].

To improve on the previously reported methods in [7][8], we propose to use a

hierarchical image geo-location approach. From an algorithmic perspective, developing a

hierarchical geo-location framework has several advantages. Rather than resorting to the

use of a high-dimensional feature space to separate highly complex geographic classes in

one step, by implementing multiple hierarchical stages we solve multiple simpler

classification problems, each in a lower dimensional feature space to avoid the curse of

dimensionality [9]. Furthermore, a hierarchical approach has the potential for improved

geo-location accuracy by allowing for both the use of simple, unsupervised classifiers for

the initial stages, as well as more complex classifiers for the later stages. Establishing

such a hierarchical framework also makes sense from a computational point of view.

Because several models relating to specific locations can exist, comparing all models

over the vast space of all possible images may not be computationally feasible. Paring

down the search space using coarse geo-location models with rough spatial descriptors on

large databases, followed by increasingly complex descriptors applied on reduced-size

databases makes the geo-location problem much more computationally tractable.

From a classification standpoint, the problem of geo-locating an image with no geo-

spatial metadata to a city-sized geographic class is very challenging. There are many

thousands such city-sized geographic classes in the world that we need to separate.

5

Besides the sheer number of classes, the boundary between geographic classes (e.g. is

this Bangkok or Paris?) is extremely complex because it must divide a spectrum of scene

types (indoors, outdoors, close-up, perspective, street, highway, tall and short buildings)

that might be present in both locations.

Fig. 1.3 - Proposed hierarchical geo-location approach. The method starts out with a query image and a 3D world model representation composed of several geo-spatial databases with millions of data samples. The coarse-scale geo-location method applies a computationally efficient terrain classifier to the query image in order to reduce the search space. On the resulting reduced-size database, the medium and fine scale geo-location methods apply more complex classifiers in order to obtain improved geo-location accuracy, with eventual localization at the city to street-level scale.

In order to overcome this complex classification problem, in this paper we propose a

novel hierarchical image classification approach that geo-locates a query image of an

urban scene to a particular city location in the world. As shown in Figure 1.3, we start out

with a query image and a 3D world model representation composed of several large

databases, namely the 6.5 million image geo-spatial image database from [7][8], a world-

wide land-coverage and terrain type database and a terrain-labeled image database. At the

coarse scale, we consider a query image as a whole by extracting rough scene content to

assign the image to a land class type, such as urban, forest, coast, country or mountain.

6

Once a terrain label is obtained, such as ‘urban’ for instance, we can reduce the image

and geo-spatial search space by filtering the larger database for images with geo-tags in

close proximity to urban areas. This has the effect of reducing the geo-spatial and

database search space anywhere from 70 to 90%. For the medium-scale geo-location

method, additional image content is extracted through the use of multiple low-level

features per image. To obtain geo-location accuracy at city level those features are

matched against a pre-computed feature database using a novel supervised classifier to

reduce the geographical search space by up to 99%. Once a city location is determined,

we further refine the geo-location of the query image to a pose with accuracy at the

metric scale using an improved structure from motion 3D reconstruction pipeline.

The key contributions of our work are:

1. The development of a new geo-tagged and terrain-labeled large-scale image database

to represent the 3D world model and the application to a novel coarse geo-location

method, with terrain classification results that are an improvement of 6% over previously

reported results. The course geo-location method has several advantages over prior non-

hierarchical approaches, namely: (a) the method is robust to noisy geo-labels, (b) the

method works in a low dimensional feature space to avoid the curse of dimensionality [9]

and (c) the method reduces the database size in order to allow for more complex follow-

on stages to be computationally tractable.

2. The development of a medium scale geo-location algorithm that improves upon

previous image retrieval techniques to geo-locate a query image to city-level accuracy.

The hierarchical course and medium geo-location framework was tested on a geo-tagged

6.5 million image database and demonstrated to have an improvement of 10% in geo-

location accuracy compared to previous methods applied up to city level geo-location.

3. The development of a fine-scale geo-location approach that is an order of magnitude

more computationally efficient, as well the development of a novel method to pre-

process a city database using both aerial and ground video imagery to effectively cover

and more uniformly sample a whole city, allowing for the localization of the query image

to meter scale accuracy. The technique is demonstrated with ground video imagery as

well as aerial video imagery. Geo-location performance for the reconstructed 3D city

7

model is validated in a systematic manner over a 1x1km area using a 20x40km 3D Ladar

data as truth. Our contribution is to develop: 1. a method to geo-locate images over a

wide scale city area, incorporating both aerial as well as ground imagery for a more

complete city model, 2. systematic validation of geo-location using wide area 3D Ladar

truth data.

4. A novel method to process noisy 3D Ladar imagery collected from an operational

airborne Ladar sensor, in support of fine scale geo-location validation. The 3D Ladar

filtering method is shown to be a marked improvement in terms of image quality and

speed compared to prior methods. Our contribution to the 3D Ladar processing area was

to develop an algorithm that is a significant improvement over prior methods, with a 9x

improvement in signal-to-noise ratio, a 2-3x improvement in angular and range

resolution, a 21% improvement in ground detection and a 5.9x improvement in

computational efficiency.

The thesis is organized as follows:

- Chapter 2 discusses the coarse-scale geo-location algorithm and presents

quantitative results to demonstrate improvement over prior methods.

- Chapter 3 describes the medium scale geo-location algorithm and presents results

using the hierarchical course and medium scale geo-location to demonstrate

improvement over prior state of the art.

- Chapter 4 describes the fine-scale geo-location method and presents qualitative as

well as quantitative results that demonstrate a large speedup compared to prior

work as well as high geo-location accuracy, validated against 3D Ladar truth

imagery.

- Chapter 5 describes a novel method to process noisy 3D Ladar imagery followed

by a qualitative imagery comparison as well as quantitative metrics that

demonstrate a large improvement over prior state of the art.

- Chapter 6 concludes with a discussion of the significance of results obtained so

far and explores areas of future work.

In the remainder of this chapter, we give an overview of the data sources and data sets

used for the rest of the thesis.

8

1.1 Data

In order to achieve state-of-the-art performance in terms of image classification and geo-

location on a world-wide scale, as well as validate geo-location using truth 3D Lidar data,

we need to leverage as many available training and truth data sets as possible. We utilize

several sources of data, with varying degrees of labeling accuracies in terms of geo-

spatial content as well as image content, namely:

1. A previously existing world-wide terrain classification geospatial database with

1km resolution, in support of the coarse localization stage.

2. A previously existing truth database of 2689 images with accurate terrain type

labels, in support of the coarse localization stage.

3. A 6.5 million image Flickr database with low to medium accuracy geo-spatial data

(city to street level accuracy), in support of both the coarse and medium geo-

location stages.

4. A new 125,000 ground-collected 1Hz video database with accurate GPS metadata

(~10m), using a 60 Megapixel multi-camera system, in support of the fine geo-

location stage.

5. A new 72,000 aerial-collected 2Hz video database, with highly accurate GPS/INS

(~2m) metadata using a 66 Megapixel multi camera system integrated into an

airborne platform, in support of the fine geo-location stage.

6. A new 20x40km 3D Ladar map at 1 meter resolution with sub-meter geo-location

accuracy, in support of truth validation of the fine geo-location stage.

7. A new 40 GB 3D Ladar data point database, corresponding to 2 billion raw 3D

points to showcase the results of an improved algorithm for processing 3D Ladar

data collected using an operational airborne 3D Ladar sensor, in support of

validation of the fine-scale geo-location results.

9

Chapter 2

Coarse Scale Geo-location

Chapter Summary. We present the coarse geo-location method where we geo-locate a query image to a particular region of the world by classifying the terrain type in that image based on image content. To achieve the goal of image geo-location by terrain classification, we first create a 3D world model representation composed of a large training database of geo-tagged, terrain labeled images. This database is created by merging knowledge from three publicly available databases, namely a geo-spatial terrain type and land coverage database, a 6.5 million image database that is only geo-tagged and a database of terrain-labeled images. We develop a coarse geo-location method that uses the generated 3D world model to test a hold-out set of 5000 images and demonstrate an improvement over current state of the art in terrain classification, with over 91% terrain classification accuracy. The resulting terrain label for the image is used to reduce geographical search space and segment the original large database by filtering for images with geo-tags in close proximity to the resulting terrain-label. The reduction in search space allows the usage of more complex medium-scale and fine-scale geo-location classifiers that are both accurate and computationally tractable.

2.1 Background and Related Work

There has been a significant interest in applying machine learning methods to scene

classification and image retrieval. Most of these methods use feature vectors such as

SIFT [10], texton dictionaries [11], color histograms, Gist [12] or a combination of these

[8], with feature vector dimensions on the order of 100-1000. When the training database

is on the order of tens of thousands of examples, the most common and reliable method

to train is using Kernel Support Vector Machines (SVMs). However, the problem of

image geo-location on a world-wide scale requires much larger image training databases,

on the order of millions of images, to have enough sampling of the various locations

around the world.

For computational scalability, retrieval methods that use millions of training samples

typically use the K-nearest neighbor (KNN) training algorithm in a feature space defined

by a few image feature types and then use those nearest neighbors for various tasks, such

as object recognition [13][14][15][16], image completion [8] and object classification.

Nearest neighbor techniques are attractive in that they are trivially parallelizable, require

10

no training, have good classification performance and perform well from a computational

perspective with query complexity that scales linearly with the size of the data set.

Nearest neighbor methods rely on a feature vector of low to medium size in dimensions

(100-200), either using SIFT [10], texton vocabularies [11], Gist [12] or a combination of

these [8]. Given a new image, the same feature vector is computed and the nearest

neighbors in feature space are found from the training database. Using those neighbors, a

majority rule is implemented to determine the label for the new query image. These KNN

methods tend to work well in low to medium sized dimensional spaces, but tend to suffer

as feature dimensionality increases [8][9]. The reason for this is that nearest neighbor

methods lack one of the fundamental advantages of supervised learning methods, which

is the ability to discard irrelevant feature dimensions for a given task [8].

In the context of image geo-location, the KNN algorithm has been used in seminal work

by [8] as part of a single stage algorithm that extracted a single feature vector per image

from a 6.5 million image database. To separate the highly complex city-sized classes

present in image geo-location, the feature vectors were high dimensional, of size close to

3000. Given a query image and its associated feature vector, KNN was applied in this

high-dimensional space to retrieve the k-nearest images, thereby directly obtaining the k

most likely candidate geo-locations for the query image [8]. As noted beforehand, KNN

tends to suffer as feature dimensionality increases, so working in a 3000 dimensional

feature space is not ideal. Furthermore, [8] only used data from a geo-tagged image

database and did not use additional knowledge to penalize unlikely matches.

Considering prior research work, there are several lessons to be learned, namely: 1. for

databases on order of millions of images, KNN is one of the only computationally

tractable approaches and 2. KNN performance tends to suffer as dimensionality

increases. One immediate conclusion that can be drawn is that in order to get good

classification performance, we need to use KNN in a lower dimensional space.

Considering that there are more than 3400 cities in the world with population over one

hundred thousand [17], accurate classification at the city level or even street-level using a

low dimensional feature becomes challenging, if not impractical. Thus, we need to

consider solving a simpler classification problem where the world is broken into fewer

11

class types. This again motivates our initial proposal to implement a hierarchical image

geo-location approach.

One possible solution that reduces the world into a few general classes is to classify

images by terrain type. One advantage of classifying by terrain type is that terrains look

physically different, and thus imagery of the terrains will appear different and have

discernible attributes. For example, deserts look substantially different from forests,

which on the whole look different from urban areas. This good separation in attributes

makes the problem of terrain classification from images very attractive. Another

advantage of classifying by terrain type is that landscapes have a fairly contiguous

distribution, with low spatial variance in terrain types. This contiguous, clumped

distribution of land types is advantageous in several ways, namely: 1. most images should

have no more than one or two terrain types, making accurate classification of the whole

image as single terrain class feasible and 2. considering the low spatial variance in terrain

labels, low accuracy geo-tagged images with no terrain labels might be used to accurately

train a terrain classifier, enabling improved classification. Once we have determined that

an image belongs to a particular terrain type, we can outright reject large contiguous

regions of the world. This in turn reduces our search space, allowing for ever-more

complex algorithms to be used for the follow-on image geo-location stages.

It is noteworthy to mention that research by Hays in [8] transferred information from a

terrain labeled GIS database to show that the terrain label indexed from the computed

geo-location estimate correlated well with the actual image content. However, the

research presented in [8] did not use the additional information for training, but rather

showed the existence of high correlation between the geo-location derived terrain label

and image content; this only demonstrated that information transference from one GIS

database to another is beneficial, but that benefit was not exploited. Our proposed

approach is significantly different; for training we are taking advantage of the heavy

correlation between scene content and GPS derived terrain label to obtain an improved

geo-location estimate.

12

To achieve the goal of image geo-location by terrain classification, we have to resolve the

following two problems: 1. determine terrain type given geo-spatial coordinates, and 2.

recognize most prevalent terrain type(s) from a single image. Once we have a solution to

these two problems, an algorithmic chain becomes apparent:

1. Starting with a single image, determine the most likely terrain type(s).

2. Mark areas on the globe that belong to or are near to such terrain types, as likely

candidate regions.

3. Take the world-wide geo-tagged image database and down-select to an image

database composed of images with geo-tags that fall within the defined candidate

regions.

4. Pass this reduced database to a next processing stage for further geo-location

refinement.

In the next section, we discuss data collections and prior research in support of

developing the above algorithmic method. In particular, we first focus on solving the

problem of determining terrain type given geospatial location, followed by recognition of

terrain types from a single image.

2.1.1 Determining Terrain Type Given Geo-spatial Coordinates

Towards the first goal of determining terrain type given geo-spatial coordinates, we need

training data that has both geo-spatial information as well as annotation of terrain types at

that geo-spatial location. This can be obtained from geological land surveys, with a

multitude of surveys available online from both the Unites States Geological Survey

(USGS) as well as the European Space Agency (ESA). One example of a geological

survey is the USGS "Global Land Cover Characteristics Data Base” (GLCC) world-wide

land coverage database as well as the US National Land Cover Dataset (NLCD), with

examples shown in Figure 2.1 [16,17] Listed in Table 1 alongside the coverage maps in

Fig. 2.1 are some percentage breakdowns by terrain type. An example coverage area in

the Washington DC metro area is also shown in Fig 2.1-C, where roughly 150 square km

is subdivided as a function of latitude and longitudinal coordinates into several

categories.

13

A

B C

Fig. 2.1 - Examples of land coverage and terrain type maps. A) Ecosystems map based on USGS "Global Land Cover Characteristics Data Base (GLCC World)" at a 30 arc second (~1km) resolution using 17 terrain categories, b) National Land Cover Dataset 2011 (NLCD) for the contiguous 48 US stages with spatial resolution of 30m and c) NLCD zoomed-in in view of Washington DC area.

14

Table 2.1

Terrain Coverage Numbers of 48 Contiguous Stages (NLCD, 1992)

From the world-wide terrain classification data in Figure 2.1-A, we observe that terrain

labels exhibit low spatial variance, where large contiguous regions are labeled using a

single terrain type. This property might be exceedingly helpful, considering that we have

a large amount of non-terrain labeled image training data that has low accuracy geo-

spatial metadata, as it suggests that terrain classification performance might be insensitive

to the accuracy of an image’s geo-tag (eg. we can accurately infer the terrain type in an

image solely based on its low accuracy geo-tag). This key insight, that low-accuracy geo-

tagged images may be used for training a terrain classifier, will play a key role in the

development of our proposed terrain classification algorithm.

The data in Table 2.1 show that terrain types are distributed somewhat unevenly, with a

large percentage of terrain labeled into the more broad categories of forest and country

areas, with low percentage representation of coastal/water regions and urban areas. At

first glance, this might appear of concern in terms how much of an image database

reduction can we obtain if we classify an image into one of these highly represented

15

classes. As further explained in Section 2.2, the image database density distribution tends

to cancel out this effect, as many more images are collected in urban and coastal

environments, with each final terrain class being more or less evenly represented in the

image database. By having a more even representation of each terrain class in the image

database, we can maintain a predictable reduction in database size that is on the order of

the number of terrain classes, allowing for more complex algorithms to be used in the

next geo-location stages.

While these databases provide a direct method to index geo-spatial data to terrain type,

we still need a training database and classification method to help us recognize terrain

types from our query image.

2.1.2 Recognizing Terrain Types from an Image

While imagery of the same type of terrain can appear different and variable, on the whole

such imagery will look very similar when viewing the image in its entirety. That is, if we

ignore the fine detail of an image, we should still be able to understand the context of an

image. This is evident in Figure 2.2 where the type of terrain can still be deduced despite

the fact that the image has been significantly degraded by noise or blurred out. One

approach to the study of environmental scenes has been to model an image using a

holistic representation [20][21][22]. This area of research has been the subject of intense

study over the last decade, and has grown to model the human visual system with cross-

disciplinary applications in both the computer vision as well as cognitive science

community. The most notable development is building scene understanding by describing

an image by its spectral envelope that describes the gist of the scene [21][22] -

researchers have come to know this feature as the Gist feature.

16

Fig. 2.2 - Images afflicted with various types of noise demonstrating that we can still deduce terrain type despite loss of fine detail. A) Additive speckle applied to an urban scene B) Severe blurring applied to a forest scene.

Images have long been known to have some interesting spectral properties, though they

have not been quantified rigorously until seminal work by Aude Oliva and Antonio

Torralba [21]. The work models the shape of a scene by holistically evaluating the so-

called spatial envelope of a scene in a mathematical and computational way. The model

is a multidimensional space that defines perceptual concepts of naturalness, openness,

roughness, expansion, and ruggedness that describe the dominant scene structure. The

primary goal of the study dedicated to defining roughly what a scene actually is, that is,

the gist of a scene, hence the name Gist features. Figure 2.3 describes how the Gist

feature is computed using orientation, color and intensity information.

17

Fig. 2.3 - Computing a Gist feature from an image using orientation, color and intensity information over the whole image at multiple scales. Orientation information is captured using Gabor filters at 4 angles (0,45,90,135) on 4 scales, leading to 16 sub-channels. Color information is captured using red-green and blue-yellow center surround each with 6 scale combinations, leading to 12 sub-channels. Intensity is captured using dark-bright center surround with 6 channel combinations, leading to 6 sub-channels, for a total of 34 sub-channels. Each channel is encoded by a 16 bin histogram, leading to a feature vector of length 544.

Torralba et al. [21] utilized Gist to classify images into natural and man-made semantic

groups, with each semantic group further split into 4 classes, namely coast, country,

forest, mountain for natural scenes and highway, street, close-up and tall-buildings for

man-made scenes. In our research, we decided to use the Gist feature for geo-location

since it provides good terrain classification accuracy, with the Gist descriptor fast to

compute, leading to a fast training and testing phase, necessary requirements for

algorithm scalability to large data sets. Since we are implementing a multi-stage

classifier, we choose a lower number of classes for the coarse classification stage, namely

5 instead of Torralba’s 8 original classes. Our coarse geo-location classes are `”coast,''

``country,” ``forest,” “mountain,'' and “urban”, with the urban class encapsulating

Torralba’s man-made semantic group (highway, inside city, street and tall building).

18

Examples images of these terrain classes are shown in Figure 2.4, from Torralba et al

[21] image-terrain label database annotated with LabelMe [23].

Fig 2.4 - Examples of terrain labeled images from Torralba et al. [21] truth database.

19

2.1.3 Towards a Coarse Scale Geo-location Approach

We now have the necessary methods to go from an image to a rough geo-location on a

world wide scale by first going from an image to a terrain type and then going from a

terrain type to a geo-spatial location. We leverage the Gist feature as designed by

Torralba et al[21] to achieve of the first goal of recognizing terrain types from our query

image. Once we have a terrain type, we can assign the image as belonging to certain

regions of the globe through the use of a world-wide terrain coverage classification

database. While there is quite a lot of previous research work in the area of image to

terrain classification, and extensive work on geo-spatial terrain classification, research on

the combined field of image geo-location on a world wide scale is currently in its

infancy, with the most notable research work done by Hays et al[5,6]. Indeed, we build

upon Hays’ work by using his 6 Million image Flickr database as well as some of the

image features and matching techniques that he used to achieve improved terrain

classification over current state of the art as well as reasonable computational

performance considering the large scale database used for training. Figure 2.5 shows the

geospatial distribution of Flickr image database, which we kindly obtained from the

James Hays for our research work.

Fig. 2.5 - Geographical distribution of photos from Flickr database in [8]. Photo location are shown in cyan, with density overlaid using a jet color-map (blue indicates low density, yellow medium, red high density). Our contribution to Hays et at[7] and Torralba et al’s [21] research is to improve on

classification performance by developing a method to robustly upgrade Hays et al [7] 6.5

Million geo-tagged image database with terrain labels. For his work, Hays only used the

20

geo-tagged image database to directly match a new query image to the closest K images

in his feature space using a K-Nearest Neighbor classifier [24]. His method does not

make use of knowledge that could be gained from a geo-spatial land coverage database,

nor does it use information from a truth database of image to terrain labels, information

that might help by making the geo-location problem more robust to noisy GPS labels. For

our research, we propose to use additional prior knowledge sources in order to improve

both geo-location and terrain classification performance.

We propose an improved method where we first probabilistically label the 6.5 Million

geo-tagged image database used in [7] with terrain classification labels using two

additional knowledge sources, namely: a geo-spatial land-coverage database and a truth

terrain-labeled image database from [21]. The enhanced image database is used to

classify a hold-out set of query images, with a significant improvement over previous

state of the art in terrain classification performance, which in turn enables improved geo-

location capability. Section 2.2 describes the setup of the coarse geo-location stage,

while Section 2.3 describes in detail our coarse geo-location method.

2.2 Coarse Scale Geo-location Approach

The coarse geo-location method builds upon research on image terrain classification from

[21] as well as image geo-location research from [8]. Our method uses the same image

database from [8], but improves on the geo-location approach by not only avoiding the

problem of KNN in high dimensions but also using additional data sources to enrich our

3D world model representation in order to penalize unlikely matches. Starting with the

6.5 million image geo-tagged database from [8], we develop a method to probabilistically

annotate terrain labels to each of the images by combining knowledge from two

additional databases: a world-wide land-coverage and terrain-type geospatial database

and 2689 image terrain-labeled truth database from [21]. By adding these new data

sources, we are now able to penalize unlikely matches that might otherwise happen with

an image database that only has geo-tags. For instance, the correct recognition of a

coastal image as being a coastal scene would make the image a highly unlikely match to

an image with a geo-tag from an inland area. Thus, our method can robustly discount

21

images with noisy geo-labels and prevent such images from negatively impacting geo-

location performance.

We create this probabilistically labeled geo-tagged/terrain data set using a two-stage

training algorithm. In the first stage, we use the geo-tags from the 6.5 million image

database, along with the world-wide land coverage geo-labeled database, to weakly label

the images as belonging to a subset of 5 terrain classes. In essence, we are using the geo-

spatial metadata embedded with the image to determine a terrain label probability prior.

In the second stage, we extract feature vectors for each of the images in a 6.5 million

image database as well as from a truth, terrain-labeled, 2689 image database. We

compute a probability that an image falls into a certain terrain class by comparing the

feature vector associated with that particular image to the feature vectors in the truth

terrain-labeled database using a KNN approach. The end-result is an enhanced 6.5

million image database that is not only geo-tagged by also probabilistically labeled by

terrain type. For classification, we again use a KNN approach to compute the most likely

terrain label given query image. The outcome is a terrain label that helps reduce our

search space to a subset of the original multi-million image database. Figure 2.6

summarizes the overall algorithmic flow for the proposed algorithm.

In the first training stage, we start labeling the world using 5 terrain types, namely: coast,

country, forest, mountain, and urban. We primarily use the USGS GLCC Database [18]

to assign a subset of labels to each 1km x 1km land tile. From this data base, we

determine layers for four of our five classes, namely urban, forest, country and coast.

Table 2.2 summarizes the mapping from the USGS Land Use labeling [25], containing 24

labels, to the reduced set of 4 labels (urban, forest, country and coast).

22

Fig. 2.6 - Block diagram of coarse-scale geo-location algorithm. Three databases are used to create a multi-million image terrain labeled, geo-tagged database. The training is divided into two stages. The first stage determines for each image a probability prior of a label, given the image’s geo-tag. The second stage extracts a feature vector from a terrain-labeled image truth database and a multi-million image database in order to determine the conditional probability of an image being a particular terrain class given its feature vectors. The enhanced terrain-labeled, geo-tagged database is used to classify a new query image to obtain a terrain label, resulting in a reduction in the size of the image database. The reduction in search space allows the usage of more complex medium-scale and fine-scale geo-location classifiers that are both accurate and computationally tractable.

23

Table 2.2

Mapping from USGS land-use terrain classes to a reduced set of terrain classes

To reduce gridding errors and improve terrain classification based on wide-field of view

images, or images that might have low accuracy geo-tags, we allow each 1km x 1km tile

to have multiple labels. We apply a 1km image dilation operation for urban, forest and

country label regions, and a 3km dilation operation for coastal regions that are derived

from sea-land contour lines. Since the USGS GLCC Database does not contain labels for

our mountain regions class, we extract this information from UNEP, Mountains and Tree

24

cover in Mountain Regions 2002 Database [19]. Figure 2.7 shows UNEP-Mountains

layer used for our method. Considering that mountains form a landscape feature that

might feature prominently in images taken from neighboring non-mountainous areas, we

perform a 5km dilation operation on the mountains layer obtained from [19]. The result

of this first training stage is a world-wide multi-label image at 1x1km resolution, where

all of the 1x1km pixels are assigned to a subset of terrain labels, producing a terrain

labeled geo-spatial database henceforth referred to as GeoLabel.

Fig. 2.7 - United Nations Environment Programme, Mountains and Tree cover in Mountain Regions 2002 Database Colors represent various sub-classes of mountain regions.[19]

The newly created GeoLabel terrain coverage database is now used to create probabilistic

terrain class priors on the 6.5 million geo-tagged image database, where the probability of

a terrain label for an image given that image’s geo-tagged latitude-longitude information

can be derived as follows:

| , max∈ ,

∑ ∈ , , ɛ

[Eq. 2.1]

25

where C=5 is the number of classes, i is the ith image in the 6.5 million image database,

GeoLabel() is the terrain labeled spatial database indexed in lat-lon coordinates and ɛ is a

small value to prevent the conditional probability of any label from being set to zero

(enables images to not be complexly ignored in the case where an image has a noisy geo-

tag, or the case when an image has an accurate geo-tag but contains land coverage types

other than those predicted from the GeoLabel() database). We now have completed the

first training stage, in which we obtain a terrain classification prior by probabilistically

labeling each image given its geo-tag. Figure 2.8 provides pseudo-code with details on

the data sets and algorithmic steps that are part of the first training stage.

First Stage Training Pseudo-code:

1. Start off with GLCC land-cover database image (gusgs2_0ll.img) that has 24 land-cover types.

2. Use translation index described in Table 2.2 to create a new image with 4 general land-cover types.

3. Create binary masks for the 4 remaining classes (forest, country, coastal, urban). 4. Do 1km image dilation operation on the country, forest and urban masks using 3x3km

square kernel. 5. Create a coastal contour map. Do 1km image dilation and 4km image erosion on coastal

regions by iterating multiple times using a 3x3km square mask. Save difference image between dilated binary mask and eroded binary mask to define a coastal contour map.

6. Create mask in lat-lon of mountainous regions from UNEP Mountains Data set using Level 8 data (0.6km resolution). Do 5km image dilation of mountain mask using 5 applications of a 3x3km square mask.

7. Create GeoLabel database, a 5-deep image stack of dilated masks. 8. Use GeoLabel to compute probability prior as a function of lat-lon.

Fig. 2.8 - Pseudo-code for the first training stage. The pseudo-code describes in detail the steps needed to combine information from two geo-spatial databases to obtain a GeoLabel database with 5 terrain classes, which is used to obtain a terrain label prior conditional on lat-lon location.

Figure 2.9 depicts the results from pseudo-code step 2 in the Figure 2.8, where we create

4 general land types from the GLCC 24-type database, using the translation index

provided in Table 2.2.

26

A

B

Fig. 2.9 - Results of processing GLCC land-cover database into 4 general land-types, namely coastal (blue), country(cyan), fores(yellow) and urban (red). A) Results depicting coverage for whole word. B) Zoomed-in view depicting coverage for north-eastern US.

Figure 2.10 captures the image masks created in pseudo-code steps 3-6 from Figure 2.10.

From the images, we note that coastal and urban regions represent a low percentage of

the globe, while large regions are classified as country, forest and mountain. Table 3 adds

further details on the percentages of land area, as well as percentages of images in the

database that fall into each of the five classes. The percent of land area covered varies

significantly, from 0.5% for urban to 7% for coastal and up to 64% for country. Thus,

classification into certain classes, such as urban or coastal can lead to reduction upwards

27

of 200x in search area, while other classes such a country leads to modest reductions

closer to 2x. Nonetheless, classification into any these land-types helps reduce the overall

search space, meeting our goal for hierarchical geo-location. Ideally, the classes would be

more uniformly distributed, with a uniform reduction of 5x (given 5 classes), which is not

the case in terms of land-area percentage. However, from an algorithm performance

perspective, uniform class distribution in terms of land area is not as important as much

as uniform class distribution in terms of image database size. The goal of hierarchical

geo-location is to progressively reduce image database size in order to allow for ever-

more complex algorithms to be applied to each reduced set of images. As detailed in

Table 2.3, the class distribution in term of number of images in the database is more

evenly balanced, with coast, forest and mountain each accounting for about a ¼ of the

database, and country and urban accounting for about ½. The obtained class distribution

in terms of database size allows anywhere from a 2x to 4x reduction in search space.

A B

C D

E F

28

G H

I

J

Fig. 2.10 - Results of processing GLCC land-cover database into 5 general land-types. A) Coastal areas mask for whole world, B) Zoomed-in view of coastal mask over continental US, C) Urban mask over whole world, showing very few large hot-spots of urban areas, D) Zoomed-in view of urban mask over continental US, depicting the major cities, with crumb-trails of urban areas along major highways, E) Country mask, F) Forest mask G) Mountains mask, H) Super-position of all 5 masks in a 5 bit image (32 distinct colors) with a jet color map (blue-yellow-red), using the following least significant bit to most significant bit order: mountain, country, forest, coast, urban. Open-sea is represented as dark-blue, followed by barren mountains, mountains and country areas, etc. Urban-only areas are represented as value 16 (green), with yellow and red representing urban areas with multiple labels, where red represents regions that are both urban and coastal. J) Histogram of land-type combinations for the chosen 5 classes, leading to 32 possible combinations.

29

Table 2.3

Geo-Label database statistics

By creating this new GeoLabel database, we have completed the first processing stage to

obtain a probability prior for each class conditional on the geographic coordinates. We

now describe the second training stage, where we attempt to find a probability prior for

each class label conditions on the feature vectors extracted from each image. Towards

this goal, we first extract feature vectors for images in both the 6.5 million image geo-

tagged database as well as for images in the terrain-labeled truth database from [21]. We

utilize the Gist feature descriptor [21], which has been shown to work well for terrain

classification and scene categorization [7][8]. Using the extracted Gist feature vectors,

we initialize a KNN classifier on the 2689 image terrain-labeled database from [21]. For

each of the 6.5 million probabilistically labeled images, we run the pre-computed Gist

feature through the KNN classifier. Instead of determining a single label based on the

typical KNN majority-rule, we instead take the K nearest neighbors and their associated

truth labels to find the probability of a terrain label given the image’s Gist feature vector,

Fi:

| ∑

[Eq. 2.2]

,where K is the number of nearest neighbor feature vectors in L2 distance, j is the nearest

j’th Gist vector from the truth database T to Gist vector of image i and i is the ith image

in the 6 Million image database. We now combine the information from the first and

30

second training stages to express the probability of a label for an image given its geo-tag

and Gist feature as:

| , , | ∗ | , [Eq. 2.3]

This probabilistically labeled geo-tagged/terrain data set composed of feature vectors,

associated terrain labels and geo-tags serves as the improved representation of the 3D

world model. Next, we train an additional classifier on the world-model database and test

on a hold-out set of images, for which we have both geo-tags and terrain labels. Similar

to [8], we chose a KNN classifier to make the classification problem computationally

tractable. For each test image, we compute a Gist feature and use the KNN classifier

with K’ nearest neighbors (note that this is a different parameter than K used in equation

2 above). Unlike [8] who used KNN to label the query image by neighbor majority rule,

we choose the label for the image by computing the label likelihood over the

neighborhood Gist features as follows:

∑ | , ,′ [Eq. 2.4]

, where K’ is the number neighbors used in KNN, j is the nearest jth Gist vector from the

6.5 million image database to the Gist feature derived from the query image. For

reference, the coarse geo-location algorithm is shown in pseudo-code in Figure 2.11.

31

Fig. 2.11 - Pseudo-code for the proposed coarse geo-location algorithm.

Compared to the standard KNN terrain classification approach in [21] where KNN is

trained on a 2689 terrain labeled database, our proposed terrain classification method

utilizes knowledge from multiple much larger database. This might allow our KNN

method to better distinguish complex boundaries as we now have 6.5 million image

samples for our truth database, which is over three orders of magnitude more data for

training compared to the 2689 terrain label database in [21]. Furthermore, compared to

the direct, single step geo-location method developed by Hays in [8], our method can

penalize images with incorrect geo-tags, leading to robust classification in the presence of

databases with noisy geo-tags (exe image of a coastal area with geo-tag far inland).

32

2.3 Coarse-Scale Geo-location Experimental Setup and Results

The coarse geo-location method was tested on a hold-out set of 5000 test images, 1000

images per terrain class. We extracted a Gist feature for each image, using Gabor filters

at 4 angles and 4 scales (16 channels). Color information is captured using red-green and

blue-yellow center surround, each with 6 scale combinations, leading to 12 sub-channels

[21]. Intensity is captured using dark-bright center surround with 6 channel combinations,

leading to 6 sub-channels, for a total of 34 sub-channels. Each channel is encoded by a 16

bin histogram, leading to a feature vector of length 544. Using the above procedure,

feature vectors were also computed for images in Torralba’s 2689 terrain labeled

databases as well as for the entire 6.5 million image flickr database. To reduce

dimensionality and avoid sparsity concerns, principal component analysis (PCA) was

applied to the training database. For computational reasons, only a subset of the training

database was used for PCA analysis, namely Torralba’s entire 2689 image database as

well as a 50000 random image selection from the multi-million image flickr database.

Feature dimensionality was reduced from 544 to 80 dimensions. Based on results from

[12], we chose K=19 as a good value for creation of the geo-tagged/terrain labeled

database. By cross-validation, we found that a parameter of K’=200 worked well for the

k-nearest neighbors used to predict the label of a test image. We compare the results of

our method, shown in Table 2.5, to the baseline method as detailed in [21], shown in

Table 2.4. The baseline had an average accuracy of 85.5%, while our method had an

accuracy of 91.3%, a modest improvement of 5.7% over the baseline. In particular, our

method was able to much better classify coastal areas with an 11.3% improvement over

the baseline and country areas with a 6.4% improvement. The improved results

demonstrate that the proposed method can leverage the additional database information to

improve accuracy. Also, the high level of correct classification makes the proposed

algorithm suitable for use as a first stage in hierarchical classification chain.

33

Table 2.4.

Confusion Matrix for baseline method. Numbers in red denote correct classification

Table 2.5.

Confusion Matrix for proposed coarse-scale geo-location method.

Numbers in red denote correct classification

In summary, we have developed a method to coarsely geo-locate images on a world wide

scale by classification of terrain types. Results indicate 91.26% correct classification,

with a 71% weighted average geo-spatial search space reduction and upwards of 99.5%

reduction for urban query images. In terms of image database reduction, the algorithm

resulted in an average 66% reduction, and upwards of 76% for mountain query images.

The new method provides a significant improvement in terms of accuracy, with 5.72%

improvement over the baseline. The proposed method has several advantages over prior

non-hierarchical approaches [8], in that the method is robust to images with noisy geo-

labels, works in a low dimensional feature space to avoid the curse of dimensionality [9]

and reduces the database size in order to allow for more complex follow-on stages to be

computationally tractable. The resulting terrain label for the query image is now used to

reduce our geographical search space, as well as choose a specific medium geo-location

classifier trained to further distinguish spatial locality within that particular terrain class.

34

Future areas of research might include expanding the number of terrain classes, allowing

for improved data reduction and geo-location specificity. In particular, the “country” and

“urban” classes tend to account for more than half of all images in the image database

and need to be further sub-divided. Towards that goal, we might consider adding several

additional classes, namely a “savanna/arid” class as well as further subdivide the “urban”

class into a “sub-urban” class, a “dense urban” class and possibly consider an “indoors”

class.

35

Chapter 3

Medium Scale Geo-location

Chapter Summary. We describe a medium geo-location approach, where a query image has already been labeled as belonging to certain terrain type by using our coarse geo-location approach, leading to a reduction in our geospatial space, in some cases up to 99%. Given this reduced search space, we attempt to further geo-locate the image to few candidate locations in the world. We develop an improved geo-location method using a classifier inspired by SVM-KNN [26][7][8] and demonstrate that the classifier, in conjunction with a set of extracted image features, improves geo-location accuracy compared to prior methods.

3.1 Background

The coarse classification in Chapter 2 enables the capability to distinguish between types

of terrain, whether it is cities, forest, mountains, etc. This is useful because it reduces the

geospatial search space, in some cases, up to 99%. The next step is to geo-locate within

this reduced search space. That is, now that we know this particular photograph is of an

urban scene, we ask questions such as, which city was it taken in? Or, if the image was

classified as forest, which type of forest is it (jungle, deciduous, evergreen)? Of course,

there are naturally limits as to how well a machine performs. For example, if we take a

picture of a white wall, there is not a lot that an analyst (human or machine) can do to

geo-locate that photograph. Ignoring such pathological situations, though, information

and cues within an image offer considerable potential and are telling about a geo-location

(if not the exact geo-location) by simply observing image attributes.

Fig 3.1 - Image attributes that might be used for medium-scale geo-location.

36

The decision-making process, as apparent in Figure 3.1 relies heavily on attributes such

as the type of vegetation and leaves, architecture style, common building colors, texture,

relative height, etc. Such features are recognizable and discernible if the observer has the

a priori knowledge of how they appear in digital imagery, where illumination conditions,

resolution, and picture quality play large roles.

For the medium-scale geo-location problem, we first focus on urban scenes to answer the

following question: given this image, which city was the image taken from? Before we

go further and explain the proposed classification approach, we will briefly review

related work on scene classification and image retrieval using image features. We will

explore which of these classification algorithms has the capabilities to do a good

classification job given our difficult classification problem where the boundaries between

our classes (e.g. is this location Bangkok or Paris) is extremely complex because the

algorithm needs to divide along a wide range of scene types (indoors or outdoors, street

or highway, tall or short buildings) that might be present at both locations. We will also

explore which classification algorithms are computationally tractable to our geo-location

problem, where we will have a large scale training database that has millions of images.

3.2 Learning for Scene Classification and Image Retrieval

There has been a significant interest in applying machine learning methods to scene

classification and image retrieval. Most of these methods use feature vectors such as

SIFT [10], texton dictionaries [11], color histograms, Gist [12] or a combination of these

[8], with feature vector dimensions on the order of 100-1000. When the training

databases are on the order of tens of thousands of examples, the most common and

reliable method to train is using Kernel Support Vector Machines (SVMs). SVMs is a

supervised learning method that takes in a input set of data and predicts one of two

possible classes that the input sample belongs to, making SVM a binary classifier. Given

a training set of binary labeled input data, a SVM training algorithm builds a model that

assigns a new example to one of the two categories. In addition to performing linear

classification, SVMs can also perform non-linear classification using what is known as

the “Kernel Trick”. The kernel trick is a transformation where the input data is mapped

37

from the original feature space to a much higher dimensional feature space. The reason

for the mapping is to more easily find a separation hyper-plane boundary (a linear

decision) that enables good partitioning between two classes that have a complex

separation boundary in the original low dimensional space. Thus for our geo-location

problem, where the boundaries between our classes (is this Bangkok or Paris) is very

complex, SVMs are highly desirable.

Algorithmic implementation of typical kernel SVMs have a training complexity of

O(d*N2) where d is the feature dimensionality and N is the number of training examples.

For our classification problem, where we have millions of training samples, this approach

is not computationally intractable. In machine learning literature, there are some

examples of large scale SVM approaches, but many such methods, such as SMO [27]

typically require a N2 all-pairs distance computation, which is computationally

intractable with millions of training images. One of the more promising approaches for

large scale SVMs is from Wang et al [28], who use a “histogram intersection kernel”

coupled with online SVM training method to classify image into Flickr groups and

PASCAL categories. With 80 thousand training images and image features on the order

of N=200 dimensions, they can train an SVM in 150 seconds, with classification

performance which is nearly as good as batch-trained SVMs. Nonetheless, even this

method does not directly scale well to training databases on the order of millions of

images.

For computational scalability, retrieval methods that use millions of training samples

typically use KNN training algorithms in a feature space defined by a few image feature

types and then use those nearest neighbors for various tasks, such as object recognition

[29][30][31][32][33], image completion [8] and object classification. Nearest neighbor

techniques are attractive in that they are trivially parallelizable, require no training, have

good classification performance and perform well from a computational perspective with

query complexity that scales linearly with the size of the data. Nearest neighbor methods

rely on a feature vector of low to medium size in dimensions (100-200), either using

SIFT [10], texton vocabularies [11], Gist [12] or a combination of these [8]. Given a new

image, the same feature vector is computed and the nearest neighbors in feature space are

38

found from the training database. Using those neighbors, a majority rule is implemented

to determine the label for the new query image. These KNN methods tend to work well in

low to medium sized dimensional spaces, but tend to suffer as feature dimensionality

increases [8]. The reason for this is that nearest neighbor methods lack one of the

fundamental advantage of supervised learning methods, which the ability to discard

irrelevant feature dimensions for a given task.

Nonetheless, with more features types and therefore higher feature dimensions, there is

potential gain in classification performance as long as we have a computationally

tractable supervised training method that focuses on the relevant features for the given

task or query. One promising approach for high dimensional features is to combine the

supervised learning power of SVM with computationally efficiency of KNN. The

medium-scale geo-location method used in this thesis is inspired by SVM-KNN [8][26]

and prior KNN enhancements [34][35][36] [37][38][39][40]. The method is a hybrid of

non-parametric, KNN techniques and parametric, supervised SVM learning techniques.

The philosophy behind this method is that learning becomes easier if we focus on

examining the local space around a query instead of the entire problem domain.

Consider our image geo-location problem where we are attempting to differentiate

between multiple cities (e.g. is this location Bangkok or Paris). The boundaries between

our classes is extremely complex because the boundary must still divide a spectrum of

scene types within a city (indoors or outdoors, close-up or perspective, street or highway,

tall or short buildings) that might be present in both locations. When looking at the

combined training data for both cities, there might not be a simple parametric boundary

between these geographic classes in feature space. However, if we were to look within a

space of similar scenes to the query image (e.g. streets), then it may become much easier

and more feasible to divide the classes. This is exactly what we intend to do with the

KNN-SVM algorithm. Given a query image, we will use KNN to roughly find a local

space of similar scenes and then use an online SVM classifier trained just on the nearest

neighbors to find a possibly non-linear parametric boundary and classify our image. The

proposed KNN-SVM algorithm will not only be computationally tractable, but also have

39

the potential to have significantly classification improved performance over a KNN only

method.

3.3 Medium Scale Geo-location Algorithm and Results

Our KNN-SVM classifier builds upon the baseline method described in [8]. Given a

query image, we first extract a feature vector for each image using the resulting output of

several popular feature detectors from literature. Given a query image, its corresponding

feature vector, as well the predicted terrain class from the course-scale classifier, we

propose the following KNN-SVM algorithm:

1. Reduce original database to images with geo-tags that overlap with the predicted terrain type. Automatically label “regions” using mean-shift clustering with 200km bandwidth. For computational efficiency, the above steps are performed offline only once for each terrain type.

2. Use baseline KNN-SVM from [26] with K1 to find a “region” label, using a minimum cluster size of 5.

3. Run again KNN with data only from region, using K2 nearest neighbors. 4. Cluster on the globe the K2 nearest neighbors by mean-shift, using bandwidth of

50 km, with minimum cluster size of 3. Consider each cluster as a city for SVM. 5. Compute the pair-wise distances between all K2 nearest neighbors using image

features with L1 and chi-squared distance. 6. Convert the pair-wise distance into a positive semi-definite kernel matrix using

procedure from [8] and train C 1-vs-all non-linear SVMs. 7. For each classifier C, compute distance of the query point to the decision boundary

using procedure from [8]. The class for which the distance is most positive distance is declared the winner.

8. Estimate GPS of query as average of all members of the winning class.

We tested the new algorithm using a 500 image hold out set, composed of geo-tagged

images from 5 cities with 100 images per city. The images include the cities of 1.

Lubbock, Texas, 2. Boston, MA, 3. Paris, France, 4. Vienna, Austria and 5. Dubrovnik,

Croatia. The images were in part downloaded from Flickr as well as selected from ground

imagery collection campaigns in support of the fine-scale geo-location algorithm

development. Similar to criteria applied in [8], we removed images that were undesirable.

For this experiment, we use the Tiny Images feature, as detailed by Torralba et al. in [41],

to create 16x16 color images as one of our features. In addition, we use color histograms

of size 4x14x14 bins in CIE L*a*b* space for a total of 784 dimensions. Texton features

40

are also used due to their ability to distinguish well between different building textures in

cities. Similar to [8] we use a 512 entry universal texton dictionary [42] by clustering data

to a set of bank filters with 8 orientations, 2 scales and 2 elongations. Finally, we apply

the same Gist feature descriptor as detailed in Chapter 2, of size 544 dimensions. We use

L1 distance for all image features (Gist, Tiny Images), and chi-squared for histograms

(textons, color). The sub-vectors are concatenated together to create a 2096 dimensional

feature vector. For each query image, we test the image against the whole database to

determine the geo-location performance for finding the particular city amongst data from

the entire world. Successful geo-location for a query image is defined as finding a

location within 200km of the actual GPS location as specified in the geo-tagged metadata

of the query image. By cross-validation, a K1=2000, K2=200 was determined to work

well in terms of geo-location accuracy.

Figure 3.2 captures the result of the mean-shift clustering for the urban-only database,

described in pseudo-code Step 1. The method was implemented in Matlab and took

approximately 16 minutes of run-time. The outcome is the division of the world-wide

data set into 977 clusters. Figure 3.2c provides visual confirmation that the chosen

clusters conceptually capture regional areas. Based on this database labeling, the rest of

the algorithm was run on the 500 image hold-out set. We compare our results to the

KNN-SVM procedure and optimal parameters used by Hays in [8]. Table 3.1 captures

geo-location performance. Results indicate that we can geo-locate a query image to a

particular city with an accuracy of 12% to 18%, with an average of 15%. We also ran the

algorithm from [8] on the urban-only database and obtained an accuracy of 12.5%. Our

method has an absolute improvement of 2.5%, leading to a 20% relative improvement

over previous results.

41

A

B

C Fig 3.2 - Mean shift clustering of urban-only database using 200km bandwidth. A) GPS locations for all urban images. B) Clustering of GPS locations for all urban images, color-coded using a wrapped color map (multiple clusters might have some color). The cluster center is shown with a black round marker. C) Zoom-in of the clustering for east coast and mid-west USA, confirming that clustering captures well regional areas.

42

Table 3.1.

Medium‐scale Geo‐Location Confusion Matrix at the city‐level

In addition to testing the imagery on a narrow set of only 5 cities, we also did testing

using a 500 image urban test set with images randomly selected from across the entire

world. We built the test set by randomly drawing 800 images from the urban-only data

set. From this initial set, we removed undesirable photos using the same methodology as

in [8]. In addition to the procedure in [8], we ensured the images did not capture the same

scene area (visual checked images with close geo-tags). Similar to [8], to prevent testing

bias, we removed from the database not only the test images, but also all images from the

same photographers. The resulting set contained 462 images. The set was enriched with

author-collected geo-tagged images to bring the set to a total of 500 images. Using this

new set, we tested the accuracy of our proposed method as function of database size. For

this experiment, (as well as the next experiment), we used all the im2gps features except

geometric context and 16x16 tiny images. The baseline KNN-SVM algorithm from [8]

was run using a Ksl=K=2000 to obtain a more fair comparison to the proposed KNN-

SVM method in that both algorithms now use the same data for further SVM

classification. Results for this test are shown in Figure 3.3. From Figure 3.3, we can see

that accuracy increases with database size, similar to the results obtained in [8]. For the

entire urban-only database, our accuracy rate was 16.2%. We repeated the test using the

method described in [8] and obtained a second curve in Figure 3.3 that has the same trend

as our proposed method. For the method in [8], we obtained an accuracy rate of 12.8%

using the entire urban-only database, resulting in a modest improvement of 3.4% (25%

relative improvement) for our proposed method compared to the method from [8].

43

So far, our accuracy metric has been based on the top-scored regional city-refined cluster

of images. We now relax the condition to determine the accuracy rate that we correctly

geo-locate the query image when we consider the second through Nth mean-shift

determined regional city-refined clusters, as well as the best cluster, which is defined as

the cluster being spatially closest to the ground truth for the query location. We compare

those results to the baseline KNN-SVM method from [8] with Ksl=K=2000. Results

shown in Figure 3.4 indicate a significant increase for both the baseline and proposed

algorithm in correct geo-location once lower ranked clusters are considered. The

percentage of the data set that meets the new criteria increases from 16.2% when

constrained to the rank 1 cluster to 39.2% when considering up to the top 9 clusters,

reaching 39.8% when expanding criteria to include all found regional city-refined

clusters. Correct classification for both algorithms eventually converges when all clusters

are considered due to having similar overlapping clusters being determined (choosing

exactly the same K=2000 nearest neighbors to start off). The proposed algorithm has a

slight advantage when considering up to the 4th top regional city-refined clusters, most

likely due to refined search enabled by our 2-stage (regional-city) hierarchical KNN-

SVM proposed approach. Chance performance based on random matches is also shown

for comparison. We note that the ratio of geo-location accuracy to chance performance

stays in a range of 8-16x better than chance. This demonstrates that the image search

system can be used to obtain correct geo-location with fairly good recall rates, while still

providing accuracy that are at about one order of magnitude better than chance.

As a point of comparison to prior approaches, we also tested the accuracy of running the

whole hierarchical algorithm (course-scale followed by medium-scale geo-location) on

the entire 6.5 million image database using the 2K random image subset from [8]. In

order to do a comparison with prior reported results in [8], we used all the base im2gps

features (minus geo-metric context and tiny images) for the first KNN-SVM stage and

added the additional geometric-derived and SIFT-derived features explained in [8] for

performing the second KNN-SVM stage.

44

Fig 3.3 - Accuracy as function of urban database size for proposed algorithm (red) versus method described in [8] (blue). The accuracy of our proposed method on the entire urban database is 16.2% compared 12.8% for previous method.

Fig 3.4 - Accuracy of geo-location estimates once lower ranked clusters are considered. The proposed method had good recall rate while still being much better than an order of magnitude better than chance.

The results indicate a slightly improved accuracy rate of 15.1% compared to the accuracy

rate of 13.75% reported in [8], leading to a small 1.35% absolute improvement and a

modest relative improvement of 9.8%. Although ~15% is a low absolute accuracy rate, it

is worth reminding that the test 2K random data set has many images that are extremely

45

difficult, if not impossible to confidently geo-locate. Figure 3.5 shows a representative

sample of the 2K random data set.

Fig 3.5 - Sample of 2k random data set, showing 30 images. Many of these photos are very difficult if not impossible to geolocate due to lack of content that is geographically specific.

In Figure 3.6, 3.7 and 3.8, we show some example geo-location results. For each

example, the query image is shown on the top left. The top 6 images from the resulting

geo-location image cluster are shown to the right of the query image. The predicted city-

refined geo-location estimate is shown as a red dot, while the actual ground truth geo-

location is denoted using concentric yellow rings of radius 200km and 750km.

In summary, we developed an improved geo-location method using a classifier inspired

by [26][7][8] and demonstrated that the classifier, in conjunction with a set of extracted

image features, improves geo-location accuracy compared to prior methods. Results with

the new method indicate a slightly improved accuracy rate of 15.1% compared to the

accuracy rate of 13.75% reported in [8], leading to a modest relative improvement of

46

9.8%. Future work might include appending additional images to the database to

determine if the proposed dual-stage KNN-SVM classifier can take further advantage of

the additional data. In addition, it would be of interest to better explore which features are

more important for geo-location and discard features that have low discriminatory power.

With the development of this medium-scale classifier, we have now reduced the geo-

location problem to a city-scale and gained additional scene knowledge by having a

region and possibly a city associated with the image. In the next chapter, we will explore

a method to further geo-locate the query image from city-scale down to street-level

accuracy, or in some cases go as far as determining an actual camera pose and location.

Fig 3.6 - Geo-location results 1. A query image from the city of Boston, MA is shown on the top left. The top 6 images from the resulting geo-location image cluster are shown to the right of the query image. The predicted city-refined geo-location estimate is shown as a red dot, while the actual ground truth geo-location is denoted using concentric yellow rings of radius 200km and 750km.

47

Fig 3.7 - Geo-location results 2. A query image from the Grand Canyon is shown on the top left. The top 6 images from the resulting geo-location image cluster are shown to the right of the query image. The predicted city-refined geo-location estimate is shown as a red dot, while the actual ground truth geo-location is denoted using concentric yellow rings of radius 200km and 750km.

48

Fig 3.8 - Geo-location results 3. A query image from Paris, France is shown on the top left. The top 6 images from the resulting geo-location image cluster are shown to the right of the query image. The predicted city-refined geo-location estimate is shown as a red dot, while the actual ground truth geo-location is denoted using concentric yellow rings of radius 200km and 750km.

49

Chapter 4

Fine Scale Geo-location

Chapter Summary. Once we have geo-located a query to a particular city, we go to the final step in the geo-location progression by attempting to estimate the pose from where that particular image was taken. To achieve this, we first pre-process a training data set using structure-from-Motion (SfM) techniques by extracting local features from training images, finding feature correspondences and upgrading the correspondences to 3D locations to create a 3D model of the city scene. The relative camera poses, along with the 3D reconstruction, are then geo-located using GPS image metadata that might be available with a subset of the training images in our city-wide image database. A query image can then be geo-located and attached to training image database using a similar SfM procedure. Our contribution to the SfM research area is to develop an efficient method to do 3D reconstruction on a city-wide scale using ground video imagery as well as aerial video imagery in order to compute a more complete and self-consistent geo-registered 3D city model. The reconstruction results of a 1x1km city area, covered with a 66 Mega-pixel airborne system along with a 60 Mega-pixel ground camera system, are presented and validated to geo-register to within 3 meters to prior airborne-collected 3D Ladar data. Compared to prior approaches, the new method has a computational speed-up on the order of 4 to 14x depending on database size. Furthermore, given holdout set of 500 query images, the presented method is shown to be able to geo-locate close to 80% of query images to within better than 100m, thus demonstrating the ability to geo-locate most query images to street level accuracy.


Automatic 3D reconstruction and geo-location of buildings and landscapes from images

is an active research area. Recent work by [43][44][45][46] has shown the feasibility of

3D reconstruction using tens of thousands of ground-level images from both unstructured

photo collections, such as Flickr, as well as more structured video collections captured

from a moving vehicle [5][47][48][49], with some algorithms incorporating GPS data for

geo-location when available [5][50][51][52]. While 3D reconstructions from ground

imagery provide high-detail street-level views of a city, the resulting reconstructions tend

to be limited to vertical structures, such as building facades, missing a lot of the

horizontal structures, such as roofs, or flat landscape areas, thus leading to an incomplete

model of the city scene [5][46][47][48]. Furthermore, when using GPS data for geo-

location, the 3D model’s geo-registration accuracy and precision might suffer since

50

street-level GPS solutions are poor, particularly amidst tall buildings on narrow streets

due to multipath reflection errors [5][53]. For video collects with GPS captured from

moving vehicles, the resulting 3D model is typically composed of a single connected

component that might have internal distortions due to GPS drift or discontinuities [5][51].

For unstructured ground photo collections, the issue of geo-registration is further

exacerbated as only a subset of images might have GPS metadata, with typical city-sized

reconstructions composed of multiple unconnected 3D models representing popular

touristic sites or landmarks [43][44][46], where each connected component might have

few or no GPS tie points. Recent work by [46] attempts to resolve this problem, however

they require additional GIS building data as a unifying world model to connect the

various disconnected 3D scene clusters.

While ground level 3D reconstructions do not capture a complete model of a city’s

surface scene, they can be complemented by adding aerial imagery, which has wider area

coverage along with inherently more accurate aerial GPS data. A reconstruction using a

combination of aerial and ground imagery might lead to a 3D city model that has both

high level of detail as well as capture a wide area. Furthermore, by using aerial imagery

to create a reference, geo-registered and self-consistent 3D world model, we might be

able improve both absolute geo-registration accuracy as well as precision of the

previously unconnected 3D ground reconstructions.

In this section, we describe an efficient method that utilizes both ground video imagery as

well as ultra-high resolution aerial video imagery to reconstruct a more complete 3D

model of a large (1x1km) city-sized scene. The method starts out with two similar

structure-from-motion (SFM) algorithms to process the aerial and ground imagery

separately. We developed a SFM processing chain similar to [43][44], with several

improvements to take advantage of inherent video constraints as well as GPS information

to reduce computational complexity. The two separate 3D reconstructions are then

merged using the aerial-derived 3D model as the unifying reference frame to correct for

any remaining GPS errors in the ground-derived 3D scene. To quantify the improvements

in geo-registration accuracy and precision, we compare the aerial-derived 3D model, the

ground-derived 3D model, as well as the merged 3D reconstruction to a previously

51

collected high-resolution 3D Ladar map, which is considered to be truth data. To the best

of our knowledge, no one has published results of city-sized reconstruction using both

aerial and ground imagery to obtain a more complete 3D model, nor has the geo-location

accuracy and precision of 3D reconstruction been quantified in a systematic manner over

large scale areas using 3D Ladar data as truth.

We utilize an airborne 66Mpixel multiple-camera sensor operating at 2Hz to capture

videos of large scale city-sized areas (2x2km), with an example image shown in Figure

4.1-A/B. In addition, we captured ground based video data at 1Hz with 5 12MPix Nikon

D5000 cameras using a moving vehicle as shown in Figure 4.1-C/D. A total of 72000

aerial images were captured, as well as 125000 ground images in support of fine-scale

geo-location. The algorithm was tested using 250 66-MPix aerial video frames along with

34400 ground images to create a dense 3D reconstruction of a 1x1km area of Lubbock,

Texas. A Ladar map of the city at 50cm grid sampling with 0.5 meter geo-registration

accuracy is used to determine the geo-registration accuracy and precision of the various

3D reconstructed data sets. The main contributions of our research are:

1. An efficient SFM method that takes into account video constraints as well as GPS

information, in combination with a method to merge the 3D aerial and ground

reconstruction for a more complete 3D city model, with improvements in geo-

registration accuracy and precision for the ground collected data.

2. The first 3D reconstruction using both aerial and ground imagery on a large city-sized

scale (1x1km).

3. A detailed study showing geo-location improvements of the merged reconstruction,

validated in a systematic manner over a large scale area using 3D Ladar data as

truth.

52

A B

C

D

Fig. 4.1 - Data used by the 3D reconstruction system. A) Example of a 66 MPix image captured at 2HZ by a multi-camera airborne sensor, covering a ground area of about 2x2km. B) Zoomed in view of the same aerial image. C) A 5 12MPix camera ground system, covering an 180 degree field of view, collected at 1Hz. D) Example of resulting ground imagery.

53

The rest of this chapter is organized as follows: Section 4.2 discusses in detail the

developed algorithm along with implementation of the system. Section 4.3 reports the 3D

reconstruction results on a 1x1 km area using aerial data, followed by the 3D

reconstruction results using only the ground imagery. Qualitative as well as quantitative

geo-registration results of the 3D reconstructions are reported for the aerial imagery,

ground collected imagery, merged ground imagery, as well as for the combined aerial-

ground reconstruction. Section 4.4 concludes with a proof-of-concept on how fine-scale

geo-location opens up new paths for improved image understanding.

4.2 Fine Scale Geo-Registration Approach

The developed algorithm can be divided into three stages. The first stage explains the

implementation of a 3D reconstruction pipeline that is applied similarly for both the

ground and aerial imagery. The second stage describes a 3D merge method to fuse the

two 3D reconstructions into a complete city model. The third stage makes use of the

combined 3D reconstruction to geo-locate a new query image. Each stage is described in

a separate subsection below.

4.2.1 3D Reconstruction from Video Imagery and Geo-location

The 3D reconstruction pipeline is similar to [43][44], with several improvements that

take into account temporal video constraints as well as availability of GPS information.

The processing pipeline, shown in Figure 4.2, can be broken up into the following stages:

preprocessing, feature detection, feature matching, initialization, bundle adjustment,

followed by dense 3D reconstruction.

For the pre-processing step, we first record estimates of the camera intrinsic parameters,

such as focal length, and any information related to camera extrinsic parameters, such as

GPS information. For the aerial imagery, the camera intrinsic data are determined using

prior calibration, while for the ground imagery we use jpeg-header metadata to determine

an initial estimate of the focal length, as well as record the GPS information on a per

54

video-frame basis. In the feature detection stage, we find points of interest for each image

using Lowe’s SIFT descriptors [10] and store those SIFT features for each image.

Fig. 4.2 - Structure from motion 3D reconstruction pipeline for video imagery. The pipeline takes as input a set of image frames from a video sequence, along with GPS information per frame. Features are detected from each image and matched across multiple sequence of images. GPS as well as time information is used to remove probable outliers as well as significantly speed up the processing time. Once feature matching is completed, bundle adjustment is run by first initializing with 2-view reconstruction and continually adding additional images to the reconstruction. The result is a sparse 3D reconstruction, which can later be upgraded to a dense reconstruction.

Next, in the feature matching stage, the SIFT descriptors of the points of interest are first

matched using Lowe’s ratio test [10], with an additional uniqueness test to filter out

many-to-one correspondences. The matches are verified using epi-polar constraints in the

form of RANSAC-based estimation of the essential matrix [54][55]. Typically, the image

matching stage is the most computationally expensive stage of the process. For

unstructured collections of photos, where any image might match any other of the

remaining images, the process typically takes O(n2) computational time, where n is the

number of images in the photo collection. In our case, where we have a video sequence,

we can reduce computational complexity of the matching step by taking into account that

55

time-neighboring video frames capture similar perspectives of the 3D scene, thus there is

a high likelihood that consecutive video frames will have many features matches to a

current video frame, while video frames further separated in time might have fewer

matches. We employ a simple data-driven model that keeps track of the maximum

amount of matching features. We continue to match neighboring frames further out in

time as long as the number of matches does not fall below 25% of the maximal match

number, or reaches a predetermined hard threshold, THorizon, of frames. Based on offline

tests of maximal correspondence track lengths, THorizon is set to 40 consecutive frames for

the aerial imagery, while for the ground imagery THorizon is set to 20. The moving horizon

time based clustering is captured in Equation 4.1 below.

[Eq. 4.1]

To account for situations where the same scene is revisited after a prolonged time, the

above data driven matching scheme is also performed between the current frame i and

key frames i+K*m, where K is the skip frame interval (set to THorizon/2). This sub-

sampling in time allows for loop closure, to remove time-based propagation errors in

camera pose estimates.

Besides time-based image matching, we also employ space-based constraints. For ground

imagery, we only match images that have GPS locations that are no further than 300

meters apart (this works well in practice for urban imagery). For aerial imagery, we

employ a more sophisticated method that uses GPS and INS information to compute

projection matrices and test for camera frusta intersection between the pairs of potential

images to match. This space-based matching constraint is captured in Equation 4.2.

1)1(

,

)1(

ImIm

,Im

,0},,...,{

25.0})max({

:ImIm

jiijii

i

horizonhorizon

Corrji

jii

ji

andbetweenmatchesofnumberisNm

andiindexatimageis

andjTTjwhere

RatioNm

Nm

ifonlyandifandbetweenMatchingFeatureDo

56

i image camera aerialfor frusta theis F where true,= )F ,Fintersect(

images ground are j and i where,T < |GPS- GPS|

:ImIm

iji

GPSji

ifonlyandifandbetweenMatchingFeatureDo ji

[Eq. 4.2]

Furthermore, once initial matching is completed using RANSAC with essential matrix

constraints, we take advantage of any images with available GPS/INS data to remove

false matches due to building symmetry. We consider an image to be well matched to

another image as long as the GPS/INS derived relative rotation versus the essential-

matrix derived rotation is off by no more than 20 degrees; otherwise the image are

considered to be mismatched due to building axial symmetry, and the image pair and

correspondences are thrown out.

This time-based and space-based image matching methods allow us to reduce the

computational complexity of image matching from O(N2) for unstructured photo

collections closer to order O(N), which leads to significant computational savings. Once

pair-wise matching is completed, the final step of the matching process is to combine all

the pair-wise matching information to generate consistent tracks across multiple images

[43][44][6], where each track represents a single 3D location.

Once feature matching is completed, the next step is to initialize the reconstruction with a

seed model, refine the initial model and add additional images. Similar to [43][44], our

SfM method is incremental, starting with a two-view reconstruction, adding another view

and triangulating more points, while doing several rounds of non-linear least squares

optimization, known as bundle adjustment (BA), in order to minimize the re-projection

error. Similar to [43], the seed initialization starts by finding the pair of images that have

the most matches using the essential matrix constraint, while having few matches using a

homography constraint (want an image pair with large rotation and large baseline, not

just pure rotation). After each bundle adjustment stage, some 3D points that have re-

projection error above a certain pixel threshold are removed and a final bundle

adjustment stage is run. Next, if GPS/INS information is available, the estimated final

BA rotation between image pairs is checked against the GPS/INS derived relative

rotation to be no more than 20 degrees off. If the criterion is not met, the initial seed is

57

discarded and the above process repeats for successive candidate seeds until one is found

that passes all the above criteria.

Once we have an initial seed, the above process repeats again with each additional image

view. The final result of this step is a set of adjusted camera intrinsic and extrinsic

matrices along with a sparse 3D reconstruction of the 3D scene. The geo-registered

sparse 3D reconstruction is upgraded to a dense 3D reconstruction using Furukawa et

al.’s Patch-based Multi-View Stereo (PMVS) algorithm [56][57].

Fig. 4.3 - 3D Geo-location method. The 3D reconstruction, which has camera pose data in a virtual coordinate space, is geo-located to world space using the GPS metadata available with the original images. A similarity transform is found using a robust least square method that utilizes the RANSAC algorithm. The similarity transform is applied to the camera poses in the virtual coordinate space to obtain world-space camera poses that are well aligned to the GPS metadata. The similarity transform is also applied to the 3D reconstruction to obtain an absolute geo-located 3D data set. The accuracy of the geo-located data set is verified by comparing the data to previously collected truth 3D data, obtained from a high-precision 3D Laser Detection and Ranging (LiDAR) sensor.

Next, both the aerial and ground derived dense reconstructions are geo-located using the

process depicted in Figure 4.3. Geodetic data in lat-lon-alt is converted into an Earth-

Fixed Earth-Centered (ECEF) world coordinates, which is a Cartesian coordinate system.

Geo-registration in is performed by automatically finding a 7 degrees-of-freedom(scale,

rotation & translation) transformation that minimizes the least squares errors between the

58

metric-scaled camera positions and the GPS-based ECEF camera positions. The method

uses random sample consensus (RANSAC) to get rid of any outliers correspondences,

that might be caused either by a poor GPS solution or a poor 3D camera reconstruction.

The results of the above process are multiple dense, geo-registered 3D models, one

derived from aerial imagery, with possible multiple separate 3D models derived from

ground imagery. The geo-registration accuracy and precision of each reconstructed geo-

located model is quantified by automatically aligning each 3D model to a previously

collected, geo-located 3D data set. The alignment method used is a modified version of

Iterative Closest Point [58] (ICP) algorithm with six degrees of freedom (rotation +

translation) as detailed in [59]. The alignment accuracy if verified manually by super-

imposing both data sets in a 3D data viewer. The truth data was derived from an airborne

3D Laser Radar (Ladar) sensor that produced data with sub-meter geo-location accuracy.

Due to the possibility of poor GPS solutions obtained from ground imagery due to multi-

path effects [53], some parts of the ground 3D reconstruction might include drift errors.

In the next section, we discuss a method to correct distortions in the ground-based 3D

reconstruction by using the aerial data as a reference frame.

4.2.2 3D Reconstruction Merge

The aerial-based reconstruction was used as a reference frame to merge the ground-

derived reconstructions in order to create a more complete and self-consistent 3D model

of the city. The first step was to roughly align the ground and aerial 3D reconstructions to

remove the overall bias between the multiple 3D models. Similar to the process of

determining geo-location accuracy, we used a modified version of Iterative Closest Point

[58] (ICP) algorithm with six degrees of freedom (rotation + translation) as detailed in

[59]. The next step was to correct for intra-model distortions. To find optimal global

alignment between the reference aerial model and each of the 3D ground models, we use

a combination of ICP along with multi-positions search in xyz space. The result of this

method is a set of rigid transforms that best align each ground model to the aerial model.

In addition to applying the corrections to the 3D ground models, the corrections are also

applied to the extrinsic camera positions of the 2D images that used to generate the

59

models. The images, along with the corrected extrinsic matrices are now considered a

single ground component. This data is passed again to PMVS with the goal being higher

density reconstruction compared to the original multiple disconnected 3D ground models.

Figure 4.4 details the 3D reconstruction merge method.

Fig. 4.4 - 3D Merge Method. The ground models are first aligned to the aerial 3D data. The result is a set of rigid transforms for each of the ground models. The transform is applied to the camera parameter files associated with each model in order to create a single-component ground data set. The metadata associated with this data set is again passes to PMVS to obtain a refined dense ground reconstruction.

4.2.3 Geo-locating a new image

The SfM algorithm described the Section 4.2.1 is applied similarly to both the ground and

aerial reconstruction, followed by the merged algorithm in Section 4.2.2. The result is a

complete, geo-referenced 3D model and image database, with camera poses for each of

the reconstructed images. Given a new query image, either taken from the ground or from

an aerial platform, we are now in a position to match this new image to the pre-processed

geo-referenced image database. The procedure for achieving this is described in Figure

4.5.

Similar to the SfM reconstruction pipeline, we do feature detection using SIFT. We apply

feature matching using Niester’s 5 point algorithm RANSAC constraint, as well as using

60

the time-based constraints method described in Section 4.2.1 with a K skip-factor of 5.

However, the space constraint method described in Section 4.2.1 is not applicable, as we

assume that no GPS metadata exists for the query image). Depending on how many

feature matches are found, there are several possibilities going forward:

1. No matches found: Image cannot be geo-located further than city-scale.

2. Matches found between 1 and 4. Cannot use Nistér’s 5 point algorithm to

compute essential matrix. Do weighted estimate based number of matches and

GPS locations of matched images

3. Matches equal to or greater than 5. Can add image to bundle adjustment using

motion-only estimation.

Fig. 4.5 - Geo-locating a new image using the SfM reconstructed image database.

61

After the above stage, we now have a camera pose determined in a virtual coordinate

space. This camera pose can be upgraded to world coordinates using the pre-computed

similarity transform found for the aerial reconstruction, which is also used for geo-

location of the merged aerial-ground reconstruction. The result is a camera pose in world

coordinates that can now be compared to the GPS metadata available with the query

image.

4.3 Fine-scale Geo-location Results

Aerial imagery was collected over Lubbock, Texas using a 66Mpix airborne multi-

camera sensor, flying at an altitude of 800 meters in a circular race-track pattern, with a

collection line-of-sight 30 degrees off nadir. Using a single 360 degree race-track,

consisting of 250 images, a 3D reconstruction was computed using the algorithm detailed

in Section 4.2. It is worth noting that the 3D reconstruction algorithm does not require a

360 degree view of an area to perform a good reconstruction; the algorithm has been

successfully tested for other more general flight paths such as straight fly-bys over an

area. Ground video imagery was collected using a pickup truck mounted with a 60 MPix

multi-camera sensor on top of the cabin roof. Of the 125000 images collected, 34400

images overlapped in coverage with the aerial platform and were used to perform a 3D

reconstruction in the region of interest.

As the aerial data is collected from above at 30 degrees off nadir, one would expect that

the aerial data captures well the sides of many but not all the buildings. On the other

hand, the ground photos capture primarily the building facades. This might lead to some

concern as there might be no overlap in certain areas when merging the ground and the

aerial model. In our experience, the situation did not arise in our data set as most of the

buildings are not very tall and are fairly well spaced apart. Such a situation might be of

concern for a Manhattan-like city collect and could be resolved by imposing minimum

thresholds on scene overlap and residual error.

62

4.3.1 Reconstruction and Geo-location of Aerial Video Imagery

The results of the 3D aerial reconstruction are qualitatively shown in Figure 4.6. The

figure shows the Texas-Tech campus, along with its distinctive football stadium. Multiple

zoomed-in views of the stadium and other campus buildings are rendered using Meshlab

[60] to capture the quality of the 3D reconstruction results. The dense reconstruction has

approximately 23 million points, with a 20cm pixel ground sampling distance, and range

resolution of approximately 1 meter (can resolve in height AC units on rooftops). One

can visually determine that we were able to find a good 3D metric reconstruction

(preserves 90O angles), to within a similarity transformation.

Fig. 4.6 - Aerial 3D reconstruction of 1x1km area of Lubbock, Texas using a 250 frame 66MPix video sequence. 3D rendering of the data was achieved using MeshLab.

63

The 3D model was geo-registered as described in Section 4.2, using the GPS metadata

available for each video frame. The quality of the geo-registration was also verified using

3D Ladar truth data. Figure 4.7-A shows a geo-located 3D Ladar truth data set collected

at an earlier date from a separate airborne sensor. The Ladar data is geo-located to within

0.5 meters and sampled on a rectangular grid with 50 cm grid-spacing. The range/height

resolution of the data is about 30 cm. Figure 4.7-B shows the 3D aerial reconstruction in

the same coordinate space as the 3D Ladar data, in order for the viewer to get a rough

comparison of the coverage area and notice that the two data sets appear well aligned.

Figure 4.7-C shows the two data sets superimposed to qualitatively demonstrate that we

obtained a good geo-registration.

A B C Fig. 4.7 - Qualitative geo-registration results of aerial reconstruction. A) 3D Ladar map of Lubbock, Texas displayed using height color-coding, where blue represents low height, yellow/red represent increasing height. B) Geo-registered 3D aerial reconstruction. C) 3D Ladar truth data superimposed onto the 3D aerial reconstruction. Notice that there is no doubling of buildings or sharp discontinuities between the two data sets, indicating a good geo-registration.

A B Fig. 4.8 - Quantitative geo-registration results of aerial reconstruction. A) Histogram of initial geo-registration error with bias of 2.52m indicating good geo-registration accuracy. The σ=0.54 m indicates low internal geometric distortion (high precision). B) Histogram of geo-registration error after applying ICP. The bias has been reduced by 2.5 times to 0.84m.

64

A quantitative study is performed to determine geo-location accuracy and precision of the

3D aerial reconstruction by automatically aligning the aerial reconstruction to the 3D

Ladar truth dataset using an ICP algorithm from [59]. The results are shown in Figure

4.8-A: the bias of the geo-registration is 2.52 meters with σ =0.54 meters. The results

validate that we not only have good geo-location accuracy, to within about 3 meters, but

also indicate low geometric distortion of around 0.5 meters, which is on the order of the

accuracy of the truth data. It is also noteworthy to highlight that the results in Figure 4.8-

A are the geo-registration errors prior to ICP alignment; in the above study, ICP is only

used to find 3D correspondences and verify initial goodness of geo-registration. Thus,

using just GPS metadata collected with the aerial video imagery, we obtained good geo-

location accuracy to within 3 meters and high geo-location precision to within 0.5 meters.

Figure 4.8-B show the geo-registration statistics after ICP alignment, with the bias

reduced by about 2.5 times to 0.84m. Taken altogether, the results suggest the 3D aerial

reconstruction might be readily fused with 3D Lidar data to obtain higher-fidelity

products.

4.3.2 Reconstruction and Geo-location of Ground Imagery

Ground video data was collected in Lubbock, Texas covering the same area as the

airborne sensor. The data was collected using 5, GPS-enabled, 12Mpix Canon D5000

cameras with 1Hz frame update. Figure 4.9-A shows the captured GPS locations

superimposed on a satellite image visualization using Google Earth. Figure 4.9-B shows

the overall 3D reconstruction, height-color coded in shades of purple-green-red. The

reconstruction was composed of 44 separate components with a total of 25 million points;

most of the components tended to capture individual streets, with reconstruction typically

stopping when reaching video data of busy intersections. To appreciate the 3D

reconstruction quality, Figure 4.9-C/D/E capture zoomed-in views of the reconstructions

within the University of Texas at Lubbock, colored texture derived from the underlying

RGB video frames.

65

A B

C D

E

Fig. 4.9 - Qualitative results of ground reconstruction. A) Ground recorded GPS points overlaid onto a satellite image. B) Ground reconstruction captured from two different views in height-above-ground color-coding (lowest height corresponds to purple, blue/green/red correspond to increasingly higher altitudes). C,D,E) Zoomed-in views of 3D with RGB color-map information obtained from the reconstructed images.

Using the same procedure as for the aerial reconstruction, we geo-registered the 3D

ground model by comparing the GPS data captured for each frame to the bundle-adjusted

camera locations. Figure 4.10-A/B qualitatively capture the initial geo-location error

66

(prior to ICP alignment): comparison of the 3D Ladar data in Figure 4.9-A to the

superimposed 3D ground reconstruction & 3D Ladar data in Figure 4.10-B reveals large

geo-location errors, with doubling of buildings surfaces. The statistics of the geo-

registration error prior to ICP alignment are shown in Figure 4.10-C. From Figure 4.10-

C, we can determine that we have poor geo-location accuracy with a geo-registration bias

of 9.63 meters, as well a poor geo-location precision with a σ=2.01 meters indicating that

significant distortions exist within the model. Thus, due to poor GPS ground solutions,

the geo-registration of the ground reconstruction is significantly worse compared to the

3D aerial reconstruction. The reason for these higher geo-registration errors is the

presence of slow-varying GPS bias and distortions due to multi-path effects [53]

especially present in urban canyons formed by streets surrounded by tall buildings, as

demonstrated in Figure 4.10-D.

A B

C D Fig. 4.10 - Initial geo-location of ground reconstruction. A) Qualitative view of the 3D Ladar truth data, B) Same 3D Ladar data superimposed with ground reconstruction showing doubling of buildings due to large geo-location bias. C) Histogram of geo-registration errors: the bias is 9.63m with a σ=2.01m. D) Example of GPS errors encountered amidst taller buildings, which lead to poor geo-location of ground data.

67

A timing comparison was also run to determine the improvement in computational speed

between the approach described in [43] versus our new method. The data was run on a

Pentium Xeon Quad-core 2.8Ghz machine, with 48GB of RAM (specified RAM size

needed to assure that a dense 3D reconstruction with PMVS was achievable). Results are

summarized in Table 4.1. For the aerial video imagery, our approach resulted in a 3.6x

speedup. The speedup can be attributed directly to the improvements in feature matching

based on video constraints, that reduced the time for the feature matching computational

step from O(N2) to O(N). For the larger ground data set, our approach resulted in a

significant 14x speedup. The computational advantage embedded in our feature matching

step becomes more readily apparent with this larger image set.

Table 4.1

Timing comparison of 3D Reconstruction method to prior state of art

Data Set Bundler v0.4 Our Method

250 image aerial video 306 min 84 min

34400 image ground video 10710 min 784 min

4.3.3 Geo-Registration of Combined Aerial-Ground Imagery

In order to obtain a better geo-location of the ground 3D data we apply the 3D Merge

algorithm described in Section 4.2.2. The result is a merged 3D aerial-ground

reconstruction that is now self-consistent. Figure 4.11-A/B quantitatively capture the

before and after merge intra-registration errors between the ground and aerial 3D data.

The overall bias is reduced by an order of magnitude from 8.86m to 0.83m, while the

intra-data distortion is reduced from σ=2.5m to σ=0.59m. Thus, the merge method was

able to successfully remove the bias term and also reduce the intra-registration distortion

by 4x in order to produce a more self consistent complete 3D city model.

68

A B

Fig. 4.11 - Improvement of geo-location after applying 3-D Merge algorithm. A) Intra-registration errors between the 3D ground and aerial data before 3-D Merge, with bias of 8.86m and σ=2.52m; B). Remaining intra-registration errors after 3-D Merge procedure with remaining bias of 0.83m and σ=0.59m. The merge method removed most of the bias term, and reduced distortions within the ground data set by 4x from σ=2.52m to a σ=0.59m.

The combined aerial-ground 3D city model was verified against 3D Ladar data to

determine the final geo-registration error. Results indicate a geo-location accuracy (bias)

of 2.82m, with a geo-location precision of σ=0.74m. As expected, the final geo-location

accuracy of 2.82m is limited by how well the aerial data was initially geo-located, which

in our case was with an accuracy/bias of 2.52m. The overall geo-location precision,

standing at σ=0.74m, is lower bound by both the geo-location precision of the aerial

reconstruction (σ=0.54m) as well as the precision of the ground reconstruction after the

3D merge (σ=0.59m). Thus, the combined aerial-ground data set has geo-registration

accuracy to within approximately 3 meters with geo-location precision on the order of

1m. Figure 4.12 shows the merged aerial-ground 3D reconstruction

69

Fig. 4.12 - Examples of merged aerial-ground reconstruction. Aerial results are shown in black-and-white imagery, with the ground data super-imposed in color. Note that that the aerial and ground data are very well registered, with no visible surface doubling present, visually confirming great aerial to ground registration.

70

4.3.4 Geo-locating a new image

In order to test the geo-location performance of image query against the reconstructed

database, a 500 image subset was kept for unit testing and not included in generating the

3D reconstructed image database. Furthermore, to prevent matching with other video

imagery in close chronological proximity to the selected images, all images within a +-10

second time window were removed, as were all consecutive images within a +-20m

contiguous-time space window. (e.g. imagine choosing an image where the vehicle was

stopped for a prolonged period of time; all chronologically contiguous images with a

GPS value within 20m of chosen image are removed). The algorithm described in

Section 4.2.3 was applied to the subset of images to generate the results shown in Figure

4.13. Figure 4.13 captures geo-location accuracy in terms of cumulative percent of

images that met the respective absolute geo-location accuracy value. The geo-location

performance is quite good, with 399 of the 500 (79.8%) images having enough matches

to obtain a pose estimate, with all such estimates having a geo-location accuracy under 84

meters. The remaining 101 images failed to have enough consistent matches, with the

most common failure modes observed being heavy occlusion of camera by oncoming

traffic or turns through intersections, where most of the scene structure failed to be static.

About 24% of images aligned to within 5 meters, and 61% were geo-located within 20

meters. Furthermore, most of the images that were added to the bundle adjustment step

(as opposed to having only 3D location estimates) were shown to have better than 20

meter geo-location accuracy, with 53% of images having enough matches to attempt

bundle adjustment and 27% of images geo-located using only a 3D location estimate. It is

also important to note that the truth GPS imagery is not perfect, and does contain

significant GPS bias, which will tend to make the geo-location appear worse than actual

geo-location error. Nonetheless, the results are very encouraging; a large percentage of

the images have the potential to be added to the reconstructed image database in order to

further improve scene coverage and fidelity.

71

Fig. 4.13 - Fine-scale geo-location accuracy for a 500 image test subset. Geo-location performance is quite good, with 399 of the 500 (79.8%) images having enough matches to obtain a pose estimate, with all such estimates having a geo-location accuracy under 84 meters. The remaining 101 images failed to have enough consistent matches, with the most common failure modes observed being heavy occlusion of camera by oncoming traffic or turns through intersections, where most of the scene structure failed to be static.

4.4 Towards Improved Image and Scene Understanding

In the last 3 chapters, we have shown a complete system for image geo-location, starting

with at a world-wide level and working our way to the city level and finally to the street

level or even an estimated camera pose. The developed system incorporated data from

different modalities, such as 2D GIS, aerial imagery, ground imagery as well as 3D

Ladar, to create a dense 3D world model representation, with enough statistics to be able

to geo-locate a new query image at varying levels of accuracy.

This system provides an immediate benefit in that the image can now be tagged as

coming from a particular region of the world, a particular country and even a particular

72

street. This additional metadata can be used as priors for object detection and recognition

to tailor a particular object detection algorithm at recognizing objects that might be found

in that particular region of the world. However, quite a lot more additional information

can be extracted about the content of the geo-located image. Consider the 3D city model

generated by the SfM image reconstruction; most of the pixels in the query image can

now be associated with absolute 3D locations, allowing for fusion and information

transference from other geo-located data sources for improved scene understanding.

From geometric context, we can readily classify pixels as being road, ground, trees and

buildings by back-projecting the classified 3D data into the query image plane to assign

pixels as belonging to particular scene classes. An example of such ground and building

classification using 3D imagery is shown in Figure 4.14-A. An example of road detection

using information transference from a 1D GIS road layer is shown in Figure 4.14-B [61].

Further fusion with other data sources, such as GIS layers and 3D Ladar data can act as

further scene understanding multiplier, with building names, business names and street

names now being associated with regions of the image, as demonstrated in Figure 4.14-C

[61]. Higher level scene understanding can also be gained, such as occlusion, missing

data reasoning and shadowing effects as demonstrated in Figure 4.14-D [62]. Using this

additional information, it might also be possible to better classify change detection due to

people, cars or new construction, as regions of the particular query image that match

poorly to the 3D underlay back-projected into the query image camera space.

A

73

B

C

D

Fig. 4.14 - Towards improved image understanding using information transference from other data modalities. A). Scene classification using 3D imagery. The blue color coding represents buildings, red and yellow represent ground. This classification can be back-projected into the query image plane to assign pixels a class label. B) Examples of 1D GIS road network information transference to identify pixels that are roads. C) Further information can transferred, such as building and street names, adding additional scene understanding. D) Occlusion, missing data reasoning and shadowing effects can be inferred from the fusion of the 2D to 3D imagery. Figure shows 3D data in-painted with a single 2D image. Occluded areas of the 3D map are shown in two shades of gray, with dark gray representing shadowed areas computed based on sun location at time the image was collected. This additional information can be used to better detect or track moving objects such as vehicles and persons that might be present in the query image.

74

Chapter 5

3D Ladar Processing: An Extension to 2D Image Geo-location

Chapter Summary: In support of geo-accuracy validation for the fine-scale geo-location method, we present a novel 3D Ladar processing method using data collected by an airborne 3D Ladar sensor. Data collected by 3D Laser Radar (Ladar) systems, which utilize arrays of avalanche photo-diode detectors operating in either Linear or Geiger mode, may include a large number of false detector counts or noise from temporal and spatial clutter. We developed an improved algorithm for noise removal and signal detection, called Multiple-Peak Spatial Coincidence Processing (MPSCP). Field data, collected using an airborne Ladar sensor in support of the 2010 Haiti earthquake operations, were used to test the MPSCP algorithm against current state-of-the-art, Maximum A-posteriori Coincidence Processing (MAPCP). Qualitative and quantitative results are presented to determine how well each algorithm removes image noise while preserving signal and reconstructing the best estimate of the underlying 3D scene. The MPSCP algorithm is shown to have 9x improvement in signal-to-noise ratio, a 2-3x improvement in angular and range resolution, a 21% improvement in ground detection and a 5.9x improvement in computational efficiency compared to MAPCP.


Three-dimensional Laser Radar (3-D Ladar) sensors output range images, which provide

explicit 3-D information about a scene [63][64][65]. MIT Lincoln Laboratory has built a

functional airborne 3-D Ladar system, with an array of avalanche photo-diodes (APDs)

operating in Geiger mode, that actively illuminates an area using a passively Q-switched

micro-chip laser with a short pulse width time [66][67]. On each single laser pulse, light

from the laser travels to the target area and some reflects back and is detected by an array

of Geiger-mode APDs. Figure 5.1 captures the 3D Ladar system concept.

Recent field tests using the sensor have produced high-quality 3-D imagery of targets for

extremely low signal levels [68][69]. Though there are many advantages to using single-

photon sensitive detector technology, the data collected using these Geiger-mode APDs

are often noisy with unwanted temporal or spatial clutter. It has been shown in previous

publications that by identifying spatial coincidences in data from as few as three laser

pulses, we can significantly reduce the probability of false alarms by several orders of

magnitude [70][71][72].

75

Fig. 5.1 - 3D Laser Radar (Ladar) system concept. A laser sends out a pulse of light to a target. Some of that light is reflected back and detected by an APD array. The time of flight between the send and receive of the laser pulse is recorded and converted to metric units to create a range image.

The method of finding signal in the presence of noise and clutter by using coincident

spatial data is known as coincidence processing. The more 3D points returned at the same

spatial location, the more likely that the points came from a real scene surface. In this

paper, we discuss the implementation of a novel processing algorithm, known as Multi-

Peak Spatial Coincidence Processing (MPSCP), and test it against the current state of the

art, Maximum A-posteriori Coincidence Processing (MAPCP) algorithm, [73]. The

contributions of this paper are as follows:

1. A set of general methods to address typical 3D Ladar processing challenges that are

relevant to most 3D Ladar sensor systems (Linear and Geiger mode).

2. An improved 3D Ladar filtering algorithm that is shown to have a significant

improvement over current state-of-the-art, with qualitative and quantitative results

shown.

In the remainder of this section, we first discuss the challenges of processing 3D Ladar

data and describe how our algorithm addresses those challenges. Quantitative and

76

qualitative results are shown using data collected over Haiti in support of earthquake

rescue operations using the Airborne Ladar Research Test-bed (ALIRT) platform [74].

5.2 3D Ladar Background

There exists a cause-effect relationship between 3D Ladar system design / data collection

methods and the inherent processing challenges that arise and need be subsequently

addressed. First we introduce the 3D Ladar system design, where an airborne sensor

stares at a pre-designated ground target from multiple perspectives in order to get a more

complete measurement of the scene. This concept of operations is shown in Figure 5.2.

Fig. 5.2 - 3D Ladar concept of operations. An airborne platform, stares at a pre-designated ground target from multiple perspectives in order to get a more complete measurement of the scene.

Due to limitations in APD array size (number of pixels), for each viewing perspective the

detector array needs to be scanned using a sinusoid pattern in angle-angle space to get a

higher resolution 3D image of the target. Given the above system design and data

collection methods, a list of processing challenges becomes apparent:

1. Varying signal and noise levels: the scanning pattern can lead to large variations

in the absolute output level (3D point density). Background light and/or detector thermal

excitation can lead to 3D noise points with high spatial coincidence that need to be

filtered out. There is a need to know the output level to dynamically determine the

statistical significance of coincident returns.

2. Photon attenuation: Obscurants in the range direction might reduce the

probability of transmitted photons reaching a ground target. Knowledge of photon

77

attenuation can be used to dynamically adjust statistical significance of coincident

returns.

3. Detector-specific range attenuation: due to nature of Geiger-mode APDs, once a

pixel is triggered at a closer range, no hits at further ranges are possible as the pixel needs

to be reset, leading to output level attenuation in the range direction.

4. Laser-detector Point Spread Function (PSF) can lead to 3D blurring of imaged

objects. A method is needed to de-blur the 3D image.

5. Platform attitude errors (GPS/INS) can add further blur to the 3D image.

6. Platform motion and signal aggregation from multiple perspectives: To increase

signal-to-noise (SNR) level, a method is needed to evaluate output level from each

perspective that contributes to a particular 3D location.

7. Automatically determine optimal processing parameters: Need data-driven

method to obtain good, reproducible, single-run results without the need for human

intervention.

Before further discussing and addressing each individual challenge, it is crucial to notice

that most of these challenges are related by a common denominator: sensor line of sight

(LOS). Variations in signal/noise levels are orthogonal to the LOS, while photon/detector

signal/noise attenuation are along the LOS. The Laser-Detector PSF is oriented along the

LOS direction: error in range due to laser-detector timing jitter is, by definition, aligned

to the LOS, while platform attitude errors, such as GPS and inertial navigation errors

(INS) can also be readily thought as orthogonal to the LOS.

The crucial insight is that processing in an appropriate line-of-sight coordinate system

plays an important role in decoupling the effects of the various processing challenges

listed above, so that each challenge can be independently addressed. Depending on

airborne platform velocity, range-to-target and target collection size, a LOS coordinate

system can be chosen to approximate the true line-of-sight while avoiding the

78

computational expense of ray tracing each individual APD array LOS vector and storing

the information in a 3D volumetric signal map.

A B

Fig. 5.3 - Line-of-sight (LOS) coordinate systems for various sensor platforms. A) For slow-moving platforms, a spherical coordinate system (angle-angle-range) gives a good approximation of the LOS while B) a skewed-cylindrical coordinate space (heading-angle-range) gives a good approximation of the LOS for fast-moving airborne platforms.

Figure 5.3 depicts several LOS coordinates systems that might be used. For instance for

airborne platforms that are slow-moving in comparison to the range-to-target distance

and target area, a sensor-centered spherical coordinate system (angle-angle-range) can

best approximate the collection volume which tends to resemble a solid angle. For fast-

moving airborne platforms, where target area size is on the same order as platform

motion, a skewed-cylindrical coordinate system can best approximate the LOS

independent of range-to-target. Another possibility to consider for small target areas is an

inverted target-centered spherical coordinate system.

Since the ALIRT system is hosted on a fast-moving airplane and uses the airplane’s

forward heading motion to scan a target area of the same approximate size, we utilize a

skewed-cylindrical coordinate system to best approximate the LOS.

79

5.3 3D Ladar Processing Approach

The proposed MPSCP algorithm advances current state of the art in 3D Ladar processing

by addressing each of the processing challenges noted in Section 5.2. Figure 5.4-A/B

shows an example of raw 3D data to allow the reader to visually appreciate the large

amount of noise and clutter present.

The noisy 3D data is initially stored in Universal Transverse Mercator (UTM) coordinate

system, a 3D space that can be locally approximated as a Cartesian coordinate space [75].

The first processing step of MPSCP is to transform the data from UTM space to an

appropriate line of sight space, which for our sensor is a skewed cylindrical coordinate

space. Using metadata, such as airborne sensor position, a LOS coordinate basis is

created and the data is transformed. We now proceed to explain how utilizing this data-

defined LOS coordinate space will lead to improved computational efficiency as well

improved 3D filtering results.

A B

Fig. 5.4 - Raw Lidar data showing salt and pepper noise. A) Height color-coded example of raw 3D data (dark grey low altitude, white high altitude). The target area is obscured due to the heavy amount of noise. B) Zoomed-in version of same data set showing the measured 3D structure embedded in high levels of noise and clutter.

The next MPSCP processing step is to determine expected 3D output level. MPSCP uses

the output level estimate to determine statistical significance of spatially coincident

returns. Statistical significance is determined in terms of a maximum likelihood estimator

80

given the expected output level. In this fashion, the MPSCP algorithm can dynamically

adjust its internal noise suppression thresholds to work well under most signal conditions.

Variations in output level due to the scanning pattern, photon attenuation as well as

detector attenuation, can be accounted accurately using the data-defined LOS coordinate

system. We first determine an initial output level, Oinitial, due to variations in scan pattern

dwell times. In our LOS coordinate system, Oinitial varies in only 2 of the 3 dimensions,

namely heading and angle, but not range. Compared to output level estimation in a 3D

Cartesian coordinates, which would have required computationally expensive 3D ray

tracing and storage of a volumetric 3D array of values, the problem of estimating output

level reduces to a 2D matrix in heading-angle space using our LOS coordinate system.

This leads to increased algorithmic computational efficiency as well as improved

implementation with significantly lower memory overhead. An example of the computed

Oinitial output level map is shown in Figure 5.5, with output level back-projected to 3D

from 2D heading-angle space on a per raw 3D point basis.

A B

Fig. 5.5 - Raw 3D Lidar point cloud color-coded by scan pattern induced output variation. A) Side-view of a target area, showing a notional scan pattern and the estimate of Oinitial (dark grey - low value, white - high value) obtained using the LOS space. B) Heading-angle view of the same target area. The LOS is shown to be accurately estimated, with high output levels (white) at the edge of the sinusoid scan pattern due to decreased angular velocity as the scan mirror changes direction, leading to an increase in 3D point density level.

The output level is also affected by photon attenuation as well as detector attenuation in

range. Photon attenuation due to line-of-sight blocking needs to be taken into account

81

when computing the statistical significance of coincident returns. Detector-specific output

attenuation in range has a similar effect as photon attenuation, reducing expected output

level at further range values along the LOS. To account for these data-dependent effects,

the data at a particular heading-angle location (which can be visually represented as a

chimney of data in the range direction) is binned into a range histogram H. For each

range bin i of histogram H, MPSCP keeps track of returns that have occurred at closer

ranges versus returns that have occurred at further ranges to determine an expected output

attenuation value. Equation 5.1 numerically captures the method for determining

attenuated output level, Oattenuated, as a function of range along the LOS, while Figure 5.6

visually describes the method to account for photon and detector range attenuation

effects.

.)(

)(1

),(

),(

21

21

NC

iCOiO

ahH

ahHinitialattenuated [Eq. 5.1]

where, Oattenuated(i) is the attenuation-corrected expected output level, Oinitial is the expected output level at a particular heading-angle location [h1, a2] determined solely based on the scan pattern (no attenuation correction), CH(i) is the cumulative histogram of range histogram H at range bin i and N is the last (furthest) range histogram bin.

A

82

B C

Fig. 5.6 - Method for correcting for photon and detector range attenuation effects. A). Example range histogram and cumulative range histogram for a localized heading-angle location. The cumulative histogram is used to compute an attenuated output level compared to the initial level, in order to account for photon and range blocking effects. B). Side-view of raw 3D Lidar point cloud color-coded by output attenuation in range and C) view orthogonal to the LOS, showing the effects of photon / detector attenuation as a function of range. Notice how the output attenuation changes from low (dark grey) to high (white) as the line-of-sight passes through obscuration.

Having determined the expected output level, a method is needed to find spatially

coincident returns to distinguish signal from noise. Spatial coincidence of points is

affected by the laser-detector 3D Point Spread Function (PSF) as well as platform attitude

errors, leading to blurring of the 3D image. The laser-detector 3D (angle-angle-range)

PSF can be decoupled into the angular response to a step-response in range (such as a

ground to building edge), followed by the range response to a flat surface. Figure 5.7

captures the methodology used to determine the 3D PSF. MPSCP uses the PSF as a 3D

matched filter to integrate signal and find 3D locations that have enough returns to be

considered statistically significant. Since our LOS coordinate system is already well

aligned to the 3D PSF, the 3D matched filter can be efficiently applied to the data. The

matched filter is also be used for sub-voxel estimation of the filtered return 3D location,

effectively removing the PSF-induced blur.

83

A B Fig. 5.7 - Computation of laser-detector 3D point spread function. A) Angular response to 3D edge and Gamma-function fit, B) Range response to flat-plate and its associated Gaussian fit. Another source that affects the spatial coincidence of points is platform attitude errors.

These errors occur due to drift in the GPS/INS solution, as well as due to errors induced

by the scanning hardware: such errors occur during changes in view-point perspective,

which require a sharp step-response in angular space from the scanning hardware. Due to

insufficient bandwidth, small angular errors can occur. These angular errors, combined

with GPS/INS drift, can lead to blurring that can be several times bigger than the 3D

PSF-induced blur. MPSCP corrects for these blurring errors by employing a two-stage

filtering process. Figure 5.8 shows the overall MPSCP processing block diagram. In the

first stage filter, data from each single viewpoint is processed independently: starting

with a noisy 3D data set per viewpoint, a unique data-defined LOS coordinate system is

created and the data is processed along the line of sight to produce a filtered 3D data set

per viewpoint. A secondary output is also created, which consists of the original 3D

noisy data appended with LOS statistics per point, such as the expected output level value

as visualized in Figures 5.5 and 5.6. Using the single-viewpoint 3D filtered data sets, we

align the all data sets to a single reference view, chosen as the data set with the largest

amount of data. The alignment method uses a variant of the Iterative Closest Point

[58][59] algorithm with six-degrees of freedom (3D rotation and 3D translation), which

produces results with sub-pixel error correction. The 6-degree transformation is also

applied to the raw 3D point cloud data that has been appended with LOS statistics per

point. To detect weak signals that might have been missed when processing data on a

84

single-viewpoint basis, a second-stage filter takes the aggregated, de-blurred, multi-

viewpoint data set and processes the data in a similar manner to the first-stage. Since the

data is taken from multiple perspectives, MPSCP defaults to using an UTM-aligned 3D

Cartesian coordinate space to process the aggregated data. The second-stage coincidence

processor uses the expected output level saved on a per-point basis from the first stage

filter to determine a statistical noise threshold to filter the multi-viewpoint aggregated

data set. Compared to the first stage LOS filter, in the multi-view point second stage the

noise filtering and detection is performed along the Z direction using a histogram

comprised of a vertical chimney of data.

As shown in the block diagram, MPSCP requires a single input parameter: processing

resolution in meters. This processing resolution is used to automatically determine a

binning size in the LOS coordinate space for the first-stage coincidence processing filter

as well as 3D PSF matched filter size. An accurate output level estimate is computed

directly from the data, which takes into account photon and detector output attenuation

effects, allowing MPSCP to dynamically adjust its noise-suppression thresholds to filter

most of the noise while keeping weak signals. In comparison, the MAPCP algorithm,

which represents the current state of the art in 3D Ladar processing, does not have this

type of automatic, data-dependent parameter tuning, with the user required to manually

determine the size of the 3D matched filter and manually choose an optimum threshold.

This typically requires multiple runs per data set for an operator to determine a good set

of parameters. Furthermore, since MAPCP does not take into account photon or detector

output attenuation effects, the algorithm has difficulty in keeping weak signals in low

output level regions while at the same time removing noise in high output level regions.

85

Fig. 5.8 - MPSCP algorithm block diagram. MPSCP has a two-stage filtering process. Data from each viewpoint is first processed independently in its own unique LOS coordinate system. Two outputs are created, namely a filtered 3D data set per viewpoint and the original noisy 3D data with LOS statistics, such as output level on a per point basis. The individual filtered data sets are aligned to remove attitude errors, with the transform applied to the noisy 3D data set. A second filtering stage ingests the aggregated data set to detect weak signals that might have been missed by the first stage filter, leading to a final 3D filtered output.

5.4 3D Filtered Results and Discussion

The MPSCP algorithm was tested against MAPCP on multiple data sets collected over

Port-Au-Prince, Haiti as part of the 2010 earthquake response. The data was used to

determine the navigability of streets as well as to quickly respond to population

movement into tent-cities that literally sprang out overnight. By accurately counting the

number of tents, an accurate assessment could be determined of the quantity of essential

supplies for each tent city.

5.4.1 Qualitative Results

Figure 5.9-A shows height-intensity color-coded MAPCP results for a target-mode data

set collected from multiple perspectives. Figure 5.9-B shows the MPSCP results for

visual comparison. From the results, one can visually discern that MPSCP has

86

significantly better angular resolution as well as range resolution compared to MAPCP,

with sharp palm tree branches, sharper building edges and car shapes better resolved. In

addition, the MPSCP results have almost all the noise removed, while the MAPCP

algorithm still has a large amount of noise present (visually seen as salt-and-pepper noise

above road, other open areas). In Figure 5.9-C/D, we are showing the same data set, now

zoomed-in and cropped in the z-direction to reveal the presence of tents. The MPSCP

results shown in Figure 5.9-D, demonstrate improved 3D scene coverage and

reconstruction under weak signal conditions compared to MAPCP.

A B

C D

Fig. 5.9 - Visual comparison of MAPCP versus MPSCP results on a Haiti tent city collected in January 2010. A) MAPCP results and B) MPSCP results. C) Zoomed in view to the center of target area showing tent city under obscurant using MAPCP, and D) same view of MPSCP results. The MPSCP results are shown to have less noise, have sharper edges with less blurring on buildings, cars, palm trees lining the street, and have better 3D scene coverage in weak signal areas under obscuration (fewer no-signal voids, shown as black pixels in the image)

87

5.4.2 Quantitative Results

Using metrics developed by Lopez et. al. [73], we quantitatively evaluated the data sets

shown in Figure 5.9. Signal-to-noise (SNR) was measured in a flat area out in open:

processed 3D points that fell within a height envelope above and below the ground were

considered valid detections; points above or below were considered noise. MPSCP had

an SNR of 97x while MAPCP had an SNR of 10.8x. MPSCP has a 9x improvement in

SNR, close to an order of magnitude better than MAPCP.

Figure 5.10-A shows the results of a line spread function (LSF) metric to evaluate

angular resolution. The LSF results indicate that MPSCP has a 3x improvement in

angular resolution. Range resolution was measured by segmenting out the roof-top of a

building, followed by slope-bias removal using principal-component analysis to align the

plane normal axis to the z, up direction. The resulting MAPCP and MPSCP range

histograms are shown in Figure 5.10-B; MPSCP has 2 times improvement in range

resolution. Ground scene reconstruction was also evaluated, as shown in Figures 5.10-C

and 5.10-D. The 3D data was cropped in the z direction to include only 3D returns on the

ground and tents; the data was binned in the x-y directions to create a binary filled vs.

empty pixel image. Results indicate that MPSCP found 21% more ground cover

compared to MAPCP.

The improved ground signal detection of MPSCP compared to MAPCP, while retaining

high-frequency information out in open areas, can be attributed to the use of dynamic

thresholding based on an accurate output-level estimate that takes into account photon

and detector output attenuation effects due to obscuration. The use of dynamic

thresholding allows the MPSCP algorithm to detect weak signals under obscuration,

while still removing heavy noise in high output level areas. By contrast, MAPCP does not

employ data-driven noise thresholding, leading the algorithm to have difficulty in

keeping weak signals in low output areas level while at the same time removing noise in

high output level areas.

88

A B

C D Fig. 5.10 - Coincidence processing quantitative results. A) MAPCP vs. MPSCP line spread function (LSF), showing that MPSCP has an improvement of about 3x in angular resolution. B) MAPCP range resolution versus MPSCP range resolution, showing an improvement in the MPSCP result of 2x. C) Ground coverage for MAPCP and D) MPSCP, with voids shown as black pixels. MPSCP recovered 21% more ground cover compared to MAPCP

A timing analysis was run on 4 multi-viewpoint data sets using a 12 core Intel Xeon

3GHz machine. Both MPSCP and MAPCP were run at the same processing resolution

with the default processing parameters. The overall conclusion from the timing results is

that MPSCP is about 6 times faster than MAPCP. Besides extensive testing on 4 multi-

viewpoint data sets collected in Haiti, the algorithm has been successfully tested on a

large scale 3D map data set covering approximately 30 square km of Port-Au-Prince,

Haiti. The MPSCP algorithm produced good, single-run results without the need for

parameter tweaking. The removal of the need for human intervention is of tremendous

importance for algorithm scalability to the large amounts of 3D Ladar data sets generated

in the field.

89

In summary, we have described a set of general methods to process 3D Ladar data that

are relevant to most 3D Ladar sensor systems, with either Linear-mode of Geiger-mode

APDs. We have also described in detail a novel 3D Ladar filtering algorithm that is

shown to be a significant improvement over the current state of the art. Qualitative results

indicate shaper 3D images with building and tree structure better resolved. The algorithm

was also able to remove more noise while preserving weak signal areas as visually

demonstrated in the form of improved ground coverage under obscuration. The use of

automatic, data-driven parameter tuning allows MPSCP to produce good, single-run

results without the need of human intervention.

90

Chapter 6

Conclusion

In this chapter, we summarize the contributions of the research work in this thesis, review

recent developments in literature and discuss promising directions for future research.

6.1 Contributions

Image geo-location on a world-wide scale is a very challenging problem. Besides being

an interesting problem in itself, it can be tremendously useful for many other vision tasks,

such as image retrieval, object detection and recognition. For instance, the distribution of

likely geo-locations of a particular image provides additional context, such as terrain

type, population density, and prominent cultural markers. This additional metadata can be

used as priors for object detection and recognition to tailor a particular object detection

algorithm at recognizing objects that might be found in that particular region of the

world.

For this research, we developed a hierarchical image geo-location and 3D reconstruction

framework using a course-to-fine scale localization approach using a 6.5 million image

database. By design, the approach presented is scalable to larger databases and may be

highly beneficial for many research communities; such communities include, but are not

limited to, online social networking sites, intelligence agencies and companies dealing

with large-scale data mining.

The presented approach starts off with a coarse geo-location method, where a query

image is roughly geo-located to particular region of the world by classifying the terrain

type in that particular image. To achieve the goal of image geo-location by terrain

classification, we first create a 3D world model representation composed of a large

training database of geo-tagged, terrain labeled images. This database is created by

merging knowledge from three publicly available databases, namely a geo-spatial terrain

type and land coverage database, a 6.5 million image database that is only geo-tagged and

91

a database of terrain-labeled images. We developed a coarse geo-location method that

uses the generated 3D world model to test a hold-out set of 5000 images. We

demonstrated an improvement over current state of the art in terrain classification, with

over 91% terrain classification accuracy, with a significant improvement of 5.72% over

the baseline. The proposed method has several advantages over prior approaches [8], in

that the method is robust to images with noisy geo-labels, works in a low dimensional

feature space to avoid the curse of dimensionality [9] and reduces the database size in

order to allow for more complex follow-on stages to be computationally tractable.

A medium scale geo-location method was implemented that improves upon previous

image retrieval techniques to geo-locate a query image to city-level accuracy. We

developed an improved KNN-SVM approach that is not only computationally tractable,

but also provides significantly improved classification performance over a KNN only

method. The hierarchical course and medium geo-location framework was tested on a

geo-tagged 6.5 million image database and demonstrated to have a relative improvement

of 10% in geo-location accuracy compared to previous methods applied up to city level

geo-location. Results summarizing the coarse and medium geo-location method are

published in [76].

Once we have geo-located a query image to a particular city, we go to the final step in the

geo-location progression by attempting to estimate the pose from where that particular

image was taken. To achieve this, we first process a training data set using structure-

from-Motion (SfM) techniques, where we take our training images for a particular city,

find feature correspondences and upgrade our correspondences to 3D locations to create a

3D model of the city scene. The relative camera poses, along with the 3D reconstruction,

are then geo-located using GPS image metadata that might be available with a subset of

the training images in our city-wide image database. A query image can then be geo-

located and attached to training image database using a similar SfM procedure. Our

contribution to the SfM research area is to develop an efficient method to do 3D

reconstruction on a city-wide scale using ground video imagery as well as aerial video

imagery in order to compute a more complete and self-consistent geo-registered 3D city

model. The reconstruction results of a 1x1km city area, covered with a 66 Mega-pixel

92

airborne system along with a 60 Mega-pixel ground camera system, are presented and

validated to geo-register to within 3 meters to prior airborne-collected Ladar data.

Compared to prior approaches, the new method has a computational speed-up on the

order to 4 to 14x depending on database size. Results summarizing the fine-scale geo-

location approach are published in [77]. As a proof-of-concept, we leveraged the newly

developed 3D world model to perform information transference from other geo-located

labeled data sources to the respective query image in order to demonstrate improved

image understanding.

In support of validation of our fine geo-location method, we developed a novel 3D Ladar

processing method using data collected by an airborne 3D Ladar sensor. Data collected

by 3D Laser Radar (Ladar) systems, which utilize arrays of avalanche photo-diode

detectors operating in either Linear or Geiger mode, may include a large number of false

detector counts or noise from temporal and spatial clutter. We present an improved

algorithm for noise removal and signal detection, called Multiple-Peak Spatial

Coincidence Processing (MPSCP). Field data, collected using an airborne Ladar sensor in

support of the 2010 Haiti earthquake operations, were used to test the MPSCP algorithm

against current state-of-the-art, Maximum A-posteriori Coincidence Processing

(MAPCP). Qualitative and quantitative results are presented to determine how well each

algorithm removes image noise while preserving signal and reconstructing the best

estimate of the underlying 3D scene. The MPSCP algorithm is shown to have 9x

improvement in signal-to-noise ratio, a 2-3x improvement in angular and range

resolution, a 21% improvement in ground detection and a 5.9x improvement in

computational efficiency compared to MAPCP. Results summarizing the 3D Ladar

processing approach are published in [78] [79].

93

6.2 Recent Developments

In this section, we review recent work in the literature that is related to the work

presented in this thesis.

Altwaijry H. et al. tackled the problem of image geo-location using Google-glass

imagery [80]. They extracted different features and in one case did testing using a

much higher dimensional vector (upwards of 300K dimensions instead of around

~2K dimensions as in our case). The training data set used was quite small at

1204 image (vs 6.5 million in our case). For their data set, they obtained good

geo-location accuracy (70-80%), though it was not clear how dense the images

were collected and if the high geo-location accuracy was due to instance level

learning versus the more desirable general image learning. For our geo-location

method, we might consider using some of the features proposed in [80].

G. Baats et al. focused on geo-location of images in mountainous terrain at the

country level [81]. The research utilizes a high-resolution digital terrain model to

form a sky contour. The sky contour is quantized into a feature vector that can be

matched to a database of GPS labeled images with pre-computed sky-contours. A

variant of ICP [69] is applied to deal with the lack of rotation invariance for the

sky contours. The approach proves to have good geo-location performance with

upwards of 80% of image having a geo-location error lower than 1km. At its

essence, the method allows for virtual sampling of the earth to densify sparse

remote regions on the globe (a problem that is apparent in our Flickr data set) in

order to allow for good image matching. The technique should work well for

mountainous areas and possibly coastal areas, and might be used to improve our

geo-location method once we detect the presence of mountainous/coastal terrain

from the coarse-scale classifier. However, the presented method in [81] will not

work for other terrain types such as urban, forest or country, which typically

contains sky-contours that are highly varying with small changes in viewing

perspective or in the case of forests, possibly not well defined due to a lack of a

contiguous sky region.

94

T. Y Lin et al. presented work on cross-view image geo-localization, where

satellite imagery was used as a truth database and matched to ground-based

imagery [82]. In particular, a land cover database was used to label the satellite

imagery, with that information being used to predict which regions might best

match the query ground-based imagery. The method has some similarities to our

approach, in that the land-cover database is used to extract further information

that can reduce the geo-spatial search. The method was applied to small region

(city scale) and obtained accuracy results that were 2x better than chance. The

research work has some overlap with our approach and provides another method

to use land-cover databases to improve geo-location of ground based imagery.

Research work in [83][84] focused on matching ultra-wide baseline aerial

imagery in urban environments, an issue that is not directly addressed by our fine-

scale aerial video geo-location method, where we typically have a narrow baseline

between images The research described promising results for matching images

that have gone though a large rotational (more than 30 degrees) and translation

changes, circumstances under which SIFT matching is known to fail. For our

approach, we did in fact use SIFT matching as we did not need to address the

issue of matching over wide baselines since the training imagery is composed of

high-frame rate aerial and ground based video imagery. However, the research in

[83][84] might be used as an extension to the fine-scale geo-location method in

Chapter 4 to further improve both ground and aerial geo-location, where we might

have a query image that has a wide baseline compared to any prior collected

training imagery.

Our review on recent work on image geo-location confirms that other authors are starting

to present geo-location methods that use multiple GIS data sources, along the lines of our

proposed approach. In particular, other researchers have found that using land-cover

databases can add significant information that is helpful to ground-based image geo-

location. The presented methods are quite different than the one proposed in this thesis,

95

but the work is very much complimentary and can be brought into the framework of the

hierarchical geo-location approach.

6.3 Future Work

Future areas of investigation will focus on further improving the coarse geo-location by

upgrading the KNN classifier to a KNN-SVM classifier similar to the one used for the

medium scale geo-location. We would also like to consider expanding the number of

terrain classes, allowing for improved data reduction and geo-location specificity. In

particular, the “country” and “urban” classes tend to account for more than half of all

images in the image database and need to be further sub-divided. Towards that goal, we

might consider adding several additional classes, namely a “savanna/arid” class as well as

further subdivide the “urban” class into a “sub-urban” versus “dense urban” class.

In regards to medium-scale geo-location, future work might include adding additional

features, learning which features are more important for geo-location and discarding

features that have low discriminatory power. For the fine-scale geo-location method,

recent work in the literature suggests that a further hierarchical approach can be applied

within the 3D SfM reconstruction to achieve a higher percentage of images that are part

of the initial 3D reconstruction. We are actively pursuing similar methods for real-time

reconstruction of 2D imagery from small UAVs.

Furthermore, in this thesis we have only dealt with geo-locating a single query image.

Research in [7] has extended this to a sequence of images collected over multiple days.

We would like continue towards that line of research by extending the presented thesis

work to the problem of geo-location of a short video sequence, where the most of the

information is much more narrowly localized in both time and space, making the problem

more challenging.

In terms of 3D Lidar processing, we are currently making great strides towards improved

signal detection. In particular, the multi-view point 3D filter described in this thesis is not

very good at capturing vertical surfaces since we do peak detection using a histogram

96

comprised of a vertical chimney of data. We are developing a new multi-viewpoint filter

that addresses this shortcoming. Furthermore, we are developing algorithms that process

large amounts of data (1GB/sec) in real time as well as pursuing extreme Lidar platform

SWaP (size, weight and power) reductions on order of 109 to obtain similar 3D data

collection capabilities on a small UAV as compared to prior airborne systems that need a

much larger, manned, airborne platform.

In addition, we are developing classification algorithms using fused 2D and 3D data for

improved scene understanding and object recognition. These algorithms are to be used to

detect natural versus man-made with further sub-classifications into object classes, such

as trees, rivers, buildings, cars, roads and trails.

97

References

1. http://blog.flickr.net/en/2006/08/29/geotagging-one-day-later/

2. Graham, M., Hale, S. A. and Stephens, M. (2011) Geographies of the World’s Knowledge.

London, Convoco! Edition.

3. W. Zhang and J. Kosecka. Image Based Localization in Urban Environments, 3DPVT 2006

4. Amir Roshan Zamir and Mubarak Shah, Accurate Image Localization Based on Google Maps

Street View, ECCV, 2010

5. M. Pollefeys, D. Nister, J. Frahm, A. Akbarzadeh, P. Mordo- hai, B. Clipp, C. Engels, D.

Gallup, S. Kim, P. Merrell, Detailed Real-Time Urban 3D Reconstruction from Video. IJCV,

Volume 78, Issue 2-3:143–167, July 2008

6. Noah Snavely: Scene Reconstruction and Visualization from Internet Photo Collections,

Doctoral thesis, University of Washington, 2008

7. James Hays, Alexei A. Efros. IM2GPS: estimating geographic information from a single

image. CVPR 2008.

8. James Hays, Large Scale Scene Matching for Graphics and Vision, CMU PhD Thesis, 2009.

9. Richard Ernest Bellman; Rand Corporation (1957). Dynamic programming. Princeton

University Press. ISBN 978-0-691-07951-6

10. D. Lowe, Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110,

2004

11. De Bonet, J.S.and Viola, P. 1997. Structure driven image database retrieval. Advances in

Neural Information Processing, 10:866–872.

12. A. Oliva and A. Torralba. Building the gist of a scene: The role of global image features in

recognition. In Visual Perception, Progress in Brain Research, volume 155, 2006.

13. Herv´e J´egou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak

geometric consistency for large scale image search. In European Conference on Computer

Vision, volume I, pages 304–317, Oct 2008.

14. D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, volume 2,

pages 2161–2168, 2006.

98

15. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving

particular object retrieval in large scale image databases. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2008.

16. C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing: label transfer via dense scene

alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2009.

17. http://unstats.un.org/unsd/demographic/products/dyb/dybsets/2012.pdf

18. Global Land Cover Characterization Database, http://edc2.usgs.gov/glcc/glcc.php

19. United Nations Environment Programme, Mountains and Treed cover in Mountain Regions

(2002) http://www.unep-wcmc.org/mountains-and-tree-cover-in-mountain-regions-

2002_724.html

20. N. Rasiwasia and N. Vasconcelos, “Holistic context modeling using semantic co-

occurrences," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(CVPR) (2009).

21. A. Oliva and A. Torralba, “Modeling the Shape of the Scene: A Holistic Representation of the

Spatial Envelope," International Journal of Computer Vision 42(3), 145{175 (2001), URL

http://dx.doi.org/10.1023/A:1011139631724.

22. A. Torralba, “Understanding visual scenes," Video Lecture (2009), URL

http://videolectures.net/nips09_torralba_uvs/.

23. B.C. Russell, A. Torralba, K.P. Murphy, and W.T. Freeman, “Labelme: a database and

webbased tool for image annotation," International Journal of Computer Vision 77(1-3), 157{173

(2008).

24. S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu, “An optimal algorithm

for approximate nearest neighbor searching in fixed dimensions," in ACM-SIAM Symposium on

Discrete Algorithms (1994), pp. 573{582.

25. USGS Land Use/Land Cover System Legend,

http://edc2.usgs.gov/glcc/globdoc2_0.php#app3

26. Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. Svm-knn: Discriminative

nearest neighbor classification for visual category recognition. CVPR ’06, 2006.

99

27. John C. Platt. Sequential minimal optimization: A fast algorithm for training support vector

machines, 1998.

28. G. Wang, D. Hoeim, and D. A. Forsyth. Learning image similarity from flickr groups using

stochastic intersection kernel machines. In ICCV, 2009.

29. Herv´e J´egou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak

geometric consistency for large scale image search. In European Conference on Computer Vision,

volume I, pages 304–317, oct 2008.

30. D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, volume 2,

pages 2161–2168, 2006.

31. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving

particular object retrieval in large scale image databases. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, 2008.

32. C. Liu, J. Yuen, , and A. Torralba. Nonparametric scene parsing: label transfer via dense

scene alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2009.

33. Antonio Torralba, Rob Fergus, and Yair Weiss. Small codes and large image databases for

recognition. In CVPR, 2008.

34. Carlotta Domeniconi and Dimitrios Gunopulos. Adaptive nearest neighbor classification

using support vector machines. In NIPS, 2001.

35. J. H. Friedman. Flexible metric nearest neighbor classification. Technical report, Stanford,

Nov. 1994.

36. Pascal Vincent and Yoshua Bengio. K-local hyperplane and convex distance nearest neighbor

algorithms. In NIPS, 2002.

37. Stefan Schaal Chris Atkeson, Andrew Moore. Locally weighted learning. AI Review, 11:11–

73, April 1997.

38. T. Hastie and R Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE

PAMI, 18:607–616, 1996.

39. Craig Stanfill and David Waltz. Toward memory-based reasoning. Communications of the

ACM, 29(12):1213–1228, 1986.

40. P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. CVPR, June 2008.

100

41. A. Torralba, R. Fergus, and W. T. Freeman. Tiny images. MIT-CSAIL-TR-2007-024, 2007.

42. D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images

and its application to evaluating segmentation algorithms and measuring ecological statistics. In

Proc. ICCV, July 2001.

43. Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, Richard Szeliski: Building

Rome in a Day. ICCV 2009

44. Jan-Michael Frahm, Pierre Georgel, David Gallup, Tim Johnson, Rahul Raguram,

Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, Marc Pollefeys:

Building Rome on a Cloudless Day, ECCV 2010

45. Martin Bujnak and Zuzana Kukelova and Tomas Pajdla: 3D reconstruction from image

collections with a single known focal length, ICCV 2009

46. C. Strecha, T. Pylvanainen, P. Fua: Dynamic and Scalable Large Scale Image

Reconstruction, CVPR 2010.

47. Jan-Michael Frahm, Marc Pollefeys, Svetlana Lazebnik, Christopher Zach, David Gallup,

Brian Clipp, Rahul Raguram, Changchang Wu, Tim Johnson: Fast Robust Large-scale Mapping

from Video and Internet Photo Collections, ISPRS 2010

48. Micusik B., Kosecka J.: Piecewise Planar City 3D Modeling from Street View Panoramic

Sequences, CVPR 2009

49. T. Lee: Robust 3D Street-View Reconstruction using Sky Motion Estimation. 3DIM2009 in

conjunction with ICCV, 2009

50. C. Fruh and A. Zakhor: An Automated Method for Large-scale, Ground-based City Model

Acquisition. IJCV, 60(1), 2004

51. M. Agrawal and K. Konolige: Real-time localization in outdoor environments using stereo

vision and inexpensive GPS,” ICPR, Vol. 3, pp. 1063–1068, 2006

52. Yuji Yokochi, Sei Ikeda, Tomokazu Sato, Naokazu Yokoya: Extrinsic Camera Parameter

Based-on Feature Tracking and GPS Data, ICPR, pp. 369–378, 2006

53. M. Modsching, R. Kramer, and K. ten Hagen: Field trial on GPS Accuracy in a medium size

city: The influence of built-up, WPNC 2006

54. Richard I. Hartley and Andrew Zisserman. Multiple View Geometry. Cambridge University

Press, Cambridge, UK, 2004

101

55. D. Nistér: An efficient solution to the five-point relative pose problem, IEEE Transactions on

Pattern Analysis and Machine Intelligence (PAMI), 26(6):756-770, June 2004

56. Yasutaka Furukawa and Jean Ponce: Accurate, Dense, and Robust Multi-View Stereopsis,

IEEE Trans. on Pattern Analysis and Machine Intelligence, 2009

57. Yasutaka Furukawa and Jean Ponce: Patch-based Multi-View Stereo Software,

http://grail.cs.washington.edu/software/pmvs

58. P. Besl and N. McKay. A method of registration of 3-D shapes. IEEE Trans. Pattern Analysis

and Machine Intelligence, vol. 12, no. 2, pp. 239-256, February 1992

59. S. Rusinkiewicz, M. Levoy, Efficient variants of the ICP algorithm, in: Third International

Conference on 3D Digital Imaging and Modeling (3DIM), June 2001, pp. 145–152

60. MeshLab, http://meshlab.sourceforge.net/

61. Peter Cho, Noah Snavely, “Enhancing Large Urban Photo Collections with 3D Ladar and GIS

Data”, International Journal of Remote Sensing Applications (IJRSA) 2013.

62. A. Vasile, F. R. Waugh, D. Greisokh, and R. M. Heinrichs. “Automatic alignment of color

imagery onto 3d laser radar data”. In AIPR ’06: Proceedings of the 35th Applied Imagery and

Pattern RecognitionWorkshop, page 6,Washington, DC, USA, 2006. IEEE Computer Society.

63. A.G. Gschwendtner and W.E. Keicher, “Development of Coherent Laser Radar at

Lincoln Laboratory,” Linc. Lab. J. 12 (2), 2000, pp. 383–396.

64. R.M. Marino, T. Stephens, R.E. Hatch, J.L. McLaughlin, J.G. Mooney, M.E.

O’Brien, G.S. Rowe, J.S. Adams, L. Skelly, R.C. Knowlton, S.E. Forman, and W.R.

Davis, “A Compact 3D Imaging Laser Radar System Using Geiger-Mode APD Arrays:

System and Measurements,” SPIE 5086, 2003, pp. 1-15.

65. M.A. Albota, B.F. Aull, D.G. Fouche, R.M. Heinrichs, D.G. Kocher, R.M. Marino, J.G.

Mooney, N.R. Newbury, M.E. O’Brien, B.E. Player, B.C. Willard, and J.J. Zayhowski, “Three-

Dimensional Imaging Laser Radars with Geiger-Mode Avalanche Photodiode Arrays,” Lincoln

Laboratory Journal, vol. 13, no. 2, 2002, pp. 351-370.

66. J.J. Zayhowski, “Passively Q-Switched Microchip Lasers and Applications,” Rev. Laser Eng.

29 (12), 1988, pp. 841-846.18

102

67. J.J. Zayhowski, “Microchip Lasers,” Lincoln Laboratory Journal, vol 3, no. 3, 1990, pp. 427-

446.

68. R.M. Heinrichs, B.F. Aull, R.M. Marino, D.G. Fouche, A.K. McIntosh, J.J. Zayhowski, T.

Stephens, M.E. O’Brien, and M.A. Albota, “Three-Dimensional Laser Radar with APD Arrays,”

SPIE 4377, 2001, pp. 106-117.

69. M.A. Albota, R.M. Heinrichs, D.G. Kocher, D.G. Fouche, B.E. Player, M.E. O’Brien,

B.F.Aull, J.J. Zayhowski, J. Mooney, B.C. Willard, and R.R. Carlson, “Three-Dimensional

Imaging Laser Radar with a Photon-Counting Avalanche Photodiode Array and Microchip

Laser,” Appl. Opt. 41 (36), pp. 7671-7678.

70. B.F. Aull, A.H. Loomis, D.J. Young, R.M. Heinrichs, B.J. Felton, P.J. Daniels, and D.J.

Landers, “Geiger-Mode Avalanche Photodiodes for Three-Dimensional Imaging,” Linc.

Laboratory Journal, vol. 13, no. 2, 2002, pp. 335-350.

71. K.A. McIntosh, J.P. Donnelly, D.C. Oakley, A. Napoleone, S.D. Calawa, L.J. Mahoney, K.M.

Molvar, E.K. Duerr, S.H. Groves, and D.C. Shaver, “InGaAsP/InP Avalanche Photodiodes for

Photon Counting at 1.06 μm,” Appl. Phys. Lett. 81, 2505-2507 (2002).

72. D.G. Fouche, “Detection and False-Alarm Probabilities for Laser Radars That Use Geiger-

Mode Detectors,” Appl. Opt. 42 (27), pp. 5388-5398.

73. Jeffrey R. Stevens, Norman A. Lopez, Robin R. Burton, “Quantitative Data Quality Metrics

for 3D Laser Radar Systems”, SPIE Proceedings, 2010, Volume 8037.

74. www.ll.mit.edu/publications/technotes/TechNote_ALIRT.pdf

75. C. F. F. Karney, “Transverse Mercator with an accuracy of a few nanometers,” Journal of

Geodesy, 2011, Volume 85, Number 8, Pages 475-485

76. Alexandru N. Vasile and Octavia Camps, “Hierarchical Image Geo-Location on a World-

Wide Scale”, ISVC 2013, Part II, LNCS 8034, pp. 266-277, 2013

77. Alexandru N. Vasile, Luke J. Skelly, Karl Ni, Richard Heinrichs and Octavia Camps,

“Efficient City-sized 3D Reconstruction from Ultra-high Resolution Aerial and Ground Video

Imagery”, ISVC 2011, Part I, LNCS 6938, pp. 350–362, 2010

78. Alexandru N. Vasile, Luke J. Skelly, Michael E. O’Brien, Dan G. Fouche, Richard M.

Marino, Robert Knowlton, M. Jalal Khan and Richard M. Heinrichs, “Advanced Coincidence

Processing of 3D Laser Radar Data”, ISVC 2012, Part I, LNCS 7431, pp. 382-393, 2012

103

79. Alexandru N. Vasile, Luke J. Skelly, Michael E. O’Brien, Dan G. Fouche, Richard M.

Marino, Robert Knowlton, M. Jalal Khan and Richard M. Heinrichs, “Coincidence Processing of

3D Lidar Data for Foliage Penetration Applications”, MSS-EO 2012

80. Altwaijry H., Moghimi M., Belongie S., "Recognizing Locations with Google Glass: A Case

Study", IEEE Winter Conference on Applications of Computer Vision (WACV), Steamboat

Springs, Colorado, March, 2014.

81. G. Baatz, O. Saurer, K.Köser, M. Pollefeys, Large scale visual geo-localization of images in

mountainous terrain, In Proceedings of the 12th European Conference on Computer Vision -

Volume Part II, (2012), pp. 517–530

82 T.-Y. Lin, S. Belongie, J. Hays. Cross-view image geolocalization, in IEEE Conference on

Computer Vision and Pattern Recognition (CVPR) (Portland, OR, June 2013)

83. Altwaijry H., Belongie S., "Ultra-wide Baseline Aerial Imagery Matching in Urban

Environments", British Machine Vision Conference (BMVC), Bristol, September, 2013.

84. Mayank Bansal, Kostas Daniilidis, and Harpreet Sawhney. Ultra-wide baseline façade

matching for geo-localization. In ECCV 2012.

hierarchical image geo-location on a world-wide scale349703/fulltext.pdf · iii abstract...

Documents