building rome in a day - university of washingtongrail.cs.washington.edu/rome/rome_paper.pdfbuilding...

Building Rome in a Day

Sameer Agarwal1,∗ Noah Snavely2 Ian Simon1 Steven M. Seitz1 Richard Szeliski3

1University of Washington 2Cornell University 3Microsoft Research

Abstract

We present a system that can match and reconstruct 3Dscenes from extremely large collections of photographs suchas those found by searching for a given city (e.g., Rome) onInternet photo sharing sites. Our system uses a collectionof novel parallel distributed matching and reconstructionalgorithms, designed to maximize parallelism at each stagein the pipeline and minimize serialization bottlenecks. It isdesigned to scale gracefully with both the size of the problemand the amount of available computation. We have experi-mented with a variety of alternative algorithms at each stageof the pipeline and report on which ones work best in aparallel computing environment. Our experimental resultsdemonstrate that it is now possible to reconstruct cities con-sisting of 150K images in less than a day on a cluster with500 compute cores.

1. IntroductionEntering the search term “Rome” on flickr.com re-

turns more than two million photographs. This collection rep-resents an increasingly complete photographic record of thecity, capturing every popular site, facade, interior, fountain,sculpture, painting, cafe, and so forth. Most of these pho-tographs are captured from hundreds or thousands of view-points and illumination conditions—Trevi Fountain alonehas over 50,000 photographs on Flickr. Exciting progresshas been made on reconstructing individual buildings orplazas from similar collections [16, 17, 8], showing the po-tential of applying structure from motion (SfM) algorithmson unstructured photo collections of up to a few thousandphotographs. This paper presents the first system capable ofcity-scale reconstruction from unstructured photo collections.We present models that are one to two orders of magnitudelarger than the next largest results reported in the literature.Furthermore, our system enables the reconstruction of datasets of 150,000 images in less than a day.

∗To whom correspondence should be addressed. Email: [email protected]

City-scale 3D reconstruction has been explored previ-ously in the computer vision literature [12, 2, 6, 21] and isnow widely deployed e.g., in Google Earth and Microsoft’sVirtual Earth. However, existing large scale structure frommotion systems operate on data that comes from a structuredsource, e.g., aerial photographs taken by a survey aircraftor street side imagery captured by a moving vehicle. Thesesystems rely on photographs captured using the same cal-ibrated camera(s) at a regular sampling rate and typicallyleverage other sensors such as GPS and Inertial NavigationUnits, vastly simplifying the computation.

Images harvested from the web have none of these sim-plifying characteristics. They are taken from a variety ofdifferent cameras, in varying illumination conditions, havelittle to no geographic information associated with them, andin many cases come with no camera calibration information.The same variability that makes these photo collections sohard to work with for the purposes of SfM also makes theman extremely rich source of information about the world. Inparticular, they specifically capture things that people find in-teresting, i.e., worthy of photographing, and include interiorsand artifacts (sculptures, paintings, etc.) as well as exteriors[14]. While reconstructions generated from such collectionsdo not capture a complete covering of scene surfaces, thecoverage improves over time, and can be complemented byadding aerial or street-side images.

The key design goal of our system is to quickly producereconstructions by leveraging massive parallelism. Thischoice is motivated by the increasing prevalence of parallelcompute resources both at the CPU level (multi-core) andthe network level (cloud computing). At today’s prices, forexample, you can rent 1000 nodes of a cluster for 24 hoursfor $10,000 [1].

The cornerstone of our approach is a new system for large-scale distributed computer vision problems, which we willbe releasing to the community. Our pipeline draws largelyfrom the existing state of the art of large scale matching andSfM algorithms, including SIFT, vocabulary trees, bundleadjustment, and other known techniques. For each stagein our pipeline, we consider several alternatives, as some

algorithms naturally distribute and some do not, and issueslike memory usage and I/O bandwidth become critical. Incases such as bundle adjustment, where we find that theexisting implementations did not scale, we have created ourown high performance implementations. Designing a trulyscalable system is challenging, and we discovered manysurprises along the way. The main contributions of ourpaper, therefore, are the insights and lessons learned, as wellas technical innovations that we invented to improve theparallelism and throughput of our system.

The rest of the paper is organized as follows. Section 2discusses the detailed design choices and implementation ofour system. Section 3 reports the results of our experimentson three city scale data sets, and Section 4 concludes with adiscussion of some of the lessons learned and directions forfuture work.

2. System Design

Our system runs on a cluster of computers (nodes), withone node designated as the master node. The master node isresponsible for the various job scheduling decisions.

In this section, we describe the detailed design of oursystem, which can naturally be broken up into three distinctphases: (1) pre-processing §2.1, (2) matching §2.2, and (3)geometric estimation §2.4.

2.1. Preprocessing and feature extraction

We assume that the images are available on a centralstore, from which they are distributed to the cluster nodes ondemand in chunks of fixed size. This automatically performsload balancing, with more powerful nodes receiving moreimages to process.

This is the only stage where a central file server required;the rest of the system operates without using any shared stor-age. This is done so that we can download the images fromthe Internet independent of our matching and reconstructionexperiments. For production use, it would be straightforwardto have each cluster node to crawl the Internet for images atthe start of the matching and reconstruction process.

On each node, we begin by verifying that the image filesare readable, valid images. We then extract the EXIF tags, ifpresent, and record the focal length. We also downsampleimages larger than 2 Mega-pixels, preserving their aspectratios and scaling their focal lengths. The images are thenconverted to grayscale and SIFT features are extracted fromthem [10]. We use the SIFT++ implementation by AndreaVedaldi for its speed and flexible interface [20]. At the end ofthis stage, the entire set of images is partitioned into disjointsets, one for each node. Each node owns the images andSIFT features associated with its partition.

Round 3Round 2Round 1

I0 F0

SIFTTF0

VT

VQ

IK FKSIFT

TFKVT

VQ

IN FNSIFT

TFNVT

VQ

...

...

MTFIDF0

TFIDFK

TFIDFN

...

...

D

F

N0

NM

M0

MK

MN

...

...

M

M0

…

MN

(a) (b) (c) (d) (e)

(f)

Dis

trib

utek

1

Verify

{F0}

Verify

{FN}

Dis

trib

utek

2

Verify

{F0}

Verify

{FN}CC

... ... ...

Exp

and

1

(g) (h) (i) (j) (k)

...

Merge

{T0}

Merge

{TN}

T0

…

TC

Merge

T0

Merge

TC

M M M M

(l) (m)

Figure 1. Our multi-stage parallel matching pipeline. The N inputimages are distributed onto M processing nodes, after which thefollowing processing stages are performed: (a) SIFT feature ex-traction, vocabulary tree vector quantization, and term frequencycounting; (b) document frequency counting; (c) TFIDF computa-tion and information broadcast; (d) computation of TF-based matchlikelihoods; (e) aggregation at the master node; (f) round-robinbin-packing distribution of match verification tasks based on thetop k1 matches per image; (g) match verification with optionalinter-node feature vector exchange; (h) match proposal expansionbased on images found in connected components CC using thenext k2 best matches per image; (i) more distributed verification;(j) four more rounds of query expansion and verification; (k) trackmerging based on local verified matches; (l) track aggregation byinto C connected components; (m) final distribution of tracks byimage connected components and distributed merging.

2.2. Image Matching

The key computational tasks when matching two imagesare the photometric matching of interest points and the geo-metric verification of these matches using a robust estimateof the fundamental or essential matrix. While exhaustivematching of all features between two images is prohibitivelyexpensive, excellent results have been reported with approxi-mate nearest neighbor search. We use the ANN library [3]for matching SIFT features. For each pair of images, thefeatures of one image are inserted into a k-d tree, and thefeatures from the other image are used as queries. For eachquery, we consider the two nearest neighbors, and matchesthat pass Lowe’s ratio test are accepted [10]. We use thepriority queue based search method, with an upper boundon the maximum number of bin visits of 200. In our ex-perience, these parameters offer a good tradeoff betweencomputational cost and matching accuracy. The matches re-turned by the approximate nearest neighbor search are thenpruned and verified using a RANSAC-based estimation ofthe fundamental or essential matrix [18], depending on theavailability of focal length information from the EXIF tags.

These two operations form the computational kernel of ourmatching engine.

Unfortunately, even with a well optimized implementa-tion of the matching procedure described above, it is notpractical to match all pairs of images in our corpus. For acorpus of 100,000 images, this translates into 5,000,000,000pairwise comparisons, which with 500 cores operating at 10image pairs per second per core would require about 11.5days to match. Furthermore, this does not even take intoaccount the network transfers required for all cores to haveaccess to all the SIFT feature data for all images.

Even if we were able to do all these pairwise matches,it would be a waste of computational effort, since an over-whelming majority of the image pairs do not match. This isexpected from a set of images associated with a broad taglike the name of a city. Thus, we must be careful in choosingthe image pairs that the system spends its time matching.

Building upon recent work on efficient object re-trieval [15, 11, 5, 13], we use a multi-stage matching scheme.Each stage consists of a proposal and a verification step. Inthe proposal step, the system determines a set of image pairsthat it expects to share common scene elements. In theverification step, detailed feature matching is performed onthese image pairs. The matches obtained in this manner theninform the next proposal step.

In our system, we use two methods to generate proposals:vocabulary tree based whole image similarity §2.2.1 andquery expansion §2.2.4. The verification stage is describedin §2.2.2. Figure 1 shows the system diagram for the entirematching pipeline.

2.2.1 Vocabulary Tree Proposals

Methods inspired by text retrieval have been applied withgreat success to the problem of object and image retrieval.These methods are based on representing an image as a bagof words, where the words are obtained by quantizing the im-age features. We use a vocabulary tree-based approach [11],where a hierarchical k-means tree is used to quantize thefeature descriptors. (See §3 for details on how we build thevocabulary tree using a small corpus of training images.)These quantizations are aggregated over all features in animage to obtain a term frequency (TF) vector for the image,and a document frequency (DF) vector for the corpus ofimages (Figure 1a–e). The document frequency vectors aregathered across nodes into a single vector that is broadcastacross the cluster. Each node normalizes the term frequencyvectors it owns to obtain the TFIDF matrix for that node.These per-node TFIDF matrices are broadcast across thenetwork, so that each node can calculate the inner productbetween its TFIDF vectors and the rest of the TFIDF vectors.In effect, this is a distributed product of the matrix of TFIDFvectors with itself, but each node only calculates the block

of rows corresponding to the set of images it owns. Foreach image, the top scoring k1 + k2 images are identified,where the first k1 images are used in an initial verificationstage, and the additional k2 images are used to enlarge theconnected components (see below).

Our system differs from that of Nister and Stewenius [11],since their system has a fixed database to which they matchincoming images. They can therefore store the database inthe vocabulary tree itself and evaluate the match score ofan image on the fly. In our case, the query set is the sameas the database, and it is not available when the featuresare being encoded. Thus, we must have a separate matrixmultiplication stage to find the best matching images.

2.2.2 Verification and detailed matching

The next step is to verify candidate image matches, and tothen find a detailed set of matches between matching images.If the images were all located on a single machine, the task ofverifying a proposed matching image pair would be a simplematter of running through the image pairs, perhaps withsome attention paid to the order in which the verificationsare performed so as to minimize disk I/O. However, in ourcase, the images and their feature descriptors are distributedacross the cluster. Thus, asking a node to match the imagepair (i, j) require it to fetch the image features from twoother nodes of the cluster. This is undesirable, as thereis a large difference between network transfer speeds andlocal disk transfers. Furthermore, this creates work for threenodes. Thus, the image pairs should be distributed across thenetwork in a manner that respects the locality of the data andminimizes the amount of network transfers (Figure 1f–g).

We experimented with a number of approaches with somesurprising results. We initially tried to optimize networktransfers before any verification is done. In this setup, oncethe master node has all the image pairs that need to be ver-ified, it builds a graph connecting image pairs which sharean image. Using MeTiS [7], this graph is partitioned intoas many pieces as there are compute nodes. Partitions arethen matched to the compute nodes by solving a linear as-signment problem that minimizes the number of networktransfers needed to send the required files to each node.

This algorithm worked well for small problem sizes, butas the problem size increased, its performance became de-graded. Our assumption that detailed matches between allpairs of images take the same constant amount of time waswrong: some nodes finished early and were idling for up toan hour.

The second idea we tried was to over-partition the graphinto small pieces, and to parcel them out to the cluster nodeson demand. When a node requests another chunk of work,the piece with the fewest network transfers is assigned to it.This strategy achieved better load balancing, but as the size

of the problem grew, the graph we needed to partition grewto be enormous, and partitioning itself became a bottleneck.

The approach that gave the best results was to use a simplegreedy bin-packing algorithm (where each bin represents theset of jobs sent to a node), which works as follows. Themaster node maintains a list of images on each node. Whena node asks for work, it runs through the list of availableimage pairs, adding them to the bin if they do not requireany network transfers, until either the bin is full or there areno more image pairs to add. It then chooses an image (list offeature vectors) to transfer to the node, selecting the imagethat will allow it to add the maximum number of image pairsto the bin, This process is repeated until the bin is full. Thisalgorithm has one drawback: it can require multiple sweepsover all the image pairs needing to be matched. For largeproblems, the scheduling of jobs can become a bottleneck.A simple solution is to only consider a subset of the jobs at atime, instead of trying to optimize globally. This windowedapproach works very well in practice, and all our experimentsare run with this method.

Verifying an image pair is a two-step procedure, consist-ing of photometric matching between feature descriptors,and a robust estimation of the essential or fundamental ma-trix depending upon the availability of camera calibrationinformation. In cases where the estimation of the essen-tial matrix succeeds, there is a sufficient angle between theviewing directions of the two cameras, and the number ofmatches are above a threshold, we do a full Euclidean two-view reconstruction and store it. This information is used inlater stages (see §2.4) to reduce the size of the reconstructionproblem.

2.2.3 Merging Connected Components

At this stage, consider a graph on the set of images withedges connecting two images if matching features werefound between them. We refer to this as the match graph. Toget as comprehensive a reconstruction as possible, we wantthe fewest number of connected components in this graph.To this end, we make further use of the proposals from thevocabulary tree to try and connect the various connectedcomponents in this graph. For each image, we considerthe next k2 images suggested by the vocabulary tree. Fromthese, we verify those image pairs which straddle two differ-ent connected components (Figure 1h–i). We do this onlyfor images which are in components of size 2 or more. Thus,images which did not match any of their top k1 proposedmatches are effectively discarded. Again, the resulting im-age pairs are subject to detailed feature matching. Figure 2illustrates this. Notice that after the first round, the matchgraph has two connected components, which get connectedafter the second round of matching.

2.2.4 Query Expansion

After performing two rounds of matching as described above,we have a match graph which is usually not dense enough toreliably produce a good reconstruction. To remedy this, weuse another idea from text and document retrieval research –query expansion [5].

In its simplest form, query expansion is done by firstfinding the documents that match a user’s query, and thenusing them to query the database again, thus expandingthe initial query. The results returned by the system aresome combination of these two queries. In essence, if wewere to define a graph on the set of documents, with similardocuments connected by an edge, and treat the query as adocument too, then query expansion is equivalent to findingall vertices that are within two steps of the query vertex.

In our system, we consider the image match graph, whereimages i and j are connected if they have a certain minimumnumber of features in common. Now, if image i is connectedto image j and image j is connected to image k, we performa detailed match to check if image j matches image k. Thisprocess can be repeated a fixed number of times or until thematch graph converges.

A concern when iterating rounds of query expansion isdrift. Results of the secondary queries can quickly divergefrom the original query. This is not a problem in our system,since query expanded image pairs are subject to detailedgeometric verification before they are connected by an edgein the match graph.

2.3. Track Generation

The final step of the matching process is to combine all thepairwise matching information to generate consistent tracksacross images, i.e., to find and label all of the connectedcomponents in the graph of individual feature matches (thefeature graph). Since the matching information is stored lo-cally on the compute node that it was computed on, the trackgeneration process proceeds in two stages (Figure 1k–m). Inthe first, each node generates tracks from all the matchingdata it has available locally. This data is gathered at themaster node and then broadcast over the network to all thenodes. Observe that the tracks for each connected compo-nent of the match graph can be processed independently.The track generation then proceeds with each compute nodebeing assigned a connected component for which the tracksneed to be produced. As we merge tracks based on sharedfeatures, inconsistent tracks can be generated, where twofeature points in the same image belong to the same track.In such cases, we drop the offending points from the track.

Once the tracks have been generated, the next step is toextract the 2D coordinates of the feature points that occur inthe tracks from the corresponding image feature files. Wealso extract the pixel color for each such feature point, whichis later used for rendering the 3D point with the average

104

693

694

1231

1628

2543

2990

3037

3038

3039

4728

5412

5433

6670

8824

10492

10493

13942

13972

14470

14757

14803

15275

15276

15277

16742

16743

18989

18990

19115

20789

22059

22505

23977

25965

27282

30036

33099

34298

34299

34321

34810

34812

35513

36586

36587

38055

39960

39962

39964

40017

40018

41037

41359

41360

41361

41377

43242

43705

43870

45356

46262

48154

48531

48994

48995

49030

49824

49873

52259

53569

54467

54469

55060

55061

55062

55720

55760

55762

55763

55764

57215

59314

60343

60361

61448

61449

61450

61720

61721

61722

62039

62383

62850

63807

64855

64871

65119

65121

65125

66327

66328

66399

68070

68072

68091

68094

68095

70985

73147

74555

74763

75015

83222

83416

83981

84355

86001

86003

86005

86774

86775

87709

87765

88549

88584

91259

91260

92505

92654

92655

94300

96301

96302

100455

101322

101945

102389

102896

108214

108233

109346

109954

109955

109956

112136

116625

116843 116844

116995

116997

117000

117980

118292

118293

118294

122090

123345

123361

124010

124011

124013

124191

124710

129261

130620

130689

131056

134888

104

693

694

1231

1628

2543

2990

3037

3038

3039

4728

5412

5433

6670

8824

10492

10493

13942

13972

14470

14757

14803

15275

15276

15277

16742

16743

18989

18990

19115

20789

22059

22505

23977

25965

27282

30036

33099

34298

34299

34321

34810

34812

35513

36586

36587

38055

39960

39962

39964

40017

40018

41037

41359

41360

41361

41377

43242

43705

43870

45356

46262

48154

48531

48994

48995

49030

49824

49873

52259

53569

54467

54469

55060

55061

55062

55720

55760

55762

55763

55764

57215

59314

60343

60361

61448

61449

61450

61720

61721

61722

62039

62383

62850

63807

64855

64871

65119

65121

65125

66327

66328

66399

68070

68072

68091

68094

68095

70985

73147

74555

74763

75015

83222

83416

83981

84355

86001

86003

86005

86774

86775

87709

87765

88549

88584

91259

91260

92505

92654

92655

94300

96301

96302

100455

101322

101945

102389

102896

108214

108233

109346

109954

109955

109956

112136

116625

116843 116844

116995

116997

117000

117980

118292

118293

118294

122090

123345

123361

124010

124011

124013

124191

124710

129261

130620

130689

131056

134888

104

693

694

1231

1628

2543

2990

3037

3038

3039

4728

5412

5433

6670

8824

10492

10493

13942

13972

14470

14757

14803

15275

15276

15277

16742

16743

18989

18990

19115

20789

22059

22505

23977

25965

27282

30036

33099

34298

34299

34321

34810

34812

35513

36586

36587

38055

39960

39962

39964

40017

40018

41037

41359

41360

41361

41377

43242

43705

43870

45356

46262

48154

48531

48994

48995

49030

49824

49873

52259

53569

54467

54469

55060

55061

55062

55720

55760

55762

55763

55764

57215

59314

60343

60361

61448

61449

61450

61720

61721

61722

62039

62383

62850

63807

64855

64871

65119

65121

65125

66327

66328

66399

68070

68072

68091

68094

68095

70985

73147

74555

74763

75015

83222

83416

83981

84355

86001

86003

86005

86774

86775

87709

87765

88549

88584

91259

91260

92505

92654

92655

94300

96301

96302

100455

101322

101945

102389

102896

108214

108233

109346

109954

109955

109956

112136

116625

116843 116844

116995

116997

117000

117980

118292

118293

118294

122090

123345

123361

124010

124011

124013

124191

124710

129261

130620

130689

131056

134888

104

693

694

1231

1628

2543

2990

3037

3038

3039

4728

5412

5433

6670

8824

10492

10493

13942

13972

14470

14757

14803

15275

15276

15277

16742

16743

18989

18990

19115

20789

22059

22505

23977

25965

27282

30036

33099

34298

34299

34321

34810

34812

35513

36586

36587

38055

39960

39962

39964

40017

40018

41037

41359

41360

41361

41377

43242

43705

43870

45356

46262

48154

48531

48994

48995

49030

49824

49873

52259

53569

54467

54469

55060

55061

55062

55720

55760

55762

55763

55764

57215

59314

60343

60361

61448

61449

61450

61720

61721

61722

62039

62383

62850

63807

64855

64871

65119

65121

65125

66327

66328

66399

68070

68072

68091

68094

68095

70985

73147

74555

74763

75015

83222

83416

83981

84355

86001

86003

86005

86774

86775

87709

87765

88549

88584

91259

91260

92505

92654

92655

94300

96301

96302

100455

101322

101945

102389

102896

108214

108233

109346

109954

109955

109956

112136

116625

116843 116844

116995

116997

117000

117980

118292

118293

118294

122090

123345

123361

124010

124011

124013

124191

124710

129261

130620

130689

131056

134888

104

693

694

1231

1628

2543

2990

3037

3038

3039

4728

5412

5433

6670

8824

10492

10493

13942

13972

14470

14757

14803

15275

15276

15277

16742

16743

18989

18990

19115

20789

22059

22505

23977

25965

27282

30036

33099

34298

34299

34321

34810

34812

35513

36586

36587

38055

39960

39962

39964

40017

40018

41037

41359

41360

41361

41377

43242

43705

43870

45356

46262

48154

48531

48994

48995

49030

49824

49873

52259

53569

54467

54469

55060

55061

55062

55720

55760

55762

55763

55764

57215

59314

60343

60361

61448

61449

61450

61720

61721

61722

62039

62383

62850

63807

64855

64871

65119

65121

65125

66327

66328

66399

68070

68072

68091

68094

68095

70985

73147

74555

74763

75015

83222

83416

83981

84355

86001

86003

86005

86774

86775

87709

87765

88549

88584

91259

91260

92505

92654

92655

94300

96301

96302

100455

101322

101945

102389

102896

108214

108233

109346

109954

109955

109956

112136

116625

116843 116844

116995

116997

117000

117980

118292

118293

118294

122090

123345

123361

124010

124011

124013

124191

124710

129261

130620

130689

131056

134888

Initial Matches CC Merge Query Expansion 1 Query Expansion 4 Skeletal Set

Figure 2. The evolution of the match graph as a function of the rounds of matching, and the skeletal set corresponding to it. Notice how thesecond round of matching merges the two components into one, and how rapidly the query expansion increases the density of the withincomponent connections. The last column shows the skeletal set corresponding to the final match graph. The skeletal sets algorithm canbreak up connected components found during the match phase if it determines that a reliable reconstruction is not possible, which is whathappens in this case.

color of the feature points associated with it. Again, thisprocedure proceeds in two steps. Given the per-componenttracks, each node extracts the feature point coordinates andthe point colors from the SIFT and image files that it owns.This data is gathered and broadcast over the network, whereit is processed on a per connected component basis.

2.4. Geometric Estimation

Once the tracks have been generated, the next step is torun structure from motion (SfM) on every connected compo-nent of the match graph to recover a pose for every cameraand a 3D position for every track. Most SfM systems forunordered photo collections are incremental, starting with asmall reconstruction, then growing a few images at a time,triangulating new points, and doing one or more roundsof nonlinear least squares optimization (known as bundleadjustment [19]) to minimize the reprojection error. Thisprocess is repeated until no more cameras can be added.However, due to the scale of our collections, running such anincremental approach on all the photos at once is impractical.

The incremental reconstruction procedure describedabove has in it the implicit assumption that all images con-tribute more or less equally to the coverage and accuracy ofthe reconstruction. Internet photo collections, by their verynature are redundant—many photographs are taken fromnearby viewpoints and processing all of them does not nec-essarily add to the reconstruction. It is thus preferable tofind and reconstruct a minimal subset of photographs thatcapture the essential connectivity of the match graph and thegeometry of the scene [8, 17]. Once this is done, we canadd back in all the remaining images using pose estimation,triangulate all remaining points, and then do a final bundleadjustment to refine the SfM estimates.

For finding this minimal set, we use the skeletal setsalgorithm of [17], which computes a spanning set of pho-tographs that preserves important connectivity informationin the image graph (such as large loops). In [17], a two-framereconstruction is computed for each pair of matching imageswith known focal lengths. In our system, these pairwisereconstruction are computed as part of the parallel matching

process. Once a skeletal set is computed, we estimate theSfM parameters of each resulting component using the incre-mental algorithm of [16]. The skeletal sets algorithm oftenbreaks up connected components across weakly-connectedboundaries, resulting in a larger set of components.

2.4.1 Bundle Adjustment

Having reduced the size of the SFM problem down to theskeletal set, the primary bottleneck in the reconstruction pro-cess is the non-linear minimization of the reprojection error,or bundle adjustment (BA). The best performing BA soft-ware available publicly is Sparse Bundle Adjustment (SBA)by Lourakis & Argyros [9]. The key to its high performanceis the use of the so called Schur complement trick [19] to re-duce the size of the linear system (also known as the normalequations) that needs to be solved in each iteration of theLevenberg-Marquardt (LM) algorithm. The size of this linearsystem depends on the 3D point and the camera parameters,whereas the size of the Schur complement only depends onthe camera parameters. SBA then uses a dense Choleskyfactorization to factor and solve the resulting reduced linearsystem. Since the number of 3D points in a typical SfMproblem is usually two orders of magnitude or more largerthan the number of cameras, this leads to substantial sav-ings. This works for small to moderate sized problems, butfor large problems with thousands of images, computingthe dense Cholesky factorization of the Schur complementbecomes a space and time bottleneck. For large problemshowever, the Schur complement itself is quite sparse (a 3Dpoint is usually visible in only a few cameras) and exploitingthis sparsity can lead to significant time and space savings.

We have developed a new high performance bundle adjust-ment software that, depending upon the size of the problem,chooses between a truncated and an exact step LM algorithm.In the first case, a block diagonal preconditioned conjugategradient method is used to solve approximately the normalequations. In the second case, CHOLMOD [4], a sparsedirect method for computing Cholesky factorization is usedto exactly solve the normal equations via the Schur comple-

Time (hrs)

Data set Images Cores Registered Pairs verified Pairs found Matching Skeletal sets Reconstruction

Dubrovnik 57,845 352 11,868 2,658,264 498,982 5 1 16.5Rome 150,000 496 36,658 8,825,256 2,712,301 13 1 7Venice 250,000 496 47,925 35,465,029 6,119,207 27 21.5 16.5

Table 1. Matching and reconstruction statisics for the three data sets.

ment trick. The first algorithm has low time complexity periteration, but uses more LM iterations, while the second oneconverges faster at the cost of more time and memory periteration. The resulting code uses significantly less memorythan SBA and runs up to an order of magnitude faster. Theexact runtime and memory savings depend upon the sparsitystructure of the linear system involved.

2.5. Distributed Computing Engine

Our matching and reconstruction algorithm is imple-mented as a two-layered system. At the base of the systemis an application-agonstic distributed computing engine. Ex-cept for a small core, which is easily ported, the system iscross-platform and runs on all major flavors of UNIX andMicrosoft Windows. The system is small and extensible,and while it comes with a number of scheduling policies anddata transfer primitives, the users are free to add their own.

The system is targeted at batch applications that operateon large amounts of data. It has extensive support for localcaching of application data and on-demand transfer of dataacross the network. If a shared filesystem is available, it canbe used, but for maximum performance, all data is stored onlocal disk and only transferred across the network as needed.It supports a variety of scheduling models, from triviallydata parallel tasks to general map-reduce style computation.The distributed computing engine is written as set of Pythonscripts, with each stage of computation implemented as acombination of Python and C++ code.

3. ExperimentsWe report here the results of running our system on three

city data sets downloaded from flickr.com: Dubrovnik,Croatia; Rome; and Venice, Italy. Figures 3(a), 3(b) and 3(c)show reconstructions of the largest connected components inthese data sets. Due to space considerations, only a sampleof the results is shown here, we encourage the reader tovisit http://grail.cs.washington.edu/rome,where the complete results are posted, and additional results,data, and code will be posted over time.

The experiments were run on a cluster of 62 nodes withdual quad core processors each, on a private network with 1GB/sec Ethernet interfaces. Each node was equipped with32 GB of RAM and 1 TB of local hard disk space. The

nodes were running the Microsoft Windows Server 200864-bit operating system.

The same vocabulary tree was used in all experiments.It was trained off-line on 1,918,101 features extracted from20,000 images of Rome (not used in the experiments re-ported here). We trained a tree with a branching factor of10 and a total of 136,091 vertices. For each image, we usedthe vocabulary tree to find the top 20 matching images. Thetop k1 = 10 matches were used in the first verification stage,and the next k2 = 10 matches were used in the second com-ponent matching stage. Four rounds of query expansionwere done. We found that in all cases, the ratio of number ofmatches performed to the number of matches verified startsdropping off after four rounds. Table 1 summarizes statisticsof the three data sets.

The reconstruction timing numbers in Table 1 bear someexplanation. It is surprising that the SfM time for Dubrovnikis so much more than for Rome, and is almost the same asVenice, both which is are much larger data sets. The reasonlies in how the data sets are structured. The Rome and Venicedata sets are essentially a collection of landmarks, which atlarge scale have a simple geometry and visibility structure.The largest connected component in Dubrovnik on the otherhand captures the entire old city. With its narrow alley ways,complex visibility and widely varying view points, it is amuch more complicated reconstruction problem. This isreflected in the sizes of the skeletal sets associated with thelargest connected components as shown in Table 2. As wementioned above, the skeletal sets algorithm often breaks upconnected components across weakly-connected boundaries,therefore the size of the connected component returned bythe matching system (CC1), is larger than the connectedcomponent (CC2) that what the skeletal sets algorithm re-turns.

4. DiscussionAt the time of writing this paper, searching on Flickr.com

for the keywords “Rome” or “Roma” results in over2,700,000 images. Our aim is to be able to reconstruct asmuch of the city as possible from these photographs in 24hours. Our current system is about an order of magnitudeaway from this goal. We believe that our current approachcan be scaled to problems of this size. However, a number

(a) Dubrovnik: Four different views and associated images from the largest connected component. Note that the component captures the entireold city, with both street-level and roof-top detail. The reconstruction consists of 4,585 images and 2,662,981 3D points with 11,839,682 observedfeatures.

Colosseum: 2,097 images, 819,242 points Trevi Fountain: 1,935 images, 1,055,153 points

Pantheon: 1,032 images, 530,076 points Hall of Maps: 275 images, 230,182 points(b) Rome: Four of the largest connected components visualized at canonical viewpoints [14].

San Marco Square: 13,699 images, 4,515,157 points

Grand Canal: 3,272 images, 561,389 points(c) Venice: The two largest connected components visualized from two different view points each

of open questions remain.

The track generation, skeletal sets, and reconstructionalgorithms are all operating on the level of connected com-ponents. This means that the largest few components com-

pletely dominate these stages. Currently, our reconstructionalgorithm only utilizes multi-threading when solving the bun-dle adjustment problem. While this helps, it is not a scalablesolution to the problem, as depending upon the connectiv-

Data set CC1 CC2 Skeletal Set Reconstructed

Dubrovnik 6,076 4619 977 4585Rome 7,518 2,106 254 2,097Venice 20,542 14,079 1,801 13,699

Table 2. Reconstruction statistics for the largest connected compo-nents in the three data sets. CC1 is the size of the largest connectedcomponent after matching, CC2 is the size of the largest componentafter skeletal sets. The last column lists the number of images inthe final reconstruction.

ity pattern of the match graph, this can take an inordinateamount of time and memory. The key reason we are ablereconstruct these large image collections is because of thelarge amount of redundancy present in Internet collections,which the skeletal sets algorithm exploits. We are currentlyexploring ways of parallelizing all three of these steps, withparticular emphasis on the SfM system.

The runtime performance of the matching system dependscritically on how well the verification jobs are distributedacross the network. This is facilitated by the initial distribu-tion of the images across the cluster nodes. An early decisionto store images according to the name of the user and theFlickr ID of the image meant that most images taken by thesame user ended up on the same cluster node. Looking at thematch graph, it turns out (quite naturally in hindsight) thata user’s own photographs have a high probability of match-ing amongst themselves. The ID of the person who tookthe photograph is just one kind of metadata associated withthese images. A more sophisticated strategy would exploitall the textual tags and geotags associated with the images topredict what images likely to match and then distribute thedata accordingly.

Finally, our system is designed with batch operation inmind. A more challenging problem is to start with the out-put of our system and to add more images as they becomeavailable.

AcknowledgementThis work was supported in part by SPAWAR, NSF grant IIS-

0811878, the Office of Naval Research, the University of Wash-ington Animation Research Labs, and Microsoft. We also thankMicrosoft Research for generously providing access to their HPCcluster. The authors would also like to acknowledge discussionswith Steven Gribble, Aaron Kimball, Drew Steedly and DavidNister.

References[1] Amazon Elastic Computer Cloud. http://aws.amazon.

com/ec2.

[2] M. Antone and S. Teller. Scalable extrinsic calibration ofomni-directional image networks. IJCV, 49(2):143–174,2002.

[3] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu.An optimal algorithm for approximate nearest neighborsearching fixed dimensions. JACM, 45(6):891–923, 1998.

[4] Y. Chen, T. Davis, W. Hager, and S. Rajamanickam. Algo-rithm 887: CHOLMOD, Supernodal Sparse Cholesky Factor-ization and Update/Downdate. TOMS, 35(3), 2008.

[5] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman.Total recall: Automatic query expansion with a generativefeature model for object retrieval. In ICCV, pages 1–8, 2007.

[6] C. Fruh and A. Zakhor. An automated method for large-scale,ground-based city model acquisition. IJCV, 60(1):5–24, 2004.

[7] G. Karypis and V. Kumar. A fast and high quality multilevelscheme for partitioning irregular graphs. SISC, 20(1):359,1999.

[8] X. Li, C. Wu, C. Zach, S. Lazebnik, and J. Frahm. Modelingand Recognition of Landmark Image Collections Using IconicScene Graphs. In ECCV, pages 427–440, 2008.

[9] M. I. A. Lourakis and A. A. Argyros. Sba: A softwarepackage for generic sparse bundle adjustment. ACM TOMS,36(1):1–30, 2009.

[10] D. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91–110, 2004.

[11] D. Nister and H. Stewenius. Scalable recognition with avocabulary tree. In CVPR, pages 2161–2168, 2006.

[12] M. Pollefeys, D. Nister, J. Frahm, A. Akbarzadeh, P. Mordo-hai, B. Clipp, C. Engels, D. Gallup, S. Kim, P. Merrell, et al.Detailed real-time urban 3d reconstruction from video. IJCV,78(2):143–167, 2008.

[13] G. Schindler, M. Brown, and R. Szeliski. City-scale locationrecognition. In CVPR, pages 1–7, 2007.

[14] I. Simon, N. Snavely, and S. M. Seitz. Scene summarizationfor online image collections. In ICCV, pages 1–8, 2007.

[15] J. Sivic and A. Zisserman. Video Google: A text retrievalapproach to object matching in videos. In ICCV, pages 1470–1477, 2003.

[16] N. Snavely, S. M. Seitz, and R. Szeliski. Photo Tourism:Exploring photo collections in 3D. TOG, 25(3):835–846,2006.

[17] N. Snavely, S. M. Seitz, and R. Szeliski. Skeletal graphs forefficient structure from motion. In CVPR, pages 1–8, 2008.

[18] P. Torr and A. Zisserman. Robust computation andparametrization of multiple view relations. In ICCV, pages727–732, 1998.

[19] B. Triggs, P. McLauchlan, H. R.I, and A. Fitzgibbon. Bundleadjustment - a modern synthesis. In Vision Algorithms’99,pages 298–372, 1999.

[20] A. Vedaldi. Sift++. http://vision.ucla.edu/

˜vedaldi/code/siftpp/siftpp.html.[21] L. Zebedin, J. Bauer, K. Karner, and H. Bischof. Fusion

of feature- and area-based information for urban buildingsmodeling from aerial imagery. In ECCV, pages IV: 873–886,2008.

building rome in a day - university of washingtongrail.cs.washington.edu/rome/rome_paper.pdfbuilding...

Documents