a contour completion model for augmenting surface ... · mrf. the procedure for augmenting the...

15
A Contour Completion Model for Augmenting Surface Reconstructions Nathan Silberman 1 , Lior Shapira 2 , Ran Gal 2 , Pushmeet Kohli 3 1 Courant Institute, New York University, 2 Microsoft Research Abstract. The availability of commodity depth sensors such as Kinect has enabled development of methods which can densely reconstruct arbitrary scenes. While the results of these methods are accurate and visually appealing, they are quite often incomplete. This is either due to the fact that only part of the space was visible during the data capture process or due to the surfaces being occluded by other objects in the scene. In this paper, we address the problem of completing and refining such reconstructions. We propose a method for scene completion that can infer the layout of the complete room and the full extent of partially occluded objects. We propose a new probabilistic model, Contour Completion Random Fields, that allows us to complete the boundaries of occluded surfaces. We evaluate our method on synthetic and real world reconstructions of 3D scenes and show that it quantitatively and qualitatively outperforms standard methods. We created a large dataset of partial and complete reconstructions which we will make available to the community as a benchmark for the scene completion task. Finally, we demonstrate the practical utility of our algorithm via an augmented-reality application where objects interact with the completed reconstructions inferred by our method. Keywords: 3D Reconstruction, Scene Completion, Surface Reconstruc- tion, Contour Completion 1 Introduction The task of generating dense 3D reconstruction of scenes has seen great progress in the last few years. While part of this progress is due to algorithmic improvements, large strides have been made with the adoption of inexpensive depth cameras and the fusion of color and depth signals. The combined use of depth and colour signals has been successfully demonstrated for the production of large-scale models of indoor scenes via both offline [1] and online [2] algorithms. Most RGB+D reconstruction methods require data that show the scene from a multitude of viewpoints and are not well suited for input sequences which contain a single-view or limited number of viewpoints. Moreover, these reconstruction methods are hindered by occlusion as they make no effort to infer the geometry of parts of the scene that are not visible in the input sequences. Consequently, the resulting 3D models often contain gaps or holes and do not capture certain basic elements of a scene, such as the true extent and shape of objects and scene surfaces.

Upload: others

Post on 09-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

A Contour Completion Model for AugmentingSurface Reconstructions

Nathan Silberman1, Lior Shapira 2, Ran Gal2, Pushmeet Kohli3

1Courant Institute, New York University,2 Microsoft Research

Abstract. The availability of commodity depth sensors such as Kinecthas enabled development of methods which can densely reconstructarbitrary scenes. While the results of these methods are accurate andvisually appealing, they are quite often incomplete. This is either dueto the fact that only part of the space was visible during the datacapture process or due to the surfaces being occluded by other objectsin the scene. In this paper, we address the problem of completing andrefining such reconstructions. We propose a method for scene completionthat can infer the layout of the complete room and the full extent ofpartially occluded objects. We propose a new probabilistic model, ContourCompletion Random Fields, that allows us to complete the boundariesof occluded surfaces. We evaluate our method on synthetic and realworld reconstructions of 3D scenes and show that it quantitatively andqualitatively outperforms standard methods. We created a large datasetof partial and complete reconstructions which we will make available tothe community as a benchmark for the scene completion task. Finally, wedemonstrate the practical utility of our algorithm via an augmented-realityapplication where objects interact with the completed reconstructionsinferred by our method.

Keywords: 3D Reconstruction, Scene Completion, Surface Reconstruc-tion, Contour Completion

1 Introduction

The task of generating dense 3D reconstruction of scenes has seen great progress inthe last few years. While part of this progress is due to algorithmic improvements,large strides have been made with the adoption of inexpensive depth camerasand the fusion of color and depth signals. The combined use of depth and coloursignals has been successfully demonstrated for the production of large-scale modelsof indoor scenes via both offline [1] and online [2] algorithms. Most RGB+Dreconstruction methods require data that show the scene from a multitude ofviewpoints and are not well suited for input sequences which contain a single-viewor limited number of viewpoints. Moreover, these reconstruction methods arehindered by occlusion as they make no effort to infer the geometry of parts ofthe scene that are not visible in the input sequences. Consequently, the resulting3D models often contain gaps or holes and do not capture certain basic elementsof a scene, such as the true extent and shape of objects and scene surfaces.

Page 2: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

2 Nathan Silberman, Lior Shapira, Ran Gal, Pushmeet Kohli

Fig. 1. Overview of the pipeline: Given a dense but incomplete reconstruction (a) wefirst detect planar regions (b) and classify them into ceiling (not shown), walls (tan),floor (green) and interior surfaces (pink) (c) Non-planar regions are shown in light blue.Next, we infer the scene boundaries(d) and shape of partially occluded interior objects,such as the dining table (e) to produce a complete reconstructed scene (f). Additionalvisualizations are found in the supplementary material.

Accurate and complete surface reconstruction is of special importance inAugmented Reality (AR) applications which are increasingly being used forboth entertainment and commerce. For example, a recently introduced gamingplatform [3] asks users to scan an interior scene from multiple angles. Usingthe densely reconstructed model, the platform overlays graphically generatedcharacters and gaming elements. Furniture retailers (such as IKEA) enablecustomers to visualize how their furniture will look when installed without havingto leave their homes. These applications often require a high fidelity densereconstruction so that simulated physical phenomenon, such as lighting, shadowand object interactions (e.g. collisions) can be produced in a plausible fashion.Unfortunately, such reconstructions often require considerable effort on the partof the user. Applications either demand that users provide video capture ofsufficient viewpoint diversity or operate using an incomplete model of the scene.

Our goal is to complement the surface reconstruction for an input sequencethat is limited in viewpoint, and ”fill in” parts of the scene that are occluded ornot visible to the camera. This goal is driven by both a theoretical motivationand a practical application. Firstly, basic scene understanding requires highlevel knowledge of how objects interact and extend in 3D spaces. While mostscene understanding research is concerned with semantics and pixel labeling,relatively little work has gone into inferring object or surface extent, despitethe prevalence and elemental nature of this faculty in humans. Second, onlinesurface reconstruction pipelines such as KinectFusion [2] are highly suitable forAR applications, and could benefit from a scene completion phase, integratedinto the pipeline.

Our method assumes a partial dense reconstruction of the scene that isrepresented as a voxel grid where each voxel can be occupied, free or its state inunknown. We use the KinectFusion [2] method to compute this reconstructionwhich also assigns a surface normal and truncated signed distance function (TSDF)[4] to the voxels. Given this input, our method first detects planar surfaces inthe scene and classifies each one as being part of the scene layout (floor, walls

Page 3: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

A Contour Completion Model for Augmenting Surface Reconstructions 3

ceiling) or part of an internal object. We then use the identities of the planes toextend them by solving a labeling problem using our Contour-completion randomfield (CCRF) model. Unlike pairwise Markov random fields, which essentiallyencourage short boundaries, the CCRF model encourages the discontinuities inthe labeling to follow detected contour primitives such as lines or curves. Weuse this model to complete both the floor map for the scene and estimate theextents of planar objects in the room. This provides us with a watertight 3Dmodel of the room. Finally we augment the original input volume account forour extended and filled scene. The stages of our algorithm are demonstrated infigure 1.

Our Contributions: This paper makes the following technical contributions: (1)We introduce the Contour Completion Random Field (CCRF), a novel algorithmfor completing or extending contours in 2D scenes. (2) We describe an algorithmfor inferring the extent of a scene and its objects. (3) We demonstrate how toefficiently re-integrate the inferred scene elements into a 3D scene representationto allow them to be used by a downstream application.

The rest of the paper is organized as follows. We discuss the relationship ofour work to the prior art in dense reconstruction in section 2. In section 3 wedescribe how we detect and classify planar surfaces in the scene. In section 4we discuss how we perform scene completion by inferring the extent of sceneboundaries and extending internal planes using our novel contour completionMRF. The procedure for augmenting the original (TSDF) based reconstructionwith new values is discussed in section 5. In section 6 we perform qualitativeand quantitative evaluation of our methods and in section 7 discuss our results,limitations and future work.

2 Related Work

KinectFusion [2] is a real-time dense surface mapping and tracking algorithm.It maintains an active volume in GPU memory, updated regularly with newdepth observations from a Kinect camera. Each depth frame is tracked using theiterative closest-point algorithm, and updates the reconstruction volume whichis represented as a truncated signed-distance function (TSDF) grid. At any pointthe TSDF volume may be rendered (using ray casting) or transformed into anexplicit surface using marching cubes [5] (or similar algorithm). Our methodextends this pipeline by accessing the TSDF volume at certain key frames, andaugmenting it. As new depth observations reveal previously occluded voxels, ouraugmentations are phased out. Our scene completion algorithm may be used toaugment other surface reconstruction methods such as [6,7] with little change.However, most of these methods have different motivation, and are not real-time.In SLAM++ [8], a database of 3D models is used during the reconstructionprocess. This helps improve tracking, reduces representation size, and augmentsthe reconstructed scene. However, this method requires intensive preparationtime for each new scene in order to capture the various repeated objects foundin each one.

Page 4: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

4 Nathan Silberman, Lior Shapira, Ran Gal, Pushmeet Kohli

In recent years there have been many papers on scene understanding froma single image. Barinova et al [9] geometrically parse a single image to identifyedges, parallel lines and vanishing point, and a level horizon. Hoiem et al [10]estimate the layout of a room from a single image, by using a generalized boxdetector. This approach is not well suited to completing either highly occludedscenes or non-rectangular objects. Ruiqi and Hoiem [11] attempt to label occludedsurfaces using a learning-based approach. This method is limited by the numberof classes which can be learned, and does not infer the extent of the hiddenobjects. A more recent paper by the same authors [12] is more relevant to ourapproach. They detect a set of supporting horizontal surfaces in a single-imagedepth frame, and their extent is deduced using the features at each point andsurrounding predictions. However, they do not attempt to infer the full scenelayout and limit themselves to support-related objects.

Min Kim et al [13] augment and accelerate reconstruction of complex scenesby carefully scanning and modeling objects which repeat in the scene, in advance.These objects (e.g. office chairs, monitors ) are then matched in the noisy andincomplete point cloud of the larger scene, and used to augment it. This approachis less suited to capturing a new environment, in which we cannot model therepeating objects independently (given that there are any). Kim et al[14] jointlyestimate the 3D reconstruction of a scene, and the semantic labels associatedwith each voxel, on a coarse 3D volume. Their method works best with simpleocclusions and does not extend the volume to estimate the overall shape of theroom. Zheng et al [15] employ physical and geometrical reasoning to reconstructa scene. Their method relies on successful object segmentation, and a Manhattan-world assumption. They fill occluded parts of objects by extrapolating along themajor axes of the scene. However, none of these methods can handle previouslyunseen object and surfaces with non-linear boundaries.

3 Plane Detection and Classification

We now describe how our method detects the dominant planes from the partialvoxel based reconstruction. We denote the space of all possible 3D planes byH, and the set of planes present in the scene by H. Let the set of all 3D pointsvisible in the scene be denoted by P = {p1, ...pN}. We estimate the most probablelabeling for H by minimizing the following energy function:

H∗ = arg minH⊂H

N∑i=1

fi(H) + λ|H|. (1)

where λ is a penalty on the number of planes and fi is a function that penalizesthe number of points not explained by any plane: fi(H) = min{minh∈H [δ(pi, h), λb]},where the function δ returns a value of 0 if point pi falls on plane h and is infinityotherwise. Minimizing the first term alone has the trivial degenerate solutionwhere we include a plane for every point pi in the set H. However, this situationis avoided by the second terms of the energy which acts as a regularizer and addsa penalty that linearly increases with the cardinality of H.

Page 5: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

A Contour Completion Model for Augmenting Surface Reconstructions 5

Lemma 1. The energy function defined in equation (1) is a supermodular setfunction.

Proof in the supplemental material

Computing the Optimal H Minimization of a super-modular function is an NP-hard problem even when the set of possible elements is finite (which is not truein our case). We employ a greedy strategy, starting from an empty set andrepeatedly adding the element that leads to the greatest energy reduction. Thismethod has been observed to achieve a good approximation [16]. We begin byusing the Hough transform [17] to select a finite set of planes. In our method,each 3D point and its surface normal votes for a plane equation parameterizedby its azimuth θ, elevation ψ and distance from the origin ρ. Each of these votesis accrued in an accumulator matrix of size A×E ×D where A is the number ofazimuth bins, E is the number of elevation bins and D is the number of distancebins 1. After each point has voted, we run non-maximal suppression to avoidaccepting multiple planes that are too similar.

Once we have a set of candidate planes we sort them in descending orderby the number of votes they have received and iteratively associate points toeach plane. A point can be associated to a plane if it has not been previouslyassociated to any other plane and if its planar disparity and local surface normaldifference are small enough 2. As an additional heuristic, each new plane and itsassociated points are broken into a set of connected components ensuring thatplanes are locally connected.

Semantic Labeling Once we have a set of planes, we classify each one indepen-dantly into one of four semantic classes: Floor, Wall, Ceiling and Internal. Todo so, we train a Random Forest Classifier to predict each plane’s class usingthe ground truth labels and 3D features from [18], which capture attributes ofeach plane including its height in the room, size and surface normal distribution.Planes classified as one of Floor, Wall and Ceiling will be used for inferring thefloor plan and scene boundaries (section 4.6), whereas Internal planes will beextended and filled in a subsequent step (section 4.7).

4 Scene Completion

Given the set of detected and classified planes we infer the true extent of thescene, ie. obtain a water-tight room structure, and extend interior planes basedon evidence from the scene itself.

4.1 Completion as a Labeling Problem

We now describe how to estimate the boundaries of planes as seen from a top-downview. We formulate boundary completion as a pixel labeling problem. Consider a

1 We use A=128, E=64 and D is found dynamically by spacing bin edges of size 5cmapart between the max and minimum points

2 Planar disparity threshold=.1, angular disparity threshold = .1

Page 6: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

6 Nathan Silberman, Lior Shapira, Ran Gal, Pushmeet Kohli

set S of nodes that represent grid locations in the top-down view of the scene.We assume that a partial labeling of nodes i ∈ S in the grid can be observedand is encoded by variables yi; i ∈ S where yi = 1, yi = 0 and yi = −1 representthat i belongs to the plane, does not belong to the plane, and its membership isuncertain respectively. Given y, we want to estimate the true extent of the planewhich we denote by x. Specifically, we will use the binary variable xi to encodewhether the plane covers the location of node i in the top-view. xi = 1 representsthat node i belongs to the plane while x1 = 0 represents that it does not.

The traditional approach for pixel labeling problems is to use a pairwiseMarkov Random Field (MRF) model. The energy of any labeling y under thepairwise MRF model is defined as: E(x) =

∑i∈S φi(xi) +

∑ij∈N φij(xi, xj),

where φi encode the cost of assigning a label xi and φij are pairwise potentialsthat encourage neighboring (N ) nodes to take the same label, and K is a constant.The unary potential functions force the estimated labels x to be consistent withthe observations y, ie. φi(xi) = inf if yi 6= −1 and xi 6= yi, and φi(yi) = 0for all other cases, while the pairwise potentials take the form an Ising model.The Maximum a Posteriori (MAP) labeling under the model can be computedin polynomial time using graph cuts. However, the results are underwhelmingas the pairwise model does not encode any information about how boundariesshould be completed. It simply encourages a labeling that has a small number ofdiscontinuities.

4.2 Contour Completion Random Field

Unlike the standard MRF which penalizes the number of transitions in thelabeling, our Contour Completion Random Field (CCRF) model adds a penaltybased on the least number of curve primitives that can explain all the transitions.We implement this by introducing higher order potentials in the model. Thesepotentials are defined over overlapping sets of edges where each set follows somesimple (low-dimensional) primitive curve shape such as a line or a circle. Formally,the energy function for the CCRF model can be written as:

E(x) =∑i∈S

φi(xi) +∑g∈G

Ψg(x) (2)

where Ψg are our curve completion potentials, and G is a set where Each curveg represents a set of nodes (edges) that follow a curve. The curve completionpotentials have a diminishing returns property. More formally,

Ψg(x) = F( ∑ij∈Eg

ψij(xi, xj)), (3)

where Eg is the set of edges that defines the curve or edge group g. F is anon-decreasing concave function. In our experiments, we defined F as an upper-bounded linear function ie. F (t) = min{λ ∗ t, θ} where λ is the slope of thefunction and θ is the upper-bound. It can be seen that once a few edges are

Page 7: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

A Contour Completion Model for Augmenting Surface Reconstructions 7

cut t ≥ θλ , the rest of the edges in the group can be cut without any penalty.

This behavior of the model does not prevent the boundary in the labeling fromincluding large number of edges as long as they belong to the same group (curve).The exact nature of these groups are described below.

4.3 Defining Edge Groups

We consider two types of edge groups: straight lines and parabolas. Whileprevious work has demonstrated the ability of the hough transform [17] to detectother shapes, such as circles and ellipses, such high parameter shapes requiresubstantially more memory and computation and we found lines and parabolassufficiently flexible to capture most of the cases we encountered.

To detect lines, we used a modified Hough transform to not only detect linesin the image, but also the direction of the transition (the plane to free spaceor vice-versa). We use an accumulator with 3 parameters: ρ, the distance fromthe origin to the line, θ, the angle between the vector from the origin to theline and the X axis, and a quaternary variable d, which indicates the directionof the transition (both bottom-top and left-right directions) 3. Following theaccumulation of votes, we run non-maximal suppression and create an edge groupfor each resulting line.

The standard Hough transform for parabolas requires 4 parameters. To avoidthe computational and memory demands of such a design, we introduce a noveland simple heuristic detailed in the supplemental material.

Fig. 2. Contour Completion Random Field: (a) A top-down view of a partially occludedplane (b) We detect lines and parabolas along the contour of the known pixels (stippledblack lines), and hallucinate parallel lines (in red) (c) We apply CCRF inference toextend the plane.

3 We use 400 angular bins for θ and evenly spaced bins for ρ 1 unit apart. The minimumnumber of votes allowed was set to 10.

Page 8: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

8 Nathan Silberman, Lior Shapira, Ran Gal, Pushmeet Kohli

4.4 Hierarchical Edge Groups

While using detected lines or curves may encourage the correct surface boundariesto be inferred in many cases, in others, there is no evidence present in the imageof how a shape should be completed. For example see the right side of the shape infigure 2 and the synthetic examples in figure 3(b). Motivated by the gestalt lawsof perceptual grouping, we attempt to add edge groups whose use in completionwould help provide for shapes that exhibited simple closure and symmetry. Morespecifically, for each observed line detected, we add addition parallel edge groupson the occluded side of the shape.

It is clear that defining edge groups that completely cover another edge groupwould lead to double counting. To prevent this, we modify the formulation toensure that only one group from each hierarchy of edge groups is counted. To beprecise, our CCRF model allows edge groups to be organized hierarchically sothat a set of possible edges have a parent and only a single child per parent maybe active. This formulation is formalised as:

E(x) =∑i∈S

φi(xi) +∑g∈G

mink∈c(g)

Ψk(x) (4)

where c(g) denotes the set of child edge groups for each parent g.To summarize, our edge groups are obtained by fitting lines and parabolas to

the input image thus encouraging transitions that are consistent with these edges.As indicated in Equation 4, not all edge groups can be active simultaneouslyand in particular, any line used to hallucinate a series of edges is considered theparent to its child hallucinated lines. Consequently, we constrain only a singlehallucinated line to be active (at most) at a time.

4.5 Inference with Hierarchical Edge Groups

Inference under higher order potentials defined over edges groups was recentlyshown to be NP-hard even for binary random variables by Jegelka et al. [19].They proposed a special purpose iterative algorithm for performing approximateinference in this model. Later, Kohli et al. [20] proposed an approximate methodthat could deal with multi-label variables. Their method transformed edge grouppotentials into a sum of pairwise potentials with the addition of auxiliary variablesthat allowed the use of standard inference method like graph cuts. However, boththese algorithms are unsuitable for CCRF because of the special hierarchicalstructure defined over our edge groups.

Inspired from [20], we transformed the higher-order curve completion potential(3) to the following pairwise form:

Ψpg (x) = T + min

hg,z

{ ∑ij∈Eg

θij((xi + xj − 2zij)hg − 2(xi + xj)zij + 4zij) − Thg

}. (5)

where hg is the binary auxiliary corresponding to the group g, and zij ,∀ij ∈ Egare binary auxiliary variables corresponding to the edges that constitute the edge

Page 9: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

A Contour Completion Model for Augmenting Surface Reconstructions 9

group g. However, this transformation deviates from the energy of the hierarchicalCCRF (equation 4) as it allows multiple edge groups in the hierarchy to be allactive at once.

To enure that only one edge group is active in each edge group hierarchy, weintroduce a series of constraints on the binary auxiliary variables correspondingto the edge groups. More formally, we minimize the following energy :

E(x) =∑i∈S

φi(xi) +∑g∈G

Ψpg (x) s.t.∀g,∑k∈c(g)

hk ≤ 1 (6)

where c(g) denotes the set of child edge groups for each parent edge group g.The minimum energy configuration of this formulation is equivalent to that ofhierarchical CCRF (equation 4).

In order to find the MAP solution, we now need to minimize the constrainedpairwise energy function (equation 6). We observe that on fixing the values ofthe group auxiliary variable (h’s) the resulting energy becomes unconstrainedand submodular, and thus, can be minimized using graph cuts. We use thisobservation to do inference by exhaustively searching over the space of edgegroup auxiliary variables and minimizes the rest of the energy using graph cuts.However, we can make the algorithm even more efficient by not allow the activityof a child edge group to be explored if its parent is not already active. In otherwords, we start by exhaustively searching over the auxiliary variables of theparent edge groups (at the top of the hierarchy), and if a group variable is foundto be active, we check if its child variables can be made active instead of it.

4.6 Inferring Scene Boundaries

To extend and fill the scene boundaries, we begin by projecting the free spaceof the input TSDF and the Wall planes (predicted by our classifier) onto thefloor plane. Given a 2D point cloud induced by these projections, we discretizethe points to form a projection image illustrated by figure 2 where each pixel yitakes on the value of free space, wall or unknown. To infer the full scene layout,we apply the CCRF (Equation 2) to infer the values of the unknown pixels. Inthis case, we consider free space to be the area to be expanded (yi = 1) and thewalls to be the surrounding area to avoid being filled (yi = 0). We first detectthe lines and curves of the walls to create a series of edge groups. Next, we setφi(xi = 1) =∞ if yi = 0 and φi(xi = 0) =∞ if yi = 1. Finally, we add a slightbias [6] to assigning free space φi(xi = 0) = ε 4.

4.7 Extending Planar Surfaces

Once the scene boundary has been completed, we infer the full extent of internalplanar surfaces. For each internal plane, we project the TSDF onto the detected2D plane as follows. First we find a coordinate basis for the plane using PCA

4 ε=1e-6

Page 10: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

10 Nathan Silberman, Lior Shapira, Ran Gal, Pushmeet Kohli

Fig. 3. (Top) Pixel-wise classification error on our synthetic dataset. The plot demon-strates performance as a function of the width of the evaluation region. (Bottom) Someexamples from our dataset of synthetic images showing classification results for occludedpixels (yellow) using four methods.

and estimate the major and and minor axes of the plane, M and N , respectively.Next, we create an image of size 2N + 1 × 2M + 1 where the center pixel ofthe image corresponds to the centroid of the plane. We sample a grid along theplane basis of size 2N + 1× 2M + 1 where the TSDF values sampled in each gridlocation are used to assign each of the image’s pixels. If the sampled TSDF valueis occupied, yi is set to 1, if its free space yi is set to 0 and if its unknown, yi isset to -1. In practice, we also sample several voxels away from the plane (alongthe surface normal direction). This heuristic has the effect of reducing the effectsof sensor noise and error from plane fitting.

Once Y has been created, we detect all lines and parabolas in the image andhallucinate the necessary lines to create our edge groups. Next, we assign thelocal potentials in the same manner as described in Section 4.6.

5 Augmenting the Original Volume

The result of our scene completion is a water-tight scene boundary, and extendedinterior planes. As the final step in our pipeline we augment the original TSDF weimported. For the scene boundary we simplify the resutling polyline representing

Page 11: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

A Contour Completion Model for Augmenting Surface Reconstructions 11

the boundary, and sample points along this boundary from floor to ceiling height.For the interior planes we sample points in the extended parts of the planes. Foreach sampled point (sampled densely as required, in our case γ) we traverse abresenham-line in the volume from the voxel closest to the point, and in twodirections, its normal and the inverse to its normal. For each encountered voxel,we update the TSDF value with the distance to the surface. If the dimensions ofthe original volume do not suffice to hold the new scene boundaries, we create anew larger volume and copy the original TSDF to it, before augmenting it.

The augmented TSDF, in the originating surface reconstruction pipeline, iscontinuously updated with new evidence (e.g. as the user moves). Augmentedareas are phased out as the voxels which they filled become known.

6 Experiments

We evaluate our method in several ways. First, we demonstrate the effectiveness ofthe CCRF model by comparing it to several baselines for performing in-paintingof binary images. Second, we compare our results to a recently introduced method[12] that performs extent-reasoning. Third, we demonstrate qualitative resultson a newly collected set of indoor scenes.

6.1 Synthetic validation

We generated a dataset of synthetic images inspired by shapes of planar surfacescommonly found in indoor scenes. The dataset contained 16 prototypical shapesmeant to resemble objects like desks, conference tables, beds, kitchen counters,etc. The set of 16 protypes were first divided into training and test sets. Each ofthe 8 training and testing prototypes were then randomly rotated and occluded10 times resulting in 80 total training and 80 test images.

Since we are primarily concerned with how these methods perform in predict-ing the boundaries of the input images, we use the evaluation metric of [21] inwhich we only evaluate the correctness of pixels immediately near the boundaryareas. We computed evaluation scores for various widths of the evaluation re-gion. Quantitative results can be found in Figure 3. We compare against severalbaseline methods that are commonly used for in-painting binary images. Theseinclude both 4 connected and 8 connected graph cuts and a non-parametric patch-matching algorithm for in-painting. The ideal parameters for each algorithm werefine tuned on the training set and then applied to the test set.

6.2 Single Frame Scene Completion

We compare our approach to the work of Ruiqi and Hoiem [12] to demonstrateits applicability to single frame scene completion. While we produce a binarylabel indicating the voxels that are occupied, [12] output a heat map. Therefore,for each frame of the NYU V2 dataset [18] we computed the filled voxels usingour regular pipeline. Since KinectFusion was not designed for single image inputs,a small number of images failed to be fused and were ignored during evaluation.

Page 12: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

12 Nathan Silberman, Lior Shapira, Ran Gal, Pushmeet Kohli

For the rest we used the metric of [12] and report the accuracy using the samefalse positive rate as in our method.

[12] achieved 62.6% accuracy compared to our 60.6%. This demonstrates thatwhile our method was not meant for single frame completion, and does not useRGB data, it achieves comparable results. For visual comparisons see figure 4.

Fig. 4. Comparison with [12]: (a) Frames from the NYU2 dataset (b) Ground truthsupport extent (c) Predicted extent from [12] (d) A top down view of our completedscene where inferred scene boundaries are highlighted in white and extended interiorobjects are highlighted in red.

6.3 Qualitative Analysis

Using a Kinect Sensor attached to a notebook computer, we captured more than100 scenes. Each scene is a collection of color and depth frames ranging fromhundreds to thousands of frames. Each of these sequences was input into theKinectFusion pipeline, where for a large number of scenes we performed multiplereconstructions, sharing a starting frame, but varying in number of framesprocessed. This gave us a baseline for a qualitative and progressive evaluation ofour results.

For each such scene we were able to evaluate the original augmented andoriginal reconstruction at each step. As more evidence is revealed, extendedplanes are replaced with their physical counterparts, and the scene boundaries are

Page 13: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

A Contour Completion Model for Augmenting Surface Reconstructions 13

Fig. 5. Progressive surface reconstruction and scene completion. After 200 frameslarge parts of the room are occluded, and filled in by the completion model. After800 previously occluded areas are observed and replace the completion, and overallthe structure of the room is more accurate. For a more detailed figure please seesupplementary material.

updated to reflect reality. A sample progression can be seen in figure 5. Additionalresults can be seen in figure 7(top row) and the supplemental material.

We implemented a complete AR system in the Unity 3D framework, in whicha user is able to navigate a captured scene, place virtual objects on horizontalsupporting planes, and throw balls which bounce around the scene. As seen infigure 6, with scene completion, we are able to place figures on occluded planes,and bounce balls realistically off completed surfaces. Video captured results canbe seen in the supplemental materials.

(a) (b) (c)

Fig. 6. We created an augmented reality application to demonstrate our results. (a) aball bouncing on an occluded floor falls through (b) a ball bouncing on the floor in acompleted scene bounces realistically (c) Virtual objects may be placed on any surface,including occluded planes in the completed scene. The scene shown is the same as infigure 5. Video clips of this application can be seen in the supplemental material.

Page 14: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

14 Nathan Silberman, Lior Shapira, Ran Gal, Pushmeet Kohli

7 Discussion and Future Work

We have presented a complete and novel pipeline for surface reconstructionaugmentation via scene completion. Our novel contour completion MRF algorithmis able to extend occluded surfaces in a believable manner, paving way forenhanced AR applications. In this work we have focused on extending planarsurfaces and estimating scene boundaries, however we recognize the value inaugmenting volumetric (non-planar) objects and intend to address it in futurework. Moreover, in this work we do not rely on previously seen objects, noron repetition in the scene, both of which we intend to employ in future work.Our current pipeline is implemented in Matlab and does not achieve onlineperformance (processing a volume takes up to a minute). In the future we intendto integrate our pipeline directly into KinectFusion or an equivalent system.

Failure cases may occur in different stages along our pipeline. In some instanceswe fail to detect or correctly classify planes. Sometimes the noise level is too high,hampering the CCRF from fitting good contours. In some cases we receive faultyvolumetric information from the Kinect sensor, mostly due to reflectance andtransparencies in the scene. Some examples can be seen in figure 7 bottom row.

Fig. 7. Example scenes: darker colors represent completed areas, hues signify planeclassification. The top row contains mostly successful completions. The bottom rowcontains failure cases, on the left the room is huge due to many reflective surfaces,the middle image is of a staircase, vertical in nature, which leads to incorrect planeclassifications. And on the right a high ceiling and windows cause misclassification ofthe floor plane.

References

1. Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: Rgb-d mapping: Using depthcameras for dense 3d modeling of indoor environments. In: ISER. Volume 20. (2010)22–25

Page 15: A Contour Completion Model for Augmenting Surface ... · MRF. The procedure for augmenting the original (TSDF) based reconstruction with new values is discussed in section 5. In section

A Contour Completion Model for Augmenting Surface Reconstructions 15

2. Newcombe, R.A., Davison, A.J., Izadi, S., Kohli, P., Hilliges, O., Shotton, J.,Molyneaux, D., Hodges, S., Kim, D., Fitzgibbon, A.: Kinectfusion: Real-time densesurface mapping and tracking. In: ISMAR. (2011)

3. Qualcomm: Vuforia. Website (2011) https://www.vuforia.com/.4. Curless, B., Levoy, M.: A volumetric method for building complex models from

range images. In: CGI. (1996) 303–3125. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface con-

struction algorithm. In: Proceedings of the 14th annual conference on Computergraphics and interactive techniques. SIGGRAPH (1987)

6. Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R.: Reconstructing building interiorsfrom images. In: ICCV. (2009)

7. Furukawa, Y., Curless, B., Seitz, S., Szeliski, R.: Manhattan-world stereo. CVPR(2009)

8. Salas-Moreno, R.F., Newcombe, R.A., Strasdat, H., Kelly, P.H., Davison, A.J.:Slam++: Simultaneous localisation and mapping at the level of objects. In: CVPR.(June 2013)

9. Barinova, O., Lempitsky, V., Tretiak, E., Kohli, P.: Geometric image parsing inman-made environments. In: ECCV. (2010)

10. Hoiem, D., Hedau, V., Forsyth, D.: Recovering free space of indoor scenes from asingle image. (2012)

11. Guo, R., Hoiem, D.: Beyond the line of sight: Labeling the underlying surfaces. In:ECCV. (2012)

12. Guo, R., Hoiem, D.: Support surface prediction in indoor scenes. In: ICCV. (2013)13. Kim, Y.M., Mitra, N.J., Yan, D.M., Guibas, L.: Acquiring 3d indoor environments

with variability and repetition. ACM Transactions on Graphics 31(6) (2012)14. soo Kim, B., Kohli, P., Saverese, S.: 3d scene understanding by voxel-crf. In: ICCV.

(2013)15. Zheng, B., Zhao, Y., Yu, J.C., Ikeuchi, K., Zhu, S.C.: Beyond point clouds: Scene

understanding by reasoning geometry and physics. In: CVPR. (2013)16. Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for

maximizing submodular set functionsi. Mathematical Programming 14(1) (1978)265–294

17. Duda, R.O., Hart, P.E.: Use of the hough transformation to detect lines and curvesin pictures. Communications of the ACM 15(1) (1972) 11–15

18. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and supportinference from rgbd images. In: ECCV. (2012)

19. Jegelka, S., Bilmes, J.: Submodularity beyond submodular energies: Coupling edgesin graph cuts. In: CVPR. (2011)

20. Kohli, P., Osokin, A., Jegelka, S.: A principled deep random field model for imagesegmentation. In: CVPR. (2013)

21. Kohli, P., Torr, P.H., et al.: Robust higher order potentials for enforcing labelconsistency. IJCV 82(3) (2009)