hierarchical surface prediction for 3d object reconstructionchaene/publications/haene2017hsp.… ·...

9
Hierarchical Surface Prediction for 3D Object Reconstruction Christian H¨ ane, Shubham Tulsiani, Jitendra Malik University of California, Berkeley {chaene, shubhtuls, malik}@eecs.berkeley.edu Abstract—Recently, Convolutional Neural Networks have shown promising results for 3D geometry prediction. They can make predictions from very little input data such as a single color image. A major limitation of such approaches is that they only predict a coarse resolution voxel grid, which does not capture the surface of the objects well. We propose a general framework, called hierarchical surface prediction (HSP), which facilitates prediction of high resolution voxel grids. The main insight is that it is sufficient to predict high resolution voxels around the predicted surfaces. The exterior and interior of the objects can be represented with coarse resolution voxels. Our approach is not dependent on a specific input type. We show results for geometry prediction from color images, depth images and shape completion from partial voxel grids. Our analysis shows that our high resolution predictions are more accurate than low resolution predictions. I. I NTRODUCTION We live in a world composed of 3D objects bounded by 2D surfaces. Fundamentally, this means that we can either represent geometry implicitly as 3D volume or explicitly as 2D mesh surface which lives within the 3D space. When working with 3D geometry in practice for many tasks a specific representation is more suited than the other. Recently, 3D prediction approaches which directly learn a function to map an input image to the output geometry have emerged [6], [13], [35]. The function is represented as Convolutional Neural Network (CNN) and the geometry is represented as voxel grid, which is a natural choice for CNNs. The advantage of such an approach is that it can represent arbitrary topology and allows for large shape vari- ations. However, such approaches have a major limitation. Due to the cubic growth of the volume with increasing resolution only a coarse voxel grid is predicted. A common resolution is 32 × 32 × 32. Using higher resolutions very quickly becomes computationally infeasible. Moreover, due to the 2D nature of the surface, the ratio between surface or close to the surface and non-surface voxels becomes smaller and smaller with increasing resolution. Voxels far from the surface are generally predicted correctly quite early in the training and hence do not induce any gradients which can be used to teach the system where the surface lies. Therefore, with increasing resolution also the work done on computing gradients which are uninformative increases. The underlying reason why this problem exists is that current systems do not exploit the fact that surfaces are only two dimensional. Can we build a system that does? (a) Input Image (b) 16 3 (c) 32 3 (d) 64 3 (e) 128 3 (f) 256 3 Figure 1: Illustration of our method on the task of predicting geometry from a single color image. From the image our method hierarchically generates a surface with increasing resolution. In order to have sufficient information for the prediction of the higher resolution we predict at each level multiple feature channels which serve as input for the next level and at the same time allow us to generate the output of the current level. We start with a resolution of 16 3 and divide the voxel side length by two on each level reaching a final resolution of 256 3 . In this paper we introduce a general framework, hierarchi- cal surface prediction (HSP), for high resolution 3D object reconstruction which is organized around the observation that only a few of the voxels are in the vicinity of the object’s surface i.e. the boundary between free and occupied space, while most of the voxels will be “boring”, either completely inside or outside the object. The basic principle therefore is to only predict voxels around the surface, which we can now afford to do at higher resolution. The key insight to enable our method to achieve this is to change the standard prediction of free and occupied space into a three label prediction with the labels free space, boundary and occupied space. Furthermore, we do not directly predict high resolution voxels. Instead, we hierarchically predict small blocks of voxels from coarse to fine resolution in an octree, which we call voxel block octree. Thereby we have at each resolution the signal of the boundary label which tells us that the descendants of the current voxel will contain both, free space and occupied space. There- fore, only nodes containing voxels with the boundary label assigned need higher resolution prediction. By hierarchically

Upload: vudien

Post on 13-Jul-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Hierarchical Surface Prediction for 3D Object Reconstruction

Christian Hane, Shubham Tulsiani, Jitendra MalikUniversity of California, Berkeley

{chaene, shubhtuls, malik}@eecs.berkeley.edu

Abstract—Recently, Convolutional Neural Networks haveshown promising results for 3D geometry prediction. They canmake predictions from very little input data such as a singlecolor image. A major limitation of such approaches is thatthey only predict a coarse resolution voxel grid, which does notcapture the surface of the objects well. We propose a generalframework, called hierarchical surface prediction (HSP), whichfacilitates prediction of high resolution voxel grids. The maininsight is that it is sufficient to predict high resolution voxelsaround the predicted surfaces. The exterior and interior ofthe objects can be represented with coarse resolution voxels.Our approach is not dependent on a specific input type. Weshow results for geometry prediction from color images, depthimages and shape completion from partial voxel grids. Ouranalysis shows that our high resolution predictions are moreaccurate than low resolution predictions.

I. INTRODUCTION

We live in a world composed of 3D objects bounded by2D surfaces. Fundamentally, this means that we can eitherrepresent geometry implicitly as 3D volume or explicitly as2D mesh surface which lives within the 3D space. Whenworking with 3D geometry in practice for many tasks aspecific representation is more suited than the other.

Recently, 3D prediction approaches which directly learna function to map an input image to the output geometryhave emerged [6], [13], [35]. The function is representedas Convolutional Neural Network (CNN) and the geometryis represented as voxel grid, which is a natural choice forCNNs. The advantage of such an approach is that it canrepresent arbitrary topology and allows for large shape vari-ations. However, such approaches have a major limitation.Due to the cubic growth of the volume with increasingresolution only a coarse voxel grid is predicted. A commonresolution is 32 × 32 × 32. Using higher resolutions veryquickly becomes computationally infeasible. Moreover, dueto the 2D nature of the surface, the ratio between surface orclose to the surface and non-surface voxels becomes smallerand smaller with increasing resolution. Voxels far from thesurface are generally predicted correctly quite early in thetraining and hence do not induce any gradients which can beused to teach the system where the surface lies. Therefore,with increasing resolution also the work done on computinggradients which are uninformative increases. The underlyingreason why this problem exists is that current systems do notexploit the fact that surfaces are only two dimensional. Canwe build a system that does?

(a) Input Image (b) 163 (c) 323

(d) 643 (e) 1283 (f) 2563

Figure 1: Illustration of our method on the task of predictinggeometry from a single color image. From the image ourmethod hierarchically generates a surface with increasingresolution. In order to have sufficient information for theprediction of the higher resolution we predict at each levelmultiple feature channels which serve as input for the nextlevel and at the same time allow us to generate the outputof the current level. We start with a resolution of 163 anddivide the voxel side length by two on each level reachinga final resolution of 2563.

In this paper we introduce a general framework, hierarchi-cal surface prediction (HSP), for high resolution 3D objectreconstruction which is organized around the observationthat only a few of the voxels are in the vicinity of theobject’s surface i.e. the boundary between free and occupiedspace, while most of the voxels will be “boring”, eithercompletely inside or outside the object. The basic principletherefore is to only predict voxels around the surface, whichwe can now afford to do at higher resolution. The keyinsight to enable our method to achieve this is to changethe standard prediction of free and occupied space into athree label prediction with the labels free space, boundaryand occupied space. Furthermore, we do not directly predicthigh resolution voxels. Instead, we hierarchically predictsmall blocks of voxels from coarse to fine resolution inan octree, which we call voxel block octree. Thereby wehave at each resolution the signal of the boundary labelwhich tells us that the descendants of the current voxelwill contain both, free space and occupied space. There-fore, only nodes containing voxels with the boundary labelassigned need higher resolution prediction. By hierarchically

predicting only those voxels we build up our octree. Thiseffectively reduces the computational cost and hence allowsus to predict significantly higher resolution voxel grids thanprevious approaches. In our experiments we predict voxelsat a resolution of 256× 256× 256 (cf. Fig. 1).

One of the fundamental questions which arises oncehigher resolution predictions are feasible, is whether ornot a higher resolution prediction leads to more accurateresults. In order to analyze this we compare our highresolution predictions to upsampled low resolution predic-tions in our quantitative evaluation and determine that ourmethod outperforms these baselines. Our results are not onlyquantitatively more accurate, they are also qualitatively moredetailed and have a higher surface quality.

The remainder of the paper is organized as follows. InSec. II we discuss the related work. We then introduceour framework in Sec. III. Quantitative and qualitativeevaluations are done in Sec. IV and eventually we draw theconclusions in Sec. V.

II. RELATED WORK

Traditionally dense 3D reconstruction from images hasbeen done using a large collection of images. Geometryis extracted by dense matching or direct minimization ofreprojection errors. The common methods can be broadlygrouped as implicit volumetric reconstruction [7], [18], [19],[22], [38] or explicit mesh based [9], [12] approaches.All these approaches facilitate a reconstruction of arbitrarygeometry. However, in general a lot of input data is requiredto constrain the geometry enough. In difficult cases it is notalways possible to recover the geometry of an object withmulti-view stereo matching. This can be due to challengingmaterials such as transparent and reflective surfaces or lackof images from all around an object. For such cases objectshape priors have been proposed [8], [14], [36].

All the methods mentioned above in general use multi-ple images as input data. Approaches which are primarilydesigned to reconstruct 3D geometry from a single colorimage were also studied. Vetter et al. [1] propose to builda morphable shape model for faces, which allows for thereconstruction of faces from just a single image. In orderto build such a model a lot of manual interaction and highquality scanning of faces is required. Other works proposeto learn a shape model from silhouettes [3] or silhouettesand keypoints [16].

Recently, CNNs have also been used to predict geometryfrom single images. Choy et al. [6] propose to use an en-coder decoder architecture in a recurrent network to facilitateprediction of coarse resolution voxel grids from a single ormultiple color images. An alternative approach [13] proposesto first train an autoencoder on the coarse resolution voxelgrids and then train a regressor which predicts the shapecode from the input image. It has also been shown thatCNNs can be leveraged to predict alternative primitive

based representations [33]. While these approaches rely onground truth shapes, CNNs can also be trained to predict 3Dshapes using weaker supervision [27], [35], [34] for exampleimage silhouettes. These types of supervisions utilize rayformulations which originally have been proposed for multi-view reconstruction [23], [29].

Predicting geometry from color images has also beenaddressed in a 2.5D setting where the goal is, given acolor image to predict a depth map [11], [21], [30]. Theseapproaches can generally produce high resolution outputbut do only capture the geometry from a single viewpoint.Tatarchenko et al. [32] propose to predict a collection ofRGBD (color and depth) images from a single color imageand fuse them in a post-processing step.

The volumetric approach has the major limitation thatusing high resolution voxel grids is computationally de-manding and memory intensive. Multi-view approaches al-leviate this, by the use of data adaptive discretization of thevolume with Delaunay tedrahedrization [20], voxel blockhashing [25] or octrees [5], [31]. Our work is inspired bythese earlier works on octrees. A crucial difference is thatin these works the structure of the octree can be determinedfrom the input geometry but in our case we are lookingat a more general problem where the structure of the treeneeds to be predicted. Similarly, a very recent work [28]which proposes to use octrees in a CNN, also assumes thatthe structure of the octree is given as input. We predict thestructure of the tree together with its content.

Our hierarchical surface prediction framework also relatesto coarse-to-fine approaches in optical flow computation.Similar to volumetric 3D reconstruction also dense opticalflow is computationally demanding due the the large labelspace. By computing optical flow hierarchically on a spatialpyramid, e.g. [2], real-time computation of optical flow isfeasible [37]. The benefits of using a coarse-to-fine approachin learning based optical flow prediction has also recentlybeen pointed out [26], [15].

III. FORMULATION

In this Section we describe our hierarchical surface pre-diction framework, which allows for high resolution voxelgrid predictions. We first briefly discuss how the state-of-the-art, coarse resolution baseline works and then introducethe voxel block octree structure which we are predicting.

A. Voxel Prediction

The basic voxel prediction framework is adapted from [6],[13]. We consider an encoder/decoder architecture whichtakes an input I, which in our experiments is either acolor image, depth image or a partial low resolution voxelgrid. A convolutional encoder encodes the input to a featurevector or shape code C which, in all our experiments, is a128 dimensional vector. An up-convolutional decoder thendecodes C into a predicted voxel grid V . A labeling of the

Volumetric (Up-) ConvolutionsCropping

Encoder

Color

Depth

PartialVolume

3 Labels (free space / boundary / occupied space)

Figure 2: Overview of our system

voxel space into free and occupied space can be determinedusing a suitable threshold τ (cf. Sec. IV-D) or alternativelya mesh can be extracted using marching cubes [24] directlyon the predicted voxel occupancies at an iso value σ.

The main problem which prevents us from directly uti-lizing this formulation with high resolutions for V is thateach voxel, even if it is far from the surface, needs to berepresented and a prediction is made for it. Given that thesurface area grows quadratically and the volume cubicallywith respect to edge division, the ratio of surface to non-surface voxels becomes smaller with increasing resolution.

B. Voxel Block Octree

In our hierarchical surface prediction method, we pro-pose to predict a data structure with an up-convolutionaldecoder architecture, which we call ‘voxel block octree’.It is inspired from octree formulations [5], [31] used intraditional multi-view reconstruction approaches. The keyinsight which allows us to use such a data structure in aprediction framework is to extend the standard two labelformulation to a three label formulation with labels inside,boundary and outside. As we will see later our data structureallows us to generate a complete voxel grid at high resolutionwhile only making predictions around the surface. This leadsto an approach which facilitates efficient training of anencoder/decoder architecture end-to-end.

An octree is a tree data structure which is used to partitionthe 3D space. The root node describes the cube of interestin the 3D space. Each internal node has up to 8 child nodeswhich describe the subdivision of the current node’s cubeinto the eight octants. Note that we slightly deviate fromthe standard definition of the octree where either none or allthe 8 child nodes are present.

We consider a tree with levels ` ∈ {1, . . . , L}. Each node

of the tree contains a ‘voxel block’ – a subdivision of thenode’s 3D space into a voxel grid of size b3. Each voxelin the voxel block at tree levels ` < L contains classifierresponses for the three labels occupied space, boundary andfree space. The nodes in layers ` ∈ {1, . . . , L−1}, therefore,contain voxel blocks B`,s ∈ [0, 1]b

3×3 where the index sdescribes the position of the voxel block with respect to theglobal voxel grid. In the lowest level L we do not needthe boundary label, therefore it is composed of nodes whichcontain a voxel block B`,s ∈ [0, 1]b

3

, where each elementof the block describes the classifier response for the binaryclassification into free or occupied space. Note that eachchild node describes a space which has a cube side lengthof only half of the current node. By keeping the voxel blockresolution fixed at b3 we also divide the voxel side length bya factor of two. This means that the prediction in the childnodes are of higher resolution. The voxel block resolution bis chosen such that it is big enough for efficient predictionusing an up-convolutional decoder network and at the sametime small enough such that local predictions around thesurface can be made. In our experiments we use voxel blockswith b = 16 and a tree with depth L = 5. A visualizationof the voxel block octree is given in Fig. 3.

The predictions stored at any given level of the treemight be sparse. In order to reconstruct a complete modelin high resolution, we upsample all the voxels which havenot been predicted on the highest resolution from the closestpredicted resolution. As we will see later, the voxels whichget upsampled are in the inside and outside of the objectnot directly next to the boundary. Therefore the resolutionof the actual surface remains high. In order to be ableto extract a smooth high quality surface using marchingcubes it is crucial that all the voxels which are close to thesurface are predicted at high resolution and the predicted

Figure 3: Visualization of the voxel block octree. The (left)part depicts the space subdivision which is induced by theoctree in the (middle). The (right) side is a 2D visualizationof parts of the voxel blocks with a voxel block size of b = 4.

values are smooth. Therefore we aim to always evaluatethe highest resolution around the boundary. At this pointwe would like to note that we can also extract a model atintermediate resolution by considering the boundary labelas occupied space. This choice is consistent with the choiceof considering any voxel as occupied space which intersectsthe ground truth mesh during voxelization (cf. Sec. IV-A).

C. Network Architecture

In the previous section we introduced our voxel blockoctree data structure. We will now describe the encoder /decoder architecture which we are using to predict voxelblock octrees using CNNs. An overview of our architectureis depicted in Fig. 2.

The encoder part is identical to the one of the basic voxelprediction pipeline from Sec. III-A, i.e. the input I gets firsttransformed into a shape code C by a convolutional encodernetwork. An up-convolutional also called de-convolutionaldecoder architecture [10], [39] predicts the voxel blockoctree. In order to be able to predict the blocks B`,s andat the same time also being able to capture the informationwhich allows for predictions of the higher resolution levels` + 1, . . ., we introduce feature blocks F`,s ∈ R(b+2p)3×c.The spatial extent of the feature blocks is bigger or equalto the ones of the voxels blocks to allow for a paddingp ≥ 0. As we will see later the padding allows for anoverlap when predicting neighbouring octants which leadsto smoother results. The fourth dimension is the numberof feature channels which can be chosen depending on theapplication. In our experiments we use c = 32 and p = 2.

The most important part for the understanding of our

F`,s

(a) Cropping

F`+1,r

(b) Upsampling

Figure 4: 2D visualization of the cropping and upsamplingmodule. The cropping module (left) crops out the part of thefeature block centered around the child node’s octant. Theupsampling module (right) then upsamples the feature mapusing up-convolutional layers to a new feature block withhigher spatial resolution. The dashed lines indicate the sizeof the output blocks and the octants.

(a) Input (b) No Overlap (c) 2 Voxels Overlap

Figure 5: Influence of overlap on the task of geometryprediction from a single depth map. Without overlap thepredictions are much less smooth and the grid patterninduced by the individual voxel blocks is much more visible.

decoder architecture is how we predict level ` + 1 givenlevel `. In the following we will discuss this procedure indetail. It consist of three basic steps.

1) Feature Cropping2) Upsampling3) Output Generation

Feature Cropping. At this point we assume that we havegiven the feature block F`,s. The goal is to generate theoutput for the child node corresponding to a specific octantO. The feature block F`,s contains information to generatethe output for all the child nodes. In order to only processthe information relevant for the prediction of the child nodecorresponding to O we extract a ((b/2 + 2p)3 × c) regionout of the four dimensional tensor F`,s spatially centeredaround O. An illustration of this process is given in Fig. 4.If neighboring octants are processed, the extracted featurechannels will have some overlap, which helps to generate asmoother output (see Fig. 5).

Figure 6: Responses at the highest resolution, the checkeredareas indicate not predicted at that resolution, (left) slicethrough airplane, (middle) slice through front legs of a chair,(right) slice through a car.

Upsampling. The upsampling module takes input from thecropping module and predicts a new (b+2p)3 feature blockF`+1,r via up-convolutional and convolutional layers.

Output Generation. The output network takes the predic-tion of the feature block F`+1,r from the upsampling moduleand generates the voxel block B`+1,r. This is done using asequence of convolutional layers. The supervision is givenin form of the three ground truth labels for the voxel blockB`+1,r, i.e. there is supervision for each level of the tree.Once the output is generated the child nodes for the nextlevel get generated. The decision on whether to add a childnode and hence a higher resolution prediction is based onthe boundary prediction in the corresponding octant of thevoxel block B`+1,r. We compute the maximum boundaryprediction response of the corresponding octant O′

C`+1,rO′ = max

i,j,k∈O′B`+1,ri,j,k,2 , (1)

with label 2 the boundary label. The child node is generatedif C`+1,r

O′ is above a small threshold γ. The intuition behindthis choice is that as soon as there is some evidence thatthe surface is going through a specific block we shouldpredict a higher resolution output. On the other hand if theprediction on a specific level is very certain that there is noboundary within that subtree, a higher resolution predictionis not necessary. For the classes aeroplane, chair and car inpractice only around 5 to 10% of the voxels are predictedat high resolution, in average (cf. Fig. 6), which shows thatour approach effectively reduces the number of voxels thatneed to be predicted.

In the first level of the tree, which predicts the root nodefrom the shape code C, a small decoding network directlypredicts the first feature block without a cropping modulein between. Likewise, in the deepest level of the tree, noexplicit feature block is needed. Therefore the output isdirectly generated from the cropped features of the previouslevel. Also note that the output and upsampling moduleshave their individual filters at each level of the tree. Fur-thermore, thanks to this architecture all the convolutions andup-convolutions are standard layers and no special versionsfor the octree are required. Detailed layer configurations can

be found in the supplementary material.

D. Efficient Training with Subsampling

As we have seen above, only predicting voxels on theboundary at each level largely reduces the complexity ofthe prediction task. In the beginning of the training, this isnot the case because the boundaries are not yet predictedcorrectly. Moreover, even once the training has advancedenough and the boundaries are mostly placed correctly, anevaluation of the complete tree is still slow for training(approx. 0.7s for only a forward pass).

Therefore, we propose to utilize a subsampling of thechild nodes during training. This is possible thanks toour hierarchical structure of the prediction. The tree getstraversed in a depth first manner. Each time the boundarylabel is present in an octant according to Eq. 1 the childnode is traversed with a certain probability ρ. Differentschedules for ρ can be used. We experimented with both,having a fixed probability ρ or start with a very low ρ andgradually increase it during training. The first version leadsto a simpler training procedure and was hence used in ourexperiments. Thanks to the depth first traversal the memoryconsumption (weights and filter responses) grows linearlywith the number of levels if the same upsampling and outputarchitecture is used for each level.

In order to be able to train with mini-batches we sumup all the gradients during traversal of the tree and onlydo a gradient step as soon as we have done a forward andbackward traversal of the subsampled tree for the trainingexamples of the whole mini-batch. As loss function we useCross-Entropy.

IV. EXPERIMENTS

For our evaluation we use the synthetic dataset ShapeNet-Core [4], which is commonly used to evaluate gemetry pre-diction networks. It is composed of Computer Aided Design(CAD) models of objects which are organized in categories.We use three categories for our evaluation, aeroplanes, chairsand cars. Our system is implemented in Torch1 and we useAdam [17] for stochastic optimization, with a mini-batchsize of 4.

A. Voxelization of the Ground Truth

In a prepossessing step we voxelize the ground truthdata. A common approach is to consider all the voxelswhich intersect the ground truth mesh as occupied spaceand then fill in the interior by labeling all voxels whichare not reachable through free space from the boundary asoccupied space. While this works well for the commonlyused 323 resolution it does not directly generalize to highresolutions. The reason is that the CAD models are generallynot watertight meshes and with increasing resolution the riskthat a hole in the mesh prevents filling the interior increases.

1http://torch.ch/

In our case we are voxelizing the models at a resolutionof 2563. In order to have filled interiors of the objects athigh resolution we utilize a multi-scale approach. We firstfill the interior at 323 resolution. Then erode the occupiedspace by one voxel. We then fix every high resolution voxelwhich falls within this eroded low resolution occupied spaceas occupied space and also fix every high resolution voxelwhich falls within the original low resolution free space asfree space. Furthermore, we fix all the high resolution voxelswhich intersect the original mesh surface as occupied space.To determine the ground truth label for the remaining highresolution voxels we run a graph-cut based regularizationwith a small smoothness term and a preference to keep theunlabeled voxels as free space.

Given the high resolution voxels at 2563 resolution wecan now build the ground truth for all the remaining levelsof the pyramid. At each level of the pyramid we label thevoxels which contain the boundary at highest resolution asboundary and the other labels either free or occupied space,our coarsest resolution is 163.

B. Baselines

We consider two baselines. Both of the baselines use thestate-of-the-art coarse resolution voxel prediction frameworkfrom Sec. III-A. For both baselines we use a uniformprediction resolution of 323 (except for one evaluation wherewe use 643 resolution), they differ in the way the groundtruth is computed. The first baseline follows the standardapproach of labeling all voxels which intersect the groundtruth mesh surface as occupied space and then fill in theinterior. This can be achieved by downsampling our highresolution ground truth and label all low resolution voxelswhich contain at least one high resolution occupied spacevoxel as occupied space and all the other ones as free space.We call this baseline “Low Resolution Hard” (LR Hard). Theother baseline uses a soft assignment for the low resolution,the label is given by the ratio of high resolution free tooccupied space voxels that the low resolution voxel contains.Therefore these labels can have fractional assignments. Wecall this baseline “Low Resolution Soft” (LR Soft). Notethat the baseline LR Soft makes use of the high resolutionvoxelization but the baseline LR Hard is equivalent tovoxelizing at low resolution. As our goal is prediction ofhigh resolution geometry we trininearly upsample the rawclassification output of the baselines from 323 to 2563 andconduct the evaluation at high resolution.

C. Input Data

To obtain the RGB/Depth images which are used as inputfor training and testing our CNNs, we render the CADmodels in the ShapeNet dataset using Blender2. For eachCAD model, we render 10 images from random viewpoints

2http://www.blender.org

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

IoU

Extraction Threshold

High Resolution Prediction

Low Resolution Soft Prediction

Low Resolution Hard Prediction

0

0.02

0.04

0.06

0.08

0.1

0.12

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cha

mfe

r D

ista

nce

Extraction Threshold

High Resolution Prediction

Low Resolution Soft Prediction

Low Resolution Hard Prediction

Figure 7: Evaluation on the validation set at different ex-traction thresholds for the class aeroplane at a resolution of2563. For IoU higher is better and for CD lower is better.

Metric Method Car Chair Aero

IoULR Hard 0.649 0.368 0.458LR Soft 0.679 0.360 0.518

HSP 0.701 0.378 0.561

CDLR Hard 0.0142 0.0275 0.0190LR Soft 0.0122 0.0321 0.0190

HSP 0.0108 0.0266 0.0121

Table I: Quantitative analysis for the task of predictinggeometry from RGB input.

obtained via uniformly sampling azimuth from [0, 360)degrees and elevation from [−20, 30] degrees. We also userandom lighting variations to render the RGB images.

We also use partial voxel grids as input. In practice suchan input can arise when multiple depth maps of an object areavailable but they are all from the same side of the object.To imitate this task we use the ground truth of the LR Softbaseline and randomly zero the data in half of the voxelgrid. The network then learns to predict the complete high

Figure 8: Results for the category chairs. Each block from left to right, input image, our proposed HSP, LR Soft, LR Hard.

Figure 9: Results for the category aeroplanes. From left to right, input image, our proposed HSP, LR Soft, LR Hard.

resolution geometry from the partial low resolution input.

D. Quantitative Evaluation

We conducted a quantitative evaluation on the task of pre-dicting high resolution geometry from a single RGB image.We assign each of the 3D models from the dataset randomlyto the train, validation or test split with probabilities 70%,10% and 20%, respectively. We trained category specificnetworks for our method Hierarchical Surface Prediction(HSP) and both baselines for the three categories, aero-planes, chairs and cars. To quantify the accuracy we use usethe Intersection over Union (IoU) score and the symmetricChamfer Distance (CD). We compute the measures foreach model and average over the dataset. To compute themeasures we binarize the non-binary predictions with asuitable threshold. In order to determine the best threshold

for each measure and category we test an exhaustive setof thresholds using the validation set. The plots given inFig. 7 show that the choice of a good threshold has a verydrastic influence to the error measure. As a first experiment,we study predictions at a coarse resolution of 643. Forthis experiment both the baseline and HSP are trained topredict 643 resolution and the evaluation is conducted atthe same resolution. We demonstrate that our hierarchicalprediction performs similar to the LR hard baseline i.e. theperformance of HSP does not degrade compared to uniformprediction at the same resolution. For the baseline LR Hardwe get an IoU of 43.47% and 43.12% for HSP, respectivelyon the validation set for the class chair. However, unlikeHSP, the uniform prediction baselines cannot scale to finerresolutions and we now empirically show the benefits of

Figure 10: Results for the category cars. Each block from left to right, input image, our proposed HSP, LR Soft, LR Hard.

Figure 11: Additional applications of our method, (left) reconstruction from real world rgb images, (right) geometry predictionfrom partial low resolution input volumes.

being able to predict at finer resolutions. The evaluationis conducted at 2563 resolution. We compute the IoU andChamfer distance for each of the models in the test setusing the per category and per error measure best thresholddetermined using the validation set. The results are given inTable I. We outperform the baselines for all categories andevaluation metrics.

E. Qualitative Evaluation

In Figs. 8,9 and 10 we show example results on the taskof geometry prediction from a single color image for thethree categories. The results are selected to demonstratethe variablility of shapes within a category. Random resultsare shown in the supplementary material. The results arepredicted with the same networks which we used in thequantitative analysis above. The meshes are extracted withmarching cubes [24] using the same thresholds as for thequantitative analysis using the Chamfer Distance. For chairthe LR Hard baseline produces visually pleasant results.However, this same baseline does not achieve detailed resultson the categories aeroplane and car. Due to the fractionallabels the LR Soft baseline often misses thin structures. Theresults generated with our proposed HSP have high qualitysurfaces for all the three categories.

We also downloaded two images of cars with a whitebackground from the Internet using Google image searchand processed them with our car prediction network, whichwas trained only on synthetic data. As can be seen in Fig.11, the network is able to generalize to this type of input.Additionally, we trained a network on the category car whichtakes a partial low resolution voxel grid as input and predictsa high resolution model of the whole object (c.f. Fig. 11).

V. CONCLUSION AND FUTURE WORK

In this work, we presented a hierarcical surface predictionframework which facilitates high resolution 3D reconstruc-tion of objects from very little input data. We presentedresults on the task of predicting geometry from a singlecolor image, a single depth image and from a partial lowresolution voxel grid. Our quantitative evaluation shows thatthe high resolution prediction leads to more accurate resultsthan low resolution baselines. Qualitatively, our predictedsurfaces are superior to the low resolution baselines.

In this work we have demonstrated that high resolutiongeometry prediction is feasible therefore future work willinvestigate the potential to use prediction based reconstruc-tion methods in multi-view stereo. Furthermore, for almostsymmetric objects methods which explicitly take the sym-metry into account are an interesting further direction whichmight improve the quality of prediction based approaches.

ACKNOWLEDGMENTS

C. Hane received funding from the Early Postdoc.Mobilityfellowship No. 165245 from the Swiss National ScienceFoundation. The project received funding from the Intel/NSFVEC award IIS-153909 and NSF award IIS-1212798.

REFERENCES

[1] V. Blanz and T. Vetter. A morphable model for the synthesisof 3d faces. In Conference on Computer graphics andinteractive techniques (SIGGRAPH), 1999.

[2] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High ac-curacy optical flow estimation based on a theory for warping.In European conference on computer vision (ECCV), 2004.

[3] T. J. Cashman and A. W. Fitzgibbon. What shape aredolphins? building 3d morphable models from 2d images.Transactions on pattern analysis and machine intelligence(TPAMI), 2013.

[4] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan,Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su,J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich3D Model Repository. Technical Report arXiv:1512.03012[cs.GR], Stanford University — Princeton University —Toyota Technological Institute at Chicago, 2015.

[5] J. Chen, D. Bautembach, and S. Izadi. Scalable real-timevolumetric surface reconstruction. Transactions on Graphics(TOG), 2013.

[6] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d objectreconstruction. In European Conference on Computer Vision(ECCV), 2016.

[7] B. Curless and M. Levoy. A volumetric method for buildingcomplex models from range images. In Conference onComputer graphics and interactive techniques (SIGGRAPH),1996.

[8] A. Dame, V. A. Prisacariu, C. Y. Ren, and I. Reid. Densereconstruction using 3d object shape priors. In Conferenceon Computer Vision and Pattern Recognition (CVPR), 2013.

[9] A. Delaunoy, E. Prados, P. G. I. Piraces, J.-P. Pons, andP. Sturm. Minimizing the multi-view stereo reprojection errorfor triangular surface meshes. In British Machine VisionConference (BMVC), 2008.

[10] A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, andT. Brox. Learning to generate chairs, tables and cars withconvolutional networks. IEEE transactions on pattern anal-ysis and machine intelligence (TPAMI), 2017.

[11] D. Eigen, C. Puhrsch, and R. Fergus. Depth map predictionfrom a single image using a multi-scale deep network. InAdvances in neural information processing systems (NIPS),2014.

[12] P. Gargallo, E. Prados, and P. Sturm. Minimizing thereprojection error in surface reconstruction from images. InInternational Conference on Computer Vision (ICCV), 2007.

[13] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta.Learning a predictable and generative vector representationfor objects. In European Conference on Computer Vision(ECCV), 2016.

[14] C. Hane, N. Savinov, and M. Pollefeys. Class specific 3dobject shape priors using surface normals. In Conference onComputer Vision and Pattern Recognition (CVPR), 2014.

[15] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, andT. Brox. Flownet 2.0: Evolution of optical flow estimationwith deep networks. arXiv preprint arXiv:1612.01925, 2016.

[16] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category-specific object reconstruction from a single image. In Confer-ence on Computer Vision and Pattern Recognition (CVPR),2015.

[17] D. Kingma and J. Ba. Adam: A method for stochasticoptimization. In International Conference on Learning Rep-resentations (ICLR), 2015.

[18] K. Kolev, M. Klodt, T. Brox, S. Esedoglu, and D. Cremers.Continuous global optimization in multiview 3d reconstruc-tion. In International Workshop on Energy MinimizationMethods in Computer Vision and Pattern Recognition (EMM-CVPR), 2007.

[19] K. N. Kutulakos and S. M. Seitz. A theory of shape by spacecarving. International Journal of Computer Vision (IJCV),2000.

[20] P. Labatut, J.-P. Pons, and R. Keriven. Efficient multi-view reconstruction of large-scale scenes using interest points,delaunay triangulation and graph cuts. In InternationalConference on Computer Vision (ICCV), 2007.

[21] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out ofperspective. In Conference on Computer Vision and Pattern

Recognition (CVPR), 2014.[22] V. Lempitsky and Y. Boykov. Global optimization for shape

fitting. In Conference on Computer Vision and PatternRecognition (CVPR), 2007.

[23] S. Liu and D. B. Cooper. A complete statistical inverseray tracing approach to multi-view stereo. In Conference onComputer Vision and Pattern Recognition (CVPR), 2011.

[24] W. E. Lorensen and H. E. Cline. Marching cubes: A highresolution 3d surface construction algorithm. In Siggraphcomputer graphics, 1987.

[25] M. Nießner, M. Zollhofer, S. Izadi, and M. Stamminger.Real-time 3d reconstruction at scale using voxel hashing.Transactions on Graphics (TOG), 2013.

[26] A. Ranjan and M. J. Black. Optical flow estimation using aspatial pyramid network. arXiv preprint arXiv:1611.00850,2016.

[27] D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia,M. Jaderberg, and N. Heess. Unsupervised learning of 3dstructure from images. In Advances In Neural InformationProcessing Systems (NIPS), 2016.

[28] G. Riegler, A. O. Ulusoys, and A. Geiger. Octnet: Learningdeep 3d representations at high resolutions. arXiv preprintarXiv:1611.05009, 2016.

[29] N. Savinov, C. Hane, L. Ladicky, and M. Pollefeys. Se-mantic 3d reconstruction with continuous regularization andray potentials using a visibility consistency constraint. InConference on Computer Vision and Pattern Recognition(CVPR), 2016.

[30] A. Saxena, S. Chung, and A. Ng. Learning depth from singlemonocular images. In Neural Information Processing Systems(NIPS), 2005.

[31] F. Steinbrucker, J. Sturm, and D. Cremers. Volumetric 3dmapping in real-time on a cpu. In International Conferenceon Robotics and Automation (ICRA), 2014.

[32] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3dmodels from single images with a convolutional network. InEuropean Conference on Computer Vision (ECCV), 2016.

[33] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Ma-lik. Learning shape abstractions by assembling volumetricprimitives. In Conference on Computer Vision and PatternRecognition (CVPR), 2017.

[34] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-viewsupervision for single-view reconstruction via differentiableray consistency. In Conference on Computer Vision andPattern Recognition (CVPR), 2017.

[35] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Per-spective transformer nets: Learning single-view 3d objectreconstruction without 3d supervision. In Advances in NeuralInformation Processing Systems, 2016.

[36] S. Yingze Bao, M. Chandraker, Y. Lin, and S. Savarese. Denseobject reconstruction with semantic priors. In Conference onComputer Vision and Pattern Recognition (CVPR), 2013.

[37] C. Zach, T. Pock, and H. Bischof. A duality based approachfor realtime tv-l 1 optical flow. In Joint Pattern RecognitionSymposium (DAGM), 2007.

[38] C. Zach, T. Pock, and H. Bischof. A globally optimalalgorithm for robust tv-l 1 range image integration. InInternational Conference on Computer Vision (ICCV), 2007.

[39] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus.Deconvolutional networks. In Conference on Computer Visionand Pattern Recognition (CVPR), 2010.