discriminative shape from shading in uncalibrated illumination · 2015-05-26 · discriminative...

Discriminative Shape from Shading in Uncalibrated Illumination

Stephan R. Richter Stefan RothDepartment of Computer Science, TU Darmstadt

Abstract

Estimating surface normals from just a single imageis challenging. To simplify the problem, previous workfocused on special cases, including directional lighting,known reflectance maps, etc., making shape from shadingimpractical outside the lab. To cope with more realistic set-tings, shading cues need to be combined and generalized tonatural illumination. This significantly increases the com-plexity of the approach, as well as the number of parametersthat require tuning. Enabled by a new large-scale datasetfor training and analysis, we address this with a discrim-inative learning approach to shape from shading, whichuses regression forests for efficient pixel-independent pre-diction and fast learning. Von Mises-Fisher distributions inthe leaves of each tree enable the estimation of surface nor-mals. To account for their expected spatial regularity, we in-troduce spatial features, including texton and silhouette fea-tures. The proposed silhouette features are computed fromthe occluding contours of the surface and provide scale-invariant context. Aside from computational efficiency, theyenable good generalization to unseen data and importantlyallow for a robust estimation of the reflectance map, extend-ing our approach to the uncalibrated setting. Experimentsshow that our discriminative approach outperforms state-of-the-art methods on synthetic and real-world datasets.

1. Introduction

Estimating surface normals from just a single image isa severely under-constrained problem. Previous work hasthus made a number of simplifying assumptions, e.g., pre-suming smooth surfaces, uniform albedo, and directionallighting from a single light source of known direction. Yet,relying on a single directional light, or assuming a given re-flectance map renders shape from shading impractical. Togo beyond the lab, some of these assumptions need to berelaxed. We thus investigate estimating the reflectance mapof a diffuse object with uniform albedo together with itssurface under uncontrolled illumination, given only a singleimage (Fig. 1). Moreover, we aim to avoid strong spatialregularization to recover fine surface detail. To address this

Figure 1. Qualitative results for shape and reflectance estimationfrom a single image: input image [30], estimated normals and re-flectance map from our method, and novel view (from left to right).

challenging setting, shading cues need to be generalized tomore realistic lighting and also combined due to their com-plementary strengths. This increases the model and compu-tational complexity, as well as the number of parameters.

We approach these challenges with a discriminativelearning approach to shape from shading. This notablymakes it easy to combine several shading cues. We beginby considering (1) color, which helps under hued illumina-tion [18], and can be exploited with a second order approx-imation to Lambertian shading [25]. While powerful, ourexperimental investigation (Sec. 8.1) shows that correlatedcolor channels (e.g., in near white light) or the presence ofnoise severely impair accuracy. We thus add (2) local con-text, which aids disambiguation [33], but until now has beenlimited to directional lighting. Instead of using colors froma neighborhood directly, we choose a Texton filter bank [28]for capturing local context. Employing these filters in alearning framework allows for automatic adaptation to un-controlled lighting and fine surface detail. We finally ex-ploit (3) silhouette features. Previous work has constrainedsurface normals at the occluding contour [16, 20], and prop-agated this information to the surface interior during globalreasoning. We generalize the occluding contour constraintto the surface interior and provide (spatial) contour informa-tion at every pixel. These silhouette features yield a coarseestimate of the surface from just the silhouette, which weadditionally exploit for estimating the reflectance map.

Adopting a learning approach to uncalibrated shape fromshading poses several challenges: First, example-based

Figure 2. Pipeline for training and testing. For each test image,we estimate a reflectance map to train the regression forests onsynthetically generated data. We optionally enforce integrabilityon their pixel-wise predictions.

approaches require a database of surfaces imaged underthe same conditions as the object in question. Captur-ing all possible combinations of surfaces and lighting con-ditions is next to impossible, and placing known objectsin the scene [14] often impractical. Recent learning ap-proaches [5, 19] thus created a database on-the-fly by ren-dering synthetic shapes under a known lighting condition.We follow this avenue, but capture the variation of realisticsurfaces with a significantly larger database of 3D modelscreated by artists. Second and unlike [5, 19], we aim tocope with unknown illumination at test time. We addressthis by estimating the reflectance map from silhouette fea-tures, and only then training the discriminative approach onthe estimated reflectance. Third, (re-)training a model for alighting condition at test time requires efficient learning andinference. Adapting regression forests to store von Mises-Fisher distributions in the leaves, as well as leveraging theset of cues discussed above enables us to perform efficientpixel-independent surface normal prediction. To further re-fine the estimate, we optionally enforce integrability on thepredicted surface. Fig. 2 depicts the entire pipeline.

To assess the contribution of the different cues, we firstevaluate their importance statistically and as a componentof our approach. Moreover, we evaluate and compare ourmethod on both synthetic and real-world data, where it out-performs several state-of-the art algorithms.

2. Related WorkShape from shading has an extensive literature [9, 34],

limiting us to the most relevant, recent work here. Fordecades, Lambertian shape from shading has been stud-ied under the assumption of a single white point lightsource, presuming that this simplifies the problem. How-ever, chromatic illumination not only resembles real-worldenvironments much better, it also constrains shape estima-tion significantly, permitting substantially increased perfor-mance [15, 18]. Nevertheless, these methods rely heavily onfavorable illumination and neglect to exploit shape statisticsunder nearly monochromatic lighting. Their assumption ofknown illumination further limits practical applicability.

In addition to eliminating the point light assumption, re-cent work aimed to infer the shape jointly with materialproperties or illumination. Oxholm and Nishino [22] es-timated shape together with the object’s bidirectional re-flectance function (BRDF) by exploiting orientation cuespresent in the lighting environment. However, a high-quality environment map needs to be captured. Barron andMalik [1, 2] estimate shape as part of a decomposition of asingle image into its intrinsic components. Since they opti-mize a generative model, training and inference take signif-icant time and extending the model with additional cues iscomplicated. Moreover, strong regularity assumptions needto be made, limiting the recovery of fine surface detail.

Learning approaches to shape from shading have so farbeen limited by the lack of suitable training data. They haverelied on range images or synthetic data [3, 19, 31], bothcausing problems of their own: While the noise of rangeimages is a severe problem in predicting fine-grained sur-face variations, synthetic datasets often fail to capture thevariation of real-world environments. Recently, Khan etal. [19] trained a Gaussian mixture model on the isophotes,using synthetic data and a database of laser scans. Barronand Malik [1] learned their shape priors from half the MITintrinsic image dataset [13]. In addition, example-basedmethods [5, 23] have shown reasonable qualitative perfor-mance; quantitative results have not been reported though.While learning methods tend to perform better than theirhand-tuned counterparts, they have been limited by simplis-tic shape priors and the lack of adequate training data.

The work of Hertzmann and Seitz [14] also bears somerelations, which used objects of known geometry imagedunder the same illumination to reconstruct shape from pho-tometric stereo. Multiple images are required and a knownobject needs to be present in each image. In contrast, weonly need one image of an unknown object and use “exam-ple geometry” only to synthesize our training data.

Our regression tree-based predictor is most relatedto Geodesic Forests [21]; both approaches make pixel-independent predictions by incorporating spatial informa-tion directly into the tree-based approach. However,Kontschieder et al. [21] use it for discrete labeling tasks,such as semantic segmentation, have a complex entangle-ment, and use generalized geodesic distances as spatial fea-tures. We instead predict a two-dimensional continuousvariable (normal direction), employ the proposed silhouettefeatures, and use only a single stage.

3. OverviewWe begin with an overview of our discriminative ap-

proach (Fig. 2). Given a test image of a diffuse object withuniform albedo, we first extract color, textons and our sil-houette features (Sec. 6). Based on the silhouette features,we estimate the reflectance map (Sec. 7), with which in turn

we render patches of objects from our database. The fea-tures (same as for testing) and surface normal of the centralpixel of each synthetic patch form the training dataset, onwhich the regression forest (Sec. 5) is trained. After train-ing, the forest predicts surface normals for each pixel of thetest image from the features we extracted at the beginning.Optionally, we enforce integrability of the normal field.

4. Data for Analysis and Training

High-quality training data for learning surface variationhas been scarce. With the advent of low-cost depth sensors,the situation has improved, but real world range images areoften too noisy to learn fine-grained structures. Alterna-tively, large amounts of synthetic data have been generated,which often resembles simple geometric shapes like cylin-ders or blobs [3, 5, 18, 31]. However, the employed para-metric models often fail to capture real-world surface vari-ations, like self-occlusions or wrinkles from clothing.

By instead leveraging a dataset of shapes created by anartist [8], we combine the advantages of both range mapsand synthetic data: First, being manually created by a mod-eling expert, the shapes resemble real-world objects withcomplex phenomena like self-occlusions. Second, having alarge set of 3D models allows to render them in different ori-entations and to generate a virtually infinite set of trainingimages. The 3D models we use cover a range of categories,mainly with an organic shape, such as humans and animals.

It is important to note that our dataset is much larger (100objects) and more varied than the ones considered in otherlearning approaches to shape from shading. [19] used only6 realistic surfaces of the same object class (faces), while[2] used 10 objects by taking half of the MIT intrinsic im-age dataset for training (the other for testing). Although wetrained our algorithm on a dataset that is qualitatively ratherdifferent from all test datasets, we achieve state-of-the-artperformance for a variety of test datasets (see Sec. 8).

5. Discriminative Prediction of Normals

Motivated by the success of decision and regression tree-based methods in diverse areas, such as human pose estima-tion [27], image restoration [17], or semantic labeling [21],we propose to use regression forests for shape from shading.Note that we only describe the basic learning approach here,and defer a more thorough discussion of the used features toSec. 6. A key motivation for choosing regression forests isthat learning and prediction steps are computationally effi-cient, a crucial requirement as both learning and predictionhave to be carried out at test time once the reflectance maphas been estimated (Fig. 2). The different trees of the for-est can be trained in parallel. Moreover, as we predict thesurface normals of objects independently for each pixel, es-timation is efficient and parallelizable. Since the prediction

does not necessarily result in an integrable normal field, in-tegrability can be enforced in a post-processing step.

Basic regression forest model. Regression forests forma prediction of the output variable (here, at a single pixel)by traversing a tree, such that the traversal path allows themodel to choose an appropriate prediction based on the in-put features [4]. To that end, each (non-leaf) node has a splitcriterion, here a threshold on one input feature, as usual.Depending on the feature response, the traversal followsthe left or right branch, until a leaf is reached. The leaveseach store a probability distribution over the output vari-able, which ultimately enables prediction. For robustness,the predictions of several trees in a forest are averaged.

Normal vectors as output. Predicting normal vectorscomes with a unique set of challenges. First, the outputis a continuous variable, which is normally addressed bystoring in each leaf the average of all training samples thathave been associated with that particular leaf. This canbe thought of as storing the mean of a multivariate Gaus-sian, which is sensible when the posterior is reasonably ap-proximated by a Gaussian distribution in Rd. The secondchallenge in predicting surface normals is that they are dis-tributed on a 3-dimensional unit hemisphere; thus a Gaus-sian assumption is not appropriate. We address this by mod-eling the prediction in each leaf as a von Mises-Fisher dis-tribution [10]; we store the mean and dispersion parameters.

More specifically, the von Mises-Fisher distributionmodels unit vectors on a d-dimensional hypersphere. In ourcase of d = 3, its density function is given as

p(n;µ, κ) =κ

2π(eκ − e−κ)exp(κµTn), (1)

where the normal n ∈ S2 is distributed around a mean vec-tor µ, ||µ||= 1 with a dispersion κ ∈ R (analogous to theprecision of a Gaussian distribution).

Learning. Learning proceeds mostly as for standard re-gression forests. For every tree of the forest, we randomlychoose a 90% subset of the training data. During learning,for each node a random set of features is chosen from whichthe best is picked as split criterion. The feature chosen is theone that minimizes the aggregated entropy of the the newchild nodes compared to the entropy of their parent node.The entropy is calculated from the distribution of the outputvariable.

We estimate the parameters of the von Mises-Fisher dis-tributions following Dhillon and Sra [7], who showed thatthe maximum likelihood estimate is approximated well by

µ =r

||r||and κ =

3R− R3

1− R2, (2)

where r =∑Ni=1 ni is the resultant vector, and R = ||r||

N isthe average resultant length.

As is common for most learning-based methods in shapefrom shading [5, 19], each unseen reflectance map requiresre-training the model. To do this efficiently, we sample 5×5normal patches from our dataset (Sec. 4) and store normalsand silhouette features ahead of time, as they are indepen-dent from the lighting condition. When we encounter anew illumination condition, the normals are rendered, theremaining features computed, and the forests trained. Ren-dering and feature extraction takes less than 1 second forthe whole dataset; training takes about 90 seconds and in-ference less than a second. Note that an important differ-ence to [5, 19] is that we do not require the lighting at testtime to be known, but rather estimate it as well (Sec. 7).

We create 10 training images for each of the 100 mod-els in our dataset by placing an (orthographic) camera suchthat it points to the model center from random positions.We evaluated several numbers of patches on a validation set(different from the models used for testing) and found that100–200 samples per image give the best trade-off betweenperformance and time needed for training.

Integrability. The regression forests output a surface nor-mal for each pixel independently of the neighboring predic-tions. Without any spatial regularization, predicted surfacesare more susceptible to image noise; yet penalizing discon-tinuities usually results in oversmoothed surfaces. There-fore, when fusing pixel-independent predictions, we onlyenforce integrability, as it is necessary for deriving a validsurface. Integrability requires the surface normal deriva-tives to fulfill

∂2z∂u∂v = ∂2z

∂v∂u , (3)

where the depth z is a function of the image coordinatesu, v. To penalize violations of Eq. (3), several approacheshave been proposed, e.g. [24, 26]; we evaluate differentchoices in Sec. 8.2.

6. (Spatial) FeaturesShape from shading, like other pixel labeling / prediction

problems, benefits from taking into account the spatial regu-larity of the output, or in other words modeling the expectedsmoothness of the recovered surface. That is, neighboringpredictions should account for the fact that their normalsare often very similar. Regression forests [4], which we usehere, perform pixelwise independent predictions and thusdo not necessarily model such regularities well. To addressthis, regression tree fields [17] predict the parameters of aGaussian random field instead of the output variables di-rectly; the prediction is obtained by maximum a-posteriori(MAP) estimation in the specified conditional random field.One drawback is that the MAP estimation step incurs acomputational overhead, which also makes learning inef-ficient. Inspired by geodesic forests [21], which include ageodesic distance feature to sidestep explicit modeling of

more global dependencies of the output, we here aim to ad-ditionally devise spatial features that allow promoting spa-tial consistency despite pixel-independent prediction.

Basic color feature. Regression trees benefit from inputfeatures that strongly correlate with the desired output [4],because a strong correlation allows for splits that reduce theentropy well. Possibly the strongest correlation exists be-tween the surface normal and the color; their relation is de-scribed by the rendering equation. In the common Lamber-tian case, we can take a second order approximation [25]:Ic = nTMcn at each point on the surface, where Ic is theintensity of color channel c, n is the surface normal in ho-mogeneous coordinates, and Mc is a symmetric 4×4 matrixrepresenting the reflectance map for that color channel. Asingle input image thus puts 3 nonlinear constraints onto the2 unknowns of the surface normal at each pixel. Under idealcircumstances, the reflectance maps are independent fromeach other and thus produce small isophotes (areas with thesame luminance), such that a surface can be recovered verywell with just the color. In this case shape from shading be-comes similar to photometric stereo [18]. However, if thereflectance maps and corresponding constraints are morecorrelated, e.g. in nearly white light, large isophotes resultin many surface patches that explain the same color. More-over, the problem is exacerbated by image noise. Hence, toavoid making strong assumptions about the type of lightingpresent, we not only consider the color, but look for spatialfeatures that depend on a neighborhood of pixels as wellas the object contour, and are able to reduce the remainingambiguity, even in the absence of an explicit spatial model.

Texton features. To capture how the local variation of theinput image correlates with the output, we first compute fea-tures from a texton filter bank [28]. Texton filters consist ofGaussians, their derivatives, as well as Laplacians at multi-ple scales, and have been used in many areas, such as ma-terial classification, segmentation, and recognition. Whilehaving been used in shape from texture [32], to our knowl-edge this is the first application to shape from shading. Be-fore filtering we convert the image to the L*a*b* opponentcolor space. Gaussian filters are computed on all channels,while the remaining filters are applied only to the luminancechannel. As we will see below, the local context from tex-ton features leads to a strong increase in accuracy comparedto using color alone, as their embedding in a discriminativelearning framework allows for adaption to various types ofsurface discontinuities instead of simply assuming smooth-ness as has been common in shape from shading.

Magnifying the local context by enlarging the filters canlead to faster convergence to an integrable surface, but re-quires a larger dataset to capture fine detail and achieve sim-ilar generalization. In our experiments, we used filters thatmatch the normal patches in size (5× 5).

a b c d e

Figure 3. Objects that are convex or composed of convex parts (a)exhibit a strong correlation between silhouette-based features likerelative distance (b) or direction to the silhouette (c) and out-of-plane (d) and in-plane components (e) of their surface normals.

6.1. Silhouette features

Projected onto the image plane, normals are not dis-tributed equally across the object. Most objects are roughlyconvex, or composed of convex parts. Thus, normals at thecenter of an object tend to face the viewer and normals atthe occlusion boundary face away from the viewer [16, 20].Consequently, the probability of a normal facing a certaindirection is not uniform given a position within the projec-tion. Previous work has exploited this fact only by plac-ing priors on the normals at the occlusion boundary andpropagating information to the interior with a smoothnessprior [22]. As both priors do not consider scale, balancingthem can be challenging, even within the same object, as itmay contain parts of different scale (e.g., the tail vs. headof the dinosaur in Fig. 3). Here, we consider a relation be-tween the silhouette and the normal that is more explicit andautomatically adapts to scale.

To that end, let us first look at the correlation betweena point’s surface orientation and its position within the ob-ject’s projection onto the image plane. Consider the objectin Fig. 3. As expected [16, 20] and can be seen in the visu-alization of the out-of-plane component (d) (white – towardthe viewer, black – away), normals are orthogonal to theviewing direction starting at the silhouette. Moving inwardsthe normals change until they finally face the viewer. If wenow look further at the distance of an interior point to thesilhouette (b), we can see some apparent correlation. Simi-larly, we can see apparent correlation between the directionto the nearest point on the silhouette (c) and the image-planecomponent of the normal (e). We now formalize and ana-lyze this relationship.

If B denotes the set of points on the occlusion boundary,we define the absolute distance of an interior point p to thecontour as

dabs(p) = minb∈B||p− b||. (4)

However, we are not interested in the absolute distance, asit depends on the scale of the object. To make it scale-invariant, we normalize it by the length of the shortest linesegment that passes through p and connects boundary andthe medial axis of the object. The medial axis is the setof all points that have 2 closest points on the boundary. IfM denotes the medial axis and pb the (infinite) line thatpasses through p and b, we define the relative distance to

the silhouette as

drel(p) = minb∈B

minm∈M∩pb

||p− b||||m− b||

, (5)

i.e. the relative distance is normalized by the minimal linethat passes through p and connects medial axis and con-tour. In practice, we approximate Eq. (5) using two distancetransforms, dB for the contour set and dM for the medialaxis. We thus define the scale-invariant boundary distance

d′rel(p) =dB(p)

dB(p) + dM (p). (6)

Finally, we define the direction to the contour as

β(p) = − ∇d′rel(p)

||∇d′rel(p)||. (7)

Statistical analysis. To analyze the correlation between theposition relative to the silhouette and surface orientation, wecalculated the relative distance and direction to the silhou-ette for multiple datasets and plotted them against the out-of-plane and the image-plane component of the surface nor-mals (Fig. 4). We analyze three different datasets: syntheticdata (“blobby shapes”) [18] in the first column, real worlddata from the MIT intrinsic image dataset [13] in the secondcolumn, and a collection of 3D models generated by artistsin the third column. Across all datasets we observe a strongcorrelation between the plotted variables. The direction tothe silhouette has a strong linear relation to the image-planecomponent; the relative distance has a quadratic relation tothe out-of-plane component. This clearly suggests that theproposed silhouette features should be useful for surface re-construction from a single image. We evaluate the impor-tance of our input features for surface prediction in Sec. 8.1.

7. Reflectance Map EstimationAssuming distant light sources and no self-reflections

or occlusions, all observable reflectance values of an ob-ject with uniform albedo can be mapped one-to-one onto ahemisphere. Moreover, a Lambertian reflectance map canbe approximated well by only 9 spherical harmonics coef-ficients per color channel [25]. Thus, to calibrate against areflectance map, [18] placed a calibration sphere with thesame BRDF as the object of interest in the scene. Barronand Malik [1] obviated the sphere and jointly recovered thereflectance map and the surface with a generative model.

In our discriminative approach, we reconstruct the re-flectance map directly from an initial estimate of the sur-face, which we derive solely from the object silhouette. Inparticular, we map the input image to a sphere according toour silhouette features. The features define a mapping froma pixel p to polar coordinates on a unit sphere:

σ(p) : Ω→ S2, σ(p) =(cos−1 d′rel(p), β(p)

)(8)

π

π-π-π

direction to contourimag

e-pl

ane

com

pone

nt

0 0.15

blob MIT training

0 10

out-o

f-pla

ne co

mpo

nent

relative distance to contour

π2

0 0.25

blob MIT training

Figure 4. The correlation between pixel position and surface orientation on multiple datasets. In the main columns we plot direction tosilhouette vs. image-plane component of surface normal and relative distance vs. out-of-plane component of normal for 3 datasets each:Blobby shapes [18], MIT intrinsic images [13], and our training dataset from Sec. 4.

We now obtain the color at a polar coordinate (i.e., normalor lighting direction) s ∈ S2 by averaging the colors ofthose input pixels p, whose mapping is a k-nearest neighborof s:

C(s) =1

k

k∑i=1

I(Pi), P = p |σ(p) ∈ k-NN(s), (9)

where I is the observed color. The number of neighbors kconsidered is adjusted for the size of the object.

The silhouette features alone give only a coarse estimateof the surface normals. Moreover, certain objects do notfulfill our assumption of being composed of convex parts.A bowl seen from above will cause problems, for example,but likely also for other algorithms that estimate shape andreflectance. Most objects, however, contain limited concav-ities whose effect on the estimated reflectance map is gen-erally compensated by other convexities.

However, since the mapping from pixels to polar coordi-nates is many-to-one, we average the colors of points withsimilar distance and direction to the silhouette. This actsas a low-pass filter, effectively reducing estimation errorsfrom incorrectly mapped points. We thus obtain a robustapproximation of a calibration sphere without actually hav-ing one. From this we can recover the spherical harmonicscoefficients of the reflectance in closed form. We found thatadjusting the mean and standard deviation of the reflectancemap to match the input image (effectively matching bright-ness and contrast) improves the final estimate.

8. Experiments8.1. Feature evaluation

To understand the contribution of the various input fea-tures, we first analyze the qualitative (Fig. 5) and quanti-tative impact on surface normal prediction. We evaluatedour unary input features on the training set of the MIT in-trinsic image dataset, rendered under all illuminations from[18]. Note that these illuminations stem from real environ-ment maps and also contain nearly white illumination; thisis in contrast to [1], which re-rendered the MIT data withilluminations sampled from their learned prior. After ren-dering, we added Gaussian noise (σ = 0.001) to the images

and thresholded values below 0 and above 1. We trainedon the dataset described in Sec. 4. Table 1(a) shows the re-sults evaluated using the median angular error (MAE) andthe mean-squared error of the normal (nMSE, see [1]).

We use the basic color feature (RGB) as baseline.Adding our silhouette-based features (+Silh) increases theoverall performance. Nonetheless, they are better suited forobjects that are round or composed of convex parts witha curved surface. Thus, the performance increases sig-nificantly on objects fulfilling these assumptions, but onlymarginally on planar objects or when self-occlusions arepresent. In colorful illuminations (Fig. 5, bottom), thesilhouette features can be misleading in parts, but overallclearly improve performance. Texton features (+Tex) workparticularly well under chromatic illumination, indicatingthat the captured spatial information eliminates many am-biguities; yet even in white illumination they yield a clearbenefit. Their combination (+Silh+Tex) works best overalland is robust w.r.t. the illumination conditions.

8.2. Integrability

We evaluate several approaches for enforcing integrabil-ity of the estimated surface in Tab. 1(b). As a simple startingpoint we choose an l2-penalty on violations of Eq. (3). Next,we employ an l1-penalty [26], and finally an l2-penalty un-der perspective projection following [24]. The unary pre-dictions are our baseline. Without any post-processing, thesurface normals can be reconstructed already with good ac-curacy. Under synthetic illumination, the performance mayeven decrease when enforcing integrability. We observed,however, that for real images, which potentially violate theLambertian assumptions, integrability is particularly help-ful. Since the objects in the MIT dataset were presumablyrecorded with a long focal length, the benefits of a perspec-tive approach are negligible. Leveraging the high perfor-mance levels of the regression forest, we adapted the l2-penalty to restrict surface normals to a convex combinationof samples drawn from the distributions in the leaf nodesof the trees. This version (conv) clearly outperformed allother approaches at the price of a much higher run-time. Infurther experiments, we thus rely on the simple l2-penalty.

90°

0°

ortho-l2, 14.9°

ortho-l2, 6.1°

RGB+Silh+Tex, 15.4°

RGB+Silh+Tex, 6.1°

RGB+Tex, 18.1°

RGB+Tex, 5.8°

RGB+Silh, 16.9°

RGB+Silh, 15.8°

RGB, 25.4°

RGB, 13.8°

Figure 5. Importance of unary features. For the images in the first column (white illumination – top, colored – bottom), we estimatesurfaces using only subsets of features; see text for details. The remaining columns depict the angular error per pixel and its median below.

Features MAE nMSE

RGB 13.77 0.179RGB+Silh 10.90 0.130RGB+Tex 7.92 0.097RGB+Silh+Tex 7.09 0.069

(a) Results for unary features.

Method MAE nMSE run-time

no integrability 7.09 0.069 89.0sl2, orthographic 7.33 0.057 98.5sl2, orthographic, conv. 6.46 0.056 1172.8sl1, orthographic 7.34 0.058 98.0sl2, perspective 7.42 0.059 97.2s

(b) Results for enforcing integrability.

Table 1. Influence of unary features and integrability constraints.The run-times includes training, inference, and post-processing.

8.3. Comparison with other methods

We quantitatively compare against 2 state-of-the-artmethods on 3 different datasets, 2 of which are providedwith the respective methods, and one recorded by ourselves.

The first method we compare to is the shape-from-shading component of the SIRFS method from Barron andMalik [1] (termed “Cross scale”). We use source code pro-vided by the authors and evaluate both under unknown andgiven illumination. In unknown illumination, we also con-sider the accuracy of the estimated reflectance map (lMSE,see [1]). Tab. 2(a) gives results on the dataset of [1], a vari-ant of the MIT intrinsic image dataset [13] re-rendered un-der chromatic illumination.

The second baseline, the method of Xiong et al. [33](termed “Local context”) relies on local shading context toinfer shape from shading under known illumination. Withtheir method comes a dataset of 10 objects recorded underwhite directional illumination. Since they also evaluated theshape-from-shading component from [1], we simply restatethe results from their paper (Tab. 2(b), first column) and run

Method nMSE* nMSE lMSE

Cross scale 0.058 0.471 0.039Ours 0.034 0.196 0.013

(a) Results on synthetic images (MIT intrinsic [1]).

Illumination lab[33] natural (ours)

Method MAE MAE* MAE lMSE

Local context 17.27 – – –Cross scale 19.30 20.29 29.29 0.013Ours 15.96 20.51 23.07 0.002

(b) Results on real images.

Table 2. Comparison to other methods. See text for further expla-nation. * indicates that the illumination was given.

our algorithm on their dataset.We did not train our algorithm on any of these datasets,

but use our own separate set of artist-created models as de-scribed in Sec 4. We used the training split of the MIT in-trinsic images once to set the hyperparameters (number oftrees, maximum tree depth, etc.) with Bayesian optimiza-tion [29] and used these settings in all of our experiments.

Real-world experiment. Good performance on syntheticdata does not always translate to realistic settings [9]. Weaim to address this by quantitatively evaluating our methodon real-world data. Unfortunately, no shape-from-shadingdataset captured under natural illumination exists so far;methods considering natural illumination were instead eval-uated only qualitatively or on synthetic data [1, 18].

To record ground truth data with high accuracy, photo-metric stereo methods are well established [33]. However,the lighting environment must be carefully designed andcontrolled, and only a normal map is obtained. Since we re-quire the shape to precisely align with the captured imagesand also aim to record in natural illumination, we wouldneed to either synthesize the illumination in the lab, vio-lating the real-world assumption, or build a controlled light

ground truthinput image

ours, 22.3°input image local context, 28.1°ours, 11.5°input image local context, 15.3°

cross scale

14.0° 29.9°

lMSE = 0.0052

16.1° 17.2°

lMSE = 0.0011

ours

Figure 6. Comparison on real images with median angular errors. For laboratory illumination (top row), we show a novel view of our bestand worst result of a reconstructed surface. The input image and the view of “local context” are taken from [33]. For natural illumination(bottom row), we show the surface normal estimates for known (left) and unknown illumination (right), and the estimated reflectance mapfor the latter case. Across all conditions, our method reconstructs fine surface detail better than previous approaches.

environment at each scene, which in many realistic scenesis next to impossible. Instead, we took ∼ 200 pictures foreach of 4 objects and reconstructed surface meshes usingmulti-view stereo [11, 12]; we then aligned the meshes withthe test images using mutual information [6]. For the testimages taken under real illumination, we painted the ob-jects with a white diffuse paint and recorded them togetherwith a calibration sphere in different environments; latter al-lows recovering the ground truth illumination. Images andground truth are publicly available on our website.

We give quantitative results on the dataset in the 3 right-most columns of Tab 2(b); the reconstructed surfaces areshown in Fig. 6. As before, our method was not specifi-cally adapted to the dataset. The shape prior used by [1] isneither; the shapes used for training (MIT dataset) are stillrepresentative (i.e., of similar kind).

We show additional results in Fig. 1 and in the supple-mentary material.

Results. The quantitative and qualitative results showthat our method robustly recovers surfaces and reflectancemaps in synthetic, laboratory, and natural illumination. Weclearly outperform the cross-scale approach [1] in all met-rics; the only exception are the real images with knownillumination, where we perform about the same. In themore challenging setting of unknown illumination, we per-form significantly better, however. As can be seen in thebottom row of Fig. 6, correctly estimating the reflectancemap is crucial to the performance of surface reconstruction.The silhouette features allow for a robust estimate, whichis leveraged by our discriminative learning approach. Theresults in Fig. 6 further highlight that our approach is ableto recover fine surface detail on real data, since it does notneed to rely on strong spatial regularizers.

We also outperform the local context approach of [33].One point to note is that our approach can deal with imagesof different scale (the images of Fig. 6, top are approxi-mately twice the size of those in Fig. 6, bottom). This isdue to the scale-invariant nature of our silhouette features.

It may seem surprising that the performance of all meth-ods decreases in realistic settings, since the illumination ismore colorful. However, this can be explained by observ-ing that the real data exhibits shadows and fine surface de-tail, which the synthetic datasets do not. Despite these chal-lenges our discriminative approach is able to provide high-quality surface estimates in uncalibrated illumination.

9. Conclusion

We presented a discriminative learning approach for esti-mating the shape of an unknown diffuse object with uniformalbedo under uncontrolled illumination, given only a singleimage. We adapted regression forests to predicting surfacenormals, and proposed and analyzed suitable features thatprovide local and scale-invariant object-level context with-out the need for spatial regularization. Pixel-independentpredictions are fused by only enforcing integrability of thereconstructed surface. Silhouette features further enable es-timating the unknown reflectance map. As with other learn-ing approaches, we need to train our model for each lightingcondition. This poses no major drawback, as the combinedtraining and test time of our efficient approach is on par withthe test time of other recent methods that do not use learn-ing. We trained our model on novel, large scale trainingdata and evaluated it on several challenging datasets, whereit outperforms recent approaches from the literature. Exper-iments on a new real-world dataset demonstrate its ability torecover fine surface detail outside of the laboratory.

Acknowledgement. We thank Simon Fuhrmann for as-sistance with multi-view stereo capturing. SRR was sup-ported by the German Research Foundation (DFG) withinthe GRK 1362. SR was supported in part by the Euro-pean Research Council under the European Union’s Sev-enth Framework Programme (FP/2007-2013) / ERC GrantAgreement No. 307942, as well as by the EU FP7 project“Harvest4D” (No. 323567).

References[1] J. T. Barron and J. Malik. Color constancy, intrinsic images,

and shape estimation. In ECCV, 2012. 2, 5, 6, 7, 8[2] J. T. Barron and J. Malik. Shape, albedo, and illumination

from a single image of an unknown object. In CVPR, 2012.2, 3

[3] J. Ben-Arie and D. Nandy. A neural network approach forreconstructing surface shape from shading. In ICIP, 1998. 2,3

[4] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. 3, 4

[5] F. Cole, P. Isola, W. T. Freeman, F. Durand, and E. H. Adel-son. ShapeCollage: Occlusion-aware, example-based shapeinterpretation. In ECCV, 2012. 2, 3, 4

[6] M. Cosini, M. Dellepiane, F. Ponchio, and R. Scopigno.Image-to-geometry registration: A mutual informationmethod exploiting illumination-related geometric properties.Computer Graphics Forum, 28(7):1755–1764, 2009. 8

[7] I. S. Dhillon and S. Sra. Modeling data using directional dis-tributions. Technical report, TR-03-06, Department of Com-puter Sciences, The University of Texas at Austin, 2003. 3

[8] Dosch Design. http://www.doschdesign.com/products/3d/Comic_Characters_V2.html. 3

[9] J.-D. Durou, M. Falcone, and M. Sagona. Numerical meth-ods for shape-from-shading: A new survey with benchmarks.CVIU, 109(1):22–43, 2008. 2, 7

[10] R. Fisher. Dispersion on a sphere. P. Roy. Soc. Lond. B,217(1130), 1953. 3

[11] S. Fuhrmann and M. Goesele. Floating scale reconstruction.In SIGGRAPH, 2014. 8

[12] S. Fuhrmann, F. Langguth, and M. Goesele. MVE - A multi-view reconstruction environment. In GCH, 2014. 8

[13] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Free-man. Ground truth dataset and baseline evaluations for in-trinsic image algorithms. In ICCV, 2009. 2, 5, 6, 7

[14] A. Hertzmann and S. M. Seitz. Shape and materials by ex-ample: A photometric stereo approach. In CVPR, 2003. 2

[15] R. Huang and W. A. P. Smith. Shape-from-shading undercomplex natural illumination. In ICIP, 2011. 2

[16] K. Ikeuchi and B. K. Horn. Numerical shape from shadingand occluding boundaries. Artificial Intelligence, 17(1):141–184, 1981. 1, 5

[17] J. Jancsary, S. Nowozin, and C. Rother. Loss-specific train-ing of non-parametric image restoration models: A new stateof the art. In ECCV, 2012. 3, 4

[18] M. K. Johnson and E. H. Adelson. Shape estimation in nat-ural illumination. In CVPR, 2011. 1, 2, 3, 4, 5, 6, 7

[19] N. Khan, L. Tran, and M. Tappen. Training many-parametershape-from-shading models using a surface database. InICCV Workshops, 2009. 2, 3, 4

[20] J. J. Koenderink. What does the occluding boundary tell usabout solid shape. Perception, 13(3), 1984. 1, 5

[21] P. Kontschieder, P. Kohli, J. Shotton, and A. Criminisi. GeoF:Geodesic forests for learning coupled predictors. In CVPR,2013. 2, 3, 4

[22] G. Oxholm and K. Nishino. Shape and reflectance from nat-ural illumination. In ECCV, 2012. 2, 5

[23] A. Panagopoulos, S. Hadap, and D. Samaras. Reconstruct-ing shape from dictionaries of shading primitives. In ACCV,2012. 2

[24] T. Papadhimitri and P. Favaro. A new perspective on uncali-brated photometric stereo. In CVPR, 2013. 4, 6

[25] R. Ramamoorthi and P. Hanrahan. An efficient representa-tion for irradiance environment maps. In SIGGRAPH, 2001.1, 4, 5

[26] D. Reddy, A. Agrawal, and R. Chellappa. Enforcing integra-bility by error correction using l1-minimization. In CVPR,2009. 4, 6

[27] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,R. Moore, A. Kipman, and A. Blake. Real-time human poserecognition in parts from single depth images. In CVPR,2011. 3

[28] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi. Tex-tonboost for image understanding: Multi-class object recog-nition and segmentation by jointly modeling texture, layout,and context. IJCV, 81(1):2–23, 2009. 1, 4

[29] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesianoptimization of machine learning algorithms. In NIPS, 2012.7

[30] D. Stockman. Vienna 2010 35. Flickr, 2010. licensed un-der https://creativecommons.org/licenses/by-sa/2.0/. 1

[31] G.-Q. Wei and G. Hirzinger. Learning shape from shading bya multilayer network. T. Neural Netw., 7(4):985–995, 1996.2, 3

[32] R. White and D. Forsyth. Combining cues: Shape from shad-ing and texture. In CVPR, 2006. 4

[33] Y. Xiong, A. Chakrabarti, R. Basri, S. J. Gortler, D. W. Ja-cobs, and T. Zickler. From shading to local shape. TPAMI,37(1):67–79, 2014. 1, 7, 8

[34] R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah. Shape fromshading: A survey. TPAMI, 21(8):690–706, 1999. 2

http://www.doschdesign.com/products/3d/Comic_Characters_V2.html

http://www.doschdesign.com/products/3d/Comic_Characters_V2.html

https://creativecommons.org/licenses/by-sa/2.0/

https://creativecommons.org/licenses/by-sa/2.0/

discriminative shape from shading in uncalibrated illumination · 2015-05-26 · discriminative...

Documents