depth estimation from monocular image with sparse known...

Depth Estimation from Monocular Image with Sparse Known Labels

Yaoyu Li, Yifan Zhang, and Hanqing LuInstitute of Automation, Chinese Academy of Sciences, Beijing 100190, China

{yaoyu.li,yfzhang,luhq}@nlpr.ia.ac.cn

1. Introduction

Estimating depth information from a single view is animportant problem in image understanding. Recently, com-puter vision has witnessed a series of breakthrough resultsintroduced by deep convolutional neural networks (CNNs),and deep CNN has increasingly been explored for depth es-timation [3].

Estimating depth from a single monocular image is atechnically ill-posed problem, as a single image may cor-respond to an infinite number of real-world scenes. The in-herent ambiguity of the task determines that it is impossibleto predict accurate depth from monocular vision. However,if we can sparsely know a few depth values in the envi-ronment via other more reliable sensors such as laser radar,and combine these values as prior information with the vi-sion model, the ambiguity can be significantly reduced toestimate the depth value of the whole scene.

1-line 2D laser radar (e.g. LMS111, nearly 5% of theprice of velodyne HDL-64E) has the property of simplestructure, low power consumption and low price. It is wide-ly equipped on some robots or autonomous vehicles forrange measurement. When the points scanned by the radarare projected to the 2D image plane, they are on a nearlyhorizontal straight line. It is a low cost solution to combinea 1-line 2D laser radar with a camera, where we can usethe 1-line laser radar to acquire sparse depth values (onlyone row pixels in an image) with high confidence. The s-parse known values can be considered as prior information,and we propose to leverage them to reduce the ambiguityof the mapping between a single RGB image and its corre-sponding depth map. By using these known labels, we canprovide our model a few relatively accurate depth value as areference, and largely narrow down the range of reasonabledepth value of other pixels.

To fuse the prior information in the procedure of learn-ing and inference of our model, we present a new approachfor depth estimation from a single image. We estimatedepth map using a deep CNN by two steps: Firstly, we es-timate a relative depth map using fully convolutional resid-ual network (FRCN) with multi-scale supervised loss lay-er. Secondly, a fully connected layer is deployed to fuse

Figure 1. Examples of depth estimation results. (a) RGB Input; (b)Ground-truth depth; (c) Results of Eigen et al.[3]; (d) Results ofour model without sparse known labels; (e) Results of our modelwith sparse known labels.

the estimated depth and sparse known labels which serve asabsolute information for inference. Our proposed methodachieves prominent performance on NYU Depth dataset,and fusing sparse known labels is proved to be helpful forremoving the ambiguity of the ill-posed problem to a largeextent. The qualitative results are shown in Figure 1.

2. Model

We use a deep residual network architecture in anencoder-decoder scheme (see Figure 2). The input of ournetwork is RGB monocular image. The encoder resem-bles a ResNet-152 [4] architecture (without the final fullyconnected layer) successively extracts low-resolution high-dimensional features from the input image. The encoderdownsamples the input image in 5 scales. The decoder thatupprojects the output of the encoder consists of unpoolingand convolution (we refer to it as upconvolution). Inspiredby FlowNet [2], our architecture includes long skip connec-tions between the encoder and decoder. To perform refine-ment, we upprojects features map and concatenate it withcorresponding feature maps from the encoder part and up-sampled coarser predicted depth map. The details of thedecoder part is shown in Figure 3.

We consider sparse known labels as one channel featuremap, in which the pixels without known depth value arefilled with zero value. The convolution operation is equiv-ariant to feature maps, and it will equivalently take into ac-

1

Figure 2. Illustration of our deep residual encoder-decoder architecture, and simplified skip connection from corresponding encoder layersto the decoder.

Figure 3. A fraction of the decoder part.

count pixels filled with known depth value and 0-value ones.For this reason, we do not operate convolution on sparseknown labels. For the purpose of fusing prior informationin our model, we concatenate the output of the decoder andsparse known labels, and subsequently apply fully connect-ed layer to it.

We use berHu norm [5] as loss function. BerHu lossfunction has been proved to yield a better final error thansquared euclidean norm. To perfectly train our model, weapply our supervised loss layer on the top of 5 scales outputsof the decoder.

3. ExperimentWe evaluate on NYU Depth v2, a large RGB-D data set

for scene understanding. The raw dataset consists of 464scenes, captured by a Microsoft Kinect camera. We use theofficial train/test split,using 249 scenes for training and 215for testing. For training, we sample equally-spaced framesout of each training scene,and then augment it by the meansof [3], resulting in 86k unique images. Finally, we test onthe the standard test subset which contains 654 images. Forsimulating the sparse known labels, we cut out the middlerow of ground truth depth map.

Results for NYU Depth dataset are shown in table 1. Thefirst row is the result in [3] which performs depth estima-tion in a multi-scale architecture. The second row is theresult in [1] which treat depth estimation as a classification

Table 1. comparison with state of the art, The last two rows are re-sults of our model (1) without sparse known labels and our model(2) with sparse known labels.

MethodError Accuracy

rel log10 rms δ < 1.25 δ < 1.252 δ < 1.253

Eigen et al. [3] 0.158 - 0.641 0.769 0.950 0.988Cao et al. [1] 0.150 0.065 0.656 0.791 0.952 0.986our model (1) 0.168 0.076 0.710 0.741 0.927 0.976our model (2) 0.146 0.063 0.652 0.787 0.955 0.988

problem. Our model achieves 4 better performance out of 6metrics, compared against these two state-of-the-art meth-ods. The last two rows are depth estimation results by ourmodel without and with sparse known labels respectively.As we can see from table 1, our model achieves a betterperformance on all metrics after fusing sparse known label-s. The great improvement in performance shows that it isfeasible to reduce the ambiguity of the ill-posed problem byfusing some prior information in the procedure of learningand inference of deep CNN.

In the future work, we will respectively concatenate s-parse known labels with 5 scales coarse depth map from thedecoder output, and apply fully connected layer to fuse it.

References[1] Y. Cao, Z. Wu, and C. Shen. Estimating depth from monocular

images as classification using deep fully convolutional resid-ual networks. arXiv, 2016.

[2] A. Dosovitskiy, P. Fischery, E. Ilg, and P. HUsser. Flownet:Learning optical flow with convolutional networks. In IEEEICCV, pages 2758–2766, 2015.

[3] D. Eigen and R. Fergus. Predicting depth, surface normalsand semantic labels with a common multi-scale convolutionalarchitecture. 2015 IEEE ICCV, pages 2650–2658, 2015.

[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. 2016 IEEE CVPR, pages 770–778,2015.

[5] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, andN. Navab. Deeper depth prediction with fully convolutionalresidual networks. 2016 Fourth International Conference on3D Vision (3DV), pages 239–248, 2016.

2

depth estimation from monocular image with sparse known...

Documents