cost-aware depth map estimation for lytro camera … · raytrix and lytro is the type of micro lens...

COST-AWARE DEPTH MAP ESTIMATION FOR LYTRO CAMERA∗

Min-Jung Kim, Tae-Hyun Oh, In So Kweon

Robotics and Computer Vision Laboratory, KAIST

ABSTRACT

Since commercial light field cameras became available, the

light field camera has aroused much interest from computer

vision and image processing communities due to its versa-

tile functions. Most of its special features are based on an

estimated depth map, so reliable depth estimation is a cru-

cial step. However, estimating depth on real light field cam-

eras is a challenging problem due to noise and short base-

lines among sub-aperture images. We propose a depth map

estimation method for light field cameras by exploiting corre-

spondence and focus cues. We aggregate costs among all the

sub-aperture images on cost volume to alleviate noise effects.

With efficiency of the cost volume, cost-aware depth estima-

tion is quickly achieved by discrete-continuous optimization.

In addition, we analyze each property of correspondence and

focus cues and utilize them to select reliable anchor points.

A well reconstructed initial depth map from the anchors is

shown to enhance convergence. We show our method out-

performs the state-of-the-art methods by validating it on real

datasets acquired with a Lytro camera.

Index Terms— Lytro, light field camera, depth map, cost

volume, discrete-continuous optimization

1. INTRODUCTION

Unlike conventional cameras, Light Field (LF) cameras can

capture both spatial and angular light information in only a

single shot. This feature has drawn much interest to LF cam-

eras, and has generated versatile applications like refocus-

ing [1, 2], segmentation [3], surface reconstruction [4], and

view perspective shifting [5]. Most of these applications are

based on depth estimation, so a reliable depth estimation pro-

cess is crucial for LF camera applications.

An early work of Adelson and Wang [5] mentioned the

possibility of estimating a depth map using a plenoptic cam-

era. Since introducing the concept of the LF camera, there

have been several studies on estimating a reliable depth map

from a LF camera [6, 7, 8, 9]. Kim et al. [8] and Wanner etal. [7, 10] analyzed the slope on an epipolar image domain to

estimate correspondence information. Yu et al. [6] exploited

a 3D line structure in ray space to constraint solution space

∗THIS WORK WAS SUPPORTED BY THE NATIONAL RESEARCH FOUN-DATION OF KOREA(NRF) GRANT FUNDED BY THE KOREA GOVERN-MENT(MSIP) (NO. 2010-0028680).

Fig. 1: (a) A sample sub-aperture image. (b) Depth map that min-

imizes RGB-Gradient consistency. (c) Depth map that maximizes

defocus measure. (d) Initial depth map by our approach. (e) Final

depth map based on (d) as initial.

by linearly interpolating the disparity along a 3D line. Since

these methods [6, 7, 8] work on the epipolar image domain to

detect lines or to measure edge confidence, they can be vul-

nerable to noise. Furthermore, these methods are restricted to

synthetic or large baseline multi-camera systems. They have

rarely been tested for challenging micro-lens based LF cam-

eras such as Raytrix and Lytro, because the cameras have short

baseline and the input image is noisy. A difference between

Raytrix and Lytro is the type of micro lens array. Lytro uses

a single lens type array, while Raytrix uses a group lens type

array to provide extended DoF and to have benefits for depth

estimation. Lytro has simpler hardwares and disadvantages

for depth estimation, so this study targeted Lytro [11].

Some studies [4, 9] have recently introduced and demon-

strated the applications on real data acquired by the micro-

lens LF cameras. Tao et al. [4] combined depth maps es-

timated from correspondence and defocus cues, while Per-

waß et al. [9] only utilized correspondence to triangulate it.

However, without considering the difference of each cue,

Tao’s method uses the estimated depth from each cue as it

is and combines them directly by using confidence weights.

Thus, that method could result in artifacts in the final depth.

In this paper, we propose a reliable depth estimation

framework for micro-lens based LF cameras like Lytro. Sim-

ilar to conventional stereo problems, there are ambiguities

from the homogeneous regions and narrow baseline among

sub-aperture images. We alleviate the ambiguities by picking

out the pixels with reliable depth estimates as anchor pixels;

the anchor pixels are used to correct unreliable nearby pixels.

We analyze each characteristic of correspondence and focus

cues and exploit the cues to select reliable pixels consider-

ing their own properties. After estimating the anchor pixels,

unreliable depth values are corrected with the assumption

that nearby pixels with similar color values may have similar

depth values. The assumption significantly improves conver-

Fig. 2: Illustration of cost volume. (a) Cost volume structure. (b)

Projection model. (c) Cost volume for defocus cue.

gence of the following non-convex optimization step. Finally,

we estimate the final depth map by cost-aware optimization

on a cost-volume structure. By using the cost volume con-

cept and full 4D LF information, fast depth map estimation

is achieved along with alleviating noise effects. We validate

our performance on real data acquired by a Lytro camera, and

show our method outperforms the state-of-the-art methods in

the aspects of computational efficiency and quality.

2. COST VOLUME CONSTRUCTION

The LF cameras provide multi-view images in a single shot

as sub-aperture images with coplanar image planes. To allevi-

ate noise effects, we fully utilize all the information from the

sub-aperture images simultaneously by an efficient 3D vol-

umetric structure, called cost volume [12]. Our framework

first performs a cost volume construction by aggregating all

the measurements.

The cost volume is defined by the voxel structure with the

size M × N × L, where M and N are the height and width

of the reference image (we choose the center view), and Lis the predefined number of the depth hypotheses. The depth

candidates are non-linearly sampled on a ray that is a back-

projection ray at a pixel location of the reference view. The

cost volume configuration for the LF is illustrated in Fig. 2-

(a). To use the cost volume structure, the relationship among

sub-aperture images and the 3D space needs to be determined.

We use the calibration method suggested by Dansereau etal. [13]. Here i, j denotes the index of the sub-aperture im-

age (i-th column and j-th row), k, l denotes a pixel location

on the designated sub-aperture image by i, j (we also refer

to this notation in later). Based on the geometric relationship

defined by [13], we can calculate a projected point (k′, l′) on

a sub-aperture image (i′, j′) from a 3D point (X,Y, Z) on a

ray defined by a pixel (k, l) of the reference image (i, j)) by

the following equation:

k′=

X − (H11i′ + H15 + ZH13i

′ + ZH35)

H13 + ZH33

,

l′=

Y − (H22j′ + H25 + ZH42j

′ + ZH45)

H24 + ZH44

,

(1)

where Hpq denotes an entry at (p, q) of the calibration matrix

H obtained by [13]. Eq. (1) allows us to linearly calculate the

projection location of a 3D point defined in the 3D space of

which the origin is the camera center of the reference view

as shown in Fig. 2-(b). From here, we regard the correspon-

dences as known, given the known (i, j, k, l) and depth Z.

The centroid of each voxel element corresponds to a hy-

pothetical 3D point. When we project a 3D point to all the

sub-aperture images, if the projected points have some consis-

tency on the image (such as color), the hypothetical 3D point

is regarded as reliable. From these observations, we measure

the RGB color and color-gradient consistency as cost and ag-

gregate each cost into a voxel at pixel p and depth d as:

Cc(p, d) =1

n(S)

∑(i′,j′)∈S

∣∣a(ir, jr, k, l)− a(i′, j′, k′, l′)∣∣ , (2)

where p = (k, l) of the reference image indexed by (ir, jr);S = {(i′, j′)|0 ≤ k′i′,j′ < N, 0 ≤ l′i′,j′ < M}; n(·) denotes

the number of elements in a set; and a(i, j, k, l) is a feature

vector defined as a =[R,G,B, ∂R

∂x ,∂G∂x ,

∂B∂x ,

∂R∂y ,

∂G∂y ,

∂B∂y

],

where R,G,B denote each color intensity. Intuitively, the

depth with the minimum cost value along the d axis can be

regarded as a strong candidate. We will refer to Cc as the

consistency cost (volume).

While the previous cost volume only incorporates con-

sistency information, the LF configuration can also provide

focus information. We construct another cost volume for a

focus cue to achieve stable depth estimation. The cost vol-

ume for the focus cue is constructed by accumulating local

sharpness information. To measure local sharpness, Sum-

Modified-Laplacian (SML) [14] is used. The cost volume for

the focus cue is defined as the following equation:

Cf (p, d) =∑

p∈N(p)

δ(ML(p, d) ≥ T ) ·ML(p, d), (3)

where δ(·) represents the indicator function which returns 1 if

the argument is true, otherwise 0, N(p) is a set of neighbors

within a radius R, and T denotes a predefined threshold. This

is illustrated in Fig. 2-(c). ML(·) is defined in [14]. This

measure is computed for all the generated refocus images Idspecified at depth d. An intensity value of the refocused image

is obtained by averaging intensities of the projected 2D points

from a 3D point at depth d on each sub-aperture.

3. INITIAL DEPTH MAP ESTIMATION

We estimate a depth map by optimizing it in a cost-aware

manner. It is typically a non-convex optimization procedure,

so the initial guess is important to obtain a fine depth map.

Tao et al. [4] used two depth maps estimated from focus and

correspondence independently as initial depth maps. How-

ever, we found that consistency (correspondence) and focus

cues should be treated differently considering each character-

istic. As shown in Fig. 3, we observe that the defocus cost

in (b) only provides a rough bound of possible depth values

Fig. 3: Analysis of consistency and focus costs. (a) The blue cir-

cle denotes the selected pixel to inspect two different costs for each

depth candidate. (b) The cost obtained from the defocus cue. (c) The

cost obtained from the consistency cue. (d) Green circles denotes se-

lected pixels for (e,f). (e) The consistency cost of pixel A. (f) The

consistency cost of pixel B. Red curves represent a quadratic curve

locally fitted near the minimum.

rather than an accurate extremum point like the consistency

cost in (c). Depth maps obtained by searching extrema of two

cues are shown in Fig. 1-(b) and (c). The consistency cost

provides accurate depth values on pixels with distinctive fea-

tures, but depth values in a homogeneous regions are quite

noisy and unreliable. From these observations, we decided to

use the defocus cue as a guide. Even in a weak textured re-

gion, the defocus cue can estimate a reliable bound of depth

because SML is measured on a local regions. In the homo-

geneous region, SML costs are zero for all depth candidates,

so it can easily reject unreliable depth on the homogeneous

region without any heuristic threshold.

Both initial depth maps from two cues are vulnerable in

homogeneous regions. Such unreliable depth could disturb

estimating the fine depth map (e.g. the 2nd column of Fig. 4-

(b)). We select reliable pixels as anchors and use it to correct

neighboring pixels. The anchor points are selected by the fol-

lowing criteria: 1) Pixels with zero SML costs for all depthcandidates are filtered out. 2) If the estimated depth by con-sistency cost does not fall into the bound specified by the focuscue, the pixel is filtered out. 3) While the consistency cost onthe distinctive feature point show a convex-enough shape, thecost on a weakly textured region has a broad shape as shownin Fig. 3-(e) and (f). We fit a quadratic function near the min-imum of the cost and measure the variance (the inverse valueof the coefficient of the quadratic term). If the variance is big,the pixel is filtered.

Most of the pixels in the homogeneous region are filtered

out in the previous step. To get a holistic initial depth map,

depth values in homogeneous regions also need to be esti-

mated. One reasonable assumption is that local color con-

sistent regions may have similar depth values and the depth

may vary smoothly in a homogeneous region [15]. Under this

assumption, we propagate the depth of anchor points along

the color consistent regions. Inspired by Levin et al. [16], we

Yu et al. [6] Tao et al. [4] OursEnvironment Matlab C++ Matlab

Execution time(min)

12 25 9.6 (w/o parfor)6 (with parfor)

Continuousdepth

X O O

Metric Depth O X O

Table 1: Performance comparison

formulate as a depth propagation problem:

minD

∑p

(D(p)−

∑q∈N(p) wpq D(q)

)2

, (4)

where N(p) is a set of neighboring pixels of p. Our inten-

tion is that Eq. (4) encourages depth values of two neigh-

bors to be similar if their colors are similar. Thus, we define

the weight as wpq = exp(−‖K(p)−K(q)‖/2σ2

), where

K = [R,G,B] at the pixel. Given a set of pixels pi with reli-

able depth Di, Eq. (4) is minimized subject to D(pi) = Di as

constraint. Since Eq. (4) is quadratic and the constraints are

linear, this optimization can be solved linearly in a closed-

form by pseudo inversion. The result of the propagated initial

depth map is shown in Fig. 1-(d).

4. DISCRETE-CONTINUOUS OPTIMIZATION

The estimated initial depth map looks plausible, but the depth

map obtained by Eq. (4) only depends on a color-aware

smoothness prior and does not consider any cost of the esti-

mated region. We apply another refinement step to estimate

depth in a cost-aware manner.

Since our depth map should be at minima of the consis-

tency cost volume and the neighbor depth should also be sim-

ilar, we formulate the following objective function:

mind

∑p

(λswg(p)‖∇d(p)‖H + wr(p)Cc(p, d)), (5)

where ‖·‖H is Huber function [17], wg and wr are weights for

anisotropic smoothness and for alleviating the cost of unreli-

able pixels, respectively. We define the weights as wg(p) =exp

(−‖∇K(p)‖22/σg

)to encourage depth discontinuity at

color inconsistent regions and wr(p) = 1 if p is classified as

reliable, otherwise 11000 .

In Eq. (5), minimizing Cc(·) is a discrete optimization

problem and the smoothness term is defined in the continuous

domain, so optimizing Eq. (5) simultaneously is intractable.

We instead add an additional penalty term with an auxiliary

variable z to split the discrete and continuous parts. Then,

Eq. (5) is converted to∑p

(λswg(p)‖∇z(p)‖H +wr(p)Cc(p, d) +1

2θ‖z− d‖22). (6)

As θ � 0, the penalty term must be almost 0 and Eq. (6)

becomes Eq. (5). We alternatively solve Eq. (6) by optimizing

a variable by fixing another one, and vice versa.

For an unknown z, it can be efficiently solved by a con-

ventional primal-dual method [18]. For unknown d, it can

Various Real Datasets

Depth fromRGB-gradient cost

3D reconstructionDepth frominitial guess

Captured sceneProposedTao et al.(iccv2013)

Yu et al.(iccv2013)Captured scene

Wanner et al.(cvpr 2012)

Fig. 4: Experimental results. (a) Comparisons with others methods. (b) Self-evaluation of the proposed method.

be solved by an exhaustive search for all discrete depth can-

didates. Searching exhaustively every iteration is a time

consuming task. We apply an acceleration technique sug-

gested by Newcombe et al. [19]. With their results, we

need to only search within the theoretical range d ∈ [ z −2θ√Cmax

c (p)−Cminc (p), z + 2θ

√Cmax

c (p)−Cminc (p) ].

This range significantly reduces the number of candidates

that must be inspected and prevents undesirable local minima.

With every iteration we update θ by multiplying ρ = 10−3;

θt+1 = ρθt. The iteration is terminated when ‖d − z‖2become almost 0 or θ is smaller than 10−7.

5. EXPERIMENTAL RESULTS

We validate our method by comparing the state-of-the-art

methods on real datasets acquired by Lytro (Fig. 4-(a)) and

provide a self-evaluation to analyze dependency on the initial

depth map (Fig. 4-(b)). For fair comparison, we adjust the

parameters of all the methods to have the same depth space:

the step of disparity is 0.02 pixels and the same maximum

disparity parameter is set over all the methods. Default values

are used for other parameters. We also maintain all the same

parameters for all the experiments. Yu’s method [6] returns a

disparity as an output. We convert the disparity information

into depth with the known calibration. Tao et al. [4] calcu-

lated relative depth as an output, so we adjust the depth scale

for visually easy comparison.

The performance comparisons are denoted in Table 1. The

proposed method is fastest despite un-optimized Matlab im-

plementation. Since Yu’s method is based on graph-cut, the

final depth map is discrete, while our depth map is continuous

by virtue of discrete-continuous optimization.

In Fig. 4-(a), the results of Yu et al. have clear depth dis-

continuity preserved at the boundary of the object. However,

since their method depends on line detection, depth reversal

effects are observed (indicated by red circles). In addition, a

discrete labeling scheme produces severely quantized depth.

Although Tao et al. took advantage of the depth from defocus,

their method had overall artifacts in an incorrectly estimated

region (e.g. far distant or homogeneous regions) by defocus

(orange circles). Wanner et al. [7] works well in textured re-

gions, but vulnerable to homogeneous regions. Even though

there are some depth bleeding effects caused by calibration

error at the boundary of the object, our method overall shows

continuous and stable depth estimation results.

Fig. 4-(b) shows the self-evaluation that shows the impor-

tance of the initial guess. The second and third columns depict

the depth map obtained by discrete-continuous optimization

with different initials. The former uses the initial depth map

that minimizes RGB-gradient consistency cost directly, while

the latter is based on the initial estimated by our approach in

Sec. 3. As mentioned, the initial acquired by propagating reli-

able depth values plays an important role in our optimization.

The proposed initialization method provides more plausible

results with fewer artifacts. We additionally provide 3D re-

construction results as shown in the right most column of (b).

Our method produces continuous and metric depth, so plausi-

ble surface reconstruction can be achieved.

6. CONCLUSION

We proposed a stable depth estimation method for Lytro. We

estimated an initial depth map by propagating reliable depth

values filtered by the focus cue and the level of texture with

the assumption that depth varies smoothly in color consistent

regions. The reliable initial depth map enhanced the final

solution of the discrete-continuous optimization. The effi-

ciency and quality of our method were validated on various

real datasets. In a future study, we will extend our framework

to a video sequence of a LF camera to enhance the absolute

quality by incorporating multi-view measurements.

7. REFERENCES

[1] Ren Ng, Digital light field photography, Ph.D. thesis,

Stanford University, 2006.

[2] K. Mitra and A. Veeraraghavan, “Light field denoising,

light field superresolution and stereo camera based refo-

cussing using a gmm light field patch prior,” in IEEEConference on Computer Vision and Pattern Recogni-tion Workshop (CVPRW), 2012.

[3] S. Wanner, C. Straehle, and B. Goldluecke, “Globally

consistent multi-label assignment on the ray space of 4d

light fields,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2013.

[4] M. W. Tao, S. Hadap, J. Malik, and R. Ramamoorthi,

“Depth from combining defocus and correspondence us-

ing light-field cameras,” in IEEE International Confer-ence on Computer Vision (ICCV), 2013.

[5] E. H. Adelson and J. Y. Wang, “Single lens stereo with a

plenoptic camera,” IEEE Transactions on Pattern Anal-ysis and machine intelligence (TPAMI), vol. 14, no. 2,

pp. 99–106, 1992.

[6] Z. Yu, X. Guo, and J. Yu, “Line assisted light field tri-

angulation and stereo matching,” in IEEE InternationalConference on Computer Vision (ICCV), 2013.

[7] S. Wanner and B. Goldluecke, “Globally consistent

depth labeling of 4d light fields,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),2012.

[8] C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung,

and M. Gross, “Scene reconstruction from high spatio-

angular resolution light fields,” ACM Transactions onGraphics (SIGGRAPH), vol. 32, no. 4, pp. 73:1–73:12,

2013.

[9] C. Perwaß and L. Wietzke, “Single lens 3d-camera with

extended depth-of-field,” in Proceedings of the confer-ence on Society of Photo-Optical Instrumentation Engi-neers (SPIE), 2012.

[10] S. Wanner and B. Goldluecke, “Variational light field

analysis for disparity estimation and super-resolution,”

IEEE Transactions on Pattern Analysis and Machine In-telligence (TPAMI), 2013.

[11] R. Ng, “Lytro official homapage,” https://www.lytro.com/about/, 2012.

[12] Christoph Rhemann, Asmaa Hosni, Michael Bleyer,

Carsten Rother, and Margrit Gelautz, “Fast cost-

volume filtering for visual correspondence and beyond,”

in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2011.

[13] D. G. Dansereau, O. Pizarro, and S. B. Williams, “De-

coding, calibration and rectification for lenselet-based

plenoptic cameras,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2013.

[14] S. K. Nayar and Y. Nakagawa, “Shape from focus,”

IEEE Transactions on Pattern Analysis and machine in-telligence (TPAMI), vol. 16, no. 8, pp. 824–831, 1994.

[15] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon,

“High quality depth map upsampling for 3d-tof cam-

eras,” in IEEE International Conference on ComputerVision (ICCV), 2011.

[16] A. Levin, D. Lischinski, and Y. Weiss, “Colorization

using optimization,” ACM Transactions on Graphics(TOG), vol. 23, no. 3, pp. 689–694, 2004.

[17] Peter J Huber, “Robust estimation of a location parame-

ter,” The Annals of Mathematical Statistics, vol. 35, no.

1, pp. 73–101, 1964.

[18] A. Chambolle, V. Caselles, D. Cremers, M. Novaga, and

T. Pock, “An introduction to total variation for image

analysis,” in Theoretical Foundations and NumericalMethods for Sparse Recovery. De Gruyter, 2010.

[19] R. A. Newcombe, S. J. Lovegrove, and A. J. Davi-

son, “DTAM: Dense tracking and mapping in real-

time,” in IEEE International Conference on ComputerVision (ICCV), 2011.

cost-aware depth map estimation for lytro camera … · raytrix and lytro is the type of micro lens...

Documents