a comparative study of depth-map coding schemes for 3d video · 2011. 3. 12. · abstract—this...

A Comparative Study of Depth-Map CodingSchemes for 3D Video

Harsh [email protected]

Nirabh [email protected]

Audrey [email protected]

Abstract—This work presents a comparative study of depth-map coding schemes for 3D Video. We first investigate two block-based transform coding approaches: the DCT, and a trainedKLT scheme. We then present a novel approach to depth mapcompression. By applying Block Truncation Coding (BTC) withfixed block sizes within a frame, we are able to outperform boththe DCT and the KLT with respect to rate and distortion. Our keycontribution is the design of an adaptive Block Truncation Cod-ing scheme (A-BTC) that utilizes the Lagrangian optimizationframework to adaptively select the BTC block-size. Our resultsdemonstrate that the A-BTC approach generally outperforms allof the other techniques examined by our work.

Keywords—3D video compression; depth-map compression;adaptive block truncation coding (A-BTC); BTC; DCT; KLT

I. INTRODUCTION

3D Televisions have recently begun to gain traction in themarketplace, with a variety of stereo displays now availablefrom many manufacturers. Although such displays are animportant advance, they are encumbered by the need forspecial eye-wear and a limited true “look-around” effect.The vision for true recreation of a 3D scene is likely to beachieved by future autostereoscopic displays. By emitting alarge number of views, such displays are not only able toovercome the dependence on special eye-wear but they alsoprovide the “look-around” effect that viewers expect from arealistic 3D scene.

Given that autostereoscopic displays require a large numberof views (potentially greater than 50), efficient compressiontechniques are critical for practical systems. In [1] Muller etal. show that encoding these views using a scheme such asthe multi-view coding (MVC) profile of H.264/AVC results ina bitrate that is linearly increasing with the number of views.This can rapidly become infeasible as the required number ofviews increases.

One proposed solution [1] to this issue is to encode two orthree views along with the corresponding depth maps. Such apipeline is envisioned to allow for the synthesis of an arbitrarynumber of intermediate views. The critical component in thesuccess of such an approach is the manner in which the depthmaps are encoded.

This work assesses the relative performance of the DiscreteCosine Transform (DCT), the Karhunen-Loeve Transform(KLT), and Block Truncation Coding (BTC) as applied todepth-map compression. This paper further presents a novel

Adaptive Block Truncation Coding (A-BTC) technique thatincorporates the Lagrangian optimization framework. Thisapproach adaptively selects the block-size with which the BTCis applied to the depth-maps.

In order to focus our work, we restrict the scope of ouranalysis to the spatial dimension, and ignore any potentialgains that may arise from exploiting temporal redundancy.

In the next section, we formally present the problem thiswork addresses. Section III describes some related work in thisarea. Section IV outlines our evaluation framework. SectionsV and VI describe the application of the DCT and the KLTto depth-map compression respectively. Section VII presentsBTC as a better approach to compressing depth-maps, whilesection VIII describes and evaluates our Adaptive Block Trun-cation Coding scheme. Section IX compares and contrasts thekey findings of our study, and the paper concludes in SectionX.

II. PROBLEM DESCRIPTION

The intermediate view-synthesis stage is highly dependenton the quality of the input depth-maps. In particular, the edgespresent in the depth-maps highly influence the quality of thesynthesized intermediate view. As a result, it is importantto design a coding scheme that preserves these edges, whileachieving the required compression.

III. RELEVANT LITERATURE

Several different approaches exist in the literature for ef-ficient coding of depth-maps with some giving particularattention to the preservation of edges.

In [2] the authors present edge-adaptive transforms (EATs)as an alternative to the DCT to encode depth maps. The EATsavoid filtering across edges in each image block and so donot create high coefficient values. This new method whenselectively chosen along with the DCT helps in decreasingthe bitrate.

In [3], the authors use a shape-adaptive wavelet transformand an explicit encoding of the locations of major edges toencode depth maps. The paper presents a novel approach togenerate small wavelet coefficients along edges to reduce thebits required to code the depth-map.

Platelet coding as described in [4] is an edge-aware codingscheme that uses a segmentation procedure based on a quad-tree decomposition to model the depth map with piecewiselinear functions. This scheme is efficient at causing signifi-cantly less degradation along edges.

IV. EVALUATION FRAMEWORK

In order to fairly compare the different schemes that areinvestigated, we evaluate each scheme by applying our testsequences to the system described below.

A. System OverviewFigure 1 depicts the top level view of the system that

[1] describes. We use this framework in order to evaluatethe performance of our compressed depth-maps from eachscheme.

Fig. 1. System level overview.

B. Test SequencesWe use the Kendo and Balloons multi-view sequences

available from the Tanimoto Laboratory [7]. Specifically, weuse the sequences from Cameras 1 and 3 and their associateddepth-maps to synthesize an intermediate view correspondingto Camera 2. These sequences are 1024 x 768.

For reference, Figure 2 shows the luminance component ofthe first frame of each Camera 2 sequence and the associateddepth-maps:

Fig. 2. Luminance (top) and example depth-map (bottom) of Balloons (left)and Kendo (right) from Camera 1.

C. Performance MetricsWe calculate PSNR as follows:

MSE =1

nm

n∑i=1

m∑j=1

(Y ′ij − Yij

)2PSNR = 10 log

(2552

MSE

)

In the equations above, it is assumed that the dimensionof the test sequence frames is n x m (width x height). It isimportant to note that we synthesize a reference intermediateview from uncompressed depth-maps. It is with respect to thisreference synthesized view that we measure our distortion.The prime denotes the luminance component resulting fromsynthesis performed from the compressed depth-maps. For asequence of frames, the PSNR that we report is the averageof each frame.

Finally, we perform entropy coding on each scheme in orderto ensure a consistent and fair comparison of rates amongstthe schemes we compare.

V. DISCRETE COSINE TRANSFORM

The Discrete Cosine Transform (DCT) is widely used inimage compression as it is easily implementable, computa-tionally efficient and achieves a compression factor that iscomparable to an optimum transform, the KLT [5]. Anotheradvantage is that the DCT decorrelates the input signal in adata-independent manner. For these reasons, we find the DCTbeing worthy of comparison in our comparative study.

In our implementation, we first apply the DCT-II with block-sizes (M) of 8 and 16. We then apply a uniform mid-treadquantizer with step-sizes varying from 21 to 28. We constructthe pmf for each quantized DCT coefficient, and use it todetermine the entropy code for the sequence. The results forour test sequences are shown in Figure 3:

Fig. 3. Rate-Distortion performance for DCT based compression withM=8,16 for the first 50 frames in Balloons and Kendo.

The 16x16 DCT seems to be marginally better than the 8x8DCT for coarse quantization and worse for finer quantization.However, a general claim for different block-sized DCTscan not be advanced as we compare the distortion in thesynthesized images, rather than the depth-maps. The relativerate-distortion performances between block-sizes is not clearand the basic assumption that higher block sizes yield lowerrate and PSNR is not valid for transform coding in depth mapcompression. In Figure 4, the difference of the reconstructedand the original intermediate views is depicted.

Fig. 4. Error visualization for Balloons (frames 1) synthesized using depthmaps compressed with DCT M=8, Q=128.

VI. KARHUNEN-LOEVE TRANSFORM

The Karhunen-Loeve Transform (KLT) exploits the statis-tical properties of the depth-maps, causing it to be signaldependent and inseparable for image blocks. Conversely, it isoptimal in the sense that the KLT achieves maximal decorrela-tion in the depth-map values, and also maximally compacts itsinformation [5]. For these reasons, we find the KLT as anothercentral comparison with the A-BTC described in Section VIII.

Similar to how the DCT is performed, we perform our KLTscheme by using either M = 8 or M = 16 blocks followed bya uniform mid-tread quantizer with step sizes varying from 21

to 28. The training set is constructed by selecting each row ofevery MxM block for all frames in both views, and forming amatrix of samples as depicted in Figure 5. In this figure, m andn denote the width and height of the test sequences respectivelywhile p denotes the number of frames. This is the input for theautocorrelation matrix from which we obtain the transform [6].Finally, we construct the pmf (at each quantization step size)for each quantized DCT coefficient, and use it to determinethe entropy code for the sequence.

Fig. 5. Construction of training set for KLT.

Figure 6 illustrates the relationship between the 8x8 KLTand 16x16 KLT for the first 50 frames of both the Balloons andKendo sequences. There are several important aspects of theseresults that merit further discussion. First, we notice that theKendo sequence appears to achieve strictly better performancethan the Balloons sequence. Next, we note that there are

interesting cross-over points in the curves. Specifically, theresults show that the larger block-size of 16x16 outperformsthe 8x8 block for coarse quantization. Meanwhile, for finerquantization, the smaller block-size yields better performance.We believe that it is possible to explain this behaviour bythe uniform mid-tread quantizer that we apply. Because weare applying uniform quantization to all of the coefficients, athigher step sizes, we stand to lose relatively more informationin 8x8 blocks as compared to 16x16 blocks. Conversely,at lower quantization step-sizes, the benefits of a smallerblock dominate and allow the 8x8 block to outperform thelarger block size. The somewhat unpredictable nature of theperformance suggests that the KLT is highly dependent on thetraining set.

Fig. 6. Rate-Distortion performance for KLT based compression with M=8,16for the first 50 frames in Balloons and Kendo.

Finally, in Figure 7, the difference between the reconstructedand the original intermediate views is shown.

Fig. 7. Error visualization for Balloons (frames 1) synthesized using depth-maps compressed with KLT M=8, Q=128.

VII. FIXED BLOCK-SIZE BLOCK TRUNCATION CODING

In this section, we first describe the basic Block TruncationCoding (BTC) algorithm as outlined in [8]. We investigate thisscheme as its authors claim that it is good at preserving edges.We then describe how we calculate the rate for this scheme aswe apply it to our depth-map application. Finally, we includeour results for this scheme.

A. BTC Algorithm

The BTC is a simple two level non-parametric quantizerthat adapts over local regions of the image by preserving localsample moments. It works on a block level where a block ofsize M is coded to two different quantized values. Severalvariations of mapping input values to these two levels usingthe BTC framework have been designed. In this paper we usea quantizer that preserves the first and second sample momentsof the image.

Let n be defined as the number of elements in a block of sizeM (n = M2). The first and second moments and the varianceof the block are defined as follows:

X =1

n

n∑i=1

Xi

X2 =1

n

n∑i=1

X2i

σ2 = X2 − X2

where Xi is the ith coefficient in the block.The threshold Xth is set to X and the two quantized values

(a and b) are selected as in the following pseudo-code:

IF Xi < Xth,

Xouti = a

ELSE, Xouti = b

Let q be defined as the number of Xi in the block that aregreater than Xth.

Then preserving the first two moments of the elements inthe block, the two quantized values (a and b) can be definedas:

a = X − σ

√(q

n− q

)

b = X + σ

√(n− q

q

)B. Rate Calculation

The rate calculation consists of two main components sothat we are able to estimate the entropy in order to comparewith our other schemes:

1) Entropy coding for a and b: We do this by calculatingthe pmf for all unique values of a and b for each block

size. We use this pmf to assign bits to the quantizedvalues and achieve entropy.

2) Encoding the positions of a and b in the block: Weinterpret the quantized output of the BTC as a bitmapwith a and b corresponding to ’1’ and ’0’ respectively.In order to exploit the statistical dependency of thebitmap ordering, we scan columns of the block to forma sequence, perform run-length coding, and assign bitsaccording to the entropy of the run length vector.

C. Results

In this section, we perform the calculations on the first frameof the Balloons and Kendo sequences as we realize that thefirst frame results are representative of the sequences’ results.So, in order to save execution time, we perform our analysison both of the first frames.

Fig. 8. Rate-Distortion performance for Fixed BTC scheme for Frame 1 inBalloons and Kendo with the fixed M varied between 21 and 26.

Figure 8 depicts the performance we achieve as we vary thefixed block-size parameter between 21 and 26. We observe thatas the block size increases, both the rate and PSNR decrease.

Fig. 9. Error visualization for the balloons sequence Frame 1. M=64, M=16,M=2 (l-r).

Finally, Figure 9 depicts the error in Balloons for M = 64, M= 16, and M = 2 respectively. There are two important trendshere. First, we notice that as we decrease the fixed block-size,the error we observe decreases. Second, and perhaps morecritically, we note that the error appears to be concentratedalong the edges in the synthesized images.

VIII. ADAPTIVE BLOCK TRUNCATION CODING (A-BTC)

In this section we first describe the motivation behindmodifying the fixed block-size BTC to our proposed AdaptiveBlock Truncation Coding scheme (A-BTC). We further discusshow we incorporate the Lagrangian optimization frameworkinto this scheme. We finally present and discuss the resultsthat the A-BTC achieves.

A. Motivation

The R-D curves of Figure 8 reveal that the fixed block-sizeBTC performs significantly better with lower block-sizes butat a rate penalty. However, in Figure 9 we see that increaseddistortion for larger block sizes is actually concentrated aroundthe edges. This suggests that more bits need to be spent toencode the depth-maps around the edges, while fewer bitscould suffice to efficiently code other regions of the depth-maps.

The A-BTC adaptively (according to the Lagrangian costfunction) selects different block sizes for encoding differentcomponents of a depth-map: smaller block sizes for regionswith edges and larger block-sizes for the remaining regions.

B. Algorithm Overview

This section presents a summary of the key steps in ourproposed A-BTC scheme.

1) Perform the pre-processing stage to obtain necessaryinputs to perform optimization.

2) For each block in a depth-map pair, recursively breakdown the block until the specified minimum block-size.

3) Evaluate the Lagrangian cost at each block-size leveland return this to the parent block.

4) Compare the Lagrangian cost of using the parent blocksize and the 4 children blocks. Select the minimizingoption.

The specific details of the algorithm are left until the nextsection. In short, we recursively evaluate Lagrangian costsfor the candidate block-sizes, and select the block-sizes thatminimize the Lagrangian cost.

C. Algorithmic Details

The A-BTC can be considered as an addendum to our FixedBTC scheme as described in Section VII. In other words, asthe first step of this scheme, we perform the fixed block-sizeBTC for each block-size from M = 21 to M = 26, for eachdepth map pair (left, right). We then proceed to synthesize theintermediate views resulting from each compressed depth mappair.

The above pre-processing stage provides us with the nec-essary inputs for performing the Lagrangian optimization.Our Lagrangian optimization seeks to minimize the standardLagrangian cost function:

J = D + λR

with λ defined as:

λ = 0.2Q2

We can manipulate Q as we see fit in order to control therate-distortion tradeoff.

The distortion that we use in this formulation is the straight-forward MSE distortion. We use the MSE calculation asdefined in Section IV-C but on a per block basis rather than aper frame basis.

We now address the rate calculation. This is performedas we describe in VII-B with two additional considerations.Because we are now adapting across varying block sizes(from M = 21 to M = 26), we must additionally encodethis information. We make a conservative estimate and usea fixed length 3-bit code to encode this information. Thesecond consideration has to do with the fact that we considerboth depth-maps in a pair simultaneously. Consequently, weevaluate the rate for both depth-maps, and we use the meanrate to finally evaluate J.

In order to normalize our cost calculation, we multiply theMSE by the number of pixels in the candidate block andcompute the total number of bits required to encode a block.This allows us to directly compare the Lagrangian cost for aparent block with the sum of the Lagrangian costs for its fourchildren blocks.

D. Results

We first implement the A-BTC scheme described abovewith the minimum block-size set to two, and the maximumblock-size set to 64. Moreover, we vary Q between 20 and 28

to investigate different settings of lambda. Figure 10 showsthe results we obtain for the first frame of the Balloonsand Kendo sequences respectively. The results show thatKendo increasingly outperforms Balloons at higher rates, andour lambda range can effectively control the rate-distortiontradeoff.

Fig. 10. Rate-Distortion performance for A-BTC scheme for Frame 1 inBalloons and Kendo with the maximum block-size set to M = 64 and minimumblock-size set to M = 2. Q sweeps between 20 and 28 with increasing powersof 2.

Next, we decide to investigate the effect of changing themaximum block-size available to the A-BTC scheme. We

again set the minimum block-size to two and vary Q between20 and 28. We investigate setting the maximum block-size to8, 16, 32, and 64.

Figure 11 shows the results for these settings. We de-rive a very useful observation from these results: increasingthe maximum block-size has the effect of shifting the rate-distortion curve to the left. Intuitively, by offering larger block-sizes to encode low-frequency background blocks, we attaina reduction in rate with no observable distortion penalty. Atthe same time, we must acknowledge that this effect appearsto have a diminishing gain.

Fig. 11. Rate-Distortion performance for A-BTC scheme for Frame 1 inBalloons as Mmax is varied between 8, 16, 32 and 64.

IX. EVALUATION

We now systematically evaluate the different schemes thatwe have described above. We first compare our block trans-form schemes against the fixed BTC scheme. We then evaluatethe gain of the A-BTC over the Fixed BTC.

A. Fixed BTC vs. Block Transforms

In Figure 12, we compare the performance of the FixedBTC with the performance of the DCT and KLT with block-size M=8. Meanwhile, Figure 13 compares the performanceof the Fixed BTC with the performance of the DCT and KLTwith block-size M=16.

First, we notice that the DCT outperforms the KLT for bothM = 8 and M = 16. This suggests that the performance ofthe KLT scheme is likely rather sensitive to the training set.Constructing a better training set would ideally give us a KLTthat outperforms the DCT. However, there is no obvious wayto achieving this optimal training set.

In the Figure 12 results, it is clear that the Fixed BTCscheme outperforms both of the block transform basedschemes. At low rates (about 0.05bpp), we observe a gainof approximately 1.1dB for both test sequences.

The results of Figure 13 are less conclusive. In particular,a key point to notice is that for the Kendo sequence, at low

Fig. 12. Rate-Distortion comparison of BTC and Block Transform schemesfor M=8

rates, we observe a cross over in the performance. For theserates, our block transform schemes appear to outperform theFixed BTC. This obviously requires further analysis. An initialhypothesis might be that this could be resolved by allowingmore flexibility in the block sizes. We address this further inSection XI.

Fig. 13. Rate-Distortion comparison of BTC and Block Transform schemesfor M = 16

B. A-BTC vs. Fixed BTC

It is clear from the results of Figure 14 that the adap-tive scheme that we propose holds promise. Specifically, weobserve that as expected, we are able to trade off rate anddistortion and achieve a performance that is strictly superiorto the fixed block-size curves.

X. CONCLUSION

In this work, we conduct an evaluation of three mainschemes for compressing depth-maps: DCT, KLT, and BTC.

Fig. 14. Rate-Distortion comparison of BTC and Block Transform schemesfor M = 16

Our analysis confirms that depth maps cannot be treated asordinary images; it is important to pay special attention toedges.

Our work begins with an investigation of the DCT andKLT based block transform schemes. Our analysis revealsthat the DCT achieves slightly better performance than theKLT. Moreover, we observe that there is a cross-over whencomparing blocks of M = 8 and M = 16.

The main focus of the work is a novel application of theBTC to our depth-map compression problem. Our first attempt,the Fixed BTC scheme, is able to generally outperform theblock transform approaches. By observing that distortion isconcentrated along edges, we leverage small blocks to encodethis information and larger blocks to encode other regions.This observation leads to our proposed A-BTC method.

We conclude this work with the observation that the A-BTCis a promising new approach to depth map compression.

XI. FUTURE WORK

In this work, we have proposed a promising scheme whichwe denote as the A-BTC. We believe that there exist severalinteresting future directions that can be investigated to furtherdevelop this work.

First, we would like to improve our optimization framework.In this work, as described above, we calculate the Lagrangiancost function jointly based on both the left and right depthmaps. This is sub-optimal. An optimal scheme would iteratethrough dependent depth maps until the block size decisionsconverge.

Second, we also note that the Lagrangian optimizationframework we propose is computationally expensive. More-over, it will become even more so as we proceed in thedirection described directly above. The key reason for thiscomplexity is that the algorithm operates in a bottom-upfashion. We have conducted some initial work on developinga heuristic approach that works from the top-down and wouldlike to explore this further.

Third, we might also investigate preserving higher ordermoments in the BTC. In addition, our results for the A-BTC suggest that increasing the maximum block-size has theeffect of shifting the R-D curve to the left, improving R-Dperformance. We believe that this result merits further study. Astudy of the characteristics of different block sizes in differentdepth-map images would be useful in selecting the optimal setof block sizes.

Finally, we would like to perform a comparison with anotherpromising scheme for depth map compression: the Plateletsmethod. This would involve implementing the method asdescribed in [4].

ACKNOWLEDGEMENTS

The authors would like to sincerely thank Professor BerndGirod and Mina Makar for their invaluable guidance in thedevelopment of this work. Moreover, we would like to extendour gratitude to the Tanimoto Laboratory of Nagoya Universityfor their view synthesis software and their ‘Balloons’ and‘Kendo’ multi-view test sequences.

REFERENCES

[1] K. Muller, P. Merkle, and T. Wiegand, “3-D video representation usingdepth maps,” Proceedings of the IEEE, vol. PP, no. 99, pp. 1-14, 2010.

[2] G. Shen, W.-S. Kim, S. K. Narang, A. Ortega, J. Lee, and H. Wey,“Edge-adaptive transforms for efficient depth-map coding,” in Proc. of28th Picture Coding Symposium (PCS 10), Nagoya, Japan, Dec. 2010.

[3] M. Maitre and M. N. Do, “Depth and Depth-Color Coding UsingShape-Adaptive Wavelets, Journal of Visual Communication and ImageRepresentation., vol. 21, issue 5-6, July. 2010.

[4] Y Morvana, D. Farina and P.H.N.de With, “Novel Coding Technique forDepth Images using Quadtree Decomposition and Plane Approximation,Visual Communications and Image Processing 2005 Proc of SPIE., vol.5960, 1187-1194, doi: 10.1117/12.631647.

[5] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform, IEEETrans. Compiti., vol. C-23, pp. 90-93, 1974.

[6] Z. Li and M. Drew, “Karhunen-Loeve Transform, in Fundamentals ofMultimedia. Upper Saddle River. Pearson Education, 2004, ch. 8, sec.5.2. pp. 220-222.

[7] Tanimoto Laborotory, Nagoya University.http://www.tanimoto.nuee.nagoya-u.ac.jp/

[8] E. Delp and O. Mitchell, “Image Compression Using Block TruncationCoding,” Communications, IEEE Transactions on., vol. 27, no. 9, pp.1335-1342, Sep. 1979.

a comparative study of depth-map coding schemes for 3d video · 2011. 3. 12. · abstract—this...

Documents