ios press maximum likelihood motion compensation for...

Integrated Computer-Aided Engineering 19 (2012) 215–227 215DOI 10.3233/ICA-2012-0401IOS Press

Maximum likelihood motion compensationfor distributed video coding

Frederik Verbista,b, Nikos Deligiannisa,b, Marc Jacobsa,b, Joeri Barbariena,b, Peter Schelkensa,b andAdrian Munteanua,b,∗aDepartment of Electronics and Informatics, Vrije Universiteit Brussel, Brussels, BelgiumbDepartment of Future Media and Imaging, Interdisciplinary Institute for Broadband Technology, Ghent, Belgium

Abstract. Aspiring to provide robust low-complexity encoding for video, this work presents a hash-based transform domaindistributed video codec, featuring a novel maximum likelihood motion compensation technique to generate high quality sideinformation at the decoder. A simple hash is employed to perform overlapped block motion estimation at the decoder, whichproduces a set of temporal predictors on a pixel basis. For every pixel position, maximum likelihood motion compensation,based on an online estimate of the conditional dependencies between the temporal predictors and the original frame, combinesthe cluster of temporal predictors into a single value to serve as decoder-generated side information. Efficient low-densityparity-check accumulate channel codes refine the side information in the transform domain. Experimental results demonstratethat the proposed system advances over our previous hash-based distributed video coding architectures, delivering state-of-the-artdistributed coding performance, in particular for sequences organized in large groups of pictures or containing highly irregularmotion. Notwithstanding the presence of a hash, the presented distributed video codec successfully maintains low encodercomplexity.

Keywords: Distributed video coding, Wyner-Ziv video coding, side information generation, hash-based motion estimation, motioncompensation

1. Introduction

The information theoretic principles of distributedsource coding (DSC) coding offer an intriguing cod-ing logic to design competitive low-complexity encod-ing solutions. DSC refers to the separate encoding butjoint decoding of multiple correlated sources and issupported by the information theoretic foundations laidby Slepian and Wolf [22]. Although [22] considered alossless coding scenario, Wyner and Ziv [26] extendedthe latter findings by adding a quantization step, hencefacilitating distributed lossy compression, referred toas Wyner-Ziv (WZ) coding. Video coding systems de-signed according to the principles of Wyner-Ziv theory,known as distributed video coding (DVC) or WZ videocoding systems, no longer designate the encoder as thesole responsible for exploiting the spatio-temporal cor-

∗Corresponding author: Prof. Adrian Munteanu, Department ofElectronics and Informatics, Vrije Universiteit Brussel, Pleinlaan 2,B-1050, Brussels, Belgium. E-mail: [email protected].

relation but rather allow for sharing this responsibilitywith the decoder.

Coincidentally, the emergence of low-cost wirelessmultimedia sensor networks and processing technolo-gy [18] has attracted the attention of both the academ-ic world and industry [27]. Wireless video sensors, incombinationwith tracking and detection algorithms [7],monitoring specific scenes, can provide security andsurveillance. In this context,DVC has been identified asa promising enabling technology for low-powermobilemedia applications [20,27].

In practical DVC, an original source signalX is cod-ed independently using powerful channel codes. Thedecoder in turn first generates a prediction of the signalX , called the side information (SI) signal Y , which issubsequently used in the channel decoding of X . Tothis end, an accurate knowledge of the virtual corre-lation channel, capturing the conditional dependenciesbetween X and its SI is of paramount importance forefficient decoding.

In this regard, we have introduced the so-called SIdependent (SID) additive correlation channel model

ISSN 1069-2509/12/$27.50 2012 – IOS Press and the author(s). All rights reserved

216 F. Verbist et al. / Maximum likelihood motion compensation for distributed video coding

X = Y + N [12]. In particular, an SID channel as-sumes the additive noise component N as distribut-ed according to a zero-mean Laplacian with standard-deviation σ (y), which varies depending on the actualrealization y of the SI Y . It has been demonstratedthat the Laplacian SID correlation channel model de-scribes the actual correlation channel more accurate-ly than the conventional SI independent (SII) channelmodel [12]. Additionally, we have introduced an effi-cient online SID transform domain correlation channelestimator in [10], which is not confined to a particularWZ architecture.

Producing accurate SI at the decoder is a critical fac-tor for the compression performance of a WZ codec.Motion-compensated interpolation (MCI) based SIgeneration [3,4] interpolates already decoded framesalong the estimated motionfield based on a linear trans-lational motion model. However, the accuracy of a lin-ear motion model degrades when the motion becomeshighly irregular or the distance to the reference framesincreases [16]. In such circumstances, the quality ofthe resulting SI dwindles and demands a large amountof channel coding rate in order to be corrected, hencereducing compression efficiency.

As a countermeasure, auxiliary information, calledhash information, can be sent to the decoder to assistmotion estimation, resulting in more accurate SI, evenunder strenuous conditions [1,6]. In this context, wehave introduced a hash-based video coding architecturecalled spatial domain unidirectional DVC (SDUDVC)in [13]. Coarsely quantized versions of the originalnon-key frames served as a hash, based on which over-lapped block motion estimation (OBME) and proba-bilisticmotion compensation generated the reconstruct-ed frames at the decoder. Lacking true WZ channelcoding, the system of [13] evades the use of a feed-back channel. Our hash-based video coding schemepresented in [23] not only added a WZ layer in thetransform domain to [13], but in addition, it featureda novel probabilistic motion compensation technique,which combines the temporal predictors from the dif-ferent overlapping blocks by means of weighted aver-aging, where the weights are calculated based on thecombined knowledge of the hash and the estimatedcorrelation statistics.

Our hash-based architectures introduced in [11,14]employed a down-sampled and H.264/AVC intra codedversion of the original WZ frames as a hash. The hash-based OBME algorithm was adapted as to accommo-date the alternative hash and the SI valueswere generat-ed as the average of the candidate temporal predictors.

In addition, [11] refined the quality of the SI during de-coding to further boost the compression performance.

This work introduces a novel multi-hypothesis prob-abilistic motion compensation technique in a hash-based WZ architecture, which combines the large col-lection of temporal predictors for each pixel, generat-ed through OBME, into a single value. The employedhash comprises a downsampled version of the hash uti-lized in [13,23]. The new multi-hypothesis maximumlikelihood motion compensation (MLMC) techniquerequires knowledge of the conditional dependency be-tween the temporal predictors and the original sourcevalues in order to generate accurate approximations ofthe original frames at the decoder. To this end, thecorrelation estimation algorithm from [10] is properlymodified to use information from the hash and operatein the spatial domain.

The remainder of the paper is structured as follows.Section 2 provides a detailed description of our DVCscheme. Section 3 offers an in-depth explanation of thenovel MLMC approach. Experimental results for theproposed DVC scheme, comparing it against a set ofreference codecs, are provided in Section 4. Section 5concludes the paper.

2. System architecture

2.1. The encoding procedure

A graphical overview of the proposed transform do-main WZ video codec is depicted in Fig. 1. The in-put sequence is split into key frames I and WZ framesX , which are organized into groups of picture (GOPs).The key frames are coded with H.264/AVC [25], oper-ating in intra frame mode only, while the WZ framesare coded according to the procedure presented in [2],that is, they undergo a DCT followed by quantizationand the resulting quantization indices are coded witha Slepian-Wolf (SW) coder based on the rate-adaptivelow-density parity-check accumulate (LDPCA) codespresented in [28].

For every WZ frame, the encoder creates hash in-formation, which constitutes a crude representation ofthe original frame. The hash serves the same functionas proposed in [13], namely as an aid to the decoderduring motion estimation and SI generation. Figure 2illustrates the hash formation and coding pipeline. Theoriginal WZ frames are downsampled by a factor ξ pri-or to coarse quantization. Similar to [13,23], uniformquantization with a quantization step-size 2K−b is ap-

F. Verbist et al. / Maximum likelihood motion compensation for distributed video coding 217

Fig. 1. Schema of the proposed distributed video coding architecture.

plied, which is equivalent to retaining the b most sig-nificant bitplanes of the original sample values havinga bit-depth K . The obtained quantization indices arespatially decorrelated, entropy coded and sent to thedecoder.

In particular, letX denote the samples in the originalWZ frame of width W and height H . Let s = (i, j)indicate a pixel position in the frame, where i ∈[0,W − 1] and j ∈ [0, H − 1] are the column and rowindex, respectively. With this notation, X (s) desig-nates the sample at location s in the original WZ frame.Prior to quantization, theWZ frames are downscaled bya factor ξ = 2k, k ∈ N. Let X ′ represent the samplesin the downscaled frames with width W ′ = W/ξ andheightH ′ = H/ξ and let X ′ (s′) represent the sampleat position s′ = (i′, j′), where i′ ∈ [0,W ′ − 1] andj′ ∈ [0, H ′ − 1] are respectively the column and rowin the downsampled frame. Downscaling is performedby decimation without anti-aliasing filtering, that is,the downscaling operation simply removes values fromthe original frame. In other words, the samplesX ′ (s′)are obtained a X ′ (s′) = X (ζs′), i′ = 0, 1, . . .,W ′,j′ = 0, 1, . . ., H ′. Subsequently, the samples X ′ arequantized according to:

X ′b =

⌊X ′

2K−b

⌋(1)

where �·� stands for taking the integer part, K desig-nates the bit-depth of the original samples and b is thenumber of retained bitplanes. The resulting hashX ′

b issimilar to a low-resolution grey-scale image containing

2b distinct values, as illustrated in Fig. 2 for b = 2 andξ = 2.

In order to remove spatial redundancy while main-taining limited hash encoding complexity, the quan-tization indices X ′

b are decorrelated using the edge-adaptive prediction scheme of JPEG-LS [24]. The re-sulting prediction errors are mapped to a new set ofsymbols in the range

[0, 2b − 1

]by modulo arithmetic.

These symbols are then converted to sequences of bina-ry symbols (bins) using unary coding and every symbolis subsequently entropy coded using binary arithmeticcoding.

2.2. The decoding procedure

At the decoder side, the key frame and hash bitstreamis decoded and the reconstructed key frames I and hashX ′

b are stored for future referencing. In order to enableLDPCA decoding of the WZ frames, the decoder cre-ates SI in a two-stage process. First, the decoded hashis used to perform hash-based OBME, which generatesa cluster of temporal predictors for each pixel in theWZ frame. In the second stage, a novel maximum like-lihood motion compensation technique distils a singlevalue from every cluster of predictors to serve as SI.

Before commencing motion estimation, the decodedhash X ′

b is upscaled to the original frame dimensionsby upsampling via mere zero insertion. It is importantto note that no interpolation filter is employed duringupscaling. Instead, the zero-valued samples introducedby upsampling will be ignored during the block match-


Fig. 2. Schematic overview of the hash formation and coding pipeline.

ing process. This is motivated by the following. Thehash frame X ′

b contains quantization indices obtainedthrough coarse quantization, attaining 2b possible lev-els. To limit the required hash rate, b is kept low, e.g.b = 2. Applying classical interpolation on these samplevalues is therefore likely to incur a significant interpo-lation error. In order to avoid contaminating the blockmatching with low fidelity interpolations, resorting on-ly to the reliable pixel positions in the upsampled hashframe X ′

b is a more robust solution. As a side-effect,this approach also reduces the number of operations,thereby limiting the associated complexity.

Following the hierarchical B temporal predictionstructure used in [21], two previously decoded WZand/or key frames, termed R0 and R1 respectively,serve as past and future reference frames for OBME.The upsampled hash frame X ′

b is divided into overlap-ping blocks β of size B ×B pixels, with an overlap ofε pixels. For every such block, the best matching blockϑ, within search range sr, is identified in the referenceframes.

The upsampled hash X ′b contains the quantiza-

tion indices representing the b most significant bit-planes of the original WZ frames at pixel positionss = ξs′, where s′ = (i′, j′), i′ = 0, 1, . . .,W ′ − 1,j′ = 0, 1, . . ., H ′ − 1. Hence, X ′

b contains reliable in-formation solely at the pixel positions ξs′. Since thereliable positions contain coarse quantization indices,traditional error metrics, like SAD or MSE, are inaccu-rate block matching criteria. Therefore, a more suitablematching error is applied.

Let Rt, t ∈ {0, 1} represent the samples of eitherthe past or future reference frame and let v = (vi, vj),−sr � vi, vj � sr be a motion vector within thesearch window defined by sr. The motion estimationalgorithm chooses the best motion vector υ accordingto:

υ=argmaxv

∑s=ξs′

δ

(X ′

b (s)−⌊Rt (s− v)

2K−b

⌋), (2)

where δ (·) is the Kronecker delta function and s isconfined to all reliable pixel positions within a block.In other words, the motion estimation will maximizethe number of equal quantization indices of collocatedpixels in the blocks under consideration, where onlyreliable positions are taken into account.

In this way, for a every pixel position s in the WZframe, OBME creates a set of temporal predictorsTs = {ψ0, ψ1, . . ., ψM−1}. The elements ψi corre-spond to the co-located pixels in the blocks ϑi, whichhave been identified as the best match for the overlap-ping blocks βi covering the pixel position s.

The next step is to combine all the elements of Tsinto a single SI value, which is accomplished by meansof a novel MLMC technique, explained in detail in Sec-tion 3. The resulting SI frame Y , is DCT transformedand acts as SI for the WZ decoding of X in the trans-form domain. An online transform domain correlationchannel estimation (TDCCE) is performed, accordingto the procedure introduced in [10]. In short, the al-gorithm determines the conditional probability massfunction (PMF) of the quantization bins (defined by thealready decoded bitplanes of the quantization indicesof a coded DCT band) given the SI. From this inter-mediary PMF, the SID model parameters, that is, theshape parameter of the Laplacian distributions, giventhe SI, are derived. The current model approximationis then used to LDPC decode the next bitplane in theDCT frequency band after which the conditional PMFand SID model parameters are calculated anew.

When the WZ frames are decoded, they are optimal-ly reconstructed, according to the procedure followedin [3,11,23], after which they undergo an inverse DCT.The reconstructed WZ frames inside a GOP remainstored in the reference frame buffer, where they serveas reference frames for the temporal prediction of WZframes belonging to a lower temporal level [21].


3. Maximum likelihood motion compensation

3.1. Formulating the likelihood function

This section entails an in-depth presentation of thenovel MLMC technique to generate the final SI valuesafter the execution of OBME. Denote the discrete ran-dom variables representing the samples in the originalWZ frame and the correspondingSI frame byX and Y ,respectively. Let Ts = {ψ0, ψ1, . . ., ψM−1} denotethe set containing the candidate temporal predictors ob-tained by means of OBME for the WZ frame sample atlocation s. MLMC is responsible for deducing a singlevalue y from the set Ts to serve as SI at pixel positions. With this goal, we propose the following likelihoodfunction:

Ls (x) =M−1∏i=0

pΨ|X (ψi|x), (3)

where Ψ is a discrete random variable representing thetemporal predictors and pΨ|X (ψi|x) is the conditionalPMF of Ψ conditioned on X .

In order to maximize the likelihood function givenby Eq. (3), the correlation between the candidate pre-dictors and the original source X is first modelled asa conditional PMF. Since the temporal predictors ψi

correspond to pixels in the reference frames, the con-ditional PMF between the original source X and therandom variable Ψ, that is, pX|Ψ (x|ψi), is approxi-mated by the conditional PMF pX|Rt

(x|rt) betweenX and the source of candidate predictors, namely, thereference frames Rt, t ∈ {0, 1}. Notice that estimat-ing pX|Rt

(x|rt) is conceptually different from TDC-CE [10], which is estimating the virtual correlationchannel for WZ decoding [10]. Specifically, the TDC-CE generates the model parameters for the conditionaldependencies describing the virtual correlation channelbetween the original source X and the SI in the trans-form domain. This means that TDCCE is performedafter motion compensated prediction at the decoder andis used to feed the SW decoder with the necessary soft-input information. On the contrary, the correlation es-timated in the context of our MLMC (i) is performedprior to motion compensation, (ii) is set in the spa-tial domain, and (iii) serves as a means to obtain thestatistical information required for MLMC.

3.2. Instantiating the correlation model

Denote byR0,R1 the discrete random variables rep-resenting the samples in the reconstructed past and fu-

ture reference frames, respectively, while rt, t ∈ {0, 1}designates the realizations in one of the referenceframes, taking on values within the range,

[0, 2K − 1

],

whereK is the bit-depth of the original frame samples.In order to estimate pX|Rt

(x|rt), we start by modellingthe correlation fX|Rt

(x|rt) between the samples in theoriginal WZ frame and the reference frames Rt. In-spired by [12], fX|Rt

(x|rt) is modelled by a condition-al PDF, which is assumed to be Laplacian distributedcentred on rt with standard deviation σ (rt) and wherethe continuous random variable X represents the sam-ples in the original WZ frame. In other words:

fX|Rt(x|rt) = 1√

2σ(rt)e−

√2|x−rt|σ(rt) . (4)

The parameters σ (rt) can be estimated online similarto the methodology postulated in [10]. We note that thealgorithm presented [10] is set in the transform domainwhere the model parameters were successively refinedduring the SW decoding of the bitplanes of the quanti-zation indices of the different frequency bands. There-fore, the algorithm has to be adapted to the current set-ting for which the correlation needs to be estimated inthe spatial domain. Moreover, in contrast to [10], thereis no opportunity for refining the correlation model.

Extrapolating the principles of [10] to the pixel do-main, the PDF fX|Rt

(x|rt) could be estimated fromthe conditional PMF pXb|Rt

(xb| rt), where the dis-crete random variable Xb represents a coarsely quan-tized version of the original WZ frame.

However, such a coarsely quantized version of theoriginal WZ frame is only partially available to the de-coder. In fact, the hash is the sole available reliablesource at the decoder based on which one can gath-er statistics on the original WZ frame. Bear in mindthat the hash is created by downsampling the originalWZ frame by a factor ξ = 2k, k ∈ N followed bycoarse quantization retaining the bmost significant bit-planes. That is, the hash contains values in the range[0, 2b − 1

]. Therefore, one can use the hash together

with each of the reference frames to obtain an estimateof pXb|Rt

(xb| rt), as explained below.Notice that no viable upscaling method for the hash

exists that can generate trustworthy quantization indexsamples at the missing pixel positions. Therefore, inour technique the reference frames Rt, t ∈ {0, 1} arefirst downsampled by a factor ξ, according to the samemethod as during the creation of the hash, yielding thedownsampled framesR′

t. Then the correlation betweenthe hash X ′

b and downsampled reference frames R′t is

estimated. As a remark, observe that the downsam-


pling factor ξ influences the statistical support for thecorrelation estimation.

In the next step of our algorithm, the joint PMFpX′

b,R′

t(x′b, r

′t) is calculated by normalizing the joint

histogram, counting the occurrences of co-locatedpairs (x′b, r

′t), where x′b and r′t take on values in

the range[0, 2b − 1

]and

[0, 2K − 1

], respective-

ly. Then, the conditional PMFs pX′b|R′

t(x′b| r′t), are

computed by dividing the joint PMF by the empiri-cal marginal PMF pR′

t(r′t). The PMF pX′

b|R′t(x′b| r′t)

serves as an approximation for the PMF correspondingto the non-downsampled versions of the frames, i.e.,pXb|Rt

(xb| rt).We note that, due to downsampling, a particular real-

ization rt might have no equivalent r′t in the downsam-pled reference frames R′

t. Hence, no empirical PMFconditioned on that particular realization of Rt can bederived. This is handled by assigning a maximal lev-el of uncertainty on Xb to unobserved realizations ofRt, that is, a uniform distribution over all 2b possiblequantization indices in Xb.

Formally, the estimation of pXb|Rt(xb| rt) is given

by:

pXb|Rt(xb| rt) ={

pX′b|R′

t(x′b| r′t) if r′t ∈ Rt

2−b otherwise, (5)

with:2b−1∑xb=0

pXb|Rt(xb |rt )=1, ∀rt ∈

[0, . . ., 2K−1

].(6)

Similarly, a specific realization xb in Xb might havebeen removed from the hash X ′

b, due to the downsam-pling process. However, as xb ∈ [

0, . . ., 2b − 1]

has alimited range of possible values due to coarse quantiza-tion (e.g. b = 2), the absence of a particular realizationxb is assumed to be accredited to its non-occurrence inXb rather than being deleted during downsampling.

Once the empirical conditional PMF pXb|Rt(xb| rt)

is determined, the standard-deviation σ (rt) of the par-ticular Laplacian distribution fX|Rt

(x|rt) correspond-ing to the PMF pXb|Rt

(xb| rt) is found by solving [10]:

pXb|Rt(xb| rt) −

qu(xb)∫ql(xb)

fX|Rt(x|rt) dx = 0 (7)

where ql (xb) and qu (xb) are the respective lower andupper bound of the quantization interval defined byquantization index xb. The solution, along with theproof of existence and uniqueness, can be found in [10].

3.3. Performing motion compensation

By assigning every predictor ψi to its equal rt val-ue in the corresponding reference frame, the condi-tional distribution fX|Ψ (x|ψi) for every ψi is equal tothe corresponding fX|Rt

(x|rt), with σ (ψi) = σ (rt),where the standard-deviationσ (rt) was found by solv-ing Eq. (7). Note that these correlation models arecontinuous. However, in reality pixel values are dis-crete, taking on integer values in the range

[0, 2K − 1

].

Therefore, we derive the conditional PMF pX|Ψ (x|ψi)by integrating fX|Ψ (x|ψi) over a unit interval centred

on the integer values within the range[0, 2K − 1

]:

pX|Ψ (x|ψi) =

n(x)∫m(x)

fX|Ψ (x|ψi) dx, (8)

where x = 0, 1, . . . , 2K−1 and the integration boundsare given by Eqs (9) and (10).

m (x) ={x− 0.5 if x > 0−∞ if x = 0 (9)

n (x) ={x+ 0.5 if x < 2K − 1+∞ if x = 2K − 1 (10)

Invoking Bayes’ law, the value xML maximizing thelikelihood functionLs (x), in Eq. (3), can be found as:

xML = arg maxx

M−1∏i=0

pX|Ψ (x|ψi) pΨ (ψi)pX (x)

(11)

where pX (x) and pΨ (ψi) are prior PMFs. The poste-rior PMF pX|Ψ (x|ψi) is given by Eq. (8).

It is clear from observing Eq. (11) that the factorspΨ (ψi) do not depend on x and therefore do not affectthe maximization process. Assuming the decoder hasno prior information on the randomvariableX , all real-izationsx are assumed to be equally likely, thereby ren-dering pX (x) of no influence on the maximization pro-cess. Integrating these observations and assumptionsin Eq. (11) yields,

xML = arg maxx

M−1∏i=0

pX|Ψ (x|ψi) (12)

Finally, letting xML obtained for position s in the WZframe serve as SI value for that particular location yieldsy = xML.


4. Experimental results

4.1. Compression performance

The performance of the proposed transform domainWZ video coding scheme was compared against agermane collection of state-of-the-art video compres-sion systems including efficient DVC systems as wellas low-complex representatives of conventional videocodecs.

The first system, the DISCOVER codec [3], is awell established reference, delivering state-of-the-artcompression performance in DVC. Similar to the pro-posed scheme, DISCOVER applies WZ coding in thetransform domain but generates SI based on the lat-est MCI techniques [4,5]. The second reference codecis our spatial-domain unidirectional DVC (SDUDVC)scheme proposed in [13]. SDUDVC encodes the keyframes of every GOP using H.264/AVC intra framecoding. The remaining frames in aGOPundergo coarsequantization forminga hashwhich is codedwith a basicentropy codec. At the decoder, hash-based OBME pro-duces the reconstructed frames directly in the spatialdomain. SDUDVC delivers respectable compressionperformance without a transform domain WZ codec ora feedback channel. In addition, we have included acomparison against several of our previous WZ archi-tectures [11,23], all employing a WZ layer in the trans-form domain and advancing significantly over SDUD-VC in terms of compression performance.

Besides these DVC systems, two low-complexityconventional coding schemes are retained as a refer-ence. Both schemes are configurations of the state-of-the-art H.264/AVC codec and avoid computation-ally expensive motion estimation at the encoder. Thefirst version, H.264/AVC intra, encodes each frame in avideo sequence separately, solely exploiting the intra-frame spatial redundancies. H.264/AVC intra framecoding combinesmulti-hypothesis directional intra pre-diction followed by discrete cosine transformation ofthe prediction residuals with advanced context-basedentropy coding, and is considered to be one of the mostefficient intra frame coding schemes. H.264/AVC intraconstitutes a well established benchmark when evalu-ating DVC solutions [3,6,9,19].

In a second configuration, additional prediction fea-tures of the H.264/AVC coding standard are acti-vated. Apart from exploiting the spatial correlation,H.264/AVC no motion also removes temporal redun-dancies by means of simple differential coding prin-ciples. H.264/AVC no motion generally outperforms

H.264/AVC intra frame coding at the cost of slightlyincreased encoder complexity. However, it needs em-phasizing that both H.264/AVC configurations are sig-nificantly more complex than the assessed DVC solu-tions.

With regard to the evaluation of the proposed DVCscheme compared to H.264/AVC intra, H.264/AVC nomotion and DISCOVER, experimental results are re-ported on the complete Foreman, Soccer, Carphone andFootball sequences at QCIF resolution, a frame rate of15 Hz and for a GOP size of 2, 4 and 8. Additionally,the results from our SDUDVC [13] on Foreman andSoccer have been included as well.

Concerning the configuration of the proposed DVC,the OBME module was configured with an overlapsize ε = 2 and a block size of B = 16. The mo-tion search was executed in an exhaustive mannerwith-in a search range of ± 15 pixels at integer-pel accu-racy. The hash was generated from the original WZframes, which were downsampled by a factor ξ =2 prior to quantization. A total number of b = 2 bit-planes were retained in the final hash. Both parame-ters ξ and b were chosen to obtain a balance betweenthe accuracy of the motion estimation, the reliabilityof the spatial correlation estimation and the requiredrate to code the hash. Smaller b and larger ξ may re-sult in a lower hash rate but lead to a poorer qualityof the SI, since the motion estimation accuracy dropsand the spatial domain correlation estimation algorithmno longer has the statistical support to generate faith-ful estimates. The MLMC was implemented numeri-cally where for every pixelposition s the search spacewas shrunken to [μs − σs/2, μs + σs/2], with μs,σs denoting the respective mean and standard devia-tion of the appropriate set of temporal predictor pix-els Ts = {ψ0, ψ1, . . ., ψM−1}. Figure 3 shows thecompression performance of the proposed DVC withrespect to the reference codecs for the Foreman (a–c)and Soccer (d–f) sequences. The Foreman sequenceexhibits a reasonable amount of motion activity as wellas intricate facial movements while severe camera pan-ning causes a complete scene change towards the endof the sequence. Comparing the compression perfor-mance of the different DVC systems for this sequence,the proposed DVC indisputably exhibits the best per-formance with Bjøntegaard rate savings [8] of up to25% with respect to DISCOVER in GOP of 8. Thehash-based SDUDVC fails to reach the efficiency lev-el of both competitor DVC schemes. When compar-ing the DVC solutions against the traditional predic-tive H.264/AVC codec, the experimental results indi-


Fig. 3. Compression performance of the proposed DVC architecture on the Foreman QCIF sequence at 15 Hz for GOP2 (a), GOP4 (b) andGOP8 (c); and on the Soccer QCIF sequence at 15 Hz for GOP2 (d), GOP4 (e) and GOP8 (f).

cate that, for a GOP of 2, H.264/AVC no motion isthe superior system in terms of compression efficien-cy. The performance of H.264/AVC intra is consid-erably lower, operating approximately on par with theproposed DVC and DISCOVER. When the GOP sizeis increased to 4 – (c) – H.264/AVC no motion is nolonger able to take advantage of the temporal corre-lation by means of low-complexity motion compensa-tion measures but has to resort primarily to intra pre-

diction, which roughly equalizes its performance withH.264/AVC intra. In some cases, H.264/AVC intraeven slightly outperforms H.264/AVC no motion. Thisseems counterintuitive but is the result of the fact thatintra macroblocks in B-slices cost more to code thanin I-slices due to slight differences in the entropy cod-ing [17]. The convergence of H.264/AVC no motion toH.264/AVC intra with growing GOP size is regularlyobserved.


Fig. 4. Compression performance of the proposed DVC architecture on the Carphone QCIF sequence at 15 Hz for GOP2 (a), GOP4 (b) andGOP8 (c); and on the Football QCIF sequence at 15 Hz for GOP2 (d), GOP4 (e) and GOP8 (f).

The Soccer sequence contains a fragment of a soccergame recorded at medium to short range, resulting inrather complex motion content with a wide range ofaccelerations and frequent camera panning. It is wellknown that under such circumstances, WZ based sys-tems are hard-pressed to create accurate SI and suffer anotable performance loss compared to both H.264/AVCintra and H.264/AVC no motion, as confirmed by theexperimental results. The rate-distortion behaviour of

the relevant codecs for the Soccer sequence is present-ed in Fig. 3(d–f) for a GOP of respectively 2, 4 and 8.The proposedDVC solution manages to notably reducethe performance gap between classical video codingsystems and DVC systems, particularly when the GOPsize is large.

Figure 4 shows the compression performance ob-tained on the Carphone (a–c) and Football (d–f) se-quences. Carphone is characterised by complex fa-


Table 1Bjøntegaard compression gains of the proposed DVC with respect to DISCOVER

Table 2Bjøntegaard compression gains of the proposed DVC compared to our previous hash-based DVC introduced in [23].Negative values constitute a performance loss

Table 3Bjøntegaard compression gains of the proposed DVC with respect to our previous hash-based WZ architecture,presented in [14]

cial gestures and contains additional high motion re-stricted to the window area of the fast-moving car. Thecamera remains fixed throughout the entire sequence.H.264/AVC no motion outperforms the competition interms of compression efficiency for all GOPs, while thecompression performance of H.264/AVC intra is con-siderably less. In fact, both the proposed DVC andDIS-COVER even outperformH.264/AVC intra at low rates.In the medium rate region, the proposed WZ systemmaintains a performance similar to H.264/AVC intra,slightly losing ground as the rate increases further. Incomparison to DISCOVER, the proposed DVC systemattains Bjøntegaard rate savings [8] of up to 13.99% fora GOP size 8.

Football is considered a hard-to-code sequence witha very high degree of motion as well as complex cam-era movements as it trails several players during an of-fensive manoeuvre. Similar to the Soccer sequence,the characteristics of Football highly favour conven-tional H.264/AVC coding. Regarding the compari-son between the different DVC systems, the proposedWZ codec consistently outperformsDISCOVER for allGOP sizes, with Bjøntegaard rate savings [8] of up to13.68% in a GOP of 8.

Table 1 provides a comprehensive overview of thecompression performance comparison between the pro-posed DVC and the state-of-the-art DISCOVER codecin terms of Bjøntegaard rate savings (%) and distortion

reduction (dB) [8]. Table 1 clearly illustrates the pro-posed DVC’s superior performance, in particular whenthe GOP size grows.

In the following, we compare the proposed WZ ar-chitecture with our previous hash-based DVC schemes.The first system, introduced in [23], employs a hashidentical to one used in the SDUDVC [13]. The SI isgenerated via OBME followed by probabilistic motioncompensation, comprising advanced weighted averag-ing of the candidate predictors.

Table 2 shows the compression performance of theproposedDVC, compared to the architecture from [23],for the Foreman and Carphone sequence in a GOP of 2,4 and 8. The results show that the proposedDVC suffersa performance loss in a GOP of 2 with Bjøntegaard [8]rate losses of 4.49% and 4.10% for Foreman and Car-phone, respectively. However, the proposed DVC re-covers for the larger GOP sizes, consistently outper-forming [23]. The behaviour is explained by the natureof the hash in [23], which is not a downsampled ver-sion of the bitplane hash from the SDUDVC, as usedin the proposed system. For [23] this (i) results in moreaccurate motion estimation and (ii) allows for codingonly the difference between the hash the original framein a WZ manner. However, the bitrate overhead to codethe hash is roughly ξ2 times higher than the rate ofthe downsampled version, for a downsampling factorξ. Such a rate overhead undermines the performance


Table 4The execution time (ms) of the proposed DVC for encoding the entire Foreman and Carphone sequence

in the rate-distortion sense, in particular when the GOPsize grows.

The second hash-based DVC is our architecture pre-sented in [14], where the hash consists of a downscaledand low quality H.264/AVC intra coded version of theWZ frames. The SI is created by means of OBME andthe obtained candidate temporal predictors are merelyaveraged. Additionally, since the nature of the hashin [14] is different, the hash is truly upsampled at thedecoder using a Lanczos interpolation filter [15] andthe sum of the absolute differences (SAD) error metricis used during OBME. This architecture was expand-ed in [11] by adding a SI refinement loop at the de-coder. After the SW decoding of all the bitplanes ofthe quantization indices in the DC band, the partiallydecoded WZ frame is transformed back to spatial do-main and the SI generation process is executed anew,yielding an improved SI frame. Adding SI refinementat the decoder significantly boosts the compression per-formance [11]. However, since the proposed DVC inthis work does not feature any SI refinement at the de-coder, we have only included the experimental resultsfor the core architecture of [14] without SI refinement.Table 3 shows the compression gains on account of theproposed DVC with respect to [14], obtained on Fore-man and Carphone. The results corroborate that theproposed scheme consistently outperforms [14] for allGOPs.

4.2. Complexity evaluation

It needs strong emphasizing that the superior com-pression performance of H.264/AVC comes at a price.Although H.264/AVC intra, and to a lesser extendH.264/AVC no motion, are considered to be light-weight encoding configurations of theH.264/AVCstan-dard, their computational complexity is a great dealhigher than that of the evaluated DVC systems.

To endorse this statement, the encoder complexi-ty of the proposed DVC is analysed according to the

methodology followed in [9,19], where encoder execu-tion times were compared. In this work, time measure-ments were carried out on a x86 machine with a Intel(R)Core(TM) i7 CPU running at 2.20 GHz and with 16 GBof RAM. Our DVC was written in C++, compiled withMicrosoft Visual Studio 2008 and running in releasemode under Windows 7. We measured the executiontimes to encode the entire Foreman and Carphone se-quence, at QCIF resolution and a frame rate of 15Hz,using H.264/AVC intra and the proposed DVC. Thequality parameter (QP), controlling the quantizationlevel of the H.264/AVC encoder was identical for theH.264/AVC intra coded sequences and the key framesof the WZ codec. The employed quantization matrices(QM), used to quantize the WZ frames in the transformdomain, are equal to the QMs introduced in [2].

Table 4 shows the executing time of the proposedDVC, broken down in its prime components, that is,the encoding of the key frames, the LDPCA encodingof the WZ frames and dealing with the hash. As seen inTable 4, the majority part of the encoding complexityof the proposed DVC scheme is due to the H.264/AVCintra coding of the key frames. In contrast to MCI-based DVC systems, a hash-based WZ codec allocatesadditional resources at the encoder to code the hash in-formation. However, the added complexity regardingthe creation and coding of the hash, comprising quan-tization, prediction, binarization and entropy coding,is fairly modest with respect the combined resourcesrequired for key frame and WZ encoding.

For comparison, Table 5(a) contains the execu-tion time for encoding Foreman and Carphone withH.264/AVC intra. Table 5(b) summarizes the executiontime saved by adopting our hash-based DVC comparedto H.264/AVC intra coding, displaying the ratio (%) ofthe total execution time of the proposed WZ schemeand H.264/AVC intra. It is clear that our WZ systembrings significant savings, in particular for the largerGOP sizes, comprising around 30% of the executiontime of H.264/AVC intra in a GOP of 8.


Table 5(a) The execution time (ms) of H.264/AVC intra for encoding theentire Foreman and Carphone sequence and (b) the ratio (%) betweenthe proposed DVC and H.264/AVC intra

5. Conclusions

This paper proposed a novel MLMC scheme in ahash-based transform domain WZ video coding archi-tecture. Using a downsampled and coarsely quan-tized version of the original WZ frames as a hash,the proposed DVC generates a large collection of tem-poral predictors by means of OBME from which anew multi-hypothesis probabilistic motion compensa-tion technique generates accurate SI. In order to facili-tate MLMC, the spatial domain conditional dependen-cies between the temporal predictors and the originalsource are estimated online using an adaptation of ourprevious algorithm to estimate the transform domainSID correlation channel statistics [10].

Experimental results testify to the state-of-the-art distributed coding performance of the proposedWZ video coding architecture featuring the presentedMLMC. The proposed system brings significant gainsover the reference DISCOVER codec, especially forsequences containing complexmotion patterns or whenthe distance to the reference frames used for temporalprediction is large. The proposed scheme considerablyadvances over our previous hash-based DVC solutionsand further diminishes the existing performance gapbetween state-of-the-art transform domain WZ videocodecs and traditional low-complexity predictive cod-ing solutions. Taking full advantage of the potential ofWyner and Ziv’s philosophy, the encoding complexityof the proposed systems is reduced to up to 30% ofthe complexity of to the benchmark H.264/AVC intracodec.

References

[1] A. Aaron, S. Rane and B. Girod, Wyner-Ziv video codingwith hash-based motion compensation at the receiver, in IEEEInternational Conference on Image Processing, ICIP, Singa-pore, 2004.

[2] A. Aaron, S. Rane, E. Setton and B. Girod, Transform-domainWyner-Ziv codec for video, in SPIE Visual Communicationsand Image Processing Conference, VCIP, San Jose, CA, 2004.

[3] X. Artigas, J. Ascenso, M. Dalai, S. Klomp, D. Kubasov andM. Quaret, The DISCOVER codec: Architecture, techniquesand evaluation, in Picture Coding Symposium, PCS 2007, Lis-boa, Portugal, 2007.

[4] J. Ascenso, C. Brites and F. Pereira, Improving frame inter-polation with spatial motion smoothing for pixel domain dis-tributed video coding, in 5th EURASIP Conference on Speechand Image Processing, Multimedia Communications and Ser-vices, Smolenice, Slovac Republic, 2005.

[5] J. Ascenso, C. Brites and F. Pereira, Content adaptive Wyner-Ziv video coding driven by motion activity, in IEEE Interna-tional Conference on Image Processing, Atlanta, GA, 2006.

[6] J. Ascenso, C. Brites and F. Pereira, A flexible side informa-tion generation framework for distributed video coding, Mul-timedia Tools and Applications 48 (2010), 381–409.

[7] Z. Bankovic, J.M. Moya, A. Araujo, D. Fraga, J.C. Vallejoand J.M. de Goy, Distributed intrusion detection system forwireless sensor networks based on a reputation system coupledwith kernel self-organizing maps, Integrated Computer-AidedEngineering 17 (2010), 87–102.

[8] G. Bjøntegaard, Calculation of average PSNR differencesbetween RD-curves, ITU-T Video Coding Experts Group(VCEG), Austin, TX,, Document VCEG-M33, April 2001.

[9] C. Brites, J. Ascenso, J.Q. Pedro and F. Pereira, Evaluatinga feedback channel based transform domain Wyner-Ziv videocodec, Signal Processing: Image Communication 23 (April2008), 269–297.

[10] N. Deligiannis, J. Barbarien, M. Jacobs, A. Munteanu, A.Skodras and P. Schelkens, Side-information-dependent corre-lation channel estimation in hash-based distributed video cod-ing, IEEE Transactions on Image Processing 21 (April 2012),1934–1949.

[11] N. Deligiannis, M. Jacobs, F. Verbist, J. Slowack, J. Barbarien,R. Van de Walle, P. Schelkens and A. Munteanu, Efficienthash-driven Wyner-Ziv video coding for visual sensors, inACM/IEEE International Conference on Distributed SmartCameras, ICDSC, Gent, Belgium, 2011.

[12] N. Deligiannis, A. Munteanu, T. Clerckx, J. Cornelis and P.Schelkens, On the side-information dependency of the tempo-ral correlation in Wyner-Ziv video coding, in IEEE Interna-tional Conference on Acoustics, Speech, and Signal Process-ing, ICASSP 2009, Taipei, Taiwan, 2009, pp. 709–712.

[13] N. Deligiannis, A. Munteanu, T. Clerckx, J. Cornelis andP. Schelkens, Overlapped block motion estimation and prob-abilistic compensation with application in distributed videocoding, IEEE Signal Processing Letters 16 (September 2009),743–746.

[14] N. Deligiannis, F. Verbist, J. Barbarien, J. Slowack, R. Van deWalle, P. Schelkens and A. Munteanu, Distributed coding ofendoscopic video, in IEEE International Conference on ImageProcessing, Brussels, Belgium, 2011.

[15] C.E. Duchon, Lanczos Filtering in One and Two Dimensions,Journal of Applied Meteorology 18 (August 1979), 1016–1022.

[16] B. Girod, A. Aaron, S. Rane and D. Rebollo-Monedero, Dis-tributed video coding, Proceedings of the IEEE 93 (January2005), 71–83.

[17] I.-T. a. I. I. JTC1, ITU-T Recommendation H.264, ISO/IEC14496-10 (MPEG-4 AVC): Advanced Video Coding forGeneric Audiovisual Services, May 2003.


[18] H. Mendonca, O. Vybornova, J.Y.L. Lawson and B. Macq,Multi-domain framework for multimedia archiving using mul-timodal interaction, Integrated Computer-Aided Engineering18 (2011), 15–28.

[19] F. Pereira, J. Ascenso and C. Brites, Studying the GOP size im-pact on the performance of a feedback channel based Wyner-Ziv video codec, in IEEE Pacific Rim Symposium on ImageVideo and Technology, Santiago, Chile, 2007.

[20] F. Pereira, L. Torres, C. Guillemot, T. Ebrahimi, R. Leonardiand S. Klomp, Distributed video coding: selecting the mostpromising application scenarios, Signal Processing: ImageCommunication 23 (June 2008), 339–352.

[21] H. Schwarz, D. Marpe and T. Wiegand, Overview of the scal-able video coding extension of the H.264/AVC standard, IEEETransactions for Circuits and Systems for Video Technology17 (September 2007), 1103–1120.

[22] D. Slepian and J.K. Wolf, Noiseless coding of correlated in-formation sources, IEEE Transactions on Information Theory19 (July 1973), 471–480.

[23] F. Verbist, N. Deligiannis, M. Jacobs, J. Barbarien, P.Schelkens and A. Munteanu, A Statistical Approach to Create

Side Information in Distributed Video Coding, presented atthe ACM/IEEE International Conference on Distributed SmartCameras, ICDSC, Gent, Belgium, 2011.

[24] M.J. Weinberger, G. Seroussi and G. Sapiro, The LOCO-Ilossless image compression algorithm: principles and stan-dardization into JPEG-LS, IEEE Transactions on Image Pro-cessing 9 (August 2000), 1309–1324.

[25] T. Wiegand, G.J. Sullivan, G. Bjøntegaard and A. Luthra,Overview of the H.264/AVC video coding standard, IEEETransactions on Circuits and Systems for Video Technology 13(July 2003), 560–576.

[26] A.D. Wyner and J. Ziv, The rate-distortion function for sourcecoding with side information at the decoder, IEEE Transac-tions on Information Theory 22 (January 1976), 1–10.

[27] Z. Xiong, A. Liveris and S. Cheng, Distributed source codingfor sensor networks, IEEE Signal Processing Magazine 21(September 2004), 80–94.

[28] D. Varodayan, A. Aaron and B. Girod, Rate-adaptive codes fordistributed source coding, Signal processing, Elsevier, Vol. 86,2006, pp. 3123–3130.

ios press maximum likelihood motion compensation for...

Documents