streaming source coding with delay - eecs at uc berkeley · 2007-12-19 · streaming source coding...

Streaming source coding with delay

Cheng Chang

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2007-164

http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-164.html

December 19, 2007

Copyright © 2007, by the author(s).All rights reserved.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

Streaming Source Coding with Delay

by

Cheng Chang

M.S. (University of California, Berkeley) 2004B.E. (Tsinghua University, Beijing) 2000

A dissertation submitted in partial satisfactionof the requirements for the degree of

Doctor of Philosophy

in

Engineering - Electrical Engineeing and Computer Sciences

in the

GRADUATE DIVISION

of the

UNIVERSITY OF CALIFORNIA, BERKELEY

Committee in charge:

Professor Anant Sahai, ChairProfessor Kannan Ramchandran

Professor David Aldous

Fall 2007

The dissertation of Cheng Chang is approved.

Chair Date

Date

Date

University of California, Berkeley

Fall 2007


Copyright c© 2007

by

Cheng Chang

Abstract


by

Cheng Chang

Doctor of Philosophy in Engineering - Electrical Engineeing and Computer Sciences

University of California, Berkeley

Professor Anant Sahai, Chair

Traditionally, information theory is studied in the block coding context — all the source

symbols are known in advance by the encoder(s). Under this setup, the minimum number of

channel uses per source symbol needed for reliable information transmission is understood

for many cases. This is known as the capacity region for channel coding and the rate region

for source coding. The block coding error exponent (the convergence rate of the error

probability as a function of block length) is also studied in the literature.

In this thesis, we consider the problem that source symbols stream into the encoder in

real time and the decoder has to make a decision within a finite delay on a symbol by symbol

basis. For a finite delay constraint, a fundamental question is how many channel uses per

source symbol are needed to achieve a certain symbol error probability. This question, to

our knowledge, has never been systematically studied. We answer the source coding side of

the question by studying the asymptotic convergence rate of the symbol error probability

as a function of delay — the delay constrained error exponent.

The technical contributions of this thesis include the following. We derive an upper

bound on the delay constrained error exponent for lossless source coding and show the

achievability of this bound by using a fixed to variable length coding scheme. We then

extend the same treatment to lossy source coding with delay where a tight bound on the

error exponent is derived. Both delay constrained error exponents are connected to their

block coding counterpart by a “focusing” operator. An achievability result for lossless

multiple terminal source coding is then derived in the delay constrained context. Finally,

we borrow the genie-aided feed-forward decoder argument from the channel coding literature

1

to derive an upper bound on the delay constrained error exponent for source coding with

decoder side-information. This upper bound is strictly smaller than the error exponent for

source coding with both encoder and decoder side-information. This “price of ignorance”

phenomenon only appears in the streaming with delay setup but not the traditional block

coding setup.

These delay constrained error exponents for streaming source coding are

generally different from their block coding counterparts. This difference has

also been recently observed in the channel coding with feedback literature.

Professor Anant SahaiDissertation Committee Chair

2

Contents

List of Figures v

List of Tables viii

Acknowledgements ix

1 Introduction 1

1.1 Sources, channels, feedback and side-information . . . . . . . . . . . . . . . 3

1.2 Information, delay and random arrivals . . . . . . . . . . . . . . . . . . . . 6

1.3 Error exponents, focusing operator and bandwidth . . . . . . . . . . . . . . 9

1.4 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4.1 Lossless source coding with delay . . . . . . . . . . . . . . . . . . . . 12

1.4.2 Lossy source coding with delay . . . . . . . . . . . . . . . . . . . . . 13

1.4.3 Distributed lossless source coding with delay . . . . . . . . . . . . . 14

1.4.4 Lossless Source Coding with Decoder Side-Information with delay . 15

2 Lossless Source Coding 16

2.1 Problem Setup and Main Results . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.1 Source Coding with Delay Constraints . . . . . . . . . . . . . . . . . 16

2.1.2 Main result of Chapter 2: Lossless Source Coding Error Exponentwith Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 Comparison of error exponents . . . . . . . . . . . . . . . . . . . . . 23

2.2.2 Non-asymptotic results: prefix free coding of length 2 and queueingdelay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 First attempt: some suboptimal coding schemes . . . . . . . . . . . . . . . . 28

2.3.1 Block Source Coding’s Delay Performance . . . . . . . . . . . . . . . 28

i

2.3.2 Error-free Coding with Queueing Delay . . . . . . . . . . . . . . . . 29

2.3.3 Sequential Random Binning . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Proof of the Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.4.1 Achievability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.4.2 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.5.1 Properties of the delay constrained error exponent Es(R) . . . . . . 46

2.5.2 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . 52

3 Lossy Source Coding 54

3.1 Lossy source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.1.1 Lossy source coding with delay . . . . . . . . . . . . . . . . . . . . . 55

3.1.2 Delay constrained lossy source coding error exponent . . . . . . . . . 56

3.2 A brief detour to peak distortion . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2.1 Peak distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2.2 Rate distortion and error exponent for the peak distortion measure . 57


3.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4.1 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4.2 Achievability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5 Discussions: why peak distortion measure? . . . . . . . . . . . . . . . . . . 63

4 Distributed Lossless Source Coding 65

4.1 Distributed source coding with delay . . . . . . . . . . . . . . . . . . . . . . 65

4.1.1 Lossless Distributed Source Coding with Delay Constraints . . . . . 66

4.1.2 Main result of Chapter 4: Achievable error exponents . . . . . . . . 69


4.2.1 Example 1: symmetric source with uniform marginals . . . . . . . . 73

4.2.2 Example 2: non-symmetric source . . . . . . . . . . . . . . . . . . . 74

4.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3.1 ML decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3.2 Universal Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

ii

5 Lossless Source Coding with Decoder Side-Information 90

5.1 Delay constrained source coding with decoder side-information . . . . . . . 90

5.1.1 Delay Constrained Source Coding with Side-Information . . . . . . . 92

5.1.2 Main results of Chapter 5: lower and upper bound on the error ex-ponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Two special cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2.1 Independent side-information . . . . . . . . . . . . . . . . . . . . . . 95

5.2.2 Delay constrained encryption of compressed data . . . . . . . . . . . 96

5.3 Delay constrained Source Coding with Encoder Side-Information . . . . . . 99

5.3.1 Delay constrained error exponent . . . . . . . . . . . . . . . . . . . . 100

5.3.2 Price of ignorance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.4.1 Special case 1: independent side-information . . . . . . . . . . . . . 103

5.4.2 Special case 2: compression of encrypted data . . . . . . . . . . . . . 103

5.4.3 General cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.5.1 Proof of Theorem 6: random binning . . . . . . . . . . . . . . . . . . 108

5.5.2 Proof of Theorem 7: Feed-forward decoding . . . . . . . . . . . . . . 110

5.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6 Future Work 118

6.1 Past, present and future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.1.1 Dominant error event . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.1.2 How to deal with both past and future? . . . . . . . . . . . . . . . . 120

6.2 Source model and common randomness . . . . . . . . . . . . . . . . . . . . 122

6.3 Small but interesting problems . . . . . . . . . . . . . . . . . . . . . . . . . 123

Bibliography 124

A Review of fixed-length block source coding 131

A.1 Lossless Source Coding and Error Exponent . . . . . . . . . . . . . . . . . . 131

A.2 Lossy source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

A.2.1 Rate distortion function and error exponent for block coding underaverage distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

A.3 Block distributed source coding and error exponents . . . . . . . . . . . . . 135

iii

A.4 Review of block source coding with side-information . . . . . . . . . . . . . 137

B Tilted Entropy Coding and its delay constrained performance 140

B.1 Tilted entropy coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

B.2 Error Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

B.3 Achievability of delay constrained error exponent Es(R) . . . . . . . . . . . 143

C Proof of the concavity of the rate distortion function under the peakdistortion measure 145

D Derivation of the upper bound on the delay constrained lossy source cod-ing error exponent 146

E Bounding source atypicality under a distortion measure 149

F Bounding individual error events for distributed source coding 152

F.1 ML decoding: Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . 152

F.2 Universal decoding: Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . 155

G Equivalence of ML and universal error exponents and tilted distributions 158

G.1 case 1: γH(px |y ) + (1− γ)H(pxy ) < R(γ) < γH(p1x |y ) + (1− γ)H(p1

xy ). . . . 160

G.2 case 2: R(γ) ≥ γH(p1x |y ) + (1− γ)H(p1

xy ). . . . . . . . . . . . . . . . . . . . 165

G.3 Technical Lemmas on tilted distributions . . . . . . . . . . . . . . . . . . . 167

H Bounding individual error events for source coding with side-information 175

H.1 ML decoding: Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . 175

H.2 Universal decoding: Proof of Lemma 14 . . . . . . . . . . . . . . . . . . . . 177

I Proof of Corollary 1 179

I.1 Proof of the lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

I.2 Proof of the upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

J Proof of Theorem 8 181

J.1 Achievability of Eei(R) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

J.2 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

iv

List of Figures

1.1 Information theory in 50 years . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Shannon’s original problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Separation, feedback and side-informfation . . . . . . . . . . . . . . . . . . . 4

1.4 Noiseless channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Delay constrained source coding for streaming data . . . . . . . . . . . . . . 7

1.6 Random arrivals of source symbols . . . . . . . . . . . . . . . . . . . . . . . 7

1.7 Block length vs decoding error . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.8 Delay vs decoding error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.9 Focusing bound vs sphere packing bound . . . . . . . . . . . . . . . . . . . 11

2.1 Timeline of source coding with delay . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Timeline of source coding with universal delay . . . . . . . . . . . . . . . . 18

2.3 Source coding error exponents . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Ratio of error exponent with delay to block coding error exponent . . . . . 24

2.5 Streaming prefix-free coding system . . . . . . . . . . . . . . . . . . . . . . 25

2.6 Transition graph of random walk Bk for source distribution . . . . . . . . . 26

2.7 Error probability vs delay (non-asymptotic results) . . . . . . . . . . . . . . 27

2.8 Block coding’s delay performance . . . . . . . . . . . . . . . . . . . . . . . . 28

2.9 Error free source coding for a fixed-rate system . . . . . . . . . . . . . . . . 29

2.10 Union bound of decoding error . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.11 A universal delay constrained source coding system . . . . . . . . . . . . . 41

2.12 Plot of GR(ρ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.13 Derivative of Es(R) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1 Timeline of lossy source coding with delay . . . . . . . . . . . . . . . . . . . 55

3.2 Peak distortion measure and valid reconstructions under different D . . . . 58

v

3.3 R−D curve under peak distortion constraint is a staircase function . . . . 60

3.4 Lossy source coding error exponent . . . . . . . . . . . . . . . . . . . . . . . 61

3.5 A delay optimal lossy source coding system. . . . . . . . . . . . . . . . . . . 61

4.1 Slepian-Wolf source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2 Timeline of delay constrained source coding with side-information . . . . . 66

4.3 Rate region for the example 1 source . . . . . . . . . . . . . . . . . . . . . 73

4.4 Error exponents plot: ESW,x(Rx, Ry) plotted for Ry = 0.71 and Ry = 0.97 . 75

4.5 Rate region for the example 2 source . . . . . . . . . . . . . . . . . . . . . . 76

4.6 Error exponents plot for source x for fixed Ry as Rx varies . . . . . . . . . 77

4.7 Error exponents plot for source y for fixed Ry as Rx varies . . . . . . . . . 78

4.8 Two dimensional plot of the error probabilities pn(l, k), corresponding toerror events (l, k), contributing to Pr[xn−∆

1 (n) 6= xn−∆1 ] in the situation where

infγ∈[0,1] Ey(Rx, Ry, ρ, γ) ≥ infγ∈[0,1] Ex(Rx, Ry, ρ, γ). . . . . . . . . . . . . . 82

4.9 Two dimensional plot of the error probabilities pn(l, k), corresponding toerror events (l, k), contributing to Pr[xn−∆


infγ∈[0,1] Ey(Rx, Ry, γ) < infγ∈[0,1] Ex(Rx, Ry, γ). . . . . . . . . . . . . . . . 83

4.10 2D interpretation of the score of a sequence pair . . . . . . . . . . . . . . . 87

5.1 Lossless source coding with decoder side-information . . . . . . . . . . . . . 91

5.2 Lossless source coding with side-information: DMC . . . . . . . . . . . . . . 91

5.3 Timeline of delay constrained source coding with side-information . . . . . 92

5.4 Encryption of compressed data with delay . . . . . . . . . . . . . . . . . . . 97

5.5 Compression of encrypted data with delay . . . . . . . . . . . . . . . . . . . 98

5.6 Information-theoretic model of compression of encrypted data . . . . . . . . 98

5.7 Lossless source coding with both encoder and decoder side-information . . . 99

5.8 Timeline of Delay constrained source coding with encoder/decoder side-information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.9 Delay constrained Error exponents for source coding with independent de-coder side-information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.10 A binary stream cipher for source ε, 1− ε . . . . . . . . . . . . . . . . . . 104

5.11 Delay constrained Error exponents for uniform source with symmetric de-coder side-information (symmetric channel) . . . . . . . . . . . . . . . . . . 105

5.12 Uniform source and side-information connected by a symmetric erasure chan-nel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.13 Delay constrained Error exponents for uniform source with symmetric de-coder side-information (erasure channel) . . . . . . . . . . . . . . . . . . . 106

vi

5.14 Delay constrained Error exponents for general source and decoder side-information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.15 Feed-forward decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.1 Past, present and future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.2 Dominant error event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

A.1 Block lossless source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

A.2 Block lossy source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

A.3 Achievable region for Slepian-Wolf source coding . . . . . . . . . . . . . . . 138

J.1 Another FIFO variable length coding system . . . . . . . . . . . . . . . . . 182

vii

List of Tables

1.1 Delay constrained information theory problems . . . . . . . . . . . . . . . . 8

5.1 A non-uniform source with uniform marignals . . . . . . . . . . . . . . . . . 106

6.1 Type of dominant error events for different coding problems . . . . . . . . . 120

viii

Acknowledgements

I would like to thank my thesis advisor, Professor Anant Sahai, for being a great source

of ideas and inspiration through my graduate study. His constant encouragement and

support made this thesis possible. I am truly honored to be Anant’s first doctoral student.

Professors and post-docs in the Wireless Foundations Center have been extremely help-

ful and inspirational through my graduate study. Professor Kannan Ramchandran’s lec-

tures in EE225 turned my research interest from computer vision to fundamental problems

in signal processing and communication. Professor Venkat Anantharam introduced me to

the beautiful information theory and Professor Michael Gastpar served in my qualification

exam committee. For more than a year Dr. Stark Draper and I worked on the problem

that built the foundations of this thesis. I appreciate the help from all of them.

I would like to thank all my colleagues in the Wireless Foundations Center for making

my life in Cory enjoyable. Particularly, my academic kid brothers Rahul Tandra, Mubaraq

Mishra, Hari Palaiyanur and Pulkit Grover shouldered my workload in group meetings.

Dan Schonberg captained our softball team for six years. My cubicle neighbors, Vinod

Prabhakaran and Jonathan Tsao, were always ready to talk to me on a wide range of

topics. I am also grateful to our wonderful EECS staff members Amy Ng and Ruth Gjerde

for making my life a lot easier.

My years at Berkeley have been a wonderful experience. I would like to thank all my

friends for all those nights out, especially Keda Wang, Po-lung Chia, Will Brett, Beryl

Huang, Jing Xiang and Jiangang Yao for being the sibling figures that I never had.

My parents have been extremely supportive through all these years, I hope I made you

proud. Last but not least, I would like to thank the people of Harbin, China and Berkeley,

USA, for everything.

ix

Chapter 1

Introduction

In 1948 Claude Shannon published The Mathematical Theory of Communication [72],

which “single-handedly started the field of information theory” [82]. 60 years later, what

Shannon started in that paper has evolved into what is shown in Figure 1.1. Some of the

most important aspects of information theory are illustrated in this picture such as sources,

noisy channels, universality, feedback, security, side-information, interaction between mul-

tiple agents etc. For trained eyes, it is easy to point out some classical information theory

problems such as lossless/lossy source coding [72, 74], channel coding [72], Slepian-Wolf

source coding [79], multiple-access channels [40], broadcast channels [37, 24], arbitrarily

varying channels [53], relay channels [25] and control over noisy channels [69] as sub prob-

lems of Figure 1.1. Indeed we can come up with some novel problems by just changing the

connections in Figure 1.1, for example a problem named “On the security of joint multiple

access channel/correlated source coding with partial channel information and noisy feed-

back” is well in the scope of that picture. And this may have already been published in the

literature.

In this thesis, we study information theory from another dimension— delay. Tradition-

ally, the issue of delay is ignored, as in most information theory problems a message which

contains all the information is known at the encoder prior to the whole communication

task. Or the issue of delay is studied in a strict real time setup such as [58]. We instead

study the intermediate regime where a finite delay for every information symbol is part of

the system design criteria. Related works are found in the networking literature on delay

and protocol [38] and recently on implications of finite delay on channel coding [67]. In

1

this thesis, we focus on the implications of end-to-end delay on a variety of source coding

problems.

Noisy (memory, varying…) Channels

Encoder A

Encoder B

Source A 2

Source B

Sink 2

Sink 1

Source A 1

Decoder 1

Decoder 2

“side-information” on source B or

channel

Feedback

Eavesdropper

Channel information

Control

Correlation

Figure 1.1. Sources, channels, feedback, side-information and eavesdroppers

2

1.1 Sources, channels, feedback and side-information

Instead of looking at information theory as a big complicated system illustrated in

Figure 1.1, we focus on a discrete time system that is consists five parts as shown in

Figure 1.2. We study the source model and the channel model in their simplest forms.

A source is a series of independent identically distributed random variables on a finite

alphabet. A noisy channel is characterized by a probability transition matrix with finite

input and finite output alphabets. The encoder is a map from a realization of the source to

channel inputs and the decoder is a map from the channel outputs to the reconstruction of

the source.

-- - DecoderEncoderNoisy Channel

Source Reconstruction

Figure 1.2. Shannon’s original problem in [72]

This is the original system setup in Shannon’s 1948 paper [72]. One of the most im-

portant results in that paper is the separation theorem. The separation theorem states

that one can reconstruct the source with a high fidelity 1 if and only if the capacity of the

channel is at least as large as the entropy of the source. The capacity of the channel is the

number of independent bits that can be reliably communicated across the noisy channel per

channel use. The entropy of the source is the number of bits needed on average to describe

the source. Both the entropy of the source and the capacity of the channel are defined

in terms of bits. And hence the separation theorem guarantees the validity of the digital

interface shown in Figure 1.3. In this system setup, where the channel coding and source

coding are two separate parts, the reconstruction has high fidelity as long as the entropy

of the source is less than the capacity of the channel. This theorem, however, does not

state that separate source-channel coding is optimal under every criteria. For example, the

large deviation performance of the joint source channel coding system is strictly better than

separate coding even in the block coding case [27]. However, to simplify the problem, we

study source coding and channel coding separately. In this thesis, we focus on the source

coding problems.

Figure 1.3 is the system we are concerned with. It illustrates the separation of source1For simplicity, we define high fidelity as lossless in this section. The lossy version, reconstruction under

a distortion measure, is discussed in Chapter 3.

3

coding and channel coding and two other important elements in information theory: channel

feedback and side-information of the source. It is well known that feedback does not increase

the channel capacity [26] for memoryless channels. However, as shown in [61, 47, 85, 70, 67,

68] and numerous other papers in the literature, channel feedback can be used to reduce

complexity and increase reliability.

Another important aspect we are going to explore is source side-information. Intuitively

speaking, knowing some information about the source for free can only help the decoder

to reconstruct the source. In the system model shown in Figure 1.3, the source is modeled

as iid random variables and the noisy channel is a discrete time memoryless channel, the

side-information is another iid random sequence which is correlated with the source in a

memoryless fashion. The problem is reduced to a standard source coding problem if both

the encoder and the decoder have the side-information. In [79], it is shown that with decoder

only side-information the source coding system can operate at the conditional entropy rate

instead of the generally higher entropy rate of the source.

-- -

6

--

??

Noisy Channel DecoderEncoderbits bits

Feedback

Encoder DecoderSource SourceSource

Rec

onst

ruct

ion

Channel Channel

Side-information

Figure 1.3. Separation, channel feedback and source side-information

By replacing the noisy channel in Figure 1.3 with a noiseless channel through which R

bits can be instantaneously communicated, we have a source coding system as shown in

Figure 1.4. In the main body of this thesis, Chapters 2-4, we study the model shown in

Figure 1.4.

4

Encoder DecoderSource SourceSource Reconstruction

Side−information

bits

Noiseless Channel (rate R)

Figure 1.4. Source coding problem

5

1.2 Information, delay and random arrivals

A very important aspect that is missing in the model shown in Figure 1.3 is the timing

of the source symbols and the channel uses. In most information theory problems, the

source sequence is assumed to be ready for the source encoder before the beginning of

compression. And hence the source encoder has whole knowledge of the entire sequence.

Similarly, the channel encoder has the message represented by a binary string before the

beginning of communication and thus the channel outputs depend on every bit in that

message. This is a reasonable assumption for many applications. For example this is a

valid model for the transmission of Kong Zi (Confucius)’s books to another planet. However,

there are a wide range of problems that cannot be modeled that way. Recent results in the

interaction of information theory and control [69] show that we need to study the whole

information system in an anytime formulation. It is also shown in that paper that the

traditional notion of channel capacity is not enough. On the source coding side, modern

communication applications such as video conferencing [54, 48] require short delay between

the generation of the source and the consumption by the end user. Instead of transmitting

Kong Zi’s books, we are facing the problem of transmitting Wang Shuo [84]’s writings to

our impatient audience on another planet, while Wang Shuo only types one character per

second in his usual unpredictable “I’m your daddy” [78] manner.

In this thesis, we study delay constrained coding for streaming data, where a finite

delay is required between the realization time of a source symbol (a random variable in a

random sequence) and the consumption time of the source symbol at the sink. Without

loss of generality, the source generates one source symbol per second and within a finite

delay ∆ seconds, the decoder consumes that source symbol as shown in Figure 1.5. We

assume the source is iid. Implicitly, this model states that the reconstruction of the source

symbol at the decoder is only dependent on what the decoder receives until ∆ seconds after

the realization of that source symbol. This is in line with the notion of causality defined

in [58]. The formal setup is in Section 2.1.1. In this thesis, we freely interchange “coding

with delay” with “delay constrained coding”.

Another timing issue is the randomness in the arrival times of the information (source

symbols) to the encoder. As shown in Figure 1.6, Wang Shuo types a character in a random

fashion. The readers may not care about the random inter-arrivals between his typing. But

with a finite delay, a positive rate (protocol information) has to be used to encode the

timing information [38]. If the readers care about the arrival time, the encoding of the

6

Figure 1.5. Wang Shuo types one character per second while impatient readers want to readhis writings within a finite delay. Rightmost character is the first in “I’m your daddy” [78].Read from right to left.

timing information might pose new challenges to the coding system if the inter-arrival time

is unbounded. We only have some partial result for geometric distributed arrival times. We

are currently working on Poisson random arrivals but neither of these results is included in

this thesis. Another interesting case is when the channel is a timing channel. Then Wang

Shuo can send messages by properly choosing his typing time [3].

Figure 1.6. Wang Shuo decides to type in a pattern with random inter-character time.Rightmost character is the first in “I’m your daddy” [78]. Read from right to left.

We focus on the case where source symbols come to the encoder at a constant rate (no

random arrivals). The problems of interest are summarized in Table 1.1. Each problem is

“coded” by a binary string of length six. A fundamental question is how these six elements

affect the coding problem. This is not fully understood. A brief discussion is given at

the end of the thesis. There are quite a few (≤ 62) interesting problems in the table.

For example, the ultimate problem in the table is 111111: “Delay constrained joint source

channel Wyner-Ziv coding with feedback and random arrivals”. To our knowledge, nobody

has published anything about this before. There are many other open questions dealing

with the timing issues of the communication system as shown in the table. We leave them

to future researchers to explore.

7

Random Non-uniform Noisy Feedback Side- Distortionarrival source channel information

Chapter 2 0 1 0 0 0 0Chapter 3 0 1 0 0 0 1Chapter 5 0 1 0 0 1 0

Ongoing work 1 1 0 0 0 0[15] 0 1 1 0 0 0[67] 0 0 1 1 0 0

Ultimate 1 1 1 1 1 1

Table 1.1. Delay constrained information theory problems

8

1.3 Error exponents, focusing operator and bandwidth

From the perspective of the large deviation principle [31], an error exponent is the loga-

rithm of the probability of the exceptionally rare behavior of source or channel. This error

exponent result is especially useful when the source entropy is much lower than the channel

capacity. The channel capacity and the source entropy rate characterize the behavior of the

source and channel in the regime of the central limit theorem. And hence error exponent

results always play second fiddle to channel capacity and source entropy rate. Recently, we

studied the third order problem— redundancy rate in the block coding setup [21, 19]. This

is particularly interesting when the channel capacity is close to the source entropy.

In Shannon’s original paper on information theory [72], he studied the source coding and

channel coding problem from the perspective of the minimum average number of channel

uses to communicate 1 source symbol reliably. The setting is asymptotic in nature— long

block length, where the number of source symbols is big. Reliable communication means

an arbitrary small decoding error. However, it is not clear how fast the error converges to

zero with longer block length. Later, researchers studied the convergence rate problem and

it was shown that the error converges to zero exponentially fast with block length as long

as there is some redundancy in the system, i.e. channel capacity is strictly higher than the

entropy rate of the source. This exponent is defined as the error exponent, as shown in

Figure 1.7. For channel coding, a lower bound and an upper bound on this error exponent

are derived in some of the early works [75, 76] and [41]. A very nice upper bound derivation

appears in Gallager’s technical report [35]. The lossless source coding error exponent is

completely identified in [29], the lossy source coding error exponent is studied in [55]. The

joint source channel coding error exponent was first studied in [27].

In the delay constrained setup of streaming source coding and channel coding problems,

we study the convergence rate of the symbol error probability. As shown in Figure 1.8,

the error probabilities go to zero exponentially fast with delay. We ask the fundamental

question: is delay for delay constrained streaming coding the same as the block length for

classical fixed-length block coding?

In [67], Sahai first answered the question for streaming channel coding. It is shown

that without feedback, the delay constrained error exponent does not beat the block coding

upper bound. More importantly, it is shown that, with feedback, the delay constrained

error exponent is higher than its block coding counterpart as illustrated in Figure 1.9. This

new exponent is termed the “focusing” bound.

9

log

( de

codi

ng e

rror

)

block length

Figure 1.7. Decoding error converges to zero exponentially fast with block length givensystem redundancy. The slope of the curve is the block coding error exponent.

In this thesis, we answer the question for source coding. For both lossless source coding

and lossy source coding with delay, we show that the delay constrained source coding error

exponents are higher than their block coding counterparts in Chapter 2 and Chapter 3

respectively. Similar to channel coding, the delay constrained error exponent and the block

coding error exponent are connected by a “focusing” operator.

Edelay(R) = infα>0

1α

Eblock((1 + α)R) (1.1)

where Edelay(R) and Eblock(R) are the rate R error exponents of delay constrained and

fixed-length block coding respectively. The conceptual information-theoretic explanation of

this operator is that the coding system can borrow some resources (channel uses) from the

future to deal with the delay constrained dominant error events. This is not the case for

block coding as for fixed length block coding, the amount of resources is given before the

realization of the randomness from the source or the channel.

This thesis is an information-theoretic piece of work. But we also care about applications

(non-asymptotics). Now suppose Wang Shuo’s impatient readers want to know what Wang

Shuo just typed ∆ = 20 seconds ago, but they can tolerate an average Pe = 0.1% decoding

error. A natural system design question is: what is the sufficient and necessary bandwidth

R. We give a one line derivation here and will not further study the design issue in this

thesis. The error probability decays to zero exponentially with delay:

Pe ≈ 2−∆Edelay(R) (1.2)

Because Edelay(R) is monotonically increasing with R, the sufficient and necessary condition

10

time line

log

( de

codi

ng e

rror

) .... k k+1 k+2 k+3 ......

streaming source data

Figure 1.8. Decoding error converges to zero exponentially fast with block length givensystem redundancy. The slope of the curves are the delay constrained error exponent.

to pleasing our impatient readers is thus:

R ≈ E−1delay(

1∆

log(Pe)) (1.3)

An explicit lower bound on R can be trivially derived from (1.3), our recent work on the

block coding redundancy rate problems [21, 19] and a manipulation of the focusing operator

in (1.1). It is left for future researchers.

capacity0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Rate

Err

or E

xpon

ents

Sphere packing boundFocusing bound

Figure 1.9. Focusing bound vs sphere packing bound for a binary erasure channel witherasure rate 0.05

11

1.4 Overview of the thesis

As discussed in previous sections, there is a huge number of information-theoretic prob-

lems that we are interested in. We only focus on source coding problems with an end to end

decoding delay in this thesis. Technical results of the thesis are developed in Chapters 2-

5. The structure of every chapter roughly follows the same flow: problem setup → main

theorems → examples → proofs → discussion and ideas for future work. Section 2.1 serves

as the foundation of the thesis. Formal definitions of streaming data and delay constrained

encoder and decoder are introduced. Every chapter is mostly presented in a self-contained

fashion.

The organization of the thesis is as follows. We first study the simplest and the most

fundamental problem of all, delay constrained lossless source coding in Chapter 2. In

Chapter 3, we add a distortion measure to the coding system to explore new aspects of delay

constrained problems and give a more general proof. Then in Chapter 4, the achievability

(lower bound on the delay constrained error exponent) of the distributed source coding

problem is studied, no general upper bound is known yet. However, in Chapter 5, we give

an upper bound on the delay constrained error exponent for source coding with decoder

side-information, which is a special case of distributed source coding.

1.4.1 Lossless source coding with delay

In Chapter 2, we study the lossless source coding with delay problem. This is the

simplest problem of all in terms of problem setup. Some of the most important concepts

of this thesis are introduced in this chapter. There is one source that generates one source

symbol per second and the encoder can send R bits per second to the decoder. The decoder

wants to recover every source symbol within a finite delay from when the symbol enters the

encoder. We define the delay constrained error exponent Es(R) as the exponential rate at

which the decoding error decays to zero with delay. The delay constrained error exponent

is the main object we study in this thesis.

An upper bound on this error exponent is derived by a “focusing” bound argument. The

key step is to translate the symbol error with delay to the fixed length block coding error.

From there, the classical block coding error exponent [41] result can be borrowed. This

bound is shown to be achievable by implementing an optimal universal fixed to variable

length encoder together with a FIFO buffer. A similar scheme based on tilted distribution

12

coding was proposed in [49] when the encoder knows the distribution of the source. A

new proof for the tilted distribution based scheme is provided based on the large deviation

principle. We analyze the delay constrained error exponent Es(R) and show that there are

several differences between this error exponent and the classical block source coding error

exponent in Section 2.5. For example, the derivative of the delay constrained error exponent

is positive at the entropy rate instead of 0 for block coding error exponents. Also the delay

constrained error exponent approaches infinity at the logarithm of the alphabet, which is

not the case for block coding. However, they are connected through the focusing operator

defined in (1.1). A similar “focusing” operator was recently observed in streaming channel

coding with feedback [67].

For streaming channel coding with delay, the present of feedback is essential for the

achievability of the “focusing” bound. But for streaming lossless source coding with de-

lay, no feedback is needed. In order to understand for what kind of information-theoretic

problems the focusing operator applies to, we study a more general problem in Chapter 3.

For the delay constrained point-to-point source coding problems in Chapters 2 and 3, the

focusing bound applies because the encoder has full information of the source. However,

for the distributed source coding problems in Chapters 4 and 5 the focusing operator no

longer applies because the encoders do not have the full information of source randomness.

1.4.2 Lossy source coding with delay

In Chapter 2, the reconstruction has to be exactly the same as the source or else a

decoding error occurs. In Chapter 3, we loosen the requirement for the exactness of the

reconstruction and study the delay constrained source coding problem under a distortion

measure. The source model is the same as that for lossless source coding. We define the

delay constrained lossy source coding error exponent as the exponential convergence rate of

the lossy error probability with delay, where a lossy error occurs if the distortion between

a source symbol and its reconstruction is higher than the system requirement.

Technically speaking, lossless source coding can be treated as a special case of lossy

source coding by properly defining error as a distortion violation. Hence, the results in

Chapter 3 can be treated as a natural generalization of Chapter 2. We prove that the delay

constrained error exponent and the block coding error exponent for peak distortion measure

are connected through the same focusing operator defined in (1.1). The reason is that the

rate distortion function under peak distortion is concave ∩ over the distribution of the

13

source. This is a property that average distortion measures do not have. The derivations

in this chapter are more general than in Chapter 2 since we only use the concavity of the

rate distortion function over the distribution. Hence the techniques can be used in other

delay constrained coding problems, for example, channel coding with feedback where the

variable length code is also concave ∩ in the channel behavior.

1.4.3 Distributed lossless source coding with delay

In Chapter 4, we study delay constrained distributed source coding of correlated sources.

The block coding counterpart was first studied by Slepian and Wolf in [79], where they

determined the rate region for two encoders that intend to communicate two correlated

sources to one decoder without cooperation.

In Chapter 2, we introduced a sequential random binning scheme that achieves the

random coding error exponent. This scheme is not necessary in the point-to-point case

because the random coding error exponent is much smaller than the optimal “focusing”

bound. In distributed source coding, however, sequential random binning is useful. By

using a sequential random binning scheme, we show that a positive delay constrained error

exponent can be achieved for both sources as long as the rate pair is in the interior of the

rate region determined in [79]. The delay constrained error exponents are different from the

block coding ones in [39, 29] because the two problems have different dominant error events.

Similar to fixed length block coding, we show that both maximum-likelihood decoding

and universal decoding achieve the same error exponent. This is through an analysis in

Appendix G where tilted distributions are used to bridge the two error exponents. Several

important Lemmas that are used in other chapters are also proved in Appendix G. The

essence of these proofs is to use Lagrange duality to confine the candidate set of distributions

of a minimization problem to a one dimensional exponential family.

Unlike what we observed in Chapter 2 and Chapter 3, the delay constrained error

exponent is smaller than the fixed length block coding counterpart. Is it because the

achievability scheme is suboptimal, or does this new problem have different properties than

the point to point source coding problems studied in the previous two chapters? This

question leads us to the study of the upper bound for a special case of the delay constrained

distributed source coding problem in the next chapter.

Chronologically, this is the first project on delay constrained source coding that I was

involved with. Then a post-doc at Cal, Stark Draper, my advisor Anant Sahai and I worked

14

on this problem from December 2004 to October 2006. Early results were summarized in [32]

and a final version has been submitted [12]. Now a professor in Wisconsin, Stark contributed

a lot to this project, especially in the universal decoding part. I appreciate his help in the

project and introducing me to Imre Csiszar and Janos Korner’s great book [29].

1.4.4 Lossless Source Coding with Decoder Side-Information with delay

What is missing in Chapter 4 is a non-trivial upper bound on the delay constrained error

exponents. In Chapter 5, we study one special distributed source coding problem, source

coding with decoder side-information, and derive an upper bound on the error exponent.

This problem is a special case of the problem in Chapter 4 since the decoder side-information

can be treated as an encoder with rate higher than the logarithm of the size of the alphabet.

It is a well known fact that there is a duality between channel coding and source coding

with decoder side-information [2] in the block coding setup. So it is not surprising that we

can borrow a delay constrained channel coding technique called the feed-forward decoder

that was first developed by Pinsker [62] and recently clarified by Sahai [67] to solve our

problem in Chapter 5. However, the lower bound and the upper bound are in general not

the same. This leaves a great space for future improvements.

We then derive the error exponent for source coding with both decoder and encoder

side-information. This delay constrained error exponent is related to the block coding error

exponent by the focusing operator in (1.1). This error exponent is generally strictly higher

than the upper bound of the delay constrained error exponent with only decoder side-

information. This is similar to the channel coding case, where the delay constrained error

exponent is higher with feedback than without feedback. This phenomenon is called “price

of ignorance” [18] and is not observed in the block coding context. This shows that in

the delay constrained setup, traditional compression first then encryption scheme achieves

higher reliability than the novel encryption first then compression scheme developed in [50],

although there is no difference in reliability between the two schemes in the block coding

setup. The ‘price of ignorance” adds to the series of observations that delay is not the same

as block length.

15

Chapter 2

Lossless Source Coding

In this chapter, we begin by reviewing classical results on the error exponents of lossless

block source coding. In order to understand the fundamental differences between classical

block coding and delay constrained coding1, we introduce the setup of delay constrained

lossless source coding problem and the notion of error exponent with delay constraint.

This setup serves as the foundation of the thesis. We then present the main result of this

chapter: a tight achievable delay-constrained error exponent for lossless source coding with

delay constraints. Some alternative suboptimal coding schemes are also analyzed, especially

the sequential random binning scheme which is used as a useful tool in future chapters on

distributed lossless source coding.

2.1 Problem Setup and Main Results

Fixed-length lossless block source coding is reviewed in Section A.1 in the appendix.

We present our result in the delay constrained setup for streaming lossless source coding.

2.1.1 Source Coding with Delay Constraints

We introduce the delay constrained source coding problem in this section, system model

and setup and architecture issues are discussed. These are the basics of the whole thesis.1In this thesis, we freely interchange “coding with delay” with “delay constrained coding”, similarly

“error exponent with delay” with “delay constrained error exponent”.

16

System Model, Fixed Delay and Delay Universality

x1 x2 x3 x4 x5 x6 ...

b1(x21 ) b2(x4

1 ) b3(x61 ) ...Encoding

Source

? ? ? ? ? ?

x1(4) x2(5) x3(6) ...

Rate limited channel

Decoding

? ? ?

Figure 2.1. Time line of delay constrained source coding: rate R = 12 , delay ∆ = 3

As shown in Figure 2.1, rather than being known in advance, the source symbols enter

the encoder in a streaming fashion. We assume that the discrete memoryless source gener-

ates one source symbol xi per second from a finite alphabet X . Where xi’s are i.i.d from a

distribution px . Without loss of generality, assume px(x) > 0, ∀x ∈ X .

The jth source symbol xj is not known at the encoder until time j, this is the fundamental

difference in the system model from the block source coding setup in Section A.1. The

coding system also commits to a rate R. Rate R operation means that the encoder sends

1 bit to the decoder every 1R seconds. It is shown in Proposition 1 and Proposition 2 that

we only need to study the problem when the rate R falls in the interval of [H(px), log |X |].

We define the sequential encoder-decoder pair. This is the coding system we study for

the delay constrained setup.

Definition 1 A fixed-delay ∆ sequential encoder-decoder pair E ,D is a sequence of maps:

Ej, j = 1, 2, ... and Dj, j = 1, 2, .... The outputs of Ej are the outputs of the encoder Efrom time j − 1 to j.

Ej : X j −→ 0, 1bjRc−b(j−1)Rc

Ej(xj1) = b

bjRcb(j−1)Rc+1

The output of the fixed-delay ∆ decoder Dj is the decoding decision of xj based on the received

17

binary bits up to time j + ∆.

Dj : 0, 1b(j+∆)Rc −→ X

Dj(bb(j+∆)Rc1 ) = xj(j + ∆)

Hence xj(j +∆) is the estimation of xj at time j +∆ and thus there is an end-to-end delay

of ∆ seconds between when xj enters the encoder and when the decoder outputs the estimate

of xj. A rate R = 12 , fixed-delay ∆ = 3, sequential source coding system is illustrated in

Figure 2.1.

In this thesis, we focus our study on the fixed delay coding problem with a fixed delay

∆. Another interesting problem is the delay-universal coding problem defined as follows.

The outputs of delay-universal decoder Dj are the decoding decisions of all the arrived

source symbols at the encoder by time j based on the received binary bits up to time j.

Dj : 0, 1bjRc −→ X j

Dj(bbjRc1 ) = xj

1(j)

Where xj1(j) is the estimation, at time j, of xj

1 and thus the end-to-end delay of symbol xi

at time j is j − i seconds for i ≤ j. In a delay-universal scheme, the decoder emits revised

estimates for all source symbols so far. This coding system is illustrated in Figure 2.2.

x1 x2 x3 x4 x5 x6 ...

b1(x21 ) b2(x4


Source

? ? ? ? ? ?

x11 (1) x2

1 (2) x31 (3) x4

1 (4) x51 (5) x5

1 (6) ...


Decoding

? ? ?

Figure 2.2. Time line of delay universal source coding: rate R = 12

Without giving the proof, we state that all the error exponent results in this thesis for

fixed-delay problems also apply to delay universal problems.

18

Symbol Error Probability, Delay and Error Exponent

For a streaming source coding system, a source symbol that enters the system earlier

could get a higher decoding accuracy than a symbol that enters the system later. This is

because the earlier source symbol could use more resource— noiseless channel uses. This

is in contrast to block coding systems where all source symbols share the same noiseless

channel uses. Thus for delay constrained source coding, it is important to study the symbol

decoding error probability instead of the block coding error probability.

Definition 2 A delay constrained error exponent Es(R) is said to be achievable if and only

if for all ε > 0, there exists K < ∞, ∀∆ > 0, ∃ fixed-delay ∆ encoder decoder pairs E, D,

s.t. ∀i:Pr[xi 6= xi(i + ∆)] ≤ K2−∆(Es(R)−ε)

Note: the order of the conditions of this definition is extremely important. Simplistically

speaking, a delay constrained error exponent is said to be achievable (meaning the symbol

error decays exponentially with delay with the claimed exponent for all symbols), if it is

achievable for all fixed-delays.

We have the following two propositions stating that the error exponent is only interesting

if R ∈ [H(px), log |X |]. First we show that the error exponent is only interesting if the rate

R ≤ log |X |.

Proposition 1 Es(R) is not well defined2 if R > log |X |

Proof: Suppose R > log |X |, for integer N large enough, we have

2bNR−2c > 2N log |X | = |X |N .

Now we construct a very simple block coding system as follows.

The encoder first queues up N source symbols from time (k−1)N +1 to kN , k = 1, 2, ...,

those N symbols are x(k−1)N+1, ...xkN . Now since 2bNR−2c > |X |N , use an injective map Efrom XN to 1, 2, 3, ..., 2bNR−2c. Notice that the encoder can send at least bNR− 2c bits

in the time interval (kN, (k + 1)N). So the encoder can send E(x(k−1)N+1, ...xkN ) to the

decoder within time interval (kN, (k + 1)N). Thus the decoding error for source symbols2Less mathematically strictly speaking, the error exponent Es(R) is infinite if R > log |X |.

19

x(k−1)N+1, ...xkN is zero at time (k + 1)N . This is true for all k = 1, 2... So the error

probability for any source symbol xn at time n + ∆ is zero for any ∆ ≥ 2N . Thus the error

exponent Es(R) for R > log |X | is not defined because log 0 is not defined. Or conceptually

we say the error exponent is infinite for R > log |X |. ¤

The above lemma is for the delay constrained source coding problem in this chapter.

But it should be obvious that similar results hold for the other source coding problems

discussed in future chapters. That is, if the rate of the system is above the logarithm of the

alphabet size, an almost instantaneous error free coding is possible. This implies that the

error exponent is not well defined in that case.

Secondly, the delay constrained error exponent is zero, i.e. the error probability does

not universally converge to zero if the rate is below the entropy rate of the source. This

result should not be surprising given that it is also true for block coding. However, for

the completeness of the thesis, we give a proof. The proof here is to translate a delay

constrained problem to a block coding problem, then by using the classical block coding

error probability result we lower bound the symbol error probability, and thus give a lower

upper bound on the delay constrained error exponent.

Proposition 2 Es(R) = 0 if R < H(px).

Proof: We prove the lemma by contradiction. Suppose that for some source px the

delay constrained source coding error exponent Es(R) > 0 for some R < H(px). Then from

Definition 2 we know that for any ε > 0, there exists K < ∞, such that for all ∆, there

exists a delay constrained source coding system E , D, and

Pr[xn 6= xn(n + ∆)] ≤ K2−∆(Es(R)−ε) for all n. (2.1)

Notice that (2.1) is true for all n, ∆. We pick n and ∆ that are big enough as needed.

Now we can design a block coding scheme of block length n, the encoder E is derived

from the delay constrained source coding system Di, where the output of the block encoder

is the same as the accumulate of the delay constrained encoders.

E(xn1 ) =

(E1(x1), E2(x21), ...., En+∆(xn+∆

1 ))

= bb(n+∆)Rc1 (2.2)

Notice that some of the output of the encoders Ei(xi1) can be empty.

The decoder part is similar. The block code decoder uses the output of all the delay

constrained decoders up to time n + ∆

D(b(n+∆)R1 ) =

(x1(1 + ∆), x2(2 + ∆), ..., xn(n + ∆)

), xn

1 (2.3)

20

Now we look at the block error of the block coding system built from the delay con-

strained source coding system. First, we fix the ratio of n and ∆ at ∆ = αn, we will choose

a sufficient large n and a small enough α to construct the contradiction.

Pr[xn1 6= xn

1 ] ≤n∑

i=1

Pr[xi 6= xi]

≤n∑

i=1

K2−∆(E−ε)

≤ nK2−αn(E−ε) (2.4)

However, from the classical block source coding theorem in [72], we know that

Pr[xn1 6= xn

1 ] ≥ 0.5 (2.5)

if the effective rate of the block coding system n+∆n R is smaller than the entropy rate of

the source H(px). This only requires n be much larger than ∆, such that n+∆n R < H(px).

Now combining (2.4) and (2.5), we have

nK2−αn(E−ε) ≥ 0.5 (2.6)

Notice that E > ε, K < ∞ is a constant, and α > 0 is also a constant, hence there

exists n big enough such that nK2−αn(E−ε) < 0.5. This gives the contradiction we need.

The lemma is proved. ¤

These two lemmas tell us that the delay constrained error exponent Es(R) is only

interesting for R ∈ [H(px), log |X |].

In the proof of Proposition 2, we constructed a block coding system from a delay con-

strained source coding system with some delay performance and then build the contra-

diction. In this thesis, we also use this technique to show other theorems. We borrow

the classical block coding results to serve as the building blocks of the delay constrained

information theory.

The above two lemmas are for the delay constrained lossless source coding problem in

this chapter. But it should be obvious that similar result holds for other source coding

problems discussed in future chapters. That is, delay constrained error exponents are zero,

or the error probabilities do not converge to zero universally if the rate is lower than the

relevant entropy. And if the rate is above the logarithm of the alphabet size of the source,

any exponent is achievable.

21

2.1.2 Main result of Chapter 2: Lossless Source Coding Error Exponent

with Delay

Following the definition of the delay-constrained error exponent for lossless source coding

in Definition 2, we have the following theorem which describes the convergence rate of the

symbol-wise error probability of lossless source coding with delay problem.

Theorem 1 Delay constrained lossless source coding error exponent: For source x ∼ px ,

the delay constrained source coding error defined in Definition 2 is

Es(R) = infα>0

1α

Es,b((α + 1)R) (2.7)

Where Es,b(R) is the block source coding error exponent [29] defined in (A.4). This error

exponent Es(R) is both achievable and an upper bound, hence our result is complete.

Recall the definition of delay constrained error exponent, for all ε > 0, there exists a finite

constant K, s.t. for all i, ∆,

Pr[xi 6= xi(i + ∆)] ≤ K2−∆(Es(R)−ε)

The result has two parts. First, it states that there exists a coding scheme, such that the

error exponent Es(R) can be achieved in a universal setup. This is summarized in Propo-

sition 5 in Section 2.4.1. Second, there is no coding scheme can achieve better delay con-

strained error exponent than Es(R). This is summarized in Proposition 6 in Section 2.4.2.

Before showing the proof of Theorem 1, we give some numerical results and discuss

some other coding schemes in the next two sections.

2.2 Numerical Results

In this section we evaluate the delay constrained performance for different coding

schemes via an example. For a simple source x with alphabet size 3, X = A,B, Cand the following distribution

px(A) = a px(B) =1− a

2px(C) =

1− a

2

Where a ∈ [0, 1], in this section we set a = 0.65.

22

2.2.1 Comparison of error exponents

The error exponents for both block and delay constrained source coding predict the

asymptotic performance of different source coding systems when the delay is long. We

plot the delay constrained error exponent Es(R), the block source coding error exponent

Es,b(R) and the random coding error exponent Er(R) in Figure 2.3. As shown in Theorem 1

and Theorem 9, these error exponents tell the asymptotic performance of the two delay

constrained source coding systems. We give a simple streaming prefix-free code at rate 32 for

this source. It will be shown that, although this prefix-free coding is suboptimal as the error

exponent is much smaller than Es(32), its error exponent is much larger than Er(

32) which is

equal to Es,b(32) in this case. We give the details of the streaming prefix-free coding system

and analyze its decoding error probability in Section 2.2.2.

1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Rate R

Err

or E

xpon

ents

Random codingBlock codingDelay optimal codingSimple prefix coding

to ∞

H(px) log

2(3)

Figure 2.3. Different source coding error exponents: delay constraint error exponent Es(R),block source coding error exponent Es,b(R), random coding error exponent Er(R)

In Figure 2.4, we plot the ratio of the optimal delay constrained error exponent over the

block coding error exponent. The ratio tells to achieve the same error probability, how many

times longer the delay has to be for the block coding system that is studied in Section 2.3.1.

The smallest ratio is around 52 at rate around 1.45.

23

1.50

100

200

300

400

500

600

700

Rate R

Rat

io o

f del

ay−

optim

al o

ver

bloc

k co

ding to ∞

to ∞

H(ps) log

2(3)

Figure 2.4. Ratio of delay optimal error exponent Es(R) over block source coding errorexponent Es,b(R),

2.2.2 Non-asymptotic results: prefix free coding of length 2 and queueing

delay

In this section we show a very simple coding scheme which use a prefix-free code [26]

instead of the asymptotical optimal universal code studied in Section 2.4.1. The length two

prefix-free code for source x we are going to use is:

AA → 0

AB → 1000 AC → 1001 BA → 1010 BB → 1011

BC → 1100 CA → 1101 CB → 1110 CC → 1111

This prefix-free code is obviously sub-optimal because the code length is only 2, also

the code length is not adapted to the distribution. However, it will be shown that even this

obviously non optimal code outperforms the block coding schemes in the delay constrained

setups when a is not too small. We analyze the performance of the streaming prefix-free

coding system at R = 32 . That is, the source generates 1 symbol per second, while 3 bits can

be sent through the rate constrained channel every 2 seconds. The prefix-free stream coding

24

system like a FIFO queueing system with infinite buffer which is similar to the problem

studied in [49]. The prefix-free stream coding system is illustrated in Figure 2.5. Following

the prefix code defined earlier, the prefix-free encoder group two source symbols x2k−1, x2k

together at time 2k, k = 1, 2, ... into the prefix-free code. the length of the codeword is

either 1 or 4. The buffer is drained out by 3 bits per 2 seconds. Write the number of bits in

the buffer as Bk at time 2k. Every two seconds, the number of bits Bk in the buffer either

goes down by 2 if x2k−1, x2k = AA or goes up by 1 if x2k−1x2k 6= AA. 3 bits are drained out

every 2 seconds from the FIFO queue, notice that if the queue is empty, the encoder can

send random bits through the channel without causing confusion at the decoder because

the source generates 1 source symbol per second.

AB BA CC AA CA AA AA AA CB AA CCAASource

Buffer //

Rate R bit−stream

Prefix code

0 1000

1010

1111

0 1101

0 0 0 1110

1111

0

/

/ 0 111

10 0 01 / 0 / 1

*** 0** 100 010 101 111 011 010 0** 0** 111 00* 111

AA

CB

AA

AA

CA

AA

AA

CC

BA

AB

AADecision

Figure 2.5. Streaming prefix-free coding system (/ indicates empty queue, * indicatesmeaningless random bits)

Clearly Bk, k = 1, 2, ... form a Markov chain with following transition matrix: Bk =

Bk−1 + 1 with probability 1 − a2, Bk = Bk−1 − 2 with probability a2, notice that Bk ≥ 0

thus the boundary conditions. We have the state transition graph in Figure 2.6. For this

Markov chain, we can easily derive the stationary distribution (if it exists) [33].

µk = L(−1 +

√1 + 4(1−q)

q

2)k (2.8)

Where q = a2 and L is the normalizer, notice that µk is a geometric series and the stationary

distribution exists as long as the geometric series goes to 0 as index goes to infinity, i.e.

41−qq < 8 or equivalently q > 1

3 . In this example, we have a = 0.65, thus q = a2 = 0.4225 >

13 , thus the stationary distribution µk exists.

For the above simple prefix-free coding system, we can easily derive the decoding error

for symbols x2k−1, x2k at time 2k + ∆ − 1, thus the effective delay for x2k−1 is ∆. The

decoding error can only happen if at time 2k + ∆ − 1, at least one bits of the prefix-free

25

0 1 2 3 4 5qq

q q q q q

1−q 1−q 1−q 1−q 1−q 1−q

Figure 2.6. Transition graph of random walk Bk for source distribution a, 1−a2 , 1−a

2 , q = a2

is the probability that Bk goes down by 2.

code describing x2k−1, x2k are still in the queue. This implies that the number of bits in the

buffer at time 2k, Bk, is larger than

b32(∆− 1)c − l(x2k−1, x2k)

where l(x2k−1, x2k) is the length of the prefix-free code for x2k−1, x2k, thus it is 1 with

probability q = a2 and it is 4 with probability 1− q = 1− a2. Notice that the length of the

prefix-free code for x2k−1, x2k is independent with Bk, we have the following upper bound

on the error probability of decoding with delay ∆ when the system is at stationary state:

Pr[x2k−1 6= x2k−1] ≤ Pr[l(x2k−1, x2k) = 1] Pr[Bk > b32(∆− 1)c − 1]

Pr[l(x2k−1, x2k) = 4] Pr[Bk > b32(∆− 1)c − 4]

= q∞∑

j=b 32(∆−1)c

µj + (1− q)∞∑

j=b 32(∆−1)c−3

µj

= G(−1 +

√1 + 4(1−q)

q

2)b 3

2(∆−1)c−3

The last line is by substituting in (2.8) for stationary distribution µj . Where G is a constant,

we omit the detail expression of G here. Following the last line, the error exponent for this

prefix-free coding system is obviously

32

log(−1 +

√1 + 4(1−q)

q

2)

With the above evaluation, we now compare three different coding schemes in the non-

asymptotic setups. In Figure 2.7, the error probability vs delay curves are plotted for three

different coding schemes. First we plot the causal random coding error probability. Shown

in Figure 2.3, at R = 32 , random coding error exponent Er(R) is the same as the block

26

coding error exponent Es,b(R). The block coding curve is for a so called simplex coding

scheme, where the encoder first queues up ∆2 symbols, encode them into a length ∆

2 R binary

sequence and use the next ∆2 seconds to transmit the message. This coding scheme gives an

error exponent Es,b(R)2 . As can be seen in Figure 2.3, the slope of the simplex block coding

is roughly half of the block source coding’s slope.

0 10 20 30 40 50 6010

−10

10−9

10−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

End to end delay

Err

or p

roba

bilit

y (s

ymbo

l)

Block codingCausal random codingSimple prefix coding

Figure 2.7. Error probability vs delay (non-asymptotic results)

The slope of these curves in Figure 2.7 indicates how fast the error probability goes to

zero with delay, i.e. the error exponents. Although smaller than the delay optimal error

exponent Es(R), the simple prefix-free coding has a much higher error exponent than both

both random coding and the simplex block coding error exponent. A trivial calculation

tells us that in order to get 10−6 symbol error probability, the delay requirement for delay

optimal coding is ∼ 40, for causal random coding is around ∼ 303, for simplex block

coding is at around∼ 374. Here we run a linear regression on the data: y∆ = log10 Pe(∆),

x∆ = ∆ as shown in Figure 2.3 from ∆ = 80 to ∆ = 100. Then we extrapolate the ∆, s.t.

log10 Pe(∆) = −6. Thus we see a major delay performance improvement for delay optimal

source coding.

27

2.3 First attempt: some suboptimal coding schemes

On our way to find the optimal delay constrained performance, we first studied the

classical block coding’s delay constrained performance in Section 2.3.1, the random tree

code later discussed in Section 2.3.3 and several other known coding systems. These coding

systems in the next three subsections are shown to be suboptimal in the sense that they

all achieve strictly smaller delay constrained error exponents than that in Theorem 1. This

section is by no means an exhaustive study of the delay constrained performance of all

possible coding systems. Our intention is to look at the obvious candidates and help readers

to understand the key issues with delay constrained coding.

2.3.1 Block Source Coding’s Delay Performance

R R R R

0 ∆2

∆ 3∆2

2∆Sourcestream

................

................Bit Stream of rate R∆R2

bits

∆2

symbols

Figure 2.8. Block coding’s delay performance

We analyze the delay constrained performance for traditional block source coding sys-

tems. As shown in Figure 2.8, the encoder first encode ∆2 source symbols, during a period

of ∆2 seconds, into a block of binary bits. During the next ∆

2 seconds, the encoder uses

all the bandwidth to transmit the binary bits through the rate R noiseless channel. After

all the binary bits are received by the decoder at time ∆, the effective coding system is a

length ∆2 , rate R block source coding system. Thus the error probability of each symbol is

lower bounded by 2−∆2

Es,b(R) as shown in Theorem 9. The delay constrained error exponent

for source symbol x1 is

E = − 1∆

log2−∆2

Es,b(R)

=Es,b(R)

2(2.9)

This is half of the block coding error exponent. Another problem of this coding scheme

28

is that the encoder is delay-specific as ∆ is a parameter in the infrastructure. While the

optimal coding scheme as shown in Section 2.4.1 is not.

2.3.2 Error-free Coding with Queueing Delay

For both the block coding in the previous section and the sequential random coding in

Section 2.3.3, the coding system commits to some none-zero error probability for symbol xi

at time i + ∆, for any i, ∆ pair as long as the rate R is smaller than log |X |. The next two

coding schemes are different in the sense that the error probability for any source symbol xi

is going to be exactly zero after some finite variant delay ∆i(x∞1 ). These encoding schemes

have two parts, first part is a zero-error fixed or variable to variable length source encoder,

the second part is a FIFO (first in first out) buffer. The queue in the buffer is drained

out by a constant rate R bits per second. If the queue is empty, the encoder buffer simply

send arbitrary bits through the noiseless channel. The decoder, knowing the rate of the

source, simply discards the arbitrary gibberish bits added by the encoder buffer and now

the the task is only to decode the source symbols by the received bits that describe the

source stream. This coding scheme makes no decoding error on xi as long as the bits that

describe xi are all received by the decoder. Thus the symbol error only occurs if there are

too many bits are used to describe the source block that xi is in and some previous source

symbols. This class of coding schemes are illustrated in Figure 2.9. The error free code can

be either variable to fixed length coding or variable to variable length coding.

Streaming data S

Encoder Buffer

FIFO

Rate R bit streamDecoder

Causal Error−Free Source Code

Figure 2.9. Error free source coding for a fixed-rate system

29

Prefix-free Entropy Coding with Queueing Delay

Entropy coding [26] is a simple source coding scheme, where the code length of a source

symbol is roughly inversely proportional to the logarithm of the probability of that source

symbol. For example, Huffman code and arithmetic code are two entropy code that are

thoroughly studied in [26]. An important property of the entropy coding is that the average

code length per source symbol is the entropy rate H(px) of the source x if the code block

is long enough. The code length is independent of the rate R. We argue that the simple

prefix-free entropy coding with queueing delay scheme is not always optimal by the following

simple example. Consider a binary source S = a, b, c, d with distribution px(a) = 0.91,

px(b) = px(c) = px(d) = 0.03 and the rate of the system is R = 2. Now obviously the best

coding scheme is not to code by just sending the uncompressed binary expressions of the

source across the rate R = 2 system. The system achieves zero error for every source symbol

xi instantaneously. Since rate R match well with the alphabet size log |X |, philosophically

speaking, this statement is the same to Michael Gastpar’s PhD thesis [43] in which the

Gaussian source is matched well with the Gaussian channel. The delay constrained error

exponent is thus infinity or strictly speaking not well defined. One length-2 optimal prefix-

free code of this source is E(a) = 0, E(b) = 10, E(c) = 110, E(d) = 111. And the coding

system using this prefix-free code or entropy codes of any length clearly does not achieve

zero error probability for any delay ∆ for source symbol xi if i is large enough. The error

occurs if the source symbols before i are atypical. For example, a long string of c’s. This

case is thoroughly analyzed in Section 2.2.2.

We show in Section 2.4.1 that a simple universal coding with queueing delay scheme is

indeed optimal. This scheme is inspired by Jelinek’s non-universal coding scheme in [49].

We call Jelinek’s scheme tilted entropy coding because the new codes are entropy codes for

a tilted distribution generated from the original distribution. We give a simplified proof of

Jelinek’s result in Appendix B using modern large deviation techniques.

LZ78 Coding with Queueing Delay

In this subsection, we argue that the standard LZ78 coding is not suitable for delay

constrained setup due to the nature of its dictionary construction.

In their seminal paper [88], Lempel and Ziv proposed a dictionary based sequential

universal lossless data compression algorithm. This coding scheme is causal in nature.

30

However, the standard LZ78 coding scheme described in [88] and later popularized by

Welch in [83] does not achieve any positive delay constrained error exponent. The reason

why seemingly causal LZW coding scheme does not achieve any positive delay constrained

error exponent is rooted in the incremental nature of the dictionary. If the dictionary is

infinite as original designed by Lempel and Ziv, then for a memoryless source and a fixed

delay ∆, every word with length 2∆ will be in the dictionary with high probability at

sufficiently large time t. Thus the encoding delay at the LZW encoder is at least 2∆ for any

source symbol xi, i > t. According to the definition of delay constrained error exponent in

Definition 2, the achievable error exponent is 0 for such a Lempel-Ziv coding system.

On the other hand, with a finite dictionary, the dictionary mismatches with the source

statistics if the early source symbols are atypical. This mis-match occurs with a positive

probability and thus the encoding delay for later source symbols is big and thus no positive

delay constrained error exponent can be achieved. Note: we only discuss the popular LZ78

algorithm here. There are numerous other forms of universal Lempel-Ziv coding, the first

one in [87]. The delay constrained performance for such coding schemes are left for future

studies.

2.3.3 Sequential Random Binning

In [12] and [32], we proposed a sequential random binning scheme for delay constrained

source coding systems. It is the source-coding counterpart to tree and convolutional codes

used for channel coding [34]. This sequential random binning scheme follows our definition of

delay constrained source coding system in Definition 1 with additional common randomness

accessible to both the encoder(s) and decoder. The encoder is universal in nature that it

is the same for any source distribution px meanwhile the decoder can be ML decoder or

universal decoder which will be discussed in great details later. We define the sequential

random binning scheme as follows.

Definition 3 A randomized sequential encoder-decoder pair (a random binning scheme)

E ,D is a sequence of maps: Ej, j = 1, 2, ... and Dj, j = 1, 2, .... The outputs of Ej are

the outputs of the encoder E from time j − 1 to j.


Ej(xj1) = b

bjRcb(j−1)Rc+1

31

The output of the fixed-delay ∆ decoder Dj is the decoding decision of xj based on the received

binary bits up to time j + ∆.

Dj : 0, 1b(j+∆)Rc −→ X

Dj(bb(j+∆)Rc1 ) = xj(j + ∆)

Where xj(j + ∆) is the estimation of xj at time j + ∆ and thus has end-to-end delay of

∆ seconds. Common randomness, shared between encoder(s) and decoder(s), is assumed.

Availability to common randomness is a common assumption in information theory com-

munity. Common randomness can be generated from correlated data [63]. This is not the

main topic of this thesis in which we assume enough common randomness. Interested read-

ers may read [56, 1]. This allows us to randomize the mappings independent of the source

sequence. In this thesis, we only need pair-wise independence, formally, for all i, n:

Pr[E(xi1x

ni+1) = E(xi

1xni+1)] = 2−(bnRc−biRc) ≤ 2× 2−(n−i)R (2.10)

for all xi+1 6= xi+1

(2.10) deserves more staring at. The number of output bits of the encoder for source

symbols from time i to time n is bnRc − biRc, so (2.10) means that the chance that the

binary representations of two length-n source strings which diverge at time i is 2−(bnRc−biRc).

We define bins as follows. A bin is a set of source strings that share the same binary

representations:

Bx(xn1 ) = xn

1 ∈ Sn : E(xn1 ) = E(xn

1 ) (2.11)

With the notion of bins, (2.10) is equivalent to the following equality for xi+1 6= xi+1:

Pr[E(xi1x

ni+1) ∈ Bx(xi

1xni+1)] = 2−(bnRc−biRc) ≤ 2× 2−(n−i)R (2.12)

In this thesis, the sequential random encoder always works by assigning random parity

bits in a causal fashion to the observed source sequence. That is the bits generated at each

time in Definition 3 are iid Bernoulli-(0.5) random variables. Since parity bits are assigned

causally, if two source sequences xn1 and xn

1 share the same length-l prefix, i.e. xl1 = xl

1 then

their first blRc parity bits must match. Subsequent parity bits are drawn independently.

32

At the decoder side, the decoder receives the binary parity check bbnRc1 at time n.

Hence the decoder knows the bin number which the source sequence xn1 is in. Now the

decoder has to pick one source sequence out of the bin Bx(xn1 ) as the estimate of xn

1 . If

the decoder knows the distribution of the source px , it can simply pick the sequence with

the maximum likelihood. Otherwise, which is called universal case, the decoder can use a

minimum empirical entropy decoding rule. We first summarize the delay constrained error

exponent results for both ML decoding and universal decoding.

First the maximum-likelihood decoding where the decoder knows the distribution of the

source px .

Proposition 3 Given a rate R > H(px), there exists a randomized streaming encoder and

maximum likelihood decoder pair (per Definition 3) and a finite constant K > 0, such that

Pr[xi(i + ∆) 6= xi] ≤ K2−∆EML(R) for all i,∆ ≥ 0, or equivalently for all n ≥ ∆ ≥ 0

Pr[xn−∆(n) 6= xn−∆] ≤ K2−∆EML(R) (2.13)

where EML(R) = supρ∈[0,1]

ρR− (1 + ρ) log

(∑x

px(x)1

1+ρ

) (2.14)

Secondly, if the decoder does not know the distribution, it can use the minimum empiri-

cal entropy rule to pick the sequence with the smallest empirical entropy inside the received

bin. More interestingly, the decoder may prefer not to use its knowledge of the distribution

of the source. The benefit of doing so is due to the uncertainty of the distribution of the

source, for example a source distribution of 0.9, 0.051, 0.049 may be mistaken as a distri-

bution of 0.9, 0.049, 0.051 and the decoding may as well be incorrect due to this nature.

We summarize the universal coding result in the following proposition.

Proposition 4 Given a rate R > H(px), there exists a randomized streaming encoder and

universal decoder pair (per Definition 3) such that for all ε > 0 there exists finite K > 0

such that Pr[xi(i + ∆) 6= xi] ≤ K2−∆(EUN (R)−ε) for all n, ∆ ≥ 0 where

EUN (R) = infq

D(q‖px) + |R−H(q)|+, (2.15)

where q is an arbitrary probability distribution on X and where |z|+ = max0, z.

Remark: The error exponents of Propositions 3 and 4 both equal random block-coding

exponents shown in (A.5).

33

Proof of Propositions 3: ML decoding

To show Propositions 3 and 4, we first develop the common core of the proof in the

context of ML decoding. First, we describe the ML decoding rule.

ML decoding rule:

Denote by xn1 (n) the estimate of the source sequence xn

1 at time n.

xn1 (n) = arg max

xn1∈Bx(xn

1 )px(xn

1 ) = arg maxxn1∈Bx(xn

1 )

n∏

i=1

px(xi) (2.16)

The ML decoding rule in (2.16) is very simple. At time n, the decoder simply picks the

sequence xn1 (n) with the highest likelihood which is in the same bin as the true sequence

xn1 ). Now the estimate of source symbol n−∆ is simply the (n− delay)th symbol of xn

1 (n),

denoted by xn−∆(n).

Details of the proof:

The proof strategy is as follows. To lead to a decoding error, there must be some false

source sequence xn1 that satisfies three conditions: (i) it must be in the same bin (share the

same parities) as xn1 , i.e., xn

1 ∈ Bs(xn1 ), (ii) it must be more likely than the true sequence,

i.e., px(xn1 ) > px(xn

1 ), and (iii) xl 6= xl for some l ≤ n−∆.

The error probability can be union bounded as follows which is also illustrated in Fig-

ure 2.10.

Pr[xn−∆(n) 6= xn−∆] ≤Pr[xn−∆1 (n) 6= xn−∆

1 ]

=∑

xn1

Pr[xn−∆1 (n) 6= xn−∆

1 |xn1 = xn

1 ]px(xn1 ) (2.17)

=∑

xn1

n−∆∑

l=1

Pr[∃ xn

1 ∈ Bs(xn1 ) ∩ Fn(l, xn

1 ) s.t. px(xn1 ) ≥ px(xn

1 )]px(xn

1 )

(2.18)

=n−∆∑

l=1

∑

xn1

Pr[∃ xn

1 ∈ Bs(xn1 ) ∩ Fn(l, xn


1 )]px(xn

1 )

=n−∆∑

l=1

pn(l). (2.19)

After conditioning on the realized source sequence in (2.17), the remaining randomness is

only in the binning. In (2.18) we decompose the error event into a number of mutually

34

- l1 nn−∆

Figure 2.10. Decoding error probability at n − ∆ can be union bounded by the sum ofprobabilities of first decoding error at l, 1 ≤ l ≤ n−∆. The dominant error event pn(n−∆)is the one in the highlighted oval(shortest delay).

exclusive events by partitioning all source sequences xn1 into sets Fn(l, xn

1 ) defined by the

time l of the first sample in which they differ from the realized source xn1 ,

Fn(l, xn1 ) = xn

1 ∈ X n|xl−11 = xl−1

1 , xl 6= xl, (2.20)

and define Fn(n + 1, xn1 ) = xn

1. Finally, in (2.19) we define

pn(l) =∑

xn1

Pr[∃ xn

1 ∈ Bs(xn1 ) ∩ Fn(l, xn


1 )]px(xn

1 ). (2.21)

We now upper bound pn(l) using a Chernoff bound argument similar to [39].

Lemma 1 pn(l) ≤ 2× 2−(n−l+1)EML(R).

Proof:

pn(l) =∑

xn1

Pr[∃ xn

1 ∈ Bs(xn1 ) ∩ Fn(l, xn


1 )]px(xn

1 )

≤∑

xn1

min[1,

∑

xn1 ∈ Fn(l, xn

1 )s.t.

px (xn1 ) ≤ px (x

n1 )

Pr[xn1 ∈ Bs(xn

1 )]]px(xn

1 ) (2.22)

≤∑

xl−11 ,xn

l

min[1,

∑

xnl s.t.

px (xnl ) < px (x

nl )

2× 2−(n−l+1)R]px(xl−1

1 )px(xnl ) (2.23)

≤2×∑

xnl

min[1,

∑

xnl s.t.

px (xnl ) < px (x

nl )

2−(n−l+1)R]px(xn

l )

=2×∑

xnl

min[1,

∑

xnl

1[px(xnl ) > px(xn

l )]2−(n−l+1)R]px(xn

l ) (2.24)

≤2×∑

xnl

min

1,

∑

xnl

min[1,

px(xnl )

px(xnl )

]2−(n−l+1)R

px(xn

l )

≤2×∑

xnl

∑

xnl

[px(xn

l )px(xn

l )

] 11+ρ

2−(n−l+1)R

ρ

px(xnl ) (2.25)

35

=2×∑

xnl

px(xnl )

11+ρ

∑

xnl

[px(xnl )]

11+ρ

ρ

2−(n−l+1)ρR

=2×[∑

x

px(x)1

1+ρ

](n−l+1) [∑x

px(x)1

1+ρ

](n−l+1)ρ

2−(n−l+1)ρR (2.26)

=2×[∑

x

px(x)1

1+ρ

](n−l+1)(1+ρ)

2−(n−l+1)ρR

=2× 2−(n−l+1)

[ρR−(1+ρ) log

(∑x px (x)

11+ρ

)]

(2.27)

In (2.22) we apply the union bound. In (2.23) we use the fact that after the first symbol

in which two sequences differ, the remaining parity bits are independent, and use the fact

that only the likelihood of the differing suffixes matter in (2.12). That is, if xl−11 = xl−1

1 ,

then px(xn1 ) < px(xn

1 ) if and only if px(xnl ) < px(xn

l ). In (2.24) 1(·) is the indicator function,

taking the value one if the argument is true, and zero if it is false. We get (2.25) by

limiting ρ to the range 0 ≤ ρ ≤ 1 since the arguments of the minimization are both positive

and upper-bounded by one. We use the iid property of the source, exchanging sums and

products to get (2.26). The bound in (2.27) is true for all ρ in the range 0 ≤ ρ ≤ 1.

Maximizing (2.27) over ρ gives pn(l) ≤ 2 × 2−(n−l+1)EML(R) where EML(R) is defined in

Proposition 3, in particular (2.14). ¤

Using Lemma 1 in (2.19) gives

Pr[xn−∆(n) 6= xn−∆] ≤2×n−∆∑

l=1

2−(n−l+1)EML(R) (2.28)

=n−∆∑

l=1

2× 2−(n−l+1−∆)EML(R)2−∆EML(R)

≤K02−∆EML(R) (2.29)

In (2.29) we pull out the exponent in ∆. The remaining summation is a sum over decaying

exponentials, can thus can be bounded by some constant K0. This proves Proposition 3.

Proof of Proposition 4: Universal decoding

We use the union bound on the symbol-wise error probability introduced in (2.19), but

with minimum empirical entropy, rather than maximum-likelihood, decoding.

36

Universal decoding rule:

xl(n) = w[l]l where w[l]n1 = arg minxn∈Bs(xn

1 ) s.t. xl−11 =xl−1

1 (n)

H(xnl ). (2.30)

We term this a sequential minimum empirical-entropy decoder. The reason for using this

decoder instead of the standard minimum block-entropy decoder is that the block-entropy

decoder has a polynomial term in n (resulting from summing over the type classes) that

multiplies the exponential decay in ∆. For n large, this polynomial can dominate. Using

the sequential minimum empirical-entropy decoder results in a polynomial term in ∆.

Details of the proof: With this decoder, errors can only occur if there is some sequence

xn1 such that (i) xn

1 ∈ Bs(xn1 ), (ii) xl−1

1 = xl−1, and xl 6= xl, for some l ≤ n −∆, and (iii)

the empirical entropy of xnl is such that H(xn

l ) < H(xnl ). Building on the common core of

the achievability (2.17)–(2.19) with the substitution of universal decoding in the place of

maximum likelihood results in the following definition of pn(l) (cf. (2.31) with (2.21),

pn(l) =∑

xn1

Pr[∃ xn

1 ∈ Bs(xn1 ) ∩ Fn(l, xn

1 ) s.t. H(xnl ) ≤ H(xn

l )]px(xn

1 ) (2.31)

The following lemma gives a bound on pn(l).

Lemma 2 For sequential minimum empirical entropy decoding,

pn(l) ≤ 2× (n− l + 2)2|X |2−(n−l+1)EUN (R).

Proof: We define Pn−l to be the type of length-(n− l +1) sequence xnl , and TP n−l to be

the corresponding type class so that xnl ∈ TP n−l . Analogous definitions hold for Pn−l and

37

xnl . We rewrite the constraint H(xn

l ) < H(xnl ) as H(Pn−l) < H(Pn−l). Thus,

pn(l) =∑

xn1

Pr[∃ xn

1 ∈ Bs(xn1 ) ∩ Fn(l, xn

1 ) s.t. H(xnl ) ≤ H(xn

l )]px(xn

1 )

≤∑

xn1

min[1,

∑

xn1 ∈ Fn(l, xn) s.t.

H(xnl ) ≤ H(xn

l )

Pr[xn1 ∈ Bs(xn

1 )]]px(xn)

≤∑

xl−11 ,xn

l

min[1,

∑

xnl s.t.

H(xnl ) ≤ H(xn

l )

2× 2−(n−l+1)R]px(xl−1

1 )px(xnl ) (2.32)

≤2×∑

xnl

min[1,

∑

xnl s.t.

H(xnl ) ≤ H(xn

l )


l ) (2.33)

=2×∑

P n−l

∑

xnl ∈TPn−l

min[1,

∑

P n−l s.t.

H(P n−l) ≤ H(P n−l)

∑

xnl ∈TPn−l


l ) (2.34)

≤2×∑

P n−l

∑

xnl+1∈TPn−l

min[1, (n− l + 2)|X |2−(n−l)[R−H(P n−l)]

]px(xn

l ) (2.35)

≤2× (n− l + 2)|X |∑

P n−l

∑

xnl ∈TPn−l

2−(n−l+1)[|R−H(P n−l)|+]

2−(n−l+1)[D(P n−l‖Ps)+H(P n−l)] (2.36)

≤2× (n− l + 2)|X |∑

P n−l

2−(n−l+1) infq [D(q‖Ps)+|R−H(q)|+] (2.37)

≤2× (n− l + 2)2|X |2−(n−l+1)EUN (R) (2.38)

To show (2.32), we use the bound in (2.12). In going from (2.34) to (2.35) first note that

the argument of the inner-most summation (over xnl ) does not depend on xn

1 . We then

use the following relations: (i)∑

xnl ∈TPn−l

= |TP n−l | ≤ 2(n−l+1)H(P n−l), which is a standard

bound on the size of the type class [29], (ii) H(Pn−l) ≤ H(Pn−l) by the sequential minimum

empirical entropy decoding rule, and (iii) the polynomial bound on the number of types [28],

|Pn−l| ≤ (n−l+2)|X |. In (2.36) we recall the function definition |·|+ , max0, ·. We pull

the polynomial term out of the minimization and use px(xnl ) = 2−(n−l+1)[D(P n−l‖Ps)+H(P n−l)]

for all xnl ∈ TP n−l . It is also in (2.36) that we see why we use a sequential minimum

empirical entropy decoding rule instead of a block minimum entropy decoding rule. If we

had not marginalized out over xl−11 in (2.33) then we would have a polynomial term out

front in terms of n rather than n − l, which for large n could dominate the exponential

decay in n − l. As the expression in (2.37) no longer depends on xnl , we simplify by using

|TP n−l | ≤ 2(n−l+1)H(P n−l). In (2.38) we use the definition of the universal error exponent

38

EUN (R) from (2.15) of Proposition 4, and the polynomial bound on the number of types.

Other steps should be obvious. ¤

Lemma 2 and Pr[xn−∆(n) 6= xn−∆] ≤ ∑n−∆l=1 pn(l) imply that:

Pr[xn−∆(n) 6= xn−∆] ≤n−∆∑

l=1

(n− l + 2)2|X |2−(n−l+1)EUN (R)

≤n−∆∑

l=1

K12−(n−l+1)[EUN (R)−ε] (2.39)

≤K2−∆[EUN (R)−ε] (2.40)

In (2.39) we incorporate the polynomial into the exponent. Namely, for all a > 0, b > 0,

there exists a C such that za ≤ C2b(z−1) for all z ≥ 1. We then make explicit the delay-

dependent term. Pulling out the exponent in ∆, the remaining summation is a sum over

decaying exponentials, and can be bounded by a constant. Together with K1, this gives the

constant K in (2.40). This proves Proposition 4. Note that the ε in (2.40) does not enter

the optimization because ε > 0 can be picked equal to any constant. The choice of ε effects

the constant K in Proposition 4.

Discussions on sequential random binning

We propose a sequential random binning scheme, as summarized in Propositions 3 and

4, which achieves the random block coding error exponent defined in (A.5). Although this

error exponent is strictly suboptimal as shown in Proposition 8 in Section 2.5, it provides

a very useful tool in deriving the achievability results for delay constrained coding (both

source coding and channel coding) as will be seen in future chapters where delay constrained

source coding with decoder side-information and distributed source coding problems are

discussed. Philosophically speaking, the essence of sequential random binning is to leave the

uncertainties about the source at the encoder to the future, and let the binning reduce the

uncertainties of the source. This is in contrast to the optimal scheme shown in Section 2.4.1,

where the encoder queues up the fixed-to-variable code and thus for a particular source

symbol and delay, the uncertainties lie in the past.

39

2.4 Proof of the Main Results

In this section we derive the delay constrained source coding error exponent in Theorem

1. We implement a simple universal coding scheme to achieve this error exponent defined in

Theorem 1. As for the converse, we use a focusing bound type of argument which is parallel

to the analysis on the delay constrained error exponent for channel coding with feedback in

[67].

Following the notion of delay constrained source coding error exponent Es(R) in Defi-

nition 2, we have the following result in Theorem 1.

Es(R) = infα>0

1α

Es,b((α + 1)R)

In Section 2.4.1, we first show the achievability of Es(R) by a simple fixed to variable length

universal code and a FIFO queue coding scheme. Then in Section 2.4.2, we show that Es(R)

is indeed an upper bound on the delay constrained error exponent. This error exponent

was first derived in [49] in a slightly different setup by using a non-universal (encoder needs

to know the distribution of the source, px , and the rate R) coding scheme, we give a simple

proof of Jelinek’s result by using large deviation theory in Appendix B.

2.4.1 Achievability

In this section, we introduce a universal coding scheme which achieves the delay con-

strained error exponent shown in Theorem 1. The coding scheme only depends on the size

of the alphabet of the source, not the distribution of the source. We first describe our

universal coding scheme. The basic idea is to encode a long block source symbols into a

variable-length binary strings. The binary strings first describe the type of the source block,

then index different source blocks with the same type by a one to one map. This is in line

with the Minimum Description Length principle [66, 4]. Source blocks with the same type

have the same length.

Optimal Universal Coding

A block-length N is chosen that is much smaller than the target end-to-end delays, while

still being large enough. This finite block length N will be absorbed into the finite constant

K. For a discrete memoryless source and large block-lengths N , the variable-length code

40

Streaming data S

Encoder Buffer

FIFO


Optimal

Fixed to Variable Lenght Encoder

Figure 2.11. A universal delay constrained source coding system

consists of two stages: first describing the type of the block 3 ~xi using O(|X | log N) bits

and then describing which particular realization has occurred by using a variable NH(~xi)

bits. The overhead O(|X | log N) is asymptotically negligible and the code is also universal

in nature. It is easy to verify that the average code length: limN→∞Epx (l(~x))

N = H(px) This

code is obviously a prefix-free code. Write l(~xi) as the length of the codeword for ~xi, then:

NH(~xi) ≤ l(~xi) ≤ |X | log(N + 1) + NH(~xi) (2.41)

The binary sequence describing the source is fed to the FIFO buffer illustrated in Figure

2.11. Notice that if the buffer is empty, the output of the buffer can be gibberish binary

bits. The decoder simply discards these meaningless bits because it is aware that the buffer

is empty.

Proposition 5 For the iid source ∼ px using the universal delay constrained code described

above, for allε > 0, there exists K < ∞, s.t. for all t, ∆:

Pr[~xt 6= ~x t((t + ∆)N)] ≤ K2−∆N(Es(R)−ε)

Where ~x t((t + ∆)N) is the estimate of ~xt at time (t + ∆)N . Before the proof, we have the

following lemma to bound the probabilities of atypical source behavior.

Lemma 3 (Source atypicality) for all ε > 0, block length N large enough, there exists

K < ∞, s.t. for all n, if r < log |X | :

K2−nN(Es,b(r)+ε) ≤ Pr[n∑

i=1

l(~xi) > nNr] ≤ K2−nN(Es,b(r)−ε) (2.42)

3~xi is the ith block of length N .

41

Proof: Only need to show the case for r > H(px). By the Cramer’s theorem[31], for all

ε1 > 0, there exists K1, such that:

Pr[n∑

i=1

l(~xi) > nNr] = Pr[1n

n∑

i=1

l(~xi) > Nr] ≤ K12−n(infz>Nr I(z)−ε1) (2.43)

And

Pr[n∑

i=1

l(~xi) > nNr] = Pr[1n

n∑

i=1

l(~xi) > Nr] ≥ K12−n(infz>Nr I(z)+ε1) (2.44)

where the rate function I(z) is [31]:

I(z) = supρ∈R

ρz − log(∑

~x∈XN

px(~x)2ρl(~x)) (2.45)

Here we use this equivalent K− ε notation instead of limit superior and limit inferior in the

Cramer’s theorem [31].

Write I(z, ρ) = ρz − log(∑

~x∈XN px(~x)2ρl(~x)), I(z, 0) = 0. z > Nr > NH(px), for large

N :

∂I(z, ρ)∂ρ

|ρ=0 = z −∑

~x∈XN

px(~x)l(~x) ≥ 0

By the Holder’s inequality, for all ρ1, ρ2, and for all θ ∈ (0, 1):

(∑

i

pi2ρ1li)θ(∑

i

pi2ρ2li)(1−θ) ≥∑

i

(pθi 2

θρ1li)(p(1−θ)i 2(1−θ)ρ2li)

=∑

i

pi2(θρ1+(1−θ)ρ2)li

This shows that log(∑

~x∈XN px(~x)2ρl(~x)) is a convex ∪ function on ρ, thus I(z, ρ) is a concave

∩ function on ρ for fixed z. Then ∀z > 0, ∀ρ < 0, I(z, ρ) < 0, which means that the ρ to

maximize I(z, ρ) is positive. This implies that I(z) is monotonically increasing with z and

obviously I(z) is continuous. Thus infz>Nr

I(z) = I(Nr)

For ρ ≥ 0, using the upper bound on l(~x) in (2.41):

log(∑

~x∈XN

px(~x)2ρl(~x)) ≤ log(∑

T NP ∈T N

2−ND(p‖px )2ρ(|S| log(N+1)+NH(p)))

≤ log((N + 1)|X |2−N minpD(p‖px )−ρH(p)+ρ|S| log(N+1))

= N(−min

pD(p‖px)− ρH(p)+ εN

)

where εN = (1+ρ)|X | log(N+1)N goes to 0 as N goes to infinity.

42

For ρ ≥ 0, using the lower bound on l(~x) in (2.41):

log(∑

~x∈XN

px(~x)2ρl(~x)) ≥ log(∑

p∈T N

2−ND(p‖px )+|S| log(N+1)2ρNH(p))

≥ log(2−N minp∈TN D(p‖px )−ρH(p)+|S| log(N+1))

= N(−min

pD(p‖px)− ρH(p) − ε′N

)

where ε′N = |X | log(N+1)N −minpD(p‖px)− ρH(p)+ minp∈T N D(p‖px)− ρH(p) goes to 0

as N goes to infinity, T N is the set of all types of XN .

Substitute the above inequalities to I(Nr) defined in (2.45):

I(Nr) ≥ N(supρ>0

minp

ρ(r −H(p)) + D(p‖px) − εN

)(2.46)

And

I(Nr) ≤ N(supρ>0

minp

ρ(r −H(p)) + D(p‖px)+ ε′N)

(2.47)

First fix ρ, by a simple Lagrange multiplier argument, with fixed H(p), we know that

the distribution p to minimize D(p‖px) is a tilted distribution of pαx . It can be verified that

∂H(pαx )

∂α ≥ 0 and ∂D(pαx ‖px )

∂α = α∂H(pαx )

∂α . Thus the distribution to minimize D(p‖px) − ρH(p)

is pρx . Using some algebra, we have

D(pρx‖px)− ρH(pρ

x) = −(1 + ρ) log∑

s∈Spx(x)

11+ρ

Substitute this into (2.46) and (2.47) respectively:

I(Nr) ≥ N(supρ>0

ρr − (1 + ρ) log∑

s∈Xpx(x)

11+ρ − εN

)

= N(Es,b(r)− εN

)(2.48)

The last equality can again be proved by a simple Lagrange multiplier argument.

Similarly:

I(Nr) ≤ N(Es,b(r) + ε′N

)(2.49)

Substitute (2.48) and (2.49) into(2.43) and (2.44) respectively, by letting ε1 small enough

and N big enough thus εN and ε′N small enough, we get the the desired bound in (2.42). ¤

Now we are ready to prove Proposition 5.

43

Proof: We give an upper bound on the decoding error on ~xt at time (t + ∆)N . At

time (t + ∆)N , the decoder cannot decode ~xt with 0 error probability iff the binary strings

describing ~xt are not all out of the buffer. Since the encoding buffer is FIFO, this means

that the number of outgoing bits from some time t1 to (t + ∆)N is less than the number of

the bits in the buffer at time t1 plus the number of incoming bits from time t1 to time tN .

Suppose the buffer is last empty at time tN − nN where 0 ≤ n ≤ t, given this condition,

the decoding error occurs only if∑n−1

i=0 l(~xt−i) > (n + ∆)NR. Write lmax as the longest

code length, lmax ≤ |X | log(N + 1) + N |X |. Then Pr[∑n−1

i=0 l(~xt−i) > (n + ∆)NR] > 0 only

if n > (n+∆)NRlmax

> ∆NRlmax

∆= β∆

Pr[~xt 6= ~xt((t + ∆)N)] ≤t∑

n=β∆

Pr[n−1∑

i=0

l(~xt−i) > (n + ∆)NR] (2.50)

≤(a)

t∑

n=β∆

K12−nN(Es,b((n+∆)NR

nN)−ε1)

≤(b)

∞∑

n=γ∆

K22−nN(Es,b(R)−ε2) +γ∆∑

n=β∆

K22−∆N(min

α>0Es,b((1+α)R)

α−ε2)

≤(c) K32−γ∆N(Es,b(R)−ε2) + |γ∆− β∆|K32−∆N(Es(R)−ε2)

≤(d) K2−∆N(Es(R)−ε)

where, K ′is and ε′is are properly chosen positive real numbers. (a) is true because of

Lemma 3. Define γ = Es(R)Es,b(R) , in the first part of (b), we only need the fact that Es,b(R)

is non decreasing with R. In the second part of (b), we write α = ∆n and take the α to

minimize the error exponents. The first term of (c)comes from the sum of a geometric series.

The second teFrm of (c) is by the definition of Es(R). (d) is by the definitions of γ. ¥

2.4.2 Converse

To bound the best possible error exponent with fixed delay, we consider a block coding

encoder/decoder pair constructed by the delay constrained encoder/decoder pair and trans-

late the block-coding bounds of [29] to the fixed delay context. The argument is analogous

to the “focusing bound” derivation in [67] for channel coding with feedback.

Proposition 6 For fixed-rate encodings of discrete memoryless sources, it is not possible

to achieve an error exponent with fixed-delay higher than

infα>0

1α

Es,b((α + 1)R) (2.51)

44

from the definition of delay constrained source coding error exponent in Definition 2, the

statement of this proposition is equivalent to the following mathematical statement:

For any E > infα>0

1αEs,b((α+1)R), there exists an positive real value ε, such that for any

K < ∞, there exists i > 0, ∆ > 0 and

Pr[xi 6= xi(i + ∆)] > K2−∆(Es(R)−ε)

Proof: We show the proposition by contradiction. Suppose that the delay-constrained

error exponent can be higher than infα>0

1αEs,b((α+1)R). Then according to Definition 2, there

exists a delay-constrained source coding system, such that for some E > infα>0

1αEs,b((α+1)R),

for any ε > 0, there exists K < ∞, such that for all i > 0, ∆ > 0

Pr[xi 6= xi(i + ∆)] ≤ K2−∆(E−ε)

so we choose some ε > 0, such that

E − ε > infα>0

1α

Es,b((α + 1)R) (2.52)

Then consider a block coding scheme (E ,D) which is built on the delay constrained

source coding system. The encoder of the block coding system is the same as the delay-

constrained source encoder, and the block decoder D works as follows:

x i1 = (x1(1 + ∆), x2(2 + ∆), ..., xi(i + ∆))

Now the block decoding error of this coding system can be upper bounded as follows,

for any i > 0 and ∆ > 0:

Pr[x i1 6= x i

1] ≤i∑

t=1

Pr[xi(i + ∆) 6= xi]

≤i∑

t=1

K2−∆(E−ε)

= iK2−∆(E−ε) (2.53)

The block coding scheme (E ,D) is a block source coding system for i source symbols by

using bR(i + ∆)c bits hence has a rate bR(i+∆)ci ≤ i+∆

i R. From the classical block coding

result in Theorem 9, we know that the source coding error exponent Es,b(R) is monotonically

45

increasing and continuous in R, so Es,b(bR(i+∆)c

i ) ≤ Es,b(R(i+∆)

i ). Again from Theorem 9,

we know that the block coding error probability can be bounded in the following way:

Pr[x i1 6= x i

1] > 2−i(Es,b(R(i+∆)

i)+εi) (2.54)

where limi→∞

εi = 0.

Combining (2.53) and (2.54), we have:

2−i(Es,b(R(i+∆)

i)+εi) < iK2−∆(E−ε)

Now let α = ∆i , α > 0, then the above inequality becomes:

2−i(Es,b(R(1+α))+εi) < 2−iα(E−ε−θi) and hence:

E − ε <1α

(Es,b(R(1 + α) + εi) + θi (2.55)

where limi→∞

θi = 0. Now the above inequality is true for all i, α > 0, meanwhile limi→∞

θi =

0, limi→∞

εi = 0, taking all these into account, we have:

E − ε ≤ infα>0

1α

Es,b(R(1 + α)) (2.56)

Now (2.56) contradicts with the assumption in (2.52), thus the proposition is proved. ¥

Combining Proposition 5 and Proposition 6, we establish the desired results summarized

in Theorem 1. For source coding with delay constraints problem, we are able to provide

an accurate estimate on the speed of the decaying of the symbol wise error probability.

This error exponent is in general larger than the block coding error exponent as shown in

Figure 2.3.

2.5 Discussions

We first investigate some of the properties of the delay constraint source coding error

exponent Es(R) defined in (2.7). Then we will conclude this chapter with some ideas for

future work.

2.5.1 Properties of the delay constrained error exponent Es(R)

The source coding with delay error exponent can be parameterized by a single real

number ρ which is parallel to the delay constrained channel coding with feedback error

46

ρ ρ* ρR

0

GR

(ρ)

Es,b

(R)

Figure 2.12. GR(ρ)

exponents for symmetric channels derived in [67]. The parametrization of Es(R) makes the

evaluations of Es(R) easier and potentially enables us to study other properties of Es(R).

We summarize the result in the following corollary first shown in [20].

Proposition 7 Parametrization of Es(R):

Es(R) = E0(ρ∗) = (1 + ρ∗) log[∑

s

px(x)1

1+ρ∗ ] (2.57)

where ρ∗ satisfies the following condition: R = E0(ρ∗)ρ∗ =

(1+ρ∗) log[∑s

px (x)1

1+ρ∗ ]

ρ∗ , i.e. ρ∗R =

E0(ρ∗)

Proof: For the simplicity of the notations, we define GR(ρ) = ρR−E0(ρ), thus GR(ρ∗) =

0 by the definition of ρ∗. And maxρ

GR(ρ) = Es,b(R) by the definition of the block coding

error exponent Es,b(R) in Theorem 9. As shown in Lemma 31 and Lemma 22 in Appendix G:

dGR(ρ)dρ

= R−H(pρx),

d2GR(ρ)dρ2

= −dH(pρx)

dρ≤ 0

Furthermore, limρ→∞

1ρGR(∞) = R− log |X | < 0, thus GR(∞) < 0. Obviously GR(0) = 0.

So we know that GR(ρ) is a concave ∩ function with a unique root for ρ > 0, as illustrated

47

in Figure 2.12. Let ρR ≥ 0 be such that maximize GR(ρ), by the concavity of GR(ρ) this is

equivalent to R−H(pρRx ) = 0. By concavity, we know that ρR < ρ∗.

Write:

FR(α, ρ) =1α

(ρ(α + 1)R−E0(ρ))

= ρR +GR(ρ)

α

Then Es(R) = infα>0

supρ≥0

FR(α, ρ). And for any α ∈ (0, log |X |R − 1),

∂FR(α, ρ)∂ρ

=(α + 1)R−H(pρ

x)α

∂2FR(α, ρ)∂ρ2

=1α

dH(pρx)

dρ≤ 0

FR(α, 0) = 0

FR(α,∞) < 0 (2.58)

So for any α ∈ (0, log |X |R − 1), let ρ(α) be so FR(α, ρ(α)) is maximized. ρ(α) is thus the

unique solution to:

(α + 1)R−H(pρ(α)x ) = 0

Define α∗ = H(pρ∗x )

R − 1, i.e. (α∗ + 1)R−H(pρ∗x ) = 0 which implies that,

α∗ =H(pρ∗

x )R

− 1 <H(p∞x )

R− 1 =

log |X |R

− 1

α∗ =H(pρ∗

x )R

− 1 >H(pρR

x )R

− 1 = 0

Now we establish that α∗ ∈ (0, log |X |R − 1), by the definition of α∗, ρ∗ maximizes FR(α∗, ρ)

over all ρ. From the above analysis:

Es(R) = infα>0

1α

Eb((α + 1)R)

= infα>0

supρ≥0

1α

(ρ(α + 1)R−E0(ρ))

= infα>0

supρ≥0

FR(α, ρ)

≤ supρ≥0

FR(α∗, ρ)

= FR(α∗, ρ∗)

= ρ∗R (2.59)

The last step is from the definition of FR(α, ρ) and the definition of ρ∗.

48

On the other hand:

Es(R) = infα>0

supρ≥0

FR(α, ρ)

≥ supρ≥0

infα>0

FR(α, ρ)

≥ supρ≥0

FR(α∗, ρ)

= FR(α∗, ρ∗)

= ρ∗R (2.60)

Combining (2.59) and (2.60), we derive the desired parametrization of Es(R). Here we

actually also prove that (α∗, ρ∗) is the saddle point for function FR(α, ρ). ¤

Our next proposition shows that for any rate within the meaningful rate region

(H(px), log |X |), the delay constraint error exponent is strictly larger than the block coding

error exponent.

Proposition 8 Comparison of Es(R) and Es,b(R):

∀R ∈ (H(px), log |X |), Es(R) > Es,b(R)

Proof: From the Proposition 7, we have: Es(R) = ρ∗R. Also from Proposition 7, we know

that ρR < ρ∗, where the block coding error exponent Es,b(R) = ρRR−E0(ρR). By noticing

that E0(ρ) ≥ 0, for all ρ ≥ 0, we have:

Es(R) = ρ∗R

> ρRR

≥ ρRR−E0(ρR)

≥ Es,b(R)

¤

The source coding error exponent Es,b(R) has a zero right derivative at the entropy rate

of the source R = H(px), this is a simple corollary of Lemma 23 in Appendix G. The delay

constrained source coding error exponent is different from its block coding counterpart as

shown in Figure 2.3. The next proposition shows that the right derivative at the entropy

rate of the source is positive.

Proposition 9 Right derivative of Es(R) at R = H(px):

limR→H(px )+

dEs(R)dR

=2H(px)∑

xpx(x)[log(px(x))]2 −H(px)2

49

Proof: Proposition 7 shows that the delay constrained error exponent Es(R) can be

parameterized by a real number ρ as Es(R) = E0(ρ) and R(ρ) = E0(ρ)ρ . In particular for

R(ρ) = H(px), ρ = 0. By Lemma 23 in Appendix G, we have

dE0(ρ)ρ

dρ=

ρdE0(ρ)dρ − E0(ρ)

ρ2

=1ρ2

(ρH(pρx)−E0(ρ))

=1ρ2

D(pρx‖px) (2.61)

dE0(ρ)dρ

= H(pρx) (2.62)

dD(pρx‖px)

dρ= ρ

dH(pρx)

dρ(2.63)

Now we can follow Gallager’s argument on page 143-144 in [41]. By noticing that both

E0(R) and E0(ρ)ρ are positive for ρ ∈ (0,+∞). Combining (2.61) and (2.62):

dEs(R)dR

=dE0(ρ)

dρ

dR(ρ)dρ

=ρ2H(pρ

x)D(pρ

x‖px)(2.64)

From (2.64), we know the right derivative of Es(R) at R = H(px):

limR→H(px )+

dEs(R)dR

= limρ→0+

ρ2H(pρx)

D(pρx‖px)

= limρ→0+

2ρH(pρx) + ρ2 dH(pρ

x )dρ

ρdH(pρx )

dρ

(2.65)

= limρ→0+

2H(pρx)

dH(pρx )

dρ

=2H(px)∑

xpx(x)[log(px(x))]2 −H(px)2

(2.66)

(2.65) is true because of the L’Hospital rule [86]. (2.66) is true by simple algebra.

By the Cauchy-Schwatz inequality∑x

px(x)[log(px(x))]2 −H(px)2 > 0 unless px is uni-

form. Thus the right derivative of Es(R) at R = H(px) is strictly positive in general.

¤

In Figure 2.13, we plot the positive right derivative of Es(R) at R = H(px) for Bernoulli

(α) source where α ∈ (0, 0.25).

50

0 0.05 0.1 0.15 0.2 0.250

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

α

Rig

ht d

eriv

ativ

e of

Es(R

) at

R=

H(p

s)

Figure 2.13. limR→H(px )+

dEs(R)dR for Bernoulli (α) sources

The most striking difference between the block coding error exponent Eb,s(R) and the

delay constrained error exponent Es(R) is their behaviors in the high rate regime as illus-

trated in Figure 2.3, especially at R = log |X |. It is obvious that both error exponents

are infinite (or mathematically strictly speaking: not well defined) when R is greater than

log |X | as shown in Proposition 1. The block coding error exponent Eb,sR has a finite left

limit at R = log |X |, the next proposition tells us that the delay constrained error exponent,

however, has an infinite left limit at R = log |X |.

Proposition 10 The left limit of the sequential error exponent is infinity as R approaches

log |X |: limR→log |X |

Es(R) = ∞

Proof: Notice that Eb(R) is monotonically increasing for R ∈ [0, log |X |) and Eb(R) = ∞for R > log |X |. This implies that if Eb((1 + α)R) is finite, α must be less than log |X |

R − 1.

So we have for all R ∈ [0, log |X |):

Es(R) = infα>0

1α

Eb((1 + α)R)

≥ 1log |X |

R − 1Eb(R)

51

Thus the left limit of Es(R) at R = log |X | is:

limR→log |X |

Es(R) ≥ limR→log |X |

1log |X |

R − 1Eb(R)

= ∞

¤

2.5.2 Conclusions and future work

In this chapter, we introduced the general setup of delay constrained source coding and

the definition of delay-constrained error exponent. Unlike classical source coding, the source

generates source symbols in a real time fashion and the performance of the coding system

is measured by the probability of symbol error with a fixed-delay. The error exponent

for lossless source coding with delay is completely characterized. The achievability part is

proved by implementing a fixed-to-variable source encoder with a FIFO queue. Then the

symbol error probability with delay is the same as the probability of an atypical queueing

delay. The converse part is proved by constructing a block source coding system out of the

delay-constrained source coding system. This simple idea can be applied to other source

coding and channel coding problems and is called “focusing bound” in [67]. It was showed

that this delay-constrained error exponent Es(R) has different properties than the classical

block coding error exponent Es,b(R). We shown that Es(R) is strictly larger than Es,b(R) for

all R. However, the delay constrained error exponent Es(R) is not completely understood.

For example, we still do not know if Es(R) is convex ∪ on R. To further understand the

properties of Es(R), one may find the rich literature in queueing theory [52] very useful.

Particularly, the study of large deviations in queueing theory [42].

We also analyzed the delay constrained performance of the sequential random binning

scheme. We proved that the standard random coding error exponent for block coding

can be achieved using sequential random binning. Hence, sequential random binning is a

suboptimal coding scheme for lossless source coding with delay. However, sequential random

binning is a powerful technique which can be applied to other delay constrained coding

problems such as distributed source coding [32, 12], source coding with side-information [16],

joint source channel coding, multiple-access channel coding [14] and broadcast channel

coding [15] under delay constraints. Some of these will be thoroughly studied in the next

several chapters of this thesis. The decoding schemes have exponential complexity with

time. However, [60] shows that the random coding error exponent is achievable using a

52

more computationally friendly stack-based decoding algorithm if the source distribution is

known.

The most interesting result in this chapter is the “focusing” operator, which connects the

delay constrained error exponent and the classical fixed-length block coding error exponent.

A similar focusing bound was recently derived for delay constrained channel coding with

feedback [67]. In order to understand for what problems this type of “focusing” bound

applies to, we study a more general problem in the next chapter.

53

Chapter 3

Lossy Source Coding

Under the same streaming source model in Chapter 2, we study the delay constrained

performance of lossy source coding where the decoder only needs to reconstruct the source

to within a certain distortion. We derive the delay constrained error exponent through the

“focusing ” bound operator on the block error exponent for peak distortion measure. The

proof in this chapter is more general than that in Chapter 2 and can be applied to delay

constrained channel coding and joint source-channel coding.

3.1 Lossy source coding

As discussed in the review paper [6], lossy source coding is a source coding model where

the estimate of the source at the decoder does not necessarily have to be exact. Unlike in

Chapter 2, where the source and the reconstruction of the source are in the same alphabet

set X , for lossy source coding, the reconstruction of the source is in a different alphabet

set Y where a distortion measure d(·, ·) is defined on X × Y. We denote by X the source

alphabet and Y the reconstruction alphabet. We also denote by yi the reconstruction of

xi. This notational difference from Chapter 2 where we use xi as the reconstruction of xi

is to emphasize the fact that the reconstruction alphabet can be different from the source

alphabet. Formally, a distortion measure is a non-negative function d(x, y), x ∈ X and

y ∈ Y:

d : X × Y → [0,∞)

The classical fixed length block coding results for lossy source coding are reviewed in

54

Section A.2 in the appendix. The core issue we are interested in for this chapter is the impact

of causality and delay on lossy source coding. In [58], the rate distortion performance for a

strictly zero delay decoder is studied, and it is shown that the optimal performance can be

obtained by time-sharing between memoryless codes. Thus, it is in general strictly worse

than the performance of classical fixed-block source coding that allows arbitrarily large

delay. The large deviation performance of the zero delay decoder problem is studied in [57].

Allowing some finite end-to-end delay, [81, 80] shows that the average block coding rate

distortion performance can still be approached exponentially with delay.

3.1.1 Lossy source coding with delay

As the common theme of this thesis, we consider a coding system for a streaming source,

drawn iid from a distribution px on finite alphabet X . The encoder, mapping source symbols

into bits at fixed rate R, is causal and the decoder has to reconstruct the source symbols

within a fixed end-to-end latency constraint ∆. The system is illustrated in Figure 5.3,

which is the almost same as the lossless case in Figure 2.1 in Chapter 2. The only difference

between the two figures is notational, we use y to indicate the reconstruction of the source

x in this chapter, instead of x in Chapter 2.

x1 x2 x3 x4 x5 x6 ...

b1(x21 ) b2(x4


Source

? ? ? ? ? ?

y1(4) y2(5) y3(6) ...


Decoding

? ? ?

Figure 3.1. Time line of delay constrained source coding: rate R = 12 , delay ∆ = 3

Generalizing the notion of delay constrained error exponent (end-to-end delay perfor-

mance) for lossless source coding in Definition 2, we have the following definition on delay

constrained error exponent for lossy source coding.

Definition 4 A rate R sequential source code shown in Figure 3.1 achieves error (distortion

55

violation) exponent ED(R) with delay if for all ε > 0, there exists K < ∞, s.t. ∀i,∆ > 0

Pr[d(xi, yi(i + ∆)) > D] ≤ K2−∆(ED(R)−ε)

The main goal of this chapter is to derive the error exponent defined above. The above

lossy source coding system, in a way, has a symbol distortion constraint. This lead to an

interesting relationship between the problem in Definition 4 and the lossy source coding

under a peak distortion problem. We briefly introduce the relevant results of lossy source

coding with a peak distortion in the next section.

3.1.2 Delay constrained lossy source coding error exponent

In this section, we present the main result of Chapter 3, the delay constrained error

exponent for lossy source coding. This result first appeared in our paper [17]. We investigate

the relation between delay ∆ and the probability of distortion violation Pr[d(xi, yi(i+∆)) >

D], where yi(i+∆) is the reconstruction of xi at time i+∆ and D is the distortion constraint.

Theorem 2 Consider fixed rate source coding of iid streaming data xi ∼ px , with a non-

negative distortion measure d. For D ∈ (D, D), and rates R ∈ (R(px , D), RD), the following

error exponent with delay is optimal and achievable.

ED(R) , infα>0

1α

EbD((α + 1)R) (3.1)

where EbD(R) is the block coding error exponent under peak distortion constraint, as defined

in Lemma 6.

D, D and RD are defined in the next section.

3.2 A brief detour to peak distortion

In this section we introduce the peak distortion measure and present the relevant block

coding result for lossy source coding with a peak distortion measure.

56

3.2.1 Peak distortion

Csiszar introduced the peak distortion measure [29] as an exercise problem (2.2.12), for

a positive function d defined on finite alphabet set X × Y, for xN1 ∈ XN , yN

1 ∈ YN , we

define the distortion between xN1 and yN

1 as d(xN1 , yN

1 ) where

d(xN1 , yN

1 ) , max1≤i≤n

d(xi, yi) (3.2)

This distortion measure reasonably reflects the reaction of human visual system [51]. Given

some finite D > 0, we say y is a valid reconstruction of x if d(x, y) ≤ D.

The problem is only interesting if the target peak distortion D is higher than D and

lower than D, where

D , maxx∈X

miny∈Y

d(x, y)

D , miny∈Y

maxx∈X

d(x, y).

If D > D, then there exists a y∗, such that for all x ∈ X , d(x, y∗) ≤ D, thus the decoder

can reconstruct any source symbol xi by y∗ and the peak distortion requirement is still

satisfied. On the other hand, if D < D, then there exists an x∗ ∈ X , such that for all y ∈ Y,

d(x∗, y) > D. Now with the assumption that px(x∗) > 0, if N is big enough then with

high probability that for some i, xi = x∗. In this case, there is no reconstruction yN1 ∈ YN ,

such that d(xN1 , yN

1 ) ≤ D. Note that both D and D only depend on the distortion measure

d(·, ·), not the source distribution px .

3.2.2 Rate distortion and error exponent for the peak distortion measure

cf. exercise (2.2.12) [29], the peak distortion problem can be treated as a special case of

an average distortion problem. by defining a new distortion measure dD on X × Y, where

dD(x, y) =

1 if x 6= y

0 otherwise

with average distortion level 0. And hence the rate distortion theorem and error exponent

result follows the average distortion case. In [29], the rate distortion theorem for a peak

distortion measure is derived.

Lemma 4 The rate-distortion function R(D) for peak distortion:

R(px , D) , minW∈WD

I(px ,W ) (3.3)

57

0

0

1

1

2

33

0.2

3

x y

D ∈ [1, 2)

x xy y

Source Reconstruction

d(x,y) D ∈ [2, 3)

x y

D ∈ [0.2, 1)

Figure 3.2. d(x, y) and valid reconstructions under different peak distortion constraints D.(x, y) is linked if d(x, y) ≤ D. D = 0.2 and D = 3. For D ∈ [0.2, 1), this is a lossless sourcecoding problem.

where WD is the set of all transition matrices that satisfy the peak distortion constraint, i.e.

WD = W : W (y|x) = 0, if d(x, y) > D.

Operationally, this lemma says, for any ε > 0 and δ > 0, for block length N big enough,

there exists a code of rate R(px , D) + ε, such that the peak(maximum) distortion between

the source string xN1 and its reconstruction yN

1 is no bigger than D with probability at least

1− δ.

This result can be viewed as a simple corollary of the rate distortion theorem in Theo-

rem 10. Similar to the average distortion case, we have the following fixed-to-variable length

coding result for peak distortion measure.

To have Pr[d(xN1 , yN

1 ) > D] = 0, we can implement a universal variable length prefix-free

code with code length lD(xN1 ) where

lD(xN1 ) = n(R(pxN

1, D) + δN ) (3.4)

where pxN1

is the empirical distribution of xN1 , and δN goes to 0 as N goes to infinity.

Like the average distortion case, this is a simple corollary of the type covering lemma

[29, 28] which is derived from the Johnson SteinLovasz theorem [23].

58

The rate distortion function R(D) for average distortion measure is in general non-

concave, non-∩, in the source distribution p as pointed out in [55]. But for peak distortion,

R(p,D) is concave ∩ in p for a fixed distortion constraint D. The concavity of the rate

distortion function under the peak distortion measure is an important property that that

rate distortion functions of average distortion measure do not have.

Lemma 5 R(p,D) is concave ∩ in p for fixed D.

Proof: The proof is in Appendix C. ¤

Now we study the large deviation properties of lossy source coding under the peak

distortion measure. As a simple corollary of the block-coding error exponents for average

distortion from [55], we have the following result.

Lemma 6 Block coding error exponent under peak distortion:

lim infn→∞ − 1

Nlog2 Pr[d(xN

1 , yN1 ) > D] = Eb

D(R)

where EbD(R) , min

qx :R(qx ,D)>RD(qx‖px) (3.5)

where yN1 is the reconstruction of xN

1 using an optimal rate R code.

For lossless source coding, if R > log2 |X |, the error probability is 0 and the error exponent

is infinite. Similarly, for lossy source coding under peak distortion, the error exponent is

infinite whenever

R > RD , supqx

R(qx , D)

where RD only depends on d(·, ·) and D.


Consider a distribution px = 0.7, 0.2, 0.1 and a distortion measure on X ×Y as shown

in Figure 3.2. We plot the rate distortion R −D curve under the peak distortion measure

in Figure 3.3. The R −D curve under peak distortion is a staircase function because the

valid reconstruction is a 0 − 1 function. The delay constrained lossy source coding error

59

exponents are shown in Figure 3.4. The delay constrained error exponent is higher than the

block coding coding error exponent, which is also observed in the lossless case in Chapter 2.

0.2 1 2 30

0.275

1.157

Distortion

Rat

e

Figure 3.3. For D1 ∈ [1, 3), R(px , D1) = 0.275 and RD1 = 0.997. For D2 ∈ [0.2, 1), theproblem degenerates to lossless coding, so R(px , D1) = H(px) = 1.157 and RD1 = log2(3) =1.585

3.4 Proof of Theorem 2

In this section, we show that the error exponent in Theorem 2 is both achievable asymp-

totically with delay and that no better exponents are possible.

3.4.1 Converse

The proof of the converse is similar to the upper bound argument in Chapter 2 for

lossless source coding with delay constraints. We leave the proof in Appendix D.

3.4.2 Achievability

We prove achievability by giving a universal coding scheme illustrated in Figure 3.5.

This coding system looks almost identical to the delay constrained lossless source coding

system in Section 2.4.1 as shown in Figure 2.11. The only difference is the fixed to variable

60

0

0.5

1

1.5

2

2.5

3

R(px, D

1)

Err

or E

xpon

ents

Block coding D1 ∈ [1,3)

With delay D1 ∈ [1,3)

Block coding D2 ∈ [0.2,1)

With delay D2 ∈ [0.2,1)

Rate

R(px, D

2)R

D1

RD

2

to ∞

at R=RD

1

to ∞

at R=RD

2

Figure 3.4. Error exponents of delay constrained lossy source coding and block sourcecoding under peak distortion measure

length encoder. Instead of the universal optimal lossless encoder which is a one-to-one map

from the source block space to the binary string space, we have a variable length lossy

encoder in this chapter. So the average number of bits per symbol for an individual block is

roughly the rate distortion function under peak distortion measure rather than the empirical

entropy of that block.

?

-

-

-Encoder Buffer

FIFO


Fixed to variable length EncoderStreaming data

“Lossy” reconstruction with delay ∆

..., xi, ....

..., yi(i + ∆), ....

Figure 3.5. A delay optimal lossy source coding system.

A block-length N is chosen that is much smaller than the target end-to-end delays1,

while still being large enough. For a discrete memoryless source X , distortion measure1As always, we are interested in the performance with asymptotically large delays ∆.

61

d(·, ·), peak distortion constraint D, and large block-length N , we use the universal variable

length prefix-free code in Proposition 4 to encode the ith block ~xi = xiN(i−1)N+1 ∈ XN . The

code length lD(~xi) is shown in (3.4),

NR(p~xi, D) ≤ lD(~xi) ≤ N(R(p~xi

, D) + δN ) (3.6)

The overhead δN is negligible for large N , since δN goes to 0 as N goes to infinity. The

binary sequence describing the source is fed into a FIFO buffer described in Figure 3.5. The

buffer is drained at a fixed rate R to obtain the encoded bits. Notice that if the buffer is

empty, the output of the encoder buffer can be gibberish binary bits. The decoder simply

discards these meaningless bits because it is aware that the buffer is empty. The decoder

uses the bits it has received so far to get the reconstructions. If the relevant bits have

not arrived by the time the reconstruction is due, it just guesses and we presume that a

distortion-violation will occur.

As the following proposition indicates, the coding scheme is delay universal, i.e. the

distortion-violation probability goes to 0 with exponent ED(R) for all source symbols and

for all delays ∆ big enough.

Proposition 11 For the iid source ∼ px , peak distortion constraint D, and large N , using

the universal causal code described above, for all ε > 0, there exists K < ∞, s.t. for all t,

∆:

Pr[d(~xt, ~yt((t + ∆)N)) > D] ≤ K2−∆N(ED(R)−ε)

where ~yt((t + ∆)N) is the estimate of ~xt at time (t + ∆)N .

Before proving Proposition 11, we state the following lemma (proved in Appendix E) bound-

ing the probability of atypical source behavior.

Lemma 7 (Source atypicality) For all ε > 0, block length N large enough, there exists

K < ∞, s.t. for all n, for all r < RD:

Pr

(n∑

i=1

lD(~xi) > nNr

)≤ K2−nN(Eb

D(r)−ε) (3.7)


62

Proof: At time (t + ∆)N , the decoder cannot decode ~xt within peak distortion D

only if the binary strings describing ~xt are not all out of the buffer. Since the encoding

buffer is FIFO, this means that the number of outgoing bits from some time t1, where

t1 ≤ tN to (t + ∆)N is less than the number of the bits in the buffer at time t1 plus the

number of incoming bits from time t1 to time tN . Suppose the buffer is last empty at time

tN − nN where 0 ≤ n ≤ t, given this condition, the peak distortion is not satisfied only if∑n−1

i=0 lD(~xt−i) > (n + ∆)NR. Write lD,max as the longest possible code length.

lD,max ≤ |X | log2(N + 1) + N log2 |X |.

Then Pr[∑n−1

i=0 lD(~xt−i) > (n + ∆)NR] > 0 only if n > (n+∆)NRlD,max

> ∆NRlD,max

∆= β∆. So

Pr (d(~xt, ~yt((t + ∆)N)) > D)

≤t∑

n=β∆

Pr[n−1∑

i=0

l(~xt−i) > (n + ∆)NR] (3.8)

≤(a)

t∑

n=β∆

K12−nN(EbD(

(n+∆)NRnN

)−ε1)

≤(b)

∞∑

n=γ∆

K22−nN(EbD(R)−ε2) +

γ∆∑

n=β∆

K22−∆N(minα>1Eb

D(αR)

α−1−ε2)

≤(c) K32−γ∆N(EbD(R)−ε2) + |γ∆− β∆|K32−∆N(ED(R)−ε2)

≤(d) K2−∆N(ED(R)−ε)

where K ′is and ε′is are properly chosen real numbers. (a) is true because of Lemma 7. Define

γ , ED(R)

EbD(R)

. In the first part of (b), we only need the fact that EbD(R) is non decreasing

with R. In the second part of (b), we write α = n+∆n and take the α to minimize the error

exponents. The first term of (c) comes from the sum of a convergent geometric series and

the second is by the definition of ED(R). (d) is by the definition of γ. ¥

Combining the converse result in Proposition 12 in Appendix D and the achievability

result in Proposition 11, we establish the desired results summarized in Theorem 2.

3.5 Discussions: why peak distortion measure?

For delay constrained lossy source coding, a “focusing” type bound is derived which is

quite similar to its lossless source coding counterpart in the sense that they convert the block

coding error exponent into a delay constrained error exponent through the same focusing

63

operator. As shown in Appendix E, the technical reason for the similarity is that the length

of optimal variable-length codes, or equivalently the rate distortion functions, are concave

∩ in the empirical distribution for both lossless source coding and lossy source coding under

peak distortion constraint. This is not the case for rate distortion functions under average

distortion measures [29]. Thus the block error exponent for average distortion lossy source

coding cannot be converted to delay constrained source lossy source coding by the same

“focusing” operator. This leaves a great amount of future work to us.

The technical tools in this chapter are the same as those in Chapter 2, which are the

large deviation principle and convex optimization methods. However, the block coding error

exponent for lossy source coding under a peak distortion measure cannot be parameterized

like the block lossless source coding error exponent. This poses a new challenge to our task.

By only using the convexity and concavity of the relevant entities, but not the particular

form of the error exponent, we develop a more general proof scheme for “focusing” type

bounds. These techniques can be applied to other problems such as channel coding with

perfect feedback, source coding with random arrivals and, undoubtedly, some other problems

waiting to be discovered.

64

Chapter 4

Distributed Lossless Source Coding

In this chapter1, we begin by reviewing classical results on the error exponents of the

distributed source coding problem first studied by Slepian and Wolf [79]. Then we introduce

the setup of delay constrained distributed source coding problem. We then present the main

result of this chapter: achievable delay-constrained error exponents for delay constrained

distributed source coding. The key techniques are sequential random binning which was

introduced in Section 2.3.3 and a somewhat complicated way of partitioning error events.

This is a “divide and conquer” proof. For each individual error event we use classical block

code bounding techniques. We analyze both maximum likelihood decoding and universal

decoding and show that the achieved exponents are equal. The proof of the equality of

the two error exponents is in Appendix G. From the mathematical point of view, this

proof is the most interesting and challenging part of this thesis. The key tools we use are

the tilted-distribution from the statistics literature and the Lagrange dual from the convex

optimization literature.

4.1 Distributed source coding with delay

In Section A.3 in the appendix, we review the distributed lossless source coding result

in the block coding setup first discovered by David Slepian and Jack Wolf in their classical

paper [79] which is now widely known as the Slepian-Wolf source coding problem. We then1The result in this chapter is build upon a research project [12] jointly done by Cheng Chang, Stark

Draper and Anant Sahai. I would like to thank Stark Draper for his great work in the project (especiallythe universal decoding part) and for allowing me to put this joint work in my thesis.

65

Encoder y

Encoder x

Decoder3

Rx

s

Ry

-

-

-x1, x2, . . . , xN

y1, y2, . . . , yN

x1, x2, . . . , xN

y1, y2, . . . yN

(xi, yi) ∼ pxy

6

?

Figure 4.1. Slepian-Wolf distributed encoding and joint decoding of a pair of correlatedsources.

introduce distributed source coding in the delay constrained framework. Then we give our

result on error exponents with delay which first appeared in [12].

4.1.1 Lossless Distributed Source Coding with Delay Constraints

Parallelling to the point-to-point lossless source coding with delay problem introduced

in Definition 1 in Section 2.1.1, we have the following setup for distributed source coding

with delay constraints.x1 x2 x3 x4 x5 x6 ...

y1 y2 y3 y4 y5 y6 ...

a1(x21 ) a2(x4

1 ) a3(x61 ) ...

b1(y21 ) b2(y4

1 )

x1(4) x2(5) x3(6) ...y1(4) y2(5) y3(6) ...

Decoding

Encoder x

Encoder y

Source x

Source y

Rate limited Channel Rx

Rate limited Channel Ry

? ? ? ? ? ?

6 6 6 6 6 6

? ? ?

6 6

Figure 4.2. Delay constrained source coding with side-information: rates Rx = 12 , Ry = 1

3 ,delay ∆ = 3

66

Definition 5 A fixed-delay ∆ sequential encoder-decoder triplet (Ex, Ey,D) is a sequence

of mappings, Exj , j = 1, 2, ..., Ey

j , j = 1, 2, ... and Dj, j = 1, 2, ... The outputs of Exj

and Eyj from time j − 1 to j.

Exj : X j −→ 0, 1bjRxc−b(j−1)Rxc, e.g., Ex

j (xj1) = a

bjRxcb(j−1)Rxc+1,

Eyj : Yj −→ 0, 1bjRyc−b(j−1)Ryc, e.g., Ey

j (yj1) = b

bjRycb(j−1)Ryc+1.

(4.1)

The output of the fixed-delay ∆ decoder Dj is the decoding decision of xj and yj based on

the received binary bits up to time j + ∆.

Dj : 0, 1bjRxc × 0, 1bjRyc −→ X × Y

Dj(ab(j+∆)Rxc, bb(j+∆)Ryc) = (xj(j + ∆), yj(j + ∆))

Where xj(j + ∆) and yj(j + ∆) are the estimation of xj and yj at time j + ∆ and thus has

end-to-end delay of ∆ seconds. A point-to-point delay constrained source coding system is

illustrated in Figure 2.1.

In this chapter we study two symbol error probabilities. We define the pair of source

estimates of the (n−∆)th source symbols (xn−∆, yn−∆) at time n as (xn−∆(n), yn−∆(n)) =

Dn(∏n

j=1 Exj ,

∏nj=1 Ey

j ), where∏n

j=1 Exj indicates the full bnRxc bit stream from encoder x

up to time n. We use (xn−∆(n), yn−∆(n)) to indicate the estimate of the (n−∆)th symbol

of each source stream at time n, the end to end delay is ∆. With these definitions the two

error probabilities we study are

Pr[xn−∆(n) 6= xn−∆] and Pr[yn−∆(n) 6= yn−∆].

Parallelling to the definition of delay constrained error exponent for point-to-point source

coding in Definition 2, we have the following definition on delay constrained error exponents

for distributed source coding.

Definition 6 A family of rate Rx, Ry distributed sequential source codes (Ex, Ey,D) are

said to achieve delay error exponent (Ex(Rx, Ry), Ey(Rx, Ry)) if and only if for all ε > 0,

there exists K < ∞, s.t.∀i, ∀∆ > 0

Pr[xn−∆(n) 6= xn−∆] ≤ K2−∆(Ex(Rx,Ry)−ε) and Pr[yn−∆(n) 6= yn−∆] ≤ K2−∆(Ey(Rx,Ry)−ε)

67

Remark: The error exponents Ex(Rx, Ry) and Ey(Rx, Ry) are functions of both Rx and

Ry. In contrast to (A.9) the error exponent we look at is in the delay, ∆, rather than

total sequence length, n. Naturally, we define the delay constrained joint error exponent as

follows, joint error exponent Exy(Rx, Ry) if and only if for all ε > 0, there exists K < ∞,

s.t.∀i, ∀∆ > 0

Pr[(xn−∆(n), yn−∆(n)) 6= (xn−∆, yn−∆)] ≤ K2−∆(Exy(Rx,Ry)−ε)

By noticing the simple fact that:

Pr[(xn−∆(n), yn−∆(n)) 6= (xn−∆, yn−∆)] ≤ 2maxPr[xn−∆(n) 6= xn−∆],Pr[yn−∆(n) 6= yn−∆]

and

Pr[(xn−∆(n), yn−∆(n)) 6= (xn−∆, yn−∆)] ≥ maxPr[xn−∆(n) 6= xn−∆], Pr[yn−∆(n) 6= yn−∆]

it should be clear that

Exy(Rx, Ry) = minEx(Rx, Ry), Ey(Rx, Ry) (4.2)

Following the point-to-point randomized sequential encoder-decoder pair defined in 3,

we have the following definition of randomized sequential encoder-decoder triplets.

Definition 7 A randomized sequential encoder-decoder triplets (Ex, Ey,D) is a sequential

encoder-decoder triplet defined in Definition 5 with common randomness, shared between

encoder Ex, Ey and decoder2 D. This allows us to randomize the mappings independent of

the source sequences. We only need pair-wise independence, formally:

Pr[Ex(xi1x

ni+1) = Ex(xi

1xni+1)] = 2−(bnRxc−biRxc) ≤ 2× 2−(n−i)Rx (4.3)

where xi+1 6= xi+1, and

Pr[Ey(yi1y

ni+1) = Ey(yi

1yni+1)] = 2−(bnRyc−biRyc) ≤ 2× 2−(n−i)Ry (4.4)

where yi+1 6= yi+1

2This common randomness is not shared between Ex and Ey.

68

Recall the definition of bins in (2.11):

Bx(xn1 ) = xn

1 ∈ X n : E(xn1 ) = E(xn

1 ) (4.5)

Hence (4.3) is equivalent to the following equality for xi+1 6= xi+1:

Pr[E(xi1x

ni+1) ∈ Bx(xi

1xni+1)] = 2−(bnRxc−biRxc) ≤ 2× 2−(n−i)Rx (4.6)

Similar for source y:

Pr[E(yi1y

ni+1) ∈ By(yi

1yni+1)] = 2−(bnRyc−biRyc) ≤ 2× 2−(n−i)Ry (4.7)

4.1.2 Main result of Chapter 4: Achievable error exponents

By using a randomized sequential encoder-decoder triplets in Definition 7, positive delay

constrained error exponent pair (Ex, Ey) can be achieved if the rate pair (Rx, Ry) is in

the classical Slepian-Wolf rate region [79], where both maximum likelihood decoding and

universal decoding can achieve the positive error exponent pair. We summarize the main

results of this chapter in the following two Theorems for maximum likelihood decoding and

universal decoding respectively. Furthermore in Theorem 5, we show that the two decoding

scheme achieve the same error exponent.

Theorem 3 Let (Rx, Ry) be a rate pair such that Rx > H(x |y), Ry > H(y |x), Rx + Ry >

H(x , y). Then, there exists a randomized encoder pair and maximum likelihood decoder

triplet (per Definition 3) that satisfies the following three decoding criteria.

(i) For all ε > 0, there is a finite constant K > 0 such that

Pr[xn−∆(n) 6= xn−∆] ≤ K2−∆(EML,SW,x(Rx,Ry)−ε) for all n, ∆ ≥ 0 where

EML,SW,x(Rx, Ry) = min

inf

γ∈[0,1]EML

x (Rx, Ry, γ), infγ∈[0,1]

11− γ

EMLy (Rx, Ry, γ)

.

(ii) For all ε > 0, there is a finite constant K > 0 such that

Pr[yn−∆(n) 6= yn−∆] ≤ K2−∆(EML,SW,y(Rx,Ry)−ε) for all n, ∆ ≥ 0 where

EML,SW,y(Rx, Ry) = min

inf

γ∈[0,1]

11− γ

EMLx (Rx, Ry, γ), inf

γ∈[0,1]EML

y (Rx, Ry, γ)

.

69

(iii) For all ε > 0 there is a finite constant K > 0 such that

Pr[(xn−∆(n), yn−∆(n)) 6= (xn−∆, yn−∆)] ≤ K2−∆(EML,SW,xy(Rx,Ry)−ε) for all n,∆ ≥ 0 where

EML,SW,xy(Rx, Ry) = min

inf

γ∈[0,1]EML


EMLy (Rx, Ry, γ)

.

In definitions (i)–(iii),

EMLx (Rx, Ry, γ) = supρ∈[0,1][γEx|y(Rx, ρ) + (1− γ)Exy(Rx, Ry, ρ)]

EMLy (Rx, Ry, γ) = supρ∈[0,1][γEy|x(Rx, ρ) + (1− γ)Exy(Rx, Ry, ρ)]

(4.8)

and

Exy(Rx, Ry, ρ) = ρ(Rx + Ry)− log[ ∑

x,y pxy (x, y)1

1+ρ

]1+ρ

Ex|y(Rx, ρ) = ρRx − log[ ∑

y

[∑x pxy (x, y)

11+ρ

]1+ρ]

Ey|x(Ry, ρ) = ρRy − log[∑

x

[∑y pxy (x, y)

11+ρ

]1+ρ](4.9)

Theorem 4 Let (Rx, Ry) be a rate pair such that Rx > H(x |y), Ry > H(y |x), Rx + Ry >

H(x , y). Then, there exists a randomized encoder pair and universal decoder triplet (per

Definition 3) that satisfies the following three decoding criteria.

(i) For all ε > 0, there is a finite constant K > 0 such that

Pr[xn−∆(n) 6= xn−∆] ≤ K2−∆(EUN,SW,x(Rx,Ry)−ε) for all n,∆ ≥ 0 where

EUN,SW,x(Rx, Ry) = min

inf

γ∈[0,1]EUN


11− γ

EUNy (Rx, Ry, γ)

. (4.10)

(ii) For all ε > 0, there is a finite constant K > 0 such that

Pr[yn−∆(n) 6= yn−∆] ≤ K2−∆(EUN,SW,y(Rx,Ry)−ε) for all n,∆ ≥ 0 where

EUN,SW,y(Rx, Ry) = min

inf

γ∈[0,1]

11− γ

EUNx (Rx, Ry, γ), inf

γ∈[0,1]EUN

y (Rx, Ry, γ)

. (4.11)

(iii) For all ε > 0, there is a finite constant K > 0 such that

Pr[(xn−∆(n), yn−∆(n)) 6= (xn−∆, yn−∆)] ≤ K2−∆(EUN,SW,xy(Rx,Ry)−ε) for all n, ∆ ≥ 0 where

EUN,SW,xy(Rx, Ry) = min

inf

γ∈[0,1]EUN


EUNy (Rx, Ry, γ)

. (4.12)

70

In definitions (i)–(iii),

EUNx (Rx, Ry, γ) = inf

x ,y ,x ,yγD(px ,y‖pxy ) + (1− γ)D(px ,y‖pxy )

+ |γ[Rx −H(x |y)] + (1− γ)[Rx + Ry −H(x , y)]|+

EUNy (Rx, Ry, γ) = inf

x ,y ,x ,yγD(px ,y‖pxy ) + (1− γ)D(px ,y‖pxy )

+ |γ[Ry −H(y |x)] + (1− γ)[Rx + Ry −H(x , y)]|+ (4.13)

where the dummy random variables (x , y) and (x , y) have joint distributions px ,y and px ,y ,

respectively. Recall that the function |z|+ = z if z ≥ 0 and |z|+ = 0 if z < 0.

Remark: We can compare the joint error event for block and streaming Slepian-Wolf

coding, c.f. (4.12) with (A.10). The streaming exponent differs by the extra parameter

γ that must be minimized over. If the minimizing γ = 1, then the block and streaming

exponents are the same. The minimization over γ results from a fundamental difference in

the types of error-causing events that can occur in streaming Slepian-Wolf as compared to

block Slepian-Wolf.

Remark: The error exponents of maximum likelihood and universal decoding in Theo-

rems 3 and 4 are the same. However, because there are new classes of error events possible

in streaming, this needs proof. The equivalence is summarized in the following theorem.

Theorem 5 Let (Rx, Rx) be a rate pair such that Rx > H(x |y), Ry > H(y |x), and Rx +

Ry > H(x , y). Then,

EML,SW,x(Rx, Ry) = EUN,SW,x(Rx, Ry), (4.14)

and

EML,SW,x(Rx, Ry) = EUN,SW,x(Rx, Ry). (4.15)

Theorem 5 follows directly from the following lemma, shown in the appendix.

Lemma 8 For all γ ∈ [0, 1]

EMLx (Rx, Ry, γ) = EUN

x (Rx, Ry, γ), (4.16)

and

EMLy (Rx, Ry, γ) = EUN

y (Rx, Ry, γ). (4.17)

71

.

Proof: This is a very important Lemma. The proof is extremely laborious, so we put the

details of the proof in Appendix G. The main tool used in the proof is convex optimiza-

tion. There is a great deal of detailed analysis on tilted distribution in G.3 which is used

throughout this thesis. ¥

Remark: This theorem allows us to simplify notation. For example, we can define

Ex(Rx, Ry, γ) as Ex(Rx, Ry, γ) = EMLx (Rx, Ry, γ) = EUN

x (Rx, Ry, γ), and can similarly

define Ey(Rx, Ry, γ). Further, since the ML and universal exponents are the same for the

whole rate region we can define ESW,x(Rx, Ry) as ESW,x(Rx, Ry) = EML,SW,x(Rx, Ry) =

EUN,SW,x(Rx, Ry), and can similarly define ESW,y(Rx, Ry).


To build insight into the differences between the sequential error exponents of Theorem

3 - 5 and block-coding error exponents, we give some examples of the exponents for binary

sources.

For the point-to-point case, the error exponents of random sequential and block source

coding are identical everywhere in the achievable rate region as can be seen by comparing

Theorem 9 and Propositions 3 and 4. The same is true for source coding with decoder

side-information which will be clear in Chapter 5. For distributed (Slepian-Wolf) source

coding however, the sequential and block error exponents can be different. The reason for

the discrepancy is that a new type of error event can be dominant in Slepian-Wolf source

coding. This is reflected in Theorems 3 - 5 by the minimization over γ. Example 2 illustrates

the impact of this γ term.

Given the sequential random source coding rule, for Slepian-Wolf source coding at very

high rates, where Rx > H(x), the decoder can ignore any information from encoder y and

still decode x with with a positive error exponent. However, the decoder could also choose

to decode source x and y jointly. Fig 4.6.a and 4.6.b illustrate that joint decoding may or

surprisingly may not help decoding source x. This is seen by comparing the error exponent

when the decoder ignores the side-information from encoder y (the dotted curves) to the

joint error exponent (the lower solid curves). It seems that when the rate for source y is

low, atypical behaviors of source y can cause joint decoding errors that end up corrupting

x estimates. This holds for both block and sequential coding.

72

-

6

Ry

Rx

. . . . . . . . . . . . . . . . . . . . . . . . .0.71

. . . . . . . . . . . . . . . . . . . . . . . . .0.97

AchievableRegion

Rx + Ry = H(x , y)µ

Figure 4.3. Rate region for the example 1 source, we focus on the error exponent on sourcex for fixed encoder y rates: Ry = 0.71 and Ry = 0.97

4.2.1 Example 1: symmetric source with uniform marginals

Consider a symmetric source where |X | = |Y| = 2, pxy (0, 0) = 0.45, pxy (0, 1) =

pxy (1, 0) = 0.05 and pxy (1, 1) = 0.45. This is a marginally-uniform source: x is

Bernoulli(1/2), y is the output from a BSC with input x , thus y is Bernoulli(1/2) as well.

For this source H(x) = H(y) = log(2) = 1, H(x |y) = H(y |x) = 0.46, H(x , y) = 1.47. The

achievable rate region is the triangle shown in Figure(4.3).

For this source, as will be shown later, the dominant sequential error event is on the

diagonal line in Fig 4.8. This is to say that:

ESW,x(Rx, Ry) = EBLOCKSW,x (Rx, Ry) = EML

x (Rx, Ry, 0) = supρ∈[0,1]

[Exy(Rx, Ry, ρ)]. (4.18)

Where EBLOCKSW,x (Rx, Ry) = minEML

x (Rx, Ry, 0), EMLx (Rx, Ry, 1) as shown in [39].

Similarly for source y:

ESW,y(Rx, Ry) = EBLOCKSW,y (Rx, Ry) = EML

y (Rx, Ry, 0) = supρ∈[0,1]

[Exy(Rx, Ry, ρ)]. (4.19)

73

We first show that for this source ∀ρ ≥ 0, Ex|y(Rx, ρ) ≥ Exy(Rx, Ry, ρ). By definition:

Ex|y(Rx, ρ)−Exy(Rx, Ry, ρ)

= ρRx − log[ ∑

y

[ ∑x

pxy (x, y)1

1+ρ

]1+ρ]−

(ρ(Rx + Ry)− log

[∑x,y

pxy (x, y)1

1+ρ

]1+ρ)

= −ρRy − log[2[ ∑

x

pxy (x, 0)1

1+ρ

]1+ρ]+ log

[2

∑x

pxy (x, 0)1

1+ρ

]1+ρ

= −ρRy − log[2]

+ log[2]1+ρ

= ρ(log[2]−Ry)

≥ 0

The last inequality is true because we only consider the problem when Ry ≤ log |Y|.Otherwise, y is better viewed as perfectly known side-information. Now

EMLx (Rx, Ry, γ) = sup

ρ∈[0,1][γEx|y(Rx, ρ) + (1− γ)Exy(Rx, Ry, ρ)]

≥ supρ∈[0,1]

[Exy(Rx, Ry, ρ)]

= EMLx (Rx, Ry, 0)

Similarly EMLy (Rx, Ry, γ) ≥ EML

y (Rx, Ry, 0) = EMLx (Rx, Ry, 0). Finally,

ESW,x(Rx, Ry) = min

inf

γ∈[0,1]Ex(Rx, Ry, γ), inf

γ∈[0,1]

11− γ

Ey(Rx, Ry, γ)

= EMLx (Rx, Ry, 0)

Particularly Ex(Rx, Ry, 1) ≥ Ex(Rx, Ry, 0), so

EBLOCKSW,x (Rx, Ry) = minEML

x (Rx, Ry, 0), EMLx (Rx, Ry, 1)

= EMLx (Rx, Ry, 0)

The same proof holds for source y.

In Fig 4.4 we plot the joint sequential/block coding error exponents ESW,x(Rx, Ry) =

EBLOCKSW,x (Rx, Ry), the error exponents are positive iff Rx > H(xy)−Ry = 1.47−Ry.

4.2.2 Example 2: non-symmetric source

Consider a non-symmetric source where |X | = |Y| = 2, pxy (0, 0) = 0.1, pxy (0, 1) =

pxy (1, 0) = 0.05 and pxy (1, 1) = 0.8. For this source H(x) = H(y) = 0.42,

74

0 0.50 0.76 log(2)0

0.05

0.1

0.15

0.2

0.25

0.3

Err

or e

xpon

ent f

or s

ourc

e x

Rate of encoder x

R =0.97

R =0.71

y

y

Figure 4.4. Error exponents plot: ESW,x(Rx, Ry) plotted for Ry = 0.71 and Ry = 0.97ESW,x(Rx, Ry) = EBLOCK

SW,x (Rx, Ry) = ESW,y(Rx, Ry) = EBLOCKSW,y (Rx, Ry) and Ex(Rx) = 0

H(x |y) = H(y |x) = 0.29 and H(x , y) = 0.71. The achievable rate region is shown

in Fig 4.5. In Fig 4.6.a, 4.6.b, 4.6.c and 4.6.d, we compare the joint sequential er-

ror exponent ESW,x(Rx, Ry) the joint block coding error exponent EBLOCKSW,x (Rx, Ry) =

minEx(Rx, Ry, 0), Ex(Rx, Ry, 1) as shown in [39] and the individual error exponent for

source X, Ex(Rx) as shown in Corollary 4. Notice that Ex(Rx) > 0 only if Rx > H(x). In

Fig 4.7, we compare the sequential error exponent for source y: ESW,y(Rx, Ry) and the block

coding error exponent for source y: EBLOCKSW,y (Rx, Ry) = minEy(Rx, Ry, 0), Ey(Rx, Ry, 1)

and Ey(Ry) which is a constant since we fix Ry.

For Ry = 0.50 as shown in Fig 4.6.a.b and 4.7.a.b, the difference between the block

coding and sequential coding error exponents is very small for both source x and y. More

interestingly, as shown in Fig 4.6.a, because the rate of source y is low, i.e. it is more likely

to get a decoding error due to the atypical behavior of source y. So as Rx increases, it is

sometimes better to ignore source y and decode x individually. This is evident as the dotted

curve is above the solid curves.

For Ry = 0.71 as shown in Fig 4.6.c.d and 4.7.c.d, since the rate for source y is high

enough, source y can be decoded with a positive error exponent individually as shown in

Fig 4.7.c. But as the rate of source x increases, joint decoding gives a better error exponent.

When Rx is very high, then we observe the saturation of the error exponent on y as if source

75

-

6

Ry

Rx

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

0.50

0.71

AchievableRegion

Rx + Ry = H(x , y)

µ

Figure 4.5. Rate region for the example 2 source, we focus on the error exponent on sourcex for fixed encoder y rates: Ry = 0.50 and Ry = 0.71

x is known perfectly to the decoder! This is illustrated by the flat part of the solid curves

in Fig 4.7.c.

4.3 Proofs

In this section we provide the proofs of Theorems 3 and 4, which consider the delay

constrained distributed source coding. As with the proofs for the point-to-point source

coding result in Propositions 3 and 4 in Section 2.3.3, we start by proving the case for

maximum likelihood decoding. The universal result follows.

4.3.1 ML decoding

ML decoding rule

First we explain the simple maximum likelihood decoding rule. Very similar to the ML

decoding rule for the point-to-point source coding case in (2.16), the decoding rule is to

simply pick the most likely source sequence pair under the joint distribution pxy .

Denote by (xn1 (n), yn

1 (n)) the estimate of the source sequence pair (xn1 , yn

1 ) at time n.

76

0 0.52 log(2) 0

0.05

0.1

0.15

0.2

0.25

(a)

R = 0.50

0 log(2)0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(b)

0 0.42 0.61 log(2)0

0.05

0.1

0.15

0.2

0.25

(c)

0 log(2)0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(d)

R

x

R

x

R

x

R

x

y R = 0.71

y

Figure 4.6. Error exponents plot for source x for fixed Ry as Rx varies:Ry = 0.50:(a) Solid curve: ESW,x(Rx, Ry), dashed curve EBLOCK

SW,x (Rx, Ry) and dotted curve: Ex(Rx),notice that ESW,x(Rx, Ry) ≤ EBLOCK

SW,x (Rx, Ry) but the difference is small.

(b) 10 log10(EBLOCK

SW,x (Rx,Ry)

ESW,x(Rx,Ry) ). This shows the difference is there at high rates.Ry = 0.71:(c) Solid curve ESW,x(Rx, Ry), dashed curve EBLOCK

SW,x (Rx, Ry) and dotted curve: Ex(Rx),again ESW,x(Rx, Ry) ≤ EBLOCK

SW,x (Rx, Ry) but the difference is extremely small.

(d) 10 log10(EBLOCK

SW,x (Rx,Ry)

ESW,x(Rx,Ry) ). This shows the difference is there at intermediate low rates.

77

0 0.52 log(2)0

0.02

0.04

0.06

0.08

(a)

R = 0.50

0 log(2)0

0.05

0.1

0.15

0.2

0.25

0.3

(b)

0 0.42 log(2)0

0.02

0.04

0.06

0.08

(c)

R = 0.71

log(2)0

0.05

0.1

0.15

0.2

0.25

0.3

R

x

R

x

R

x

R

x

infinity

y y

E (R )

ML y

E (R ) ML y

(d)

Figure 4.7. Error exponents plot for source y for fixed Ry as Rx varies:Ry = 0.50:(a) Solid curve: ESW,y(Rx, Ry) and dashed curve EBLOCK

SW,y (Rx, Ry), ESW,y(Rx, Ry) ≤EBLOCK

SW,y (Rx, Ry), the difference is extremely small. Ey(Ry) is 0 because Ry = 0.50 < H(y).

(b) 10 log10(EBLOCK

SW,y (Rx,Ry)

ESW,y(Rx,Ry) ). This shows the two exponents are not identical everywhere.Ry = 0.71:(c) Solid curves: ESW,y(Rx, Ry), dashed curve EBLOCK

SW,y (Rx, Ry) and ESW,y(Rx, Ry) ≤EBLOCK

SW,y (Rx, Ry) and Ey(Ry) is constant shown in a dotted line.

(d) 10 log10(EBLOCK

SW,y (Rx,Ry)

ESW,y(Rx,Ry) ). Notice how the gap goes to infinity when we leave the Slepian-Wolf region.

78

(xn1 (n), yn

1 (n)) = arg maxxn1∈Bx(xn

1 ),yn1 ∈By(yn

1 )pxy (xn

1 , yn1 ) = arg max

sn1∈Bx(xn

1 ),yn1 ∈By(yn

1 )

n∏

i=1

pxy (si, yi) (4.20)

At time n, the decoder simply picks the sequence pair xn1 (n) and yn

1 (n) with the highest

likelihood which is in the same bin as the true sequence xn1 and yn

1 respectively. Now the

estimate of source symbol n−∆ is simply the (n−∆)th symbol of xn1 (n) and yn

1 (n), denoted

by (xn−∆(n), yn−∆(n)).

Details of the proof of Theorem 3

In Theorems 3 and 4 three error events are considered: (i) [xn−∆ 6= xn−∆(n)], (ii)

[yn−∆ 6= yn−∆(n)], and (iii) [(xn−∆, yn−∆) 6= (xn−∆(n), yn−∆(n))]. We develop the error

exponent for case (i). The error exponent for case (ii) follows from a similar derivation, and

that of case (iii) is the minimum of the exponents of cases (i) and (ii) by the simple union

bound argument in 4.2.

To lead to the decoding error [xn−∆ 6= xn−∆(n)] there must be some spurious source

pair (xn1 , yn

1 ) that satisfies three conditions: (i) xn1 ∈ Bx(xn

1 ) and yn1 ∈ By(yn

1 ), (ii) it must be

more likely than the true pair pxy (xn1 , yn

1 ) > pxy (xn1 , yn

1 ), and (iii) xl 6= xl for some l ≤ n−∆.

The error probability is

Pr[xn−∆ 6= xn−∆(n)] ≤ Pr[xn−∆1 (n) 6= xn−∆

1 ]

=∑

xn1 ,yn

1

Pr[xn−∆ 6= xn−∆|xn1 = xn

1 , yn1 = yn

1 ]pxy (xn1 , yn

1 )

≤∑

xn1 ,yn

1

pxy (xn1 , yn

1 ) n−∆∑

l=1

n+1∑

k=1

Pr[∃ (xn

1 , yn1 ) ∈ Bx(xn

1 )× By(yn1 ) ∩ Fn(l, k, xn

1 , yn1 ) s.t. pxy (xn

1 , yn1 ) ≥ pxy (xn

1 , yn1 )

]

(4.21)

=n−∆∑

l=1

n+1∑

k=1

∑

xn1 ,yn

1

pxy (xn1 , yn

1 )

Pr[∃ (xn

1 , yn1 ) ∈ Bx(xn

1 )× By(yn1 ) ∩ Fn(l, k, xn


1 , yn1 ) ≥ pxy (xn

1 , yn1 )

]

=n−∆∑

l=1

n+1∑

k=1

pn(l, k). (4.22)

In (4.21) we decompose the error event into a number of mutually exclusive events by

partitioning all source pairs (xn1 , yn

1 ) into sets Fn(l, k, xn1 , yn

1 ) defined by the times l and k

79

at which xn1 and yn

1 diverge from the realized source sequences. The set Fn(l, k, xn1 , yn

1 ) is

defined as

Fn(l, k, xn, yn) = (xn1 , yn

1 ) ∈ X n × Yn s.t. xl−11 = xl−1

1 , xl 6= xl, yk−11 = yk−1

1 , yk 6= yk,(4.23)

In contrast to streaming point-to-point or side-information coding (cf. (4.23) with (2.20)),

the partition is now doubly-indexed. To find the dominant error event, we must search over

both indices. Having two dimensions to search over results in an extra minimization when

calculating the error exponent (and leads to the infimum over γ in Theorem 3).

Finally, to get (4.22) we define pn(l, k) as

pn(l, k) =∑

xn1 ,yn

1

pxy (xn1 , yn

1 )×

Pr[∃ (xn

1 , yn1 ) ∈ Bx(xn

1 )× By(yn1 ) ∩ Fn(l, k, xn


1 , yn1 ) ≥ pxy (xn

1 , yn1 )

].

The following lemma provides an upper bound on pn(l, k):

Lemma 9

pn(l, k) ≤ 4× 2−(n−l+1)Ex(Rx,Ry, k−ln−l+1

) if l ≤ k,

pn(l, k) ≤ 4× 2−(n−k+1)Ey(Rx,Ry , l−kn−k+1

) if l ≥ k,

(4.24)

where Ex(Rx, Ry, γ) and Ey(Rx, Ry, γ) are defined in (4.8) and (4.9) respectively. Notice

that l, k ≤ n, for l ≤ k: k−ln−l+1 ∈ [0, 1] serves as γ in the error exponent Ex(Rx, Ry, γ).

Similarly for l ≥ k.

Proof: The proof is quite similar to the Chernoff bound [8] argument in the proof of

Proposition 3 for the point to point source coding problem. We put the details of the proof

in Appendix F.1. ¤

We use Lemma 9 together with (4.22) to bound Pr[xn−∆1 (n) 6= xn−∆

1 ] which is an upper

bound on Pr[xn−∆(n) 6= xn−∆] for two distinct cases. The first, simpler case, is when

infγ∈[0,1] Ey(Rx, Ry, γ) > infγ∈[0,1] Ex(Rx, Ry, γ). To bound Pr[xn−∆1 (n) 6= xn−∆

1 ] in this

case, we split the sum over the pn(l, k) into two terms, as visualized in Fig 4.8. There are

(n + 1)× (n−∆) such events to account for (those inside the box). The probability of the

event within each oval are summed together to give an upper bound on Pr[xn−∆1 (n) 6= xn−∆

1 ].

We add extra probabilities outside of the box but within the ovals to make the summation

symmetric thus simpler. Those extra error events do not impact the error exponent because

80

infγ∈[0,1] Ey(Rx, Ry, ρ, γ) ≥ infγ∈[0,1] Ex(Rx, Ry, ρ, γ). The possible dominant error events

are highlighted in Figure 4.8. Thus,

Pr[xn−∆1 (n) 6= xn−∆

1 ] ≤n−∆∑

l=1

n+1∑

k=l

pn(l, k) +n−∆∑

k=1

n+1∑

l=k

pn(l, k) (4.25)

≤ 4×n−∆∑

l=1

n+1∑

k=l

2−(n−l+1) infγ∈[0,1] Ex(Rx,Ry,γ)

+ 4×n−∆∑

k=1

n+1∑

l=k

2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry,γ) (4.26)

= 4×n−∆∑

l=1

(n− l + 2)2−(n−l+1) infγ∈[0,1] Ex(Rx,Ry,γ)

+ 4×n−∆∑

k=1

(n− k + 2)2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry ,γ)

≤ 8n−∆∑

l=1

(n− l + 2)2−(n−l+1) infγ∈[0,1] Ex(Rx,Ry ,γ) (4.27)

≤n−∆∑

l=1

C12−(n−l+2)[infγ∈[0,1] Ex(Rx,Ry,γ)−ε] (4.28)

≤ C22−∆[infγ∈[0,1] Ex(Rx,Ry ,γ)−ε] (4.29)

(4.25) follows directly from (4.22), in the first term l ≤ k, in the second term l ≥ k.

In (4.26), we use Lemma 9. In (4.27) we use the assumption that infγ∈[0,1] Ey(Rx, Ry, γ) >

infγ∈[0,1] Ex(Rx, Ry, γ). In (4.28) the ε > 0 results from incorporating the polynomial into

the first exponent, and can be chosen as small as desired. Combining terms and summing

out the decaying exponential yield the bound (4.29).

The second, more involved case, is when

infγ∈[0,1] Ey(Rx, Ry, ρ, γ) < infγ∈[0,1] Ex(Rx, Ry, ρ, γ). To bound Pr[xn−∆1 (n) 6= xn−∆

1 ], we

could use the same bounding technique used in the first case. This gives the error exponent

infγ∈[0,1] Ey(Rx, Ry, γ) which is generally smaller than what we can get by dividing the

error events in a new scheme as shown in Figure 4.9. In this situation we split (4.22) into

three terms, as visualized in Fig 4.9. Just as in the first case shown in Fig 4.8, there are

(n + 1) × (n −∆) such events to account for (those inside the box). The error events are

partitioned into 3 regions. Region 2 and 3 are separated by k∗(l) using a dotted line. In

region 3, we add extra probabilities outside of the box but within the ovals to make the

summation simpler. Those extra error events do not affect the error exponent as shown in

81

-

6

k

l

n+

1n−

∆

Inde

xat

whi

chy

n 1an

dy

n 1fir

stdi

verg

e

Index at which xn1 and xn

1 first divergen + 1n−∆

Figure 4.8. Two dimensional plot of the error probabilities pn(l, k), correspondingto error events (l, k), contributing to Pr[xn−∆


infγ∈[0,1] Ey(Rx, Ry, ρ, γ) ≥ infγ∈[0,1] Ex(Rx, Ry, ρ, γ).

82

-

6

k

l

n+

1n−

∆

Inde

xat

whi

chy

n 1an

dy

n 1fir

stdi

verg

e

Index at which xn1 and xn

1 first diverge

k∗(n−∆)− 1

n + 1n−∆

Region 1

Region 2 Region 3

k∗(l)¾

Figure 4.9. Two dimensional plot of the error probabilities pn(l, k), correspondingto error events (l, k), contributing to Pr[xn−∆


infγ∈[0,1] Ey(Rx, Ry, γ) < infγ∈[0,1] Ex(Rx, Ry, γ).

the proof. The possible dominant error events are highlighted shown in Fig 4.9. Thus,

Pr[xn−∆1 (n) 6= xn−∆

1 ] ≤n−∆∑

l=1

n+1∑

k=l

pn(l, k) +n−∆∑

l=1

l−1∑

k=k∗(l)

pn(l, k) +n−∆∑

l=1

k∗(l)−1∑

k=1

pn(l, k) (4.30)

Where∑0

k=1 pk = 0. The lower boundary of Region 2 is k∗(l) ≥ 1 as a function of n and l:

k∗(l) = max

1, n + 1− d infγ∈[0,1] Ex(Rx, Ry, γ)infγ∈[0,1] Ey(Rx, Ry, γ)

e(n + 1− l)

= max 1, n + 1−G(n + 1− l) (4.31)

where we use G to denote the ceiling of the ratio of exponents. Note that when

infγ∈[0,1] Ey(Rx, Ry, γ) > infγ∈[0,1] Ex(Rx, Ry, γ) then G = 1 and region two of Fig. 4.9

disappears. In other words, the middle term of (4.30) equals zero. This is the first case

considered. We now consider the cases when G ≥ 2 (because of the ceiling function G is a

positive integer).

The first term of (4.30), i.e., region one in Fig. 4.9 where l ≤ k, is bounded in the same

83

way that the first term of (4.25) is, giving

n−∆∑

l=1

n+1∑

k=l

pn(l, k) ≤ C22−∆[infγ∈[0,1] Ex(Rx,Ry ,γ)−ε]. (4.32)

In Fig. 4.9, region two is upper bounded by the 45-degree line, and lower bounded by

k∗(l). The second term of (4.30), corresponding to this region where l ≥ k,

n−∆∑

l=1

l−1∑

k=k∗(l)

pn(l, k) ≤ 4×n−∆∑

l=1

l−1∑

k=k∗(l)

2−(n−k+1)Ey(Rx,Ry, l−kn−k+1

)

= 4×n−∆∑

l=1

l−1∑

k=k∗(l)

2−(n−k+1)n−l+1n−l+1

Ey(Rx,Ry , l−kn−k+1

) (4.33)

≤ 4×n−∆∑

l=1

l−1∑

k=k∗(l)

2−(n−l+1) infγ∈[0,1]1

1−γEy(Rx,Ry ,γ) (4.34)

= 4×n−∆∑

l=1

(l − k∗(l))2−(n−l+1) infγ∈[0,1]1

1−γEy(Rx,Ry ,γ) (4.35)

In (4.33) we note that l ≥ k, so define l−kn−k+1 = γ as in (4.34). Then n−k+1

n−l+1 = 11−γ .

The third term of (4.30), i.e., the intersection of region three and the “box” in Fig. 4.9

where l ≥ k, can be bounded as,

n−∆∑

l=1

k∗(l)−1∑

k=1

pn(l, k) ≤n+1∑

l=1

minl,k∗(n−∆)−1∑

k=1

pn(l, k) (4.36)

=k∗(n−∆)−1∑

k=1

n+1∑

l=k

pn(l, k) (4.37)

≤ 4×k∗(n−∆)−1∑

k=1

n+1∑

l=k

2−(n−k+1)Ey(Rx,Ry , l−kn−k+1

)

≤ 4×k∗(n−∆)−1∑

k=1

n+1∑

l=k

2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry ,γ)

≤ 4×k∗(n−∆)−1∑

k=1

(n− k + 2)2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry,γ) (4.38)

In (4.36) we note that l ≤ n−∆ thus k∗(n−∆)−1 ≥ k∗(l)−1, also l ≥ 1, so l ≥ k∗(l)−1.

This can be visualized in Fig 4.9 as we extend the summation from the intersection of the

“box” and region 3 to the whole region under the diagonal line and the horizontal line

k = k∗(n−∆)− 1. In (4.37) we simply switch the order of the summation.

84

Finally when G ≥ 2, we substitute (4.32), (4.35), and (4.38) into (4.30) to give

Pr[xn−∆1 (n) 6= xn−∆

1 ] ≤ C22−∆[infγ∈[0,1] Ex(Rx,Ry ,γ)−ε]

+ 4×n−∆∑

l=1

(l − k∗(l))2−(n−l+1) infγ∈[0,1]1

1−γEy(Rx,Ry ,γ) (4.39)

+ 4×k∗(n−∆)−1∑

k=1

(n− k + 2)2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry,γ)

≤ C22−∆[infγ∈[0,1] Ex(Rx,Ry ,γ)−ε]

+ 4×n−∆∑

l=1

(l − n− 1 + G(n + 1− l))2−(n−l+1) infγ∈[0,1]1

1−γEy(Rx,Ry ,γ)

+ 4×n+1−G(∆+1)∑

k=1

(n− k + 2)2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry,γ) (4.40)

≤ C22−∆[infγ∈[0,1] Ex(Rx,Ry ,γ)−ε]

+ (G− 1)C32−∆

[infγ∈[0,1]

11−γ

Ey(Rx,Ry ,γ)−ε]

+ C42−[∆G infγ∈[0,1] Ey(Rx,Ry,γ)−ε

]

≤ C52−∆

[min

infγ∈[0,1] Ex(Rx,Ry ,γ),infγ∈[0,1]

11−γ

Ey(Rx,Ry ,γ)

−ε

]. (4.41)

To get (4.40), we use the fact that k∗(l) ≥ n + 1−G(n + 1− l) from the definition of k∗(l)

in (4.31) to upper bound the second term. We exploit the definition of G to convert the

exponent in the third term to infγ∈[0,1] Ex(Rx, Ry, γ). Finally, to get (4.41) we gather the

constants together, sum out over the decaying exponentials, and are limited by the smaller

of the two exponents.

Note: in the proof of Theorem 3, we regularly double count the error events or add

smaller extra probabilities to make the summations simpler. But it should be clear that

the error exponent is not affected. ¥

4.3.2 Universal Decoding

Universal decoding rule

As discussed in Section 2.3.3, we do not use a pairwise minimum joint-entropy decoder

because of polynomial term in n would multiply the exponential decay in ∆. Analogous

to the sequential decoder used there, we use a “weighted suffix entropy” decoder. The

decoding starts by first identifying candidate sequence pairs as those that agree with the

85

encoding bit streams up to time n, i.e., xn1 ∈ Bx(xn

1 ), yn1 ∈ By(yn

1 ). For any one of the

|Bx(xn1 )||By(yn

1 )| sequence pairs in the candidate set, i.e., (xn1 , yn

1 ) ∈ Bx(xn1 ) × By(yn

1 ) we

compute (n + 1)× (n + 1) weighted entropies:

HS(l, k, xn1 , yn

1 ) = H(x(n+1−l)l , y

(n+1−l)l ), l = k

HS(l, k, xn1 , yn

1 ) =k − l

n + 1− lH(xk−1

l |yk−1l ) +

n + 1− k

n + 1− lH(xn

k , ynk ), l < k

HS(l, k, xn1 , yn

1 ) =l − k

n + 1− kH(yl−1

k |xl−1k ) +

n + 1− l

n + 1− kH(xn

l , ynl ), l > k.

We define the score of (xn1 , yn

1 ) as the pair of integers ix(xn1 , yn

1 ), iy(xn1 , yn

1 ) s.t.,

ix(xn1 , yn

1 ) = maxi : HS(l, k, (xn1 , yn

1 )) < HS(l, k, xn1 , yn

1 )∀k = 1, 2, ...n + 1, ∀l = 1, 2, ...i,

∀(xn1 , yn

1 ) ∈ Bx(xn1 )× By(yn

1 ) ∩ Fn(l, k, xn1 , yn

1 ) (4.42)

iy(xn1 , yn

1 ) = maxi : HS(l, k, (xn1 , yn

1 )) < HS(l, k, xn1 , yn

1 )∀l = 1, 2, ...n + 1, ∀k = 1, 2, ...i,

∀(xn1 , yn

1 ) ∈ Bx(xn1 )× By(yn

1 ) ∩ Fn(l, k, xn1 , yn

1 ) (4.43)

While Fn(l, k, xn1 , yn

1 ) is the same set as defined in (4.23), we repeat the definition here for

convenience,

Fn(l, k, xn1 , yn

1 ) = (xn1 , yn

1 ) ∈ X n × Yn s.t. xl−11 = xl−1

1 , xl 6= xl, yk−11 = yk−1

1 , yk 6= yk.

The definition of (ix(xn1 , yn

1 ), iy(xn1 , yn

1 )) can be visualized in the following procedure.

As shown in Fig. 4.10, for all 1 ≤ l, k ≤ n + 1, if there exists (¯xn1 , ¯yn

1 ) ∈ Fn(l, k, (xn1 , yn

1 )) ∩Bx(xn

1 ) × By(yn1 ) s.t. HS(l, k, xn

1 , yn1 ) ≥ HS(l, k, ¯xn

1 , ¯yn1 ), then we mark (l, k) on the plane

as shown in Fig.4.10. Eventually we pick the maximum integer which is smaller than all

marked x-coordinates as ix(xn1 , yn

1 ) and the maximum integer which is smaller than all

marked y-coordinates as iy(xn1 , yn

1 ). The score of (xn1 , yn

1 ) tells us the first branch(either x

or y) point where a “better sequence pair” (with a smaller weighted entropy) exists.

Define the set of the winners as the sequences (not sequence pair) with the maximum

score:

Wxn = xn

1 ∈ Bx(xn1 ) : ∃yn

1 ∈ By(yn1 ), s.t.ix(xn

1 , yn1 ) ≥ ix(xn

1 , yn1 ), ∀(xn

1 , yn1 ) ∈ Bx(xn

1 )×By(yn1 )

Wyn = yn

1 ∈ By(yn1 ) : ∃xn

1 ∈ Bx(xn1 ), s.t.iy(xn

1 , yn1 ) ≥ iy(xn

1 , yn1 ), ∀(xn

1 , yn1 ) ∈ Bx(xn

1 )×By(yn1 )

86

-

6

k

l

n+

1

n + 11

1i y

ix

Figure 4.10. 2D interpretation of the score, (ix(xn1 , yn

1 ), iy(xn1 , yn

1 )), of a sequence pair(xn

1 , yn1 ). If there exists a sequence pair in Fn(l, k, xn

1 , yn1 ) with less or the same score,

then (l, k) is marked with a solid dot. The score ix(xn1 , yn

1 ) is the largest integer which issmaller than all the x-coordinates of the marked points. Similarly for iy(xn

1 , yn1 ),

Then arbitrarily pick one sequence from Wxn and one from Wy

n as the decision

(xn1 (n), yn

1 (n)) at decision time n.

Details of the proof of Theorem 4

Using the above universal decoding rule, we give an upper bound on the decoding error

Pr[xn−∆(n) 6= xn−∆], and hence derive an achievable error exponent.

We bound the probability that there exists a sequence pair in Fn(l, k, (xn1 , yn

1 ))∩Bx(xn1 )×

By(yn1 ) with smaller weighted minimum-entropy suffix score as:

pn(l, k) =∑

xn1

∑

yn1

pxy (xn1 , yn

1 ) Pr[∃(xn1 , yn

1 ) ∈ Bx(xn1 )× By(yn

1 ) ∩ Fn(l, k, xn1 , yn

1 ),

s.t.HS(l, k, xn1 , yn

1 ) ≤ HS(l, k, (xn1 , yn

1 ))]

Note that the pn(l, k) here differs from the pn(l, k) defined in the ML decoding by replacing

pxy (xn1 , yn

1 ) ≤ pxy (xn1 , yn

1 ) with HS(l, k, xn1 , yn

1 ) ≤ HS(l, k, (xn1 , yn

1 )).

87

The following lemma, analogous to (4.22) for ML decoding, tells us that the “suffix

weighted entropy” decoding rule is a good one.

Lemma 10 Upper bound on symbol-wise decoding error Pr[xn−∆(n) 6= xn−∆] :

Pr[xn−∆(n) 6= xn−∆] ≤ Pr[xn−∆1 (n) 6= xn−∆

1 ] ≤n−∆∑

l=1

n+1∑

k=1

pn(l, k)

Proof: The first inequality is trivial. We only need to show the second inequality.

According to the decoding rule, xn−∆1 6= xn−∆

1 implies that there exists a sequence xn1 ∈ Wx

n

s.t.xn−∆1 6= xn−∆

1 . This means that there exists a sequence yn1 ∈ By(yn

1 ), s.t. ix(xn1 , yn

1 ) ≥ix(xn

1 , yn1 ). Suppose that (xn

1 , yn1 ) ∈ Fn(l, k, xn

1 , yn1 ), then l ≤ n−∆ because xn−∆

1 6= xn−∆1 .

By the definition of ix, we know that HS(l, k, xn1 , yn

1 ) ≤ HS(l, k, xn1 , yn

1 ). And using the

union bound argument we get the desired inequality. ¤

We only need to bound each single error probability pn(l, k) to finish the proof.

Lemma 11 Upper bound on pn(l, k), l ≤ k: ∀ε > 0, ∃K1 < ∞, s.t.

pn(l, k) ≤ 2−(n−l+1)[Ex(Rx,Ry ,λ)−ε]

where λ = (k − l)/(n− l + 1) ∈ [0, 1].

Proof: The proof bears some similarity to the that of the universal decoding problem of

the point-to-point source coding problem in Proposition 4. The details of the proof is in

Appendix F.2 ¤

A similar derivation yields a bound on pn(l, k) for l ≥ k.

Combining Lemmas 11 and 10, and then following the same derivation for ML decoding

yields Theorem 4. ¥

4.4 Discussions

We derived the achievable delay constrained error exponents for distributed source cod-

ing. The key technique is “divide and conquer”. We divide the delay constrained error event

88

into individual error events. For each individual error events, we apply the classical block

coding analysis to get an individual error exponent for that specific error event. Then by a

union bound argument we determine the dominant error event. This “divide and conquer”

scheme not only works for the Slepian-Wolf source coding in this chapter but also works for

other information-theoretic problems illustrated in [13, 14]. While the encoder is the same

sequential random binning shown in Section 2.3.3, the decoding is quite complicated due to

the nature of the problem. Similar to that in Section 2.3.3, we derived the error exponent

for both ML and universal decoding. To show the equivalence of the two different error

exponents, we apply Lagrange duality and tilted distributions in Appendix G. There are

several other important results in Appendix G. The derivations are extremely laborious but

the reward is quite fulfilling. These results in Appendix G essentially follow the I-projection

theory in the statistics literature [30] and were recently discussed in the context of channel

coding error exponents [9].

We only showed that some positive delay constrained error exponent is achievable as

long as the rate pair is in the interior of the classical Slepian-Wolf region in Figure A.3.

In general, the delay constrained error exponent for distributed source coding is smaller or

equal to its block coding counterpart. This is different from what we observed in Chapters 2

and 3. To further understand the difference, we need a tight upper bound on the error

exponent. A trivial example tells us that this error exponent is nowhere tight. Consider

the special case where the two sources are independent, the random coding scheme in this

chapter gives the standard random coding error exponent for each single source as shown

in (2.13) in Section 2.3.3. This error exponent is strictly smaller than the optimal coding

scheme in Chapter 2 as shown in Proposition 8 in Section 2.5. The optimal error exponent

problem could be an extremely difficult one. It would be interesting to give an upper bound

on it. However, we do not have any general non-trivial upper bounds either. In Chapter 5,

we will study the upper bound on the delay constrained error exponents for source coding

with decoder side-information problem. This is a special case of what we study in this

chapter since the decoder side-information can be treated as an encoder with rate higher

than the logarithm of the alphabet. However, this upper bound is not tight in general.

These open questions are left future research.

89

Chapter 5

Lossless Source Coding with

Decoder Side-Information

In this chapter, we study the delay constrained source coding with decoder side-

information problem. As a special case of the Slepian-Wolf problem studied in Chapter 4,

the sequential random binning scheme’s delay performance is derived as a corollary of those

results in Chapters 2 and 4. An upper bound on the delay constrained error exponent is

derived by using the feed-forward decoding scheme from the channel coding literature. The

results in this chapter are also summarized in our paper [18], especially the implications of

these results on compression of encrypted data [50].

5.1 Delay constrained source coding with decoder side-

information

In [79], Slepian and Wolf studied the distributed lossless source coding problems. One

of the problems studied, the source coding with only decoder side-information problem is

shown in Figure 4.1, where the encoder has access to the source x only, but not the side-

information y . It is clear that if the side-information y is also available to the encoder, to

achieve arbitrarily small decoding error for fixed block coding, rate at the conditional en-

tropy H(px |y ) is necessary and sufficient. Somewhat surprisingly, even without the encoder

90

side-information, Slepian and Wolf showed that rate at conditional entropy H(px |y ) is still

sufficient.

Encoder Decoder --

6

x1, x2, ...-x1, x2, ...

y1, y2, ...

(xi, yi) ∼ pxy

6

?

Figure 5.1. Lossless source coding with decoder side-information

We can also factor the joint probability to treat the source as a random variable x and

consider the side-information y as the output of a discrete memoryless channel (DMC) py |xwith x as input. This model is shown in Figure 5.2.

Encoder Decoder --

6

x1, x2, ...-x1, x2, ...

y1, y2, ...

DMC

?

?

Figure 5.2. Lossless source coding with side-information: DMC

We first review the error exponent result for source coding with decoder side-information

in the fixed-length block coding setup in Section A.4 in the appendix. Then we formally

define the delay constrained source coding with side-information problem. In Section 5.1.2

we give a lower a bound and an upper bound on the error exponent on this problem. These

two bounds in general do not match, which leaves space for future explorations.

91

5.1.1 Delay Constrained Source Coding with Side-Information

Rather than being known in advance, the source symbols stream into the encoder in

a real-time fashion. We assume that the source generates a pair of source symbols (x , y)

per second from the finite alphabet X × Y. The jth source symbol xj is not known at the

encoder until time j and similarly for yj at the decoder. Rate R operation means that

the encoder sends 1 binary bit to the decoder every 1R seconds. For obvious reasons (cf.

Proposition 1 and 2), we focus on cases with Hx |y < R < log2 |X |.

x1 x2 x3 x4 x5 x6 ...

b1(x21 ) b2(x4

1 ) b3(x61 ) ...

x1(4) x2(5) x3(6) ...Decoding

Encoding

Source

Rate limited Channel

Side-info y1, y2, ... −→

? ? ? ? ? ?

? ? ?

Figure 5.3. Delay constrained source coding with side-information: rate R = 12 , delay

∆ = 3

Definition 8 A sequential encoder-decoder pair E ,D are sequence of maps. Ej, j =

1, 2, ... and Dj, j = 1, 2, .... The outputs of Ej are the outputs of the encoder E from

time j − 1 to j.


Ej(xj1) = b

bjRcb(j−1)Rc+1

The outputs of Dj are the decoding decisions of all the arrived source symbols by time j

based on the received binary bits up to time j as well as the side-information.

Dj : 0, 1bjRc × Yj −→ X

Dj(bbjRc1 , yj

1) = xj−∆(j)

Where xj−∆(j) is the estimation of xj−∆ at time j and thus has end-to-end delay of ∆

seconds. A rate R = 12 sequential source coding system is illustrated in Figure 5.3.

92

For sequential source coding, it is important to study the symbol by symbol decoding

error probability instead of the block coding error probability.

Definition 9 A family of rate R sequential source codes (E, D∆) are said to achieve

delay-reliability Esi(R) if and only if for all ε > 0, there exists K < ∞, s.t. ∀i, ∆ > 0

Pr[xi 6= xi(i + ∆)] ≤ K2−∆(Esi(R)−ε)

Following this definition, we have both lower and upper bound on delay constrained

error exponent for source coding with side-information.

5.1.2 Main results of Chapter 5: lower and upper bound on the error

exponents

There are two parts of the main result. First, we give an achievable lower bound on the

delay constrained error exponent Esi(R). This part is realized by using the same sequential

random binning scheme in Definition 3 and a Maximum-likelihood decoder in Section 4.3.1

or a universal decoder in Section 4.3.2. The lower bound can be treated as a simple corollary

of the more general theorems in the distributed source coding setup in Theorems 3 and 4.

The second part is an upper bound on Esi(R). We apply a modified version of the feed-

forward decoder used by Pinsker [62] and recently clarified by Sahai [67]. This feed-forward

decoder technique is a powerful tool in the upper bound analysis for delay constrained error

exponents. Using this technique, we derived an upper bound on the joint source-channel

coding error exponent in [15].

A lower bound on Esi(R)

We state the relevant lower bound (achievability) results which comes as a simpler result

to the more general result proved in Theorems 3 and 4.

Theorem 6 Delay constrained random source coding theorem: Using a random sequential

coding scheme and the ML decoding rule using side-information, for all i, ∆:

Pr[xi(i + ∆) 6= xi] ≤ K2−∆Elowersi (R) (5.1)

93

Where K is a constant, and

Elowersi (R) = Elower

si,b (R) = minqxyD(qxy‖pxy ) + |0, R−H(qx |y )|+

where Elowersi,b (R) is defined in Theorem 13. Another definition is

Elowersi (R) = Elower

si,b (R) = maxρ∈[0,1]

ρR− log[∑

y

[∑

x

pxy (x, y)1

1+ρ ]1+ρ]

These two expressions can be shown to be equivalent following the same Lagrange multiplier

argument in [39] and Appendix G.

Similarly for the universal decoding rule, for all ε > 0, there exist a finite constant K,

s.t. for all n and ∆:

Pr[xi(i + ∆) 6= xi] ≤ K2−∆(Elowersi (R)−ε) (5.2)

The encoder is the same random sequential encoding scheme described in Definition 3

which can be realized using an infinite constraint-length time-varying random convolutional

code. Common randomness between the encoder and the decoder is assumed. The decoder

can use a maximum likelihood rule or minimum empirical joint entropy decoding rule which

are similar to that in the point-to-point source coding setup in Chapter 2. Details of the

proof are in Section 5.5.

An upper bound on Esi(R)

We give an upper bound on the delay constrained error exponent for source coding with

side-information defined in Definition 9. This bound is for any generic joint distribution

pxy . Some special cases will be discussed in Section 5.2.

Theorem 7 For the source coding with side-information problem in Figure 5.2, if the source

is iid ∼ pxy from a finite alphabet, then the error exponents Esi(R) with fixed delay must

satisfy Esi(R) ≤ Euppersi (R), where

Euppersi (R) = min inf

qxy ,α≥1:H(qx|y )>(1+α)R 1α

D(qxy‖pxy ),

infqxy ,1≥α≥0:H(qx|y )>(1+α)R

1− α

αD(qx‖px) + D(qxy‖pxy )

94

5.2 Two special cases

It is clear that the upper bound and the lower bound on the error exponent Esi(R)are

not the same thing. There seems to be a gap between the two bounds. But how big or how

small can the gap be? In this section, we answer that question by showing two extreme

cases.

5.2.1 Independent side-information

Consider the case that x and y are independent, or there is no side-information. Then

clearly this source coding with the irrelevant “side-information” at the decoder problem is

the same as the single source coding problem discussed in Chapter 2. So the upper bound

Euppersi (R) ≥ Es(R), where Es(R) is the delay constrained source coding error exponent

defined in Theorem 1 for source x . So we only need to show that Euppersi (R) ≤ Es(R) to

establish the tightness of our upper bound for this extreme case where no side-information

is presented. The proof simply follows the following argument:

Euppersi (R) =(a) min inf


D(qxy‖pxy ),


1− α


≤(b) infqxy ,α≥0:H(qx|y )>(1+α)R

1α

D(qxy‖pxy )

≤(c) infqx×py ,α≥1:H(qx|y )>(1+α)R

1α

D(qx × py‖px × py )

=(d) infqx ,α≥1:H(qx )>(1+α)R

1α

D(qx‖px)=(e) Es(R) (5.3)

(a) is by definition. (b) is because D(qxy‖pxy ) ≥ D(qx‖px). (c) is true because of the

following two observations. First, x and y are independent, so pxy = px × py . Second, we

take the inf over a subset of all qxy : qx × py , i.e. the distributions such that x and y are

independent and the marginal qy = py . (d) is true because under p and q, the two marginals

x and y are independent with y ∼ py . (e) is by definition.

The lower bound on the error exponent Elowersi (R) has the following form for independent

95

side-information.

Elowersi (R) =(a) max

ρ∈[0,1]ρR− log[

∑y

[∑

x

pxy (x, y)1

1+ρ ]1+ρ]

=(b) maxρ∈[0,1]

ρR− log[∑

y

[∑

x

px(x)1

1+ρ py (y)1

1+ρ ]1+ρ]

=(c) maxρ∈[0,1]

ρR− log[∑

y

py (y)[∑

x

px(x)1

1+ρ ]1+ρ]

=(d) maxρ∈[0,1]

ρR− (1 + ρ) log[∑

x

px(x)1

1+ρ ]

=(e) Er(R) (5.4)

(a) is by definition. (b) is because the side-information y is independent of the source x .

(c) and (d) are trivial. (e) is by definition in (A.5).

From the above analysis, we know that our upper bound Euppersi (R) is tight which can be

achieved as shown in Chapter 2 and our lower bound is the same as the random coding error

exponent for source x only in Section 2.3.3. Which validates our results in Section 5.1.2.

Comparing the upper bound in (5.3) and lower bound in (5.4), we know that the lower

bound is strictly lower than the upper bound as shown in Proposition 8 in Chapter 2 and

illustrated in Figure 2.3.

5.2.2 Delay constrained encryption of compressed data

The traditional view of encryption of redundant source is in the block coding context

in which all the source symbols are compressed first then encrypted1. In the receiver

end, decompression of the source is after the decryption. As the common theme in this

thesis, we consider the end-to-emd delay between the realization of the source and the

decoding/decryption at the sink. The delay constrained compression first then encryption

system is illustrated in Figure 5.5. Without getting into the details of the information-

theoretic security, we briefly introduce the system. The compression and decompression

part is exactly the delay constrained source coding shown in Figure 2.1 in Chapter 2 where

the source is iid ∼ ps on a finite alphabet S. The secret keys y1, .... are iid Bernoulli 0.5

random variables, so the output of the encrypter b ⊕ y is independent with the output of

the compressor b, thus the name “perfect secrecy”. Here the ⊕ operator is sum mod two,

or more generally speaking, the addition operator in the finite field of size 2. Since the1We consider the type I Shannon-sense security (perfect secrecy) [73, 46].

96

encryption and decryption are instantaneous, the overall system has a delay constrained

error exponent Es(R) defined in Theorem 1, Chapter 2.s1 s2 s3 s4 s5 s6 ...

b1(s21 ) b2(s4

1 ) b3(s61 ) ...

b1 = y1 ⊕ b1 b2 = y2 ⊕ b2 b3 = y3 ⊕ b3 ...

b1 = y1 ⊕ b1 b2 = y2 ⊕ b2 b3 = y3 ⊕ b3 ...

s1(4) s2(5) s3(6) ...

Secret key y1, y2, ... −→ Encryption

Secret key y1, y2, ... −→ Decryption

Compression

Source

Rate limited public channel

Decompression

? ? ? ? ? ?

? ? ?

? ? ?

? ? ?

Figure 5.4. Encryption of compressed streaming data with delay constraints rate R = 12 ,

delay ∆ = 3

In the thought provoking paper [50], Johnson shows that the same compression rate

(entropy rate of the source s) can be achieved by first encrypting the source then compressing

the encrypted data without knowing the key, then decompress/decrypt the encoded data

by using the key as the decoder side-information. Recently practical system is build by

implementing LDPC [36, 65] codes for source coding with side-information [71]. The delay

constrained system of the encryption first, then compression system is shown in Figure 5.5.

The source is iid ∼ ps on a finite alphabet S, to achieve perfect secrecy, the secret keys

y ’s are iid uniform random variables on S. And hence the encrypted data x is also uniformly

distributed in S where

x = s ⊕ y (5.5)

where the ⊕ operator is the addition in the finite field of size |S|. For uniform sources, no

compression can be done. However, since y is known to the decoder and y is correlated

to x , the source x can be compressed to the conditional entropy H(x |y) which is equal to

H(s) because of the relations of x , y and s through (5.5). The emphasis of this section is

the fundamental information-theoretical model of the problem in Figure 5.5, as shown in

97

s1 s2 s3 s4 s5 s6 ...

x1 x2 x3 x4 x5 x6 ...

b1(x21 ) b2(x4

1 ) b3(x61 ) ...

s1(4) s2(5) s3(6) ...

Secret key y1, y2, ... −→ Encryption

Secret key y1, y2, ... −→

Compression

Source

Rate limited public channel

Joint decryptionDecompression

? ? ? ? ? ?

? ? ?

? ? ?

Figure 5.5. Compression of encrypted streaming data with delay constraints: rate R = 12 ,

delay ∆ = 3

Figure 5.6. A detailed discussion of the delay constrained encryption of compressed data

problem is in [18].

Encoder Decoder --

6

x

s = x ª y

-x

y

x = s ⊕ y

6

?

Figure 5.6. Information-theoretic model of compression of encrypted data, s is the source, yis uniformly distributed in S is the secret key. The encoder is facing a uniformly distributedx and the decoder has side-information y .

Notice that the estimate of x and s are related in the following way:

s = x ª y (5.6)

so the estimation problems are equivalent for s and x . The lower bound and the upper bound

on the delay constrained error exponent for x are shown in Theorem 6 and Theorem 7

respectively for general sources x and side information y ∼ pxy . For the compression of

98

encrypted data problem, as illustrated in Figure 5.6, we have the following corollary for

both lower bound and upper bound.

Corollary 1 Bounding the delay constrained error exponent for compression of encrypted

data: for the compression of encrypted data system in Figure 5.5 and hence the information-

theoretic interpretation as a decoder side-information problem of source x given side-

information y in Figure 5.6, the delay constrained error exponent Esi(R) is bounded by

the following two error exponents:

Er(R, ps) ≤ Esi(R) ≤ Es,b(R, ps) (5.7)

where Er(R, ps) is the random coding error exponent for source ps defined in (A.5),

Es,b(R, ps) is the block coding error exponent defined in (A.4), both definitions are in Chap-

ter 2.

It should be clear that this corollary is true for any source side-information pair shown

in Figure 5.6, where x = s ⊕ y and y is uniform on S. The proof is in Appendix I.

5.3 Delay constrained Source Coding with Encoder Side-

Information

In this section, we study the source coding problem for x from a joint distribution

(x , y) ∼ pxy . Suppose that the side information y is known at both encoder and decoder

shown in Figure 5.7.

Encoder Decoder --

66

x1, x2, ...-x1, x2, ...

y1, y2, ...

(xi, yi) ∼ pxy

6

?

Figure 5.7. Lossless source coding with both encoder and decoder side-information

We first review the error exponent result for source coding with encoder side information

problem. As shown in Figure 5.7, the sources are iid random variables xn1 , yn

1 from a finite

99

alphabet X ×Y. Without loss of generality, px(x) > 0, ∀x ∈ X and py (y) > 0, ∀y ∈ Y. xn1 is

the source known to the encoder and yn1 is the side-information known to both the decoder

and the encoder. A rate R block source coding system for n source symbols consists of an

encoder-decoder pair (En,Dn). Where

En : X n × Yn → 0, 1bnRc, En(xn1 , yn

1 ) = bbnRc1

Dn : 0, 1bnRc × Yn → X n, Dn(bbnRc1 , yn

1 ) = xn1

The error probability is Pr(xn1 6= xn

1 ) = Pr(xn1 6= Dn(En(xn

1 ))). The exponent Eei,b(R) is

achievable if ∃ a family of (En,Dn), s.t.

limn→∞−

1n

log2 Pr(xn1 6= xn

1 ) = Eei,b(R) (5.8)

As a simple corollary of the relevant results of [29, 39], we have the following lemma.

Lemma 12 Eei,b(R) = Euppersi,b (R) where Eupper

si,b (R) is the upper bound on the source coding

with decoder only side-information defined in Theorem 13.

Euppersi,b (R) = min

qxy :H(qx|y )≥RD(qxy‖pxy )

= supρ≥0

ρR− log∑

y

(∑

x

pxy (x, y)1

1+ρ )(1+ρ)

Note: this error exponent is both achievable and tight in the usual block coding way [29].

5.3.1 Delay constrained error exponent

Similar to the previous cases, we have the following notion on delay constrained source

coding with both encoder and decoder side-information.

Definition 10 A sequential encoder-decoder pair E ,D are sequence of maps. Ej, j =

1, 2, ... and Dj, j = 1, 2, .... The outputs of Ej are the outputs of the encoder E from time

j − 1 to j.

Ej : X j × Yj −→ 0, 1bjRc−b(j−1)Rc

Ej(xj1, y

j1) = b

bjRcb(j−1)Rc+1

100

x1 x2 x3 x4 x5 x6 ...

b1(x21 , y2

1 ) b2(x41 , y4

1 ) b3(x61 , y6

1 ) ...

x1(4) x2(5) x3(6) ...Decoding

EncodingSide-info y1, y2, ... −→

Source

Rate limited Channel

Side-info y1, y2, ... −→

? ? ? ? ? ?

? ? ?

Figure 5.8. Delay constrained source coding with encoder side-information: rate R = 12 ,

delay ∆ = 3

The outputs of Dj are the decoding decisions of all the arrived source symbols by time j

based on the received binary bits up to time j as well as the side-information.

Dj : 0, 1bjRc × Yj −→ X

Dj(bbjRc1 , yj

1) = xj−∆(j)

Where xj−∆(j) is the estimation of xj−∆ at time j and thus has end-to-end delay of ∆

seconds. A rate R = 12 sequential source coding system is illustrated in Figure 5.8.

The delay constrained error exponent is defined in Definition 11. This is parallel to

previous definitions on delay constrained error exponents.

Definition 11 A family of rate R sequential source codes (E∆, D∆) are said to achieve

delay-reliability Eei(R) if and only if for all ε > 0, there exists K < ∞, s.t. ∀i, ∆ > 0

Pr(xi 6= xi(i + ∆)) ≤ K2−∆(Eei(R)−ε)

Following the definition of delay constrained source coding error exponent Eei(R) in

Definition 11, we have the following result in Theorem 8.

Theorem 8 Delay constrained error exponent with both encoder and decoder side-

information

Eei(R) = infα>0

1α

Eei,b((α + 1)R)

Where Eei,b(R) is the block source coding error exponent defined in Lemma 12.

101

Similar to the lossless source coding in Chapter 2 and lossy source coding in Chapter 3,

the delay constrained source coding error exponent and the block coding error exponent are

connected by the focusing operator.

In Appendix J, we first show the achievability of Eei(R) by a simple variable length

universal code and a FIFO queue coding scheme which is very similar to that for lossless

source coding in Chapter 2. Then we show that Eei(R) is the upper bound on the delay

constrained error exponent by the same argument used in the proof for lossless source coding

error exponent in Chapter 2. Indeed, encoder side-information eliminates all the future ran-

domness in the source and the side-information, so the coding system should have the same

nature as the lossless source coding system discussed in Chapter 2. Although in different

forms, it should be conceptually clear that Eei(R) and Es(R) share many characteristics.

Without proof, we claim that the properties in Section 2.5 for Es(R) can be also found in

Eei(R).

5.3.2 Price of ignorance

In the block coding setup, with or without encoder side-information does not change

the error exponent in a dramatic way. As shown in [29], the difference is only between

the random coding error exponent Elowersi,b (R) and the error exponent Eupper

si,b (R). These

two are the same in the low rate regime, this is similar to the well-known channel coding

error exponent where random coding and sphere packing bounds are the same in the high

rate regime [41]. Furthermore, the gap between with encoder side-information and without

encoder side-information can be further reduced by using expurgation as shown in [2].

However for the delay constrained case, we show that without the encoder side-

information, the error exponent with only decoder side-information is generally strictly

smaller than the error exponent when the encoder side-information is also presented.

Corollary 2 Price of ignorance:

Eei(R) > Euppersi (R)

Proof: D(qxy‖pxy ) > D(qx‖px) in general. ¥

102

5.4 Numerical results

In this section we show three examples to illustrate the nature of the upper bound and

the lower bound on the delay constrained error exponent with decoder side-information.

5.4.1 Special case 1: independent side-information

As shown in Section 5.2.1, if the side-information y is independent with the source x ,

the upper bound agrees with the delay constrained source coding error exponent Es(R) for

x and the lower bound agrees with the random coding error exponent Er(R) for source

x . This shows that our bounding technique for the upper bound is tight in this case. To

bring the gap between the upper bound and the lower bound, the coding scheme should

be the optimal delay coding scheme in Chapter 2. The sequential random binning scheme

is clearly suboptimal. Because it is proved in Section 2.5 that the delay constrained error

exponent Es(R) is strictly higher than the random coding error exponent Er(R). This is

clearly shown in Figure 5.9.

The source is the same as it in Section 2.2. Source x with alphabet size 3, X = A,B, Cand the following distribution

px(A) = 0.65 px(B) = 0.175 px(C) = 0.175

The side-information is independent with the source, so its distribution does not matter.

We arbitrarily let the marginal py = 0.420, 0.580.

5.4.2 Special case 2: compression of encrypted data

In Section 5.2.2, we show that the delay constrained error exponent for compression of

encrypted data for source ps is sandwiched by the block coding error exponent Es,b(R, ps)

and the random coding error exponent Er(R, ps), in Corollary 1. For source s the stream

cipher is x = s ⊕ y . For the binary case, this stream cipher is illustrated in Figure 5.10.

Where the source ps(0) = 1− ε and ps(1) = ε, both the key y and the output of the cipher

x are uniform on 0, 1. The bounds on the delay constrained error exponents are plotted

in Figure 5.11. In the example in Figure 5.11, ε = 0.1. For this problem the entropy of the

source H(s) is equal to the conditional entropy H(x |y).

For uniform source x and the side-information that is the output of a symmetric channel

with x as the input, it can be shown that the upper bound and lower bound agree at low

103

1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Rate R

Err

or E

xpon

ents

Lower bound (random coding error exponent)Upper bound ( delay optimal error exponent)

to ∞

H(x|y) log2(3)

Figure 5.9. Delay constrained Error exponents for source coding with independent decoderside-information: Upper bound Eupper

si (R) = Es(R), Lower bound Elowersi (R) = Er(R)

-

q-

1

y x

1− ε

1− ε

ε

ε

0

1

0

1

Figure 5.10. A stream cipher x = s ⊕ y can be modeled as a discrete memoryless channel,where key y is uniform and independent with source s. Key y is the input, encryption x isthe output of the channel.

104

H(x|y) log(2)0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Uniform source with symmetric channel 1

Rate R

Err

or E

xpon

ent

Upper boundLower bound

Figure 5.11. Delay constrained Error exponents for uniform source coding with symmetricdecoder side-information x = y ⊕ s: Upper bound Eupper

si (R) = Es(R, ps), Lower boundElower

si (R) = Er(R, ps). These two bounds agree in the low rate regime

rate regime just as shown in Figure 5.11. Note:this is a more general problem than the

compression of encrypted data problem. A uniform source with erasure channel between

the source and the side-information is illustrated in 5.12. The bounds on error exponents

are shown in Figure 5.13.

X

1−e

1−e

e

eY

0

1

0

1

*

Figure 5.12. uniform source and side-information connected by a symmetric erasure channel

105

H(x|y) log(2)0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Uniform source with symmetric channel 2

Rate R

Err

or E

xpon

ent

Lower boundUpper bound

Figure 5.13. Delay constrained Error exponents for uniform source coding with symmetricdecoder side-information 2 x and y are connected by an erasure channel: Upper boundEupper

si (R) = Er(R, ps) in the low rate regime

5.4.3 General cases

In this section, we show the upper bound and the lower bound for a general distribution

of the source and side-information. That is, the side-information is dependent of the source

and from the encoder point of view the source is not uniform. The uniformity here includes

the marginal distribution of the source and the side-information. For example the following

distribution is not uniform although the marginals are uniform:

pxy (x, y) x=1 x=2 x=3y=1 a b 1

3 − a− b

y=2 c d 13 − c− d

y=3 13 − a− c 1

3 − b− d −13 + a + b + c + d

Table 5.1. A non-uniform source with uniform marignals

Where a, b, c, d, a + b, a + c, d + c, d + b ∈ [0, 13 ], and the encoder can do more than just

sequential random binning. An extreme case where the encoder only need to deal with

x = 2, 3 is as follows: let a = 13 and b = c = 0, then the side information at the decoder can

106

be used to tell when x = 1. Without proof, we claim that for this marginal-uniform source,

random coding is suboptimal. Now consider the following 2× 2 source:

pxy =

0.1 0.2

0.3 0.4

(5.9)

In Figure 5.14, we plot the upper bound, the lower bound on the delay constrained er-

ror exponent with decoder only side-information. To illustrate the “price of ignorance”

phenomenon, we also plot the delay constrained error exponent with both encoder and de-

coder side-information: Eei(R). As the lossless source coding with delay constraints error

exponent Es(R), Eei(R) is related to its block coding error exponent with the focusing

operator.

H(x|y) log(|X|)0

0.2

0.4

0.6

0.8

1

1.2

1.4

Rate R

Err

or E

xpon

ents

Lower boundUpper boundE

ei(R)

Figure 5.14. Delay constrained Error exponents for general source and decoder side-information: both the upper bound Eupper

si (R) and the lower bound Elowersi (R) are plot-

ted, the delay constrained error exponent with both encoder and decoder side informationEei(R) is plotted in dotted lines

Note: we do not know how to parameterize the upper bound Euppersi (R) as what we did

in Proposition 7 in Section 2. Instead, we brutal forcefully minimize



D(qxy‖pxy ),


1− α


on a 5 dimensional space (qxy , α). This gives a not so smooth plot as shown in Figure 5.14.

107

5.5 Proofs

The achievability part of the proof is a special case for that of the Slepian-Wolf source

coding shown in Chapter 4. The upper bound is derived by a feedforward decoding scheme

which was first developed in channel coding context. There is no surprise that we can

borrow the channel coding technique to source coding with decoder side-information since

it has been long known the duality between the two. The upper bound and the achievable

lower bound are not identical in general.

5.5.1 Proof of Theorem 6: random binning

We prove Theorem 6 by following the proofs for point-to-point source coding in Propo-

sitions 3 and 4. The encoder is the same sequential random binning encoder in Definition 3

with common randomness shared by the encoder and decoder. Recall that the key property

of this random binning scheme is the pair-wise independence, formally, for all i, n:

Pr[E(xi1x

ni+1) = E(xi

1xni+1)] = 2−(bnRc−biRc) ≤ 2× 2−(n−i)R (5.10)

First we show (5.1) in Theorem 6.

ML decoding

ML decoding rule:

Denote by xn1 (n) the estimate of the source sequence xn

1 at time n.

xn1 (n) = arg max

xn1∈Bx(xn

1 )pxy (xn

1 , yn1 ) = arg max

xn1∈Bx(xn

1 )

n∏

i=1

pxy (xi, yi) (5.11)

The ML decoding rule in (5.11) is very simple. At time n, the decoder simply picks the

sequence xn1 (n) with the highest joint likelihood with the side-information yn

1 while xn1 (n)

is in the same bin as the true sequence xn1 . Now the estimate of source symbol n − ∆ is

simply the (n− delay)th symbol of xn1 (n), denoted by xn−∆(n).

Details of the proof:

The proof is quite similar to that of Proposition 3, we will omit some of the details of the

proof in this thesis. To lead to a decoding error, there must be some false source sequence

xn1 that satisfies three conditions: (i) it must be in the same bin (share the same parities)

108

as xn1 , i.e., xn

1 ∈ Bx(xn1 ), (ii) it must be more likely than the true sequence given the same

side-information yn1 , i.e., pxy (xn

1 , yn1 ) > pxy (xn

1 , yn1 ), and (iii) xl 6= xl for some l ≤ n−∆.

The error probability again can be union bounded as:

Pr[xn−∆(n) 6= xn−∆]

≤Pr[xn−∆1 (n) 6= xn−∆

1 ]

=∑

xn1 ,yn

1

Pr[xn−∆1 (n) 6= xn−∆

1 |xn1 = xn

1 ]pxy (xn1 , yn

1 )

=∑

xn1 ,yn

1

n−∆∑

l=1

Pr[∃ xn

1 ∈ Bx(xn1 ) ∩ Fn(l, xn

1 ) s.t. pxy (xn1 , yn

1 ) ≥ pxy (xn1 , yn

1 )]pxy (xn

1 , yn1 )

=n−∆∑

l=1

∑

xn1 ,yn

1

Pr[∃ xn

1 ∈ Bx(xn1 ) ∩ Fn(l, xn


1 ) ≥ pxy (xn1 , yn

1 )]pxy (xn

1 , yn1 )

=n−∆∑

l=1

pn(l) (5.12)

Recall the definition of Fn(l, xn1 ) in (2.20)

Fn(l, xn1 ) = xn

1 ∈ X n|xl−11 = xl−1

1 , xl 6= xl

and we define

pn(l) =∑

xn1 ,yn

1

Pr[∃ xn

1 ∈ Bx(xn1 )∩Fn(l, xn


1 ) ≥ pxy (xn1 , yn

1 )]pxy (xn

1 , yn1 ) (5.13)

We now upper bound pn(l) using a Chernoff bound argument similar to [39] and in the

proof of Lemma 1. The details of the proof of the following Lemma 13 is in Appendix H.1.

Lemma 13 pn(l) ≤ 2× 2−(n−l+1)Elowersi (R).

Using the above lemma, substitute the bound on pn(l) into (5.12), we prove the ML decoding

part, (5.1), in Theorem 6. ¥

Universal decoding (sequential minimum empirical joint entropy decoding)

In this section we prove (5.2) in Theorem 6.

Universal decoding rule:

xl(n) = w[l]l where w[l]n1 = arg minxn∈Bx(xn

1 ) s.t. xl−11 =xl−1

1 (n)

H(xnl , yn

l ). (5.14)

109

We term this a sequential minimum joint empirical-entropy decoder which tightly follows the

sequential minimum empirical-entropy decoder for point-to-point source coding in (2.30).

Details of the proof: With this decoder, errors can only occur if there is some sequence

xn1 such that (i) xn

1 ∈ Bx(xn1 ), (ii) xl−1

1 = xl−1, and xl 6= xl, for some l ≤ n −∆, and (iii)

the joint empirical entropy of xnl and yn

l is such that H(xnl , yn

l ) < H(xnl , yn

l ). Building on

the common core of the achievability (5.12) with the substitution of universal decoding in

the place of maximum likelihood results in the following definition of pn(l):

pn(l) =∑

xn1 ,yn

1

Pr[∃ xn

1 ∈ Bx(xn1 ) ∩ Fn(l, xn

1 ) s.t. H(xnl , yn

l ) ≤ H(xnl , yn

l )]pxy (xn

1 , yn1 ) (5.15)

The following lemma gives a bound on pn(l). The proof is in Appendix H.2.

Lemma 14 For sequential joint minimum empirical entropy decoding,

pn(l) ≤ 2× (n− l + 2)2|X ||Y|2−(n−l+1)Elowersi (R).

Lemma 14 and Pr[xn−∆(n) 6= xn−∆] ≤ ∑n−∆l=1 pn(l) imply that:

Pr[xn−∆(n) 6= xn−∆] ≤n−∆∑

l=1

(n− l + 2)2|X ||Y|2−(n−l+1)Elowersi (R)

≤n−∆∑

l=1

K12−(n−l+1)[Elowersi (R)−ε]

≤K2−∆[Elowersi (R)−ε]

where K and K1 are finite constants. The above analysis follows the same argument of

that in the proof of Proposition 4. This concludes the proof of the universal decoding part,

(5.2), in Theorem 6. ¥

5.5.2 Proof of Theorem 7: Feed-forward decoding

The theorem is proved by applying a variation of the bounding technique used in [67]

(and originating in [62]) for the fixed-delay channel coding problem. Lemmas 15-20 are the

source coding counterparts to Lemmas 4.1-4.5 in [67]. The idea of the proof is to first build

a feed-forward sequential source decoder which has access to the previous source symbols

in addition to the encoded bits and the side-information. The second step is to construct a

block source-coding scheme from the optimal feed-forward sequential decoder and showing

110

Causal

Feed-forward

delay

?-

-

DMC

xb

xx x

x

x

y

encoder

?- -

-

6

-

-

-

6

6

-

?

Delay ∆

Delay ∆

+−feed-forward

decoder 1

feed-forward

decoder 2

Figure 5.15. A cutset illustration of the Markov Chain xn1 − (xn

1 , bb(n+∆)Rc1 , yn+∆

1 ) − xn1 .

Decoder 1 and decoder 2 are type I and II delay ∆ rate R feed-forward decoder respectively.They are equivalent.

that if the side-information behaves atypically enough, then the decoding error probability

will be large for at least one of the source symbols. The next step is to prove that the

atypicality of the side-information before that particular source symbol does not cause the

error because of the feed-forward information. Thus, cause of the decoding error for that

particular symbol is the atypical behavior of the future side-information only. The last

step is to lower bound the probability of the atypical behavior and upper bound the error

exponents. The proof spans into the next several subsections.

Feed-forward decoders

Definition 12 A delay ∆ rate R decoder D∆,R with feed-forward is a decoder D∆,Rj that

also has access to the past source symbols xj−11 in addition to the encoded bits b

b(j+∆)Rc1 and

side-information yj+∆1 .

Using this feed-forward decoder, the estimate of xj at time j + ∆ is :

xj(j + ∆) = D∆,Rj (bb(j+∆)Rc

1 , yj+∆1 , xj−1

1 ) (5.16)

Lemma 15 For any rate R encoder E, the optimal delay ∆ rate R decoder D∆,R with

feed-forward only needs to depend on bb(j+∆)Rc1 , yj+∆

j , xj−11

Proof: The source and side-information (xi, yi) is an iid random process and the en-

coded bits bb(j+∆)Rc1 are functions of xj+∆

1 so obeys the Markov chain: yj−11 −

111

(xj−11 , b

b(j+∆)Rc1 , yj+∆

j ) − xj+∆j . Conditioned on the past source symbols, the past side-

information is completely irrelevant for estimation. ¤

Notice that for any finite alphabet set X , we can always define a group Z|X | on X ,

where the operators − and + are indeed −, + mod |X |. So we write the error sequence

of the feed-forward decoder as xi = xi − xi. Then we have the following property for the

feed-forward decoders.

Lemma 16 Given a rate R encoder E, the optimal delay ∆ rate R decoder D∆,R with

feed-forward for symbol j only needs to depend on bb(j+∆)Rc1 , yj+∆

1 , xj−11

Proof: Proceed by induction. It holds for j = 1 since there are no prior source symbols.

Suppose that it holds for all j < k and consider j = k. By the induction hypothesis,

the action of all the prior decoders j can be simulated using (bb(j+∆)Rc1 , yj+∆

1 , xj−11 ) giving

xk−11 . This in turn allows the recovery of xk−1

1 since we also know xk−11 . Thus the decoder

is equivalent. ¤

We call the feed-forward decoders in Lemmas 15 and 16 type I and II delay ∆ rate R

feed-forward decoders respectively. Lemma 15 and 16 tell us that feed-forward decoders can

be thought in three ways: having access to all encoded bits, all side-information and all past

source symbols, (bb(j+∆)Rc1 , yj+∆

1 , xj−11 ), having access to all encoded bits, a recent window

of side information and all past source symbols, (bb(j+∆)Rc1 , yj+∆

j , xj−11 ), or having access to

all encoded bits, all side-information and all past decoding errors, (bb(j+∆)Rc1 , yj+∆

1 , xj−11 ).

Constructing a block code

To encode a block of n source symbols, just run the rate R encoder E and terminate

with the encoder run using some random source symbols drawn according to the distribution

of px with matching side-information on the other side. To decode the block, just use the

delay ∆ rate R decoder D∆,R with feed-forward, and then use the fedforward error signals to

correct any mistakes that might have occurred. As a block coding system, this hypothetical

system never makes an error from end to end. As shown in Figure 5.15, the data processing

inequality implies:

Lemma 17 If n is the block-length, the block rate is R(1 + ∆n ), then

H(xn1 ) ≥ −(n + ∆)R + nH(x |y) (5.17)

112

Proof:

nH(x) =(a) H(xn1 )

= I(xn1 ; xn

1 )

=(b) I(xn1 ; xn

1 , bb(n+∆)Rc1 , yn+∆

1 )

=(c) I(xn1 ; yn+∆

1 ) + I(xn1 ; xn

1 |yn+∆1 ) + I(xn

1 ; bb(n+∆)Rc1 |yn+∆

1 , xn1 )

≤(d) nI(x , y) + H(xn1 ) + H(bb(n+∆)Rc

1 )

≤ nH(x)− nH(x |y) + H(xn1 ) + (n + ∆)R

(a) is true because the source is iid. (b) is true because of the data processing

inequality considering the following Markov chain: xn1 − (xn

1 , bb(n+∆)Rc1 , yn

1 ) − xn1 ,

thus I(xn1 ; xn

1 ) ≤ I(xn1 ; xn

1 , bb(n+∆)Rc1 , yn+∆

1 ). And the fact that I(xn1 ; xn

1 ) = H(xn1 ) ≥

I(xn1 ; xn

1 , bb(n+∆)Rc1 , yn+∆

1 ). Combining the two equalities we get (b). (c) is the chain

rule for mutual information. In (d), first notice that (x , y) are iid across time, thus

I(xn1 ; yn+∆

1 ) = I(xn1 ; yn

1 ) = nI(x , y). Secondly entropy of a random variable is never less

than the mutual information of that random variable with another one, condition on other

random variable or not. Others are obvious. ¤

Lower bound the symbol-wise error probability

Now suppose this block-code were to be run with the distribution qxy , s.t. H(qx |y ) >

(1+ ∆n )R, from time 1 to n, and were to be run with the distribution pxy from time n+1 to

n + ∆. Write the hybrid distribution as Qxy . Then the block coding scheme constructed in

the previous section will with probability 1 make a block error. Moreover, many individual

symbols will also be in error often:

Lemma 18 If the source and side-information is coming from qxy , then there exists

a δ > 0 so that for n large enough, the feed-forward decoder will make at leastH(qx|y )−n+∆

nR

2 log2 |X |−(H(qx|y )−n+∆n

R)n symbol errors with probability δ or above. δ satisfies

hδ + δ log2(|X | − 1) =12(H(qx |y )− n + ∆

nR),

where hδ = −δ log2 δ − (1− δ) log2(1− δ).

113

Proof: Lemma 17 implies:

n∑

i=1

H(xi) ≥ H(xn1 ) ≥ −(n + ∆)R + nH(qx |y ) (5.18)

The average entropy per source symbol for x is at least H(qx |y )− n+∆n R. Now suppose that

H(xi) ≥ 12(H(qx |y )− n+∆

n R) for A positions. By noticing that H(xi) ≤ log2 |X |, we have

n∑

i=1

H(xi) ≤ A log2 |X |+ (n−A)12(H(qx |y )− n + ∆

nR)

With (5.18), we derive the desired result:

A ≥ (H(qx |y )− n+∆n R)

2 log2 |X | − (H(qx |y )− n+∆n R)

n (5.19)

Where 2 log2 |X | − (H(qx |y )− n+∆n R) ≥ 2 log2 |X | −H(qx |y ) ≥ 2 log2 |X | − log2 |X | > 0

Now for A positions 1 ≤ j1 < j2 < ... < jA ≤ n the individual entropy H(xj) ≥12(H(qx |y )− n+∆

n R). By the property of the binary entropy function,

Pr[xj 6= x0] = Pr[xj 6= xj ] ≥ δ,

where x0 is the zero element in the finite group Z|X |. ¤

We can pick j∗ = jA2, by Lemma 18, we know that minj∗, n − j∗ ≥

12

(H(qx|y )−n+∆n

R)

2 log2 |X |−(H(qx|y )−n+∆n

R)n, so if we fix ∆

n and let n go to infinity, then minj∗, n − j∗goes to infinity as well.

At this point, Lemma 15 and 18 together imply that even if the source and side-

information only behaves like it came from the hybrid distribution Qxy from time j∗ to

j∗+∆ and the source behaves like it came from a distribution qx from time 1 to j∗− 1, the

same minimum error probability δ still holds. Now define the “bad sequence” set Ej∗ as

the set of source and side-information sequence pairs so the type I delay ∆ rate R decoder

makes an decoding error at j∗. Formally

Ej∗ = (~x, ¯y)|xj∗ 6= D∆,Rj∗ (E(~x), yj∗+∆

j , x),

where to simplify the notation, we write: ~x = xj∗+∆1 , x = xj∗−1

1 , ¯x = xj∗+∆j∗ , ¯y = yj∗+∆

j∗ .

By Lemma 18, Qxy (Ej∗) ≥ δ. Notice that Ej∗ does not depend on the distribution of the

source but only on the encoder-decoder pair. Define J = minn, j∗ + ∆, and ¯x = xJj∗ ,

114

¯y = yJj∗ . Now we write the strongly typical set

AεJ(qxy ) = (~x : ¯y) ∈ X j∗+∆ × Y∆+1|∀x, rx(x) ∈ (qx(x)− ε, qx(x) + ε)

∀(x, y) ∈ X × Y, qxy (x, y) > 0 : r ¯x, ¯y(x, y) ∈ (qxy (x, y)− ε, qxy (x, y) + ε),

∀(x, y) ∈ X × Y, qxy (x, y) = 0 : r ¯x, ¯y(x, y) = 0

where the empirical distribution of (¯x, ¯y) is denoted by r¯x,¯y(x, y) = nx,y(¯x,¯y)∆+1 , the empirical

distribution of x by rx(x) = nx(x)j∗−1 .

Lemma 19 Qxy (Ej∗ ∩AεJ(qxy )) ≥ δ

2 for large n and ∆.

Proof: Fix ∆n , let n go to infinity, then minj∗, n− j∗ goes to infinity. By the definition of

J , minj∗, J − j∗ goes to infinity as while. By Lemma 13.6.1 in [26], we know that ∀ε > 0,

if J − j∗ and j∗ are large enough, then Qxy (AεJ(qxy )C) ≤ δ

2 . By Lemma 18, Qxy (Ej∗) ≥ δ.

So

Qxy (Ej∗ ∩AεJ(qxy )) ≥ Qxy (Ej∗)−Qxy (Aε

J(qxy )C) ≥ δ

2

¤

Lemma 20 For all ε < minx,y:pxy (x,y)>0pxy (x, y), ∀(~x, ¯y) ∈ AεJ(qxy ),

pxy (~x, ¯y)Qxy (~x, ¯y)

≥ 2−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px )−JGε

where G = max|X ||Y|+ ∑x,y:pxy (x,y)>0 log2(

qxy (x,y)pxy (x,y) + 1), |X |+ ∑

x log2(qx (x)px (x) + 1)

Proof: For (~x, ¯y) ∈ AεJ(qxy ), by definition of the strong typical set, it can be easily shown

by algebra: D(r ¯x, ¯y‖pxy ) ≤ D(qxy‖pxy ) + Gε and D(rx‖px) ≤ D(qx‖px) + Gε.

pxy (~x, ¯y)Qxy (~x, ¯y)

=pxy (x)qxy (x)

pxy (¯x, ¯x)qxy (¯x, ¯y)

pxy (xj∗+∆J+1 , yj∗+∆

J+1 )

pxy (xj∗+∆J+1 , yj∗+∆

J+1 )

=2−(J−j∗+1)(D(r ¯x, ¯y‖pxy )+H(r ¯x, ¯y))

2−(J−j∗+1)(D(r ¯x, ¯y‖qxy )+H(r ¯x, ¯y))

2−(j∗−1)(D(rx‖px )+H(rx))

2−(j∗−1)(D(rx‖qx )+H(rx))

≥(a) 2−(J−j∗+1)(D(qxy‖pxy )+Gε)−(j∗−1)(D(qx‖px )+Gε)

= 2−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px )−JGε

(a) is true by Equation 12.60 in [26]. ¤

115

Lemma 21 For all ε < minx,ypxy (x, y), and large ∆, n:

pxy (Ej∗) ≥ δ

22−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px )−JGε

Proof: Combining Lemma 19 and 20:

pxy (Ej∗) ≥ pxy (Ej∗ ∩AεJ(qxy ))

≥ qxy (Ej∗ ∩AεJ(qxy ))2−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px )−JGε

≥ δ

22−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px )−JGε

¤

Final touch of the proof of Theorem 7

Now we are finally ready to prove Theorem 7. Notice that as long as H(qx |y ) > n+∆n R,

we know δ > 0 by letting ε go to 0, ∆ and n go to infinity proportionally. We have:

Pr[xj∗(j∗ + ∆) 6= xj∗ ] = pxy (Ej∗) ≥ K2−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px ).

Notice that D(qxy‖pxy ) ≥ D(qx‖px) and J = minn, j∗ + ∆, then for all possible

j∗ ∈ [1, n], we have: for n ≥ ∆

(J − j∗ + 1)D(qxy‖pxy ) + (j∗ − 1)D(qx‖px) ≤ (∆ + 1)D(qxy‖pxy ) + (n−∆− 1)D(qx‖px)

≈ ∆(D(qxy‖pxy ) +n−∆

∆D(qx‖px))

For n < ∆

(J − j∗ + 1)D(qxy‖pxy ) + (j∗ − 1)D(qx‖px) ≤ nD(qxy‖pxy ) = ∆(n

∆D(qxy‖pxy ))

Write α = ∆n , then the upper bound on the error exponent is the minimum of the above

error exponents over all α > 0, i.e:



D(qxy‖pxy ),


1− α


This finalizes the proof. ¥

116

5.6 Discussions

In this chapter, we attempted to derive a tight upper bound on the delay constrained

distributed source coding error exponent. For source coding with decoder only side-

information, we give an upper bound by using a feed-forward coding argument borrowed

from the channel coding literature [62, 67]. The upper bound is shown to be tight in two

special cases, namely independent side-information and the low rate regime in the com-

pression of encrypted data problem. We gave a generic achievability result by sequential

random binning. The random coding based lower bound agrees with the upper bound in

the compression of encrypted data problem. In the independent side-information case, the

random coding error exponent is strictly suboptimal. For general cases, there is a gap be-

tween the upper and lower bounds in the whole rate region. We believe the lower bound can

be improved by using a variable length random binning scheme. This is a difficult problem

and should be further studied.

We also studied delay constrained source coding with both encoder and decoder side-

information. With encoder side-information, this problem resembles lossless source coding,

thus delay constrained coding has a focusing type bound. The exponent with both encoder

and decoder side-information is strictly higher than the upper bound of the decoder only

case. This phenomenon is called the “price of ignorance” [18] and is not observed in block

coding. This is another example that block length is not the same as delay.

117

Chapter 6

Future Work

In this thesis, we studied the asymptotic performance bound for several delay con-

strained streaming source coding problems where the source is assumed to be iid with

constant arrival rate. What are the performance bounds if the sources are not iid, what if

the arrivals are random, what if the distortion is not defined on a symbol by symbol basis?

What is the non asymptotic performance bounds for these problems? More importantly,

the main theme of this thesis is to figure out the dominant error event on a symbol by sym-

bol basis. For a specific source symbol with a finite end-to-end decoding delay constraint,

what’s the most likely atypical event that causes error? This is a fundamental question that

is not answered yet for several cases.

118

6.1 Past, present and future

“This duality can be pursued further and is related to the duality between past and

future and the notions of control and knowledge. Thus we may have knowledge of the past

but cannot control it; we may control the future but have no knowledge of it.”

– Claude Shannon [74]

What is the past, what is the present and what is the future? In delay constrained

streaming coding problems, for a source symbol that enters the encoder at time t, we define

past as time prior to time t, present as time t, future as time between t and time t + ∆,

where ∆ is the finite delay constraint. Time after t + ∆ is irrelevant. With this definition,

past, present and future is only relevant to a particular time t as shown in Figure 6.1

-0 t

present future irrelevancepast

t + ∆

Figure 6.1. Past, present and future at time t

6.1.1 Dominant error event

For delay constrained streaming coding, what is the dominant error event (atypical

event) for time t? For a particular coding scheme, we define the dominant error event as

the event with the highest probability that makes a decoding error for the source symbol t.

For delay constrained lossless source coding in Chapter 2, we discovered that the dominant

error event for optimal coding is the atypical behavior of the source in the past, the future

does not matter! This is shown in Figure 6.2. The same goes for the delay constrained lossy

source coding problem in Chapter 3. However, the dominant error event for the sequential

random binning scheme is the future atypicality of the binning and/or the source behavior

as shown in Section 2.3.3. This illustrates that dominant error events are coding scheme

dependent.

We summarize what we know about the dominant error events for the optimal coding

schemes in Table 6.1. In th table, we put a X mark if we completely characterized the

nature of the dominant error event, and a ? mark if we have a conjecture on the nature of

the dominant error event. Note:for source coding with decoder only side-information and

119

-0 t

present future irrelevancepast

t + ∆t− ∆α∗

Figure 6.2. Dominant error event for delay constrained lossless source coding is the atypi-cality of the source between time t− ∆

α∗ and time t, α∗ is the optimizer in (2.7)

Past Future BothPoint-to-point Lossless (Chapter 2) XPoint-to-point Lossy (Chapter 3) XSlepian-Wolf coding (Chapter 4) ?

Source coding with decoder only side-info (Chapter 5) ?Source coding with both side-information (Chapter 5) X

Channel coding without feedback [67] XErasure-like channel coding wit feedback [67] X

Joint source channel coding [15] ?MAC [14], Broadcast channel coding [13] ?

Table 6.1. Type of dominant error events for different coding problems

joint source channel coding, we derive the upper bound by using the feed-forward decoder

scheme and the the dominant error event spans across past and future. However, the lower

bounds (achievability) are derived by a suboptimal sequential random binning scheme whose

dominant error event is in the future.

6.1.2 How to deal with both past and future?

In this thesis, we develop several tools to deal with different delay constrained streaming

source coding problems. For the lower bounds on the error exponent, we use sequential

random binning to deal with future dominant error events and we see that the scheme is

often optimal as shown in [67]. We use variable length coding with FIFO queue to deal with

past dominant error events as shown in Chapters 2 and 3. For the upper bounds, we use the

feedforward decoding scheme and a simple focusing type argument to translate the delay

constrained error event into a block coding error event and then derive the upper bounds.

Obviously, more tools are needed to deal with more complex problems. At what point can

we claim that we have developed all the needed tools for delay constrained streaming coding

problems in Figure 1.1?

For those problems whose dominant error event spans across past and future, we do

not have a concrete idea of what the optimal coding scheme is. The only exception is the

120

erasure-like channel with feedback problem in [67], this channel coding problem is a special

case of source coding with delay. For more general cases, source coding with decoder side-

information and joint source channel coding, we believe that the “variable length random

binning” scheme should be studied. This is a difficult problem and more serious research

needs to be done in this area.

121

6.2 Source model and common randomness

In this thesis, the sources are always modeled as iid random variables on a finite alphabet

set. This assumption simplifies the analysis and captures the main challenges in streaming

source coding with delay. Our analysis for lossless source coding in Chapter 2 can be easily

generalized to finite alphabet stationary ergodic processes. The infinite alphabet case [59] is

very challenging and our current universal proof technique does not apply. For lossy source

coding in Chapter 3, an important question is: what if the sources are continuous random

variables instead discrete and the distortion measures are L2 distances? Is a variable length

vector quantizer with FIFO queueing system still optimal? Another interesting case is that

we do not have any assumption on the statistics of the source at all. This is the individual

sequence problem [11]. It seems that our universal variable length code and FIFO queue

would work just fine for individual sequences for the same reason that Lempel-Ziv coding is

optimal for individual sequences. A rigorous proof is needed. In our study of lossy source

coding, we focus on the per symbol loss case. What if the loss function is defined as an

average over a period of time? What can we learn from the classical sliding-block source

coding literature [45, 44, 7, 77]?

Another very interesting problem is when multiple sources have to share the same

bandwidth and encoder/decoder. We recently studied the error exponent tradeoff in the

block coding setup. The delay constrained performance for multiple streaming sources is a

more challenging problem. Also, as mentioned in Section 1.2, random arrivals of the source

symbols poses another dimension of challenges. Other than several special distributions of

the inter-symbol random arrivals, we do not have a general result on the delay constrained

error exponent. For the above problems, techniques from queueing theory [42] might be

useful.

In the proofs of the achievabilities for delay constrained distributed source coding in

Chapters 4 and 5 and channel coding [13, 14], we use a sequential random coding scheme.

This assumes that the encoder(s) and the decoder(s) share some common randomness. The

amount of randomness is unbounded since our coding system has an infinite horizon. A

natural question is if there exists a encoder decoder pair that achieve the random coding

error exponents without the presence of common randomness. For the block coding cases,

the existence is proved by a simple argument [26, 41]. However, this argument does not

apply to the delay constrained coding problems because we are facing infinite horizons.

122

6.3 Small but interesting problems

In Chapter 2, we showed several properties of the delay constrained source coding error

exponent Es(R), including the positive derivative of the error exponent at the entropy rate

etc. One important question is whether the exponent Es(R) is convex ∪. We conjecture

that the answer is yes although the channel coding counter part can be neither convex

nor concave [67]. We also conjecture that the delay constrained lossy source coding error

exponent ED(R) is convex ∪. This is a more difficult problem, because there is no obvious

parametrization of ED(R), thus the only possible way to show the concavity of ED(R) is

through the definition of ED(R) and show a general result for all focusing type bounds.

An important question is: what is the sufficient condition for F (R) such that the following

focusing function E(R) is convex ∪?

E(R) = infα>0

1α

F ((1 + α)R) (6.1)

We do not have a parametrization result for the upper bound on the delay constrained

source coding with decoder side-information as shown in Theorem 7. A parametrization

result like that in Proposition 7 can greatly simplify the calculation of the upper bound.

We use the feed-forward coding scheme in upper bounding the error exponents. All

the problems we have studied using this technique are point to point coding with one en-

coder and one decoder. An important future direction is to generalize our current technique

to distributed source coding and channel coding. This should be a reasonably easy prob-

lem for delay constrained Slepian-Wolf coding by modifying the feed-forward diagram in

Figure 5.15.

Lastly, we study the error exponents in the asymptotic regime, i.e. the error exponent

tells how fast the error probability decays to zero when the delay is long as shown in (1.2)

with ≈ instead of =. In order to accurately bound the error probability in the short delay

regime, we cannot ignore the often ignored polynomial terms in the expressions of error

probabilities. This could be an extremely laborious problem that lacks the mathematical

neatness. We leave this problem to more practically minded researchers.

I do not expect my thesis be error free. Please send your comments, suggestions, ques-

tions, worries, concerns, solutions to the open problems to [email protected]

123

Bibliography

[1] Rudolf Ahlswede and Imre Csiszar. Common randomness in information theory and

cryptography part I: Secret sharing. IEEE Transactions on Information Theory,

39:1121– 1132, 1993.

[2] Rudolf Ahlswede and Gunter Dueck. Good codes can be produced by a few permuta-

tions. IEEE Transactions on Information Theory, 28:430– 443, 1982.

[3] Venkat Anantharam and Sergio Verdu. Bits through queues. IEEE Transactions on

Information Theory, 42:4– 18, 1996.

[4] Andrew Barron, Jorma Rissanen, and Bin Yu. The minimum description length prin-

ciple in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743–

2760, 1998.

[5] Toby Berger. Rate Distortion Theory: A mathematical basis for data compression.

Prentice-Hall, 1971.

[6] Toby Berger and Jerry D. Gibson. Lossy source coding. IEEE Transactions on Infor-

mation Theory, 44:2693 – 2723, 1998.

[7] Toby Berger and Joseph KaYin Lau. On binary sliding block codes. IEEE Transactions

on Information Theory, 23:343 – 353, 1977.

[8] Patrick Billingsley. Probability and Measure. Cambridge University Press, Wiley-

Interscience, 1995.

[9] Shashi Borade and Lizhong Zheng. I-projection and the geometry of error exponents.

Allerton Conference, 2006.

[10] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University

Press, 2004.

124

[11] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge

University Press, 2006.

[12] Cheng Chang, Stark Draper, and Anant Sahai. Sequential random binning for stream-

ing distributed source coding. submitted to IEEE Transactions on Information Theory,

2006.

[13] Cheng Chang and Anant Sahai. Sequential random coding error exponents for degraded

broadcast channels. Allerton Conference, 2005.

[14] Cheng Chang and Anant Sahai. Sequential random coding error exponents for multiple

access channels. WirelessCom 05 Symposium on Information Theory, 2005.

[15] Cheng Chang and Anant Sahai. Error exponents for joint source-channel coding with

delay-constraints. Allerton Conference, 2006.

[16] Cheng Chang and Anant Sahai. Upper bound on error exponents with delay for lossless

source coding with side-information. ISIT, 2006.

[17] Cheng Chang and Anant Sahai. Delay-constrained source coding for a peak distortion

measure. ISIT, 2007.

[18] Cheng Chang and Anant Sahai. The price of ignorance: the impact on side-information

for delay in lossless source coding. submitted to IEEE Transactions on Information

Theory, 2007.

[19] Cheng Chang and Anant Sahai. Universal quadratic lower bounds on source coding

error exponents. Conference on Information Sciences and Systems, 2007.

[20] Cheng Chang and Anant Sahai. The error exponent with delay for lossless source

coding. Information Theory Workshop, Punta del Este, Uruguay, March 2006.

[21] Cheng Chang and Anant Sahai. Universal fixed-length coding redundancy. Information

Theory Workshop, Lake Tahoe, CA, September 2007.

[22] Jun Chen, Dake He, Ashish Jagmohan, and Luis A. Lastras-Montano. On the duality

and difference between Slepian-Wolf coding and channel coding. Information Theory

Workshop, Lake Tahoe, CA, USA, September 2007.

[23] Gerard Cohen, Iiro Honkala, Simon Litsyn, and Antoine Lobstein. Covering Codes.

Elsevier, 1997.

125

[24] Thomas M. Cover. Broadcast channels. IEEE Transactions on Information Theory,

18:2–14, 1972.

[25] Thomas M. Cover and Abbas El Gamal. Capacity theorems for the relay channel.

IEEE Transactions on Information Theory, pages 572–584, 1979.

[26] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley

and Sons Inc., New York, 1991.

[27] Imre Csiszar. Joint source-channel error exponent. Problem of Control and Information

Theory, 9:315–328, 1980.

[28] Imre Csiszar. The method of types. IEEE Transactions on Information Theory,

44:2505–2523, 1998.

[29] Imre Csiszar and Janos Korner. Information Theory. Akademiai Kiado, Budapest,

1986.

[30] Imre Csiszar and Paul C. Shields. Information Theory and Statistics: A Tutorial.

http://www.nowpublishers.com/getpdf.aspx?doi=0100000004&product=CIT.

[31] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications.

Springer, 1998.

[32] Stark Draper, Cheng Chang, and Anant Sahai. Sequential random binning for stream-

ing distributed source coding. ISIT, 2005.

[33] Richard Durrett. Probability: Theory and Examples. Duxbury Press, 1995.

[34] David Forney. Convolutional codes III. sequential decoding. Information and Control,

25:267–297, 1974.

[35] Robert Gallager. Fixed composition arguments and lower bounds to error probability.

http://web.mit.edu/gallager/www/notes/notes5.pdf.

[36] Robert Gallager. Low Density Parity Check Codes, Sc.D. thesis. Massachusetts Insti-

tute Technology, 1960.

[37] Robert Gallager. Capacity and coding for degraded broadcast channels. Problemy

Peredachi Informatsii, 10(3):3–14, 1974.

[38] Robert Gallager. Basic limits on protocol information in data communication networks.

IEEE Transactions on Information Theory, 22:385–398, 1976.

126

[39] Robert Gallager. Source coding with side information and universal coding. Technical

Report LIDS-P-937, Mass. Instit. Tech., 1976.

[40] Robert Gallager. A perspective on multiaccess channels. IEEE Transactions on Infor-

mation Theory, 31, 1985.

[41] Robert G. Gallager. Information Theory and Reliable Communication. John Wiley,

New York, NY, 1971.

[42] Ayalvadi Ganesh, Neil O´Connell, and Damon Wischik. Big queues. Springer,

Berlin/Heidelberg, 2004.

[43] Michael Gastpar. To code or not to code, Phd Thesis. Ecole Polytechnique Federale

de Lausanne, 2002.

[44] Robert M. Gray. Sliding-block source coding. IEEE Transactions on Information

Theory, 21:357–368, 1975.

[45] Robert M. Gray, David L. Neuhoff, and Donald S. Ornstein. Nonblock source coding

with a fidelity criterion. The Annals of Probability, 3(3):478–491, 195.

[46] Martin Hellman. An extension of the shannon theory approach to cryptography. IEEE

Transactions on Information Theory, 23(3):289–294, 1977.

[47] Michael Horstein. Sequential transmission using noiseless feedback. IEEE Transactions

on Information Theory, 9(3):136 – 143, 1963.

[48] Arding Hsu. Data management in delayed conferencing. Proceedings of the 10th Inter-

national Conference on Data Engineering, page 402, Feb 1994.

[49] Frederick Jelinek. Buffer overflow in variable length coding of fixed rate sources. IEEE

Transactions on Information Theory, 14:490–501, 1968.

[50] Mark Johnson, Prakash Ishwar, Vinod Prabhakaran, Dan Schonberg, and Kannan

Ramchandran. On compressing encrypted data. IEEE Transactions on Signal Pro-

cessing, 52(10):2992–3006, 2004.

[51] Aggelos Katsaggelos, Lisimachos Kondi, Fabian Meier, Jorn Ostermann, and Guido

Schuster. Mpeg-4 and ratedistortion -based shape-coding techniques. IEEE Proc.,

special issue on Multimedia Signal Processing, 86:1126–1154, 1998.

[52] Leonard Kleinrock. Queueing Systems. Wiley-Interscience, 1975.

127

[53] Amos Lapidoth and Prakash Narayan. Reliable communication under channel uncer-

tainty. IEEE Transactions on Signal Processing, 44:2148–2177, 1998.

[54] Michael E. Lukacs and David G. Boyer. A universal broadband multipoint telecon-

ferencing service for the 21st century. IEEE Communications Magazine, 33:36 – 43,

1995.

[55] Katalin Marton. Error exponent for source coding with a fidelity criterion. IEEE

Transactions on Information Theory, 20(2):197–199, 1974.

[56] Ueli M. Maurer. Secret key agreement by public discussion from common information.

IEEE Transactions on Information Theory, 39(3):733742, 1993.

[57] Neri Merhav and Ioannis Kontoyiannis. Source coding exponents for zero-delay coding

with finite memory. IEEE Transactions on Information Theory, 49(3):609–625, 2003.

[58] David L. Neuhoff and R. Kent Gilbert. Causal source codes. IEEE Transactions on

Information Theory, 28(5):701–713, 1982.

[59] Alon Orlitsky, Narayana P. Santhanam, Krishnamurthy Viswanathan, and Junan

Zhang. Limit results on pattern entropy. IEEE Transactions on Information The-

ory, 52(7):2954–2964, 2006.

[60] Hari Palaiyanur and Anant Sahai. Sequential decoding using side-information. IEEE

Transactions on Information Theory, submitted, 2007.

[61] Larry L. Peterson and Bruce S. Davie. Computer Networks: A Systems Approach.

Morgan Kaufmann, 1999.

[62] Mark Pinsker. Bounds of the probability and of the number of correctable errors for

nonblock codes. Translation from Problemy Peredachi Informatsii, 3:44–55, 1967.

[63] Vinod Prabhakaran and Kannan Ramchandran. On secure distributed source coding.

Information Theory Workshop, Lake Tahoe, CA, USA, September 2007.

[64] S. Sandeep Pradhan, Jim Chou, and Kannan Ramchandran. Duality between source

coding and channel coding and its extension to the side information case. IEEE Trans-

actions on Information Theory, 49(5):1181– 1203, May 2003.

[65] Thomas J. Richardson and Rudiger L. Urbanke. The capacity of low-density parity-

check codes under message-passing decoding. IEEE Transactions on Information The-

ory, 47(2):599–618, 2001.

128

[66] Jorma Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.

[67] Anant Sahai. Why block length and delay are not the same thing. IEEE Transac-

tions on Information Theory, submitted. http://www.eecs.berkeley.edu/~sahai/

Papers/FocusingBound.pdf.

[68] Anant Sahai, Stark Draper, and Michael Gastpar. Boosting reliability over awgn net-

works with average power constraints and noiseless feedback. ISIT, 2005.

[69] Anant Sahai and Sanjoy K. Mitter. The necessity and sufficiency of anytime capacity

for stabilization of a linear system over a noisy communication link: Part I: scalar

systems. IEEE Transactions on Information Theory, 52(8):3369 – 3395, August 2006.

[70] J.P.M. Schalkwijk and Thomas Kailath. A coding scheme for additive noise channels

with feedback–I: No bandwidth constraint. IEEE Transactions on Information Theory,

12(2):172–182, 166.

[71] Dan Schonberg. Practical Distributed Source Coding and Its Application to the Com-

pression of Encrypted Data, PhD thesis. University of California, Berkeley, Berkeley,

CA, 2007.

[72] Claude Shannon. A mathematical theory of communication. Bell System Technical

Journal, 27, 1948.

[73] Claude Shannon. Communication theory of secrecy systems. Bell System Technical

Journal, 28(4):656–715, 1949.

[74] Claude Shannon. Coding theorems for a discrete source with a fidelity criterion. IRE

National Convention Record, 7(4):142–163, 1959.

[75] Claude Shannon, Robert Gallager, and Elwyn Berlekamp. Lower bounds to error

probability for coding on discrete memoryless channels I. Information and Control,

10:65–103, 1967.

[76] Claude Shannon, Robert Gallager, and Elwyn Berlekamp. Lower bounds to error

probability for coding on discrete memoryless channels II. Information and Control,

10:522–552, 1967.

[77] Paul Shields and David Neuhoff. Block and sliding-block source coding. IEEE Trans-

actions on Information Theory, 23:211 – 215, 1977.

[78] Wang Shuo. I’m your daddy. Tianjin People’s Press, 2007.

129

[79] David Slepian and Jack Wolf. Noiseless coding of correlated information sources. IEEE

Transactions on Information Theory, 19, 1973.

[80] David Tse. Variable-rate Lossy Compression and its Effects on Communication Net-

works, PhD thesis. Massachusetts Institute Technology, 1994.

[81] David Tse, Robert Gallager, and John Tsitsiklis. Optimal buffer control for variable-

rate lossy compression. 31st Allerton Conference, 1993.

[82] Sergio Verdu and Steven W. McLaughlin. Information Theory: 50 Years of Discovery.

Wiley-IEEE Press, 1999.

[83] Terry Welch. A technique for high-performance data compression. IEEE Computer,

17(6):8–19, 1984.

[84] Wikipedia. Biography of Wang Shuo, http://en.wikipedia.org/wiki/Wang_Shuo.

[85] Hirosuke Yamamoto and Kohji Itoh. Asymptotic performance of a modified schalkwijk-

barron scheme for channels with noiseless feedback. IEEE Transactions on Information

Theory, 25(6):729733, 1979.

[86] Zhusheng Zhang. Analysis— a new perspective. Peking University Press, 1995.

[87] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression.

IEEE Transactions on Information Theory, 23(3):337–343, 1977.

[88] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate

coding. IEEE Transactions on Information Theory, 24(5):530–536, 1978.

130

Appendix A

Review of fixed-length block source

coding

We review the classical fixed-length block source coding results in this Chapter.

A.1 Lossless Source Coding and Error Exponent

Encoder Decoder -- xn1b

bnRc1- -xN

1

Figure A.1. Block lossless source coding

Consider a discrete memoryless iid source with distribution px defined on finite alphabet

X . A rate R block source coding system for n source symbols consists of an encoder-decoder

pair (En,Dn), as shown in Figure A.1, where

En : X n −→ 0, 1bnRc, En(xn1 ) = b

bnRc1

Dn : 0, 1bnRc −→ X n, Dn(bbnRc1 ) = xn

1

The probability of block decoding error is Pr[xn1 6= xn

1 ] = Pr[xn1 6= Dn(En(xn

1 ))].

131

In his seminal paper [72], Shannon proved that arbitrarily small error probabilities are

achievable by letting n get big as long as the encoder rate is larger than the entropy of the

source, R > H(px), where H(px) =∑

x∈X−px(x) log px(x). Furthermore, it turns out that

the error probability goes to zero exponentially in n.

Theorem 9 (From [29]) For a discrete memoryless source x ∼ px and encoder rate R <

log |X |,

∀ε > 0, ∃K < ∞, s.t. ∀n ≥ 0, ∃ a block encoder-decoder pair En,Dn such that

Pr[xn1 6= xn

1 ] ≤ K2−n(Es,b(R)−ε) (A.1)

This result is asymptotically tight, in the sense that for any sequence of encoder-decoder

pairs En,Dn,

lim supn→∞

− 1n

log Pr[xn1 6= xn

1 ] = lim supn→∞

− 1n

log Pr[xn1 6= Dn(En(xn

1 ))] ≤ Es,b(R) (A.2)

where Es,b(R) is defined as the block source coding error exponent with the form:

Es,b(R) = minq:H(q)≥R

D(q‖px) (A.3)

Paralleling the definition of the Gallager function for channel coding [41], as mentioned

as an exercise in [29]:

Es,b(R) = supρ≥0

ρR−E0(ρ) (A.4)

where E0(ρ) = (1 + ρ) log[∑

x

px(x)1

1+ρ ]

In [29], it is shown that if the encoder randomly assigns a bin number in 1, 2, ..., 2bnRcwith equal probability to the source sequence and the decoder perform a maximum likeli-

hood or minimum empirical entropy decoding rule for the source sequences in the same bin,

the random coding error exponent for block source coding is:

Er(R) = minqD(q‖px) + |R−H(q)|+

= supρ∈[0,1]

ρR− (1 + ρ) log[∑

x

px(x)1

1+ρ ] (A.5)

where |t|+ = max0, t. This error exponent is the same as the block coding error exponent

(A.3) in the low rate regime and strictly lower than the block coding error exponent in the

132

high rate regime [29]. In Section 2.3.3, we show that this random coding error exponent is

also achievable in the delay constrained source coding setup for streaming data. However,

in Section 2.5.1 we will show that the random coding error exponent and the block coding

error exponent are suboptimal everywhere for R ∈ (H(px), log |X |) in the delay constrained

setup.

A.2 Lossy source coding

Now consider a discrete memoryless iid source with distribution px defined on X . A rate

R block lossy source coding system for N source symbols consists of an encoder-decoder

pair (EN ,DN ), as shown in Figure A.2, where

En : X n −→ 0, 1bnRc, En(xn1 ) = b

bnRc1

Dn : 0, 1bnRc −→ Yn, Dn(bbnRc1 ) = yn

1

Instead of attempting to estimate the source symbols xN1 exactly as in Chapter 2, in lossy

source coding, the design goal of the system is to reconstruct the source symbols xN1 within

an average distortion D > 0.

Encoder Decoder -- yN1b

bNRc1- -xN

1

Figure A.2. Block lossy source coding

A.2.1 Rate distortion function and error exponent for block coding under

average distortion

The key issue of block lossy source coding is to determine the minimum rate R such

that the distortion measure d(xN1 , yN

1 ) ≤ D is satisfied with probability close to 1. For the

average distortion, we denote by the distortion between two sequence as the average of the

distortions of each individual symbols:

d(xN1 , yN

1 ) =1N

N∑

i=1

d(xi, yi)

133

In [74], Shannon first proved the following rate distortion theorem:

Theorem 10 The rate-distortion function R(D) for average distortion measures:

R(px , D) , minW∈WD

I(px ,W ) (A.6)

where WD is the set of all transition matrices that satisfy the average distortion con-

straint, i.e.

WD = W :∑x,y

px(x)W (y|x)d(x, y) ≤ D.

Operationally, this lemma says, for any ε > 0 and δ > 0, for block length N big enough,

there exists a code of rate R(px , D)+ ε, such that the average distortion between the source

string xN1 and its reconstruction yN

1 is no bigger than D with probability at least 1− δ.

This problem is only interesting if the target average distortion D is higher than D and

lower than D, where

D ,∑

x∈Xpx(x)min

y∈Yd(x, y)

D , miny∈Y

∑

x∈Xpx(x)d(x, y).

The reasoning for the above statement should be trivial. Interested readers may read [6].

For distortion constraint D > D, we have the following fixed-to-variable length coding

result for average distortion measure. To have Pr[d(xN1 , yN

1 ) > D] = 0, we can implement a

universal variable length prefix-free code with code length lD(xN1 ) where

lD(xN1 ) = n(R(pxN

1, D) + δN ) (A.7)

where pxN1

is the empirical distribution of xN1 , and δN goes to 0 as N goes to infinity.

This is a simple corollary of the type covering lemma [29, 28] which is derived from the

Johnson− Stein−Lovasz theorem [23].

It is widely know that R(px , D) is convex ∪ on D for fixed px [5]. However, the rate

distortion function R(px , D) for average distortion measure is in general non-concave, non-∩,

in the source distribution px for fixed distortion constraint D as pointed out in [55].

Now we present the large deviation properties of lossy source coding under average

distortion measures from [55],

134

Theorem 11 Block coding error exponent under average distortion:

lim infn→∞ − 1

Nlog2 Pr[d(xN

1 , yN1 ) > D] = Eb,average

D (R)

where Eb,averageD (R) = min

qx :R(qx ,D)>RD(qx‖px) (A.8)

where yN1 is the reconstruction of xN

1 using an optimal rate R code.

A.3 Block distributed source coding and error exponents

In the classic block-coding Slepian-Wolf paradigm [79, 29, 39] (illustrated in Figure 4.1),

full length-N vectors1 xN and yN are observed by their respective encoders before communi-

cation starts. In the block coding setup, a rate-(Rx, Ry) length-N block source code consists

of an encoder-decoder triplet (ExN , Ey

N ,DN ), as we will define shortly in Definition 13.

Definition 13 A randomized length-N rate-(Rx, Ry) block encoder-decoder triplet

(ExN , Ey

N ,DN ) is a set of maps 2

ExN : XN → 0, 1NRx , e.g., Ex

N (xN ) = aNRx

EyN : YN → 0, 1NRy , e.g., Ey

N (yN ) = bNRy

DN : 0, 1NRx × 0, 1NRy → XN × YN , e.g., DN (aNRx , bNRy) = (xN , yN )

where common randomness, similar to the point-to-point source coding case in Section 2.3.3,

shared between the encoders and the decoder is assumed. This allows us to randomize the

mappings independently of the source sequences.

The error probability typically considered in Slepian-Wolf coding is the joint error prob-

ability, Pr[(xN , yN ) 6= (xN , yN )] = Pr[(xN , yN ) 6= DN (ExN (xN ), Ey

N (yN ))]. This probability

is taken over the random source vectors as well as the randomized mappings. An error

exponent E is said to be achievable if there exists a family of rate-(Rx, Ry) encoders and

decoders (ExN , Ey

N ,DN ), indexed by N , such that

limN→∞

− 1N

log Pr[(xN , yN ) 6= (xN , yN )] ≥ E. (A.9)

1In the block coding part, we simply use xN instead of xN1 to denote the sequence of random variables

x1, ...., xN .2For the sake of simplicity, we assume NRx and NRy are integers. It should be clear that this assumption

is insignificant in the asymptotic regime where N is big and the integer effect can be ignored.

135

In this thesis, we study random source vectors (xN , yN ) that are iid across time but

may have dependencies at any given time:

pxy (xN , yN ) =N∏

i=1

pxy (xi, yi).

Without loss of generality, we assume the marginal distributions are non zero, i.e. px(x) > 0

for all x ∈ X , and py (y) > 0 for all y ∈ Y,

For such iid sources, upper and lower bounds on the achievable error exponents are

derived in [39, 29]. These results are summarized by the following Lemma.

Theorem 12 (Lower bound) Given a rate pair (Rx, Ry) such that Rx > H(x |y), Ry >

H(y |x), Rx + Ry > H(x , y). Then, for all

E < minx ,y

D(px y‖pxy ) +∣∣ min[Rx + Ry −H(x , y), Rx −H(x |y), Ry −H(y |x)]

∣∣+ (A.10)

there exists a family of randomized encoder-decoder mappings as defined in Definition 13

such that (A.9) is satisfied. In (A.10) the function |z|+ = z if z ≥ 0 and |z|+ = 0 if z < 0.

(Upper bound) Given a rate pair (Rx, Ry) such that Rx > H(x |y), Ry > H(y |x), Rx +

Ry > H(x , y). Then, for all

E > min

minx ,y :Rx<H(x |y)

D(px y‖pxy ), minx ,y :Ry<H(y |x)

D(px y‖pxy ), minx ,y :Rx+Ry<H(x ,y)

D(px y‖pxy )

(A.11)

there does not exists a randomized encoder-decoder mapping as defined in Definition 13 such

that (A.9) is satisfied.

In both bounds (x , y) are arbitrary random variables with joint distribution px y .

Remark: As long as (Rx, Ry) is in the interior of the achievable region, i.e., Rx > H(x |y),

Ry > H(y |x) and Rx+Ry > H(x , y) then the lower-bound (A.10) is positive. The achievable

region is illustrated in Fig A.3. As shown in [29], the upper and lower bounds (A.11)

and (A.10) match when the rate pair (Rx, Ry) is achievable and close to the boundary of

the region. This is analogous to the high rate regime in channel coding, or the low rate

regime in source coding, where the random coding bound (analogous to (A.10)) and the

sphere packing bound (analogous to (A.11)) agree.

136

Theorem 12 can also be used to generate bounds on the exponent for source coding with

decoder side-information (i.e., y observed at the decoder), and for source coding without

side information (i.e., y is independent of x and the decoder only needs to decode x).

Corollary 3 (Source coding with decoder side-information) Consider a Slepian-Wolf prob-

lem where y is known by the decoder. Given a rate Rx such that Rx > H(x |y), then for

all

E < minx ,y

D(px y‖pxy ) + |Rx −H(x |y)|+, (A.12)

there exists a family of randomized encoder-decoder mappings as defined in Definition 13

such that (A.9) is satisfied.

The proof of Corollary 3 follows from Theorem 12 by letting Ry be sufficiently large

(> log |X |). Similarly, by letting y be independent of x so that H(x |y) = H(x), we get the

following random-coding bound for the point-to-point case of a single source x which is the

random coding part of Theorem 9 in Section A.1.

Corollary 4 (point-to-point) Consider a Slepian-Wolf problem where y is independent of

x, Given a rate Rx such that Rx > H(x), for all

E < minx

D(px‖px) + |Rx −H(x)|+ = Er(Rx) (A.13)

there exists a family of randomized encoder-decoder triplet as defined in Definition 13 such

that (A.9) is satisfied.

A.4 Review of block source coding with side-information

As shown in Figure 5.1, the sources are iid random variables xn1 , yn

1 from a finite alphabet

X ×Y with distribution pxy . Without loss of generality, we assume that px(x) > 0, ∀x ∈ Xand py (y) > 0, ∀y ∈ Y. xn

1 is the source known to the encoder and yn1 is the side-information

known only to the decoder. A rate R block source coding system for n source symbols

consists of an encoder-decoder pair (En,Dn). Where

En : X n → 0, 1bnRc, En(xn1 ) = b

bnRc1

Dn : 0, 1bnRc × Yn → X n, Dn(bbnRc1 , yn

1 ) = xn1

137

-

6

Ry

Rx

H(y)

H(y |x)

H(x)H(x |y)

log |Y|

log |X |

AchievableRegion

Rx + Ry = H(x , y)

µ

Figure A.3. Achievable region for Slepian-Wolf source coding

The error probability is Pr[xn1 6= xn

1 ] = Pr[xn1 6= Dn(En(xn

1 ), yn1 )]. The exponent Esi,b(R)

is achievable if ∃ a family of (En,Dn), s.t.

limn→∞−

1n

log2 Pr[xn1 6= xn

1 ] = Esi,b(R) (A.14)

The relevant results of [29, 39] are summarized into the following theorem.

Theorem 13 Elowersi,b (R) ≤ Esi,b(R) ≤ Eupper

si,b (R) where

Elowersi,b (R) = min

qxyD(qxy‖pxy ) + |0, R−H(qx |y )|+

Euppersi,b (R) = min

qxy :H(qx|y )≥RD(qxy‖pxy )

Where these two error exponents can also be expressed in a parameterized way:

Elowersi,b (R) = max


∑y

[∑

x

pxy (x, y)1

1+ρ ]1+ρ]

Euppersi,b (R) = max

ρ∈[0,∞]ρR− log[

∑y

[∑

x

pxy (x, y)1

1+ρ ]1+ρ]

138

It should be clear that both of these bounds are only positive if R > H(px |y ). As shown

in [29], the two bounds are the same in the low rate regime. Furthermore, with encoder side-

information, similar to the case in Theorem 9, the block coding error exponent is Euppersi,b (R).

Thus in the low rate regime, the error exponents are the same with or without encoder side-

information. Due to the duality between channel coding and source coding with decoder

side-information, in the high rate regime one can derive an expurgated bound, as done by

Gallager [41] for channel coding, to tighten the random coding error exponent. Interested

readers may read Ahlswede’s very classical paper [2] and some recent papers [22, 64].

139

Appendix B

Tilted Entropy Coding and its

delay constrained performance

In this appendix, we introduce another way to achieve the delay constrained source

coding error exponent in Theorem 1. This coding scheme is identical to our universal

coding scheme introduced in Section 2.4.1, except that we use a non-universal fixed-to-

variable length prefix-free source code based on tilted entropy code. The rest of the system

is identical to that in Figure 2.11. This derivation first appeared in our paper [20] which

is a minor variation of the scheme analyzed in [49]. Just as in the universal coding scheme

in Section 2.4.1 and [49], the queuing behavior is what determines the delay constrained

error exponent. Rather than modeling the buffer as a random walk with and analyze the

stationary distribution of the process as in [49], here we give an alternate derivation by

applying Cramer’s theorem.

B.1 Tilted entropy coding

We replace the universal optimal source code with the following tilted entropy code

first introduced by Jelinek [49]. This code is a Shannon-code1 built for a particular tilted

distribution for px .

Definition 14 The λ-tilted entropy code is an instantaneous code CNλfor λ > −1. It is a

1Code length proportional to the logarithm of the inverse of probability.

140

mapping from XN to a variable number of binary bits.

CNλ(xN

1 ) = bl(xN

1 )1

where l(xN1 ) is the codeword length for source sequence xN

1 . The first bit is always 1 and

the rest of the codewords are the Shannon codewords based on the λ tilted distribution of px

l(xN1 ) = 1 + d− log2

px(xN1 )

11+λ

∑sN1 ∈XN

px(sN1 )

11+λ

e

≤ 2−N∑

i=1

log2

px(xi)1

1+λ

∑s∈X

px(s)1

1+λ

(B.1)

From the definition of l(~x), we have the following fact:

log2(∑

~x∈XN

px(~x)2λl(~x)) ≤ 2λ + N log2[(∑

x

px(x)1−λ

1+λ )(∑

x

px(x)1

1+λ )λ]

= 2λ + N(1 + λ) log2[∑

x

px(x)1

1+λ ]

= 2λ + NE0(λ) (B.2)

This definition is valid for any λ > −1. As will be seen later in the proof, for a rate

R delay constrained source coding system, the optimal λ is λ = ρ∗, where ρ∗ is defined in

(2.57):

ρ∗R = E0(ρ∗)

From (B.1), the longest code length lλN is

lλN ≤ 2−N log2

p1

1+λxmin∑

s∈Xpx(s)

11+λ

(B.3)

Where pxmin = minx∈X

px(x). The constant 2 is insignificant compared to N .

This variable-length code is turned into a fixed rate R code as follows that in Sec-

tion 2.4.1. The delay constrained source coding scheme is illustrated in Figure 2.11. At

time kN , k = 1, 2, ... the encoder E uses the variable length code CNλto encode the kth

source block ~xk = xkN(k−1)N+1 into a binary sequence b(k)1, b(k)2, ...b(k)l(~xk). This codeword

is pushed into a FIFO queue with infinite buffer-size. The encoder drains a bit from the

queue every 1R seconds. If the queue is empty, the encoder simply sends 0’s to the decoder

until there are new bits pushed in the queue.

141

The decoder knows the variable length code book and the prefix-free nature2 of the

Shannon code guarantees that everything can be decoded correctly.

B.2 Error Events

The number of the bits Ωk in the encoder buffer at time kN , is a random walk process

with negative drift and a reflecting barrier. At time (k + d)N , where dN = ∆, the decoder

can make an error in estimating ~xk if and only if part of the variable length code for source

block ~xk is still in the encoder buffer.

Since the FIFO queue drains deterministically, it means that when the k-th block’s

codeword entered the queue, it was already doomed to miss its deadline of dN = ∆.

Formally, for an error to occur, the number of bits in the buffer Ωk ≥ bdNR = ∆Rc.Thus, meeting a specific end-to-end latency constraint over a fixed-rate noiseless link is

like the buffer overflow events analyzed in [49]. Define the random time tkN to be the

last time before time kN when the queue was empty. A missed-deadline occurs only if∑k

i=tk+1 l(~xi) > (d + k − tk)NR = (kN − tkN + ∆)R.

For arbitrary 1 ≤ t ≤ k − 1, define the error event

P k,dN (t) = P

(k∑

i=t

l(~xi) > (d + k − t)NR

)= P

(k∑

i=t

l(~xi) > (kN − tkN + ∆)R

).

Using Cramer theorem[31], we derive a tight, in the large deviation sense, upper bound

on P k,dN (t).

As a simple corollary of the Cramer theorem, we have an upper bound on P k,dN (t).

P k,dN (t) = P

(k∑

i=t+1

l(~xi) ≥ (d + k − t)NR

)

= P

(1

k − t

k∑

i=t+1

l(~xi) ≥ (d + k − t)NR

k − t

)

≤ (k − t)|X |N

2−EN (CNλ,R,k−t,d)

Where by using the fact in (B.2) and noticing that the 2λ is insignificant in (B.2), we have2The initial 1 is not really required since the decoder knows the rate at which source-symbols are arriving

at the encoder. Thus, it knows when the queue is empty and does not need to even interpret the 0s itreceives.

142

the large deviation exponent:

EN (CNλ, R, k − t, d) ≥ (k − t) sup

ρ∈R+

ρ(d + k − t)NR

k − t− log2(

∑

~x∈XN

px(~x)2ρl(~x))

≥ (k − t)[λ(d + k − t)NR

k − t− log2(

∑

~x∈XN

px(~x)2λl(~x))]

≥ (k − t)N(λ(d + k − t)R

k − t− 2λ

N−E0(λ))

= dNλR + (k − t)N [λR−E0(λ)− 2λ

N] (B.4)

B.3 Achievability of delay constrained error exponent Es(R)

We only need to show that for any ε > 0, by appropriate choice of N, λ, it is possible

to achieve an error exponent with delay of Es(R)− ε, i.e. for all i,∆,

Pr[xj 6= xj(j + ∆)] ≤ K2−∆(Es(R)−ε)

From (2.57), we know that Es(R) = E0(ρ∗), where ρ∗R = E0(ρ∗).

Proof: For the tilted entropy source coding scheme of CN,λ, the decoding error for the

kth source block at time (k + d)N is3 P k,dN . tkN is the last time before k when the buffer is

empty.

Pr[xj 6= xj(j + ∆)] ≤k∑

t=0

P k,dN (t)

=k−1∑

t=0

P

(tk = t,

k∑

i=t+1

l(~xi) ≥ (d + k − t)NR

)

≤k−1∑

t=0

P

(k∑

i=t+1

l(~xi) ≥ (d + k − t)NR

)

≤k−1∑

t=0

(k − t + 1)|X |N

2−EN (CNλ,R,k−t,d)

= 2−dNλRk−1∑

t=0

(k − t + 1)|X |N

2−(k−t)N [λR−E0(λ)− λN

]

The above equality is true for all N,λ. Pick λ = ρ∗ − εR . From Figure 2.12, we know

that λR− E0(λ) > 0. Choose N > λλR−E0(λ)) . Then define

K(ε, R, N) =∞∑

i=0

(i + 1)|X |N

2−iN [λR−E0(λ)− λN

] < ∞

3Here we denote j by kN , and ∆ by dN .

143

which is guaranteed to be finite since dying exponentials dominate polynomials in sums.

Thus:

Pr[xj 6= xj(j + ∆)] ≤ K(ε)2−dNλR = K(ε)2−∆λR = K(ε, R, N)2−∆(ρ∗R−ε)

where the constant K(ε, R) does not depend on the delay in question. Since E0(ρ∗) = ρ∗R

and the encoder also is not targeted to the delay ∆, this scheme achieves the desired delay

constrained exponent as promised.

144

Appendix C

Proof of the concavity of the rate

distortion function under the peak

distortion measure

In this appendix, we prove Lemma 5 which states that the rate distortion function

R(p,D) is concave ∩ in p.

Proof: To show that R(p,D) is concave ∩ in p, it is enough to show that for any two

distributions p0 and p1 and for any λ ∈ [0, 1],

R(pλ, D) ≥ λR(p0, D) + (1− λ)R(p1, D)

where pλ = λp0 + (1− λ)p1. Define:

W ∗ = arg minW∈WD

I(pλ,W )

From the definition of R(p,D) we know that

R(pλ, D) = I(pλ,W ∗)

≥ λI(p0,W∗) + (1− λ)I(p1,W

∗) (C.1)

≥ λ minW∈WD

I(p0,W ) + (1− λ) minW∈WD

I(p1,W )

= λR(p0, D) + (1− λ)R(p1, D)

(C.1) is true because I(p,W ) is concave ∩ in p for fixed W and pλ = λp0 + (1− λ)p1. The

rest are by the definition. ¤

145

Appendix D

Derivation of the upper bound on

the delay constrained lossy source

coding error exponent

In this appendix, we prove the converse of Theorem 2. The proof is similar to that of

the converse of the delay constrained lossless source coding error exponent in Theorem 1.

To bound the best possible error exponent with fixed delay, we consider a block coding

encoder/decoder pair that is constructed by the delay constrained encoder/decoder pair

and translate the block-coding error exponent for peak distortion in Lemma 6 to the delay

constrained error exponent. The arguments are analogous to the “focusing bound” deriva-

tion in [67] for channel coding with feedback and extremely similar to that of the lossless

source coding case in Theorem 1. We summarize the converse of Theorem 2 in the following

proposition.


to achieve an lossy source coding error exponent with fixed-delay higher than

infα>0

1α

EbD((α + 1)R) (D.1)

from the definition of delay constrained lossy source coding error exponent in Definition 4,

the statement of this proposition is equivalent to the following statement:

146

For any E > infα>0

1αEb

D((α + 1)R), there exists an positive ε, such that for any K < ∞,

there exists i > 0, ∆ > 0 and

Pr[d(xi, yi(i + ∆)) ≥ D] > K2−∆(ED(R)−ε)

Proof: We show the proposition by contradiction. Suppose that the delay-constrained

error exponent can be higher than infα>0

1αEb

D((α+1)R). Then according to Definition 4, there

exists a delay-constrained source coding system, such that for some E > infα>0

1αEb

D((α+1)R),

for any positive real value ε, there exists K < ∞, such that for all i > 0, ∆ > 0

Pr[d(xi, yi(i + ∆)) > D] ≤ K2−∆(E−ε)

so we choose ε > 0, such that

E − ε > infα>0

1α

EbD((α + 1)R) (D.2)

Then consider a block coding scheme (E ,D) that is built on the delay constrained lossy

source coding system. The encoder of the block coding system is the same as the delay-

constrained lossy source encoder, and the block decoder D works as follows:

y i1 = (y1(1 + ∆), y2(2 + ∆), ..., yi(i + ∆))

Now the block decoding distortion of this coding system can be upper bounded as

follows, for any i > 0 and ∆ > 0:

Pr[d(x i1, y

i1) > D] =

i∑

t=1

Pr[maxi

d(xi, yi) > D]

≤i∑

t=1

Pr[d(xi, yi(i + ∆)) > D]

≤i∑

t=1

K2−∆(E−ε)

= iK2−∆(E−ε) (D.3)

The block coding scheme (E ,D) is a block source coding system for i source symbols

by using bR(i + ∆)c bits hence has a rate bR(i+∆)ci ≤ i+∆

i R. From the block coding result

in Lemma 6, we know that the lossy source coding error exponent EbD(R) is monotonically

increasing in R, so EbD( bR(i+∆)c

i ) ≤ EbD(R(i+∆)

i ). Again from Lemma 6, we know that the

block coding error probability can be bounded in the following way:

Pr[d(x i1, y

i1) > D] > 2−i(Eb

D(R(i+∆)

i)+εi) (D.4)

147

where limi→∞

εi = 0.

Combining (D.3) and (D.4), we have:

2−i(EbD(

R(i+∆)i

)+εi) < iK2−∆(E−ε)

Now let α = ∆i , α > 0, then the above inequality becomes:

2−i(EbD(R(1+α))+εi) < 2−iα(E−ε−θi) and hence:

E − ε <1α

(EbD(R(1 + α) + εi) + θi (D.5)

where θi = log Ki , so lim

i→∞θi = 0. The above inequality is true for all i, α > 0, and

limi→∞

θi = 0, limi→∞

εi = 0. Taking all these into account, we have:

E − ε ≤ infα>0

1α

EbD(R(1 + α)) (D.6)

Now (D.6) contradicts with the assumption in (D.2), thus the proposition is proved. ¥

148

Appendix E

Bounding source atypicality under

a distortion measure

In this appendix, we prove Lemma 7 in Section 3.4.2

Proof: We only need to show the case for r > R(px , D). By Cramer’s theorem [31], for

all ε1 > 0, there exists K1, such that

Pr[n∑

i=1

lD(~xi) > nNr] = Pr[1n

n∑

i=1

lD(~xi) > Nr]

≤ K12−n( inf

z>NrI(z)−ε1)


I(z) = supρ≥0

ρz − log2(∑

(~x∈XN

px(~x)2ρlD(~x)) (E.1)

It is clear that I(z) is monotonically increasing with z and I(z) is continuous. Thus

infz>Nr

I(z) = I(Nr) (E.2)

Using the upper bound on lD(~x) in (3.6):

log2(∑

~x∈XN

px(~x)2ρlD(~x)) ≤ log2(∑

qx∈T N

2−ND(qx‖px )2ρ(δN+NR(qx ,D)))

≤ log2(2NεN 2−N minqx D(qx‖px )−ρR(qx ,D)−ρδN)

= N(−min

qxD(qx‖px)− ρR(qx , D)− ρδN+ εN

)

149

where T N is the set of types of XN , and 2NεN is the number of types in XN , 0 < εN ≤|X | log2(N+1)

N , so εN goes to 0 as N goes to infinity.

Substitute the above inequalities into (E.1):

I(Nr) ≥ N(supρ≥0

minqx

ρ(r −R(qx , D)− δN ) + D(qx‖px) − εN

)(E.3)

Next we show that I(Nr) ≥ N(EbD(r) + ε′N ) where ε′N goes to 0 as N goes to infinity. We

show the existence of a saddle point of the function

f(qx , ρ) = ρ(r −R(qx , D)− δN ) + D(qx‖px)

Obviously, for fixed qx , f(qx , ρ) is a linear function of ρ, thus concave ∩. Also for fixed

ρ ≥ 0, f(qx , ρ) is a convex ∪ function of qx , because both −R(qx , D) and D(qx‖px) are

convex ∪ in qx . Write

g(u) = minqx

supρ≥0

(f(qx , ρ) + ρu)

Showing that g(u) is finite around u = 0 establishes the existence of the saddle point as

shown in Exercise 5.25 [10].

minqx

supρ≥0

f(q, ρ) + ρu =(a) minqx

supρ≥0

ρ(r −R(qx , D)− δN + u) + D(qx‖px)

≤(b) minqx :R(qx ,D)≥r−δN+u

supρ≥0

ρ(r −R(qx , D)− δN + u) + D(qx‖px)

≤(c) minqx :R(qx ,D)≥r−δN+u

D(qx‖px)

<(d) ∞

(a) is by definition. (b) is true because R(px , D) < r < RD, thus for very small δN and u,

R(px , D) < r−δN +u < RD. Thus there exists a distribution qx , s.t. R(qx , D) ≥ r−δN +u.

(c) is because R(qx , D) ≥ r− δN +u and ρ ≥ 0. (d) is true because we might as well assume

that px(x) > 0 for all x ∈ X , and r − δN + u < RD. Thus we proved the existence of the

saddle point of f(q, ρ).

supρ≥0

minq

f(q, ρ) = minqsup

ρ≥0f(q, ρ) (E.4)

Note that if R(qx , D) < r − δN , ρ can be chosen to be arbitrarily large to

make ρ(r − R(qx , D) − δN ) + D(qx‖px) arbitrarily large. Thus the qx to minimize

supρ≥0

ρ(r −R(qx , D)− δN ) + D(qx‖px) satisfies r −R(qx , D)− δN ≥ 0. So

150

minqxsup

ρ≥0ρ(r −R(qx , D)− δN ) + D(qxy‖pxy ) =(a) min

qx :R(qx ,D)≥r−δN

supρ≥0

ρ(r −R(qx , D)− δN ) + D(qx‖px)=(b) min

qx :R(qx ,D)≥r−δN

D(qx‖px)

=(c) EbD(r − δN ) (E.5)

(a) follows from the argument above. (b) is because r − R(qx , D) − δN ≤ 0 and ρ ≥ 0,

and hence ρ = 0 maximizes ρ(r−R(qx , D)−δN ). (c) is by definition in (3.5). By combining

(E.3), (E.4) and (E.5), letting N be sufficiently big so that δN is sufficiently small, and

noticing that EbD(r) is continuous in r, we get the desired bound in (3.7). ¤

151

Appendix F

Bounding individual error events

for distributed source coding

In this appendix, we prove Lemma 9 and Lemma 11. The proofs have some similarities

to those for the point-to-point lossless source coding in Propositions 3 and 4.

F.1 ML decoding: Proof of Lemma 9

In this section we give the proof of Lemma 9, this part has the same flavor as the proof

of Proposition 4 in Section 2.3.3. The technical tool used here is the standard Chernoff

bound argument, or “the ρ thing” in Gallager’s book [41]. This technique is perfected by

Gallager in the derivation of the error exponents for a series of problems, cf. multiple-access

channel in [37], degraded broadcast channel in [40] and for distributed source coding in [39].

The bound depends on whether l ≤ k or l ≥ k. Consider the case for l ≤ k,

pn(l, k) =∑

xn1 ,yn

1

pxy (xn1 , yn

1 )

Pr[∃ (xn1 , yn

1 ) ∈ Bx(xn1 )× By(yn

1 ) ∩ Fn(l, k, xn1 , yn


1 ) < pxy (xn1 , yn

1 )]

≤∑

xn1 ,yn

1

min[1,

∑

(xn1 , yn

1 ) ∈ Fn(l, k, xn1 , yn

1 )

pxy (xn1 , yn

1 ) < pxy (xn1 , yn

1 )

Pr[xn1 ∈ Bx(xn

1 ), yn1 ∈ By(yn

1 )]]pxy (xn

1 , yn1 )

(F.1)

152

≤∑

xnl ,yn

l

min[1,

∑

(xnl , yn

l ) s.t. yk−1 = yk−1

pxy (xnl , yn

l ) < pxy (xnl , yn

l )

4× 2−(n−l+1)Rx−(n−k+1)Ry

]pxy (xn

l , ynl ) (F.2)

≤ 4×∑

xnl ,yn

l

min[1,

∑

xnl ,yn

k

2−(n−l+1)Rx−(n−k+1)Ry

1[pxy (xk−1l , yk−1

l )pxy (xnk , yn

k ) > pxy (xnl , yn

l )]]pxy (xn

l , ynl )

≤ 4×∑

xnl ,yn

l

min

[1,

∑

xnl ,yn

k

2−(n−l+1)Rx−(n−k+1)Ry

min

[1,

pxy (xk−1l , yk−1

l )pxy (xnk , yn

k )pxy (xn

l , ynl )

]]pxy (xn

l , ynl )

≤ 4×∑

xnl ,yn

l

[ ∑

xnl ,yn

k

e−(n−l+1)Rx−(n−k+1)Ry

[pxy (xk−1

l , yk−1l )pxy (xn

k , ynk )

pxy (xnl , yn

l )

] 11+ρ

]ρ

pxy (xnl , yn

l )

(F.3)

= 4× 2−(n−l+1)ρRx−(n−k+1)ρRy

∑

xnl ,yn

l

[ ∑

xnl ,yn

k

[pxy (xk−1l , yk−1

l )pxy (xnk , yn

k )]1

1+ρ

]ρ

pxy (xnl , yn

l )1

1+ρ

= 4× 2−(n−l+1)ρRx−(n−k+1)ρRy

∑

yk−1l

[ ∑

xk−1l


l )1

1+ρ

][ ∑

xk−1l


l )1

1+ρ

]ρ

[ ∑

xnk ,yn

k

pxy (xnk , yn

k )1

1+ρ

]ρ ∑

xnk ,yn

k

pxy (xnk , yn

k )1

1+ρ

= 4× 2−(n−l+1)ρRx−(n−k+1)ρRy

[ ∑

yk−1l

[ ∑

xk−1l


l )1

1+ρ

]1+ρ][ ∑

xnk ,yn

k

pxy (xnk , yn

k )1

1+ρ

]1+ρ

= 4× 2−(n−l+1)ρRx−(n−k+1)ρRy

[∑y

[ ∑x

px ,y (x, y)1

1+ρ

]1+ρ]k−l[ ∑

x,y

px ,y (x, y)1

1+ρ

](1+ρ)(n−k+1)(F.4)

= 4× 2−(k−l)

[ρRx−log

[∑y

[∑x px,y (x,y)

11+ρ

]1+ρ]]

2−(n−k+1)

[ρ(Rx+Ry)−(1+ρ) log

[∑x,y px,y (x,y)

11+ρ

]]

153

= 4× 2−(k−l)Ex|y(Rx,ρ)−(n−k+1)Exy(Rx,Ry ,ρ) (F.5)

= 4× 2−(n−l+1)

[k−l

n−l+1Ex|y(Rx,ρ)+n−k+1

n−l+1Exy(Rx,Ry,ρ)

](F.6)

≤ 4× 2−(n−l+1) supρ∈[0,1]

[k−l

n−l+1Ex|y(Rx,ρ)+n−k+1

n−l+1Exy(Rx,Ry,ρ)

](F.7)

= 4× 2−(n−l+1)EMLx (Rx,Ry , k−l

n−l+1)

= 4× 2−(n−l+1)Ex(Rx,Ry , k−ln−l+1

). (F.8)

In (F.1) we explicitly indicate the three conditions that a suffix pair (xnl , yn

k ) must satisfy

to result in a decoding error. In (F.2) we sum out over the common prefixes (xl−11 , yl−1

1 ), and

use the fact that the random binning is done independently at each encoder, thus we can use

the inequalities in (4.6) and (4.7). We get (F.3) by limiting ρ to the interval 0 ≤ ρ ≤ 1, as

in (2.25). Getting (F.4) from (F.3) follows by a number of basic manipulations. In (F.4) we

get the single letter expression by again using the memorylessness of the sources. In (F.5)

we use the definitions of Ex|y and Exy from (4.9) in Theorem 3. Noting that the bound holds

for all ρ ∈ [0, 1], optimizing over ρ results in (F.7). Finally, using the definition in (4.8) and

the remark following Theorem 5 that the maximum-likelihood and universal exponents are

equal gives (F.8). The bound on pn(l, k) when l > k, is developed in the same fashion. ¤

154

F.2 Universal decoding: Proof of Lemma 11

In this section, we prove Lemma 11. We use the techniques called method of types

developed by Imre Csiszar. This method is elegantly explained in [29] and [28]. The proof

here is similar to that for the point-to-point case in the proof of Proposition 4. However,

we need the concept of V -shells defined in [29]. Essentially, a V -shell is the conditional

type set of y sequence given x sequence where x and y are from the alphabet X × Y. The

technique of V -shell is originally used in the proof of channel coding theorems in [29]. Due

to the duality of the channel coding and source coding with decoding information, it is no

surprise that it proves a very powerful tool in the distributed source coding problems. That

being said, here is the proof.

The error probability pn(l, k) can be thought as starting from (F.2) with the condition

(k − l)H(xk−1l |yk−1

l ) + (n− k + 1)H(xnk , yn

k ) < (k − l)H(xk−1l |yk−1

l ) + (n− k + 1)H(xnk , yn

k )

substituted for pxy (xnl , yn

l ) > pxy (xnl , yn

l ), we get

pn(l, k) ≤∑

P n−k,P k−l

∑

V n−k,V k−l

∑

yk−1l

∈ TP k−l ,

ynk ∈ T

P n−k

∑

xk−1l

∈ TV k−l (y

k−1l

),

xnk ∈ T

V n−k(ynk

)

min[1,

∑

V n−k, V k−l, P n−k s.t.

S(P n−k, P k−l, V n−k, V k−l) <

S(P n−k, P k−l, V n−k, V k−l)

∑

ynk∈TPn−k

∑

xk−1l ∈T

V k−l (yk−1l )

∑

xnk∈TV n−k (yn

k )

4× 2−(n−l+1)Rx−(n−k+1)Ry

]pxy (xn

1 , yn1 )

(F.9)

In (F.9) we enumerate all the source sequences in a way that allows us to focus on the types

of the important subsequences. We enumerate the possibly misleading candidate sequences

in terms of their suffixes types. We restrict the sum to those pairs (xn1 , yn

1 ) that could

lead to mistaken decoding, defining the compact notation S(Pn−k, P k−l, V n−k, V k−l) ,(k − l)H(V k−l|P k−l) + (n − k + 1)H(Pn−k × V n−k), which is the weighted suffix entropy

condition rewritten in terms of types.

Note that the summations within the minimization in (F.9) do not depend on the

arguments within these sums. Thus, we can bound this sum separately to get a bound on

155

the number of possibly misleading source pairs (xn1 , yn

1 ).

∑




∑

ynk∈TPn−k

∑

xk−1l ∈T

V k−l (yk−1l )

∑

xnk∈TV n−k (yn

k )

≤∑




∑

ynk∈TPn−k

|TV k−l(yk−1l )||TV n−k(yn

k )| (F.10)

≤∑




|TP n−k |2(k−l)H(V k−l|P k−l)2(n−k+1)H(V n−k|P n−k) (F.11)

≤∑




2(k−l)H(V k−l|P k−l)+(n−k+1)H(P n−k×V n−k) (F.12)

≤∑

V n−k,V k−l,P n−k

2(k−l)H(V k−l|P k−l)+(n−k+1)H(P n−k×V n−k) (F.13)

≤ (n− l + 2)2|X ||Y|2(k−l)H(V k−l|P k−l)+(n−k+1)H(P n−k×V n−k) (F.14)

In (F.10) we sum over all xk−1l ∈ TV k−l(yk−1

l ). In (F.11) we use standard bounds, e.g.,

|TV k−l(yk−1l )| ≤ 2(k−l)H(V k−l|P k−l) for yk−1

l ∈ TP k−l . We also sum over all xnk ∈ TV n−k(yn

k )

and over all ynk ∈ TP n−k in (F.11). By definition of the decoding rule (xn

1 , yn1 ) can only

lead to a decoding error if (k − l)H(V k−l|P k−l)] + (n − k + 1)H(Pn−k × V n−k) < (k −l)H(V k−l|P k−l) + (n− k + 1)H(Pn−k × V n−k). In (F.14) we apply the polynomial bound

on the number of types.

We substitute (F.14) into (F.9) and pull out the polynomial term, giving

pn(l, k) ≤ (n− l + 2)2|X ||Y|∑

P n−k,P k−l

∑

V n−k,V k−l

∑

yk−1l

∈ TP k−l ,

ynk ∈ T

P n−k

∑

xk−1l

∈ TV k−l (y

k−1l

),

xnk ∈ T

V n−k(ynk

)

min[1, 4× 2−(k−l)[Rx−H(V k−l|P k−l)]−(n−k+1)[Rx+Ry−H(V n−k×P n−k)]

]pxn

l ,ynl(xn

l , ynl )

≤4× (n− l + 2)2|X ||Y|∑

P n−k,P k−l

∑

V n−k,V k−l

× 2max

[0,−(k−l)[Rx−H(V k−l|P k−l)]−(n−k+1)[Rx+Ry−H(V n−k×P n−k)]

]

× 2−(k−l)D(V k−l×P k−l‖pxy )−(n−k+1)D(V n−k×P n−k‖pxy ) (F.15)

156

≤4× (n− l + 2)2|X ||Y|∑

P n−k,P k−l

∑

V n−k,V k−l

2−(n−l+1)

[λD(V k−l×P k−l‖pxy )+λD(V n−k×P n−k‖pxy )+|λ[Rx−H(V k−l|P k−l)]+λ[Rx+Ry−H(V n−k×P n−k)]|+

]

(F.16)

≤4× (n− l + 2)2|X ||Y|∑

P n−k,P k−l

∑

V n−k,V k−l

2−(n−l+1) inf x,y,x,y

[λD(px,y‖pxy )+λD(px,y‖pxy )+|λ[Rx−H(x |y)]+λ[Rx+Ry−H(x ,y)]|+

](F.17)

≤4× (n− l + 2)4|X ||Y|2−(n−l+1)Ex(Rx,Ry ,λ)

≤K12−(n−l+1)[Ex(Rx,Ry,λ)−ε] (F.18)

In (F.15) we use the memorylessness of the source, and exponential bounds on the proba-

bility of observing (xk−1l , yk−1

l ) and (xnk , yn

k ). In (F.16) we pull out (n− l+1) from all terms,

by noticing that λ = (k − l)/(n − l + 1) ∈ [0, 1] and λ , 1 − λ = (n − k + 1)/(n − l + 1).

In (F.17) we minimize the exponent over all choices of distributions px ,y and px ,y . In (F.18)

we define the universal random coding exponent Ex(Rx, Ry, λ) , inf x ,y ,x ,yλD(px ,y‖pxy ) +

λD(px ,y‖pxy )+∣∣λ[Rx −H(x |y)] + λ[Rx + Ry −H(x , y)]

∣∣+ where 0 ≤ λ ≤ 1 and λ = 1−λ.

We also incorporate the number of conditional and marginal types into the polynomial

bound, as well as the sum over k, and then push the polynomial into the exponent since for

any polynomial F , ∀E, ε > 0, there exists C ∈ (0,∞), s.t. F (∆)e−∆E ≤ Ce−∆(E−ε). ¤

157

Appendix G

Equivalence of ML and universal

error exponents and tilted

distributions

In this appendix, we give a detail proof of Lemma 8. This section also contains some very

useful analysis on the source coding error exponents and their geometry, tilted distributions

and their properties. The key mathematical tool is convex optimization. The fundamental

lemmas which cannot be easily found in the literature in Section G.3 are used throughout

this thesis. For notation simplicity, we change the logarithm from base 2 in the main body

of the thesis to base e in this section.

Our goal in Theorem 5 is to show that the maximum likelihood (ML) error exponent

equals the universal error exponent. It is sufficient to show that for all γ,


x (Rx, Ry, γ)

Where the ML error exponent:

158

EMLx (Rx, Ry, γ) = sup

ρ∈[0,1]γEx|y(Rx, ρ) + (1− γ)Exy(Rx, Ry, ρ)

= supρ∈[0,1]

ρR(γ) − γ log(∑

y

(∑

x

pxy (x, y)1

1+ρ )1+ρ)

−(1− γ)(1 + ρ) log(∑

y

∑x

pxy (x, y)1

1+ρ )

= supρ∈[0,1]

EMLx (Rx, Ry, γ, ρ)

Write the function inside the sup argument as EMLx (Rx, Ry, γ, ρ). The universal error

exponent:

EUNx (Rx, Ry, γ)

= infqxy,oxy

γD(qxy||pxy ) + (1− γ)D(oxy||pxy )

+|γ(Rx −H(qx|y)) + (1− γ)(Rx + Ry −H(oxy))|+= inf

qxy,oxy

γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − γH(qx|y)− (1− γ)H(oxy)|+

Here we define R(γ) = γRx + (1− γ)(Rx + Ry) > γH(px |y ) + (1− γ)H(pxy ). For notational

simplicity, we write qxy and oxy as two arbitrary joint distributions on X ×Y instead of px y

and p¯x ¯y . We still write pxy as the distribution of the source.

Before the proof, we define a pair of distributions that we need.

Definition 15 Tilted distribution of pxy : pρxy , for all ρ ∈ [−1,∞)

pρxy (x, y) =

pxy (x, y)1

1+ρ

∑t

∑s

pxy (s, t)1

1+ρ

The entropy of the tilted distribution is written as H(pρxy ). Obviously p0

xy = pxy .

Definition 16 x − y tilted distribution of pxy : pρxy , for all ρ ∈ [−1, +∞)

pρxy (x, y) =

[∑s

pxy (s, y)1

1+ρ ]1+ρ

∑t

[∑s

pxy (s, t)1

1+ρ ]1+ρ× pxy (x, y)

11+ρ

∑s

pxy (s, y)1

1+ρ

=A(y, ρ)B(ρ)

× C(x, y, ρ)D(y, ρ)

159

Where

A(y, ρ) = [∑

s

pxy (s, y)1

1+ρ ]1+ρ = D(y, ρ)1+ρ

B(ρ) =∑

s

[∑

t

pxy (s, t)1

1+ρ ]1+ρ =∑

y

A(y, ρ)

C(x, y, ρ) = pxy (x, y)1

1+ρ

D(y, ρ) =∑

s

pxy (s, y)1

1+ρ =∑

x

C(x, y, ρ)

The marginal distribution for y is A(y,ρ)B(ρ) . Obviously p0

xy = pxy . Write the conditional

distribution of x given y under distribution pρxy as pρ

x |y , where pρx |y (x, y) = C(x,y,ρ)

D(y,ρ) , and the

conditional entropy of x given y under distribution pρxy as H(pρ

x |y ). Obviously H(p0x |y ) =

H(px |y ).

The conditional entropy of x given y for the x − y tilted distribution is

H(pρx |y=y) = −

∑x

C(x, y, ρ)D(y, ρ)

log(C(x, y, ρ)D(y, ρ)

)

We introduce A(y, ρ), B(ρ), C(x, y, ρ), D(y, ρ) to simplify the notations. Some of their

properties are shown in Lemma 25.

While tilted distributions are common optimal distributions in large deviation theory,

it is useful to contemplate why we need to introduce these two tilted distributions. In the

proof of Lemma 8, through a Lagrange multiplier argument, we will show that pρxy : ρ ∈

[−1,+∞) is the family of distributions that minimize the Kullback−Leibler distance to

pxy with fixed entropy and pρxy : ρ ∈ [−1, +∞) is the family of distributions that minimize

the Kullback−Leibler distance to pxy with fixed conditional entropy. Using a Lagrange

multiplier argument, we parametrize the universal error exponent EUNx (Rx, Ry, γ) in terms

of ρ and show the equivalence of the universal and maximum likelihood error exponents.

Now we are ready to prove Lemma 5: EMLx (Rx, Ry, γ) = EUN

x (Rx, Ry, γ).

Proof:

G.1 case 1: γH(px |y) + (1 − γ)H(pxy) < R(γ) < γH(p1x |y) + (1 −

γ)H(p1xy).

First, from Lemma 31 and Lemma 32:

160

∂EMLx (Rx, Ry, γ, ρ)

∂ρ= R(γ) − γH(pρ

x |y )− (1− γ)H(pρxy )

Then, using Lemma 22 and Lemma 26, we have:

∂2EMLx (Rx, Ry, γ, ρ)

∂ρ≤ 0

.

So ρ maximize EMLx (Rx, Ry, γ, ρ), if and only if:

0 =∂EML

x (Rx, Ry, γ, ρ)∂ρ

= R(γ) − γH(pρx |y )− (1− γ)H(pρ

xy ) (G.1)

Because R(γ) is in the interval [γH(px |y )+(1−γ)H(pxy ), γH(p1x |y )+(1−γ)H(p1

xy )] and

the entropy functions monotonically-increase over ρ, we can find ρ∗ ∈ (0, 1), s.t.

γH(pρ∗x |y ) + (1− γ)H(pρ∗

xy ) = R(γ)

Using Lemma 29 and Lemma 30 we get:

EMLx (Rx, Ry, γ) = γD(pρ∗

xy‖pxy ) + (1− γ)D(pρ∗xy‖pxy ) (G.2)

Where γH(pρ∗x |y ) + (1− γ)H(pρ∗

xy ) = R(γ) , ρ∗ is generally unique because both H(pρx |y ) and

H(pρxy ) are strictly increasing with ρ.

Secondly

EUNx (Rx, Ry, γ)

= infqxy ,oxy

γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − γH(qx|y)− (1− γ)H(oxy)|+

= infb inf

qxy,oxy:γH(qx|y)+(1−γ)H(oxy)=bγD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − b)|+

= infb≥γH(px|y )+(1−γ)H(pxy )

infqxy ,oxy :γH(qx|y)+(1−γ)H(oxy)=b


+|R(γ) − b)|+ (G.3)

161

The last equality is true because, for b < γH(px |y ) + (1− γ)H(pxy ) < R(γ),

infqxy,oxy:γH(qx|y)+(1−γ)H(oxy)=b

γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − b|+

≥ 0 + R(γ) − b

= infqxy,oxy:H(qx|y)=H(px|y ),H(oxy)=H(pxy )

γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − b|+

≥ infqxy,oxy:H(qx|y)=H(px|y ),H(oxy)=H(pxy )


+|R(γ) − γH(px |y ) + (1− γ)H(pxy )|+≥ inf

qxy,oxy:γH(qx|y)+(1−γ)H(oxy)=γH(px|y )+(1−γ)H(pxy )γD(qxy||pxy ) + (1− γ)D(oxy||pxy )

+|R(γ) − γH(px |y ) + (1− γ)H(pxy )|+

Fixing b ≥ γH(px |y ) + (1 − γ)H(pxy ), the inner infimum in (G.3) is an optimization prob-

lem on qxy, oxy with equality constraints∑

x

∑y qxy(x, y) = 1,

∑x

∑y oxy(x, y) = 1 and

γH(qx|y) + (1− γ)H(oxy) = b and the obvious inequality constraints 0 ≤ qxy(x, y) ≤ 1, 0 ≤oxy(x, y) ≤ 1,∀x, y. In the following formulation of the optimization problem, we relax one

equality constraint to an inequality constraint γH(qx|y)+(1−γ)H(oxy) ≥ b to make the opti-

mization problem convex. It turns out later that the optimal solution to the relaxed problem

is also the optimal solution to the original problem because b ≥ γH(px |y ) + (1− γ)H(pxy ).

The resulting optimization problem is:

infqxy,oxy


s.t.∑

x

∑y

qxy(x, y) = 1

∑x

∑y

oxy(x, y) = 1

b− γH(qx|y)− (1− γ)H(oxy) ≤ 0

0 ≤ qxy(x, y) ≤ 1, ∀(x, y) ∈ X × Y0 ≤ oxy(x, y) ≤ 1, ∀(x, y) ∈ X × Y (G.4)

The above optimization problem is convex because the objective function and the inequality

constraint functions are convex and the equality constraint functions are affine[10]. The

Lagrange multiplier function for this convex optimization problem is:

162

L(qxy, oxy, ρ, µ1, µ2, ν1, ν2, ν3, ν4)

= γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) +

µ1(∑

x

∑y

qxy(x, y)− 1) + µ2(∑

x

∑y

oxy(x, y)− 1) +

ρ(b− γH(qx|y)− (1− γ)H(oxy)) +∑

x

∑y

ν1(x, y)(−qxy(x, y)) + ν2(x, y)(1− qxy(x, y))

+ν3(x, y)(−oxy(x, y)) + ν4(x, y)(1− oxy(x, y))

Where ρ, µ1, µ2 are real numbers and νi ∈ R|X ||Y|, i = 1, 2, 3, 4.

According to the KKT conditions for convex optimization[10], qxy, oxy minimize the con-

vex optimization problem in (G.4) if and only if the following conditions are simultaneously

satisfied for some qxy, oxy, µ1, µ2, ν1, ν2, ν3, ν4 and ρ:

0 =∂L(qxy, oxy, ρ, µ1, µ2, ν1, ν2, ν3, ν4)

∂qxy(x, y)

= γ[− log(pxy (x, y)) + (1 + ρ)(1 + log(qxy(x, y))) + ρ log(∑

s

qxy(s, y))]

+µ1 − ν1(x, y)− ν2(x, y) (G.5)

0 =∂L(qxy, oxy, ρ, µ1, µ2, ν1, ν2, ν3, ν4)

∂oxy(x, y)= (1− γ)[− log(pxy (x, y)) + (1 + ρ)(1 + log(oxy(x, y)))] + µ2 − ν3(x, y)− ν4(x, y)

(G.6)

For all x, y and

∑x

∑y

qxy(x, y) = 1

∑x

∑y

oxy(x, y) = 1

ρ(γH(qx|y) + (1− γ)H(oxy)− b) = 0

ρ ≥ 0

ν1(x, y)(−qxy(x, y)) = 0, ν2(x, y)(1− qxy(x, y)) = 0 ∀x, y

ν3(x, y)(−oxy(x, y)) = 0, ν4(x, y)(1− oxy(x, y)) = 0 ∀x, y

νi(x, y) ≥ 0, ∀x, y, i = 1, 2, 3, 4 (G.7)

163

Solving the above standard Lagrange multiplier equations (G.5), (G.6) and (G.7), we

have:

qxy(x, y) =[∑s

pxy (s, y)1

1+ρb ]1+ρb

∑t

[∑s

pxy (s, t)1

1+ρb ]1+ρb

pxy (x, y)1

1+ρb

∑s

pxy (s, y)1

1+ρb

= pρbxy (x, y)

oxy(x, y) =pxy (x, y)

11+ρb

∑t

∑s

pxy (s, t)1

1+ρb

= pρbxy (x, y)

νi(x, y) = 0 ∀x, y, i = 1, 2, 3, 4

ρ = ρb (G.8)

Where ρb satisfies the following condition

γH(pρb

x |y ) + (1− γ)H(pρbxy ) = b ≥ γH(px |y ) + (1− γ)H(pxy )

and thus ρb ≥ 0 because both H(pρx |y ) and H(pρ

xy ) are monotonically increasing with ρ as

shown in Lemma 22 and Lemma 26.

Notice that all the KKT conditions are simultaneously satisfied with the inequality

constraint γH(qx|y) + (1 − γ)H(oxy) ≥ b being met with equality. Thus, the relaxed opti-

mization problem has the same optimal solution as the original problem as promised. The

optimal qxy and oxy are the x − y tilted distribution pρbxy and standard tilted distribution

pρbxy of pxy with the same parameter ρb ≥ 0. chosen s.t.

γH(pρb

x |y ) + (1− γ)H(pρbxy ) = b

Now we have :

EUNx (Rx, Ry, γ)


infqxy,oxy:γH(qx|y)+(1−γ)H(oxy)=b

γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − b|+= inf

b≥γH(px|y )+(1−γ)H(pxy )γD(pρb

xy ||pxy ) + (1− γ)D(pρbxy ||pxy ) + |R(γ) − b|+

= min[ infρ≥0:R(γ)≥γH(pρ

x|y )+(1−γ)H(pρxy )

γD(pρxy ||pxy ) + (1− γ)D(pxyρ ||pxy ) + R(γ) − γH(pρ

x |y )− (1− γ)H(pρxy ),

infρ≥0:R(γ)≤γH(pρ

x|y )+(1−γ)H(pρxy )γD(pρ

xy ||pxy ) + (1− γ)D(pxyρ ||pxy )] (G.9)

164

Notice that H(pρxy ), H(pρ

x |y ), D(pρxy ||pxy ) and D(pρ

xy ||pxy ) are all strictly increasing with

ρ > 0 as shown in Lemma 22, Lemma 23, Lemma 26 and Lemma 27 later in this appendix.

We have:



xy ||pxy ) + (1− γ)D(pρxy ||pxy )

= γD(pρ∗xy ||pxy ) + (1− γ)D(pρ∗

xy ||pxy ) (G.10)

where R(γ) = γH(pρ∗x |y )+ (1− γ)H(pρ∗

xy ). Applying the results in Lemma 28 and Lemma 24,

we get:

infρ≥0:R(γ)≥γH(pρ


xy ||pxy ) + (1− γ)D(pρxy ||pxy ) + R(γ)

−γH(pρx |y )− (1− γ)H(pρ

xy )= γD(pρ

xy ||pxy ) + (1− γ)D(pρxy ||pxy ) + R(γ) − γH(pρ

x |y )− (1− γ)H(pρxy )|ρ=ρ∗


xy ||pxy ) (G.11)

This is true because for ρ : R(γ) ≥ γH(pρx |y ) + (1 − γ)H(pρ

xy ), we know ρ ≤ 1 because of

the range of R(γ): R(γ) < γH(p1x |y ) + (1 − γ)H(p1

xy ). Substituting (G.10) and (G.11) into

(G.9), we get

EUNx (Rx, Ry, γ) = γD(pρ∗

xy ||pxy ) + (1− γ)D(pρ∗xy ||pxy )

where R(γ) = γH(pρ∗x |y ) + (1− γ)H(pρ∗

xy ) (G.12)

So for γH(px |y ) + (1 − γ)H(pxy ) ≤ R(γ) ≤ γH(p1x |y ) + (1 − γ)H(p1

xy ), from (G.2) we have

the desired property:


x (Rx, Ry, γ)

G.2 case 2: R(γ) ≥ γH(p1x |y) + (1− γ)H(p1

xy).

In this case, for all 0 ≤ ρ ≤ 1

∂EMLx (Rx, Ry, γ, ρ)

∂ρ= R(γ)−γH(pρ

x |y )−(1−γ)H(pρxy ) ≥ R(γ)−γH(p1

x |y )−(1−γ)H(p1xy ) ≥ 0

So ρ takes value 1 to maximize the error exponent EMLx (Rx, Ry, γ, ρ), thus

EMLx (Rx, Ry, γ) = R(γ) − γ log(

∑y

(∑

x

pxy (x, y)12 )2)− 2(1− γ) log(

∑y

∑x

pxy (x, y)12 )(G.13)

165

Using the same convex optimization techniques as case G.1, we notice the fact that

ρ∗ ≥ 1 for R(γ) = γH(pρ∗x |y ) + (1− γ)H(pρ∗

xy ). Then applying Lemma 28 and Lemma 24, we

have:

infρ≥0:R(γ)≥γH(pρ


xy ||pxy ) + (1− γ)D(pρxy ||pxy )

+R(γ) − γH(pρx |y )− (1− γ)H(pxyρ)

= γD(p1xy ||pxy ) + (1− γ)D(p1

xy ||pxy ) + R(γ) − γH(p1x |y )− (1− γ)H(p1

xy )

And



xy ||pxy ) + (1− γ)D(pρxy ||pxy )]


xy ||pxy )


xy ||pxy ) + R(γ) − γH(pρ∗x |y )− (1− γ)H(pρ∗

xy )

≤ γD(p1xy ||pxy ) + (1− γ)D(p1


xy )

Finally:

EUNx (Rx, Ry, γ)


infqxy ,oxy :γH(qx|y)+(1−γ)H(oxy)=b

γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − b|+= inf

b≥γH(px|y )+(1−γ)H(pxy )γD(pρb

xy ||pxy ) + (1− γ)D(pρbxy ||pxy ) + |R(γ) − b|+

= min[ infρ≥0:R(γ)≥γH(pρ

x|y )+(1−γ)H(pρxy )

γD(pρxy ||pxy ) + (1− γ)D(pρ

xy ||pxy ) + R(γ) − γH(pρx |y )− (1− γ)H(pρ

xy ),inf

ρ≥0:R(γ)≤γH(pρx|y )+(1−γ)H(pρ

xy )γD(pρ

xy ||pxy ) + (1− γ)D(pρxy ||pxy )]

= γD(p1xy ||pxy ) + (1− γ)D(p1


xy )

= R(γ) − γ log(∑

y

(∑

x

pxy (x, y)12 )2)− 2(1− γ) log(

∑y

∑x

pxy (x, y)12 ) (G.14)

The last equality is true by setting ρ = 1 in Lemma 29 and Lemma 30.

Again, EMLx (Rx, Ry, γ) = EUN

x (Rx, Ry, γ), thus we finish the proof. ¤

166

G.3 Technical Lemmas on tilted distributions

Some technical lemmas we used in the above proof of Lemma 8 are now discussed:

Lemma 22 ∂H(pρxy )

∂ρ ≥ 0

Proof: From the definition of the tilted distribution we have the following observation:

log(pρxy (x1, y1))− log(pρ

xy (x2, y2)) = log(pxy (x1, y1)1

1+ρ )− log(pxy (x2, y2)1

1+ρ )

Using the above equality, we first derive the derivative of the tilted distribution, for all x, y

∂pρxy (x, y)∂ρ

=−1

(1 + ρ)2

pxy (x, y)1

1+ρ log(pxy (x, y))(∑t

∑s

pxy (s, t)1

1+ρ )

(∑t

∑s

pxy (s, t)1

1+ρ )2

− −1(1 + ρ)2

pxy (x, y)1

1+ρ (∑t

∑s

pxy (s, t)1

1+ρ log(pxy (s, t)))

(∑t

∑s

pxy (s, t)1

1+ρ )2

=−1

1 + ρpρxy (x, y)[log(pxy (x, y)

11+ρ )−

∑t

∑s

pρxy (s, t) log(pxy (s, t)

11+ρ )]

=−1

1 + ρpρxy (x, y)[log(pρ

xy (x, y))−∑

t

∑s

pρxy (s, t) log(pρ

xy (s, t))]

= −pρxy (x, y)1 + ρ

[log(pρxy (x, y)) + H(pρ

xy )] (G.15)

Then:

167

∂H(pρxy )

∂ρ= −∂

∑x,y pρ

xy (x, y) log(pρxy (x, y))

∂ρ

= −∑x,y

(1 + log(pρxy (x, y)))

∂pρxy (x, y)∂ρ

=∑x,y

(1 + log(pρxy (x, y)))

pρxy (x, y)1 + ρ

(log(pρxy (x, y)) + H(pρ

xy ))

=1

1 + ρ

∑x,y

pρxy (x, y) log(pρ

xy (x, y))(log(pρxy (x, y)) + H(pρ

xy ))

=1

1 + ρ[∑x,y

pρxy (x, y)(log(pρ

xy (x, y)))2 −H(pρxy )2]

=1

1 + ρ[∑x,y

pρxy (x, y)(log(pρ

xy (x, y)))2∑x,y

pρxy (x, y)−H(pρ

xy )2]

≥(a)1

1 + ρ[(

∑x,y

pρxy (x, y) log(pρ

xy (x, y)))2 −H(pρxy )2]

= 0 (G.16)

where (a) is true by the Cauchy-Schwartz inequality. ¤

Lemma 23 ∂D(pρxy‖pxy )∂ρ = ρ

∂H(pρxy )

∂ρ

Proof: As shown in Lemma 29 and Lemma 31 respectively:

D(pρxy‖pxy ) = ρH(pρ

xy )− (1 + ρ) log(∑x,y

pxy (x, y)1

1+ρ )

H(pρxy ) =

∂(1 + ρ) log(∑y

∑x

pxy (x, y)1

1+ρ )

∂ρ

We have:

∂D(pρxy‖pxy )∂ρ

= H(pρxy ) + ρ

∂H(pρxy )

∂ρ−

∂(1 + ρ) log(∑y

∑x

pxy (x, y)1

1+ρ )

∂ρ

= H(pρxy ) + ρ

∂H(pρxy )

∂ρ−H(pρ

xy )

= ρ∂H(pρ

xy )∂ρ

(G.17)

¤

168

Lemma 24 sign∂[D(pρ

xy‖pxy )−H(pρxy )]

∂ρ = sign(ρ− 1).

Proof: Combining the results of the previous two lemmas, we have:

∂D(pρxy‖pxy )−H(pρ

xy )∂ρ

= (ρ− 1)∂H(pρ

xy )∂ρ

= sign(ρ− 1)

¤

Lemma 25 Properties of ∂A(y,ρ)∂ρ , ∂B(ρ)

∂ρ , ∂C(x,y,ρ)∂ρ , ∂D(y,ρ)

∂ρ and∂H(pρ

x|y=y)

∂ρ

First,

∂C(x, y, ρ)∂ρ

=∂pxy (x, y)

11+ρ

∂ρ= − 1

1 + ρpxy (x, y)

11+ρ log(pxy (x, y)

11+ρ )

= −C(x, y, ρ)1 + ρ

log(C(x, y, ρ))

∂D(y, ρ)∂ρ

=∂

∑s

pxy (s, y)1

1+ρ

∂ρ= − 1

1 + ρ

∑s

pxy (s, y)1

1+ρ log(pxy (s, y)1

1+ρ )

= −∑x

C(x, y, ρ) log(C(x, y, ρ))

1 + ρ(G.18)

For a differentiable function f(ρ),

∂f(ρ)1+ρ

∂ρ= f(ρ)1+ρ log(f(ρ)) + (1 + ρ)f(ρ)ρ ∂f(ρ)

∂ρ

So

∂A(y, ρ)∂ρ

=∂D(y, ρ)1+ρ

∂ρ= D(y, ρ)1+ρ log(D(y, ρ)) + (1 + ρ)D(y, ρ)ρ ∂D(y, ρ)

∂ρ

= D(y, ρ)1+ρ(log(D(y, ρ))−∑

x

C(x, y, ρ)D(y, ρ)

log(C(x, y, ρ)))

= D(y, ρ)1+ρ(−∑

x

C(x, y, ρ)D(y, ρ)

log(C(x, y, ρ)D(y, ρ))

))

= A(y, ρ)H(pρx |y=y)

∂B(ρ)∂ρ

=∑

y

∂A(y, ρ)∂ρ

=∑

y

A(y, ρ)H(pρx |y=y) = B(ρ)

∑y

A(y, ρ)B(ρ)

H(pρx |y=y)

= B(ρ)H(pρx |y )

And last:

169

∂H(pρx |y=y)

∂ρ

= −∑

x

[∂C(x,y,ρ)

∂ρ

D(y, ρ)−

C(x, y, ρ)∂D(y,ρ)∂ρ

D(y, ρ)2][1 + log(

C(x, y, ρ)D(y, ρ)

)]

= −∑

x

[−C(x,y,ρ)

1+ρ log(C(x, y, ρ))

D(y, ρ)+

C(x, y, ρ)

∑s

C(s,y,ρ) log(C(s,y,ρ))

1+ρ

D(y, ρ)2][1 + log(

C(x, y, ρ)D(y, ρ)

)]

=1

1 + ρ

∑x

[pρx |y (x, y) log(C(x, y, ρ))− pρ

x |y (x, y)∑

s

pρx |y (s, y) log(C(s, y, ρ))]

×[1 + log(pρx |y (x, y))]

=1

1 + ρ

∑x

pρx |y (x, y)[log(pρ

x |y (x, y))−∑

s

pρx |y (s, y) log(pρ

x |y (s, y))][1 + log(pρx |y (x, y))]

=1

1 + ρ

∑x

pρx |y (x, y) log(pρ

x |y (x, y))[log(pρx |y (x, y))−

∑s

pρx |y (s, y) log(pρ

x |y (s, y))]

=1

1 + ρ

∑x


x |y (x, y)) log(pρx |y (x, y))− 1

1 + ρ[∑

x


x |y (x, y))]2

≥ 0 (G.19)

The inequality is true by the Cauchy-Schwartz inequality and by noticing that∑

x pρx |y (x, y) = 1. ¤

These properties will again be used in the proofs in the following lemmas.

Lemma 26∂H(pρ

x|y )

∂ρ ≥ 0

Proof:

∂ A(y,ρ)B(ρ)

∂ρ=

1B(ρ)2

(∂A(y, ρ)

∂ρB(ρ)− ∂B(ρ)

∂ρA(y, ρ))

=1

B(ρ)2(A(y, ρ)H(pρ

x |y=y)B(ρ)−H(pρx |y )B(ρ)A(y, ρ))

=A(y, ρ)B(ρ)

(H(pρx |y=y)−H(pρ

x |y ))

Now,

170

Lemma 28 sign∂[D(pρ

xy‖pxy )−H(pρx|y )]

∂ρ = sign(ρ− 1).

Proof: Using the previous lemma, we get:

∂D(pρxy‖pxy )−H(pρ

x |y )

∂ρ= (ρ− 1)

∂H(pρx |y )

∂ρ

Then by Lemma 26, we get the conclusion. ¤

Lemma 29

ρH(pρxy )− (1 + ρ) log(

∑y

∑x

pxy (x, y)1

1+ρ ) = D(pρxy‖pxy )

Proof: By noticing that log(pxy (x, y)) = (1 + ρ)[log(pρxy (x, y)) + log(

∑s,t

pxy (s, t)1

1+ρ )]. We

have:

D(pρxy‖pxy ) = −H(pρ

xy )−∑x,y

pρxy (x, y) log(pxy (x, y))

= −H(pρxy )−

∑x,y

pρxy (x, y)(1 + ρ)[log(pρ

xy (x, y)) + log(∑s,t

pxy (s, t)1

1+ρ )]

= −H(pρxy ) + (1 + ρ)H(pρ

xy )− (1 + ρ)∑x,y

pρxy (x, y) log(

∑s,t

pxy (s, t)1

1+ρ )

= ρH(pρxy )− (1 + ρ) log(

∑s,t

pxy (s, t)1

1+ρ ) (G.22)

¤

Lemma 30

ρH(pρx |y )− log(

∑y

(∑

x

pxy (x, y)1

1+ρ )1+ρ) = D(pρxy‖pxy )

172

Proof:

D(pρxy‖pxy ) =

∑y

∑x

A(y, ρ)B(ρ)

C(x, y, ρ)D(y, ρ)

log(A(y,ρ)B(ρ)

C(x,y,ρ)D(y,ρ)

pxy (x, y))

=∑

y

∑x

A(y, ρ)B(ρ)

C(x, y, ρ)D(y, ρ)

[log(A(y, ρ)B(ρ)

) + log(C(x, y, ρ)D(y, ρ)

)− log(pxy (x, y))]

= − log(B(ρ))−H(pρx |y )

+∑

y

∑x

A(y, ρ)B(ρ)

C(x, y, ρ)D(y, ρ)

[log(D(y, ρ)1+ρ)− log(C(x, y, ρ)1+ρ)]

= − log(B(ρ))−H(pρx |y ) + (1 + ρ)H(pρ

x |y )

= − log(∑

y

(∑

x

pxy (x, y)1

1+ρ )1+ρ) + ρH(pρx |y )

¤

Lemma 31

H(pρxy ) =

∂(1 + ρ) log(∑

y

∑x pxy (x, y)

11+ρ )

∂ρ

Proof:

∂(1 + ρ) log(∑

y

∑x pxy (x, y)

11+ρ )

∂ρ

= log(∑

t

∑s

pxy (s, t)1

1+ρ )−∑

y

∑x

pxy (x, y)1

1+ρ

∑t

∑s

pxy (s, t)1

1+ρ

log(pxy (x, y)1

1+ρ )

= −∑

y

∑x

pxy (x, y)1

1+ρ

∑t

∑s

pxy (s, t)1

1+ρ

log(pxy (x, y)

11+ρ

∑t

∑s

pxy (s, t)1

1+ρ

)

= H(pρxy ) (G.23)

¤

Lemma 32

H(pρx |y ) =

∂ log(∑y

(∑x

pxy (x, y)1

1+ρ )1+ρ)

∂ρ

173

Proof: Notice that B(ρ) =∑y

(∑x

pxy (x, y)1

1+ρ )1+ρ, and ∂B(ρ)∂ρ = B(ρ)H(pρ

x |y ) as shown

in Lemma 25. It is clear that:

∂ log(∑y

(∑x

pxy (x, y)1

1+ρ )1+ρ)

∂ρ=

∂ log(B(ρ))∂ρ

=1

B(ρ)∂B(ρ)

∂ρ

= H(pρx |y ) (G.24)

¤

174

Appendix H

Bounding individual error events

for source coding with

side-information

In this appendix, we give the proofs for Lemma 13 and Lemma 14. The proofs here are

very similar to those for point-to-point source coding problem in Propositions 3 and 4.

H.1 ML decoding: Proof of Lemma 13

In this section, we prove Lemma 13. The argument resembles the proof of Lemma 9 in

Appendix F.1.

pn(l) =∑

xn1 ,yn

1

Pr[∃ xn

1 ∈ Bx(xn1 ) ∩ Fn(l, xn


1 ) ≥ pxy (xn1 , yn

1 )]pxy (xn

1 , yn1 )

≤∑

xn1 ,yn

1

min[1,

∑

xn1 ∈ Fn(l, xn

1 )s.t.

pxy (xn1 , yn

1 ) ≤ pxy (xn1 , yn

1 )

Pr[xn1 ∈ Bs(xn

1 )]]pxy (xn

1 , yn1 ) (H.1)

175

≤∑

yn1

∑

xl−11 ,xn

l

min[1,

∑

xnl s.t.

pxy (xnl , yn

l ) < pxy (xnl , yn

l )

2× 2−(n−l+1)R]pxy (xl−1

1 , yl−11 )pxy (xn

l , ynl )

≤2×∑

ynl

∑

xnl

min[1,

∑

xnl s.t.

pxy (xnl , yn

l ) < pxy (xnl , yn

l )

2−(n−l+1)R]pxy (xn

l , ynl )

=2×∑

ynl

∑

xnl

min[1,

∑

xnl

1[pxy (xnl , yn

l ) > pxy (xnl , yn

l )]2−(n−l+1)R]pxy (xn

l , ynl )

≤2×∑

ynl

∑

xnl

min

1,

∑

xnl

min[1,

pxy (xnl , yn

l )pxy (xn

l , ynl )

]2−(n−l+1)R

pxy (xn

l , ynl )

≤2×∑

ynl

∑

xnl

∑

xnl

[pxy (xn

l , ynl )

pxy (xnl , yn

l )

] 11+ρ

2−(n−l+1)R

ρ

pxy (xnl , yn

l )

=2×∑

ynl

∑

xnl

pxy (xnl , yn

l )1

1+ρ

∑

xnl

[pxy (xnl , yn

l )]1

1+ρ

ρ

2−(n−l+1)ρR

=2×∑

y

[∑x

pxy (x, y)1

1+ρ

](n−l+1) [∑x

pxy (x, y)1

1+ρ

](n−l+1)ρ

2−(n−l+1)ρR

=2×∑

y

[∑x

pxy (x, y)1

1+ρ

](n−l+1)(1+ρ)

2−(n−l+1)ρR

=2× 2−(n−l+1)

(ρR−log[

∑y[

∑x pxy (x,y)

11+ρ ]1+ρ]

)

(H.2)

for all ρ ∈ [0, 1]. Every step of the proof follows the same argument in the proof of Lemma 9.

Finally, we optimize the exponent over ρ ∈ [0, 1] in (H.2) to prove the lemma. ¤

176

H.2 Universal decoding: Proof of Lemma 14

In this section, we give the details of the proof of Lemma 14. The argument resembles

that in the proof of Lemma 11 in Appendix F.2.

The error probability pn(l) can be thought as starting from (H.1) with the condition

H(xnl , yn

l ) < H(xnl , yn

l ) substituted for pxy (xnl , yn

l ) > pxy (xnl , yn

l ), we get

pn(l) ≤∑

P n−l

∑

V n−l

∑

ynl ∈ T

P n−l

∑

xnl ∈ T

V n−l(ynl

)

min[1,

∑

V n−l s.t.

H(P n−l × V n−l) <

H(P n−l × V n−l)

∑

xnl ∈TV n−l (y

nl )

2× 2−(n−l+1)R]pxy (xn

l , ynl ) (H.3)

In (H.3) we enumerate all the source sequences in a way that allows us to focus on the types

of the important subsequences. We enumerate the possibly misleading candidate sequences

in terms of their suffixes types. We restrict the sum to those sequences xnl that could lead

to mistaken decoding, defining the notation H(Pn−l × V n−l), which is the joint entropy of

the type Pn−l with V-shell V n−l.

Note that the summations within the minimization in (H.3) do not depend on the

arguments within these sums. Thus, we can bound this sum separately to get a bound on

the number of possibly misleading source sequences xnl .

∑

V n−l s.t.

H(P n−l × V n−l) <


∑

xnl ∈TV n−l (y

nl )

≤∑

V n−l s.t.

H(P n−l × V n−l) <


|TV n−l(ynl )|

≤∑

V n−l s.t.

H(P n−l × V n−l) <


2(n−l+1)H(V n−l|P n−l)

≤∑

V n−l

2(n−l+1)H(V n−l|P n−l) (H.4)

≤ (n− l + 2)|X ||Y|2(n−l+1)H(V n−l|P n−l) (H.5)

Every inequality is obvious by following the same argument in the proof of Lemma 11.

(H.4) is true because H(Pn−l × V n−l) < H(Pn−l × V n−l) is equivalent to H(V n−l|Pn−l) <

H(V n−l|Pn−l).

177

We substitute (H.5) into (H.3) and pull out the polynomial term, giving

pn(l) ≤(n− l + 2)|X ||Y|∑

P n−l

∑

V n−l

∑

ynl ∈ T

P n−l

∑

xnl ∈ T

V n−l(ynl

)

min[1,

2× 2−(n−l+1)(R−H(V n−l|P n−l))]pxy (xn

l , ynl )

≤(n− l + 2)|X ||Y|∑

P n−l

∑

V n−l

2−(n−l+1)max0,R−H(V n−l|P n−l)

× 2−(n−l+1)D(P n−l×V n−l‖pxy )

≤2× (n− l + 2)2|X ||Y| × 2−(n−l+1) inf x,yD(px,y‖pxy )+|R−H(V n−l|P n−l)|+

=2× (n− l + 2)2|X ||Y| × 2−(n−l+1)Elowersi (R)

All steps are obvious by following the same argument in the proof of Lemma 11.

¤

178

Appendix I

Proof of Corollary 1

I.1 Proof of the lower bound

From Theorem 6, we know that

Esi(R) ≥ Elowersi (R)

= minqxyD(qxy‖pxy ) + |R−H(qx |y )|+ (I.1)

= maxρ∈[0,1]

ρR− log[∑

y

[∑

x

pxy (x, y)1

1+ρ ]1+ρ] (I.2)

where the equivalence between (I.1) and (I.2) is shown by a Lagrange multiplier argument

in [39] and detailed in Appendix G. Next we show that (I.2) is equal to the random source

coding error exponent Er(R, ps) defined in (A.5) by replacing px with ps .

Elowersi (R) =(a) max


∑y

[∑

x

pxy (x, y)1

1+ρ ]1+ρ]

=(b) maxρ∈[0,1]

ρR− log[∑

y

[∑

x

(pxy (x|y)|S| )

11+ρ ]1+ρ]

=(c) maxρ∈[0,1]

ρR− log[1|S|

∑y

[∑

s

ps(s)1

1+ρ ]1+ρ]

=(d) maxρ∈[0,1]

ρR− log[[∑

s

ps(s)1

1+ρ ]1+ρ]

=(e) Er(R, ps)

(a) is by definition. (b) is because the marginal distribution of y is uniform on S, thus

pxy (x, y) = pxy (x|y)py (y) = pxy (x|y) 1|S| . (c) is true because x = y ⊕ s. (d) is obvious. (e) is

by the definition of the random coding error exponent in Equation (A.5). ¤

179

I.2 Proof of the upper bound

From Theorem 7, we know that

Esi(R) ≤(a) Euppersi (R)

=(b) min infqxy ,α≥1:H(qx|y )>(1+α)R

1α

D(qxy‖pxy ),


1− α


≤(c) infqxy ,1≥α≥0:H(qx|y )>(1+α)R

1− α


≤(d) infqxy ,1≥α≥0:H(qx|y )>(1+α)R

D(qxy‖pxy )

≤(e) infqxy :H(qx|y )>R

D(qxy‖pxy )

=(f) Euppersi,b (R) (I.3)

where (a) and (b) are by Theorem 7, (c) is obvious. (d) is true because D(qx‖px) ≥ 0.

(e) is true because (qxy , 0)|H(qx |y ) > R ⊆ (qxy , α)|H(qx |y ) > (1 + α)R. (f) is by the

definition of the upper bound on block coding error exponent in Theorem 13. By using the

other definition of Euppersi,b (R) in Theorem 13, we have:

Euppersi (R) =(a) max

ρ∈[0,∞]ρR− log[

∑y

[∑

x

pxy (x, y)1

1+ρ ]1+ρ]

=(b) maxρ∈[0,∞]

ρR− log[∑

y

[∑

x

(pxy (x|y)|S| )

11+ρ ]1+ρ]

=(c) maxρ∈[0,∞]

ρR− log[1|S|

∑y

[∑

s

ps(s)1

1+ρ ]1+ρ]

=(d) maxρ∈[0,∞]

ρR− log[[∑

s

ps(s)1

1+ρ ]1+ρ]

=(e) Es,b(R, ps)

where (a)-(d) follows exactly the same arguments as those in the previous section I.1. (e)

is by the definition of block coding error exponent in (A.4). ¤

180

Appendix J

Proof of Theorem 8

The proof here is almost identical to that for the delay constrained lossless source coding

error exponent Es(R) in Chapter 2. In the converse part, we use a different argument which

is genie based [67]. Readers might find that proof interesting.

J.1 Achievability of Eei(R)

In this section, we introduce a universal coding scheme which achieves the delay con-

strained error exponent in Theorem 8. The coding scheme only depends on the size of the

alphabet |X |, |Y|, not the distribution of the source. We first describe our universal coding

scheme.

A block-length N is chosen that is much smaller than the target end-to-end delays, while

still being large enough. Again we are interested in the performance with asymptotically

large delays ∆, For a discrete memoryless source X, side information Y and large block-

length N , the best possible variable-length code is given in [29] and consists of two stages:

first describing the type of the block ~xi, ~yi using at most O(|X ||Y| log2 N) bits and then

describing which particular realization has occurred by using a variable NH(~xi|~yi) bits

where ~xi is the ith block of length N and H(~xi|~yi) is the empirical conditional entropy

of sequence ~xi given ~yi. The overhead O(|X ||Y| log2 N) is asymptotically negligible and

the code is also universal in nature. It is easy to verify that the average code length:

limN→∞Epxy (l(~x ,~y))

N = H(px |y ) This code is obviously a prefix-free code. Write l(~xi, ~yi) as

181

Encoder Buffer

FIFO


Fixed to variable length EncoderStreaming data X

Streaming side information Y

Figure J.1. A universal delay constrained source coding system with both encoder anddecoder side-information

the length of the codeword for ~xi, ~yi, write 2NεN as the number of types in XN ×YN then:

NH(~xi|~yi) ≤ l(~xi, ~yi) = N(H(~xi|~yi) + εN ) (J.1)

where 0 < εN ≤ |X ||Y| log2(N+1)N goes to 0 as N goes to infinity. The binary sequence

describing the source is fed to the FIFO buffer described in Figure J.1. Notice that if the

buffer is empty, the output of the buffer can be gibberish binary bits. The decoder simply

discards these meaningless bits because it is aware that the buffer is empty.

Proposition 13 For the iid source ∼ pxy using the universal causal code described above,

for all ε, there exists K < ∞, s.t. for all t, ∆:

Pr(~xt 6= ~xt((t + ∆)N) ≤ K2−∆N(Eei(R)−ε)

where ~xt((t + ∆)N) is the estimate of ~xt at time (t + ∆)N . Before the proof, we have the

following lemma to bound the probabilities of atypical source behavior.

Lemma 33 (Source atypicality) for all ε > 0, block length N large enough, there exists

K < ∞, s.t. for all n, if r < log2 |X |:

Pr(n∑

i=1

l(~xi, ~yi) > nNr) ≤ K2−nN(Eei,b(r)−ε) (J.2)

182

Proof: Only need to show the case for r > H(px |y ). By the Cramer’s theorem[31], for

all ε1 > 0, there exists K1, such that

Pr(n∑

i=1

l(~xi, ~yi) > nNr) = Pr(1n

n∑

i=1

l(~xi, ~yi) > Nr) ≤ K12−n(infz>Nr I(z)−ε1) (J.3)


I(z) = supρ∈R

ρz − log2(∑

(~x,~y)∈XN×YN

pxy (~x, ~y)2ρl(~x,~y)) (J.4)

Write I(z, ρ) = ρz − log2(∑

(~x,~y)∈XN×YN

pxy (~x, ~y)2ρl(~x,~y))

Obviously I(z, 0) = 0, z > Nr > NH(px |y ) thus for large N : .

∂I(z, ρ)∂ρ

|ρ=0 = z −∑

(~x,~y)∈XN×YN

pxy (~x, ~y)l(~x, ~y) ≥ 0

By the Holder inequality, for all ρ1, ρ2, and for all θ ∈ (0, 1):

(∑

i

pi2ρ1li)θ(∑

i

pi2ρ2li)(1−θ) ≥∑

i

(pθi 2

θρ1li)(p(1−θ)i 2(1−θ)ρ2li)

=∑

i

pi2(θρ1+(1−θ)ρ2)li

This shows that log2(∑

(~x,~y)∈XN×YN pxy (~x, ~y)2ρl(~x,~y)) is a convex function on ρ, thus I(z, ρ)

is a concave ∩ function on ρ for fixed z. Then ∀z > 0, ∀ρ < 0, I(z, ρ) < 0, which means

that the ρ to maximize I(z, ρ) is positive. This implies that I(z) is monotonically increasing

with z and obviously I(z) is continuous. Thus

infz>Nr

I(z) = I(Nr) (J.5)

Using the upper bound on l(~x, ~y) in (J.1):

log2(∑

(~x,~y)∈XN×YN

pxy (~x, ~y)2ρl(~x,~y)) ≤ log2(∑

qxy∈T N

2−ND(qxy‖pxy )2ρ(εN+NH(qx|y )))

≤ 2NεN 2−N minqD(qxy‖pxy )−ρH(qx|y )−ρεN)

= N(−min

qD(qxy‖pxy )− ρH(qx |y )− ρεN+ εN

)

where 0 < εN ≤ |X ||Y| log2(N+1)N goes to 0 as N goes to infinity.

T N is the set of all types of XN × YN .

183

Substitute the above inequalities to I(Nr) defined in (J.4):

I(Nr) ≥ N(sup

ρmin

qρ(r −H(qx |y )− εN ) + D(qxy‖pxy ) − εN

)(J.6)

We next show that I(Nr) ≥ N(Eei,b(r) + ε) where ε goes to 0 as N goes to infinity.

This can be proved by a somewhat complicated Lagrange multiplier method also used in

[12]. We give a new proof based on the existence of saddle point of the min-max function.

Define

f(q, ρ) = ρ(r −H(qx |y )− εN ) + D(qxy‖pxy )

Obviously, for fixed q, f(q, ρ) is a linear function of ρ, thus concave ∩. Also for fixed

ρ, f(q, ρ) is a convex function of q. Define g(u) = minq supρ(f(q, ρ) + ρu), it is enough [10]

to show that g(u) is finite in the neighbor of u = 0 to establish the existence of the saddle

point.

g(u) =(a) minq

supρ

(f(q, ρ) + ρu)

=(b) minq

supρ

(ρ(r −H(qx |y )− εN + u) + D(qxy‖pxy ))

≤(c) minq:H(qx|y )=r−εN+u

D(qxy‖pxy )

≤(d) ∞ (J.7)

(a) and (b) are by definition. (c) is true because H(px |y ) < r < log2 |X |, thus for very

small εN and u, H(px |y ) < r − εN + u < log2 |X |. Thus there exists distribution q, s.t.

H(qx |y ) = r−εN +u. (d) is true because the assumption on pxy that the marginal px(x) > 0

for all x ∈ X . Thus we proved the existence of the saddle point of f(q, ρ).

supρmin

qf(q, ρ) = min

qsup

ρf(q, ρ) (J.8)

Notice that if H(qx |y ) 6= r + εN , then ρ can be chose to be arbitrarily large or small to

make ρ(r−H(qx |y )− εN )+D(qxy‖pxy ) arbitrarily large. Thus the q to minimize supρ ρ(r−H(qx |y )− εN ) + D(qxy‖pxy ) satisfies that r −H(qx |y )− εN = 0, so:

minqsup

ρρ(r −H(qx |y )− εN ) + D(qxy‖pxy ) =(a) min

q:H(qx|y )=r+εN

D(qxy‖pxy )

=(b) minq:H(qx|y )≥r+εN

D(qxy‖pxy )

=(c) Eei,b(r + εN ) (J.9)

(a) is following the argument above, (b) is true because the distribution q to minimize

the KL divergence is always on the boundary of the feasible set [26] when p is not in the

184

feasible set. (c) is definition. Combining (J.6), (J.8) and (J.9), and let N be sufficiently big,

thus εN sufficiently small, and notice that Eei(r) is continuous in r, we get the the desired

bound in (J.2). ¤


Proof: We give an upper bound on the decoding error on ~xt at time (t + ∆)N . At time

(t + ∆)N , the decoder cannot decode ~xt with zero error probability iff the binary strings

describing ~xt are not all out of the buffer. Since the encoding buffer is FIFO, this means that

the number of outgoing bits from some time t1 to (t+∆)N is less than the number of the bits

in the buffer at time t1 plus the number of incoming bits from time t1 to time tN . Suppose

that the buffer is last empty at time tN − nN where 0 ≤ n ≤ t. Given this condition, the

decoding error occurs only if∑n−1

i=0 l(~xt−i, ~yt−i) > (n+∆)NR. Write lmax as the longest code

length, lmax ≤ |X ||Y| log2(N +1)+N log2 |X |. Then Pr(∑n−1

i=0 l(~xt−i, ~yt−i) > (n+∆)NR) >

0 only if n > (n+∆)NRlmax

> ∆NRlmax

∆= β∆

Pr(~xt 6= ~xt((t + ∆)N) ≤t∑

n=β∆

Pr(n−1∑

i=0

l(~xt−i, ~yt−i) > (n + ∆)NR) (J.10)

≤(a)

t∑

n=β∆

K12−nN(Eei,b((n+∆)NR

nN)−ε1)

≤(b)

∞∑

n=γ∆

K22−nN(Eei,b(R)−ε2) +γ∆∑

n=β∆

K22−∆N(minα>1

Eei,b(αR)

α−1−ε2)

≤(c) K32−γ∆N(Eei,b(R)−ε2)|γ∆− β∆|K32−∆N(Eei(R)−ε2)

≤(d) K2−∆N(Eei(R)−ε)

where, K ′is and ε′is are properly chosen real numbers. (a) is true because of Lemma 33.

Define γ = Eei(R)Eei,b(R) , in the first part of (b), we only need the fact that Eei,b(R) is non

decreasing with R. In the second part of (b), we write α = n+∆n and take the α to minimize

the error exponent. The first term of (c) comes from the sum of a geometric series. The

second term of (c) is by the definition of Eei(R). (d) is by the definition of γ. ¥

J.2 Converse

To bound the best possible error exponent with fixed delay, we consider a genie-aided

encoder/decoder pair and translate the block-coding error exponent to the delay constrained

185

error exponent. The arguments are analogous to the “focusing bound” derivation in [67]

for channel coding with feedback.


to achieve an error exponent with fixed-delay better than

infα>0

1α

Eei,b((α + 1)R) (J.11)

Proof: For simplicity of exposition, we ignore integer effects arising from the finite nature

of ∆, R, etc. For every α > 0 and delay ∆, consider a code running over its fixed-rate

noiseless channel till time ∆α + ∆. By this time, the decoder have committed to estimates

for the source symbols up to time i = ∆α . The total number of bits used during this period

is (∆α + ∆)R.

Now consider a genie that gives the encoder access to the first i source symbols at

the beginning of time, rather than forcing the encoder to get the source symbols one at a

time. Simultaneously, loosen the requirements on the decoder by only demanding correct

estimates for the first i source symbols by the time ∆α + ∆. In effect, the deadline for

decoding the past source symbols is extended to the deadline of the i-th symbol itself.

Any lower-bound to the error probability of the new problem is clearly also a bound

for the original problem. Furthermore, the new problem is just a fixed-length block-coding

problem requiring the encoding of i source symbols into (∆α +∆)R bits. The rate per symbol

is

((∆α

+ ∆)R)1i

= ((∆α

+ ∆)R)α

∆= (α + 1)R

Lemma 12 tells us that such a code has a probability of error that is at least exponential

in iEei,b((α+1)R). Since i = ∆α , this translates into an error exponent of at most Eei,b((α+1)R)

α

with block length ∆.

Since this is true for all α > 0, we have a bound on the reliability function Eei(R) with

fixed delay ∆:

Eei(R) ≤ infα>0

1α

Eei,b((α + 1)R)

The minimizing α tells how much of the past (∆α ) is involved in the dominant error event.

¥

186

streaming source coding with delay - eecs at uc berkeley · 2007-12-19 · streaming source coding...

Documents