streaming source coding with delay - eecs at uc berkeley · 2007-12-19 · streaming source coding...
TRANSCRIPT
Streaming source coding with delay
Cheng Chang
Electrical Engineering and Computer SciencesUniversity of California at Berkeley
Technical Report No. UCB/EECS-2007-164
http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-164.html
December 19, 2007
Copyright © 2007, by the author(s).All rights reserved.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.
Streaming Source Coding with Delay
by
Cheng Chang
M.S. (University of California, Berkeley) 2004B.E. (Tsinghua University, Beijing) 2000
A dissertation submitted in partial satisfactionof the requirements for the degree of
Doctor of Philosophy
in
Engineering - Electrical Engineeing and Computer Sciences
in the
GRADUATE DIVISION
of the
UNIVERSITY OF CALIFORNIA, BERKELEY
Committee in charge:
Professor Anant Sahai, ChairProfessor Kannan Ramchandran
Professor David Aldous
Fall 2007
The dissertation of Cheng Chang is approved.
Chair Date
Date
Date
University of California, Berkeley
Fall 2007
Streaming Source Coding with Delay
Copyright c© 2007
by
Cheng Chang
Abstract
Streaming Source Coding with Delay
by
Cheng Chang
Doctor of Philosophy in Engineering - Electrical Engineeing and Computer Sciences
University of California, Berkeley
Professor Anant Sahai, Chair
Traditionally, information theory is studied in the block coding context — all the source
symbols are known in advance by the encoder(s). Under this setup, the minimum number of
channel uses per source symbol needed for reliable information transmission is understood
for many cases. This is known as the capacity region for channel coding and the rate region
for source coding. The block coding error exponent (the convergence rate of the error
probability as a function of block length) is also studied in the literature.
In this thesis, we consider the problem that source symbols stream into the encoder in
real time and the decoder has to make a decision within a finite delay on a symbol by symbol
basis. For a finite delay constraint, a fundamental question is how many channel uses per
source symbol are needed to achieve a certain symbol error probability. This question, to
our knowledge, has never been systematically studied. We answer the source coding side of
the question by studying the asymptotic convergence rate of the symbol error probability
as a function of delay — the delay constrained error exponent.
The technical contributions of this thesis include the following. We derive an upper
bound on the delay constrained error exponent for lossless source coding and show the
achievability of this bound by using a fixed to variable length coding scheme. We then
extend the same treatment to lossy source coding with delay where a tight bound on the
error exponent is derived. Both delay constrained error exponents are connected to their
block coding counterpart by a “focusing” operator. An achievability result for lossless
multiple terminal source coding is then derived in the delay constrained context. Finally,
we borrow the genie-aided feed-forward decoder argument from the channel coding literature
1
to derive an upper bound on the delay constrained error exponent for source coding with
decoder side-information. This upper bound is strictly smaller than the error exponent for
source coding with both encoder and decoder side-information. This “price of ignorance”
phenomenon only appears in the streaming with delay setup but not the traditional block
coding setup.
These delay constrained error exponents for streaming source coding are
generally different from their block coding counterparts. This difference has
also been recently observed in the channel coding with feedback literature.
Professor Anant SahaiDissertation Committee Chair
2
Contents
List of Figures v
List of Tables viii
Acknowledgements ix
1 Introduction 1
1.1 Sources, channels, feedback and side-information . . . . . . . . . . . . . . . 3
1.2 Information, delay and random arrivals . . . . . . . . . . . . . . . . . . . . 6
1.3 Error exponents, focusing operator and bandwidth . . . . . . . . . . . . . . 9
1.4 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Lossless source coding with delay . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Lossy source coding with delay . . . . . . . . . . . . . . . . . . . . . 13
1.4.3 Distributed lossless source coding with delay . . . . . . . . . . . . . 14
1.4.4 Lossless Source Coding with Decoder Side-Information with delay . 15
2 Lossless Source Coding 16
2.1 Problem Setup and Main Results . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Source Coding with Delay Constraints . . . . . . . . . . . . . . . . . 16
2.1.2 Main result of Chapter 2: Lossless Source Coding Error Exponentwith Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Comparison of error exponents . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Non-asymptotic results: prefix free coding of length 2 and queueingdelay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 First attempt: some suboptimal coding schemes . . . . . . . . . . . . . . . . 28
2.3.1 Block Source Coding’s Delay Performance . . . . . . . . . . . . . . . 28
i
2.3.2 Error-free Coding with Queueing Delay . . . . . . . . . . . . . . . . 29
2.3.3 Sequential Random Binning . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Proof of the Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.1 Achievability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.2 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.1 Properties of the delay constrained error exponent Es(R) . . . . . . 46
2.5.2 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . 52
3 Lossy Source Coding 54
3.1 Lossy source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.1 Lossy source coding with delay . . . . . . . . . . . . . . . . . . . . . 55
3.1.2 Delay constrained lossy source coding error exponent . . . . . . . . . 56
3.2 A brief detour to peak distortion . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1 Peak distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.2 Rate distortion and error exponent for the peak distortion measure . 57
3.3 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.1 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.2 Achievability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Discussions: why peak distortion measure? . . . . . . . . . . . . . . . . . . 63
4 Distributed Lossless Source Coding 65
4.1 Distributed source coding with delay . . . . . . . . . . . . . . . . . . . . . . 65
4.1.1 Lossless Distributed Source Coding with Delay Constraints . . . . . 66
4.1.2 Main result of Chapter 4: Achievable error exponents . . . . . . . . 69
4.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1 Example 1: symmetric source with uniform marginals . . . . . . . . 73
4.2.2 Example 2: non-symmetric source . . . . . . . . . . . . . . . . . . . 74
4.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 ML decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.2 Universal Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
ii
5 Lossless Source Coding with Decoder Side-Information 90
5.1 Delay constrained source coding with decoder side-information . . . . . . . 90
5.1.1 Delay Constrained Source Coding with Side-Information . . . . . . . 92
5.1.2 Main results of Chapter 5: lower and upper bound on the error ex-ponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Two special cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.1 Independent side-information . . . . . . . . . . . . . . . . . . . . . . 95
5.2.2 Delay constrained encryption of compressed data . . . . . . . . . . . 96
5.3 Delay constrained Source Coding with Encoder Side-Information . . . . . . 99
5.3.1 Delay constrained error exponent . . . . . . . . . . . . . . . . . . . . 100
5.3.2 Price of ignorance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.1 Special case 1: independent side-information . . . . . . . . . . . . . 103
5.4.2 Special case 2: compression of encrypted data . . . . . . . . . . . . . 103
5.4.3 General cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.1 Proof of Theorem 6: random binning . . . . . . . . . . . . . . . . . . 108
5.5.2 Proof of Theorem 7: Feed-forward decoding . . . . . . . . . . . . . . 110
5.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 Future Work 118
6.1 Past, present and future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.1 Dominant error event . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.2 How to deal with both past and future? . . . . . . . . . . . . . . . . 120
6.2 Source model and common randomness . . . . . . . . . . . . . . . . . . . . 122
6.3 Small but interesting problems . . . . . . . . . . . . . . . . . . . . . . . . . 123
Bibliography 124
A Review of fixed-length block source coding 131
A.1 Lossless Source Coding and Error Exponent . . . . . . . . . . . . . . . . . . 131
A.2 Lossy source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
A.2.1 Rate distortion function and error exponent for block coding underaverage distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
A.3 Block distributed source coding and error exponents . . . . . . . . . . . . . 135
iii
A.4 Review of block source coding with side-information . . . . . . . . . . . . . 137
B Tilted Entropy Coding and its delay constrained performance 140
B.1 Tilted entropy coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
B.2 Error Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
B.3 Achievability of delay constrained error exponent Es(R) . . . . . . . . . . . 143
C Proof of the concavity of the rate distortion function under the peakdistortion measure 145
D Derivation of the upper bound on the delay constrained lossy source cod-ing error exponent 146
E Bounding source atypicality under a distortion measure 149
F Bounding individual error events for distributed source coding 152
F.1 ML decoding: Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . 152
F.2 Universal decoding: Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . 155
G Equivalence of ML and universal error exponents and tilted distributions 158
G.1 case 1: γH(px |y ) + (1− γ)H(pxy ) < R(γ) < γH(p1x |y ) + (1− γ)H(p1
xy ). . . . 160
G.2 case 2: R(γ) ≥ γH(p1x |y ) + (1− γ)H(p1
xy ). . . . . . . . . . . . . . . . . . . . 165
G.3 Technical Lemmas on tilted distributions . . . . . . . . . . . . . . . . . . . 167
H Bounding individual error events for source coding with side-information 175
H.1 ML decoding: Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . 175
H.2 Universal decoding: Proof of Lemma 14 . . . . . . . . . . . . . . . . . . . . 177
I Proof of Corollary 1 179
I.1 Proof of the lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
I.2 Proof of the upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
J Proof of Theorem 8 181
J.1 Achievability of Eei(R) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
J.2 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
iv
List of Figures
1.1 Information theory in 50 years . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Shannon’s original problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Separation, feedback and side-informfation . . . . . . . . . . . . . . . . . . . 4
1.4 Noiseless channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Delay constrained source coding for streaming data . . . . . . . . . . . . . . 7
1.6 Random arrivals of source symbols . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Block length vs decoding error . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 Delay vs decoding error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.9 Focusing bound vs sphere packing bound . . . . . . . . . . . . . . . . . . . 11
2.1 Timeline of source coding with delay . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Timeline of source coding with universal delay . . . . . . . . . . . . . . . . 18
2.3 Source coding error exponents . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Ratio of error exponent with delay to block coding error exponent . . . . . 24
2.5 Streaming prefix-free coding system . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Transition graph of random walk Bk for source distribution . . . . . . . . . 26
2.7 Error probability vs delay (non-asymptotic results) . . . . . . . . . . . . . . 27
2.8 Block coding’s delay performance . . . . . . . . . . . . . . . . . . . . . . . . 28
2.9 Error free source coding for a fixed-rate system . . . . . . . . . . . . . . . . 29
2.10 Union bound of decoding error . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.11 A universal delay constrained source coding system . . . . . . . . . . . . . 41
2.12 Plot of GR(ρ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.13 Derivative of Es(R) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1 Timeline of lossy source coding with delay . . . . . . . . . . . . . . . . . . . 55
3.2 Peak distortion measure and valid reconstructions under different D . . . . 58
v
3.3 R−D curve under peak distortion constraint is a staircase function . . . . 60
3.4 Lossy source coding error exponent . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 A delay optimal lossy source coding system. . . . . . . . . . . . . . . . . . . 61
4.1 Slepian-Wolf source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Timeline of delay constrained source coding with side-information . . . . . 66
4.3 Rate region for the example 1 source . . . . . . . . . . . . . . . . . . . . . 73
4.4 Error exponents plot: ESW,x(Rx, Ry) plotted for Ry = 0.71 and Ry = 0.97 . 75
4.5 Rate region for the example 2 source . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Error exponents plot for source x for fixed Ry as Rx varies . . . . . . . . . 77
4.7 Error exponents plot for source y for fixed Ry as Rx varies . . . . . . . . . 78
4.8 Two dimensional plot of the error probabilities pn(l, k), corresponding toerror events (l, k), contributing to Pr[xn−∆
1 (n) 6= xn−∆1 ] in the situation where
infγ∈[0,1] Ey(Rx, Ry, ρ, γ) ≥ infγ∈[0,1] Ex(Rx, Ry, ρ, γ). . . . . . . . . . . . . . 82
4.9 Two dimensional plot of the error probabilities pn(l, k), corresponding toerror events (l, k), contributing to Pr[xn−∆
1 (n) 6= xn−∆1 ] in the situation where
infγ∈[0,1] Ey(Rx, Ry, γ) < infγ∈[0,1] Ex(Rx, Ry, γ). . . . . . . . . . . . . . . . 83
4.10 2D interpretation of the score of a sequence pair . . . . . . . . . . . . . . . 87
5.1 Lossless source coding with decoder side-information . . . . . . . . . . . . . 91
5.2 Lossless source coding with side-information: DMC . . . . . . . . . . . . . . 91
5.3 Timeline of delay constrained source coding with side-information . . . . . 92
5.4 Encryption of compressed data with delay . . . . . . . . . . . . . . . . . . . 97
5.5 Compression of encrypted data with delay . . . . . . . . . . . . . . . . . . . 98
5.6 Information-theoretic model of compression of encrypted data . . . . . . . . 98
5.7 Lossless source coding with both encoder and decoder side-information . . . 99
5.8 Timeline of Delay constrained source coding with encoder/decoder side-information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.9 Delay constrained Error exponents for source coding with independent de-coder side-information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.10 A binary stream cipher for source ε, 1− ε . . . . . . . . . . . . . . . . . . 104
5.11 Delay constrained Error exponents for uniform source with symmetric de-coder side-information (symmetric channel) . . . . . . . . . . . . . . . . . . 105
5.12 Uniform source and side-information connected by a symmetric erasure chan-nel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.13 Delay constrained Error exponents for uniform source with symmetric de-coder side-information (erasure channel) . . . . . . . . . . . . . . . . . . . 106
vi
5.14 Delay constrained Error exponents for general source and decoder side-information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.15 Feed-forward decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.1 Past, present and future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 Dominant error event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.1 Block lossless source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.2 Block lossy source coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
A.3 Achievable region for Slepian-Wolf source coding . . . . . . . . . . . . . . . 138
J.1 Another FIFO variable length coding system . . . . . . . . . . . . . . . . . 182
vii
List of Tables
1.1 Delay constrained information theory problems . . . . . . . . . . . . . . . . 8
5.1 A non-uniform source with uniform marignals . . . . . . . . . . . . . . . . . 106
6.1 Type of dominant error events for different coding problems . . . . . . . . . 120
viii
Acknowledgements
I would like to thank my thesis advisor, Professor Anant Sahai, for being a great source
of ideas and inspiration through my graduate study. His constant encouragement and
support made this thesis possible. I am truly honored to be Anant’s first doctoral student.
Professors and post-docs in the Wireless Foundations Center have been extremely help-
ful and inspirational through my graduate study. Professor Kannan Ramchandran’s lec-
tures in EE225 turned my research interest from computer vision to fundamental problems
in signal processing and communication. Professor Venkat Anantharam introduced me to
the beautiful information theory and Professor Michael Gastpar served in my qualification
exam committee. For more than a year Dr. Stark Draper and I worked on the problem
that built the foundations of this thesis. I appreciate the help from all of them.
I would like to thank all my colleagues in the Wireless Foundations Center for making
my life in Cory enjoyable. Particularly, my academic kid brothers Rahul Tandra, Mubaraq
Mishra, Hari Palaiyanur and Pulkit Grover shouldered my workload in group meetings.
Dan Schonberg captained our softball team for six years. My cubicle neighbors, Vinod
Prabhakaran and Jonathan Tsao, were always ready to talk to me on a wide range of
topics. I am also grateful to our wonderful EECS staff members Amy Ng and Ruth Gjerde
for making my life a lot easier.
My years at Berkeley have been a wonderful experience. I would like to thank all my
friends for all those nights out, especially Keda Wang, Po-lung Chia, Will Brett, Beryl
Huang, Jing Xiang and Jiangang Yao for being the sibling figures that I never had.
My parents have been extremely supportive through all these years, I hope I made you
proud. Last but not least, I would like to thank the people of Harbin, China and Berkeley,
USA, for everything.
ix
Chapter 1
Introduction
In 1948 Claude Shannon published The Mathematical Theory of Communication [72],
which “single-handedly started the field of information theory” [82]. 60 years later, what
Shannon started in that paper has evolved into what is shown in Figure 1.1. Some of the
most important aspects of information theory are illustrated in this picture such as sources,
noisy channels, universality, feedback, security, side-information, interaction between mul-
tiple agents etc. For trained eyes, it is easy to point out some classical information theory
problems such as lossless/lossy source coding [72, 74], channel coding [72], Slepian-Wolf
source coding [79], multiple-access channels [40], broadcast channels [37, 24], arbitrarily
varying channels [53], relay channels [25] and control over noisy channels [69] as sub prob-
lems of Figure 1.1. Indeed we can come up with some novel problems by just changing the
connections in Figure 1.1, for example a problem named “On the security of joint multiple
access channel/correlated source coding with partial channel information and noisy feed-
back” is well in the scope of that picture. And this may have already been published in the
literature.
In this thesis, we study information theory from another dimension— delay. Tradition-
ally, the issue of delay is ignored, as in most information theory problems a message which
contains all the information is known at the encoder prior to the whole communication
task. Or the issue of delay is studied in a strict real time setup such as [58]. We instead
study the intermediate regime where a finite delay for every information symbol is part of
the system design criteria. Related works are found in the networking literature on delay
and protocol [38] and recently on implications of finite delay on channel coding [67]. In
1
this thesis, we focus on the implications of end-to-end delay on a variety of source coding
problems.
Noisy (memory, varying…) Channels
Encoder A
Encoder B
Source A 2
Source B
Sink 2
Sink 1
Source A 1
Decoder 1
Decoder 2
“side-information” on source B or
channel
Feedback
Eavesdropper
Channel information
Control
Correlation
Figure 1.1. Sources, channels, feedback, side-information and eavesdroppers
2
1.1 Sources, channels, feedback and side-information
Instead of looking at information theory as a big complicated system illustrated in
Figure 1.1, we focus on a discrete time system that is consists five parts as shown in
Figure 1.2. We study the source model and the channel model in their simplest forms.
A source is a series of independent identically distributed random variables on a finite
alphabet. A noisy channel is characterized by a probability transition matrix with finite
input and finite output alphabets. The encoder is a map from a realization of the source to
channel inputs and the decoder is a map from the channel outputs to the reconstruction of
the source.
-- - DecoderEncoderNoisy Channel
Source Reconstruction
Figure 1.2. Shannon’s original problem in [72]
This is the original system setup in Shannon’s 1948 paper [72]. One of the most im-
portant results in that paper is the separation theorem. The separation theorem states
that one can reconstruct the source with a high fidelity 1 if and only if the capacity of the
channel is at least as large as the entropy of the source. The capacity of the channel is the
number of independent bits that can be reliably communicated across the noisy channel per
channel use. The entropy of the source is the number of bits needed on average to describe
the source. Both the entropy of the source and the capacity of the channel are defined
in terms of bits. And hence the separation theorem guarantees the validity of the digital
interface shown in Figure 1.3. In this system setup, where the channel coding and source
coding are two separate parts, the reconstruction has high fidelity as long as the entropy
of the source is less than the capacity of the channel. This theorem, however, does not
state that separate source-channel coding is optimal under every criteria. For example, the
large deviation performance of the joint source channel coding system is strictly better than
separate coding even in the block coding case [27]. However, to simplify the problem, we
study source coding and channel coding separately. In this thesis, we focus on the source
coding problems.
Figure 1.3 is the system we are concerned with. It illustrates the separation of source1For simplicity, we define high fidelity as lossless in this section. The lossy version, reconstruction under
a distortion measure, is discussed in Chapter 3.
3
coding and channel coding and two other important elements in information theory: channel
feedback and side-information of the source. It is well known that feedback does not increase
the channel capacity [26] for memoryless channels. However, as shown in [61, 47, 85, 70, 67,
68] and numerous other papers in the literature, channel feedback can be used to reduce
complexity and increase reliability.
Another important aspect we are going to explore is source side-information. Intuitively
speaking, knowing some information about the source for free can only help the decoder
to reconstruct the source. In the system model shown in Figure 1.3, the source is modeled
as iid random variables and the noisy channel is a discrete time memoryless channel, the
side-information is another iid random sequence which is correlated with the source in a
memoryless fashion. The problem is reduced to a standard source coding problem if both
the encoder and the decoder have the side-information. In [79], it is shown that with decoder
only side-information the source coding system can operate at the conditional entropy rate
instead of the generally higher entropy rate of the source.
-- -
6
--
??
Noisy Channel DecoderEncoderbits bits
Feedback
Encoder DecoderSource SourceSource
Rec
onst
ruct
ion
Channel Channel
Side-information
Figure 1.3. Separation, channel feedback and source side-information
By replacing the noisy channel in Figure 1.3 with a noiseless channel through which R
bits can be instantaneously communicated, we have a source coding system as shown in
Figure 1.4. In the main body of this thesis, Chapters 2-4, we study the model shown in
Figure 1.4.
4
Encoder DecoderSource SourceSource Reconstruction
Side−information
bits
Noiseless Channel (rate R)
Figure 1.4. Source coding problem
5
1.2 Information, delay and random arrivals
A very important aspect that is missing in the model shown in Figure 1.3 is the timing
of the source symbols and the channel uses. In most information theory problems, the
source sequence is assumed to be ready for the source encoder before the beginning of
compression. And hence the source encoder has whole knowledge of the entire sequence.
Similarly, the channel encoder has the message represented by a binary string before the
beginning of communication and thus the channel outputs depend on every bit in that
message. This is a reasonable assumption for many applications. For example this is a
valid model for the transmission of Kong Zi (Confucius)’s books to another planet. However,
there are a wide range of problems that cannot be modeled that way. Recent results in the
interaction of information theory and control [69] show that we need to study the whole
information system in an anytime formulation. It is also shown in that paper that the
traditional notion of channel capacity is not enough. On the source coding side, modern
communication applications such as video conferencing [54, 48] require short delay between
the generation of the source and the consumption by the end user. Instead of transmitting
Kong Zi’s books, we are facing the problem of transmitting Wang Shuo [84]’s writings to
our impatient audience on another planet, while Wang Shuo only types one character per
second in his usual unpredictable “I’m your daddy” [78] manner.
In this thesis, we study delay constrained coding for streaming data, where a finite
delay is required between the realization time of a source symbol (a random variable in a
random sequence) and the consumption time of the source symbol at the sink. Without
loss of generality, the source generates one source symbol per second and within a finite
delay ∆ seconds, the decoder consumes that source symbol as shown in Figure 1.5. We
assume the source is iid. Implicitly, this model states that the reconstruction of the source
symbol at the decoder is only dependent on what the decoder receives until ∆ seconds after
the realization of that source symbol. This is in line with the notion of causality defined
in [58]. The formal setup is in Section 2.1.1. In this thesis, we freely interchange “coding
with delay” with “delay constrained coding”.
Another timing issue is the randomness in the arrival times of the information (source
symbols) to the encoder. As shown in Figure 1.6, Wang Shuo types a character in a random
fashion. The readers may not care about the random inter-arrivals between his typing. But
with a finite delay, a positive rate (protocol information) has to be used to encode the
timing information [38]. If the readers care about the arrival time, the encoding of the
6
Figure 1.5. Wang Shuo types one character per second while impatient readers want to readhis writings within a finite delay. Rightmost character is the first in “I’m your daddy” [78].Read from right to left.
timing information might pose new challenges to the coding system if the inter-arrival time
is unbounded. We only have some partial result for geometric distributed arrival times. We
are currently working on Poisson random arrivals but neither of these results is included in
this thesis. Another interesting case is when the channel is a timing channel. Then Wang
Shuo can send messages by properly choosing his typing time [3].
Figure 1.6. Wang Shuo decides to type in a pattern with random inter-character time.Rightmost character is the first in “I’m your daddy” [78]. Read from right to left.
We focus on the case where source symbols come to the encoder at a constant rate (no
random arrivals). The problems of interest are summarized in Table 1.1. Each problem is
“coded” by a binary string of length six. A fundamental question is how these six elements
affect the coding problem. This is not fully understood. A brief discussion is given at
the end of the thesis. There are quite a few (≤ 62) interesting problems in the table.
For example, the ultimate problem in the table is 111111: “Delay constrained joint source
channel Wyner-Ziv coding with feedback and random arrivals”. To our knowledge, nobody
has published anything about this before. There are many other open questions dealing
with the timing issues of the communication system as shown in the table. We leave them
to future researchers to explore.
7
Random Non-uniform Noisy Feedback Side- Distortionarrival source channel information
Chapter 2 0 1 0 0 0 0Chapter 3 0 1 0 0 0 1Chapter 5 0 1 0 0 1 0
Ongoing work 1 1 0 0 0 0[15] 0 1 1 0 0 0[67] 0 0 1 1 0 0
Ultimate 1 1 1 1 1 1
Table 1.1. Delay constrained information theory problems
8
1.3 Error exponents, focusing operator and bandwidth
From the perspective of the large deviation principle [31], an error exponent is the loga-
rithm of the probability of the exceptionally rare behavior of source or channel. This error
exponent result is especially useful when the source entropy is much lower than the channel
capacity. The channel capacity and the source entropy rate characterize the behavior of the
source and channel in the regime of the central limit theorem. And hence error exponent
results always play second fiddle to channel capacity and source entropy rate. Recently, we
studied the third order problem— redundancy rate in the block coding setup [21, 19]. This
is particularly interesting when the channel capacity is close to the source entropy.
In Shannon’s original paper on information theory [72], he studied the source coding and
channel coding problem from the perspective of the minimum average number of channel
uses to communicate 1 source symbol reliably. The setting is asymptotic in nature— long
block length, where the number of source symbols is big. Reliable communication means
an arbitrary small decoding error. However, it is not clear how fast the error converges to
zero with longer block length. Later, researchers studied the convergence rate problem and
it was shown that the error converges to zero exponentially fast with block length as long
as there is some redundancy in the system, i.e. channel capacity is strictly higher than the
entropy rate of the source. This exponent is defined as the error exponent, as shown in
Figure 1.7. For channel coding, a lower bound and an upper bound on this error exponent
are derived in some of the early works [75, 76] and [41]. A very nice upper bound derivation
appears in Gallager’s technical report [35]. The lossless source coding error exponent is
completely identified in [29], the lossy source coding error exponent is studied in [55]. The
joint source channel coding error exponent was first studied in [27].
In the delay constrained setup of streaming source coding and channel coding problems,
we study the convergence rate of the symbol error probability. As shown in Figure 1.8,
the error probabilities go to zero exponentially fast with delay. We ask the fundamental
question: is delay for delay constrained streaming coding the same as the block length for
classical fixed-length block coding?
In [67], Sahai first answered the question for streaming channel coding. It is shown
that without feedback, the delay constrained error exponent does not beat the block coding
upper bound. More importantly, it is shown that, with feedback, the delay constrained
error exponent is higher than its block coding counterpart as illustrated in Figure 1.9. This
new exponent is termed the “focusing” bound.
9
log
( de
codi
ng e
rror
)
block length
Figure 1.7. Decoding error converges to zero exponentially fast with block length givensystem redundancy. The slope of the curve is the block coding error exponent.
In this thesis, we answer the question for source coding. For both lossless source coding
and lossy source coding with delay, we show that the delay constrained source coding error
exponents are higher than their block coding counterparts in Chapter 2 and Chapter 3
respectively. Similar to channel coding, the delay constrained error exponent and the block
coding error exponent are connected by a “focusing” operator.
Edelay(R) = infα>0
1α
Eblock((1 + α)R) (1.1)
where Edelay(R) and Eblock(R) are the rate R error exponents of delay constrained and
fixed-length block coding respectively. The conceptual information-theoretic explanation of
this operator is that the coding system can borrow some resources (channel uses) from the
future to deal with the delay constrained dominant error events. This is not the case for
block coding as for fixed length block coding, the amount of resources is given before the
realization of the randomness from the source or the channel.
This thesis is an information-theoretic piece of work. But we also care about applications
(non-asymptotics). Now suppose Wang Shuo’s impatient readers want to know what Wang
Shuo just typed ∆ = 20 seconds ago, but they can tolerate an average Pe = 0.1% decoding
error. A natural system design question is: what is the sufficient and necessary bandwidth
R. We give a one line derivation here and will not further study the design issue in this
thesis. The error probability decays to zero exponentially with delay:
Pe ≈ 2−∆Edelay(R) (1.2)
Because Edelay(R) is monotonically increasing with R, the sufficient and necessary condition
10
time line
log
( de
codi
ng e
rror
) .... k k+1 k+2 k+3 ......
streaming source data
Figure 1.8. Decoding error converges to zero exponentially fast with block length givensystem redundancy. The slope of the curves are the delay constrained error exponent.
to pleasing our impatient readers is thus:
R ≈ E−1delay(
1∆
log(Pe)) (1.3)
An explicit lower bound on R can be trivially derived from (1.3), our recent work on the
block coding redundancy rate problems [21, 19] and a manipulation of the focusing operator
in (1.1). It is left for future researchers.
capacity0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Rate
Err
or E
xpon
ents
Sphere packing boundFocusing bound
Figure 1.9. Focusing bound vs sphere packing bound for a binary erasure channel witherasure rate 0.05
11
1.4 Overview of the thesis
As discussed in previous sections, there is a huge number of information-theoretic prob-
lems that we are interested in. We only focus on source coding problems with an end to end
decoding delay in this thesis. Technical results of the thesis are developed in Chapters 2-
5. The structure of every chapter roughly follows the same flow: problem setup → main
theorems → examples → proofs → discussion and ideas for future work. Section 2.1 serves
as the foundation of the thesis. Formal definitions of streaming data and delay constrained
encoder and decoder are introduced. Every chapter is mostly presented in a self-contained
fashion.
The organization of the thesis is as follows. We first study the simplest and the most
fundamental problem of all, delay constrained lossless source coding in Chapter 2. In
Chapter 3, we add a distortion measure to the coding system to explore new aspects of delay
constrained problems and give a more general proof. Then in Chapter 4, the achievability
(lower bound on the delay constrained error exponent) of the distributed source coding
problem is studied, no general upper bound is known yet. However, in Chapter 5, we give
an upper bound on the delay constrained error exponent for source coding with decoder
side-information, which is a special case of distributed source coding.
1.4.1 Lossless source coding with delay
In Chapter 2, we study the lossless source coding with delay problem. This is the
simplest problem of all in terms of problem setup. Some of the most important concepts
of this thesis are introduced in this chapter. There is one source that generates one source
symbol per second and the encoder can send R bits per second to the decoder. The decoder
wants to recover every source symbol within a finite delay from when the symbol enters the
encoder. We define the delay constrained error exponent Es(R) as the exponential rate at
which the decoding error decays to zero with delay. The delay constrained error exponent
is the main object we study in this thesis.
An upper bound on this error exponent is derived by a “focusing” bound argument. The
key step is to translate the symbol error with delay to the fixed length block coding error.
From there, the classical block coding error exponent [41] result can be borrowed. This
bound is shown to be achievable by implementing an optimal universal fixed to variable
length encoder together with a FIFO buffer. A similar scheme based on tilted distribution
12
coding was proposed in [49] when the encoder knows the distribution of the source. A
new proof for the tilted distribution based scheme is provided based on the large deviation
principle. We analyze the delay constrained error exponent Es(R) and show that there are
several differences between this error exponent and the classical block source coding error
exponent in Section 2.5. For example, the derivative of the delay constrained error exponent
is positive at the entropy rate instead of 0 for block coding error exponents. Also the delay
constrained error exponent approaches infinity at the logarithm of the alphabet, which is
not the case for block coding. However, they are connected through the focusing operator
defined in (1.1). A similar “focusing” operator was recently observed in streaming channel
coding with feedback [67].
For streaming channel coding with delay, the present of feedback is essential for the
achievability of the “focusing” bound. But for streaming lossless source coding with de-
lay, no feedback is needed. In order to understand for what kind of information-theoretic
problems the focusing operator applies to, we study a more general problem in Chapter 3.
For the delay constrained point-to-point source coding problems in Chapters 2 and 3, the
focusing bound applies because the encoder has full information of the source. However,
for the distributed source coding problems in Chapters 4 and 5 the focusing operator no
longer applies because the encoders do not have the full information of source randomness.
1.4.2 Lossy source coding with delay
In Chapter 2, the reconstruction has to be exactly the same as the source or else a
decoding error occurs. In Chapter 3, we loosen the requirement for the exactness of the
reconstruction and study the delay constrained source coding problem under a distortion
measure. The source model is the same as that for lossless source coding. We define the
delay constrained lossy source coding error exponent as the exponential convergence rate of
the lossy error probability with delay, where a lossy error occurs if the distortion between
a source symbol and its reconstruction is higher than the system requirement.
Technically speaking, lossless source coding can be treated as a special case of lossy
source coding by properly defining error as a distortion violation. Hence, the results in
Chapter 3 can be treated as a natural generalization of Chapter 2. We prove that the delay
constrained error exponent and the block coding error exponent for peak distortion measure
are connected through the same focusing operator defined in (1.1). The reason is that the
rate distortion function under peak distortion is concave ∩ over the distribution of the
13
source. This is a property that average distortion measures do not have. The derivations
in this chapter are more general than in Chapter 2 since we only use the concavity of the
rate distortion function over the distribution. Hence the techniques can be used in other
delay constrained coding problems, for example, channel coding with feedback where the
variable length code is also concave ∩ in the channel behavior.
1.4.3 Distributed lossless source coding with delay
In Chapter 4, we study delay constrained distributed source coding of correlated sources.
The block coding counterpart was first studied by Slepian and Wolf in [79], where they
determined the rate region for two encoders that intend to communicate two correlated
sources to one decoder without cooperation.
In Chapter 2, we introduced a sequential random binning scheme that achieves the
random coding error exponent. This scheme is not necessary in the point-to-point case
because the random coding error exponent is much smaller than the optimal “focusing”
bound. In distributed source coding, however, sequential random binning is useful. By
using a sequential random binning scheme, we show that a positive delay constrained error
exponent can be achieved for both sources as long as the rate pair is in the interior of the
rate region determined in [79]. The delay constrained error exponents are different from the
block coding ones in [39, 29] because the two problems have different dominant error events.
Similar to fixed length block coding, we show that both maximum-likelihood decoding
and universal decoding achieve the same error exponent. This is through an analysis in
Appendix G where tilted distributions are used to bridge the two error exponents. Several
important Lemmas that are used in other chapters are also proved in Appendix G. The
essence of these proofs is to use Lagrange duality to confine the candidate set of distributions
of a minimization problem to a one dimensional exponential family.
Unlike what we observed in Chapter 2 and Chapter 3, the delay constrained error
exponent is smaller than the fixed length block coding counterpart. Is it because the
achievability scheme is suboptimal, or does this new problem have different properties than
the point to point source coding problems studied in the previous two chapters? This
question leads us to the study of the upper bound for a special case of the delay constrained
distributed source coding problem in the next chapter.
Chronologically, this is the first project on delay constrained source coding that I was
involved with. Then a post-doc at Cal, Stark Draper, my advisor Anant Sahai and I worked
14
on this problem from December 2004 to October 2006. Early results were summarized in [32]
and a final version has been submitted [12]. Now a professor in Wisconsin, Stark contributed
a lot to this project, especially in the universal decoding part. I appreciate his help in the
project and introducing me to Imre Csiszar and Janos Korner’s great book [29].
1.4.4 Lossless Source Coding with Decoder Side-Information with delay
What is missing in Chapter 4 is a non-trivial upper bound on the delay constrained error
exponents. In Chapter 5, we study one special distributed source coding problem, source
coding with decoder side-information, and derive an upper bound on the error exponent.
This problem is a special case of the problem in Chapter 4 since the decoder side-information
can be treated as an encoder with rate higher than the logarithm of the size of the alphabet.
It is a well known fact that there is a duality between channel coding and source coding
with decoder side-information [2] in the block coding setup. So it is not surprising that we
can borrow a delay constrained channel coding technique called the feed-forward decoder
that was first developed by Pinsker [62] and recently clarified by Sahai [67] to solve our
problem in Chapter 5. However, the lower bound and the upper bound are in general not
the same. This leaves a great space for future improvements.
We then derive the error exponent for source coding with both decoder and encoder
side-information. This delay constrained error exponent is related to the block coding error
exponent by the focusing operator in (1.1). This error exponent is generally strictly higher
than the upper bound of the delay constrained error exponent with only decoder side-
information. This is similar to the channel coding case, where the delay constrained error
exponent is higher with feedback than without feedback. This phenomenon is called “price
of ignorance” [18] and is not observed in the block coding context. This shows that in
the delay constrained setup, traditional compression first then encryption scheme achieves
higher reliability than the novel encryption first then compression scheme developed in [50],
although there is no difference in reliability between the two schemes in the block coding
setup. The ‘price of ignorance” adds to the series of observations that delay is not the same
as block length.
15
Chapter 2
Lossless Source Coding
In this chapter, we begin by reviewing classical results on the error exponents of lossless
block source coding. In order to understand the fundamental differences between classical
block coding and delay constrained coding1, we introduce the setup of delay constrained
lossless source coding problem and the notion of error exponent with delay constraint.
This setup serves as the foundation of the thesis. We then present the main result of this
chapter: a tight achievable delay-constrained error exponent for lossless source coding with
delay constraints. Some alternative suboptimal coding schemes are also analyzed, especially
the sequential random binning scheme which is used as a useful tool in future chapters on
distributed lossless source coding.
2.1 Problem Setup and Main Results
Fixed-length lossless block source coding is reviewed in Section A.1 in the appendix.
We present our result in the delay constrained setup for streaming lossless source coding.
2.1.1 Source Coding with Delay Constraints
We introduce the delay constrained source coding problem in this section, system model
and setup and architecture issues are discussed. These are the basics of the whole thesis.1In this thesis, we freely interchange “coding with delay” with “delay constrained coding”, similarly
“error exponent with delay” with “delay constrained error exponent”.
16
System Model, Fixed Delay and Delay Universality
x1 x2 x3 x4 x5 x6 ...
b1(x21 ) b2(x4
1 ) b3(x61 ) ...Encoding
Source
? ? ? ? ? ?
x1(4) x2(5) x3(6) ...
Rate limited channel
Decoding
? ? ?
Figure 2.1. Time line of delay constrained source coding: rate R = 12 , delay ∆ = 3
As shown in Figure 2.1, rather than being known in advance, the source symbols enter
the encoder in a streaming fashion. We assume that the discrete memoryless source gener-
ates one source symbol xi per second from a finite alphabet X . Where xi’s are i.i.d from a
distribution px . Without loss of generality, assume px(x) > 0, ∀x ∈ X .
The jth source symbol xj is not known at the encoder until time j, this is the fundamental
difference in the system model from the block source coding setup in Section A.1. The
coding system also commits to a rate R. Rate R operation means that the encoder sends
1 bit to the decoder every 1R seconds. It is shown in Proposition 1 and Proposition 2 that
we only need to study the problem when the rate R falls in the interval of [H(px), log |X |].
We define the sequential encoder-decoder pair. This is the coding system we study for
the delay constrained setup.
Definition 1 A fixed-delay ∆ sequential encoder-decoder pair E ,D is a sequence of maps:
Ej, j = 1, 2, ... and Dj, j = 1, 2, .... The outputs of Ej are the outputs of the encoder Efrom time j − 1 to j.
Ej : X j −→ 0, 1bjRc−b(j−1)Rc
Ej(xj1) = b
bjRcb(j−1)Rc+1
The output of the fixed-delay ∆ decoder Dj is the decoding decision of xj based on the received
17
binary bits up to time j + ∆.
Dj : 0, 1b(j+∆)Rc −→ X
Dj(bb(j+∆)Rc1 ) = xj(j + ∆)
Hence xj(j +∆) is the estimation of xj at time j +∆ and thus there is an end-to-end delay
of ∆ seconds between when xj enters the encoder and when the decoder outputs the estimate
of xj. A rate R = 12 , fixed-delay ∆ = 3, sequential source coding system is illustrated in
Figure 2.1.
In this thesis, we focus our study on the fixed delay coding problem with a fixed delay
∆. Another interesting problem is the delay-universal coding problem defined as follows.
The outputs of delay-universal decoder Dj are the decoding decisions of all the arrived
source symbols at the encoder by time j based on the received binary bits up to time j.
Dj : 0, 1bjRc −→ X j
Dj(bbjRc1 ) = xj
1(j)
Where xj1(j) is the estimation, at time j, of xj
1 and thus the end-to-end delay of symbol xi
at time j is j − i seconds for i ≤ j. In a delay-universal scheme, the decoder emits revised
estimates for all source symbols so far. This coding system is illustrated in Figure 2.2.
x1 x2 x3 x4 x5 x6 ...
b1(x21 ) b2(x4
1 ) b3(x61 ) ...Encoding
Source
? ? ? ? ? ?
x11 (1) x2
1 (2) x31 (3) x4
1 (4) x51 (5) x5
1 (6) ...
Rate limited channel
Decoding
? ? ?
Figure 2.2. Time line of delay universal source coding: rate R = 12
Without giving the proof, we state that all the error exponent results in this thesis for
fixed-delay problems also apply to delay universal problems.
18
Symbol Error Probability, Delay and Error Exponent
For a streaming source coding system, a source symbol that enters the system earlier
could get a higher decoding accuracy than a symbol that enters the system later. This is
because the earlier source symbol could use more resource— noiseless channel uses. This
is in contrast to block coding systems where all source symbols share the same noiseless
channel uses. Thus for delay constrained source coding, it is important to study the symbol
decoding error probability instead of the block coding error probability.
Definition 2 A delay constrained error exponent Es(R) is said to be achievable if and only
if for all ε > 0, there exists K < ∞, ∀∆ > 0, ∃ fixed-delay ∆ encoder decoder pairs E, D,
s.t. ∀i:Pr[xi 6= xi(i + ∆)] ≤ K2−∆(Es(R)−ε)
Note: the order of the conditions of this definition is extremely important. Simplistically
speaking, a delay constrained error exponent is said to be achievable (meaning the symbol
error decays exponentially with delay with the claimed exponent for all symbols), if it is
achievable for all fixed-delays.
We have the following two propositions stating that the error exponent is only interesting
if R ∈ [H(px), log |X |]. First we show that the error exponent is only interesting if the rate
R ≤ log |X |.
Proposition 1 Es(R) is not well defined2 if R > log |X |
Proof: Suppose R > log |X |, for integer N large enough, we have
2bNR−2c > 2N log |X | = |X |N .
Now we construct a very simple block coding system as follows.
The encoder first queues up N source symbols from time (k−1)N +1 to kN , k = 1, 2, ...,
those N symbols are x(k−1)N+1, ...xkN . Now since 2bNR−2c > |X |N , use an injective map Efrom XN to 1, 2, 3, ..., 2bNR−2c. Notice that the encoder can send at least bNR− 2c bits
in the time interval (kN, (k + 1)N). So the encoder can send E(x(k−1)N+1, ...xkN ) to the
decoder within time interval (kN, (k + 1)N). Thus the decoding error for source symbols2Less mathematically strictly speaking, the error exponent Es(R) is infinite if R > log |X |.
19
x(k−1)N+1, ...xkN is zero at time (k + 1)N . This is true for all k = 1, 2... So the error
probability for any source symbol xn at time n + ∆ is zero for any ∆ ≥ 2N . Thus the error
exponent Es(R) for R > log |X | is not defined because log 0 is not defined. Or conceptually
we say the error exponent is infinite for R > log |X |. ¤
The above lemma is for the delay constrained source coding problem in this chapter.
But it should be obvious that similar results hold for the other source coding problems
discussed in future chapters. That is, if the rate of the system is above the logarithm of the
alphabet size, an almost instantaneous error free coding is possible. This implies that the
error exponent is not well defined in that case.
Secondly, the delay constrained error exponent is zero, i.e. the error probability does
not universally converge to zero if the rate is below the entropy rate of the source. This
result should not be surprising given that it is also true for block coding. However, for
the completeness of the thesis, we give a proof. The proof here is to translate a delay
constrained problem to a block coding problem, then by using the classical block coding
error probability result we lower bound the symbol error probability, and thus give a lower
upper bound on the delay constrained error exponent.
Proposition 2 Es(R) = 0 if R < H(px).
Proof: We prove the lemma by contradiction. Suppose that for some source px the
delay constrained source coding error exponent Es(R) > 0 for some R < H(px). Then from
Definition 2 we know that for any ε > 0, there exists K < ∞, such that for all ∆, there
exists a delay constrained source coding system E , D, and
Pr[xn 6= xn(n + ∆)] ≤ K2−∆(Es(R)−ε) for all n. (2.1)
Notice that (2.1) is true for all n, ∆. We pick n and ∆ that are big enough as needed.
Now we can design a block coding scheme of block length n, the encoder E is derived
from the delay constrained source coding system Di, where the output of the block encoder
is the same as the accumulate of the delay constrained encoders.
E(xn1 ) =
(E1(x1), E2(x21), ...., En+∆(xn+∆
1 ))
= bb(n+∆)Rc1 (2.2)
Notice that some of the output of the encoders Ei(xi1) can be empty.
The decoder part is similar. The block code decoder uses the output of all the delay
constrained decoders up to time n + ∆
D(b(n+∆)R1 ) =
(x1(1 + ∆), x2(2 + ∆), ..., xn(n + ∆)
), xn
1 (2.3)
20
Now we look at the block error of the block coding system built from the delay con-
strained source coding system. First, we fix the ratio of n and ∆ at ∆ = αn, we will choose
a sufficient large n and a small enough α to construct the contradiction.
Pr[xn1 6= xn
1 ] ≤n∑
i=1
Pr[xi 6= xi]
≤n∑
i=1
K2−∆(E−ε)
≤ nK2−αn(E−ε) (2.4)
However, from the classical block source coding theorem in [72], we know that
Pr[xn1 6= xn
1 ] ≥ 0.5 (2.5)
if the effective rate of the block coding system n+∆n R is smaller than the entropy rate of
the source H(px). This only requires n be much larger than ∆, such that n+∆n R < H(px).
Now combining (2.4) and (2.5), we have
nK2−αn(E−ε) ≥ 0.5 (2.6)
Notice that E > ε, K < ∞ is a constant, and α > 0 is also a constant, hence there
exists n big enough such that nK2−αn(E−ε) < 0.5. This gives the contradiction we need.
The lemma is proved. ¤
These two lemmas tell us that the delay constrained error exponent Es(R) is only
interesting for R ∈ [H(px), log |X |].
In the proof of Proposition 2, we constructed a block coding system from a delay con-
strained source coding system with some delay performance and then build the contra-
diction. In this thesis, we also use this technique to show other theorems. We borrow
the classical block coding results to serve as the building blocks of the delay constrained
information theory.
The above two lemmas are for the delay constrained lossless source coding problem in
this chapter. But it should be obvious that similar result holds for other source coding
problems discussed in future chapters. That is, delay constrained error exponents are zero,
or the error probabilities do not converge to zero universally if the rate is lower than the
relevant entropy. And if the rate is above the logarithm of the alphabet size of the source,
any exponent is achievable.
21
2.1.2 Main result of Chapter 2: Lossless Source Coding Error Exponent
with Delay
Following the definition of the delay-constrained error exponent for lossless source coding
in Definition 2, we have the following theorem which describes the convergence rate of the
symbol-wise error probability of lossless source coding with delay problem.
Theorem 1 Delay constrained lossless source coding error exponent: For source x ∼ px ,
the delay constrained source coding error defined in Definition 2 is
Es(R) = infα>0
1α
Es,b((α + 1)R) (2.7)
Where Es,b(R) is the block source coding error exponent [29] defined in (A.4). This error
exponent Es(R) is both achievable and an upper bound, hence our result is complete.
Recall the definition of delay constrained error exponent, for all ε > 0, there exists a finite
constant K, s.t. for all i, ∆,
Pr[xi 6= xi(i + ∆)] ≤ K2−∆(Es(R)−ε)
The result has two parts. First, it states that there exists a coding scheme, such that the
error exponent Es(R) can be achieved in a universal setup. This is summarized in Propo-
sition 5 in Section 2.4.1. Second, there is no coding scheme can achieve better delay con-
strained error exponent than Es(R). This is summarized in Proposition 6 in Section 2.4.2.
Before showing the proof of Theorem 1, we give some numerical results and discuss
some other coding schemes in the next two sections.
2.2 Numerical Results
In this section we evaluate the delay constrained performance for different coding
schemes via an example. For a simple source x with alphabet size 3, X = A,B, Cand the following distribution
px(A) = a px(B) =1− a
2px(C) =
1− a
2
Where a ∈ [0, 1], in this section we set a = 0.65.
22
2.2.1 Comparison of error exponents
The error exponents for both block and delay constrained source coding predict the
asymptotic performance of different source coding systems when the delay is long. We
plot the delay constrained error exponent Es(R), the block source coding error exponent
Es,b(R) and the random coding error exponent Er(R) in Figure 2.3. As shown in Theorem 1
and Theorem 9, these error exponents tell the asymptotic performance of the two delay
constrained source coding systems. We give a simple streaming prefix-free code at rate 32 for
this source. It will be shown that, although this prefix-free coding is suboptimal as the error
exponent is much smaller than Es(32), its error exponent is much larger than Er(
32) which is
equal to Es,b(32) in this case. We give the details of the streaming prefix-free coding system
and analyze its decoding error probability in Section 2.2.2.
1.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Rate R
Err
or E
xpon
ents
Random codingBlock codingDelay optimal codingSimple prefix coding
to ∞
H(px) log
2(3)
Figure 2.3. Different source coding error exponents: delay constraint error exponent Es(R),block source coding error exponent Es,b(R), random coding error exponent Er(R)
In Figure 2.4, we plot the ratio of the optimal delay constrained error exponent over the
block coding error exponent. The ratio tells to achieve the same error probability, how many
times longer the delay has to be for the block coding system that is studied in Section 2.3.1.
The smallest ratio is around 52 at rate around 1.45.
23
1.50
100
200
300
400
500
600
700
Rate R
Rat
io o
f del
ay−
optim
al o
ver
bloc
k co
ding to ∞
to ∞
H(ps) log
2(3)
Figure 2.4. Ratio of delay optimal error exponent Es(R) over block source coding errorexponent Es,b(R),
2.2.2 Non-asymptotic results: prefix free coding of length 2 and queueing
delay
In this section we show a very simple coding scheme which use a prefix-free code [26]
instead of the asymptotical optimal universal code studied in Section 2.4.1. The length two
prefix-free code for source x we are going to use is:
AA → 0
AB → 1000 AC → 1001 BA → 1010 BB → 1011
BC → 1100 CA → 1101 CB → 1110 CC → 1111
This prefix-free code is obviously sub-optimal because the code length is only 2, also
the code length is not adapted to the distribution. However, it will be shown that even this
obviously non optimal code outperforms the block coding schemes in the delay constrained
setups when a is not too small. We analyze the performance of the streaming prefix-free
coding system at R = 32 . That is, the source generates 1 symbol per second, while 3 bits can
be sent through the rate constrained channel every 2 seconds. The prefix-free stream coding
24
system like a FIFO queueing system with infinite buffer which is similar to the problem
studied in [49]. The prefix-free stream coding system is illustrated in Figure 2.5. Following
the prefix code defined earlier, the prefix-free encoder group two source symbols x2k−1, x2k
together at time 2k, k = 1, 2, ... into the prefix-free code. the length of the codeword is
either 1 or 4. The buffer is drained out by 3 bits per 2 seconds. Write the number of bits in
the buffer as Bk at time 2k. Every two seconds, the number of bits Bk in the buffer either
goes down by 2 if x2k−1, x2k = AA or goes up by 1 if x2k−1x2k 6= AA. 3 bits are drained out
every 2 seconds from the FIFO queue, notice that if the queue is empty, the encoder can
send random bits through the channel without causing confusion at the decoder because
the source generates 1 source symbol per second.
AB BA CC AA CA AA AA AA CB AA CCAASource
Buffer //
Rate R bit−stream
Prefix code
0 1000
1010
1111
0 1101
0 0 0 1110
1111
0
/
/ 0 111
10 0 01 / 0 / 1
*** 0** 100 010 101 111 011 010 0** 0** 111 00* 111
AA
CB
AA
AA
CA
AA
AA
CC
BA
AB
AADecision
Figure 2.5. Streaming prefix-free coding system (/ indicates empty queue, * indicatesmeaningless random bits)
Clearly Bk, k = 1, 2, ... form a Markov chain with following transition matrix: Bk =
Bk−1 + 1 with probability 1 − a2, Bk = Bk−1 − 2 with probability a2, notice that Bk ≥ 0
thus the boundary conditions. We have the state transition graph in Figure 2.6. For this
Markov chain, we can easily derive the stationary distribution (if it exists) [33].
µk = L(−1 +
√1 + 4(1−q)
q
2)k (2.8)
Where q = a2 and L is the normalizer, notice that µk is a geometric series and the stationary
distribution exists as long as the geometric series goes to 0 as index goes to infinity, i.e.
41−qq < 8 or equivalently q > 1
3 . In this example, we have a = 0.65, thus q = a2 = 0.4225 >
13 , thus the stationary distribution µk exists.
For the above simple prefix-free coding system, we can easily derive the decoding error
for symbols x2k−1, x2k at time 2k + ∆ − 1, thus the effective delay for x2k−1 is ∆. The
decoding error can only happen if at time 2k + ∆ − 1, at least one bits of the prefix-free
25
0 1 2 3 4 5qq
q q q q q
1−q 1−q 1−q 1−q 1−q 1−q
Figure 2.6. Transition graph of random walk Bk for source distribution a, 1−a2 , 1−a
2 , q = a2
is the probability that Bk goes down by 2.
code describing x2k−1, x2k are still in the queue. This implies that the number of bits in the
buffer at time 2k, Bk, is larger than
b32(∆− 1)c − l(x2k−1, x2k)
where l(x2k−1, x2k) is the length of the prefix-free code for x2k−1, x2k, thus it is 1 with
probability q = a2 and it is 4 with probability 1− q = 1− a2. Notice that the length of the
prefix-free code for x2k−1, x2k is independent with Bk, we have the following upper bound
on the error probability of decoding with delay ∆ when the system is at stationary state:
Pr[x2k−1 6= x2k−1] ≤ Pr[l(x2k−1, x2k) = 1] Pr[Bk > b32(∆− 1)c − 1]
Pr[l(x2k−1, x2k) = 4] Pr[Bk > b32(∆− 1)c − 4]
= q∞∑
j=b 32(∆−1)c
µj + (1− q)∞∑
j=b 32(∆−1)c−3
µj
= G(−1 +
√1 + 4(1−q)
q
2)b 3
2(∆−1)c−3
The last line is by substituting in (2.8) for stationary distribution µj . Where G is a constant,
we omit the detail expression of G here. Following the last line, the error exponent for this
prefix-free coding system is obviously
32
log(−1 +
√1 + 4(1−q)
q
2)
With the above evaluation, we now compare three different coding schemes in the non-
asymptotic setups. In Figure 2.7, the error probability vs delay curves are plotted for three
different coding schemes. First we plot the causal random coding error probability. Shown
in Figure 2.3, at R = 32 , random coding error exponent Er(R) is the same as the block
26
coding error exponent Es,b(R). The block coding curve is for a so called simplex coding
scheme, where the encoder first queues up ∆2 symbols, encode them into a length ∆
2 R binary
sequence and use the next ∆2 seconds to transmit the message. This coding scheme gives an
error exponent Es,b(R)2 . As can be seen in Figure 2.3, the slope of the simplex block coding
is roughly half of the block source coding’s slope.
0 10 20 30 40 50 6010
−10
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
End to end delay
Err
or p
roba
bilit
y (s
ymbo
l)
Block codingCausal random codingSimple prefix coding
Figure 2.7. Error probability vs delay (non-asymptotic results)
The slope of these curves in Figure 2.7 indicates how fast the error probability goes to
zero with delay, i.e. the error exponents. Although smaller than the delay optimal error
exponent Es(R), the simple prefix-free coding has a much higher error exponent than both
both random coding and the simplex block coding error exponent. A trivial calculation
tells us that in order to get 10−6 symbol error probability, the delay requirement for delay
optimal coding is ∼ 40, for causal random coding is around ∼ 303, for simplex block
coding is at around∼ 374. Here we run a linear regression on the data: y∆ = log10 Pe(∆),
x∆ = ∆ as shown in Figure 2.3 from ∆ = 80 to ∆ = 100. Then we extrapolate the ∆, s.t.
log10 Pe(∆) = −6. Thus we see a major delay performance improvement for delay optimal
source coding.
27
2.3 First attempt: some suboptimal coding schemes
On our way to find the optimal delay constrained performance, we first studied the
classical block coding’s delay constrained performance in Section 2.3.1, the random tree
code later discussed in Section 2.3.3 and several other known coding systems. These coding
systems in the next three subsections are shown to be suboptimal in the sense that they
all achieve strictly smaller delay constrained error exponents than that in Theorem 1. This
section is by no means an exhaustive study of the delay constrained performance of all
possible coding systems. Our intention is to look at the obvious candidates and help readers
to understand the key issues with delay constrained coding.
2.3.1 Block Source Coding’s Delay Performance
R R R R
0 ∆2
∆ 3∆2
2∆Sourcestream
................
................Bit Stream of rate R∆R2
bits
∆2
symbols
Figure 2.8. Block coding’s delay performance
We analyze the delay constrained performance for traditional block source coding sys-
tems. As shown in Figure 2.8, the encoder first encode ∆2 source symbols, during a period
of ∆2 seconds, into a block of binary bits. During the next ∆
2 seconds, the encoder uses
all the bandwidth to transmit the binary bits through the rate R noiseless channel. After
all the binary bits are received by the decoder at time ∆, the effective coding system is a
length ∆2 , rate R block source coding system. Thus the error probability of each symbol is
lower bounded by 2−∆2
Es,b(R) as shown in Theorem 9. The delay constrained error exponent
for source symbol x1 is
E = − 1∆
log2−∆2
Es,b(R)
=Es,b(R)
2(2.9)
This is half of the block coding error exponent. Another problem of this coding scheme
28
is that the encoder is delay-specific as ∆ is a parameter in the infrastructure. While the
optimal coding scheme as shown in Section 2.4.1 is not.
2.3.2 Error-free Coding with Queueing Delay
For both the block coding in the previous section and the sequential random coding in
Section 2.3.3, the coding system commits to some none-zero error probability for symbol xi
at time i + ∆, for any i, ∆ pair as long as the rate R is smaller than log |X |. The next two
coding schemes are different in the sense that the error probability for any source symbol xi
is going to be exactly zero after some finite variant delay ∆i(x∞1 ). These encoding schemes
have two parts, first part is a zero-error fixed or variable to variable length source encoder,
the second part is a FIFO (first in first out) buffer. The queue in the buffer is drained
out by a constant rate R bits per second. If the queue is empty, the encoder buffer simply
send arbitrary bits through the noiseless channel. The decoder, knowing the rate of the
source, simply discards the arbitrary gibberish bits added by the encoder buffer and now
the the task is only to decode the source symbols by the received bits that describe the
source stream. This coding scheme makes no decoding error on xi as long as the bits that
describe xi are all received by the decoder. Thus the symbol error only occurs if there are
too many bits are used to describe the source block that xi is in and some previous source
symbols. This class of coding schemes are illustrated in Figure 2.9. The error free code can
be either variable to fixed length coding or variable to variable length coding.
Streaming data S
Encoder Buffer
FIFO
Rate R bit streamDecoder
Causal Error−Free Source Code
Figure 2.9. Error free source coding for a fixed-rate system
29
Prefix-free Entropy Coding with Queueing Delay
Entropy coding [26] is a simple source coding scheme, where the code length of a source
symbol is roughly inversely proportional to the logarithm of the probability of that source
symbol. For example, Huffman code and arithmetic code are two entropy code that are
thoroughly studied in [26]. An important property of the entropy coding is that the average
code length per source symbol is the entropy rate H(px) of the source x if the code block
is long enough. The code length is independent of the rate R. We argue that the simple
prefix-free entropy coding with queueing delay scheme is not always optimal by the following
simple example. Consider a binary source S = a, b, c, d with distribution px(a) = 0.91,
px(b) = px(c) = px(d) = 0.03 and the rate of the system is R = 2. Now obviously the best
coding scheme is not to code by just sending the uncompressed binary expressions of the
source across the rate R = 2 system. The system achieves zero error for every source symbol
xi instantaneously. Since rate R match well with the alphabet size log |X |, philosophically
speaking, this statement is the same to Michael Gastpar’s PhD thesis [43] in which the
Gaussian source is matched well with the Gaussian channel. The delay constrained error
exponent is thus infinity or strictly speaking not well defined. One length-2 optimal prefix-
free code of this source is E(a) = 0, E(b) = 10, E(c) = 110, E(d) = 111. And the coding
system using this prefix-free code or entropy codes of any length clearly does not achieve
zero error probability for any delay ∆ for source symbol xi if i is large enough. The error
occurs if the source symbols before i are atypical. For example, a long string of c’s. This
case is thoroughly analyzed in Section 2.2.2.
We show in Section 2.4.1 that a simple universal coding with queueing delay scheme is
indeed optimal. This scheme is inspired by Jelinek’s non-universal coding scheme in [49].
We call Jelinek’s scheme tilted entropy coding because the new codes are entropy codes for
a tilted distribution generated from the original distribution. We give a simplified proof of
Jelinek’s result in Appendix B using modern large deviation techniques.
LZ78 Coding with Queueing Delay
In this subsection, we argue that the standard LZ78 coding is not suitable for delay
constrained setup due to the nature of its dictionary construction.
In their seminal paper [88], Lempel and Ziv proposed a dictionary based sequential
universal lossless data compression algorithm. This coding scheme is causal in nature.
30
However, the standard LZ78 coding scheme described in [88] and later popularized by
Welch in [83] does not achieve any positive delay constrained error exponent. The reason
why seemingly causal LZW coding scheme does not achieve any positive delay constrained
error exponent is rooted in the incremental nature of the dictionary. If the dictionary is
infinite as original designed by Lempel and Ziv, then for a memoryless source and a fixed
delay ∆, every word with length 2∆ will be in the dictionary with high probability at
sufficiently large time t. Thus the encoding delay at the LZW encoder is at least 2∆ for any
source symbol xi, i > t. According to the definition of delay constrained error exponent in
Definition 2, the achievable error exponent is 0 for such a Lempel-Ziv coding system.
On the other hand, with a finite dictionary, the dictionary mismatches with the source
statistics if the early source symbols are atypical. This mis-match occurs with a positive
probability and thus the encoding delay for later source symbols is big and thus no positive
delay constrained error exponent can be achieved. Note: we only discuss the popular LZ78
algorithm here. There are numerous other forms of universal Lempel-Ziv coding, the first
one in [87]. The delay constrained performance for such coding schemes are left for future
studies.
2.3.3 Sequential Random Binning
In [12] and [32], we proposed a sequential random binning scheme for delay constrained
source coding systems. It is the source-coding counterpart to tree and convolutional codes
used for channel coding [34]. This sequential random binning scheme follows our definition of
delay constrained source coding system in Definition 1 with additional common randomness
accessible to both the encoder(s) and decoder. The encoder is universal in nature that it
is the same for any source distribution px meanwhile the decoder can be ML decoder or
universal decoder which will be discussed in great details later. We define the sequential
random binning scheme as follows.
Definition 3 A randomized sequential encoder-decoder pair (a random binning scheme)
E ,D is a sequence of maps: Ej, j = 1, 2, ... and Dj, j = 1, 2, .... The outputs of Ej are
the outputs of the encoder E from time j − 1 to j.
Ej : X j −→ 0, 1bjRc−b(j−1)Rc
Ej(xj1) = b
bjRcb(j−1)Rc+1
31
The output of the fixed-delay ∆ decoder Dj is the decoding decision of xj based on the received
binary bits up to time j + ∆.
Dj : 0, 1b(j+∆)Rc −→ X
Dj(bb(j+∆)Rc1 ) = xj(j + ∆)
Where xj(j + ∆) is the estimation of xj at time j + ∆ and thus has end-to-end delay of
∆ seconds. Common randomness, shared between encoder(s) and decoder(s), is assumed.
Availability to common randomness is a common assumption in information theory com-
munity. Common randomness can be generated from correlated data [63]. This is not the
main topic of this thesis in which we assume enough common randomness. Interested read-
ers may read [56, 1]. This allows us to randomize the mappings independent of the source
sequence. In this thesis, we only need pair-wise independence, formally, for all i, n:
Pr[E(xi1x
ni+1) = E(xi
1xni+1)] = 2−(bnRc−biRc) ≤ 2× 2−(n−i)R (2.10)
for all xi+1 6= xi+1
(2.10) deserves more staring at. The number of output bits of the encoder for source
symbols from time i to time n is bnRc − biRc, so (2.10) means that the chance that the
binary representations of two length-n source strings which diverge at time i is 2−(bnRc−biRc).
We define bins as follows. A bin is a set of source strings that share the same binary
representations:
Bx(xn1 ) = xn
1 ∈ Sn : E(xn1 ) = E(xn
1 ) (2.11)
With the notion of bins, (2.10) is equivalent to the following equality for xi+1 6= xi+1:
Pr[E(xi1x
ni+1) ∈ Bx(xi
1xni+1)] = 2−(bnRc−biRc) ≤ 2× 2−(n−i)R (2.12)
In this thesis, the sequential random encoder always works by assigning random parity
bits in a causal fashion to the observed source sequence. That is the bits generated at each
time in Definition 3 are iid Bernoulli-(0.5) random variables. Since parity bits are assigned
causally, if two source sequences xn1 and xn
1 share the same length-l prefix, i.e. xl1 = xl
1 then
their first blRc parity bits must match. Subsequent parity bits are drawn independently.
32
At the decoder side, the decoder receives the binary parity check bbnRc1 at time n.
Hence the decoder knows the bin number which the source sequence xn1 is in. Now the
decoder has to pick one source sequence out of the bin Bx(xn1 ) as the estimate of xn
1 . If
the decoder knows the distribution of the source px , it can simply pick the sequence with
the maximum likelihood. Otherwise, which is called universal case, the decoder can use a
minimum empirical entropy decoding rule. We first summarize the delay constrained error
exponent results for both ML decoding and universal decoding.
First the maximum-likelihood decoding where the decoder knows the distribution of the
source px .
Proposition 3 Given a rate R > H(px), there exists a randomized streaming encoder and
maximum likelihood decoder pair (per Definition 3) and a finite constant K > 0, such that
Pr[xi(i + ∆) 6= xi] ≤ K2−∆EML(R) for all i,∆ ≥ 0, or equivalently for all n ≥ ∆ ≥ 0
Pr[xn−∆(n) 6= xn−∆] ≤ K2−∆EML(R) (2.13)
where EML(R) = supρ∈[0,1]
ρR− (1 + ρ) log
(∑x
px(x)1
1+ρ
) (2.14)
Secondly, if the decoder does not know the distribution, it can use the minimum empiri-
cal entropy rule to pick the sequence with the smallest empirical entropy inside the received
bin. More interestingly, the decoder may prefer not to use its knowledge of the distribution
of the source. The benefit of doing so is due to the uncertainty of the distribution of the
source, for example a source distribution of 0.9, 0.051, 0.049 may be mistaken as a distri-
bution of 0.9, 0.049, 0.051 and the decoding may as well be incorrect due to this nature.
We summarize the universal coding result in the following proposition.
Proposition 4 Given a rate R > H(px), there exists a randomized streaming encoder and
universal decoder pair (per Definition 3) such that for all ε > 0 there exists finite K > 0
such that Pr[xi(i + ∆) 6= xi] ≤ K2−∆(EUN (R)−ε) for all n, ∆ ≥ 0 where
EUN (R) = infq
D(q‖px) + |R−H(q)|+, (2.15)
where q is an arbitrary probability distribution on X and where |z|+ = max0, z.
Remark: The error exponents of Propositions 3 and 4 both equal random block-coding
exponents shown in (A.5).
33
Proof of Propositions 3: ML decoding
To show Propositions 3 and 4, we first develop the common core of the proof in the
context of ML decoding. First, we describe the ML decoding rule.
ML decoding rule:
Denote by xn1 (n) the estimate of the source sequence xn
1 at time n.
xn1 (n) = arg max
xn1∈Bx(xn
1 )px(xn
1 ) = arg maxxn1∈Bx(xn
1 )
n∏
i=1
px(xi) (2.16)
The ML decoding rule in (2.16) is very simple. At time n, the decoder simply picks the
sequence xn1 (n) with the highest likelihood which is in the same bin as the true sequence
xn1 ). Now the estimate of source symbol n−∆ is simply the (n− delay)th symbol of xn
1 (n),
denoted by xn−∆(n).
Details of the proof:
The proof strategy is as follows. To lead to a decoding error, there must be some false
source sequence xn1 that satisfies three conditions: (i) it must be in the same bin (share the
same parities) as xn1 , i.e., xn
1 ∈ Bs(xn1 ), (ii) it must be more likely than the true sequence,
i.e., px(xn1 ) > px(xn
1 ), and (iii) xl 6= xl for some l ≤ n−∆.
The error probability can be union bounded as follows which is also illustrated in Fig-
ure 2.10.
Pr[xn−∆(n) 6= xn−∆] ≤Pr[xn−∆1 (n) 6= xn−∆
1 ]
=∑
xn1
Pr[xn−∆1 (n) 6= xn−∆
1 |xn1 = xn
1 ]px(xn1 ) (2.17)
=∑
xn1
n−∆∑
l=1
Pr[∃ xn
1 ∈ Bs(xn1 ) ∩ Fn(l, xn
1 ) s.t. px(xn1 ) ≥ px(xn
1 )]px(xn
1 )
(2.18)
=n−∆∑
l=1
∑
xn1
Pr[∃ xn
1 ∈ Bs(xn1 ) ∩ Fn(l, xn
1 ) s.t. px(xn1 ) ≥ px(xn
1 )]px(xn
1 )
=n−∆∑
l=1
pn(l). (2.19)
After conditioning on the realized source sequence in (2.17), the remaining randomness is
only in the binning. In (2.18) we decompose the error event into a number of mutually
34
- l1 nn−∆
Figure 2.10. Decoding error probability at n − ∆ can be union bounded by the sum ofprobabilities of first decoding error at l, 1 ≤ l ≤ n−∆. The dominant error event pn(n−∆)is the one in the highlighted oval(shortest delay).
exclusive events by partitioning all source sequences xn1 into sets Fn(l, xn
1 ) defined by the
time l of the first sample in which they differ from the realized source xn1 ,
Fn(l, xn1 ) = xn
1 ∈ X n|xl−11 = xl−1
1 , xl 6= xl, (2.20)
and define Fn(n + 1, xn1 ) = xn
1. Finally, in (2.19) we define
pn(l) =∑
xn1
Pr[∃ xn
1 ∈ Bs(xn1 ) ∩ Fn(l, xn
1 ) s.t. px(xn1 ) ≥ px(xn
1 )]px(xn
1 ). (2.21)
We now upper bound pn(l) using a Chernoff bound argument similar to [39].
Lemma 1 pn(l) ≤ 2× 2−(n−l+1)EML(R).
Proof:
pn(l) =∑
xn1
Pr[∃ xn
1 ∈ Bs(xn1 ) ∩ Fn(l, xn
1 ) s.t. px(xn1 ) ≥ px(xn
1 )]px(xn
1 )
≤∑
xn1
min[1,
∑
xn1 ∈ Fn(l, xn
1 )s.t.
px (xn1 ) ≤ px (x
n1 )
Pr[xn1 ∈ Bs(xn
1 )]]px(xn
1 ) (2.22)
≤∑
xl−11 ,xn
l
min[1,
∑
xnl s.t.
px (xnl ) < px (x
nl )
2× 2−(n−l+1)R]px(xl−1
1 )px(xnl ) (2.23)
≤2×∑
xnl
min[1,
∑
xnl s.t.
px (xnl ) < px (x
nl )
2−(n−l+1)R]px(xn
l )
=2×∑
xnl
min[1,
∑
xnl
1[px(xnl ) > px(xn
l )]2−(n−l+1)R]px(xn
l ) (2.24)
≤2×∑
xnl
min
1,
∑
xnl
min[1,
px(xnl )
px(xnl )
]2−(n−l+1)R
px(xn
l )
≤2×∑
xnl
∑
xnl
[px(xn
l )px(xn
l )
] 11+ρ
2−(n−l+1)R
ρ
px(xnl ) (2.25)
35
=2×∑
xnl
px(xnl )
11+ρ
∑
xnl
[px(xnl )]
11+ρ
ρ
2−(n−l+1)ρR
=2×[∑
x
px(x)1
1+ρ
](n−l+1) [∑x
px(x)1
1+ρ
](n−l+1)ρ
2−(n−l+1)ρR (2.26)
=2×[∑
x
px(x)1
1+ρ
](n−l+1)(1+ρ)
2−(n−l+1)ρR
=2× 2−(n−l+1)
[ρR−(1+ρ) log
(∑x px (x)
11+ρ
)]
(2.27)
In (2.22) we apply the union bound. In (2.23) we use the fact that after the first symbol
in which two sequences differ, the remaining parity bits are independent, and use the fact
that only the likelihood of the differing suffixes matter in (2.12). That is, if xl−11 = xl−1
1 ,
then px(xn1 ) < px(xn
1 ) if and only if px(xnl ) < px(xn
l ). In (2.24) 1(·) is the indicator function,
taking the value one if the argument is true, and zero if it is false. We get (2.25) by
limiting ρ to the range 0 ≤ ρ ≤ 1 since the arguments of the minimization are both positive
and upper-bounded by one. We use the iid property of the source, exchanging sums and
products to get (2.26). The bound in (2.27) is true for all ρ in the range 0 ≤ ρ ≤ 1.
Maximizing (2.27) over ρ gives pn(l) ≤ 2 × 2−(n−l+1)EML(R) where EML(R) is defined in
Proposition 3, in particular (2.14). ¤
Using Lemma 1 in (2.19) gives
Pr[xn−∆(n) 6= xn−∆] ≤2×n−∆∑
l=1
2−(n−l+1)EML(R) (2.28)
=n−∆∑
l=1
2× 2−(n−l+1−∆)EML(R)2−∆EML(R)
≤K02−∆EML(R) (2.29)
In (2.29) we pull out the exponent in ∆. The remaining summation is a sum over decaying
exponentials, can thus can be bounded by some constant K0. This proves Proposition 3.
Proof of Proposition 4: Universal decoding
We use the union bound on the symbol-wise error probability introduced in (2.19), but
with minimum empirical entropy, rather than maximum-likelihood, decoding.
36
Universal decoding rule:
xl(n) = w[l]l where w[l]n1 = arg minxn∈Bs(xn
1 ) s.t. xl−11 =xl−1
1 (n)
H(xnl ). (2.30)
We term this a sequential minimum empirical-entropy decoder. The reason for using this
decoder instead of the standard minimum block-entropy decoder is that the block-entropy
decoder has a polynomial term in n (resulting from summing over the type classes) that
multiplies the exponential decay in ∆. For n large, this polynomial can dominate. Using
the sequential minimum empirical-entropy decoder results in a polynomial term in ∆.
Details of the proof: With this decoder, errors can only occur if there is some sequence
xn1 such that (i) xn
1 ∈ Bs(xn1 ), (ii) xl−1
1 = xl−1, and xl 6= xl, for some l ≤ n −∆, and (iii)
the empirical entropy of xnl is such that H(xn
l ) < H(xnl ). Building on the common core of
the achievability (2.17)–(2.19) with the substitution of universal decoding in the place of
maximum likelihood results in the following definition of pn(l) (cf. (2.31) with (2.21),
pn(l) =∑
xn1
Pr[∃ xn
1 ∈ Bs(xn1 ) ∩ Fn(l, xn
1 ) s.t. H(xnl ) ≤ H(xn
l )]px(xn
1 ) (2.31)
The following lemma gives a bound on pn(l).
Lemma 2 For sequential minimum empirical entropy decoding,
pn(l) ≤ 2× (n− l + 2)2|X |2−(n−l+1)EUN (R).
Proof: We define Pn−l to be the type of length-(n− l +1) sequence xnl , and TP n−l to be
the corresponding type class so that xnl ∈ TP n−l . Analogous definitions hold for Pn−l and
37
xnl . We rewrite the constraint H(xn
l ) < H(xnl ) as H(Pn−l) < H(Pn−l). Thus,
pn(l) =∑
xn1
Pr[∃ xn
1 ∈ Bs(xn1 ) ∩ Fn(l, xn
1 ) s.t. H(xnl ) ≤ H(xn
l )]px(xn
1 )
≤∑
xn1
min[1,
∑
xn1 ∈ Fn(l, xn) s.t.
H(xnl ) ≤ H(xn
l )
Pr[xn1 ∈ Bs(xn
1 )]]px(xn)
≤∑
xl−11 ,xn
l
min[1,
∑
xnl s.t.
H(xnl ) ≤ H(xn
l )
2× 2−(n−l+1)R]px(xl−1
1 )px(xnl ) (2.32)
≤2×∑
xnl
min[1,
∑
xnl s.t.
H(xnl ) ≤ H(xn
l )
2−(n−l+1)R]px(xn
l ) (2.33)
=2×∑
P n−l
∑
xnl ∈TPn−l
min[1,
∑
P n−l s.t.
H(P n−l) ≤ H(P n−l)
∑
xnl ∈TPn−l
2−(n−l+1)R]px(xn
l ) (2.34)
≤2×∑
P n−l
∑
xnl+1∈TPn−l
min[1, (n− l + 2)|X |2−(n−l)[R−H(P n−l)]
]px(xn
l ) (2.35)
≤2× (n− l + 2)|X |∑
P n−l
∑
xnl ∈TPn−l
2−(n−l+1)[|R−H(P n−l)|+]
2−(n−l+1)[D(P n−l‖Ps)+H(P n−l)] (2.36)
≤2× (n− l + 2)|X |∑
P n−l
2−(n−l+1) infq [D(q‖Ps)+|R−H(q)|+] (2.37)
≤2× (n− l + 2)2|X |2−(n−l+1)EUN (R) (2.38)
To show (2.32), we use the bound in (2.12). In going from (2.34) to (2.35) first note that
the argument of the inner-most summation (over xnl ) does not depend on xn
1 . We then
use the following relations: (i)∑
xnl ∈TPn−l
= |TP n−l | ≤ 2(n−l+1)H(P n−l), which is a standard
bound on the size of the type class [29], (ii) H(Pn−l) ≤ H(Pn−l) by the sequential minimum
empirical entropy decoding rule, and (iii) the polynomial bound on the number of types [28],
|Pn−l| ≤ (n−l+2)|X |. In (2.36) we recall the function definition |·|+ , max0, ·. We pull
the polynomial term out of the minimization and use px(xnl ) = 2−(n−l+1)[D(P n−l‖Ps)+H(P n−l)]
for all xnl ∈ TP n−l . It is also in (2.36) that we see why we use a sequential minimum
empirical entropy decoding rule instead of a block minimum entropy decoding rule. If we
had not marginalized out over xl−11 in (2.33) then we would have a polynomial term out
front in terms of n rather than n − l, which for large n could dominate the exponential
decay in n − l. As the expression in (2.37) no longer depends on xnl , we simplify by using
|TP n−l | ≤ 2(n−l+1)H(P n−l). In (2.38) we use the definition of the universal error exponent
38
EUN (R) from (2.15) of Proposition 4, and the polynomial bound on the number of types.
Other steps should be obvious. ¤
Lemma 2 and Pr[xn−∆(n) 6= xn−∆] ≤ ∑n−∆l=1 pn(l) imply that:
Pr[xn−∆(n) 6= xn−∆] ≤n−∆∑
l=1
(n− l + 2)2|X |2−(n−l+1)EUN (R)
≤n−∆∑
l=1
K12−(n−l+1)[EUN (R)−ε] (2.39)
≤K2−∆[EUN (R)−ε] (2.40)
In (2.39) we incorporate the polynomial into the exponent. Namely, for all a > 0, b > 0,
there exists a C such that za ≤ C2b(z−1) for all z ≥ 1. We then make explicit the delay-
dependent term. Pulling out the exponent in ∆, the remaining summation is a sum over
decaying exponentials, and can be bounded by a constant. Together with K1, this gives the
constant K in (2.40). This proves Proposition 4. Note that the ε in (2.40) does not enter
the optimization because ε > 0 can be picked equal to any constant. The choice of ε effects
the constant K in Proposition 4.
Discussions on sequential random binning
We propose a sequential random binning scheme, as summarized in Propositions 3 and
4, which achieves the random block coding error exponent defined in (A.5). Although this
error exponent is strictly suboptimal as shown in Proposition 8 in Section 2.5, it provides
a very useful tool in deriving the achievability results for delay constrained coding (both
source coding and channel coding) as will be seen in future chapters where delay constrained
source coding with decoder side-information and distributed source coding problems are
discussed. Philosophically speaking, the essence of sequential random binning is to leave the
uncertainties about the source at the encoder to the future, and let the binning reduce the
uncertainties of the source. This is in contrast to the optimal scheme shown in Section 2.4.1,
where the encoder queues up the fixed-to-variable code and thus for a particular source
symbol and delay, the uncertainties lie in the past.
39
2.4 Proof of the Main Results
In this section we derive the delay constrained source coding error exponent in Theorem
1. We implement a simple universal coding scheme to achieve this error exponent defined in
Theorem 1. As for the converse, we use a focusing bound type of argument which is parallel
to the analysis on the delay constrained error exponent for channel coding with feedback in
[67].
Following the notion of delay constrained source coding error exponent Es(R) in Defi-
nition 2, we have the following result in Theorem 1.
Es(R) = infα>0
1α
Es,b((α + 1)R)
In Section 2.4.1, we first show the achievability of Es(R) by a simple fixed to variable length
universal code and a FIFO queue coding scheme. Then in Section 2.4.2, we show that Es(R)
is indeed an upper bound on the delay constrained error exponent. This error exponent
was first derived in [49] in a slightly different setup by using a non-universal (encoder needs
to know the distribution of the source, px , and the rate R) coding scheme, we give a simple
proof of Jelinek’s result by using large deviation theory in Appendix B.
2.4.1 Achievability
In this section, we introduce a universal coding scheme which achieves the delay con-
strained error exponent shown in Theorem 1. The coding scheme only depends on the size
of the alphabet of the source, not the distribution of the source. We first describe our
universal coding scheme. The basic idea is to encode a long block source symbols into a
variable-length binary strings. The binary strings first describe the type of the source block,
then index different source blocks with the same type by a one to one map. This is in line
with the Minimum Description Length principle [66, 4]. Source blocks with the same type
have the same length.
Optimal Universal Coding
A block-length N is chosen that is much smaller than the target end-to-end delays, while
still being large enough. This finite block length N will be absorbed into the finite constant
K. For a discrete memoryless source and large block-lengths N , the variable-length code
40
Streaming data S
Encoder Buffer
FIFO
Rate R bit streamDecoder
Optimal
Fixed to Variable Lenght Encoder
Figure 2.11. A universal delay constrained source coding system
consists of two stages: first describing the type of the block 3 ~xi using O(|X | log N) bits
and then describing which particular realization has occurred by using a variable NH(~xi)
bits. The overhead O(|X | log N) is asymptotically negligible and the code is also universal
in nature. It is easy to verify that the average code length: limN→∞Epx (l(~x))
N = H(px) This
code is obviously a prefix-free code. Write l(~xi) as the length of the codeword for ~xi, then:
NH(~xi) ≤ l(~xi) ≤ |X | log(N + 1) + NH(~xi) (2.41)
The binary sequence describing the source is fed to the FIFO buffer illustrated in Figure
2.11. Notice that if the buffer is empty, the output of the buffer can be gibberish binary
bits. The decoder simply discards these meaningless bits because it is aware that the buffer
is empty.
Proposition 5 For the iid source ∼ px using the universal delay constrained code described
above, for allε > 0, there exists K < ∞, s.t. for all t, ∆:
Pr[~xt 6= ~x t((t + ∆)N)] ≤ K2−∆N(Es(R)−ε)
Where ~x t((t + ∆)N) is the estimate of ~xt at time (t + ∆)N . Before the proof, we have the
following lemma to bound the probabilities of atypical source behavior.
Lemma 3 (Source atypicality) for all ε > 0, block length N large enough, there exists
K < ∞, s.t. for all n, if r < log |X | :
K2−nN(Es,b(r)+ε) ≤ Pr[n∑
i=1
l(~xi) > nNr] ≤ K2−nN(Es,b(r)−ε) (2.42)
3~xi is the ith block of length N .
41
Proof: Only need to show the case for r > H(px). By the Cramer’s theorem[31], for all
ε1 > 0, there exists K1, such that:
Pr[n∑
i=1
l(~xi) > nNr] = Pr[1n
n∑
i=1
l(~xi) > Nr] ≤ K12−n(infz>Nr I(z)−ε1) (2.43)
And
Pr[n∑
i=1
l(~xi) > nNr] = Pr[1n
n∑
i=1
l(~xi) > Nr] ≥ K12−n(infz>Nr I(z)+ε1) (2.44)
where the rate function I(z) is [31]:
I(z) = supρ∈R
ρz − log(∑
~x∈XN
px(~x)2ρl(~x)) (2.45)
Here we use this equivalent K− ε notation instead of limit superior and limit inferior in the
Cramer’s theorem [31].
Write I(z, ρ) = ρz − log(∑
~x∈XN px(~x)2ρl(~x)), I(z, 0) = 0. z > Nr > NH(px), for large
N :
∂I(z, ρ)∂ρ
|ρ=0 = z −∑
~x∈XN
px(~x)l(~x) ≥ 0
By the Holder’s inequality, for all ρ1, ρ2, and for all θ ∈ (0, 1):
(∑
i
pi2ρ1li)θ(∑
i
pi2ρ2li)(1−θ) ≥∑
i
(pθi 2
θρ1li)(p(1−θ)i 2(1−θ)ρ2li)
=∑
i
pi2(θρ1+(1−θ)ρ2)li
This shows that log(∑
~x∈XN px(~x)2ρl(~x)) is a convex ∪ function on ρ, thus I(z, ρ) is a concave
∩ function on ρ for fixed z. Then ∀z > 0, ∀ρ < 0, I(z, ρ) < 0, which means that the ρ to
maximize I(z, ρ) is positive. This implies that I(z) is monotonically increasing with z and
obviously I(z) is continuous. Thus infz>Nr
I(z) = I(Nr)
For ρ ≥ 0, using the upper bound on l(~x) in (2.41):
log(∑
~x∈XN
px(~x)2ρl(~x)) ≤ log(∑
T NP ∈T N
2−ND(p‖px )2ρ(|S| log(N+1)+NH(p)))
≤ log((N + 1)|X |2−N minpD(p‖px )−ρH(p)+ρ|S| log(N+1))
= N(−min
pD(p‖px)− ρH(p)+ εN
)
where εN = (1+ρ)|X | log(N+1)N goes to 0 as N goes to infinity.
42
For ρ ≥ 0, using the lower bound on l(~x) in (2.41):
log(∑
~x∈XN
px(~x)2ρl(~x)) ≥ log(∑
p∈T N
2−ND(p‖px )+|S| log(N+1)2ρNH(p))
≥ log(2−N minp∈TN D(p‖px )−ρH(p)+|S| log(N+1))
= N(−min
pD(p‖px)− ρH(p) − ε′N
)
where ε′N = |X | log(N+1)N −minpD(p‖px)− ρH(p)+ minp∈T N D(p‖px)− ρH(p) goes to 0
as N goes to infinity, T N is the set of all types of XN .
Substitute the above inequalities to I(Nr) defined in (2.45):
I(Nr) ≥ N(supρ>0
minp
ρ(r −H(p)) + D(p‖px) − εN
)(2.46)
And
I(Nr) ≤ N(supρ>0
minp
ρ(r −H(p)) + D(p‖px)+ ε′N)
(2.47)
First fix ρ, by a simple Lagrange multiplier argument, with fixed H(p), we know that
the distribution p to minimize D(p‖px) is a tilted distribution of pαx . It can be verified that
∂H(pαx )
∂α ≥ 0 and ∂D(pαx ‖px )
∂α = α∂H(pαx )
∂α . Thus the distribution to minimize D(p‖px) − ρH(p)
is pρx . Using some algebra, we have
D(pρx‖px)− ρH(pρ
x) = −(1 + ρ) log∑
s∈Spx(x)
11+ρ
Substitute this into (2.46) and (2.47) respectively:
I(Nr) ≥ N(supρ>0
ρr − (1 + ρ) log∑
s∈Xpx(x)
11+ρ − εN
)
= N(Es,b(r)− εN
)(2.48)
The last equality can again be proved by a simple Lagrange multiplier argument.
Similarly:
I(Nr) ≤ N(Es,b(r) + ε′N
)(2.49)
Substitute (2.48) and (2.49) into(2.43) and (2.44) respectively, by letting ε1 small enough
and N big enough thus εN and ε′N small enough, we get the the desired bound in (2.42). ¤
Now we are ready to prove Proposition 5.
43
Proof: We give an upper bound on the decoding error on ~xt at time (t + ∆)N . At
time (t + ∆)N , the decoder cannot decode ~xt with 0 error probability iff the binary strings
describing ~xt are not all out of the buffer. Since the encoding buffer is FIFO, this means
that the number of outgoing bits from some time t1 to (t + ∆)N is less than the number of
the bits in the buffer at time t1 plus the number of incoming bits from time t1 to time tN .
Suppose the buffer is last empty at time tN − nN where 0 ≤ n ≤ t, given this condition,
the decoding error occurs only if∑n−1
i=0 l(~xt−i) > (n + ∆)NR. Write lmax as the longest
code length, lmax ≤ |X | log(N + 1) + N |X |. Then Pr[∑n−1
i=0 l(~xt−i) > (n + ∆)NR] > 0 only
if n > (n+∆)NRlmax
> ∆NRlmax
∆= β∆
Pr[~xt 6= ~xt((t + ∆)N)] ≤t∑
n=β∆
Pr[n−1∑
i=0
l(~xt−i) > (n + ∆)NR] (2.50)
≤(a)
t∑
n=β∆
K12−nN(Es,b((n+∆)NR
nN)−ε1)
≤(b)
∞∑
n=γ∆
K22−nN(Es,b(R)−ε2) +γ∆∑
n=β∆
K22−∆N(min
α>0Es,b((1+α)R)
α−ε2)
≤(c) K32−γ∆N(Es,b(R)−ε2) + |γ∆− β∆|K32−∆N(Es(R)−ε2)
≤(d) K2−∆N(Es(R)−ε)
where, K ′is and ε′is are properly chosen positive real numbers. (a) is true because of
Lemma 3. Define γ = Es(R)Es,b(R) , in the first part of (b), we only need the fact that Es,b(R)
is non decreasing with R. In the second part of (b), we write α = ∆n and take the α to
minimize the error exponents. The first term of (c)comes from the sum of a geometric series.
The second teFrm of (c) is by the definition of Es(R). (d) is by the definitions of γ. ¥
2.4.2 Converse
To bound the best possible error exponent with fixed delay, we consider a block coding
encoder/decoder pair constructed by the delay constrained encoder/decoder pair and trans-
late the block-coding bounds of [29] to the fixed delay context. The argument is analogous
to the “focusing bound” derivation in [67] for channel coding with feedback.
Proposition 6 For fixed-rate encodings of discrete memoryless sources, it is not possible
to achieve an error exponent with fixed-delay higher than
infα>0
1α
Es,b((α + 1)R) (2.51)
44
from the definition of delay constrained source coding error exponent in Definition 2, the
statement of this proposition is equivalent to the following mathematical statement:
For any E > infα>0
1αEs,b((α+1)R), there exists an positive real value ε, such that for any
K < ∞, there exists i > 0, ∆ > 0 and
Pr[xi 6= xi(i + ∆)] > K2−∆(Es(R)−ε)
Proof: We show the proposition by contradiction. Suppose that the delay-constrained
error exponent can be higher than infα>0
1αEs,b((α+1)R). Then according to Definition 2, there
exists a delay-constrained source coding system, such that for some E > infα>0
1αEs,b((α+1)R),
for any ε > 0, there exists K < ∞, such that for all i > 0, ∆ > 0
Pr[xi 6= xi(i + ∆)] ≤ K2−∆(E−ε)
so we choose some ε > 0, such that
E − ε > infα>0
1α
Es,b((α + 1)R) (2.52)
Then consider a block coding scheme (E ,D) which is built on the delay constrained
source coding system. The encoder of the block coding system is the same as the delay-
constrained source encoder, and the block decoder D works as follows:
x i1 = (x1(1 + ∆), x2(2 + ∆), ..., xi(i + ∆))
Now the block decoding error of this coding system can be upper bounded as follows,
for any i > 0 and ∆ > 0:
Pr[x i1 6= x i
1] ≤i∑
t=1
Pr[xi(i + ∆) 6= xi]
≤i∑
t=1
K2−∆(E−ε)
= iK2−∆(E−ε) (2.53)
The block coding scheme (E ,D) is a block source coding system for i source symbols by
using bR(i + ∆)c bits hence has a rate bR(i+∆)ci ≤ i+∆
i R. From the classical block coding
result in Theorem 9, we know that the source coding error exponent Es,b(R) is monotonically
45
increasing and continuous in R, so Es,b(bR(i+∆)c
i ) ≤ Es,b(R(i+∆)
i ). Again from Theorem 9,
we know that the block coding error probability can be bounded in the following way:
Pr[x i1 6= x i
1] > 2−i(Es,b(R(i+∆)
i)+εi) (2.54)
where limi→∞
εi = 0.
Combining (2.53) and (2.54), we have:
2−i(Es,b(R(i+∆)
i)+εi) < iK2−∆(E−ε)
Now let α = ∆i , α > 0, then the above inequality becomes:
2−i(Es,b(R(1+α))+εi) < 2−iα(E−ε−θi) and hence:
E − ε <1α
(Es,b(R(1 + α) + εi) + θi (2.55)
where limi→∞
θi = 0. Now the above inequality is true for all i, α > 0, meanwhile limi→∞
θi =
0, limi→∞
εi = 0, taking all these into account, we have:
E − ε ≤ infα>0
1α
Es,b(R(1 + α)) (2.56)
Now (2.56) contradicts with the assumption in (2.52), thus the proposition is proved. ¥
Combining Proposition 5 and Proposition 6, we establish the desired results summarized
in Theorem 1. For source coding with delay constraints problem, we are able to provide
an accurate estimate on the speed of the decaying of the symbol wise error probability.
This error exponent is in general larger than the block coding error exponent as shown in
Figure 2.3.
2.5 Discussions
We first investigate some of the properties of the delay constraint source coding error
exponent Es(R) defined in (2.7). Then we will conclude this chapter with some ideas for
future work.
2.5.1 Properties of the delay constrained error exponent Es(R)
The source coding with delay error exponent can be parameterized by a single real
number ρ which is parallel to the delay constrained channel coding with feedback error
46
ρ ρ* ρR
0
GR
(ρ)
Es,b
(R)
Figure 2.12. GR(ρ)
exponents for symmetric channels derived in [67]. The parametrization of Es(R) makes the
evaluations of Es(R) easier and potentially enables us to study other properties of Es(R).
We summarize the result in the following corollary first shown in [20].
Proposition 7 Parametrization of Es(R):
Es(R) = E0(ρ∗) = (1 + ρ∗) log[∑
s
px(x)1
1+ρ∗ ] (2.57)
where ρ∗ satisfies the following condition: R = E0(ρ∗)ρ∗ =
(1+ρ∗) log[∑s
px (x)1
1+ρ∗ ]
ρ∗ , i.e. ρ∗R =
E0(ρ∗)
Proof: For the simplicity of the notations, we define GR(ρ) = ρR−E0(ρ), thus GR(ρ∗) =
0 by the definition of ρ∗. And maxρ
GR(ρ) = Es,b(R) by the definition of the block coding
error exponent Es,b(R) in Theorem 9. As shown in Lemma 31 and Lemma 22 in Appendix G:
dGR(ρ)dρ
= R−H(pρx),
d2GR(ρ)dρ2
= −dH(pρx)
dρ≤ 0
Furthermore, limρ→∞
1ρGR(∞) = R− log |X | < 0, thus GR(∞) < 0. Obviously GR(0) = 0.
So we know that GR(ρ) is a concave ∩ function with a unique root for ρ > 0, as illustrated
47
in Figure 2.12. Let ρR ≥ 0 be such that maximize GR(ρ), by the concavity of GR(ρ) this is
equivalent to R−H(pρRx ) = 0. By concavity, we know that ρR < ρ∗.
Write:
FR(α, ρ) =1α
(ρ(α + 1)R−E0(ρ))
= ρR +GR(ρ)
α
Then Es(R) = infα>0
supρ≥0
FR(α, ρ). And for any α ∈ (0, log |X |R − 1),
∂FR(α, ρ)∂ρ
=(α + 1)R−H(pρ
x)α
∂2FR(α, ρ)∂ρ2
=1α
dH(pρx)
dρ≤ 0
FR(α, 0) = 0
FR(α,∞) < 0 (2.58)
So for any α ∈ (0, log |X |R − 1), let ρ(α) be so FR(α, ρ(α)) is maximized. ρ(α) is thus the
unique solution to:
(α + 1)R−H(pρ(α)x ) = 0
Define α∗ = H(pρ∗x )
R − 1, i.e. (α∗ + 1)R−H(pρ∗x ) = 0 which implies that,
α∗ =H(pρ∗
x )R
− 1 <H(p∞x )
R− 1 =
log |X |R
− 1
α∗ =H(pρ∗
x )R
− 1 >H(pρR
x )R
− 1 = 0
Now we establish that α∗ ∈ (0, log |X |R − 1), by the definition of α∗, ρ∗ maximizes FR(α∗, ρ)
over all ρ. From the above analysis:
Es(R) = infα>0
1α
Eb((α + 1)R)
= infα>0
supρ≥0
1α
(ρ(α + 1)R−E0(ρ))
= infα>0
supρ≥0
FR(α, ρ)
≤ supρ≥0
FR(α∗, ρ)
= FR(α∗, ρ∗)
= ρ∗R (2.59)
The last step is from the definition of FR(α, ρ) and the definition of ρ∗.
48
On the other hand:
Es(R) = infα>0
supρ≥0
FR(α, ρ)
≥ supρ≥0
infα>0
FR(α, ρ)
≥ supρ≥0
FR(α∗, ρ)
= FR(α∗, ρ∗)
= ρ∗R (2.60)
Combining (2.59) and (2.60), we derive the desired parametrization of Es(R). Here we
actually also prove that (α∗, ρ∗) is the saddle point for function FR(α, ρ). ¤
Our next proposition shows that for any rate within the meaningful rate region
(H(px), log |X |), the delay constraint error exponent is strictly larger than the block coding
error exponent.
Proposition 8 Comparison of Es(R) and Es,b(R):
∀R ∈ (H(px), log |X |), Es(R) > Es,b(R)
Proof: From the Proposition 7, we have: Es(R) = ρ∗R. Also from Proposition 7, we know
that ρR < ρ∗, where the block coding error exponent Es,b(R) = ρRR−E0(ρR). By noticing
that E0(ρ) ≥ 0, for all ρ ≥ 0, we have:
Es(R) = ρ∗R
> ρRR
≥ ρRR−E0(ρR)
≥ Es,b(R)
¤
The source coding error exponent Es,b(R) has a zero right derivative at the entropy rate
of the source R = H(px), this is a simple corollary of Lemma 23 in Appendix G. The delay
constrained source coding error exponent is different from its block coding counterpart as
shown in Figure 2.3. The next proposition shows that the right derivative at the entropy
rate of the source is positive.
Proposition 9 Right derivative of Es(R) at R = H(px):
limR→H(px )+
dEs(R)dR
=2H(px)∑
xpx(x)[log(px(x))]2 −H(px)2
49
Proof: Proposition 7 shows that the delay constrained error exponent Es(R) can be
parameterized by a real number ρ as Es(R) = E0(ρ) and R(ρ) = E0(ρ)ρ . In particular for
R(ρ) = H(px), ρ = 0. By Lemma 23 in Appendix G, we have
dE0(ρ)ρ
dρ=
ρdE0(ρ)dρ − E0(ρ)
ρ2
=1ρ2
(ρH(pρx)−E0(ρ))
=1ρ2
D(pρx‖px) (2.61)
dE0(ρ)dρ
= H(pρx) (2.62)
dD(pρx‖px)
dρ= ρ
dH(pρx)
dρ(2.63)
Now we can follow Gallager’s argument on page 143-144 in [41]. By noticing that both
E0(R) and E0(ρ)ρ are positive for ρ ∈ (0,+∞). Combining (2.61) and (2.62):
dEs(R)dR
=dE0(ρ)
dρ
dR(ρ)dρ
=ρ2H(pρ
x)D(pρ
x‖px)(2.64)
From (2.64), we know the right derivative of Es(R) at R = H(px):
limR→H(px )+
dEs(R)dR
= limρ→0+
ρ2H(pρx)
D(pρx‖px)
= limρ→0+
2ρH(pρx) + ρ2 dH(pρ
x )dρ
ρdH(pρx )
dρ
(2.65)
= limρ→0+
2H(pρx)
dH(pρx )
dρ
=2H(px)∑
xpx(x)[log(px(x))]2 −H(px)2
(2.66)
(2.65) is true because of the L’Hospital rule [86]. (2.66) is true by simple algebra.
By the Cauchy-Schwatz inequality∑x
px(x)[log(px(x))]2 −H(px)2 > 0 unless px is uni-
form. Thus the right derivative of Es(R) at R = H(px) is strictly positive in general.
¤
In Figure 2.13, we plot the positive right derivative of Es(R) at R = H(px) for Bernoulli
(α) source where α ∈ (0, 0.25).
50
0 0.05 0.1 0.15 0.2 0.250
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
α
Rig
ht d
eriv
ativ
e of
Es(R
) at
R=
H(p
s)
Figure 2.13. limR→H(px )+
dEs(R)dR for Bernoulli (α) sources
The most striking difference between the block coding error exponent Eb,s(R) and the
delay constrained error exponent Es(R) is their behaviors in the high rate regime as illus-
trated in Figure 2.3, especially at R = log |X |. It is obvious that both error exponents
are infinite (or mathematically strictly speaking: not well defined) when R is greater than
log |X | as shown in Proposition 1. The block coding error exponent Eb,sR has a finite left
limit at R = log |X |, the next proposition tells us that the delay constrained error exponent,
however, has an infinite left limit at R = log |X |.
Proposition 10 The left limit of the sequential error exponent is infinity as R approaches
log |X |: limR→log |X |
Es(R) = ∞
Proof: Notice that Eb(R) is monotonically increasing for R ∈ [0, log |X |) and Eb(R) = ∞for R > log |X |. This implies that if Eb((1 + α)R) is finite, α must be less than log |X |
R − 1.
So we have for all R ∈ [0, log |X |):
Es(R) = infα>0
1α
Eb((1 + α)R)
≥ 1log |X |
R − 1Eb(R)
51
Thus the left limit of Es(R) at R = log |X | is:
limR→log |X |
Es(R) ≥ limR→log |X |
1log |X |
R − 1Eb(R)
= ∞
¤
2.5.2 Conclusions and future work
In this chapter, we introduced the general setup of delay constrained source coding and
the definition of delay-constrained error exponent. Unlike classical source coding, the source
generates source symbols in a real time fashion and the performance of the coding system
is measured by the probability of symbol error with a fixed-delay. The error exponent
for lossless source coding with delay is completely characterized. The achievability part is
proved by implementing a fixed-to-variable source encoder with a FIFO queue. Then the
symbol error probability with delay is the same as the probability of an atypical queueing
delay. The converse part is proved by constructing a block source coding system out of the
delay-constrained source coding system. This simple idea can be applied to other source
coding and channel coding problems and is called “focusing bound” in [67]. It was showed
that this delay-constrained error exponent Es(R) has different properties than the classical
block coding error exponent Es,b(R). We shown that Es(R) is strictly larger than Es,b(R) for
all R. However, the delay constrained error exponent Es(R) is not completely understood.
For example, we still do not know if Es(R) is convex ∪ on R. To further understand the
properties of Es(R), one may find the rich literature in queueing theory [52] very useful.
Particularly, the study of large deviations in queueing theory [42].
We also analyzed the delay constrained performance of the sequential random binning
scheme. We proved that the standard random coding error exponent for block coding
can be achieved using sequential random binning. Hence, sequential random binning is a
suboptimal coding scheme for lossless source coding with delay. However, sequential random
binning is a powerful technique which can be applied to other delay constrained coding
problems such as distributed source coding [32, 12], source coding with side-information [16],
joint source channel coding, multiple-access channel coding [14] and broadcast channel
coding [15] under delay constraints. Some of these will be thoroughly studied in the next
several chapters of this thesis. The decoding schemes have exponential complexity with
time. However, [60] shows that the random coding error exponent is achievable using a
52
more computationally friendly stack-based decoding algorithm if the source distribution is
known.
The most interesting result in this chapter is the “focusing” operator, which connects the
delay constrained error exponent and the classical fixed-length block coding error exponent.
A similar focusing bound was recently derived for delay constrained channel coding with
feedback [67]. In order to understand for what problems this type of “focusing” bound
applies to, we study a more general problem in the next chapter.
53
Chapter 3
Lossy Source Coding
Under the same streaming source model in Chapter 2, we study the delay constrained
performance of lossy source coding where the decoder only needs to reconstruct the source
to within a certain distortion. We derive the delay constrained error exponent through the
“focusing ” bound operator on the block error exponent for peak distortion measure. The
proof in this chapter is more general than that in Chapter 2 and can be applied to delay
constrained channel coding and joint source-channel coding.
3.1 Lossy source coding
As discussed in the review paper [6], lossy source coding is a source coding model where
the estimate of the source at the decoder does not necessarily have to be exact. Unlike in
Chapter 2, where the source and the reconstruction of the source are in the same alphabet
set X , for lossy source coding, the reconstruction of the source is in a different alphabet
set Y where a distortion measure d(·, ·) is defined on X × Y. We denote by X the source
alphabet and Y the reconstruction alphabet. We also denote by yi the reconstruction of
xi. This notational difference from Chapter 2 where we use xi as the reconstruction of xi
is to emphasize the fact that the reconstruction alphabet can be different from the source
alphabet. Formally, a distortion measure is a non-negative function d(x, y), x ∈ X and
y ∈ Y:
d : X × Y → [0,∞)
The classical fixed length block coding results for lossy source coding are reviewed in
54
Section A.2 in the appendix. The core issue we are interested in for this chapter is the impact
of causality and delay on lossy source coding. In [58], the rate distortion performance for a
strictly zero delay decoder is studied, and it is shown that the optimal performance can be
obtained by time-sharing between memoryless codes. Thus, it is in general strictly worse
than the performance of classical fixed-block source coding that allows arbitrarily large
delay. The large deviation performance of the zero delay decoder problem is studied in [57].
Allowing some finite end-to-end delay, [81, 80] shows that the average block coding rate
distortion performance can still be approached exponentially with delay.
3.1.1 Lossy source coding with delay
As the common theme of this thesis, we consider a coding system for a streaming source,
drawn iid from a distribution px on finite alphabet X . The encoder, mapping source symbols
into bits at fixed rate R, is causal and the decoder has to reconstruct the source symbols
within a fixed end-to-end latency constraint ∆. The system is illustrated in Figure 5.3,
which is the almost same as the lossless case in Figure 2.1 in Chapter 2. The only difference
between the two figures is notational, we use y to indicate the reconstruction of the source
x in this chapter, instead of x in Chapter 2.
x1 x2 x3 x4 x5 x6 ...
b1(x21 ) b2(x4
1 ) b3(x61 ) ...Encoding
Source
? ? ? ? ? ?
y1(4) y2(5) y3(6) ...
Rate limited channel
Decoding
? ? ?
Figure 3.1. Time line of delay constrained source coding: rate R = 12 , delay ∆ = 3
Generalizing the notion of delay constrained error exponent (end-to-end delay perfor-
mance) for lossless source coding in Definition 2, we have the following definition on delay
constrained error exponent for lossy source coding.
Definition 4 A rate R sequential source code shown in Figure 3.1 achieves error (distortion
55
violation) exponent ED(R) with delay if for all ε > 0, there exists K < ∞, s.t. ∀i,∆ > 0
Pr[d(xi, yi(i + ∆)) > D] ≤ K2−∆(ED(R)−ε)
The main goal of this chapter is to derive the error exponent defined above. The above
lossy source coding system, in a way, has a symbol distortion constraint. This lead to an
interesting relationship between the problem in Definition 4 and the lossy source coding
under a peak distortion problem. We briefly introduce the relevant results of lossy source
coding with a peak distortion in the next section.
3.1.2 Delay constrained lossy source coding error exponent
In this section, we present the main result of Chapter 3, the delay constrained error
exponent for lossy source coding. This result first appeared in our paper [17]. We investigate
the relation between delay ∆ and the probability of distortion violation Pr[d(xi, yi(i+∆)) >
D], where yi(i+∆) is the reconstruction of xi at time i+∆ and D is the distortion constraint.
Theorem 2 Consider fixed rate source coding of iid streaming data xi ∼ px , with a non-
negative distortion measure d. For D ∈ (D, D), and rates R ∈ (R(px , D), RD), the following
error exponent with delay is optimal and achievable.
ED(R) , infα>0
1α
EbD((α + 1)R) (3.1)
where EbD(R) is the block coding error exponent under peak distortion constraint, as defined
in Lemma 6.
D, D and RD are defined in the next section.
3.2 A brief detour to peak distortion
In this section we introduce the peak distortion measure and present the relevant block
coding result for lossy source coding with a peak distortion measure.
56
3.2.1 Peak distortion
Csiszar introduced the peak distortion measure [29] as an exercise problem (2.2.12), for
a positive function d defined on finite alphabet set X × Y, for xN1 ∈ XN , yN
1 ∈ YN , we
define the distortion between xN1 and yN
1 as d(xN1 , yN
1 ) where
d(xN1 , yN
1 ) , max1≤i≤n
d(xi, yi) (3.2)
This distortion measure reasonably reflects the reaction of human visual system [51]. Given
some finite D > 0, we say y is a valid reconstruction of x if d(x, y) ≤ D.
The problem is only interesting if the target peak distortion D is higher than D and
lower than D, where
D , maxx∈X
miny∈Y
d(x, y)
D , miny∈Y
maxx∈X
d(x, y).
If D > D, then there exists a y∗, such that for all x ∈ X , d(x, y∗) ≤ D, thus the decoder
can reconstruct any source symbol xi by y∗ and the peak distortion requirement is still
satisfied. On the other hand, if D < D, then there exists an x∗ ∈ X , such that for all y ∈ Y,
d(x∗, y) > D. Now with the assumption that px(x∗) > 0, if N is big enough then with
high probability that for some i, xi = x∗. In this case, there is no reconstruction yN1 ∈ YN ,
such that d(xN1 , yN
1 ) ≤ D. Note that both D and D only depend on the distortion measure
d(·, ·), not the source distribution px .
3.2.2 Rate distortion and error exponent for the peak distortion measure
cf. exercise (2.2.12) [29], the peak distortion problem can be treated as a special case of
an average distortion problem. by defining a new distortion measure dD on X × Y, where
dD(x, y) =
1 if x 6= y
0 otherwise
with average distortion level 0. And hence the rate distortion theorem and error exponent
result follows the average distortion case. In [29], the rate distortion theorem for a peak
distortion measure is derived.
Lemma 4 The rate-distortion function R(D) for peak distortion:
R(px , D) , minW∈WD
I(px ,W ) (3.3)
57
0
0
1
1
2
33
0.2
3
x y
D ∈ [1, 2)
x xy y
Source Reconstruction
d(x,y) D ∈ [2, 3)
x y
D ∈ [0.2, 1)
Figure 3.2. d(x, y) and valid reconstructions under different peak distortion constraints D.(x, y) is linked if d(x, y) ≤ D. D = 0.2 and D = 3. For D ∈ [0.2, 1), this is a lossless sourcecoding problem.
where WD is the set of all transition matrices that satisfy the peak distortion constraint, i.e.
WD = W : W (y|x) = 0, if d(x, y) > D.
Operationally, this lemma says, for any ε > 0 and δ > 0, for block length N big enough,
there exists a code of rate R(px , D) + ε, such that the peak(maximum) distortion between
the source string xN1 and its reconstruction yN
1 is no bigger than D with probability at least
1− δ.
This result can be viewed as a simple corollary of the rate distortion theorem in Theo-
rem 10. Similar to the average distortion case, we have the following fixed-to-variable length
coding result for peak distortion measure.
To have Pr[d(xN1 , yN
1 ) > D] = 0, we can implement a universal variable length prefix-free
code with code length lD(xN1 ) where
lD(xN1 ) = n(R(pxN
1, D) + δN ) (3.4)
where pxN1
is the empirical distribution of xN1 , and δN goes to 0 as N goes to infinity.
Like the average distortion case, this is a simple corollary of the type covering lemma
[29, 28] which is derived from the Johnson SteinLovasz theorem [23].
58
The rate distortion function R(D) for average distortion measure is in general non-
concave, non-∩, in the source distribution p as pointed out in [55]. But for peak distortion,
R(p,D) is concave ∩ in p for a fixed distortion constraint D. The concavity of the rate
distortion function under the peak distortion measure is an important property that that
rate distortion functions of average distortion measure do not have.
Lemma 5 R(p,D) is concave ∩ in p for fixed D.
Proof: The proof is in Appendix C. ¤
Now we study the large deviation properties of lossy source coding under the peak
distortion measure. As a simple corollary of the block-coding error exponents for average
distortion from [55], we have the following result.
Lemma 6 Block coding error exponent under peak distortion:
lim infn→∞ − 1
Nlog2 Pr[d(xN
1 , yN1 ) > D] = Eb
D(R)
where EbD(R) , min
qx :R(qx ,D)>RD(qx‖px) (3.5)
where yN1 is the reconstruction of xN
1 using an optimal rate R code.
For lossless source coding, if R > log2 |X |, the error probability is 0 and the error exponent
is infinite. Similarly, for lossy source coding under peak distortion, the error exponent is
infinite whenever
R > RD , supqx
R(qx , D)
where RD only depends on d(·, ·) and D.
3.3 Numerical Results
Consider a distribution px = 0.7, 0.2, 0.1 and a distortion measure on X ×Y as shown
in Figure 3.2. We plot the rate distortion R −D curve under the peak distortion measure
in Figure 3.3. The R −D curve under peak distortion is a staircase function because the
valid reconstruction is a 0 − 1 function. The delay constrained lossy source coding error
59
exponents are shown in Figure 3.4. The delay constrained error exponent is higher than the
block coding coding error exponent, which is also observed in the lossless case in Chapter 2.
0.2 1 2 30
0.275
1.157
Distortion
Rat
e
Figure 3.3. For D1 ∈ [1, 3), R(px , D1) = 0.275 and RD1 = 0.997. For D2 ∈ [0.2, 1), theproblem degenerates to lossless coding, so R(px , D1) = H(px) = 1.157 and RD1 = log2(3) =1.585
3.4 Proof of Theorem 2
In this section, we show that the error exponent in Theorem 2 is both achievable asymp-
totically with delay and that no better exponents are possible.
3.4.1 Converse
The proof of the converse is similar to the upper bound argument in Chapter 2 for
lossless source coding with delay constraints. We leave the proof in Appendix D.
3.4.2 Achievability
We prove achievability by giving a universal coding scheme illustrated in Figure 3.5.
This coding system looks almost identical to the delay constrained lossless source coding
system in Section 2.4.1 as shown in Figure 2.11. The only difference is the fixed to variable
60
0
0.5
1
1.5
2
2.5
3
R(px, D
1)
Err
or E
xpon
ents
Block coding D1 ∈ [1,3)
With delay D1 ∈ [1,3)
Block coding D2 ∈ [0.2,1)
With delay D2 ∈ [0.2,1)
Rate
R(px, D
2)R
D1
RD
2
to ∞
at R=RD
1
to ∞
at R=RD
2
Figure 3.4. Error exponents of delay constrained lossy source coding and block sourcecoding under peak distortion measure
length encoder. Instead of the universal optimal lossless encoder which is a one-to-one map
from the source block space to the binary string space, we have a variable length lossy
encoder in this chapter. So the average number of bits per symbol for an individual block is
roughly the rate distortion function under peak distortion measure rather than the empirical
entropy of that block.
?
-
-
-Encoder Buffer
FIFO
Rate R bit streamDecoder
Fixed to variable length EncoderStreaming data
“Lossy” reconstruction with delay ∆
..., xi, ....
..., yi(i + ∆), ....
Figure 3.5. A delay optimal lossy source coding system.
A block-length N is chosen that is much smaller than the target end-to-end delays1,
while still being large enough. For a discrete memoryless source X , distortion measure1As always, we are interested in the performance with asymptotically large delays ∆.
61
d(·, ·), peak distortion constraint D, and large block-length N , we use the universal variable
length prefix-free code in Proposition 4 to encode the ith block ~xi = xiN(i−1)N+1 ∈ XN . The
code length lD(~xi) is shown in (3.4),
NR(p~xi, D) ≤ lD(~xi) ≤ N(R(p~xi
, D) + δN ) (3.6)
The overhead δN is negligible for large N , since δN goes to 0 as N goes to infinity. The
binary sequence describing the source is fed into a FIFO buffer described in Figure 3.5. The
buffer is drained at a fixed rate R to obtain the encoded bits. Notice that if the buffer is
empty, the output of the encoder buffer can be gibberish binary bits. The decoder simply
discards these meaningless bits because it is aware that the buffer is empty. The decoder
uses the bits it has received so far to get the reconstructions. If the relevant bits have
not arrived by the time the reconstruction is due, it just guesses and we presume that a
distortion-violation will occur.
As the following proposition indicates, the coding scheme is delay universal, i.e. the
distortion-violation probability goes to 0 with exponent ED(R) for all source symbols and
for all delays ∆ big enough.
Proposition 11 For the iid source ∼ px , peak distortion constraint D, and large N , using
the universal causal code described above, for all ε > 0, there exists K < ∞, s.t. for all t,
∆:
Pr[d(~xt, ~yt((t + ∆)N)) > D] ≤ K2−∆N(ED(R)−ε)
where ~yt((t + ∆)N) is the estimate of ~xt at time (t + ∆)N .
Before proving Proposition 11, we state the following lemma (proved in Appendix E) bound-
ing the probability of atypical source behavior.
Lemma 7 (Source atypicality) For all ε > 0, block length N large enough, there exists
K < ∞, s.t. for all n, for all r < RD:
Pr
(n∑
i=1
lD(~xi) > nNr
)≤ K2−nN(Eb
D(r)−ε) (3.7)
Now we are ready to prove Proposition 11.
62
Proof: At time (t + ∆)N , the decoder cannot decode ~xt within peak distortion D
only if the binary strings describing ~xt are not all out of the buffer. Since the encoding
buffer is FIFO, this means that the number of outgoing bits from some time t1, where
t1 ≤ tN to (t + ∆)N is less than the number of the bits in the buffer at time t1 plus the
number of incoming bits from time t1 to time tN . Suppose the buffer is last empty at time
tN − nN where 0 ≤ n ≤ t, given this condition, the peak distortion is not satisfied only if∑n−1
i=0 lD(~xt−i) > (n + ∆)NR. Write lD,max as the longest possible code length.
lD,max ≤ |X | log2(N + 1) + N log2 |X |.
Then Pr[∑n−1
i=0 lD(~xt−i) > (n + ∆)NR] > 0 only if n > (n+∆)NRlD,max
> ∆NRlD,max
∆= β∆. So
Pr (d(~xt, ~yt((t + ∆)N)) > D)
≤t∑
n=β∆
Pr[n−1∑
i=0
l(~xt−i) > (n + ∆)NR] (3.8)
≤(a)
t∑
n=β∆
K12−nN(EbD(
(n+∆)NRnN
)−ε1)
≤(b)
∞∑
n=γ∆
K22−nN(EbD(R)−ε2) +
γ∆∑
n=β∆
K22−∆N(minα>1Eb
D(αR)
α−1−ε2)
≤(c) K32−γ∆N(EbD(R)−ε2) + |γ∆− β∆|K32−∆N(ED(R)−ε2)
≤(d) K2−∆N(ED(R)−ε)
where K ′is and ε′is are properly chosen real numbers. (a) is true because of Lemma 7. Define
γ , ED(R)
EbD(R)
. In the first part of (b), we only need the fact that EbD(R) is non decreasing
with R. In the second part of (b), we write α = n+∆n and take the α to minimize the error
exponents. The first term of (c) comes from the sum of a convergent geometric series and
the second is by the definition of ED(R). (d) is by the definition of γ. ¥
Combining the converse result in Proposition 12 in Appendix D and the achievability
result in Proposition 11, we establish the desired results summarized in Theorem 2.
3.5 Discussions: why peak distortion measure?
For delay constrained lossy source coding, a “focusing” type bound is derived which is
quite similar to its lossless source coding counterpart in the sense that they convert the block
coding error exponent into a delay constrained error exponent through the same focusing
63
operator. As shown in Appendix E, the technical reason for the similarity is that the length
of optimal variable-length codes, or equivalently the rate distortion functions, are concave
∩ in the empirical distribution for both lossless source coding and lossy source coding under
peak distortion constraint. This is not the case for rate distortion functions under average
distortion measures [29]. Thus the block error exponent for average distortion lossy source
coding cannot be converted to delay constrained source lossy source coding by the same
“focusing” operator. This leaves a great amount of future work to us.
The technical tools in this chapter are the same as those in Chapter 2, which are the
large deviation principle and convex optimization methods. However, the block coding error
exponent for lossy source coding under a peak distortion measure cannot be parameterized
like the block lossless source coding error exponent. This poses a new challenge to our task.
By only using the convexity and concavity of the relevant entities, but not the particular
form of the error exponent, we develop a more general proof scheme for “focusing” type
bounds. These techniques can be applied to other problems such as channel coding with
perfect feedback, source coding with random arrivals and, undoubtedly, some other problems
waiting to be discovered.
64
Chapter 4
Distributed Lossless Source Coding
In this chapter1, we begin by reviewing classical results on the error exponents of the
distributed source coding problem first studied by Slepian and Wolf [79]. Then we introduce
the setup of delay constrained distributed source coding problem. We then present the main
result of this chapter: achievable delay-constrained error exponents for delay constrained
distributed source coding. The key techniques are sequential random binning which was
introduced in Section 2.3.3 and a somewhat complicated way of partitioning error events.
This is a “divide and conquer” proof. For each individual error event we use classical block
code bounding techniques. We analyze both maximum likelihood decoding and universal
decoding and show that the achieved exponents are equal. The proof of the equality of
the two error exponents is in Appendix G. From the mathematical point of view, this
proof is the most interesting and challenging part of this thesis. The key tools we use are
the tilted-distribution from the statistics literature and the Lagrange dual from the convex
optimization literature.
4.1 Distributed source coding with delay
In Section A.3 in the appendix, we review the distributed lossless source coding result
in the block coding setup first discovered by David Slepian and Jack Wolf in their classical
paper [79] which is now widely known as the Slepian-Wolf source coding problem. We then1The result in this chapter is build upon a research project [12] jointly done by Cheng Chang, Stark
Draper and Anant Sahai. I would like to thank Stark Draper for his great work in the project (especiallythe universal decoding part) and for allowing me to put this joint work in my thesis.
65
Encoder y
Encoder x
Decoder3
Rx
s
Ry
-
-
-x1, x2, . . . , xN
y1, y2, . . . , yN
x1, x2, . . . , xN
y1, y2, . . . yN
(xi, yi) ∼ pxy
6
?
Figure 4.1. Slepian-Wolf distributed encoding and joint decoding of a pair of correlatedsources.
introduce distributed source coding in the delay constrained framework. Then we give our
result on error exponents with delay which first appeared in [12].
4.1.1 Lossless Distributed Source Coding with Delay Constraints
Parallelling to the point-to-point lossless source coding with delay problem introduced
in Definition 1 in Section 2.1.1, we have the following setup for distributed source coding
with delay constraints.x1 x2 x3 x4 x5 x6 ...
y1 y2 y3 y4 y5 y6 ...
a1(x21 ) a2(x4
1 ) a3(x61 ) ...
b1(y21 ) b2(y4
1 )
x1(4) x2(5) x3(6) ...y1(4) y2(5) y3(6) ...
Decoding
Encoder x
Encoder y
Source x
Source y
Rate limited Channel Rx
Rate limited Channel Ry
? ? ? ? ? ?
6 6 6 6 6 6
? ? ?
6 6
Figure 4.2. Delay constrained source coding with side-information: rates Rx = 12 , Ry = 1
3 ,delay ∆ = 3
66
Definition 5 A fixed-delay ∆ sequential encoder-decoder triplet (Ex, Ey,D) is a sequence
of mappings, Exj , j = 1, 2, ..., Ey
j , j = 1, 2, ... and Dj, j = 1, 2, ... The outputs of Exj
and Eyj from time j − 1 to j.
Exj : X j −→ 0, 1bjRxc−b(j−1)Rxc, e.g., Ex
j (xj1) = a
bjRxcb(j−1)Rxc+1,
Eyj : Yj −→ 0, 1bjRyc−b(j−1)Ryc, e.g., Ey
j (yj1) = b
bjRycb(j−1)Ryc+1.
(4.1)
The output of the fixed-delay ∆ decoder Dj is the decoding decision of xj and yj based on
the received binary bits up to time j + ∆.
Dj : 0, 1bjRxc × 0, 1bjRyc −→ X × Y
Dj(ab(j+∆)Rxc, bb(j+∆)Ryc) = (xj(j + ∆), yj(j + ∆))
Where xj(j + ∆) and yj(j + ∆) are the estimation of xj and yj at time j + ∆ and thus has
end-to-end delay of ∆ seconds. A point-to-point delay constrained source coding system is
illustrated in Figure 2.1.
In this chapter we study two symbol error probabilities. We define the pair of source
estimates of the (n−∆)th source symbols (xn−∆, yn−∆) at time n as (xn−∆(n), yn−∆(n)) =
Dn(∏n
j=1 Exj ,
∏nj=1 Ey
j ), where∏n
j=1 Exj indicates the full bnRxc bit stream from encoder x
up to time n. We use (xn−∆(n), yn−∆(n)) to indicate the estimate of the (n−∆)th symbol
of each source stream at time n, the end to end delay is ∆. With these definitions the two
error probabilities we study are
Pr[xn−∆(n) 6= xn−∆] and Pr[yn−∆(n) 6= yn−∆].
Parallelling to the definition of delay constrained error exponent for point-to-point source
coding in Definition 2, we have the following definition on delay constrained error exponents
for distributed source coding.
Definition 6 A family of rate Rx, Ry distributed sequential source codes (Ex, Ey,D) are
said to achieve delay error exponent (Ex(Rx, Ry), Ey(Rx, Ry)) if and only if for all ε > 0,
there exists K < ∞, s.t.∀i, ∀∆ > 0
Pr[xn−∆(n) 6= xn−∆] ≤ K2−∆(Ex(Rx,Ry)−ε) and Pr[yn−∆(n) 6= yn−∆] ≤ K2−∆(Ey(Rx,Ry)−ε)
67
Remark: The error exponents Ex(Rx, Ry) and Ey(Rx, Ry) are functions of both Rx and
Ry. In contrast to (A.9) the error exponent we look at is in the delay, ∆, rather than
total sequence length, n. Naturally, we define the delay constrained joint error exponent as
follows, joint error exponent Exy(Rx, Ry) if and only if for all ε > 0, there exists K < ∞,
s.t.∀i, ∀∆ > 0
Pr[(xn−∆(n), yn−∆(n)) 6= (xn−∆, yn−∆)] ≤ K2−∆(Exy(Rx,Ry)−ε)
By noticing the simple fact that:
Pr[(xn−∆(n), yn−∆(n)) 6= (xn−∆, yn−∆)] ≤ 2maxPr[xn−∆(n) 6= xn−∆],Pr[yn−∆(n) 6= yn−∆]
and
Pr[(xn−∆(n), yn−∆(n)) 6= (xn−∆, yn−∆)] ≥ maxPr[xn−∆(n) 6= xn−∆], Pr[yn−∆(n) 6= yn−∆]
it should be clear that
Exy(Rx, Ry) = minEx(Rx, Ry), Ey(Rx, Ry) (4.2)
Following the point-to-point randomized sequential encoder-decoder pair defined in 3,
we have the following definition of randomized sequential encoder-decoder triplets.
Definition 7 A randomized sequential encoder-decoder triplets (Ex, Ey,D) is a sequential
encoder-decoder triplet defined in Definition 5 with common randomness, shared between
encoder Ex, Ey and decoder2 D. This allows us to randomize the mappings independent of
the source sequences. We only need pair-wise independence, formally:
Pr[Ex(xi1x
ni+1) = Ex(xi
1xni+1)] = 2−(bnRxc−biRxc) ≤ 2× 2−(n−i)Rx (4.3)
where xi+1 6= xi+1, and
Pr[Ey(yi1y
ni+1) = Ey(yi
1yni+1)] = 2−(bnRyc−biRyc) ≤ 2× 2−(n−i)Ry (4.4)
where yi+1 6= yi+1
2This common randomness is not shared between Ex and Ey.
68
Recall the definition of bins in (2.11):
Bx(xn1 ) = xn
1 ∈ X n : E(xn1 ) = E(xn
1 ) (4.5)
Hence (4.3) is equivalent to the following equality for xi+1 6= xi+1:
Pr[E(xi1x
ni+1) ∈ Bx(xi
1xni+1)] = 2−(bnRxc−biRxc) ≤ 2× 2−(n−i)Rx (4.6)
Similar for source y:
Pr[E(yi1y
ni+1) ∈ By(yi
1yni+1)] = 2−(bnRyc−biRyc) ≤ 2× 2−(n−i)Ry (4.7)
4.1.2 Main result of Chapter 4: Achievable error exponents
By using a randomized sequential encoder-decoder triplets in Definition 7, positive delay
constrained error exponent pair (Ex, Ey) can be achieved if the rate pair (Rx, Ry) is in
the classical Slepian-Wolf rate region [79], where both maximum likelihood decoding and
universal decoding can achieve the positive error exponent pair. We summarize the main
results of this chapter in the following two Theorems for maximum likelihood decoding and
universal decoding respectively. Furthermore in Theorem 5, we show that the two decoding
scheme achieve the same error exponent.
Theorem 3 Let (Rx, Ry) be a rate pair such that Rx > H(x |y), Ry > H(y |x), Rx + Ry >
H(x , y). Then, there exists a randomized encoder pair and maximum likelihood decoder
triplet (per Definition 3) that satisfies the following three decoding criteria.
(i) For all ε > 0, there is a finite constant K > 0 such that
Pr[xn−∆(n) 6= xn−∆] ≤ K2−∆(EML,SW,x(Rx,Ry)−ε) for all n, ∆ ≥ 0 where
EML,SW,x(Rx, Ry) = min
inf
γ∈[0,1]EML
x (Rx, Ry, γ), infγ∈[0,1]
11− γ
EMLy (Rx, Ry, γ)
.
(ii) For all ε > 0, there is a finite constant K > 0 such that
Pr[yn−∆(n) 6= yn−∆] ≤ K2−∆(EML,SW,y(Rx,Ry)−ε) for all n, ∆ ≥ 0 where
EML,SW,y(Rx, Ry) = min
inf
γ∈[0,1]
11− γ
EMLx (Rx, Ry, γ), inf
γ∈[0,1]EML
y (Rx, Ry, γ)
.
69
(iii) For all ε > 0 there is a finite constant K > 0 such that
Pr[(xn−∆(n), yn−∆(n)) 6= (xn−∆, yn−∆)] ≤ K2−∆(EML,SW,xy(Rx,Ry)−ε) for all n,∆ ≥ 0 where
EML,SW,xy(Rx, Ry) = min
inf
γ∈[0,1]EML
x (Rx, Ry, γ), infγ∈[0,1]
EMLy (Rx, Ry, γ)
.
In definitions (i)–(iii),
EMLx (Rx, Ry, γ) = supρ∈[0,1][γEx|y(Rx, ρ) + (1− γ)Exy(Rx, Ry, ρ)]
EMLy (Rx, Ry, γ) = supρ∈[0,1][γEy|x(Rx, ρ) + (1− γ)Exy(Rx, Ry, ρ)]
(4.8)
and
Exy(Rx, Ry, ρ) = ρ(Rx + Ry)− log[ ∑
x,y pxy (x, y)1
1+ρ
]1+ρ
Ex|y(Rx, ρ) = ρRx − log[ ∑
y
[∑x pxy (x, y)
11+ρ
]1+ρ]
Ey|x(Ry, ρ) = ρRy − log[∑
x
[∑y pxy (x, y)
11+ρ
]1+ρ](4.9)
Theorem 4 Let (Rx, Ry) be a rate pair such that Rx > H(x |y), Ry > H(y |x), Rx + Ry >
H(x , y). Then, there exists a randomized encoder pair and universal decoder triplet (per
Definition 3) that satisfies the following three decoding criteria.
(i) For all ε > 0, there is a finite constant K > 0 such that
Pr[xn−∆(n) 6= xn−∆] ≤ K2−∆(EUN,SW,x(Rx,Ry)−ε) for all n,∆ ≥ 0 where
EUN,SW,x(Rx, Ry) = min
inf
γ∈[0,1]EUN
x (Rx, Ry, γ), infγ∈[0,1]
11− γ
EUNy (Rx, Ry, γ)
. (4.10)
(ii) For all ε > 0, there is a finite constant K > 0 such that
Pr[yn−∆(n) 6= yn−∆] ≤ K2−∆(EUN,SW,y(Rx,Ry)−ε) for all n,∆ ≥ 0 where
EUN,SW,y(Rx, Ry) = min
inf
γ∈[0,1]
11− γ
EUNx (Rx, Ry, γ), inf
γ∈[0,1]EUN
y (Rx, Ry, γ)
. (4.11)
(iii) For all ε > 0, there is a finite constant K > 0 such that
Pr[(xn−∆(n), yn−∆(n)) 6= (xn−∆, yn−∆)] ≤ K2−∆(EUN,SW,xy(Rx,Ry)−ε) for all n, ∆ ≥ 0 where
EUN,SW,xy(Rx, Ry) = min
inf
γ∈[0,1]EUN
x (Rx, Ry, γ), infγ∈[0,1]
EUNy (Rx, Ry, γ)
. (4.12)
70
In definitions (i)–(iii),
EUNx (Rx, Ry, γ) = inf
x ,y ,x ,yγD(px ,y‖pxy ) + (1− γ)D(px ,y‖pxy )
+ |γ[Rx −H(x |y)] + (1− γ)[Rx + Ry −H(x , y)]|+
EUNy (Rx, Ry, γ) = inf
x ,y ,x ,yγD(px ,y‖pxy ) + (1− γ)D(px ,y‖pxy )
+ |γ[Ry −H(y |x)] + (1− γ)[Rx + Ry −H(x , y)]|+ (4.13)
where the dummy random variables (x , y) and (x , y) have joint distributions px ,y and px ,y ,
respectively. Recall that the function |z|+ = z if z ≥ 0 and |z|+ = 0 if z < 0.
Remark: We can compare the joint error event for block and streaming Slepian-Wolf
coding, c.f. (4.12) with (A.10). The streaming exponent differs by the extra parameter
γ that must be minimized over. If the minimizing γ = 1, then the block and streaming
exponents are the same. The minimization over γ results from a fundamental difference in
the types of error-causing events that can occur in streaming Slepian-Wolf as compared to
block Slepian-Wolf.
Remark: The error exponents of maximum likelihood and universal decoding in Theo-
rems 3 and 4 are the same. However, because there are new classes of error events possible
in streaming, this needs proof. The equivalence is summarized in the following theorem.
Theorem 5 Let (Rx, Rx) be a rate pair such that Rx > H(x |y), Ry > H(y |x), and Rx +
Ry > H(x , y). Then,
EML,SW,x(Rx, Ry) = EUN,SW,x(Rx, Ry), (4.14)
and
EML,SW,x(Rx, Ry) = EUN,SW,x(Rx, Ry). (4.15)
Theorem 5 follows directly from the following lemma, shown in the appendix.
Lemma 8 For all γ ∈ [0, 1]
EMLx (Rx, Ry, γ) = EUN
x (Rx, Ry, γ), (4.16)
and
EMLy (Rx, Ry, γ) = EUN
y (Rx, Ry, γ). (4.17)
71
.
Proof: This is a very important Lemma. The proof is extremely laborious, so we put the
details of the proof in Appendix G. The main tool used in the proof is convex optimiza-
tion. There is a great deal of detailed analysis on tilted distribution in G.3 which is used
throughout this thesis. ¥
Remark: This theorem allows us to simplify notation. For example, we can define
Ex(Rx, Ry, γ) as Ex(Rx, Ry, γ) = EMLx (Rx, Ry, γ) = EUN
x (Rx, Ry, γ), and can similarly
define Ey(Rx, Ry, γ). Further, since the ML and universal exponents are the same for the
whole rate region we can define ESW,x(Rx, Ry) as ESW,x(Rx, Ry) = EML,SW,x(Rx, Ry) =
EUN,SW,x(Rx, Ry), and can similarly define ESW,y(Rx, Ry).
4.2 Numerical Results
To build insight into the differences between the sequential error exponents of Theorem
3 - 5 and block-coding error exponents, we give some examples of the exponents for binary
sources.
For the point-to-point case, the error exponents of random sequential and block source
coding are identical everywhere in the achievable rate region as can be seen by comparing
Theorem 9 and Propositions 3 and 4. The same is true for source coding with decoder
side-information which will be clear in Chapter 5. For distributed (Slepian-Wolf) source
coding however, the sequential and block error exponents can be different. The reason for
the discrepancy is that a new type of error event can be dominant in Slepian-Wolf source
coding. This is reflected in Theorems 3 - 5 by the minimization over γ. Example 2 illustrates
the impact of this γ term.
Given the sequential random source coding rule, for Slepian-Wolf source coding at very
high rates, where Rx > H(x), the decoder can ignore any information from encoder y and
still decode x with with a positive error exponent. However, the decoder could also choose
to decode source x and y jointly. Fig 4.6.a and 4.6.b illustrate that joint decoding may or
surprisingly may not help decoding source x. This is seen by comparing the error exponent
when the decoder ignores the side-information from encoder y (the dotted curves) to the
joint error exponent (the lower solid curves). It seems that when the rate for source y is
low, atypical behaviors of source y can cause joint decoding errors that end up corrupting
x estimates. This holds for both block and sequential coding.
72
-
6
Ry
Rx
. . . . . . . . . . . . . . . . . . . . . . . . .0.71
. . . . . . . . . . . . . . . . . . . . . . . . .0.97
AchievableRegion
Rx + Ry = H(x , y)µ
Figure 4.3. Rate region for the example 1 source, we focus on the error exponent on sourcex for fixed encoder y rates: Ry = 0.71 and Ry = 0.97
4.2.1 Example 1: symmetric source with uniform marginals
Consider a symmetric source where |X | = |Y| = 2, pxy (0, 0) = 0.45, pxy (0, 1) =
pxy (1, 0) = 0.05 and pxy (1, 1) = 0.45. This is a marginally-uniform source: x is
Bernoulli(1/2), y is the output from a BSC with input x , thus y is Bernoulli(1/2) as well.
For this source H(x) = H(y) = log(2) = 1, H(x |y) = H(y |x) = 0.46, H(x , y) = 1.47. The
achievable rate region is the triangle shown in Figure(4.3).
For this source, as will be shown later, the dominant sequential error event is on the
diagonal line in Fig 4.8. This is to say that:
ESW,x(Rx, Ry) = EBLOCKSW,x (Rx, Ry) = EML
x (Rx, Ry, 0) = supρ∈[0,1]
[Exy(Rx, Ry, ρ)]. (4.18)
Where EBLOCKSW,x (Rx, Ry) = minEML
x (Rx, Ry, 0), EMLx (Rx, Ry, 1) as shown in [39].
Similarly for source y:
ESW,y(Rx, Ry) = EBLOCKSW,y (Rx, Ry) = EML
y (Rx, Ry, 0) = supρ∈[0,1]
[Exy(Rx, Ry, ρ)]. (4.19)
73
We first show that for this source ∀ρ ≥ 0, Ex|y(Rx, ρ) ≥ Exy(Rx, Ry, ρ). By definition:
Ex|y(Rx, ρ)−Exy(Rx, Ry, ρ)
= ρRx − log[ ∑
y
[ ∑x
pxy (x, y)1
1+ρ
]1+ρ]−
(ρ(Rx + Ry)− log
[∑x,y
pxy (x, y)1
1+ρ
]1+ρ)
= −ρRy − log[2[ ∑
x
pxy (x, 0)1
1+ρ
]1+ρ]+ log
[2
∑x
pxy (x, 0)1
1+ρ
]1+ρ
= −ρRy − log[2]
+ log[2]1+ρ
= ρ(log[2]−Ry)
≥ 0
The last inequality is true because we only consider the problem when Ry ≤ log |Y|.Otherwise, y is better viewed as perfectly known side-information. Now
EMLx (Rx, Ry, γ) = sup
ρ∈[0,1][γEx|y(Rx, ρ) + (1− γ)Exy(Rx, Ry, ρ)]
≥ supρ∈[0,1]
[Exy(Rx, Ry, ρ)]
= EMLx (Rx, Ry, 0)
Similarly EMLy (Rx, Ry, γ) ≥ EML
y (Rx, Ry, 0) = EMLx (Rx, Ry, 0). Finally,
ESW,x(Rx, Ry) = min
inf
γ∈[0,1]Ex(Rx, Ry, γ), inf
γ∈[0,1]
11− γ
Ey(Rx, Ry, γ)
= EMLx (Rx, Ry, 0)
Particularly Ex(Rx, Ry, 1) ≥ Ex(Rx, Ry, 0), so
EBLOCKSW,x (Rx, Ry) = minEML
x (Rx, Ry, 0), EMLx (Rx, Ry, 1)
= EMLx (Rx, Ry, 0)
The same proof holds for source y.
In Fig 4.4 we plot the joint sequential/block coding error exponents ESW,x(Rx, Ry) =
EBLOCKSW,x (Rx, Ry), the error exponents are positive iff Rx > H(xy)−Ry = 1.47−Ry.
4.2.2 Example 2: non-symmetric source
Consider a non-symmetric source where |X | = |Y| = 2, pxy (0, 0) = 0.1, pxy (0, 1) =
pxy (1, 0) = 0.05 and pxy (1, 1) = 0.8. For this source H(x) = H(y) = 0.42,
74
0 0.50 0.76 log(2)0
0.05
0.1
0.15
0.2
0.25
0.3
Err
or e
xpon
ent f
or s
ourc
e x
Rate of encoder x
R =0.97
R =0.71
y
y
Figure 4.4. Error exponents plot: ESW,x(Rx, Ry) plotted for Ry = 0.71 and Ry = 0.97ESW,x(Rx, Ry) = EBLOCK
SW,x (Rx, Ry) = ESW,y(Rx, Ry) = EBLOCKSW,y (Rx, Ry) and Ex(Rx) = 0
H(x |y) = H(y |x) = 0.29 and H(x , y) = 0.71. The achievable rate region is shown
in Fig 4.5. In Fig 4.6.a, 4.6.b, 4.6.c and 4.6.d, we compare the joint sequential er-
ror exponent ESW,x(Rx, Ry) the joint block coding error exponent EBLOCKSW,x (Rx, Ry) =
minEx(Rx, Ry, 0), Ex(Rx, Ry, 1) as shown in [39] and the individual error exponent for
source X, Ex(Rx) as shown in Corollary 4. Notice that Ex(Rx) > 0 only if Rx > H(x). In
Fig 4.7, we compare the sequential error exponent for source y: ESW,y(Rx, Ry) and the block
coding error exponent for source y: EBLOCKSW,y (Rx, Ry) = minEy(Rx, Ry, 0), Ey(Rx, Ry, 1)
and Ey(Ry) which is a constant since we fix Ry.
For Ry = 0.50 as shown in Fig 4.6.a.b and 4.7.a.b, the difference between the block
coding and sequential coding error exponents is very small for both source x and y. More
interestingly, as shown in Fig 4.6.a, because the rate of source y is low, i.e. it is more likely
to get a decoding error due to the atypical behavior of source y. So as Rx increases, it is
sometimes better to ignore source y and decode x individually. This is evident as the dotted
curve is above the solid curves.
For Ry = 0.71 as shown in Fig 4.6.c.d and 4.7.c.d, since the rate for source y is high
enough, source y can be decoded with a positive error exponent individually as shown in
Fig 4.7.c. But as the rate of source x increases, joint decoding gives a better error exponent.
When Rx is very high, then we observe the saturation of the error exponent on y as if source
75
-
6
Ry
Rx
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
0.50
0.71
AchievableRegion
Rx + Ry = H(x , y)
µ
Figure 4.5. Rate region for the example 2 source, we focus on the error exponent on sourcex for fixed encoder y rates: Ry = 0.50 and Ry = 0.71
x is known perfectly to the decoder! This is illustrated by the flat part of the solid curves
in Fig 4.7.c.
4.3 Proofs
In this section we provide the proofs of Theorems 3 and 4, which consider the delay
constrained distributed source coding. As with the proofs for the point-to-point source
coding result in Propositions 3 and 4 in Section 2.3.3, we start by proving the case for
maximum likelihood decoding. The universal result follows.
4.3.1 ML decoding
ML decoding rule
First we explain the simple maximum likelihood decoding rule. Very similar to the ML
decoding rule for the point-to-point source coding case in (2.16), the decoding rule is to
simply pick the most likely source sequence pair under the joint distribution pxy .
Denote by (xn1 (n), yn
1 (n)) the estimate of the source sequence pair (xn1 , yn
1 ) at time n.
76
0 0.52 log(2) 0
0.05
0.1
0.15
0.2
0.25
(a)
R = 0.50
0 log(2)0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
(b)
0 0.42 0.61 log(2)0
0.05
0.1
0.15
0.2
0.25
(c)
0 log(2)0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
(d)
R
x
R
x
R
x
R
x
y R = 0.71
y
Figure 4.6. Error exponents plot for source x for fixed Ry as Rx varies:Ry = 0.50:(a) Solid curve: ESW,x(Rx, Ry), dashed curve EBLOCK
SW,x (Rx, Ry) and dotted curve: Ex(Rx),notice that ESW,x(Rx, Ry) ≤ EBLOCK
SW,x (Rx, Ry) but the difference is small.
(b) 10 log10(EBLOCK
SW,x (Rx,Ry)
ESW,x(Rx,Ry) ). This shows the difference is there at high rates.Ry = 0.71:(c) Solid curve ESW,x(Rx, Ry), dashed curve EBLOCK
SW,x (Rx, Ry) and dotted curve: Ex(Rx),again ESW,x(Rx, Ry) ≤ EBLOCK
SW,x (Rx, Ry) but the difference is extremely small.
(d) 10 log10(EBLOCK
SW,x (Rx,Ry)
ESW,x(Rx,Ry) ). This shows the difference is there at intermediate low rates.
77
0 0.52 log(2)0
0.02
0.04
0.06
0.08
(a)
R = 0.50
0 log(2)0
0.05
0.1
0.15
0.2
0.25
0.3
(b)
0 0.42 log(2)0
0.02
0.04
0.06
0.08
(c)
R = 0.71
log(2)0
0.05
0.1
0.15
0.2
0.25
0.3
R
x
R
x
R
x
R
x
infinity
y y
E (R )
ML y
E (R ) ML y
(d)
Figure 4.7. Error exponents plot for source y for fixed Ry as Rx varies:Ry = 0.50:(a) Solid curve: ESW,y(Rx, Ry) and dashed curve EBLOCK
SW,y (Rx, Ry), ESW,y(Rx, Ry) ≤EBLOCK
SW,y (Rx, Ry), the difference is extremely small. Ey(Ry) is 0 because Ry = 0.50 < H(y).
(b) 10 log10(EBLOCK
SW,y (Rx,Ry)
ESW,y(Rx,Ry) ). This shows the two exponents are not identical everywhere.Ry = 0.71:(c) Solid curves: ESW,y(Rx, Ry), dashed curve EBLOCK
SW,y (Rx, Ry) and ESW,y(Rx, Ry) ≤EBLOCK
SW,y (Rx, Ry) and Ey(Ry) is constant shown in a dotted line.
(d) 10 log10(EBLOCK
SW,y (Rx,Ry)
ESW,y(Rx,Ry) ). Notice how the gap goes to infinity when we leave the Slepian-Wolf region.
78
(xn1 (n), yn
1 (n)) = arg maxxn1∈Bx(xn
1 ),yn1 ∈By(yn
1 )pxy (xn
1 , yn1 ) = arg max
sn1∈Bx(xn
1 ),yn1 ∈By(yn
1 )
n∏
i=1
pxy (si, yi) (4.20)
At time n, the decoder simply picks the sequence pair xn1 (n) and yn
1 (n) with the highest
likelihood which is in the same bin as the true sequence xn1 and yn
1 respectively. Now the
estimate of source symbol n−∆ is simply the (n−∆)th symbol of xn1 (n) and yn
1 (n), denoted
by (xn−∆(n), yn−∆(n)).
Details of the proof of Theorem 3
In Theorems 3 and 4 three error events are considered: (i) [xn−∆ 6= xn−∆(n)], (ii)
[yn−∆ 6= yn−∆(n)], and (iii) [(xn−∆, yn−∆) 6= (xn−∆(n), yn−∆(n))]. We develop the error
exponent for case (i). The error exponent for case (ii) follows from a similar derivation, and
that of case (iii) is the minimum of the exponents of cases (i) and (ii) by the simple union
bound argument in 4.2.
To lead to the decoding error [xn−∆ 6= xn−∆(n)] there must be some spurious source
pair (xn1 , yn
1 ) that satisfies three conditions: (i) xn1 ∈ Bx(xn
1 ) and yn1 ∈ By(yn
1 ), (ii) it must be
more likely than the true pair pxy (xn1 , yn
1 ) > pxy (xn1 , yn
1 ), and (iii) xl 6= xl for some l ≤ n−∆.
The error probability is
Pr[xn−∆ 6= xn−∆(n)] ≤ Pr[xn−∆1 (n) 6= xn−∆
1 ]
=∑
xn1 ,yn
1
Pr[xn−∆ 6= xn−∆|xn1 = xn
1 , yn1 = yn
1 ]pxy (xn1 , yn
1 )
≤∑
xn1 ,yn
1
pxy (xn1 , yn
1 ) n−∆∑
l=1
n+1∑
k=1
Pr[∃ (xn
1 , yn1 ) ∈ Bx(xn
1 )× By(yn1 ) ∩ Fn(l, k, xn
1 , yn1 ) s.t. pxy (xn
1 , yn1 ) ≥ pxy (xn
1 , yn1 )
]
(4.21)
=n−∆∑
l=1
n+1∑
k=1
∑
xn1 ,yn
1
pxy (xn1 , yn
1 )
Pr[∃ (xn
1 , yn1 ) ∈ Bx(xn
1 )× By(yn1 ) ∩ Fn(l, k, xn
1 , yn1 ) s.t. pxy (xn
1 , yn1 ) ≥ pxy (xn
1 , yn1 )
]
=n−∆∑
l=1
n+1∑
k=1
pn(l, k). (4.22)
In (4.21) we decompose the error event into a number of mutually exclusive events by
partitioning all source pairs (xn1 , yn
1 ) into sets Fn(l, k, xn1 , yn
1 ) defined by the times l and k
79
at which xn1 and yn
1 diverge from the realized source sequences. The set Fn(l, k, xn1 , yn
1 ) is
defined as
Fn(l, k, xn, yn) = (xn1 , yn
1 ) ∈ X n × Yn s.t. xl−11 = xl−1
1 , xl 6= xl, yk−11 = yk−1
1 , yk 6= yk,(4.23)
In contrast to streaming point-to-point or side-information coding (cf. (4.23) with (2.20)),
the partition is now doubly-indexed. To find the dominant error event, we must search over
both indices. Having two dimensions to search over results in an extra minimization when
calculating the error exponent (and leads to the infimum over γ in Theorem 3).
Finally, to get (4.22) we define pn(l, k) as
pn(l, k) =∑
xn1 ,yn
1
pxy (xn1 , yn
1 )×
Pr[∃ (xn
1 , yn1 ) ∈ Bx(xn
1 )× By(yn1 ) ∩ Fn(l, k, xn
1 , yn1 ) s.t. pxy (xn
1 , yn1 ) ≥ pxy (xn
1 , yn1 )
].
The following lemma provides an upper bound on pn(l, k):
Lemma 9
pn(l, k) ≤ 4× 2−(n−l+1)Ex(Rx,Ry, k−ln−l+1
) if l ≤ k,
pn(l, k) ≤ 4× 2−(n−k+1)Ey(Rx,Ry , l−kn−k+1
) if l ≥ k,
(4.24)
where Ex(Rx, Ry, γ) and Ey(Rx, Ry, γ) are defined in (4.8) and (4.9) respectively. Notice
that l, k ≤ n, for l ≤ k: k−ln−l+1 ∈ [0, 1] serves as γ in the error exponent Ex(Rx, Ry, γ).
Similarly for l ≥ k.
Proof: The proof is quite similar to the Chernoff bound [8] argument in the proof of
Proposition 3 for the point to point source coding problem. We put the details of the proof
in Appendix F.1. ¤
We use Lemma 9 together with (4.22) to bound Pr[xn−∆1 (n) 6= xn−∆
1 ] which is an upper
bound on Pr[xn−∆(n) 6= xn−∆] for two distinct cases. The first, simpler case, is when
infγ∈[0,1] Ey(Rx, Ry, γ) > infγ∈[0,1] Ex(Rx, Ry, γ). To bound Pr[xn−∆1 (n) 6= xn−∆
1 ] in this
case, we split the sum over the pn(l, k) into two terms, as visualized in Fig 4.8. There are
(n + 1)× (n−∆) such events to account for (those inside the box). The probability of the
event within each oval are summed together to give an upper bound on Pr[xn−∆1 (n) 6= xn−∆
1 ].
We add extra probabilities outside of the box but within the ovals to make the summation
symmetric thus simpler. Those extra error events do not impact the error exponent because
80
infγ∈[0,1] Ey(Rx, Ry, ρ, γ) ≥ infγ∈[0,1] Ex(Rx, Ry, ρ, γ). The possible dominant error events
are highlighted in Figure 4.8. Thus,
Pr[xn−∆1 (n) 6= xn−∆
1 ] ≤n−∆∑
l=1
n+1∑
k=l
pn(l, k) +n−∆∑
k=1
n+1∑
l=k
pn(l, k) (4.25)
≤ 4×n−∆∑
l=1
n+1∑
k=l
2−(n−l+1) infγ∈[0,1] Ex(Rx,Ry,γ)
+ 4×n−∆∑
k=1
n+1∑
l=k
2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry,γ) (4.26)
= 4×n−∆∑
l=1
(n− l + 2)2−(n−l+1) infγ∈[0,1] Ex(Rx,Ry,γ)
+ 4×n−∆∑
k=1
(n− k + 2)2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry ,γ)
≤ 8n−∆∑
l=1
(n− l + 2)2−(n−l+1) infγ∈[0,1] Ex(Rx,Ry ,γ) (4.27)
≤n−∆∑
l=1
C12−(n−l+2)[infγ∈[0,1] Ex(Rx,Ry,γ)−ε] (4.28)
≤ C22−∆[infγ∈[0,1] Ex(Rx,Ry ,γ)−ε] (4.29)
(4.25) follows directly from (4.22), in the first term l ≤ k, in the second term l ≥ k.
In (4.26), we use Lemma 9. In (4.27) we use the assumption that infγ∈[0,1] Ey(Rx, Ry, γ) >
infγ∈[0,1] Ex(Rx, Ry, γ). In (4.28) the ε > 0 results from incorporating the polynomial into
the first exponent, and can be chosen as small as desired. Combining terms and summing
out the decaying exponential yield the bound (4.29).
The second, more involved case, is when
infγ∈[0,1] Ey(Rx, Ry, ρ, γ) < infγ∈[0,1] Ex(Rx, Ry, ρ, γ). To bound Pr[xn−∆1 (n) 6= xn−∆
1 ], we
could use the same bounding technique used in the first case. This gives the error exponent
infγ∈[0,1] Ey(Rx, Ry, γ) which is generally smaller than what we can get by dividing the
error events in a new scheme as shown in Figure 4.9. In this situation we split (4.22) into
three terms, as visualized in Fig 4.9. Just as in the first case shown in Fig 4.8, there are
(n + 1) × (n −∆) such events to account for (those inside the box). The error events are
partitioned into 3 regions. Region 2 and 3 are separated by k∗(l) using a dotted line. In
region 3, we add extra probabilities outside of the box but within the ovals to make the
summation simpler. Those extra error events do not affect the error exponent as shown in
81
-
6
k
l
n+
1n−
∆
Inde
xat
whi
chy
n 1an
dy
n 1fir
stdi
verg
e
Index at which xn1 and xn
1 first divergen + 1n−∆
Figure 4.8. Two dimensional plot of the error probabilities pn(l, k), correspondingto error events (l, k), contributing to Pr[xn−∆
1 (n) 6= xn−∆1 ] in the situation where
infγ∈[0,1] Ey(Rx, Ry, ρ, γ) ≥ infγ∈[0,1] Ex(Rx, Ry, ρ, γ).
82
-
6
k
l
n+
1n−
∆
Inde
xat
whi
chy
n 1an
dy
n 1fir
stdi
verg
e
Index at which xn1 and xn
1 first diverge
k∗(n−∆)− 1
n + 1n−∆
Region 1
Region 2 Region 3
k∗(l)¾
Figure 4.9. Two dimensional plot of the error probabilities pn(l, k), correspondingto error events (l, k), contributing to Pr[xn−∆
1 (n) 6= xn−∆1 ] in the situation where
infγ∈[0,1] Ey(Rx, Ry, γ) < infγ∈[0,1] Ex(Rx, Ry, γ).
the proof. The possible dominant error events are highlighted shown in Fig 4.9. Thus,
Pr[xn−∆1 (n) 6= xn−∆
1 ] ≤n−∆∑
l=1
n+1∑
k=l
pn(l, k) +n−∆∑
l=1
l−1∑
k=k∗(l)
pn(l, k) +n−∆∑
l=1
k∗(l)−1∑
k=1
pn(l, k) (4.30)
Where∑0
k=1 pk = 0. The lower boundary of Region 2 is k∗(l) ≥ 1 as a function of n and l:
k∗(l) = max
1, n + 1− d infγ∈[0,1] Ex(Rx, Ry, γ)infγ∈[0,1] Ey(Rx, Ry, γ)
e(n + 1− l)
= max 1, n + 1−G(n + 1− l) (4.31)
where we use G to denote the ceiling of the ratio of exponents. Note that when
infγ∈[0,1] Ey(Rx, Ry, γ) > infγ∈[0,1] Ex(Rx, Ry, γ) then G = 1 and region two of Fig. 4.9
disappears. In other words, the middle term of (4.30) equals zero. This is the first case
considered. We now consider the cases when G ≥ 2 (because of the ceiling function G is a
positive integer).
The first term of (4.30), i.e., region one in Fig. 4.9 where l ≤ k, is bounded in the same
83
way that the first term of (4.25) is, giving
n−∆∑
l=1
n+1∑
k=l
pn(l, k) ≤ C22−∆[infγ∈[0,1] Ex(Rx,Ry ,γ)−ε]. (4.32)
In Fig. 4.9, region two is upper bounded by the 45-degree line, and lower bounded by
k∗(l). The second term of (4.30), corresponding to this region where l ≥ k,
n−∆∑
l=1
l−1∑
k=k∗(l)
pn(l, k) ≤ 4×n−∆∑
l=1
l−1∑
k=k∗(l)
2−(n−k+1)Ey(Rx,Ry, l−kn−k+1
)
= 4×n−∆∑
l=1
l−1∑
k=k∗(l)
2−(n−k+1)n−l+1n−l+1
Ey(Rx,Ry , l−kn−k+1
) (4.33)
≤ 4×n−∆∑
l=1
l−1∑
k=k∗(l)
2−(n−l+1) infγ∈[0,1]1
1−γEy(Rx,Ry ,γ) (4.34)
= 4×n−∆∑
l=1
(l − k∗(l))2−(n−l+1) infγ∈[0,1]1
1−γEy(Rx,Ry ,γ) (4.35)
In (4.33) we note that l ≥ k, so define l−kn−k+1 = γ as in (4.34). Then n−k+1
n−l+1 = 11−γ .
The third term of (4.30), i.e., the intersection of region three and the “box” in Fig. 4.9
where l ≥ k, can be bounded as,
n−∆∑
l=1
k∗(l)−1∑
k=1
pn(l, k) ≤n+1∑
l=1
minl,k∗(n−∆)−1∑
k=1
pn(l, k) (4.36)
=k∗(n−∆)−1∑
k=1
n+1∑
l=k
pn(l, k) (4.37)
≤ 4×k∗(n−∆)−1∑
k=1
n+1∑
l=k
2−(n−k+1)Ey(Rx,Ry , l−kn−k+1
)
≤ 4×k∗(n−∆)−1∑
k=1
n+1∑
l=k
2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry ,γ)
≤ 4×k∗(n−∆)−1∑
k=1
(n− k + 2)2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry,γ) (4.38)
In (4.36) we note that l ≤ n−∆ thus k∗(n−∆)−1 ≥ k∗(l)−1, also l ≥ 1, so l ≥ k∗(l)−1.
This can be visualized in Fig 4.9 as we extend the summation from the intersection of the
“box” and region 3 to the whole region under the diagonal line and the horizontal line
k = k∗(n−∆)− 1. In (4.37) we simply switch the order of the summation.
84
Finally when G ≥ 2, we substitute (4.32), (4.35), and (4.38) into (4.30) to give
Pr[xn−∆1 (n) 6= xn−∆
1 ] ≤ C22−∆[infγ∈[0,1] Ex(Rx,Ry ,γ)−ε]
+ 4×n−∆∑
l=1
(l − k∗(l))2−(n−l+1) infγ∈[0,1]1
1−γEy(Rx,Ry ,γ) (4.39)
+ 4×k∗(n−∆)−1∑
k=1
(n− k + 2)2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry,γ)
≤ C22−∆[infγ∈[0,1] Ex(Rx,Ry ,γ)−ε]
+ 4×n−∆∑
l=1
(l − n− 1 + G(n + 1− l))2−(n−l+1) infγ∈[0,1]1
1−γEy(Rx,Ry ,γ)
+ 4×n+1−G(∆+1)∑
k=1
(n− k + 2)2−(n−k+1) infγ∈[0,1] Ey(Rx,Ry,γ) (4.40)
≤ C22−∆[infγ∈[0,1] Ex(Rx,Ry ,γ)−ε]
+ (G− 1)C32−∆
[infγ∈[0,1]
11−γ
Ey(Rx,Ry ,γ)−ε]
+ C42−[∆G infγ∈[0,1] Ey(Rx,Ry,γ)−ε
]
≤ C52−∆
[min
infγ∈[0,1] Ex(Rx,Ry ,γ),infγ∈[0,1]
11−γ
Ey(Rx,Ry ,γ)
−ε
]. (4.41)
To get (4.40), we use the fact that k∗(l) ≥ n + 1−G(n + 1− l) from the definition of k∗(l)
in (4.31) to upper bound the second term. We exploit the definition of G to convert the
exponent in the third term to infγ∈[0,1] Ex(Rx, Ry, γ). Finally, to get (4.41) we gather the
constants together, sum out over the decaying exponentials, and are limited by the smaller
of the two exponents.
Note: in the proof of Theorem 3, we regularly double count the error events or add
smaller extra probabilities to make the summations simpler. But it should be clear that
the error exponent is not affected. ¥
4.3.2 Universal Decoding
Universal decoding rule
As discussed in Section 2.3.3, we do not use a pairwise minimum joint-entropy decoder
because of polynomial term in n would multiply the exponential decay in ∆. Analogous
to the sequential decoder used there, we use a “weighted suffix entropy” decoder. The
decoding starts by first identifying candidate sequence pairs as those that agree with the
85
encoding bit streams up to time n, i.e., xn1 ∈ Bx(xn
1 ), yn1 ∈ By(yn
1 ). For any one of the
|Bx(xn1 )||By(yn
1 )| sequence pairs in the candidate set, i.e., (xn1 , yn
1 ) ∈ Bx(xn1 ) × By(yn
1 ) we
compute (n + 1)× (n + 1) weighted entropies:
HS(l, k, xn1 , yn
1 ) = H(x(n+1−l)l , y
(n+1−l)l ), l = k
HS(l, k, xn1 , yn
1 ) =k − l
n + 1− lH(xk−1
l |yk−1l ) +
n + 1− k
n + 1− lH(xn
k , ynk ), l < k
HS(l, k, xn1 , yn
1 ) =l − k
n + 1− kH(yl−1
k |xl−1k ) +
n + 1− l
n + 1− kH(xn
l , ynl ), l > k.
We define the score of (xn1 , yn
1 ) as the pair of integers ix(xn1 , yn
1 ), iy(xn1 , yn
1 ) s.t.,
ix(xn1 , yn
1 ) = maxi : HS(l, k, (xn1 , yn
1 )) < HS(l, k, xn1 , yn
1 )∀k = 1, 2, ...n + 1, ∀l = 1, 2, ...i,
∀(xn1 , yn
1 ) ∈ Bx(xn1 )× By(yn
1 ) ∩ Fn(l, k, xn1 , yn
1 ) (4.42)
iy(xn1 , yn
1 ) = maxi : HS(l, k, (xn1 , yn
1 )) < HS(l, k, xn1 , yn
1 )∀l = 1, 2, ...n + 1, ∀k = 1, 2, ...i,
∀(xn1 , yn
1 ) ∈ Bx(xn1 )× By(yn
1 ) ∩ Fn(l, k, xn1 , yn
1 ) (4.43)
While Fn(l, k, xn1 , yn
1 ) is the same set as defined in (4.23), we repeat the definition here for
convenience,
Fn(l, k, xn1 , yn
1 ) = (xn1 , yn
1 ) ∈ X n × Yn s.t. xl−11 = xl−1
1 , xl 6= xl, yk−11 = yk−1
1 , yk 6= yk.
The definition of (ix(xn1 , yn
1 ), iy(xn1 , yn
1 )) can be visualized in the following procedure.
As shown in Fig. 4.10, for all 1 ≤ l, k ≤ n + 1, if there exists (¯xn1 , ¯yn
1 ) ∈ Fn(l, k, (xn1 , yn
1 )) ∩Bx(xn
1 ) × By(yn1 ) s.t. HS(l, k, xn
1 , yn1 ) ≥ HS(l, k, ¯xn
1 , ¯yn1 ), then we mark (l, k) on the plane
as shown in Fig.4.10. Eventually we pick the maximum integer which is smaller than all
marked x-coordinates as ix(xn1 , yn
1 ) and the maximum integer which is smaller than all
marked y-coordinates as iy(xn1 , yn
1 ). The score of (xn1 , yn
1 ) tells us the first branch(either x
or y) point where a “better sequence pair” (with a smaller weighted entropy) exists.
Define the set of the winners as the sequences (not sequence pair) with the maximum
score:
Wxn = xn
1 ∈ Bx(xn1 ) : ∃yn
1 ∈ By(yn1 ), s.t.ix(xn
1 , yn1 ) ≥ ix(xn
1 , yn1 ), ∀(xn
1 , yn1 ) ∈ Bx(xn
1 )×By(yn1 )
Wyn = yn
1 ∈ By(yn1 ) : ∃xn
1 ∈ Bx(xn1 ), s.t.iy(xn
1 , yn1 ) ≥ iy(xn
1 , yn1 ), ∀(xn
1 , yn1 ) ∈ Bx(xn
1 )×By(yn1 )
86
-
6
k
l
n+
1
n + 11
1i y
ix
Figure 4.10. 2D interpretation of the score, (ix(xn1 , yn
1 ), iy(xn1 , yn
1 )), of a sequence pair(xn
1 , yn1 ). If there exists a sequence pair in Fn(l, k, xn
1 , yn1 ) with less or the same score,
then (l, k) is marked with a solid dot. The score ix(xn1 , yn
1 ) is the largest integer which issmaller than all the x-coordinates of the marked points. Similarly for iy(xn
1 , yn1 ),
Then arbitrarily pick one sequence from Wxn and one from Wy
n as the decision
(xn1 (n), yn
1 (n)) at decision time n.
Details of the proof of Theorem 4
Using the above universal decoding rule, we give an upper bound on the decoding error
Pr[xn−∆(n) 6= xn−∆], and hence derive an achievable error exponent.
We bound the probability that there exists a sequence pair in Fn(l, k, (xn1 , yn
1 ))∩Bx(xn1 )×
By(yn1 ) with smaller weighted minimum-entropy suffix score as:
pn(l, k) =∑
xn1
∑
yn1
pxy (xn1 , yn
1 ) Pr[∃(xn1 , yn
1 ) ∈ Bx(xn1 )× By(yn
1 ) ∩ Fn(l, k, xn1 , yn
1 ),
s.t.HS(l, k, xn1 , yn
1 ) ≤ HS(l, k, (xn1 , yn
1 ))]
Note that the pn(l, k) here differs from the pn(l, k) defined in the ML decoding by replacing
pxy (xn1 , yn
1 ) ≤ pxy (xn1 , yn
1 ) with HS(l, k, xn1 , yn
1 ) ≤ HS(l, k, (xn1 , yn
1 )).
87
The following lemma, analogous to (4.22) for ML decoding, tells us that the “suffix
weighted entropy” decoding rule is a good one.
Lemma 10 Upper bound on symbol-wise decoding error Pr[xn−∆(n) 6= xn−∆] :
Pr[xn−∆(n) 6= xn−∆] ≤ Pr[xn−∆1 (n) 6= xn−∆
1 ] ≤n−∆∑
l=1
n+1∑
k=1
pn(l, k)
Proof: The first inequality is trivial. We only need to show the second inequality.
According to the decoding rule, xn−∆1 6= xn−∆
1 implies that there exists a sequence xn1 ∈ Wx
n
s.t.xn−∆1 6= xn−∆
1 . This means that there exists a sequence yn1 ∈ By(yn
1 ), s.t. ix(xn1 , yn
1 ) ≥ix(xn
1 , yn1 ). Suppose that (xn
1 , yn1 ) ∈ Fn(l, k, xn
1 , yn1 ), then l ≤ n−∆ because xn−∆
1 6= xn−∆1 .
By the definition of ix, we know that HS(l, k, xn1 , yn
1 ) ≤ HS(l, k, xn1 , yn
1 ). And using the
union bound argument we get the desired inequality. ¤
We only need to bound each single error probability pn(l, k) to finish the proof.
Lemma 11 Upper bound on pn(l, k), l ≤ k: ∀ε > 0, ∃K1 < ∞, s.t.
pn(l, k) ≤ 2−(n−l+1)[Ex(Rx,Ry ,λ)−ε]
where λ = (k − l)/(n− l + 1) ∈ [0, 1].
Proof: The proof bears some similarity to the that of the universal decoding problem of
the point-to-point source coding problem in Proposition 4. The details of the proof is in
Appendix F.2 ¤
A similar derivation yields a bound on pn(l, k) for l ≥ k.
Combining Lemmas 11 and 10, and then following the same derivation for ML decoding
yields Theorem 4. ¥
4.4 Discussions
We derived the achievable delay constrained error exponents for distributed source cod-
ing. The key technique is “divide and conquer”. We divide the delay constrained error event
88
into individual error events. For each individual error events, we apply the classical block
coding analysis to get an individual error exponent for that specific error event. Then by a
union bound argument we determine the dominant error event. This “divide and conquer”
scheme not only works for the Slepian-Wolf source coding in this chapter but also works for
other information-theoretic problems illustrated in [13, 14]. While the encoder is the same
sequential random binning shown in Section 2.3.3, the decoding is quite complicated due to
the nature of the problem. Similar to that in Section 2.3.3, we derived the error exponent
for both ML and universal decoding. To show the equivalence of the two different error
exponents, we apply Lagrange duality and tilted distributions in Appendix G. There are
several other important results in Appendix G. The derivations are extremely laborious but
the reward is quite fulfilling. These results in Appendix G essentially follow the I-projection
theory in the statistics literature [30] and were recently discussed in the context of channel
coding error exponents [9].
We only showed that some positive delay constrained error exponent is achievable as
long as the rate pair is in the interior of the classical Slepian-Wolf region in Figure A.3.
In general, the delay constrained error exponent for distributed source coding is smaller or
equal to its block coding counterpart. This is different from what we observed in Chapters 2
and 3. To further understand the difference, we need a tight upper bound on the error
exponent. A trivial example tells us that this error exponent is nowhere tight. Consider
the special case where the two sources are independent, the random coding scheme in this
chapter gives the standard random coding error exponent for each single source as shown
in (2.13) in Section 2.3.3. This error exponent is strictly smaller than the optimal coding
scheme in Chapter 2 as shown in Proposition 8 in Section 2.5. The optimal error exponent
problem could be an extremely difficult one. It would be interesting to give an upper bound
on it. However, we do not have any general non-trivial upper bounds either. In Chapter 5,
we will study the upper bound on the delay constrained error exponents for source coding
with decoder side-information problem. This is a special case of what we study in this
chapter since the decoder side-information can be treated as an encoder with rate higher
than the logarithm of the alphabet. However, this upper bound is not tight in general.
These open questions are left future research.
89
Chapter 5
Lossless Source Coding with
Decoder Side-Information
In this chapter, we study the delay constrained source coding with decoder side-
information problem. As a special case of the Slepian-Wolf problem studied in Chapter 4,
the sequential random binning scheme’s delay performance is derived as a corollary of those
results in Chapters 2 and 4. An upper bound on the delay constrained error exponent is
derived by using the feed-forward decoding scheme from the channel coding literature. The
results in this chapter are also summarized in our paper [18], especially the implications of
these results on compression of encrypted data [50].
5.1 Delay constrained source coding with decoder side-
information
In [79], Slepian and Wolf studied the distributed lossless source coding problems. One
of the problems studied, the source coding with only decoder side-information problem is
shown in Figure 4.1, where the encoder has access to the source x only, but not the side-
information y . It is clear that if the side-information y is also available to the encoder, to
achieve arbitrarily small decoding error for fixed block coding, rate at the conditional en-
tropy H(px |y ) is necessary and sufficient. Somewhat surprisingly, even without the encoder
90
side-information, Slepian and Wolf showed that rate at conditional entropy H(px |y ) is still
sufficient.
Encoder Decoder --
6
x1, x2, ...-x1, x2, ...
y1, y2, ...
(xi, yi) ∼ pxy
6
?
Figure 5.1. Lossless source coding with decoder side-information
We can also factor the joint probability to treat the source as a random variable x and
consider the side-information y as the output of a discrete memoryless channel (DMC) py |xwith x as input. This model is shown in Figure 5.2.
Encoder Decoder --
6
x1, x2, ...-x1, x2, ...
y1, y2, ...
DMC
?
?
Figure 5.2. Lossless source coding with side-information: DMC
We first review the error exponent result for source coding with decoder side-information
in the fixed-length block coding setup in Section A.4 in the appendix. Then we formally
define the delay constrained source coding with side-information problem. In Section 5.1.2
we give a lower a bound and an upper bound on the error exponent on this problem. These
two bounds in general do not match, which leaves space for future explorations.
91
5.1.1 Delay Constrained Source Coding with Side-Information
Rather than being known in advance, the source symbols stream into the encoder in
a real-time fashion. We assume that the source generates a pair of source symbols (x , y)
per second from the finite alphabet X × Y. The jth source symbol xj is not known at the
encoder until time j and similarly for yj at the decoder. Rate R operation means that
the encoder sends 1 binary bit to the decoder every 1R seconds. For obvious reasons (cf.
Proposition 1 and 2), we focus on cases with Hx |y < R < log2 |X |.
x1 x2 x3 x4 x5 x6 ...
b1(x21 ) b2(x4
1 ) b3(x61 ) ...
x1(4) x2(5) x3(6) ...Decoding
Encoding
Source
Rate limited Channel
Side-info y1, y2, ... −→
? ? ? ? ? ?
? ? ?
Figure 5.3. Delay constrained source coding with side-information: rate R = 12 , delay
∆ = 3
Definition 8 A sequential encoder-decoder pair E ,D are sequence of maps. Ej, j =
1, 2, ... and Dj, j = 1, 2, .... The outputs of Ej are the outputs of the encoder E from
time j − 1 to j.
Ej : X j −→ 0, 1bjRc−b(j−1)Rc
Ej(xj1) = b
bjRcb(j−1)Rc+1
The outputs of Dj are the decoding decisions of all the arrived source symbols by time j
based on the received binary bits up to time j as well as the side-information.
Dj : 0, 1bjRc × Yj −→ X
Dj(bbjRc1 , yj
1) = xj−∆(j)
Where xj−∆(j) is the estimation of xj−∆ at time j and thus has end-to-end delay of ∆
seconds. A rate R = 12 sequential source coding system is illustrated in Figure 5.3.
92
For sequential source coding, it is important to study the symbol by symbol decoding
error probability instead of the block coding error probability.
Definition 9 A family of rate R sequential source codes (E, D∆) are said to achieve
delay-reliability Esi(R) if and only if for all ε > 0, there exists K < ∞, s.t. ∀i, ∆ > 0
Pr[xi 6= xi(i + ∆)] ≤ K2−∆(Esi(R)−ε)
Following this definition, we have both lower and upper bound on delay constrained
error exponent for source coding with side-information.
5.1.2 Main results of Chapter 5: lower and upper bound on the error
exponents
There are two parts of the main result. First, we give an achievable lower bound on the
delay constrained error exponent Esi(R). This part is realized by using the same sequential
random binning scheme in Definition 3 and a Maximum-likelihood decoder in Section 4.3.1
or a universal decoder in Section 4.3.2. The lower bound can be treated as a simple corollary
of the more general theorems in the distributed source coding setup in Theorems 3 and 4.
The second part is an upper bound on Esi(R). We apply a modified version of the feed-
forward decoder used by Pinsker [62] and recently clarified by Sahai [67]. This feed-forward
decoder technique is a powerful tool in the upper bound analysis for delay constrained error
exponents. Using this technique, we derived an upper bound on the joint source-channel
coding error exponent in [15].
A lower bound on Esi(R)
We state the relevant lower bound (achievability) results which comes as a simpler result
to the more general result proved in Theorems 3 and 4.
Theorem 6 Delay constrained random source coding theorem: Using a random sequential
coding scheme and the ML decoding rule using side-information, for all i, ∆:
Pr[xi(i + ∆) 6= xi] ≤ K2−∆Elowersi (R) (5.1)
93
Where K is a constant, and
Elowersi (R) = Elower
si,b (R) = minqxyD(qxy‖pxy ) + |0, R−H(qx |y )|+
where Elowersi,b (R) is defined in Theorem 13. Another definition is
Elowersi (R) = Elower
si,b (R) = maxρ∈[0,1]
ρR− log[∑
y
[∑
x
pxy (x, y)1
1+ρ ]1+ρ]
These two expressions can be shown to be equivalent following the same Lagrange multiplier
argument in [39] and Appendix G.
Similarly for the universal decoding rule, for all ε > 0, there exist a finite constant K,
s.t. for all n and ∆:
Pr[xi(i + ∆) 6= xi] ≤ K2−∆(Elowersi (R)−ε) (5.2)
The encoder is the same random sequential encoding scheme described in Definition 3
which can be realized using an infinite constraint-length time-varying random convolutional
code. Common randomness between the encoder and the decoder is assumed. The decoder
can use a maximum likelihood rule or minimum empirical joint entropy decoding rule which
are similar to that in the point-to-point source coding setup in Chapter 2. Details of the
proof are in Section 5.5.
An upper bound on Esi(R)
We give an upper bound on the delay constrained error exponent for source coding with
side-information defined in Definition 9. This bound is for any generic joint distribution
pxy . Some special cases will be discussed in Section 5.2.
Theorem 7 For the source coding with side-information problem in Figure 5.2, if the source
is iid ∼ pxy from a finite alphabet, then the error exponents Esi(R) with fixed delay must
satisfy Esi(R) ≤ Euppersi (R), where
Euppersi (R) = min inf
qxy ,α≥1:H(qx|y )>(1+α)R 1α
D(qxy‖pxy ),
infqxy ,1≥α≥0:H(qx|y )>(1+α)R
1− α
αD(qx‖px) + D(qxy‖pxy )
94
5.2 Two special cases
It is clear that the upper bound and the lower bound on the error exponent Esi(R)are
not the same thing. There seems to be a gap between the two bounds. But how big or how
small can the gap be? In this section, we answer that question by showing two extreme
cases.
5.2.1 Independent side-information
Consider the case that x and y are independent, or there is no side-information. Then
clearly this source coding with the irrelevant “side-information” at the decoder problem is
the same as the single source coding problem discussed in Chapter 2. So the upper bound
Euppersi (R) ≥ Es(R), where Es(R) is the delay constrained source coding error exponent
defined in Theorem 1 for source x . So we only need to show that Euppersi (R) ≤ Es(R) to
establish the tightness of our upper bound for this extreme case where no side-information
is presented. The proof simply follows the following argument:
Euppersi (R) =(a) min inf
qxy ,α≥1:H(qx|y )>(1+α)R 1α
D(qxy‖pxy ),
infqxy ,1≥α≥0:H(qx|y )>(1+α)R
1− α
αD(qx‖px) + D(qxy‖pxy )
≤(b) infqxy ,α≥0:H(qx|y )>(1+α)R
1α
D(qxy‖pxy )
≤(c) infqx×py ,α≥1:H(qx|y )>(1+α)R
1α
D(qx × py‖px × py )
=(d) infqx ,α≥1:H(qx )>(1+α)R
1α
D(qx‖px)=(e) Es(R) (5.3)
(a) is by definition. (b) is because D(qxy‖pxy ) ≥ D(qx‖px). (c) is true because of the
following two observations. First, x and y are independent, so pxy = px × py . Second, we
take the inf over a subset of all qxy : qx × py , i.e. the distributions such that x and y are
independent and the marginal qy = py . (d) is true because under p and q, the two marginals
x and y are independent with y ∼ py . (e) is by definition.
The lower bound on the error exponent Elowersi (R) has the following form for independent
95
side-information.
Elowersi (R) =(a) max
ρ∈[0,1]ρR− log[
∑y
[∑
x
pxy (x, y)1
1+ρ ]1+ρ]
=(b) maxρ∈[0,1]
ρR− log[∑
y
[∑
x
px(x)1
1+ρ py (y)1
1+ρ ]1+ρ]
=(c) maxρ∈[0,1]
ρR− log[∑
y
py (y)[∑
x
px(x)1
1+ρ ]1+ρ]
=(d) maxρ∈[0,1]
ρR− (1 + ρ) log[∑
x
px(x)1
1+ρ ]
=(e) Er(R) (5.4)
(a) is by definition. (b) is because the side-information y is independent of the source x .
(c) and (d) are trivial. (e) is by definition in (A.5).
From the above analysis, we know that our upper bound Euppersi (R) is tight which can be
achieved as shown in Chapter 2 and our lower bound is the same as the random coding error
exponent for source x only in Section 2.3.3. Which validates our results in Section 5.1.2.
Comparing the upper bound in (5.3) and lower bound in (5.4), we know that the lower
bound is strictly lower than the upper bound as shown in Proposition 8 in Chapter 2 and
illustrated in Figure 2.3.
5.2.2 Delay constrained encryption of compressed data
The traditional view of encryption of redundant source is in the block coding context
in which all the source symbols are compressed first then encrypted1. In the receiver
end, decompression of the source is after the decryption. As the common theme in this
thesis, we consider the end-to-emd delay between the realization of the source and the
decoding/decryption at the sink. The delay constrained compression first then encryption
system is illustrated in Figure 5.5. Without getting into the details of the information-
theoretic security, we briefly introduce the system. The compression and decompression
part is exactly the delay constrained source coding shown in Figure 2.1 in Chapter 2 where
the source is iid ∼ ps on a finite alphabet S. The secret keys y1, .... are iid Bernoulli 0.5
random variables, so the output of the encrypter b ⊕ y is independent with the output of
the compressor b, thus the name “perfect secrecy”. Here the ⊕ operator is sum mod two,
or more generally speaking, the addition operator in the finite field of size 2. Since the1We consider the type I Shannon-sense security (perfect secrecy) [73, 46].
96
encryption and decryption are instantaneous, the overall system has a delay constrained
error exponent Es(R) defined in Theorem 1, Chapter 2.s1 s2 s3 s4 s5 s6 ...
b1(s21 ) b2(s4
1 ) b3(s61 ) ...
b1 = y1 ⊕ b1 b2 = y2 ⊕ b2 b3 = y3 ⊕ b3 ...
b1 = y1 ⊕ b1 b2 = y2 ⊕ b2 b3 = y3 ⊕ b3 ...
s1(4) s2(5) s3(6) ...
Secret key y1, y2, ... −→ Encryption
Secret key y1, y2, ... −→ Decryption
Compression
Source
Rate limited public channel
Decompression
? ? ? ? ? ?
? ? ?
? ? ?
? ? ?
Figure 5.4. Encryption of compressed streaming data with delay constraints rate R = 12 ,
delay ∆ = 3
In the thought provoking paper [50], Johnson shows that the same compression rate
(entropy rate of the source s) can be achieved by first encrypting the source then compressing
the encrypted data without knowing the key, then decompress/decrypt the encoded data
by using the key as the decoder side-information. Recently practical system is build by
implementing LDPC [36, 65] codes for source coding with side-information [71]. The delay
constrained system of the encryption first, then compression system is shown in Figure 5.5.
The source is iid ∼ ps on a finite alphabet S, to achieve perfect secrecy, the secret keys
y ’s are iid uniform random variables on S. And hence the encrypted data x is also uniformly
distributed in S where
x = s ⊕ y (5.5)
where the ⊕ operator is the addition in the finite field of size |S|. For uniform sources, no
compression can be done. However, since y is known to the decoder and y is correlated
to x , the source x can be compressed to the conditional entropy H(x |y) which is equal to
H(s) because of the relations of x , y and s through (5.5). The emphasis of this section is
the fundamental information-theoretical model of the problem in Figure 5.5, as shown in
97
s1 s2 s3 s4 s5 s6 ...
x1 x2 x3 x4 x5 x6 ...
b1(x21 ) b2(x4
1 ) b3(x61 ) ...
s1(4) s2(5) s3(6) ...
Secret key y1, y2, ... −→ Encryption
Secret key y1, y2, ... −→
Compression
Source
Rate limited public channel
Joint decryptionDecompression
? ? ? ? ? ?
? ? ?
? ? ?
Figure 5.5. Compression of encrypted streaming data with delay constraints: rate R = 12 ,
delay ∆ = 3
Figure 5.6. A detailed discussion of the delay constrained encryption of compressed data
problem is in [18].
Encoder Decoder --
6
x
s = x ª y
-x
y
x = s ⊕ y
6
?
Figure 5.6. Information-theoretic model of compression of encrypted data, s is the source, yis uniformly distributed in S is the secret key. The encoder is facing a uniformly distributedx and the decoder has side-information y .
Notice that the estimate of x and s are related in the following way:
s = x ª y (5.6)
so the estimation problems are equivalent for s and x . The lower bound and the upper bound
on the delay constrained error exponent for x are shown in Theorem 6 and Theorem 7
respectively for general sources x and side information y ∼ pxy . For the compression of
98
encrypted data problem, as illustrated in Figure 5.6, we have the following corollary for
both lower bound and upper bound.
Corollary 1 Bounding the delay constrained error exponent for compression of encrypted
data: for the compression of encrypted data system in Figure 5.5 and hence the information-
theoretic interpretation as a decoder side-information problem of source x given side-
information y in Figure 5.6, the delay constrained error exponent Esi(R) is bounded by
the following two error exponents:
Er(R, ps) ≤ Esi(R) ≤ Es,b(R, ps) (5.7)
where Er(R, ps) is the random coding error exponent for source ps defined in (A.5),
Es,b(R, ps) is the block coding error exponent defined in (A.4), both definitions are in Chap-
ter 2.
It should be clear that this corollary is true for any source side-information pair shown
in Figure 5.6, where x = s ⊕ y and y is uniform on S. The proof is in Appendix I.
5.3 Delay constrained Source Coding with Encoder Side-
Information
In this section, we study the source coding problem for x from a joint distribution
(x , y) ∼ pxy . Suppose that the side information y is known at both encoder and decoder
shown in Figure 5.7.
Encoder Decoder --
66
x1, x2, ...-x1, x2, ...
y1, y2, ...
(xi, yi) ∼ pxy
6
?
Figure 5.7. Lossless source coding with both encoder and decoder side-information
We first review the error exponent result for source coding with encoder side information
problem. As shown in Figure 5.7, the sources are iid random variables xn1 , yn
1 from a finite
99
alphabet X ×Y. Without loss of generality, px(x) > 0, ∀x ∈ X and py (y) > 0, ∀y ∈ Y. xn1 is
the source known to the encoder and yn1 is the side-information known to both the decoder
and the encoder. A rate R block source coding system for n source symbols consists of an
encoder-decoder pair (En,Dn). Where
En : X n × Yn → 0, 1bnRc, En(xn1 , yn
1 ) = bbnRc1
Dn : 0, 1bnRc × Yn → X n, Dn(bbnRc1 , yn
1 ) = xn1
The error probability is Pr(xn1 6= xn
1 ) = Pr(xn1 6= Dn(En(xn
1 ))). The exponent Eei,b(R) is
achievable if ∃ a family of (En,Dn), s.t.
limn→∞−
1n
log2 Pr(xn1 6= xn
1 ) = Eei,b(R) (5.8)
As a simple corollary of the relevant results of [29, 39], we have the following lemma.
Lemma 12 Eei,b(R) = Euppersi,b (R) where Eupper
si,b (R) is the upper bound on the source coding
with decoder only side-information defined in Theorem 13.
Euppersi,b (R) = min
qxy :H(qx|y )≥RD(qxy‖pxy )
= supρ≥0
ρR− log∑
y
(∑
x
pxy (x, y)1
1+ρ )(1+ρ)
Note: this error exponent is both achievable and tight in the usual block coding way [29].
5.3.1 Delay constrained error exponent
Similar to the previous cases, we have the following notion on delay constrained source
coding with both encoder and decoder side-information.
Definition 10 A sequential encoder-decoder pair E ,D are sequence of maps. Ej, j =
1, 2, ... and Dj, j = 1, 2, .... The outputs of Ej are the outputs of the encoder E from time
j − 1 to j.
Ej : X j × Yj −→ 0, 1bjRc−b(j−1)Rc
Ej(xj1, y
j1) = b
bjRcb(j−1)Rc+1
100
x1 x2 x3 x4 x5 x6 ...
b1(x21 , y2
1 ) b2(x41 , y4
1 ) b3(x61 , y6
1 ) ...
x1(4) x2(5) x3(6) ...Decoding
EncodingSide-info y1, y2, ... −→
Source
Rate limited Channel
Side-info y1, y2, ... −→
? ? ? ? ? ?
? ? ?
Figure 5.8. Delay constrained source coding with encoder side-information: rate R = 12 ,
delay ∆ = 3
The outputs of Dj are the decoding decisions of all the arrived source symbols by time j
based on the received binary bits up to time j as well as the side-information.
Dj : 0, 1bjRc × Yj −→ X
Dj(bbjRc1 , yj
1) = xj−∆(j)
Where xj−∆(j) is the estimation of xj−∆ at time j and thus has end-to-end delay of ∆
seconds. A rate R = 12 sequential source coding system is illustrated in Figure 5.8.
The delay constrained error exponent is defined in Definition 11. This is parallel to
previous definitions on delay constrained error exponents.
Definition 11 A family of rate R sequential source codes (E∆, D∆) are said to achieve
delay-reliability Eei(R) if and only if for all ε > 0, there exists K < ∞, s.t. ∀i, ∆ > 0
Pr(xi 6= xi(i + ∆)) ≤ K2−∆(Eei(R)−ε)
Following the definition of delay constrained source coding error exponent Eei(R) in
Definition 11, we have the following result in Theorem 8.
Theorem 8 Delay constrained error exponent with both encoder and decoder side-
information
Eei(R) = infα>0
1α
Eei,b((α + 1)R)
Where Eei,b(R) is the block source coding error exponent defined in Lemma 12.
101
Similar to the lossless source coding in Chapter 2 and lossy source coding in Chapter 3,
the delay constrained source coding error exponent and the block coding error exponent are
connected by the focusing operator.
In Appendix J, we first show the achievability of Eei(R) by a simple variable length
universal code and a FIFO queue coding scheme which is very similar to that for lossless
source coding in Chapter 2. Then we show that Eei(R) is the upper bound on the delay
constrained error exponent by the same argument used in the proof for lossless source coding
error exponent in Chapter 2. Indeed, encoder side-information eliminates all the future ran-
domness in the source and the side-information, so the coding system should have the same
nature as the lossless source coding system discussed in Chapter 2. Although in different
forms, it should be conceptually clear that Eei(R) and Es(R) share many characteristics.
Without proof, we claim that the properties in Section 2.5 for Es(R) can be also found in
Eei(R).
5.3.2 Price of ignorance
In the block coding setup, with or without encoder side-information does not change
the error exponent in a dramatic way. As shown in [29], the difference is only between
the random coding error exponent Elowersi,b (R) and the error exponent Eupper
si,b (R). These
two are the same in the low rate regime, this is similar to the well-known channel coding
error exponent where random coding and sphere packing bounds are the same in the high
rate regime [41]. Furthermore, the gap between with encoder side-information and without
encoder side-information can be further reduced by using expurgation as shown in [2].
However for the delay constrained case, we show that without the encoder side-
information, the error exponent with only decoder side-information is generally strictly
smaller than the error exponent when the encoder side-information is also presented.
Corollary 2 Price of ignorance:
Eei(R) > Euppersi (R)
Proof: D(qxy‖pxy ) > D(qx‖px) in general. ¥
102
5.4 Numerical results
In this section we show three examples to illustrate the nature of the upper bound and
the lower bound on the delay constrained error exponent with decoder side-information.
5.4.1 Special case 1: independent side-information
As shown in Section 5.2.1, if the side-information y is independent with the source x ,
the upper bound agrees with the delay constrained source coding error exponent Es(R) for
x and the lower bound agrees with the random coding error exponent Er(R) for source
x . This shows that our bounding technique for the upper bound is tight in this case. To
bring the gap between the upper bound and the lower bound, the coding scheme should
be the optimal delay coding scheme in Chapter 2. The sequential random binning scheme
is clearly suboptimal. Because it is proved in Section 2.5 that the delay constrained error
exponent Es(R) is strictly higher than the random coding error exponent Er(R). This is
clearly shown in Figure 5.9.
The source is the same as it in Section 2.2. Source x with alphabet size 3, X = A,B, Cand the following distribution
px(A) = 0.65 px(B) = 0.175 px(C) = 0.175
The side-information is independent with the source, so its distribution does not matter.
We arbitrarily let the marginal py = 0.420, 0.580.
5.4.2 Special case 2: compression of encrypted data
In Section 5.2.2, we show that the delay constrained error exponent for compression of
encrypted data for source ps is sandwiched by the block coding error exponent Es,b(R, ps)
and the random coding error exponent Er(R, ps), in Corollary 1. For source s the stream
cipher is x = s ⊕ y . For the binary case, this stream cipher is illustrated in Figure 5.10.
Where the source ps(0) = 1− ε and ps(1) = ε, both the key y and the output of the cipher
x are uniform on 0, 1. The bounds on the delay constrained error exponents are plotted
in Figure 5.11. In the example in Figure 5.11, ε = 0.1. For this problem the entropy of the
source H(s) is equal to the conditional entropy H(x |y).
For uniform source x and the side-information that is the output of a symmetric channel
with x as the input, it can be shown that the upper bound and lower bound agree at low
103
1.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Rate R
Err
or E
xpon
ents
Lower bound (random coding error exponent)Upper bound ( delay optimal error exponent)
to ∞
H(x|y) log2(3)
Figure 5.9. Delay constrained Error exponents for source coding with independent decoderside-information: Upper bound Eupper
si (R) = Es(R), Lower bound Elowersi (R) = Er(R)
-
q-
1
y x
1− ε
1− ε
ε
ε
0
1
0
1
Figure 5.10. A stream cipher x = s ⊕ y can be modeled as a discrete memoryless channel,where key y is uniform and independent with source s. Key y is the input, encryption x isthe output of the channel.
104
H(x|y) log(2)0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Uniform source with symmetric channel 1
Rate R
Err
or E
xpon
ent
Upper boundLower bound
Figure 5.11. Delay constrained Error exponents for uniform source coding with symmetricdecoder side-information x = y ⊕ s: Upper bound Eupper
si (R) = Es(R, ps), Lower boundElower
si (R) = Er(R, ps). These two bounds agree in the low rate regime
rate regime just as shown in Figure 5.11. Note:this is a more general problem than the
compression of encrypted data problem. A uniform source with erasure channel between
the source and the side-information is illustrated in 5.12. The bounds on error exponents
are shown in Figure 5.13.
X
1−e
1−e
e
eY
0
1
0
1
*
Figure 5.12. uniform source and side-information connected by a symmetric erasure channel
105
H(x|y) log(2)0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Uniform source with symmetric channel 2
Rate R
Err
or E
xpon
ent
Lower boundUpper bound
Figure 5.13. Delay constrained Error exponents for uniform source coding with symmetricdecoder side-information 2 x and y are connected by an erasure channel: Upper boundEupper
si (R) = Er(R, ps) in the low rate regime
5.4.3 General cases
In this section, we show the upper bound and the lower bound for a general distribution
of the source and side-information. That is, the side-information is dependent of the source
and from the encoder point of view the source is not uniform. The uniformity here includes
the marginal distribution of the source and the side-information. For example the following
distribution is not uniform although the marginals are uniform:
pxy (x, y) x=1 x=2 x=3y=1 a b 1
3 − a− b
y=2 c d 13 − c− d
y=3 13 − a− c 1
3 − b− d −13 + a + b + c + d
Table 5.1. A non-uniform source with uniform marignals
Where a, b, c, d, a + b, a + c, d + c, d + b ∈ [0, 13 ], and the encoder can do more than just
sequential random binning. An extreme case where the encoder only need to deal with
x = 2, 3 is as follows: let a = 13 and b = c = 0, then the side information at the decoder can
106
be used to tell when x = 1. Without proof, we claim that for this marginal-uniform source,
random coding is suboptimal. Now consider the following 2× 2 source:
pxy =
0.1 0.2
0.3 0.4
(5.9)
In Figure 5.14, we plot the upper bound, the lower bound on the delay constrained er-
ror exponent with decoder only side-information. To illustrate the “price of ignorance”
phenomenon, we also plot the delay constrained error exponent with both encoder and de-
coder side-information: Eei(R). As the lossless source coding with delay constraints error
exponent Es(R), Eei(R) is related to its block coding error exponent with the focusing
operator.
H(x|y) log(|X|)0
0.2
0.4
0.6
0.8
1
1.2
1.4
Rate R
Err
or E
xpon
ents
Lower boundUpper boundE
ei(R)
Figure 5.14. Delay constrained Error exponents for general source and decoder side-information: both the upper bound Eupper
si (R) and the lower bound Elowersi (R) are plot-
ted, the delay constrained error exponent with both encoder and decoder side informationEei(R) is plotted in dotted lines
Note: we do not know how to parameterize the upper bound Euppersi (R) as what we did
in Proposition 7 in Section 2. Instead, we brutal forcefully minimize
Euppersi (R) = min inf
qxy ,α≥1:H(qx|y )>(1+α)R 1α
D(qxy‖pxy ),
infqxy ,1≥α≥0:H(qx|y )>(1+α)R
1− α
αD(qx‖px) + D(qxy‖pxy )
on a 5 dimensional space (qxy , α). This gives a not so smooth plot as shown in Figure 5.14.
107
5.5 Proofs
The achievability part of the proof is a special case for that of the Slepian-Wolf source
coding shown in Chapter 4. The upper bound is derived by a feedforward decoding scheme
which was first developed in channel coding context. There is no surprise that we can
borrow the channel coding technique to source coding with decoder side-information since
it has been long known the duality between the two. The upper bound and the achievable
lower bound are not identical in general.
5.5.1 Proof of Theorem 6: random binning
We prove Theorem 6 by following the proofs for point-to-point source coding in Propo-
sitions 3 and 4. The encoder is the same sequential random binning encoder in Definition 3
with common randomness shared by the encoder and decoder. Recall that the key property
of this random binning scheme is the pair-wise independence, formally, for all i, n:
Pr[E(xi1x
ni+1) = E(xi
1xni+1)] = 2−(bnRc−biRc) ≤ 2× 2−(n−i)R (5.10)
First we show (5.1) in Theorem 6.
ML decoding
ML decoding rule:
Denote by xn1 (n) the estimate of the source sequence xn
1 at time n.
xn1 (n) = arg max
xn1∈Bx(xn
1 )pxy (xn
1 , yn1 ) = arg max
xn1∈Bx(xn
1 )
n∏
i=1
pxy (xi, yi) (5.11)
The ML decoding rule in (5.11) is very simple. At time n, the decoder simply picks the
sequence xn1 (n) with the highest joint likelihood with the side-information yn
1 while xn1 (n)
is in the same bin as the true sequence xn1 . Now the estimate of source symbol n − ∆ is
simply the (n− delay)th symbol of xn1 (n), denoted by xn−∆(n).
Details of the proof:
The proof is quite similar to that of Proposition 3, we will omit some of the details of the
proof in this thesis. To lead to a decoding error, there must be some false source sequence
xn1 that satisfies three conditions: (i) it must be in the same bin (share the same parities)
108
as xn1 , i.e., xn
1 ∈ Bx(xn1 ), (ii) it must be more likely than the true sequence given the same
side-information yn1 , i.e., pxy (xn
1 , yn1 ) > pxy (xn
1 , yn1 ), and (iii) xl 6= xl for some l ≤ n−∆.
The error probability again can be union bounded as:
Pr[xn−∆(n) 6= xn−∆]
≤Pr[xn−∆1 (n) 6= xn−∆
1 ]
=∑
xn1 ,yn
1
Pr[xn−∆1 (n) 6= xn−∆
1 |xn1 = xn
1 ]pxy (xn1 , yn
1 )
=∑
xn1 ,yn
1
n−∆∑
l=1
Pr[∃ xn
1 ∈ Bx(xn1 ) ∩ Fn(l, xn
1 ) s.t. pxy (xn1 , yn
1 ) ≥ pxy (xn1 , yn
1 )]pxy (xn
1 , yn1 )
=n−∆∑
l=1
∑
xn1 ,yn
1
Pr[∃ xn
1 ∈ Bx(xn1 ) ∩ Fn(l, xn
1 ) s.t. pxy (xn1 , yn
1 ) ≥ pxy (xn1 , yn
1 )]pxy (xn
1 , yn1 )
=n−∆∑
l=1
pn(l) (5.12)
Recall the definition of Fn(l, xn1 ) in (2.20)
Fn(l, xn1 ) = xn
1 ∈ X n|xl−11 = xl−1
1 , xl 6= xl
and we define
pn(l) =∑
xn1 ,yn
1
Pr[∃ xn
1 ∈ Bx(xn1 )∩Fn(l, xn
1 ) s.t. pxy (xn1 , yn
1 ) ≥ pxy (xn1 , yn
1 )]pxy (xn
1 , yn1 ) (5.13)
We now upper bound pn(l) using a Chernoff bound argument similar to [39] and in the
proof of Lemma 1. The details of the proof of the following Lemma 13 is in Appendix H.1.
Lemma 13 pn(l) ≤ 2× 2−(n−l+1)Elowersi (R).
Using the above lemma, substitute the bound on pn(l) into (5.12), we prove the ML decoding
part, (5.1), in Theorem 6. ¥
Universal decoding (sequential minimum empirical joint entropy decoding)
In this section we prove (5.2) in Theorem 6.
Universal decoding rule:
xl(n) = w[l]l where w[l]n1 = arg minxn∈Bx(xn
1 ) s.t. xl−11 =xl−1
1 (n)
H(xnl , yn
l ). (5.14)
109
We term this a sequential minimum joint empirical-entropy decoder which tightly follows the
sequential minimum empirical-entropy decoder for point-to-point source coding in (2.30).
Details of the proof: With this decoder, errors can only occur if there is some sequence
xn1 such that (i) xn
1 ∈ Bx(xn1 ), (ii) xl−1
1 = xl−1, and xl 6= xl, for some l ≤ n −∆, and (iii)
the joint empirical entropy of xnl and yn
l is such that H(xnl , yn
l ) < H(xnl , yn
l ). Building on
the common core of the achievability (5.12) with the substitution of universal decoding in
the place of maximum likelihood results in the following definition of pn(l):
pn(l) =∑
xn1 ,yn
1
Pr[∃ xn
1 ∈ Bx(xn1 ) ∩ Fn(l, xn
1 ) s.t. H(xnl , yn
l ) ≤ H(xnl , yn
l )]pxy (xn
1 , yn1 ) (5.15)
The following lemma gives a bound on pn(l). The proof is in Appendix H.2.
Lemma 14 For sequential joint minimum empirical entropy decoding,
pn(l) ≤ 2× (n− l + 2)2|X ||Y|2−(n−l+1)Elowersi (R).
Lemma 14 and Pr[xn−∆(n) 6= xn−∆] ≤ ∑n−∆l=1 pn(l) imply that:
Pr[xn−∆(n) 6= xn−∆] ≤n−∆∑
l=1
(n− l + 2)2|X ||Y|2−(n−l+1)Elowersi (R)
≤n−∆∑
l=1
K12−(n−l+1)[Elowersi (R)−ε]
≤K2−∆[Elowersi (R)−ε]
where K and K1 are finite constants. The above analysis follows the same argument of
that in the proof of Proposition 4. This concludes the proof of the universal decoding part,
(5.2), in Theorem 6. ¥
5.5.2 Proof of Theorem 7: Feed-forward decoding
The theorem is proved by applying a variation of the bounding technique used in [67]
(and originating in [62]) for the fixed-delay channel coding problem. Lemmas 15-20 are the
source coding counterparts to Lemmas 4.1-4.5 in [67]. The idea of the proof is to first build
a feed-forward sequential source decoder which has access to the previous source symbols
in addition to the encoded bits and the side-information. The second step is to construct a
block source-coding scheme from the optimal feed-forward sequential decoder and showing
110
Causal
Feed-forward
delay
?-
-
DMC
xb
xx x
x
x
y
encoder
?- -
-
6
-
-
-
6
6
-
?
Delay ∆
Delay ∆
+−feed-forward
decoder 1
feed-forward
decoder 2
Figure 5.15. A cutset illustration of the Markov Chain xn1 − (xn
1 , bb(n+∆)Rc1 , yn+∆
1 ) − xn1 .
Decoder 1 and decoder 2 are type I and II delay ∆ rate R feed-forward decoder respectively.They are equivalent.
that if the side-information behaves atypically enough, then the decoding error probability
will be large for at least one of the source symbols. The next step is to prove that the
atypicality of the side-information before that particular source symbol does not cause the
error because of the feed-forward information. Thus, cause of the decoding error for that
particular symbol is the atypical behavior of the future side-information only. The last
step is to lower bound the probability of the atypical behavior and upper bound the error
exponents. The proof spans into the next several subsections.
Feed-forward decoders
Definition 12 A delay ∆ rate R decoder D∆,R with feed-forward is a decoder D∆,Rj that
also has access to the past source symbols xj−11 in addition to the encoded bits b
b(j+∆)Rc1 and
side-information yj+∆1 .
Using this feed-forward decoder, the estimate of xj at time j + ∆ is :
xj(j + ∆) = D∆,Rj (bb(j+∆)Rc
1 , yj+∆1 , xj−1
1 ) (5.16)
Lemma 15 For any rate R encoder E, the optimal delay ∆ rate R decoder D∆,R with
feed-forward only needs to depend on bb(j+∆)Rc1 , yj+∆
j , xj−11
Proof: The source and side-information (xi, yi) is an iid random process and the en-
coded bits bb(j+∆)Rc1 are functions of xj+∆
1 so obeys the Markov chain: yj−11 −
111
(xj−11 , b
b(j+∆)Rc1 , yj+∆
j ) − xj+∆j . Conditioned on the past source symbols, the past side-
information is completely irrelevant for estimation. ¤
Notice that for any finite alphabet set X , we can always define a group Z|X | on X ,
where the operators − and + are indeed −, + mod |X |. So we write the error sequence
of the feed-forward decoder as xi = xi − xi. Then we have the following property for the
feed-forward decoders.
Lemma 16 Given a rate R encoder E, the optimal delay ∆ rate R decoder D∆,R with
feed-forward for symbol j only needs to depend on bb(j+∆)Rc1 , yj+∆
1 , xj−11
Proof: Proceed by induction. It holds for j = 1 since there are no prior source symbols.
Suppose that it holds for all j < k and consider j = k. By the induction hypothesis,
the action of all the prior decoders j can be simulated using (bb(j+∆)Rc1 , yj+∆
1 , xj−11 ) giving
xk−11 . This in turn allows the recovery of xk−1
1 since we also know xk−11 . Thus the decoder
is equivalent. ¤
We call the feed-forward decoders in Lemmas 15 and 16 type I and II delay ∆ rate R
feed-forward decoders respectively. Lemma 15 and 16 tell us that feed-forward decoders can
be thought in three ways: having access to all encoded bits, all side-information and all past
source symbols, (bb(j+∆)Rc1 , yj+∆
1 , xj−11 ), having access to all encoded bits, a recent window
of side information and all past source symbols, (bb(j+∆)Rc1 , yj+∆
j , xj−11 ), or having access to
all encoded bits, all side-information and all past decoding errors, (bb(j+∆)Rc1 , yj+∆
1 , xj−11 ).
Constructing a block code
To encode a block of n source symbols, just run the rate R encoder E and terminate
with the encoder run using some random source symbols drawn according to the distribution
of px with matching side-information on the other side. To decode the block, just use the
delay ∆ rate R decoder D∆,R with feed-forward, and then use the fedforward error signals to
correct any mistakes that might have occurred. As a block coding system, this hypothetical
system never makes an error from end to end. As shown in Figure 5.15, the data processing
inequality implies:
Lemma 17 If n is the block-length, the block rate is R(1 + ∆n ), then
H(xn1 ) ≥ −(n + ∆)R + nH(x |y) (5.17)
112
Proof:
nH(x) =(a) H(xn1 )
= I(xn1 ; xn
1 )
=(b) I(xn1 ; xn
1 , bb(n+∆)Rc1 , yn+∆
1 )
=(c) I(xn1 ; yn+∆
1 ) + I(xn1 ; xn
1 |yn+∆1 ) + I(xn
1 ; bb(n+∆)Rc1 |yn+∆
1 , xn1 )
≤(d) nI(x , y) + H(xn1 ) + H(bb(n+∆)Rc
1 )
≤ nH(x)− nH(x |y) + H(xn1 ) + (n + ∆)R
(a) is true because the source is iid. (b) is true because of the data processing
inequality considering the following Markov chain: xn1 − (xn
1 , bb(n+∆)Rc1 , yn
1 ) − xn1 ,
thus I(xn1 ; xn
1 ) ≤ I(xn1 ; xn
1 , bb(n+∆)Rc1 , yn+∆
1 ). And the fact that I(xn1 ; xn
1 ) = H(xn1 ) ≥
I(xn1 ; xn
1 , bb(n+∆)Rc1 , yn+∆
1 ). Combining the two equalities we get (b). (c) is the chain
rule for mutual information. In (d), first notice that (x , y) are iid across time, thus
I(xn1 ; yn+∆
1 ) = I(xn1 ; yn
1 ) = nI(x , y). Secondly entropy of a random variable is never less
than the mutual information of that random variable with another one, condition on other
random variable or not. Others are obvious. ¤
Lower bound the symbol-wise error probability
Now suppose this block-code were to be run with the distribution qxy , s.t. H(qx |y ) >
(1+ ∆n )R, from time 1 to n, and were to be run with the distribution pxy from time n+1 to
n + ∆. Write the hybrid distribution as Qxy . Then the block coding scheme constructed in
the previous section will with probability 1 make a block error. Moreover, many individual
symbols will also be in error often:
Lemma 18 If the source and side-information is coming from qxy , then there exists
a δ > 0 so that for n large enough, the feed-forward decoder will make at leastH(qx|y )−n+∆
nR
2 log2 |X |−(H(qx|y )−n+∆n
R)n symbol errors with probability δ or above. δ satisfies
hδ + δ log2(|X | − 1) =12(H(qx |y )− n + ∆
nR),
where hδ = −δ log2 δ − (1− δ) log2(1− δ).
113
Proof: Lemma 17 implies:
n∑
i=1
H(xi) ≥ H(xn1 ) ≥ −(n + ∆)R + nH(qx |y ) (5.18)
The average entropy per source symbol for x is at least H(qx |y )− n+∆n R. Now suppose that
H(xi) ≥ 12(H(qx |y )− n+∆
n R) for A positions. By noticing that H(xi) ≤ log2 |X |, we have
n∑
i=1
H(xi) ≤ A log2 |X |+ (n−A)12(H(qx |y )− n + ∆
nR)
With (5.18), we derive the desired result:
A ≥ (H(qx |y )− n+∆n R)
2 log2 |X | − (H(qx |y )− n+∆n R)
n (5.19)
Where 2 log2 |X | − (H(qx |y )− n+∆n R) ≥ 2 log2 |X | −H(qx |y ) ≥ 2 log2 |X | − log2 |X | > 0
Now for A positions 1 ≤ j1 < j2 < ... < jA ≤ n the individual entropy H(xj) ≥12(H(qx |y )− n+∆
n R). By the property of the binary entropy function,
Pr[xj 6= x0] = Pr[xj 6= xj ] ≥ δ,
where x0 is the zero element in the finite group Z|X |. ¤
We can pick j∗ = jA2, by Lemma 18, we know that minj∗, n − j∗ ≥
12
(H(qx|y )−n+∆n
R)
2 log2 |X |−(H(qx|y )−n+∆n
R)n, so if we fix ∆
n and let n go to infinity, then minj∗, n − j∗goes to infinity as well.
At this point, Lemma 15 and 18 together imply that even if the source and side-
information only behaves like it came from the hybrid distribution Qxy from time j∗ to
j∗+∆ and the source behaves like it came from a distribution qx from time 1 to j∗− 1, the
same minimum error probability δ still holds. Now define the “bad sequence” set Ej∗ as
the set of source and side-information sequence pairs so the type I delay ∆ rate R decoder
makes an decoding error at j∗. Formally
Ej∗ = (~x, ¯y)|xj∗ 6= D∆,Rj∗ (E(~x), yj∗+∆
j , x),
where to simplify the notation, we write: ~x = xj∗+∆1 , x = xj∗−1
1 , ¯x = xj∗+∆j∗ , ¯y = yj∗+∆
j∗ .
By Lemma 18, Qxy (Ej∗) ≥ δ. Notice that Ej∗ does not depend on the distribution of the
source but only on the encoder-decoder pair. Define J = minn, j∗ + ∆, and ¯x = xJj∗ ,
114
¯y = yJj∗ . Now we write the strongly typical set
AεJ(qxy ) = (~x : ¯y) ∈ X j∗+∆ × Y∆+1|∀x, rx(x) ∈ (qx(x)− ε, qx(x) + ε)
∀(x, y) ∈ X × Y, qxy (x, y) > 0 : r ¯x, ¯y(x, y) ∈ (qxy (x, y)− ε, qxy (x, y) + ε),
∀(x, y) ∈ X × Y, qxy (x, y) = 0 : r ¯x, ¯y(x, y) = 0
where the empirical distribution of (¯x, ¯y) is denoted by r¯x,¯y(x, y) = nx,y(¯x,¯y)∆+1 , the empirical
distribution of x by rx(x) = nx(x)j∗−1 .
Lemma 19 Qxy (Ej∗ ∩AεJ(qxy )) ≥ δ
2 for large n and ∆.
Proof: Fix ∆n , let n go to infinity, then minj∗, n− j∗ goes to infinity. By the definition of
J , minj∗, J − j∗ goes to infinity as while. By Lemma 13.6.1 in [26], we know that ∀ε > 0,
if J − j∗ and j∗ are large enough, then Qxy (AεJ(qxy )C) ≤ δ
2 . By Lemma 18, Qxy (Ej∗) ≥ δ.
So
Qxy (Ej∗ ∩AεJ(qxy )) ≥ Qxy (Ej∗)−Qxy (Aε
J(qxy )C) ≥ δ
2
¤
Lemma 20 For all ε < minx,y:pxy (x,y)>0pxy (x, y), ∀(~x, ¯y) ∈ AεJ(qxy ),
pxy (~x, ¯y)Qxy (~x, ¯y)
≥ 2−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px )−JGε
where G = max|X ||Y|+ ∑x,y:pxy (x,y)>0 log2(
qxy (x,y)pxy (x,y) + 1), |X |+ ∑
x log2(qx (x)px (x) + 1)
Proof: For (~x, ¯y) ∈ AεJ(qxy ), by definition of the strong typical set, it can be easily shown
by algebra: D(r ¯x, ¯y‖pxy ) ≤ D(qxy‖pxy ) + Gε and D(rx‖px) ≤ D(qx‖px) + Gε.
pxy (~x, ¯y)Qxy (~x, ¯y)
=pxy (x)qxy (x)
pxy (¯x, ¯x)qxy (¯x, ¯y)
pxy (xj∗+∆J+1 , yj∗+∆
J+1 )
pxy (xj∗+∆J+1 , yj∗+∆
J+1 )
=2−(J−j∗+1)(D(r ¯x, ¯y‖pxy )+H(r ¯x, ¯y))
2−(J−j∗+1)(D(r ¯x, ¯y‖qxy )+H(r ¯x, ¯y))
2−(j∗−1)(D(rx‖px )+H(rx))
2−(j∗−1)(D(rx‖qx )+H(rx))
≥(a) 2−(J−j∗+1)(D(qxy‖pxy )+Gε)−(j∗−1)(D(qx‖px )+Gε)
= 2−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px )−JGε
(a) is true by Equation 12.60 in [26]. ¤
115
Lemma 21 For all ε < minx,ypxy (x, y), and large ∆, n:
pxy (Ej∗) ≥ δ
22−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px )−JGε
Proof: Combining Lemma 19 and 20:
pxy (Ej∗) ≥ pxy (Ej∗ ∩AεJ(qxy ))
≥ qxy (Ej∗ ∩AεJ(qxy ))2−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px )−JGε
≥ δ
22−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px )−JGε
¤
Final touch of the proof of Theorem 7
Now we are finally ready to prove Theorem 7. Notice that as long as H(qx |y ) > n+∆n R,
we know δ > 0 by letting ε go to 0, ∆ and n go to infinity proportionally. We have:
Pr[xj∗(j∗ + ∆) 6= xj∗ ] = pxy (Ej∗) ≥ K2−(J−j∗+1)D(qxy‖pxy )−(j∗−1)D(qx‖px ).
Notice that D(qxy‖pxy ) ≥ D(qx‖px) and J = minn, j∗ + ∆, then for all possible
j∗ ∈ [1, n], we have: for n ≥ ∆
(J − j∗ + 1)D(qxy‖pxy ) + (j∗ − 1)D(qx‖px) ≤ (∆ + 1)D(qxy‖pxy ) + (n−∆− 1)D(qx‖px)
≈ ∆(D(qxy‖pxy ) +n−∆
∆D(qx‖px))
For n < ∆
(J − j∗ + 1)D(qxy‖pxy ) + (j∗ − 1)D(qx‖px) ≤ nD(qxy‖pxy ) = ∆(n
∆D(qxy‖pxy ))
Write α = ∆n , then the upper bound on the error exponent is the minimum of the above
error exponents over all α > 0, i.e:
Euppersi (R) = min inf
qxy ,α≥1:H(qx|y )>(1+α)R 1α
D(qxy‖pxy ),
infqxy ,1≥α≥0:H(qx|y )>(1+α)R
1− α
αD(qx‖px) + D(qxy‖pxy )
This finalizes the proof. ¥
116
5.6 Discussions
In this chapter, we attempted to derive a tight upper bound on the delay constrained
distributed source coding error exponent. For source coding with decoder only side-
information, we give an upper bound by using a feed-forward coding argument borrowed
from the channel coding literature [62, 67]. The upper bound is shown to be tight in two
special cases, namely independent side-information and the low rate regime in the com-
pression of encrypted data problem. We gave a generic achievability result by sequential
random binning. The random coding based lower bound agrees with the upper bound in
the compression of encrypted data problem. In the independent side-information case, the
random coding error exponent is strictly suboptimal. For general cases, there is a gap be-
tween the upper and lower bounds in the whole rate region. We believe the lower bound can
be improved by using a variable length random binning scheme. This is a difficult problem
and should be further studied.
We also studied delay constrained source coding with both encoder and decoder side-
information. With encoder side-information, this problem resembles lossless source coding,
thus delay constrained coding has a focusing type bound. The exponent with both encoder
and decoder side-information is strictly higher than the upper bound of the decoder only
case. This phenomenon is called the “price of ignorance” [18] and is not observed in block
coding. This is another example that block length is not the same as delay.
117
Chapter 6
Future Work
In this thesis, we studied the asymptotic performance bound for several delay con-
strained streaming source coding problems where the source is assumed to be iid with
constant arrival rate. What are the performance bounds if the sources are not iid, what if
the arrivals are random, what if the distortion is not defined on a symbol by symbol basis?
What is the non asymptotic performance bounds for these problems? More importantly,
the main theme of this thesis is to figure out the dominant error event on a symbol by sym-
bol basis. For a specific source symbol with a finite end-to-end decoding delay constraint,
what’s the most likely atypical event that causes error? This is a fundamental question that
is not answered yet for several cases.
118
6.1 Past, present and future
“This duality can be pursued further and is related to the duality between past and
future and the notions of control and knowledge. Thus we may have knowledge of the past
but cannot control it; we may control the future but have no knowledge of it.”
– Claude Shannon [74]
What is the past, what is the present and what is the future? In delay constrained
streaming coding problems, for a source symbol that enters the encoder at time t, we define
past as time prior to time t, present as time t, future as time between t and time t + ∆,
where ∆ is the finite delay constraint. Time after t + ∆ is irrelevant. With this definition,
past, present and future is only relevant to a particular time t as shown in Figure 6.1
-0 t
present future irrelevancepast
t + ∆
Figure 6.1. Past, present and future at time t
6.1.1 Dominant error event
For delay constrained streaming coding, what is the dominant error event (atypical
event) for time t? For a particular coding scheme, we define the dominant error event as
the event with the highest probability that makes a decoding error for the source symbol t.
For delay constrained lossless source coding in Chapter 2, we discovered that the dominant
error event for optimal coding is the atypical behavior of the source in the past, the future
does not matter! This is shown in Figure 6.2. The same goes for the delay constrained lossy
source coding problem in Chapter 3. However, the dominant error event for the sequential
random binning scheme is the future atypicality of the binning and/or the source behavior
as shown in Section 2.3.3. This illustrates that dominant error events are coding scheme
dependent.
We summarize what we know about the dominant error events for the optimal coding
schemes in Table 6.1. In th table, we put a X mark if we completely characterized the
nature of the dominant error event, and a ? mark if we have a conjecture on the nature of
the dominant error event. Note:for source coding with decoder only side-information and
119
-0 t
present future irrelevancepast
t + ∆t− ∆α∗
Figure 6.2. Dominant error event for delay constrained lossless source coding is the atypi-cality of the source between time t− ∆
α∗ and time t, α∗ is the optimizer in (2.7)
Past Future BothPoint-to-point Lossless (Chapter 2) XPoint-to-point Lossy (Chapter 3) XSlepian-Wolf coding (Chapter 4) ?
Source coding with decoder only side-info (Chapter 5) ?Source coding with both side-information (Chapter 5) X
Channel coding without feedback [67] XErasure-like channel coding wit feedback [67] X
Joint source channel coding [15] ?MAC [14], Broadcast channel coding [13] ?
Table 6.1. Type of dominant error events for different coding problems
joint source channel coding, we derive the upper bound by using the feed-forward decoder
scheme and the the dominant error event spans across past and future. However, the lower
bounds (achievability) are derived by a suboptimal sequential random binning scheme whose
dominant error event is in the future.
6.1.2 How to deal with both past and future?
In this thesis, we develop several tools to deal with different delay constrained streaming
source coding problems. For the lower bounds on the error exponent, we use sequential
random binning to deal with future dominant error events and we see that the scheme is
often optimal as shown in [67]. We use variable length coding with FIFO queue to deal with
past dominant error events as shown in Chapters 2 and 3. For the upper bounds, we use the
feedforward decoding scheme and a simple focusing type argument to translate the delay
constrained error event into a block coding error event and then derive the upper bounds.
Obviously, more tools are needed to deal with more complex problems. At what point can
we claim that we have developed all the needed tools for delay constrained streaming coding
problems in Figure 1.1?
For those problems whose dominant error event spans across past and future, we do
not have a concrete idea of what the optimal coding scheme is. The only exception is the
120
erasure-like channel with feedback problem in [67], this channel coding problem is a special
case of source coding with delay. For more general cases, source coding with decoder side-
information and joint source channel coding, we believe that the “variable length random
binning” scheme should be studied. This is a difficult problem and more serious research
needs to be done in this area.
121
6.2 Source model and common randomness
In this thesis, the sources are always modeled as iid random variables on a finite alphabet
set. This assumption simplifies the analysis and captures the main challenges in streaming
source coding with delay. Our analysis for lossless source coding in Chapter 2 can be easily
generalized to finite alphabet stationary ergodic processes. The infinite alphabet case [59] is
very challenging and our current universal proof technique does not apply. For lossy source
coding in Chapter 3, an important question is: what if the sources are continuous random
variables instead discrete and the distortion measures are L2 distances? Is a variable length
vector quantizer with FIFO queueing system still optimal? Another interesting case is that
we do not have any assumption on the statistics of the source at all. This is the individual
sequence problem [11]. It seems that our universal variable length code and FIFO queue
would work just fine for individual sequences for the same reason that Lempel-Ziv coding is
optimal for individual sequences. A rigorous proof is needed. In our study of lossy source
coding, we focus on the per symbol loss case. What if the loss function is defined as an
average over a period of time? What can we learn from the classical sliding-block source
coding literature [45, 44, 7, 77]?
Another very interesting problem is when multiple sources have to share the same
bandwidth and encoder/decoder. We recently studied the error exponent tradeoff in the
block coding setup. The delay constrained performance for multiple streaming sources is a
more challenging problem. Also, as mentioned in Section 1.2, random arrivals of the source
symbols poses another dimension of challenges. Other than several special distributions of
the inter-symbol random arrivals, we do not have a general result on the delay constrained
error exponent. For the above problems, techniques from queueing theory [42] might be
useful.
In the proofs of the achievabilities for delay constrained distributed source coding in
Chapters 4 and 5 and channel coding [13, 14], we use a sequential random coding scheme.
This assumes that the encoder(s) and the decoder(s) share some common randomness. The
amount of randomness is unbounded since our coding system has an infinite horizon. A
natural question is if there exists a encoder decoder pair that achieve the random coding
error exponents without the presence of common randomness. For the block coding cases,
the existence is proved by a simple argument [26, 41]. However, this argument does not
apply to the delay constrained coding problems because we are facing infinite horizons.
122
6.3 Small but interesting problems
In Chapter 2, we showed several properties of the delay constrained source coding error
exponent Es(R), including the positive derivative of the error exponent at the entropy rate
etc. One important question is whether the exponent Es(R) is convex ∪. We conjecture
that the answer is yes although the channel coding counter part can be neither convex
nor concave [67]. We also conjecture that the delay constrained lossy source coding error
exponent ED(R) is convex ∪. This is a more difficult problem, because there is no obvious
parametrization of ED(R), thus the only possible way to show the concavity of ED(R) is
through the definition of ED(R) and show a general result for all focusing type bounds.
An important question is: what is the sufficient condition for F (R) such that the following
focusing function E(R) is convex ∪?
E(R) = infα>0
1α
F ((1 + α)R) (6.1)
We do not have a parametrization result for the upper bound on the delay constrained
source coding with decoder side-information as shown in Theorem 7. A parametrization
result like that in Proposition 7 can greatly simplify the calculation of the upper bound.
We use the feed-forward coding scheme in upper bounding the error exponents. All
the problems we have studied using this technique are point to point coding with one en-
coder and one decoder. An important future direction is to generalize our current technique
to distributed source coding and channel coding. This should be a reasonably easy prob-
lem for delay constrained Slepian-Wolf coding by modifying the feed-forward diagram in
Figure 5.15.
Lastly, we study the error exponents in the asymptotic regime, i.e. the error exponent
tells how fast the error probability decays to zero when the delay is long as shown in (1.2)
with ≈ instead of =. In order to accurately bound the error probability in the short delay
regime, we cannot ignore the often ignored polynomial terms in the expressions of error
probabilities. This could be an extremely laborious problem that lacks the mathematical
neatness. We leave this problem to more practically minded researchers.
I do not expect my thesis be error free. Please send your comments, suggestions, ques-
tions, worries, concerns, solutions to the open problems to [email protected]
123
Bibliography
[1] Rudolf Ahlswede and Imre Csiszar. Common randomness in information theory and
cryptography part I: Secret sharing. IEEE Transactions on Information Theory,
39:1121– 1132, 1993.
[2] Rudolf Ahlswede and Gunter Dueck. Good codes can be produced by a few permuta-
tions. IEEE Transactions on Information Theory, 28:430– 443, 1982.
[3] Venkat Anantharam and Sergio Verdu. Bits through queues. IEEE Transactions on
Information Theory, 42:4– 18, 1996.
[4] Andrew Barron, Jorma Rissanen, and Bin Yu. The minimum description length prin-
ciple in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743–
2760, 1998.
[5] Toby Berger. Rate Distortion Theory: A mathematical basis for data compression.
Prentice-Hall, 1971.
[6] Toby Berger and Jerry D. Gibson. Lossy source coding. IEEE Transactions on Infor-
mation Theory, 44:2693 – 2723, 1998.
[7] Toby Berger and Joseph KaYin Lau. On binary sliding block codes. IEEE Transactions
on Information Theory, 23:343 – 353, 1977.
[8] Patrick Billingsley. Probability and Measure. Cambridge University Press, Wiley-
Interscience, 1995.
[9] Shashi Borade and Lizhong Zheng. I-projection and the geometry of error exponents.
Allerton Conference, 2006.
[10] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University
Press, 2004.
124
[11] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge
University Press, 2006.
[12] Cheng Chang, Stark Draper, and Anant Sahai. Sequential random binning for stream-
ing distributed source coding. submitted to IEEE Transactions on Information Theory,
2006.
[13] Cheng Chang and Anant Sahai. Sequential random coding error exponents for degraded
broadcast channels. Allerton Conference, 2005.
[14] Cheng Chang and Anant Sahai. Sequential random coding error exponents for multiple
access channels. WirelessCom 05 Symposium on Information Theory, 2005.
[15] Cheng Chang and Anant Sahai. Error exponents for joint source-channel coding with
delay-constraints. Allerton Conference, 2006.
[16] Cheng Chang and Anant Sahai. Upper bound on error exponents with delay for lossless
source coding with side-information. ISIT, 2006.
[17] Cheng Chang and Anant Sahai. Delay-constrained source coding for a peak distortion
measure. ISIT, 2007.
[18] Cheng Chang and Anant Sahai. The price of ignorance: the impact on side-information
for delay in lossless source coding. submitted to IEEE Transactions on Information
Theory, 2007.
[19] Cheng Chang and Anant Sahai. Universal quadratic lower bounds on source coding
error exponents. Conference on Information Sciences and Systems, 2007.
[20] Cheng Chang and Anant Sahai. The error exponent with delay for lossless source
coding. Information Theory Workshop, Punta del Este, Uruguay, March 2006.
[21] Cheng Chang and Anant Sahai. Universal fixed-length coding redundancy. Information
Theory Workshop, Lake Tahoe, CA, September 2007.
[22] Jun Chen, Dake He, Ashish Jagmohan, and Luis A. Lastras-Montano. On the duality
and difference between Slepian-Wolf coding and channel coding. Information Theory
Workshop, Lake Tahoe, CA, USA, September 2007.
[23] Gerard Cohen, Iiro Honkala, Simon Litsyn, and Antoine Lobstein. Covering Codes.
Elsevier, 1997.
125
[24] Thomas M. Cover. Broadcast channels. IEEE Transactions on Information Theory,
18:2–14, 1972.
[25] Thomas M. Cover and Abbas El Gamal. Capacity theorems for the relay channel.
IEEE Transactions on Information Theory, pages 572–584, 1979.
[26] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley
and Sons Inc., New York, 1991.
[27] Imre Csiszar. Joint source-channel error exponent. Problem of Control and Information
Theory, 9:315–328, 1980.
[28] Imre Csiszar. The method of types. IEEE Transactions on Information Theory,
44:2505–2523, 1998.
[29] Imre Csiszar and Janos Korner. Information Theory. Akademiai Kiado, Budapest,
1986.
[30] Imre Csiszar and Paul C. Shields. Information Theory and Statistics: A Tutorial.
http://www.nowpublishers.com/getpdf.aspx?doi=0100000004&product=CIT.
[31] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications.
Springer, 1998.
[32] Stark Draper, Cheng Chang, and Anant Sahai. Sequential random binning for stream-
ing distributed source coding. ISIT, 2005.
[33] Richard Durrett. Probability: Theory and Examples. Duxbury Press, 1995.
[34] David Forney. Convolutional codes III. sequential decoding. Information and Control,
25:267–297, 1974.
[35] Robert Gallager. Fixed composition arguments and lower bounds to error probability.
http://web.mit.edu/gallager/www/notes/notes5.pdf.
[36] Robert Gallager. Low Density Parity Check Codes, Sc.D. thesis. Massachusetts Insti-
tute Technology, 1960.
[37] Robert Gallager. Capacity and coding for degraded broadcast channels. Problemy
Peredachi Informatsii, 10(3):3–14, 1974.
[38] Robert Gallager. Basic limits on protocol information in data communication networks.
IEEE Transactions on Information Theory, 22:385–398, 1976.
126
[39] Robert Gallager. Source coding with side information and universal coding. Technical
Report LIDS-P-937, Mass. Instit. Tech., 1976.
[40] Robert Gallager. A perspective on multiaccess channels. IEEE Transactions on Infor-
mation Theory, 31, 1985.
[41] Robert G. Gallager. Information Theory and Reliable Communication. John Wiley,
New York, NY, 1971.
[42] Ayalvadi Ganesh, Neil O´Connell, and Damon Wischik. Big queues. Springer,
Berlin/Heidelberg, 2004.
[43] Michael Gastpar. To code or not to code, Phd Thesis. Ecole Polytechnique Federale
de Lausanne, 2002.
[44] Robert M. Gray. Sliding-block source coding. IEEE Transactions on Information
Theory, 21:357–368, 1975.
[45] Robert M. Gray, David L. Neuhoff, and Donald S. Ornstein. Nonblock source coding
with a fidelity criterion. The Annals of Probability, 3(3):478–491, 195.
[46] Martin Hellman. An extension of the shannon theory approach to cryptography. IEEE
Transactions on Information Theory, 23(3):289–294, 1977.
[47] Michael Horstein. Sequential transmission using noiseless feedback. IEEE Transactions
on Information Theory, 9(3):136 – 143, 1963.
[48] Arding Hsu. Data management in delayed conferencing. Proceedings of the 10th Inter-
national Conference on Data Engineering, page 402, Feb 1994.
[49] Frederick Jelinek. Buffer overflow in variable length coding of fixed rate sources. IEEE
Transactions on Information Theory, 14:490–501, 1968.
[50] Mark Johnson, Prakash Ishwar, Vinod Prabhakaran, Dan Schonberg, and Kannan
Ramchandran. On compressing encrypted data. IEEE Transactions on Signal Pro-
cessing, 52(10):2992–3006, 2004.
[51] Aggelos Katsaggelos, Lisimachos Kondi, Fabian Meier, Jorn Ostermann, and Guido
Schuster. Mpeg-4 and ratedistortion -based shape-coding techniques. IEEE Proc.,
special issue on Multimedia Signal Processing, 86:1126–1154, 1998.
[52] Leonard Kleinrock. Queueing Systems. Wiley-Interscience, 1975.
127
[53] Amos Lapidoth and Prakash Narayan. Reliable communication under channel uncer-
tainty. IEEE Transactions on Signal Processing, 44:2148–2177, 1998.
[54] Michael E. Lukacs and David G. Boyer. A universal broadband multipoint telecon-
ferencing service for the 21st century. IEEE Communications Magazine, 33:36 – 43,
1995.
[55] Katalin Marton. Error exponent for source coding with a fidelity criterion. IEEE
Transactions on Information Theory, 20(2):197–199, 1974.
[56] Ueli M. Maurer. Secret key agreement by public discussion from common information.
IEEE Transactions on Information Theory, 39(3):733742, 1993.
[57] Neri Merhav and Ioannis Kontoyiannis. Source coding exponents for zero-delay coding
with finite memory. IEEE Transactions on Information Theory, 49(3):609–625, 2003.
[58] David L. Neuhoff and R. Kent Gilbert. Causal source codes. IEEE Transactions on
Information Theory, 28(5):701–713, 1982.
[59] Alon Orlitsky, Narayana P. Santhanam, Krishnamurthy Viswanathan, and Junan
Zhang. Limit results on pattern entropy. IEEE Transactions on Information The-
ory, 52(7):2954–2964, 2006.
[60] Hari Palaiyanur and Anant Sahai. Sequential decoding using side-information. IEEE
Transactions on Information Theory, submitted, 2007.
[61] Larry L. Peterson and Bruce S. Davie. Computer Networks: A Systems Approach.
Morgan Kaufmann, 1999.
[62] Mark Pinsker. Bounds of the probability and of the number of correctable errors for
nonblock codes. Translation from Problemy Peredachi Informatsii, 3:44–55, 1967.
[63] Vinod Prabhakaran and Kannan Ramchandran. On secure distributed source coding.
Information Theory Workshop, Lake Tahoe, CA, USA, September 2007.
[64] S. Sandeep Pradhan, Jim Chou, and Kannan Ramchandran. Duality between source
coding and channel coding and its extension to the side information case. IEEE Trans-
actions on Information Theory, 49(5):1181– 1203, May 2003.
[65] Thomas J. Richardson and Rudiger L. Urbanke. The capacity of low-density parity-
check codes under message-passing decoding. IEEE Transactions on Information The-
ory, 47(2):599–618, 2001.
128
[66] Jorma Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.
[67] Anant Sahai. Why block length and delay are not the same thing. IEEE Transac-
tions on Information Theory, submitted. http://www.eecs.berkeley.edu/~sahai/
Papers/FocusingBound.pdf.
[68] Anant Sahai, Stark Draper, and Michael Gastpar. Boosting reliability over awgn net-
works with average power constraints and noiseless feedback. ISIT, 2005.
[69] Anant Sahai and Sanjoy K. Mitter. The necessity and sufficiency of anytime capacity
for stabilization of a linear system over a noisy communication link: Part I: scalar
systems. IEEE Transactions on Information Theory, 52(8):3369 – 3395, August 2006.
[70] J.P.M. Schalkwijk and Thomas Kailath. A coding scheme for additive noise channels
with feedback–I: No bandwidth constraint. IEEE Transactions on Information Theory,
12(2):172–182, 166.
[71] Dan Schonberg. Practical Distributed Source Coding and Its Application to the Com-
pression of Encrypted Data, PhD thesis. University of California, Berkeley, Berkeley,
CA, 2007.
[72] Claude Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27, 1948.
[73] Claude Shannon. Communication theory of secrecy systems. Bell System Technical
Journal, 28(4):656–715, 1949.
[74] Claude Shannon. Coding theorems for a discrete source with a fidelity criterion. IRE
National Convention Record, 7(4):142–163, 1959.
[75] Claude Shannon, Robert Gallager, and Elwyn Berlekamp. Lower bounds to error
probability for coding on discrete memoryless channels I. Information and Control,
10:65–103, 1967.
[76] Claude Shannon, Robert Gallager, and Elwyn Berlekamp. Lower bounds to error
probability for coding on discrete memoryless channels II. Information and Control,
10:522–552, 1967.
[77] Paul Shields and David Neuhoff. Block and sliding-block source coding. IEEE Trans-
actions on Information Theory, 23:211 – 215, 1977.
[78] Wang Shuo. I’m your daddy. Tianjin People’s Press, 2007.
129
[79] David Slepian and Jack Wolf. Noiseless coding of correlated information sources. IEEE
Transactions on Information Theory, 19, 1973.
[80] David Tse. Variable-rate Lossy Compression and its Effects on Communication Net-
works, PhD thesis. Massachusetts Institute Technology, 1994.
[81] David Tse, Robert Gallager, and John Tsitsiklis. Optimal buffer control for variable-
rate lossy compression. 31st Allerton Conference, 1993.
[82] Sergio Verdu and Steven W. McLaughlin. Information Theory: 50 Years of Discovery.
Wiley-IEEE Press, 1999.
[83] Terry Welch. A technique for high-performance data compression. IEEE Computer,
17(6):8–19, 1984.
[84] Wikipedia. Biography of Wang Shuo, http://en.wikipedia.org/wiki/Wang_Shuo.
[85] Hirosuke Yamamoto and Kohji Itoh. Asymptotic performance of a modified schalkwijk-
barron scheme for channels with noiseless feedback. IEEE Transactions on Information
Theory, 25(6):729733, 1979.
[86] Zhusheng Zhang. Analysis— a new perspective. Peking University Press, 1995.
[87] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression.
IEEE Transactions on Information Theory, 23(3):337–343, 1977.
[88] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate
coding. IEEE Transactions on Information Theory, 24(5):530–536, 1978.
130
Appendix A
Review of fixed-length block source
coding
We review the classical fixed-length block source coding results in this Chapter.
A.1 Lossless Source Coding and Error Exponent
Encoder Decoder -- xn1b
bnRc1- -xN
1
Figure A.1. Block lossless source coding
Consider a discrete memoryless iid source with distribution px defined on finite alphabet
X . A rate R block source coding system for n source symbols consists of an encoder-decoder
pair (En,Dn), as shown in Figure A.1, where
En : X n −→ 0, 1bnRc, En(xn1 ) = b
bnRc1
Dn : 0, 1bnRc −→ X n, Dn(bbnRc1 ) = xn
1
The probability of block decoding error is Pr[xn1 6= xn
1 ] = Pr[xn1 6= Dn(En(xn
1 ))].
131
In his seminal paper [72], Shannon proved that arbitrarily small error probabilities are
achievable by letting n get big as long as the encoder rate is larger than the entropy of the
source, R > H(px), where H(px) =∑
x∈X−px(x) log px(x). Furthermore, it turns out that
the error probability goes to zero exponentially in n.
Theorem 9 (From [29]) For a discrete memoryless source x ∼ px and encoder rate R <
log |X |,
∀ε > 0, ∃K < ∞, s.t. ∀n ≥ 0, ∃ a block encoder-decoder pair En,Dn such that
Pr[xn1 6= xn
1 ] ≤ K2−n(Es,b(R)−ε) (A.1)
This result is asymptotically tight, in the sense that for any sequence of encoder-decoder
pairs En,Dn,
lim supn→∞
− 1n
log Pr[xn1 6= xn
1 ] = lim supn→∞
− 1n
log Pr[xn1 6= Dn(En(xn
1 ))] ≤ Es,b(R) (A.2)
where Es,b(R) is defined as the block source coding error exponent with the form:
Es,b(R) = minq:H(q)≥R
D(q‖px) (A.3)
Paralleling the definition of the Gallager function for channel coding [41], as mentioned
as an exercise in [29]:
Es,b(R) = supρ≥0
ρR−E0(ρ) (A.4)
where E0(ρ) = (1 + ρ) log[∑
x
px(x)1
1+ρ ]
In [29], it is shown that if the encoder randomly assigns a bin number in 1, 2, ..., 2bnRcwith equal probability to the source sequence and the decoder perform a maximum likeli-
hood or minimum empirical entropy decoding rule for the source sequences in the same bin,
the random coding error exponent for block source coding is:
Er(R) = minqD(q‖px) + |R−H(q)|+
= supρ∈[0,1]
ρR− (1 + ρ) log[∑
x
px(x)1
1+ρ ] (A.5)
where |t|+ = max0, t. This error exponent is the same as the block coding error exponent
(A.3) in the low rate regime and strictly lower than the block coding error exponent in the
132
high rate regime [29]. In Section 2.3.3, we show that this random coding error exponent is
also achievable in the delay constrained source coding setup for streaming data. However,
in Section 2.5.1 we will show that the random coding error exponent and the block coding
error exponent are suboptimal everywhere for R ∈ (H(px), log |X |) in the delay constrained
setup.
A.2 Lossy source coding
Now consider a discrete memoryless iid source with distribution px defined on X . A rate
R block lossy source coding system for N source symbols consists of an encoder-decoder
pair (EN ,DN ), as shown in Figure A.2, where
En : X n −→ 0, 1bnRc, En(xn1 ) = b
bnRc1
Dn : 0, 1bnRc −→ Yn, Dn(bbnRc1 ) = yn
1
Instead of attempting to estimate the source symbols xN1 exactly as in Chapter 2, in lossy
source coding, the design goal of the system is to reconstruct the source symbols xN1 within
an average distortion D > 0.
Encoder Decoder -- yN1b
bNRc1- -xN
1
Figure A.2. Block lossy source coding
A.2.1 Rate distortion function and error exponent for block coding under
average distortion
The key issue of block lossy source coding is to determine the minimum rate R such
that the distortion measure d(xN1 , yN
1 ) ≤ D is satisfied with probability close to 1. For the
average distortion, we denote by the distortion between two sequence as the average of the
distortions of each individual symbols:
d(xN1 , yN
1 ) =1N
N∑
i=1
d(xi, yi)
133
In [74], Shannon first proved the following rate distortion theorem:
Theorem 10 The rate-distortion function R(D) for average distortion measures:
R(px , D) , minW∈WD
I(px ,W ) (A.6)
where WD is the set of all transition matrices that satisfy the average distortion con-
straint, i.e.
WD = W :∑x,y
px(x)W (y|x)d(x, y) ≤ D.
Operationally, this lemma says, for any ε > 0 and δ > 0, for block length N big enough,
there exists a code of rate R(px , D)+ ε, such that the average distortion between the source
string xN1 and its reconstruction yN
1 is no bigger than D with probability at least 1− δ.
This problem is only interesting if the target average distortion D is higher than D and
lower than D, where
D ,∑
x∈Xpx(x)min
y∈Yd(x, y)
D , miny∈Y
∑
x∈Xpx(x)d(x, y).
The reasoning for the above statement should be trivial. Interested readers may read [6].
For distortion constraint D > D, we have the following fixed-to-variable length coding
result for average distortion measure. To have Pr[d(xN1 , yN
1 ) > D] = 0, we can implement a
universal variable length prefix-free code with code length lD(xN1 ) where
lD(xN1 ) = n(R(pxN
1, D) + δN ) (A.7)
where pxN1
is the empirical distribution of xN1 , and δN goes to 0 as N goes to infinity.
This is a simple corollary of the type covering lemma [29, 28] which is derived from the
Johnson− Stein−Lovasz theorem [23].
It is widely know that R(px , D) is convex ∪ on D for fixed px [5]. However, the rate
distortion function R(px , D) for average distortion measure is in general non-concave, non-∩,
in the source distribution px for fixed distortion constraint D as pointed out in [55].
Now we present the large deviation properties of lossy source coding under average
distortion measures from [55],
134
Theorem 11 Block coding error exponent under average distortion:
lim infn→∞ − 1
Nlog2 Pr[d(xN
1 , yN1 ) > D] = Eb,average
D (R)
where Eb,averageD (R) = min
qx :R(qx ,D)>RD(qx‖px) (A.8)
where yN1 is the reconstruction of xN
1 using an optimal rate R code.
A.3 Block distributed source coding and error exponents
In the classic block-coding Slepian-Wolf paradigm [79, 29, 39] (illustrated in Figure 4.1),
full length-N vectors1 xN and yN are observed by their respective encoders before communi-
cation starts. In the block coding setup, a rate-(Rx, Ry) length-N block source code consists
of an encoder-decoder triplet (ExN , Ey
N ,DN ), as we will define shortly in Definition 13.
Definition 13 A randomized length-N rate-(Rx, Ry) block encoder-decoder triplet
(ExN , Ey
N ,DN ) is a set of maps 2
ExN : XN → 0, 1NRx , e.g., Ex
N (xN ) = aNRx
EyN : YN → 0, 1NRy , e.g., Ey
N (yN ) = bNRy
DN : 0, 1NRx × 0, 1NRy → XN × YN , e.g., DN (aNRx , bNRy) = (xN , yN )
where common randomness, similar to the point-to-point source coding case in Section 2.3.3,
shared between the encoders and the decoder is assumed. This allows us to randomize the
mappings independently of the source sequences.
The error probability typically considered in Slepian-Wolf coding is the joint error prob-
ability, Pr[(xN , yN ) 6= (xN , yN )] = Pr[(xN , yN ) 6= DN (ExN (xN ), Ey
N (yN ))]. This probability
is taken over the random source vectors as well as the randomized mappings. An error
exponent E is said to be achievable if there exists a family of rate-(Rx, Ry) encoders and
decoders (ExN , Ey
N ,DN ), indexed by N , such that
limN→∞
− 1N
log Pr[(xN , yN ) 6= (xN , yN )] ≥ E. (A.9)
1In the block coding part, we simply use xN instead of xN1 to denote the sequence of random variables
x1, ...., xN .2For the sake of simplicity, we assume NRx and NRy are integers. It should be clear that this assumption
is insignificant in the asymptotic regime where N is big and the integer effect can be ignored.
135
In this thesis, we study random source vectors (xN , yN ) that are iid across time but
may have dependencies at any given time:
pxy (xN , yN ) =N∏
i=1
pxy (xi, yi).
Without loss of generality, we assume the marginal distributions are non zero, i.e. px(x) > 0
for all x ∈ X , and py (y) > 0 for all y ∈ Y,
For such iid sources, upper and lower bounds on the achievable error exponents are
derived in [39, 29]. These results are summarized by the following Lemma.
Theorem 12 (Lower bound) Given a rate pair (Rx, Ry) such that Rx > H(x |y), Ry >
H(y |x), Rx + Ry > H(x , y). Then, for all
E < minx ,y
D(px y‖pxy ) +∣∣ min[Rx + Ry −H(x , y), Rx −H(x |y), Ry −H(y |x)]
∣∣+ (A.10)
there exists a family of randomized encoder-decoder mappings as defined in Definition 13
such that (A.9) is satisfied. In (A.10) the function |z|+ = z if z ≥ 0 and |z|+ = 0 if z < 0.
(Upper bound) Given a rate pair (Rx, Ry) such that Rx > H(x |y), Ry > H(y |x), Rx +
Ry > H(x , y). Then, for all
E > min
minx ,y :Rx<H(x |y)
D(px y‖pxy ), minx ,y :Ry<H(y |x)
D(px y‖pxy ), minx ,y :Rx+Ry<H(x ,y)
D(px y‖pxy )
(A.11)
there does not exists a randomized encoder-decoder mapping as defined in Definition 13 such
that (A.9) is satisfied.
In both bounds (x , y) are arbitrary random variables with joint distribution px y .
Remark: As long as (Rx, Ry) is in the interior of the achievable region, i.e., Rx > H(x |y),
Ry > H(y |x) and Rx+Ry > H(x , y) then the lower-bound (A.10) is positive. The achievable
region is illustrated in Fig A.3. As shown in [29], the upper and lower bounds (A.11)
and (A.10) match when the rate pair (Rx, Ry) is achievable and close to the boundary of
the region. This is analogous to the high rate regime in channel coding, or the low rate
regime in source coding, where the random coding bound (analogous to (A.10)) and the
sphere packing bound (analogous to (A.11)) agree.
136
Theorem 12 can also be used to generate bounds on the exponent for source coding with
decoder side-information (i.e., y observed at the decoder), and for source coding without
side information (i.e., y is independent of x and the decoder only needs to decode x).
Corollary 3 (Source coding with decoder side-information) Consider a Slepian-Wolf prob-
lem where y is known by the decoder. Given a rate Rx such that Rx > H(x |y), then for
all
E < minx ,y
D(px y‖pxy ) + |Rx −H(x |y)|+, (A.12)
there exists a family of randomized encoder-decoder mappings as defined in Definition 13
such that (A.9) is satisfied.
The proof of Corollary 3 follows from Theorem 12 by letting Ry be sufficiently large
(> log |X |). Similarly, by letting y be independent of x so that H(x |y) = H(x), we get the
following random-coding bound for the point-to-point case of a single source x which is the
random coding part of Theorem 9 in Section A.1.
Corollary 4 (point-to-point) Consider a Slepian-Wolf problem where y is independent of
x, Given a rate Rx such that Rx > H(x), for all
E < minx
D(px‖px) + |Rx −H(x)|+ = Er(Rx) (A.13)
there exists a family of randomized encoder-decoder triplet as defined in Definition 13 such
that (A.9) is satisfied.
A.4 Review of block source coding with side-information
As shown in Figure 5.1, the sources are iid random variables xn1 , yn
1 from a finite alphabet
X ×Y with distribution pxy . Without loss of generality, we assume that px(x) > 0, ∀x ∈ Xand py (y) > 0, ∀y ∈ Y. xn
1 is the source known to the encoder and yn1 is the side-information
known only to the decoder. A rate R block source coding system for n source symbols
consists of an encoder-decoder pair (En,Dn). Where
En : X n → 0, 1bnRc, En(xn1 ) = b
bnRc1
Dn : 0, 1bnRc × Yn → X n, Dn(bbnRc1 , yn
1 ) = xn1
137
-
6
Ry
Rx
H(y)
H(y |x)
H(x)H(x |y)
log |Y|
log |X |
AchievableRegion
Rx + Ry = H(x , y)
µ
Figure A.3. Achievable region for Slepian-Wolf source coding
The error probability is Pr[xn1 6= xn
1 ] = Pr[xn1 6= Dn(En(xn
1 ), yn1 )]. The exponent Esi,b(R)
is achievable if ∃ a family of (En,Dn), s.t.
limn→∞−
1n
log2 Pr[xn1 6= xn
1 ] = Esi,b(R) (A.14)
The relevant results of [29, 39] are summarized into the following theorem.
Theorem 13 Elowersi,b (R) ≤ Esi,b(R) ≤ Eupper
si,b (R) where
Elowersi,b (R) = min
qxyD(qxy‖pxy ) + |0, R−H(qx |y )|+
Euppersi,b (R) = min
qxy :H(qx|y )≥RD(qxy‖pxy )
Where these two error exponents can also be expressed in a parameterized way:
Elowersi,b (R) = max
ρ∈[0,1]ρR− log[
∑y
[∑
x
pxy (x, y)1
1+ρ ]1+ρ]
Euppersi,b (R) = max
ρ∈[0,∞]ρR− log[
∑y
[∑
x
pxy (x, y)1
1+ρ ]1+ρ]
138
It should be clear that both of these bounds are only positive if R > H(px |y ). As shown
in [29], the two bounds are the same in the low rate regime. Furthermore, with encoder side-
information, similar to the case in Theorem 9, the block coding error exponent is Euppersi,b (R).
Thus in the low rate regime, the error exponents are the same with or without encoder side-
information. Due to the duality between channel coding and source coding with decoder
side-information, in the high rate regime one can derive an expurgated bound, as done by
Gallager [41] for channel coding, to tighten the random coding error exponent. Interested
readers may read Ahlswede’s very classical paper [2] and some recent papers [22, 64].
139
Appendix B
Tilted Entropy Coding and its
delay constrained performance
In this appendix, we introduce another way to achieve the delay constrained source
coding error exponent in Theorem 1. This coding scheme is identical to our universal
coding scheme introduced in Section 2.4.1, except that we use a non-universal fixed-to-
variable length prefix-free source code based on tilted entropy code. The rest of the system
is identical to that in Figure 2.11. This derivation first appeared in our paper [20] which
is a minor variation of the scheme analyzed in [49]. Just as in the universal coding scheme
in Section 2.4.1 and [49], the queuing behavior is what determines the delay constrained
error exponent. Rather than modeling the buffer as a random walk with and analyze the
stationary distribution of the process as in [49], here we give an alternate derivation by
applying Cramer’s theorem.
B.1 Tilted entropy coding
We replace the universal optimal source code with the following tilted entropy code
first introduced by Jelinek [49]. This code is a Shannon-code1 built for a particular tilted
distribution for px .
Definition 14 The λ-tilted entropy code is an instantaneous code CNλfor λ > −1. It is a
1Code length proportional to the logarithm of the inverse of probability.
140
mapping from XN to a variable number of binary bits.
CNλ(xN
1 ) = bl(xN
1 )1
where l(xN1 ) is the codeword length for source sequence xN
1 . The first bit is always 1 and
the rest of the codewords are the Shannon codewords based on the λ tilted distribution of px
l(xN1 ) = 1 + d− log2
px(xN1 )
11+λ
∑sN1 ∈XN
px(sN1 )
11+λ
e
≤ 2−N∑
i=1
log2
px(xi)1
1+λ
∑s∈X
px(s)1
1+λ
(B.1)
From the definition of l(~x), we have the following fact:
log2(∑
~x∈XN
px(~x)2λl(~x)) ≤ 2λ + N log2[(∑
x
px(x)1−λ
1+λ )(∑
x
px(x)1
1+λ )λ]
= 2λ + N(1 + λ) log2[∑
x
px(x)1
1+λ ]
= 2λ + NE0(λ) (B.2)
This definition is valid for any λ > −1. As will be seen later in the proof, for a rate
R delay constrained source coding system, the optimal λ is λ = ρ∗, where ρ∗ is defined in
(2.57):
ρ∗R = E0(ρ∗)
From (B.1), the longest code length lλN is
lλN ≤ 2−N log2
p1
1+λxmin∑
s∈Xpx(s)
11+λ
(B.3)
Where pxmin = minx∈X
px(x). The constant 2 is insignificant compared to N .
This variable-length code is turned into a fixed rate R code as follows that in Sec-
tion 2.4.1. The delay constrained source coding scheme is illustrated in Figure 2.11. At
time kN , k = 1, 2, ... the encoder E uses the variable length code CNλto encode the kth
source block ~xk = xkN(k−1)N+1 into a binary sequence b(k)1, b(k)2, ...b(k)l(~xk). This codeword
is pushed into a FIFO queue with infinite buffer-size. The encoder drains a bit from the
queue every 1R seconds. If the queue is empty, the encoder simply sends 0’s to the decoder
until there are new bits pushed in the queue.
141
The decoder knows the variable length code book and the prefix-free nature2 of the
Shannon code guarantees that everything can be decoded correctly.
B.2 Error Events
The number of the bits Ωk in the encoder buffer at time kN , is a random walk process
with negative drift and a reflecting barrier. At time (k + d)N , where dN = ∆, the decoder
can make an error in estimating ~xk if and only if part of the variable length code for source
block ~xk is still in the encoder buffer.
Since the FIFO queue drains deterministically, it means that when the k-th block’s
codeword entered the queue, it was already doomed to miss its deadline of dN = ∆.
Formally, for an error to occur, the number of bits in the buffer Ωk ≥ bdNR = ∆Rc.Thus, meeting a specific end-to-end latency constraint over a fixed-rate noiseless link is
like the buffer overflow events analyzed in [49]. Define the random time tkN to be the
last time before time kN when the queue was empty. A missed-deadline occurs only if∑k
i=tk+1 l(~xi) > (d + k − tk)NR = (kN − tkN + ∆)R.
For arbitrary 1 ≤ t ≤ k − 1, define the error event
P k,dN (t) = P
(k∑
i=t
l(~xi) > (d + k − t)NR
)= P
(k∑
i=t
l(~xi) > (kN − tkN + ∆)R
).
Using Cramer theorem[31], we derive a tight, in the large deviation sense, upper bound
on P k,dN (t).
As a simple corollary of the Cramer theorem, we have an upper bound on P k,dN (t).
P k,dN (t) = P
(k∑
i=t+1
l(~xi) ≥ (d + k − t)NR
)
= P
(1
k − t
k∑
i=t+1
l(~xi) ≥ (d + k − t)NR
k − t
)
≤ (k − t)|X |N
2−EN (CNλ,R,k−t,d)
Where by using the fact in (B.2) and noticing that the 2λ is insignificant in (B.2), we have2The initial 1 is not really required since the decoder knows the rate at which source-symbols are arriving
at the encoder. Thus, it knows when the queue is empty and does not need to even interpret the 0s itreceives.
142
the large deviation exponent:
EN (CNλ, R, k − t, d) ≥ (k − t) sup
ρ∈R+
ρ(d + k − t)NR
k − t− log2(
∑
~x∈XN
px(~x)2ρl(~x))
≥ (k − t)[λ(d + k − t)NR
k − t− log2(
∑
~x∈XN
px(~x)2λl(~x))]
≥ (k − t)N(λ(d + k − t)R
k − t− 2λ
N−E0(λ))
= dNλR + (k − t)N [λR−E0(λ)− 2λ
N] (B.4)
B.3 Achievability of delay constrained error exponent Es(R)
We only need to show that for any ε > 0, by appropriate choice of N, λ, it is possible
to achieve an error exponent with delay of Es(R)− ε, i.e. for all i,∆,
Pr[xj 6= xj(j + ∆)] ≤ K2−∆(Es(R)−ε)
From (2.57), we know that Es(R) = E0(ρ∗), where ρ∗R = E0(ρ∗).
Proof: For the tilted entropy source coding scheme of CN,λ, the decoding error for the
kth source block at time (k + d)N is3 P k,dN . tkN is the last time before k when the buffer is
empty.
Pr[xj 6= xj(j + ∆)] ≤k∑
t=0
P k,dN (t)
=k−1∑
t=0
P
(tk = t,
k∑
i=t+1
l(~xi) ≥ (d + k − t)NR
)
≤k−1∑
t=0
P
(k∑
i=t+1
l(~xi) ≥ (d + k − t)NR
)
≤k−1∑
t=0
(k − t + 1)|X |N
2−EN (CNλ,R,k−t,d)
= 2−dNλRk−1∑
t=0
(k − t + 1)|X |N
2−(k−t)N [λR−E0(λ)− λN
]
The above equality is true for all N,λ. Pick λ = ρ∗ − εR . From Figure 2.12, we know
that λR− E0(λ) > 0. Choose N > λλR−E0(λ)) . Then define
K(ε, R, N) =∞∑
i=0
(i + 1)|X |N
2−iN [λR−E0(λ)− λN
] < ∞
3Here we denote j by kN , and ∆ by dN .
143
which is guaranteed to be finite since dying exponentials dominate polynomials in sums.
Thus:
Pr[xj 6= xj(j + ∆)] ≤ K(ε)2−dNλR = K(ε)2−∆λR = K(ε, R, N)2−∆(ρ∗R−ε)
where the constant K(ε, R) does not depend on the delay in question. Since E0(ρ∗) = ρ∗R
and the encoder also is not targeted to the delay ∆, this scheme achieves the desired delay
constrained exponent as promised.
144
Appendix C
Proof of the concavity of the rate
distortion function under the peak
distortion measure
In this appendix, we prove Lemma 5 which states that the rate distortion function
R(p,D) is concave ∩ in p.
Proof: To show that R(p,D) is concave ∩ in p, it is enough to show that for any two
distributions p0 and p1 and for any λ ∈ [0, 1],
R(pλ, D) ≥ λR(p0, D) + (1− λ)R(p1, D)
where pλ = λp0 + (1− λ)p1. Define:
W ∗ = arg minW∈WD
I(pλ,W )
From the definition of R(p,D) we know that
R(pλ, D) = I(pλ,W ∗)
≥ λI(p0,W∗) + (1− λ)I(p1,W
∗) (C.1)
≥ λ minW∈WD
I(p0,W ) + (1− λ) minW∈WD
I(p1,W )
= λR(p0, D) + (1− λ)R(p1, D)
(C.1) is true because I(p,W ) is concave ∩ in p for fixed W and pλ = λp0 + (1− λ)p1. The
rest are by the definition. ¤
145
Appendix D
Derivation of the upper bound on
the delay constrained lossy source
coding error exponent
In this appendix, we prove the converse of Theorem 2. The proof is similar to that of
the converse of the delay constrained lossless source coding error exponent in Theorem 1.
To bound the best possible error exponent with fixed delay, we consider a block coding
encoder/decoder pair that is constructed by the delay constrained encoder/decoder pair
and translate the block-coding error exponent for peak distortion in Lemma 6 to the delay
constrained error exponent. The arguments are analogous to the “focusing bound” deriva-
tion in [67] for channel coding with feedback and extremely similar to that of the lossless
source coding case in Theorem 1. We summarize the converse of Theorem 2 in the following
proposition.
Proposition 12 For fixed-rate encodings of discrete memoryless sources, it is not possible
to achieve an lossy source coding error exponent with fixed-delay higher than
infα>0
1α
EbD((α + 1)R) (D.1)
from the definition of delay constrained lossy source coding error exponent in Definition 4,
the statement of this proposition is equivalent to the following statement:
146
For any E > infα>0
1αEb
D((α + 1)R), there exists an positive ε, such that for any K < ∞,
there exists i > 0, ∆ > 0 and
Pr[d(xi, yi(i + ∆)) ≥ D] > K2−∆(ED(R)−ε)
Proof: We show the proposition by contradiction. Suppose that the delay-constrained
error exponent can be higher than infα>0
1αEb
D((α+1)R). Then according to Definition 4, there
exists a delay-constrained source coding system, such that for some E > infα>0
1αEb
D((α+1)R),
for any positive real value ε, there exists K < ∞, such that for all i > 0, ∆ > 0
Pr[d(xi, yi(i + ∆)) > D] ≤ K2−∆(E−ε)
so we choose ε > 0, such that
E − ε > infα>0
1α
EbD((α + 1)R) (D.2)
Then consider a block coding scheme (E ,D) that is built on the delay constrained lossy
source coding system. The encoder of the block coding system is the same as the delay-
constrained lossy source encoder, and the block decoder D works as follows:
y i1 = (y1(1 + ∆), y2(2 + ∆), ..., yi(i + ∆))
Now the block decoding distortion of this coding system can be upper bounded as
follows, for any i > 0 and ∆ > 0:
Pr[d(x i1, y
i1) > D] =
i∑
t=1
Pr[maxi
d(xi, yi) > D]
≤i∑
t=1
Pr[d(xi, yi(i + ∆)) > D]
≤i∑
t=1
K2−∆(E−ε)
= iK2−∆(E−ε) (D.3)
The block coding scheme (E ,D) is a block source coding system for i source symbols
by using bR(i + ∆)c bits hence has a rate bR(i+∆)ci ≤ i+∆
i R. From the block coding result
in Lemma 6, we know that the lossy source coding error exponent EbD(R) is monotonically
increasing in R, so EbD( bR(i+∆)c
i ) ≤ EbD(R(i+∆)
i ). Again from Lemma 6, we know that the
block coding error probability can be bounded in the following way:
Pr[d(x i1, y
i1) > D] > 2−i(Eb
D(R(i+∆)
i)+εi) (D.4)
147
where limi→∞
εi = 0.
Combining (D.3) and (D.4), we have:
2−i(EbD(
R(i+∆)i
)+εi) < iK2−∆(E−ε)
Now let α = ∆i , α > 0, then the above inequality becomes:
2−i(EbD(R(1+α))+εi) < 2−iα(E−ε−θi) and hence:
E − ε <1α
(EbD(R(1 + α) + εi) + θi (D.5)
where θi = log Ki , so lim
i→∞θi = 0. The above inequality is true for all i, α > 0, and
limi→∞
θi = 0, limi→∞
εi = 0. Taking all these into account, we have:
E − ε ≤ infα>0
1α
EbD(R(1 + α)) (D.6)
Now (D.6) contradicts with the assumption in (D.2), thus the proposition is proved. ¥
148
Appendix E
Bounding source atypicality under
a distortion measure
In this appendix, we prove Lemma 7 in Section 3.4.2
Proof: We only need to show the case for r > R(px , D). By Cramer’s theorem [31], for
all ε1 > 0, there exists K1, such that
Pr[n∑
i=1
lD(~xi) > nNr] = Pr[1n
n∑
i=1
lD(~xi) > Nr]
≤ K12−n( inf
z>NrI(z)−ε1)
where the rate function I(z) is [31]:
I(z) = supρ≥0
ρz − log2(∑
(~x∈XN
px(~x)2ρlD(~x)) (E.1)
It is clear that I(z) is monotonically increasing with z and I(z) is continuous. Thus
infz>Nr
I(z) = I(Nr) (E.2)
Using the upper bound on lD(~x) in (3.6):
log2(∑
~x∈XN
px(~x)2ρlD(~x)) ≤ log2(∑
qx∈T N
2−ND(qx‖px )2ρ(δN+NR(qx ,D)))
≤ log2(2NεN 2−N minqx D(qx‖px )−ρR(qx ,D)−ρδN)
= N(−min
qxD(qx‖px)− ρR(qx , D)− ρδN+ εN
)
149
where T N is the set of types of XN , and 2NεN is the number of types in XN , 0 < εN ≤|X | log2(N+1)
N , so εN goes to 0 as N goes to infinity.
Substitute the above inequalities into (E.1):
I(Nr) ≥ N(supρ≥0
minqx
ρ(r −R(qx , D)− δN ) + D(qx‖px) − εN
)(E.3)
Next we show that I(Nr) ≥ N(EbD(r) + ε′N ) where ε′N goes to 0 as N goes to infinity. We
show the existence of a saddle point of the function
f(qx , ρ) = ρ(r −R(qx , D)− δN ) + D(qx‖px)
Obviously, for fixed qx , f(qx , ρ) is a linear function of ρ, thus concave ∩. Also for fixed
ρ ≥ 0, f(qx , ρ) is a convex ∪ function of qx , because both −R(qx , D) and D(qx‖px) are
convex ∪ in qx . Write
g(u) = minqx
supρ≥0
(f(qx , ρ) + ρu)
Showing that g(u) is finite around u = 0 establishes the existence of the saddle point as
shown in Exercise 5.25 [10].
minqx
supρ≥0
f(q, ρ) + ρu =(a) minqx
supρ≥0
ρ(r −R(qx , D)− δN + u) + D(qx‖px)
≤(b) minqx :R(qx ,D)≥r−δN+u
supρ≥0
ρ(r −R(qx , D)− δN + u) + D(qx‖px)
≤(c) minqx :R(qx ,D)≥r−δN+u
D(qx‖px)
<(d) ∞
(a) is by definition. (b) is true because R(px , D) < r < RD, thus for very small δN and u,
R(px , D) < r−δN +u < RD. Thus there exists a distribution qx , s.t. R(qx , D) ≥ r−δN +u.
(c) is because R(qx , D) ≥ r− δN +u and ρ ≥ 0. (d) is true because we might as well assume
that px(x) > 0 for all x ∈ X , and r − δN + u < RD. Thus we proved the existence of the
saddle point of f(q, ρ).
supρ≥0
minq
f(q, ρ) = minqsup
ρ≥0f(q, ρ) (E.4)
Note that if R(qx , D) < r − δN , ρ can be chosen to be arbitrarily large to
make ρ(r − R(qx , D) − δN ) + D(qx‖px) arbitrarily large. Thus the qx to minimize
supρ≥0
ρ(r −R(qx , D)− δN ) + D(qx‖px) satisfies r −R(qx , D)− δN ≥ 0. So
150
minqxsup
ρ≥0ρ(r −R(qx , D)− δN ) + D(qxy‖pxy ) =(a) min
qx :R(qx ,D)≥r−δN
supρ≥0
ρ(r −R(qx , D)− δN ) + D(qx‖px)=(b) min
qx :R(qx ,D)≥r−δN
D(qx‖px)
=(c) EbD(r − δN ) (E.5)
(a) follows from the argument above. (b) is because r − R(qx , D) − δN ≤ 0 and ρ ≥ 0,
and hence ρ = 0 maximizes ρ(r−R(qx , D)−δN ). (c) is by definition in (3.5). By combining
(E.3), (E.4) and (E.5), letting N be sufficiently big so that δN is sufficiently small, and
noticing that EbD(r) is continuous in r, we get the desired bound in (3.7). ¤
151
Appendix F
Bounding individual error events
for distributed source coding
In this appendix, we prove Lemma 9 and Lemma 11. The proofs have some similarities
to those for the point-to-point lossless source coding in Propositions 3 and 4.
F.1 ML decoding: Proof of Lemma 9
In this section we give the proof of Lemma 9, this part has the same flavor as the proof
of Proposition 4 in Section 2.3.3. The technical tool used here is the standard Chernoff
bound argument, or “the ρ thing” in Gallager’s book [41]. This technique is perfected by
Gallager in the derivation of the error exponents for a series of problems, cf. multiple-access
channel in [37], degraded broadcast channel in [40] and for distributed source coding in [39].
The bound depends on whether l ≤ k or l ≥ k. Consider the case for l ≤ k,
pn(l, k) =∑
xn1 ,yn
1
pxy (xn1 , yn
1 )
Pr[∃ (xn1 , yn
1 ) ∈ Bx(xn1 )× By(yn
1 ) ∩ Fn(l, k, xn1 , yn
1 ) s.t. pxy (xn1 , yn
1 ) < pxy (xn1 , yn
1 )]
≤∑
xn1 ,yn
1
min[1,
∑
(xn1 , yn
1 ) ∈ Fn(l, k, xn1 , yn
1 )
pxy (xn1 , yn
1 ) < pxy (xn1 , yn
1 )
Pr[xn1 ∈ Bx(xn
1 ), yn1 ∈ By(yn
1 )]]pxy (xn
1 , yn1 )
(F.1)
152
≤∑
xnl ,yn
l
min[1,
∑
(xnl , yn
l ) s.t. yk−1 = yk−1
pxy (xnl , yn
l ) < pxy (xnl , yn
l )
4× 2−(n−l+1)Rx−(n−k+1)Ry
]pxy (xn
l , ynl ) (F.2)
≤ 4×∑
xnl ,yn
l
min[1,
∑
xnl ,yn
k
2−(n−l+1)Rx−(n−k+1)Ry
1[pxy (xk−1l , yk−1
l )pxy (xnk , yn
k ) > pxy (xnl , yn
l )]]pxy (xn
l , ynl )
≤ 4×∑
xnl ,yn
l
min
[1,
∑
xnl ,yn
k
2−(n−l+1)Rx−(n−k+1)Ry
min
[1,
pxy (xk−1l , yk−1
l )pxy (xnk , yn
k )pxy (xn
l , ynl )
]]pxy (xn
l , ynl )
≤ 4×∑
xnl ,yn
l
[ ∑
xnl ,yn
k
e−(n−l+1)Rx−(n−k+1)Ry
[pxy (xk−1
l , yk−1l )pxy (xn
k , ynk )
pxy (xnl , yn
l )
] 11+ρ
]ρ
pxy (xnl , yn
l )
(F.3)
= 4× 2−(n−l+1)ρRx−(n−k+1)ρRy
∑
xnl ,yn
l
[ ∑
xnl ,yn
k
[pxy (xk−1l , yk−1
l )pxy (xnk , yn
k )]1
1+ρ
]ρ
pxy (xnl , yn
l )1
1+ρ
= 4× 2−(n−l+1)ρRx−(n−k+1)ρRy
∑
yk−1l
[ ∑
xk−1l
pxy (xk−1l , yk−1
l )1
1+ρ
][ ∑
xk−1l
pxy (xk−1l , yk−1
l )1
1+ρ
]ρ
[ ∑
xnk ,yn
k
pxy (xnk , yn
k )1
1+ρ
]ρ ∑
xnk ,yn
k
pxy (xnk , yn
k )1
1+ρ
= 4× 2−(n−l+1)ρRx−(n−k+1)ρRy
[ ∑
yk−1l
[ ∑
xk−1l
pxy (xk−1l , yk−1
l )1
1+ρ
]1+ρ][ ∑
xnk ,yn
k
pxy (xnk , yn
k )1
1+ρ
]1+ρ
= 4× 2−(n−l+1)ρRx−(n−k+1)ρRy
[∑y
[ ∑x
px ,y (x, y)1
1+ρ
]1+ρ]k−l[ ∑
x,y
px ,y (x, y)1
1+ρ
](1+ρ)(n−k+1)(F.4)
= 4× 2−(k−l)
[ρRx−log
[∑y
[∑x px,y (x,y)
11+ρ
]1+ρ]]
2−(n−k+1)
[ρ(Rx+Ry)−(1+ρ) log
[∑x,y px,y (x,y)
11+ρ
]]
153
= 4× 2−(k−l)Ex|y(Rx,ρ)−(n−k+1)Exy(Rx,Ry ,ρ) (F.5)
= 4× 2−(n−l+1)
[k−l
n−l+1Ex|y(Rx,ρ)+n−k+1
n−l+1Exy(Rx,Ry,ρ)
](F.6)
≤ 4× 2−(n−l+1) supρ∈[0,1]
[k−l
n−l+1Ex|y(Rx,ρ)+n−k+1
n−l+1Exy(Rx,Ry,ρ)
](F.7)
= 4× 2−(n−l+1)EMLx (Rx,Ry , k−l
n−l+1)
= 4× 2−(n−l+1)Ex(Rx,Ry , k−ln−l+1
). (F.8)
In (F.1) we explicitly indicate the three conditions that a suffix pair (xnl , yn
k ) must satisfy
to result in a decoding error. In (F.2) we sum out over the common prefixes (xl−11 , yl−1
1 ), and
use the fact that the random binning is done independently at each encoder, thus we can use
the inequalities in (4.6) and (4.7). We get (F.3) by limiting ρ to the interval 0 ≤ ρ ≤ 1, as
in (2.25). Getting (F.4) from (F.3) follows by a number of basic manipulations. In (F.4) we
get the single letter expression by again using the memorylessness of the sources. In (F.5)
we use the definitions of Ex|y and Exy from (4.9) in Theorem 3. Noting that the bound holds
for all ρ ∈ [0, 1], optimizing over ρ results in (F.7). Finally, using the definition in (4.8) and
the remark following Theorem 5 that the maximum-likelihood and universal exponents are
equal gives (F.8). The bound on pn(l, k) when l > k, is developed in the same fashion. ¤
154
F.2 Universal decoding: Proof of Lemma 11
In this section, we prove Lemma 11. We use the techniques called method of types
developed by Imre Csiszar. This method is elegantly explained in [29] and [28]. The proof
here is similar to that for the point-to-point case in the proof of Proposition 4. However,
we need the concept of V -shells defined in [29]. Essentially, a V -shell is the conditional
type set of y sequence given x sequence where x and y are from the alphabet X × Y. The
technique of V -shell is originally used in the proof of channel coding theorems in [29]. Due
to the duality of the channel coding and source coding with decoding information, it is no
surprise that it proves a very powerful tool in the distributed source coding problems. That
being said, here is the proof.
The error probability pn(l, k) can be thought as starting from (F.2) with the condition
(k − l)H(xk−1l |yk−1
l ) + (n− k + 1)H(xnk , yn
k ) < (k − l)H(xk−1l |yk−1
l ) + (n− k + 1)H(xnk , yn
k )
substituted for pxy (xnl , yn
l ) > pxy (xnl , yn
l ), we get
pn(l, k) ≤∑
P n−k,P k−l
∑
V n−k,V k−l
∑
yk−1l
∈ TP k−l ,
ynk ∈ T
P n−k
∑
xk−1l
∈ TV k−l (y
k−1l
),
xnk ∈ T
V n−k(ynk
)
min[1,
∑
V n−k, V k−l, P n−k s.t.
S(P n−k, P k−l, V n−k, V k−l) <
S(P n−k, P k−l, V n−k, V k−l)
∑
ynk∈TPn−k
∑
xk−1l ∈T
V k−l (yk−1l )
∑
xnk∈TV n−k (yn
k )
4× 2−(n−l+1)Rx−(n−k+1)Ry
]pxy (xn
1 , yn1 )
(F.9)
In (F.9) we enumerate all the source sequences in a way that allows us to focus on the types
of the important subsequences. We enumerate the possibly misleading candidate sequences
in terms of their suffixes types. We restrict the sum to those pairs (xn1 , yn
1 ) that could
lead to mistaken decoding, defining the compact notation S(Pn−k, P k−l, V n−k, V k−l) ,(k − l)H(V k−l|P k−l) + (n − k + 1)H(Pn−k × V n−k), which is the weighted suffix entropy
condition rewritten in terms of types.
Note that the summations within the minimization in (F.9) do not depend on the
arguments within these sums. Thus, we can bound this sum separately to get a bound on
155
the number of possibly misleading source pairs (xn1 , yn
1 ).
∑
V n−k, V k−l, P n−k s.t.
S(P n−k, P k−l, V n−k, V k−l) <
S(P n−k, P k−l, V n−k, V k−l)
∑
ynk∈TPn−k
∑
xk−1l ∈T
V k−l (yk−1l )
∑
xnk∈TV n−k (yn
k )
≤∑
V n−k, V k−l, P n−k s.t.
S(P n−k, P k−l, V n−k, V k−l) <
S(P n−k, P k−l, V n−k, V k−l)
∑
ynk∈TPn−k
|TV k−l(yk−1l )||TV n−k(yn
k )| (F.10)
≤∑
V n−k, V k−l, P n−k s.t.
S(P n−k, P k−l, V n−k, V k−l) <
S(P n−k, P k−l, V n−k, V k−l)
|TP n−k |2(k−l)H(V k−l|P k−l)2(n−k+1)H(V n−k|P n−k) (F.11)
≤∑
V n−k, V k−l, P n−k s.t.
S(P n−k, P k−l, V n−k, V k−l) <
S(P n−k, P k−l, V n−k, V k−l)
2(k−l)H(V k−l|P k−l)+(n−k+1)H(P n−k×V n−k) (F.12)
≤∑
V n−k,V k−l,P n−k
2(k−l)H(V k−l|P k−l)+(n−k+1)H(P n−k×V n−k) (F.13)
≤ (n− l + 2)2|X ||Y|2(k−l)H(V k−l|P k−l)+(n−k+1)H(P n−k×V n−k) (F.14)
In (F.10) we sum over all xk−1l ∈ TV k−l(yk−1
l ). In (F.11) we use standard bounds, e.g.,
|TV k−l(yk−1l )| ≤ 2(k−l)H(V k−l|P k−l) for yk−1
l ∈ TP k−l . We also sum over all xnk ∈ TV n−k(yn
k )
and over all ynk ∈ TP n−k in (F.11). By definition of the decoding rule (xn
1 , yn1 ) can only
lead to a decoding error if (k − l)H(V k−l|P k−l)] + (n − k + 1)H(Pn−k × V n−k) < (k −l)H(V k−l|P k−l) + (n− k + 1)H(Pn−k × V n−k). In (F.14) we apply the polynomial bound
on the number of types.
We substitute (F.14) into (F.9) and pull out the polynomial term, giving
pn(l, k) ≤ (n− l + 2)2|X ||Y|∑
P n−k,P k−l
∑
V n−k,V k−l
∑
yk−1l
∈ TP k−l ,
ynk ∈ T
P n−k
∑
xk−1l
∈ TV k−l (y
k−1l
),
xnk ∈ T
V n−k(ynk
)
min[1, 4× 2−(k−l)[Rx−H(V k−l|P k−l)]−(n−k+1)[Rx+Ry−H(V n−k×P n−k)]
]pxn
l ,ynl(xn
l , ynl )
≤4× (n− l + 2)2|X ||Y|∑
P n−k,P k−l
∑
V n−k,V k−l
× 2max
[0,−(k−l)[Rx−H(V k−l|P k−l)]−(n−k+1)[Rx+Ry−H(V n−k×P n−k)]
]
× 2−(k−l)D(V k−l×P k−l‖pxy )−(n−k+1)D(V n−k×P n−k‖pxy ) (F.15)
156
≤4× (n− l + 2)2|X ||Y|∑
P n−k,P k−l
∑
V n−k,V k−l
2−(n−l+1)
[λD(V k−l×P k−l‖pxy )+λD(V n−k×P n−k‖pxy )+|λ[Rx−H(V k−l|P k−l)]+λ[Rx+Ry−H(V n−k×P n−k)]|+
]
(F.16)
≤4× (n− l + 2)2|X ||Y|∑
P n−k,P k−l
∑
V n−k,V k−l
2−(n−l+1) inf x,y,x,y
[λD(px,y‖pxy )+λD(px,y‖pxy )+|λ[Rx−H(x |y)]+λ[Rx+Ry−H(x ,y)]|+
](F.17)
≤4× (n− l + 2)4|X ||Y|2−(n−l+1)Ex(Rx,Ry ,λ)
≤K12−(n−l+1)[Ex(Rx,Ry,λ)−ε] (F.18)
In (F.15) we use the memorylessness of the source, and exponential bounds on the proba-
bility of observing (xk−1l , yk−1
l ) and (xnk , yn
k ). In (F.16) we pull out (n− l+1) from all terms,
by noticing that λ = (k − l)/(n − l + 1) ∈ [0, 1] and λ , 1 − λ = (n − k + 1)/(n − l + 1).
In (F.17) we minimize the exponent over all choices of distributions px ,y and px ,y . In (F.18)
we define the universal random coding exponent Ex(Rx, Ry, λ) , inf x ,y ,x ,yλD(px ,y‖pxy ) +
λD(px ,y‖pxy )+∣∣λ[Rx −H(x |y)] + λ[Rx + Ry −H(x , y)]
∣∣+ where 0 ≤ λ ≤ 1 and λ = 1−λ.
We also incorporate the number of conditional and marginal types into the polynomial
bound, as well as the sum over k, and then push the polynomial into the exponent since for
any polynomial F , ∀E, ε > 0, there exists C ∈ (0,∞), s.t. F (∆)e−∆E ≤ Ce−∆(E−ε). ¤
157
Appendix G
Equivalence of ML and universal
error exponents and tilted
distributions
In this appendix, we give a detail proof of Lemma 8. This section also contains some very
useful analysis on the source coding error exponents and their geometry, tilted distributions
and their properties. The key mathematical tool is convex optimization. The fundamental
lemmas which cannot be easily found in the literature in Section G.3 are used throughout
this thesis. For notation simplicity, we change the logarithm from base 2 in the main body
of the thesis to base e in this section.
Our goal in Theorem 5 is to show that the maximum likelihood (ML) error exponent
equals the universal error exponent. It is sufficient to show that for all γ,
EMLx (Rx, Ry, γ) = EUN
x (Rx, Ry, γ)
Where the ML error exponent:
158
EMLx (Rx, Ry, γ) = sup
ρ∈[0,1]γEx|y(Rx, ρ) + (1− γ)Exy(Rx, Ry, ρ)
= supρ∈[0,1]
ρR(γ) − γ log(∑
y
(∑
x
pxy (x, y)1
1+ρ )1+ρ)
−(1− γ)(1 + ρ) log(∑
y
∑x
pxy (x, y)1
1+ρ )
= supρ∈[0,1]
EMLx (Rx, Ry, γ, ρ)
Write the function inside the sup argument as EMLx (Rx, Ry, γ, ρ). The universal error
exponent:
EUNx (Rx, Ry, γ)
= infqxy,oxy
γD(qxy||pxy ) + (1− γ)D(oxy||pxy )
+|γ(Rx −H(qx|y)) + (1− γ)(Rx + Ry −H(oxy))|+= inf
qxy,oxy
γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − γH(qx|y)− (1− γ)H(oxy)|+
Here we define R(γ) = γRx + (1− γ)(Rx + Ry) > γH(px |y ) + (1− γ)H(pxy ). For notational
simplicity, we write qxy and oxy as two arbitrary joint distributions on X ×Y instead of px y
and p¯x ¯y . We still write pxy as the distribution of the source.
Before the proof, we define a pair of distributions that we need.
Definition 15 Tilted distribution of pxy : pρxy , for all ρ ∈ [−1,∞)
pρxy (x, y) =
pxy (x, y)1
1+ρ
∑t
∑s
pxy (s, t)1
1+ρ
The entropy of the tilted distribution is written as H(pρxy ). Obviously p0
xy = pxy .
Definition 16 x − y tilted distribution of pxy : pρxy , for all ρ ∈ [−1, +∞)
pρxy (x, y) =
[∑s
pxy (s, y)1
1+ρ ]1+ρ
∑t
[∑s
pxy (s, t)1
1+ρ ]1+ρ× pxy (x, y)
11+ρ
∑s
pxy (s, y)1
1+ρ
=A(y, ρ)B(ρ)
× C(x, y, ρ)D(y, ρ)
159
Where
A(y, ρ) = [∑
s
pxy (s, y)1
1+ρ ]1+ρ = D(y, ρ)1+ρ
B(ρ) =∑
s
[∑
t
pxy (s, t)1
1+ρ ]1+ρ =∑
y
A(y, ρ)
C(x, y, ρ) = pxy (x, y)1
1+ρ
D(y, ρ) =∑
s
pxy (s, y)1
1+ρ =∑
x
C(x, y, ρ)
The marginal distribution for y is A(y,ρ)B(ρ) . Obviously p0
xy = pxy . Write the conditional
distribution of x given y under distribution pρxy as pρ
x |y , where pρx |y (x, y) = C(x,y,ρ)
D(y,ρ) , and the
conditional entropy of x given y under distribution pρxy as H(pρ
x |y ). Obviously H(p0x |y ) =
H(px |y ).
The conditional entropy of x given y for the x − y tilted distribution is
H(pρx |y=y) = −
∑x
C(x, y, ρ)D(y, ρ)
log(C(x, y, ρ)D(y, ρ)
)
We introduce A(y, ρ), B(ρ), C(x, y, ρ), D(y, ρ) to simplify the notations. Some of their
properties are shown in Lemma 25.
While tilted distributions are common optimal distributions in large deviation theory,
it is useful to contemplate why we need to introduce these two tilted distributions. In the
proof of Lemma 8, through a Lagrange multiplier argument, we will show that pρxy : ρ ∈
[−1,+∞) is the family of distributions that minimize the Kullback−Leibler distance to
pxy with fixed entropy and pρxy : ρ ∈ [−1, +∞) is the family of distributions that minimize
the Kullback−Leibler distance to pxy with fixed conditional entropy. Using a Lagrange
multiplier argument, we parametrize the universal error exponent EUNx (Rx, Ry, γ) in terms
of ρ and show the equivalence of the universal and maximum likelihood error exponents.
Now we are ready to prove Lemma 5: EMLx (Rx, Ry, γ) = EUN
x (Rx, Ry, γ).
Proof:
G.1 case 1: γH(px |y) + (1 − γ)H(pxy) < R(γ) < γH(p1x |y) + (1 −
γ)H(p1xy).
First, from Lemma 31 and Lemma 32:
160
∂EMLx (Rx, Ry, γ, ρ)
∂ρ= R(γ) − γH(pρ
x |y )− (1− γ)H(pρxy )
Then, using Lemma 22 and Lemma 26, we have:
∂2EMLx (Rx, Ry, γ, ρ)
∂ρ≤ 0
.
So ρ maximize EMLx (Rx, Ry, γ, ρ), if and only if:
0 =∂EML
x (Rx, Ry, γ, ρ)∂ρ
= R(γ) − γH(pρx |y )− (1− γ)H(pρ
xy ) (G.1)
Because R(γ) is in the interval [γH(px |y )+(1−γ)H(pxy ), γH(p1x |y )+(1−γ)H(p1
xy )] and
the entropy functions monotonically-increase over ρ, we can find ρ∗ ∈ (0, 1), s.t.
γH(pρ∗x |y ) + (1− γ)H(pρ∗
xy ) = R(γ)
Using Lemma 29 and Lemma 30 we get:
EMLx (Rx, Ry, γ) = γD(pρ∗
xy‖pxy ) + (1− γ)D(pρ∗xy‖pxy ) (G.2)
Where γH(pρ∗x |y ) + (1− γ)H(pρ∗
xy ) = R(γ) , ρ∗ is generally unique because both H(pρx |y ) and
H(pρxy ) are strictly increasing with ρ.
Secondly
EUNx (Rx, Ry, γ)
= infqxy ,oxy
γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − γH(qx|y)− (1− γ)H(oxy)|+
= infb inf
qxy,oxy:γH(qx|y)+(1−γ)H(oxy)=bγD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − b)|+
= infb≥γH(px|y )+(1−γ)H(pxy )
infqxy ,oxy :γH(qx|y)+(1−γ)H(oxy)=b
γD(qxy||pxy ) + (1− γ)D(oxy||pxy )
+|R(γ) − b)|+ (G.3)
161
The last equality is true because, for b < γH(px |y ) + (1− γ)H(pxy ) < R(γ),
infqxy,oxy:γH(qx|y)+(1−γ)H(oxy)=b
γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − b|+
≥ 0 + R(γ) − b
= infqxy,oxy:H(qx|y)=H(px|y ),H(oxy)=H(pxy )
γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − b|+
≥ infqxy,oxy:H(qx|y)=H(px|y ),H(oxy)=H(pxy )
γD(qxy||pxy ) + (1− γ)D(oxy||pxy )
+|R(γ) − γH(px |y ) + (1− γ)H(pxy )|+≥ inf
qxy,oxy:γH(qx|y)+(1−γ)H(oxy)=γH(px|y )+(1−γ)H(pxy )γD(qxy||pxy ) + (1− γ)D(oxy||pxy )
+|R(γ) − γH(px |y ) + (1− γ)H(pxy )|+
Fixing b ≥ γH(px |y ) + (1 − γ)H(pxy ), the inner infimum in (G.3) is an optimization prob-
lem on qxy, oxy with equality constraints∑
x
∑y qxy(x, y) = 1,
∑x
∑y oxy(x, y) = 1 and
γH(qx|y) + (1− γ)H(oxy) = b and the obvious inequality constraints 0 ≤ qxy(x, y) ≤ 1, 0 ≤oxy(x, y) ≤ 1,∀x, y. In the following formulation of the optimization problem, we relax one
equality constraint to an inequality constraint γH(qx|y)+(1−γ)H(oxy) ≥ b to make the opti-
mization problem convex. It turns out later that the optimal solution to the relaxed problem
is also the optimal solution to the original problem because b ≥ γH(px |y ) + (1− γ)H(pxy ).
The resulting optimization problem is:
infqxy,oxy
γD(qxy||pxy ) + (1− γ)D(oxy||pxy )
s.t.∑
x
∑y
qxy(x, y) = 1
∑x
∑y
oxy(x, y) = 1
b− γH(qx|y)− (1− γ)H(oxy) ≤ 0
0 ≤ qxy(x, y) ≤ 1, ∀(x, y) ∈ X × Y0 ≤ oxy(x, y) ≤ 1, ∀(x, y) ∈ X × Y (G.4)
The above optimization problem is convex because the objective function and the inequality
constraint functions are convex and the equality constraint functions are affine[10]. The
Lagrange multiplier function for this convex optimization problem is:
162
L(qxy, oxy, ρ, µ1, µ2, ν1, ν2, ν3, ν4)
= γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) +
µ1(∑
x
∑y
qxy(x, y)− 1) + µ2(∑
x
∑y
oxy(x, y)− 1) +
ρ(b− γH(qx|y)− (1− γ)H(oxy)) +∑
x
∑y
ν1(x, y)(−qxy(x, y)) + ν2(x, y)(1− qxy(x, y))
+ν3(x, y)(−oxy(x, y)) + ν4(x, y)(1− oxy(x, y))
Where ρ, µ1, µ2 are real numbers and νi ∈ R|X ||Y|, i = 1, 2, 3, 4.
According to the KKT conditions for convex optimization[10], qxy, oxy minimize the con-
vex optimization problem in (G.4) if and only if the following conditions are simultaneously
satisfied for some qxy, oxy, µ1, µ2, ν1, ν2, ν3, ν4 and ρ:
0 =∂L(qxy, oxy, ρ, µ1, µ2, ν1, ν2, ν3, ν4)
∂qxy(x, y)
= γ[− log(pxy (x, y)) + (1 + ρ)(1 + log(qxy(x, y))) + ρ log(∑
s
qxy(s, y))]
+µ1 − ν1(x, y)− ν2(x, y) (G.5)
0 =∂L(qxy, oxy, ρ, µ1, µ2, ν1, ν2, ν3, ν4)
∂oxy(x, y)= (1− γ)[− log(pxy (x, y)) + (1 + ρ)(1 + log(oxy(x, y)))] + µ2 − ν3(x, y)− ν4(x, y)
(G.6)
For all x, y and
∑x
∑y
qxy(x, y) = 1
∑x
∑y
oxy(x, y) = 1
ρ(γH(qx|y) + (1− γ)H(oxy)− b) = 0
ρ ≥ 0
ν1(x, y)(−qxy(x, y)) = 0, ν2(x, y)(1− qxy(x, y)) = 0 ∀x, y
ν3(x, y)(−oxy(x, y)) = 0, ν4(x, y)(1− oxy(x, y)) = 0 ∀x, y
νi(x, y) ≥ 0, ∀x, y, i = 1, 2, 3, 4 (G.7)
163
Solving the above standard Lagrange multiplier equations (G.5), (G.6) and (G.7), we
have:
qxy(x, y) =[∑s
pxy (s, y)1
1+ρb ]1+ρb
∑t
[∑s
pxy (s, t)1
1+ρb ]1+ρb
pxy (x, y)1
1+ρb
∑s
pxy (s, y)1
1+ρb
= pρbxy (x, y)
oxy(x, y) =pxy (x, y)
11+ρb
∑t
∑s
pxy (s, t)1
1+ρb
= pρbxy (x, y)
νi(x, y) = 0 ∀x, y, i = 1, 2, 3, 4
ρ = ρb (G.8)
Where ρb satisfies the following condition
γH(pρb
x |y ) + (1− γ)H(pρbxy ) = b ≥ γH(px |y ) + (1− γ)H(pxy )
and thus ρb ≥ 0 because both H(pρx |y ) and H(pρ
xy ) are monotonically increasing with ρ as
shown in Lemma 22 and Lemma 26.
Notice that all the KKT conditions are simultaneously satisfied with the inequality
constraint γH(qx|y) + (1 − γ)H(oxy) ≥ b being met with equality. Thus, the relaxed opti-
mization problem has the same optimal solution as the original problem as promised. The
optimal qxy and oxy are the x − y tilted distribution pρbxy and standard tilted distribution
pρbxy of pxy with the same parameter ρb ≥ 0. chosen s.t.
γH(pρb
x |y ) + (1− γ)H(pρbxy ) = b
Now we have :
EUNx (Rx, Ry, γ)
= infb≥γH(px|y )+(1−γ)H(pxy )
infqxy,oxy:γH(qx|y)+(1−γ)H(oxy)=b
γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − b|+= inf
b≥γH(px|y )+(1−γ)H(pxy )γD(pρb
xy ||pxy ) + (1− γ)D(pρbxy ||pxy ) + |R(γ) − b|+
= min[ infρ≥0:R(γ)≥γH(pρ
x|y )+(1−γ)H(pρxy )
γD(pρxy ||pxy ) + (1− γ)D(pxyρ ||pxy ) + R(γ) − γH(pρ
x |y )− (1− γ)H(pρxy ),
infρ≥0:R(γ)≤γH(pρ
x|y )+(1−γ)H(pρxy )γD(pρ
xy ||pxy ) + (1− γ)D(pxyρ ||pxy )] (G.9)
164
Notice that H(pρxy ), H(pρ
x |y ), D(pρxy ||pxy ) and D(pρ
xy ||pxy ) are all strictly increasing with
ρ > 0 as shown in Lemma 22, Lemma 23, Lemma 26 and Lemma 27 later in this appendix.
We have:
infρ≥0:R(γ)≤γH(pρ
x|y )+(1−γ)H(pρxy )γD(pρ
xy ||pxy ) + (1− γ)D(pρxy ||pxy )
= γD(pρ∗xy ||pxy ) + (1− γ)D(pρ∗
xy ||pxy ) (G.10)
where R(γ) = γH(pρ∗x |y )+ (1− γ)H(pρ∗
xy ). Applying the results in Lemma 28 and Lemma 24,
we get:
infρ≥0:R(γ)≥γH(pρ
x|y )+(1−γ)H(pρxy )γD(pρ
xy ||pxy ) + (1− γ)D(pρxy ||pxy ) + R(γ)
−γH(pρx |y )− (1− γ)H(pρ
xy )= γD(pρ
xy ||pxy ) + (1− γ)D(pρxy ||pxy ) + R(γ) − γH(pρ
x |y )− (1− γ)H(pρxy )|ρ=ρ∗
= γD(pρ∗xy ||pxy ) + (1− γ)D(pρ∗
xy ||pxy ) (G.11)
This is true because for ρ : R(γ) ≥ γH(pρx |y ) + (1 − γ)H(pρ
xy ), we know ρ ≤ 1 because of
the range of R(γ): R(γ) < γH(p1x |y ) + (1 − γ)H(p1
xy ). Substituting (G.10) and (G.11) into
(G.9), we get
EUNx (Rx, Ry, γ) = γD(pρ∗
xy ||pxy ) + (1− γ)D(pρ∗xy ||pxy )
where R(γ) = γH(pρ∗x |y ) + (1− γ)H(pρ∗
xy ) (G.12)
So for γH(px |y ) + (1 − γ)H(pxy ) ≤ R(γ) ≤ γH(p1x |y ) + (1 − γ)H(p1
xy ), from (G.2) we have
the desired property:
EMLx (Rx, Ry, γ) = EUN
x (Rx, Ry, γ)
G.2 case 2: R(γ) ≥ γH(p1x |y) + (1− γ)H(p1
xy).
In this case, for all 0 ≤ ρ ≤ 1
∂EMLx (Rx, Ry, γ, ρ)
∂ρ= R(γ)−γH(pρ
x |y )−(1−γ)H(pρxy ) ≥ R(γ)−γH(p1
x |y )−(1−γ)H(p1xy ) ≥ 0
So ρ takes value 1 to maximize the error exponent EMLx (Rx, Ry, γ, ρ), thus
EMLx (Rx, Ry, γ) = R(γ) − γ log(
∑y
(∑
x
pxy (x, y)12 )2)− 2(1− γ) log(
∑y
∑x
pxy (x, y)12 )(G.13)
165
Using the same convex optimization techniques as case G.1, we notice the fact that
ρ∗ ≥ 1 for R(γ) = γH(pρ∗x |y ) + (1− γ)H(pρ∗
xy ). Then applying Lemma 28 and Lemma 24, we
have:
infρ≥0:R(γ)≥γH(pρ
x|y )+(1−γ)H(pρxy )γD(pρ
xy ||pxy ) + (1− γ)D(pρxy ||pxy )
+R(γ) − γH(pρx |y )− (1− γ)H(pxyρ)
= γD(p1xy ||pxy ) + (1− γ)D(p1
xy ||pxy ) + R(γ) − γH(p1x |y )− (1− γ)H(p1
xy )
And
infρ≥0:R(γ)≤γH(pρ
x|y )+(1−γ)H(pρxy )γD(pρ
xy ||pxy ) + (1− γ)D(pρxy ||pxy )]
= γD(pρ∗xy ||pxy ) + (1− γ)D(pρ∗
xy ||pxy )
= γD(pρ∗xy ||pxy ) + (1− γ)D(pρ∗
xy ||pxy ) + R(γ) − γH(pρ∗x |y )− (1− γ)H(pρ∗
xy )
≤ γD(p1xy ||pxy ) + (1− γ)D(p1
xy ||pxy ) + R(γ) − γH(p1x |y )− (1− γ)H(p1
xy )
Finally:
EUNx (Rx, Ry, γ)
= infb≥γH(px|y )+(1−γ)H(pxy )
infqxy ,oxy :γH(qx|y)+(1−γ)H(oxy)=b
γD(qxy||pxy ) + (1− γ)D(oxy||pxy ) + |R(γ) − b|+= inf
b≥γH(px|y )+(1−γ)H(pxy )γD(pρb
xy ||pxy ) + (1− γ)D(pρbxy ||pxy ) + |R(γ) − b|+
= min[ infρ≥0:R(γ)≥γH(pρ
x|y )+(1−γ)H(pρxy )
γD(pρxy ||pxy ) + (1− γ)D(pρ
xy ||pxy ) + R(γ) − γH(pρx |y )− (1− γ)H(pρ
xy ),inf
ρ≥0:R(γ)≤γH(pρx|y )+(1−γ)H(pρ
xy )γD(pρ
xy ||pxy ) + (1− γ)D(pρxy ||pxy )]
= γD(p1xy ||pxy ) + (1− γ)D(p1
xy ||pxy ) + R(γ) − γH(p1x |y )− (1− γ)H(p1
xy )
= R(γ) − γ log(∑
y
(∑
x
pxy (x, y)12 )2)− 2(1− γ) log(
∑y
∑x
pxy (x, y)12 ) (G.14)
The last equality is true by setting ρ = 1 in Lemma 29 and Lemma 30.
Again, EMLx (Rx, Ry, γ) = EUN
x (Rx, Ry, γ), thus we finish the proof. ¤
166
G.3 Technical Lemmas on tilted distributions
Some technical lemmas we used in the above proof of Lemma 8 are now discussed:
Lemma 22 ∂H(pρxy )
∂ρ ≥ 0
Proof: From the definition of the tilted distribution we have the following observation:
log(pρxy (x1, y1))− log(pρ
xy (x2, y2)) = log(pxy (x1, y1)1
1+ρ )− log(pxy (x2, y2)1
1+ρ )
Using the above equality, we first derive the derivative of the tilted distribution, for all x, y
∂pρxy (x, y)∂ρ
=−1
(1 + ρ)2
pxy (x, y)1
1+ρ log(pxy (x, y))(∑t
∑s
pxy (s, t)1
1+ρ )
(∑t
∑s
pxy (s, t)1
1+ρ )2
− −1(1 + ρ)2
pxy (x, y)1
1+ρ (∑t
∑s
pxy (s, t)1
1+ρ log(pxy (s, t)))
(∑t
∑s
pxy (s, t)1
1+ρ )2
=−1
1 + ρpρxy (x, y)[log(pxy (x, y)
11+ρ )−
∑t
∑s
pρxy (s, t) log(pxy (s, t)
11+ρ )]
=−1
1 + ρpρxy (x, y)[log(pρ
xy (x, y))−∑
t
∑s
pρxy (s, t) log(pρ
xy (s, t))]
= −pρxy (x, y)1 + ρ
[log(pρxy (x, y)) + H(pρ
xy )] (G.15)
Then:
167
∂H(pρxy )
∂ρ= −∂
∑x,y pρ
xy (x, y) log(pρxy (x, y))
∂ρ
= −∑x,y
(1 + log(pρxy (x, y)))
∂pρxy (x, y)∂ρ
=∑x,y
(1 + log(pρxy (x, y)))
pρxy (x, y)1 + ρ
(log(pρxy (x, y)) + H(pρ
xy ))
=1
1 + ρ
∑x,y
pρxy (x, y) log(pρ
xy (x, y))(log(pρxy (x, y)) + H(pρ
xy ))
=1
1 + ρ[∑x,y
pρxy (x, y)(log(pρ
xy (x, y)))2 −H(pρxy )2]
=1
1 + ρ[∑x,y
pρxy (x, y)(log(pρ
xy (x, y)))2∑x,y
pρxy (x, y)−H(pρ
xy )2]
≥(a)1
1 + ρ[(
∑x,y
pρxy (x, y) log(pρ
xy (x, y)))2 −H(pρxy )2]
= 0 (G.16)
where (a) is true by the Cauchy-Schwartz inequality. ¤
Lemma 23 ∂D(pρxy‖pxy )∂ρ = ρ
∂H(pρxy )
∂ρ
Proof: As shown in Lemma 29 and Lemma 31 respectively:
D(pρxy‖pxy ) = ρH(pρ
xy )− (1 + ρ) log(∑x,y
pxy (x, y)1
1+ρ )
H(pρxy ) =
∂(1 + ρ) log(∑y
∑x
pxy (x, y)1
1+ρ )
∂ρ
We have:
∂D(pρxy‖pxy )∂ρ
= H(pρxy ) + ρ
∂H(pρxy )
∂ρ−
∂(1 + ρ) log(∑y
∑x
pxy (x, y)1
1+ρ )
∂ρ
= H(pρxy ) + ρ
∂H(pρxy )
∂ρ−H(pρ
xy )
= ρ∂H(pρ
xy )∂ρ
(G.17)
¤
168
Lemma 24 sign∂[D(pρ
xy‖pxy )−H(pρxy )]
∂ρ = sign(ρ− 1).
Proof: Combining the results of the previous two lemmas, we have:
∂D(pρxy‖pxy )−H(pρ
xy )∂ρ
= (ρ− 1)∂H(pρ
xy )∂ρ
= sign(ρ− 1)
¤
Lemma 25 Properties of ∂A(y,ρ)∂ρ , ∂B(ρ)
∂ρ , ∂C(x,y,ρ)∂ρ , ∂D(y,ρ)
∂ρ and∂H(pρ
x|y=y)
∂ρ
First,
∂C(x, y, ρ)∂ρ
=∂pxy (x, y)
11+ρ
∂ρ= − 1
1 + ρpxy (x, y)
11+ρ log(pxy (x, y)
11+ρ )
= −C(x, y, ρ)1 + ρ
log(C(x, y, ρ))
∂D(y, ρ)∂ρ
=∂
∑s
pxy (s, y)1
1+ρ
∂ρ= − 1
1 + ρ
∑s
pxy (s, y)1
1+ρ log(pxy (s, y)1
1+ρ )
= −∑x
C(x, y, ρ) log(C(x, y, ρ))
1 + ρ(G.18)
For a differentiable function f(ρ),
∂f(ρ)1+ρ
∂ρ= f(ρ)1+ρ log(f(ρ)) + (1 + ρ)f(ρ)ρ ∂f(ρ)
∂ρ
So
∂A(y, ρ)∂ρ
=∂D(y, ρ)1+ρ
∂ρ= D(y, ρ)1+ρ log(D(y, ρ)) + (1 + ρ)D(y, ρ)ρ ∂D(y, ρ)
∂ρ
= D(y, ρ)1+ρ(log(D(y, ρ))−∑
x
C(x, y, ρ)D(y, ρ)
log(C(x, y, ρ)))
= D(y, ρ)1+ρ(−∑
x
C(x, y, ρ)D(y, ρ)
log(C(x, y, ρ)D(y, ρ))
))
= A(y, ρ)H(pρx |y=y)
∂B(ρ)∂ρ
=∑
y
∂A(y, ρ)∂ρ
=∑
y
A(y, ρ)H(pρx |y=y) = B(ρ)
∑y
A(y, ρ)B(ρ)
H(pρx |y=y)
= B(ρ)H(pρx |y )
And last:
169
∂H(pρx |y=y)
∂ρ
= −∑
x
[∂C(x,y,ρ)
∂ρ
D(y, ρ)−
C(x, y, ρ)∂D(y,ρ)∂ρ
D(y, ρ)2][1 + log(
C(x, y, ρ)D(y, ρ)
)]
= −∑
x
[−C(x,y,ρ)
1+ρ log(C(x, y, ρ))
D(y, ρ)+
C(x, y, ρ)
∑s
C(s,y,ρ) log(C(s,y,ρ))
1+ρ
D(y, ρ)2][1 + log(
C(x, y, ρ)D(y, ρ)
)]
=1
1 + ρ
∑x
[pρx |y (x, y) log(C(x, y, ρ))− pρ
x |y (x, y)∑
s
pρx |y (s, y) log(C(s, y, ρ))]
×[1 + log(pρx |y (x, y))]
=1
1 + ρ
∑x
pρx |y (x, y)[log(pρ
x |y (x, y))−∑
s
pρx |y (s, y) log(pρ
x |y (s, y))][1 + log(pρx |y (x, y))]
=1
1 + ρ
∑x
pρx |y (x, y) log(pρ
x |y (x, y))[log(pρx |y (x, y))−
∑s
pρx |y (s, y) log(pρ
x |y (s, y))]
=1
1 + ρ
∑x
pρx |y (x, y) log(pρ
x |y (x, y)) log(pρx |y (x, y))− 1
1 + ρ[∑
x
pρx |y (x, y) log(pρ
x |y (x, y))]2
≥ 0 (G.19)
The inequality is true by the Cauchy-Schwartz inequality and by noticing that∑
x pρx |y (x, y) = 1. ¤
These properties will again be used in the proofs in the following lemmas.
Lemma 26∂H(pρ
x|y )
∂ρ ≥ 0
Proof:
∂ A(y,ρ)B(ρ)
∂ρ=
1B(ρ)2
(∂A(y, ρ)
∂ρB(ρ)− ∂B(ρ)
∂ρA(y, ρ))
=1
B(ρ)2(A(y, ρ)H(pρ
x |y=y)B(ρ)−H(pρx |y )B(ρ)A(y, ρ))
=A(y, ρ)B(ρ)
(H(pρx |y=y)−H(pρ
x |y ))
Now,
170
∂H(pρx |y )
∂ρ=
∂
∂ρ
∑y
A(y, ρ)B(ρ)
∑x
C(x, y, ρ)D(y, ρ)
[− log(C(x, y, ρ)D(y, ρ)
)]
=∂
∂ρ
∑y
A(y, ρ)B(ρ)
H(pρx |y=y)
=∑
y
A(y, ρ)B(ρ)
∂H(pρx |y=y)
∂ρ+
∑y
∂ A(y,ρ)B(ρ)
∂ρH(pρ
x |y=y)
≥∑
y
∂ A(y,ρ)B(ρ)
∂ρH(pρ
x |y=y)
=∑
y
A(y, ρ)B(ρ)
(H(pρx |y=y)−H(pρ
x |y ))H(pρx |y=y)
=∑
y
A(y, ρ)B(ρ)
H(pρx |y=y)
2 −H(pρx |y )2
= (∑
y
A(y, ρ)B(ρ)
H(pρx |y=y)
2)(∑
y
A(y, ρ)B(ρ)
)−H(pρx |y )2
≥(a) (∑
y
A(y, ρ)B(ρ)
H(pρx |y=y))
2 −H(pρx |y )2
= 0 (G.20)
where (a) is again true by the Cauchy-Schwartz inequality. ¤
Lemma 27 ∂D(pρxy‖pxy )∂ρ = ρ
∂H(pρx|y )
∂ρ
Proof: As shown in Lemma 30 and Lemma 32 respectively:
D(pρxy‖pxy ) = ρH(pρ
x |y )− log(∑
y
(∑
x
pxy (x, y)1
1+ρ )1+ρ)
H(pρx |y ) =
∂ log(∑
y(∑
x pxy (x, y)1
1+ρ )1+ρ)∂ρ
We have:
∂D(pρxy‖pxy )∂ρ
= H(pρx |y ) + ρ
∂H(pρx |y )
∂ρ−
∂ log(∑y
(∑x
pxy (x, y)1
1+ρ )1+ρ)
∂ρ
= H(pρx |y ) + ρ
∂H(pρx |y )
∂ρ−H(pρ
x |y )
= ρ∂H(pρ
x |y )
∂ρ(G.21)
¤
171
Lemma 28 sign∂[D(pρ
xy‖pxy )−H(pρx|y )]
∂ρ = sign(ρ− 1).
Proof: Using the previous lemma, we get:
∂D(pρxy‖pxy )−H(pρ
x |y )
∂ρ= (ρ− 1)
∂H(pρx |y )
∂ρ
Then by Lemma 26, we get the conclusion. ¤
Lemma 29
ρH(pρxy )− (1 + ρ) log(
∑y
∑x
pxy (x, y)1
1+ρ ) = D(pρxy‖pxy )
Proof: By noticing that log(pxy (x, y)) = (1 + ρ)[log(pρxy (x, y)) + log(
∑s,t
pxy (s, t)1
1+ρ )]. We
have:
D(pρxy‖pxy ) = −H(pρ
xy )−∑x,y
pρxy (x, y) log(pxy (x, y))
= −H(pρxy )−
∑x,y
pρxy (x, y)(1 + ρ)[log(pρ
xy (x, y)) + log(∑s,t
pxy (s, t)1
1+ρ )]
= −H(pρxy ) + (1 + ρ)H(pρ
xy )− (1 + ρ)∑x,y
pρxy (x, y) log(
∑s,t
pxy (s, t)1
1+ρ )
= ρH(pρxy )− (1 + ρ) log(
∑s,t
pxy (s, t)1
1+ρ ) (G.22)
¤
Lemma 30
ρH(pρx |y )− log(
∑y
(∑
x
pxy (x, y)1
1+ρ )1+ρ) = D(pρxy‖pxy )
172
Proof:
D(pρxy‖pxy ) =
∑y
∑x
A(y, ρ)B(ρ)
C(x, y, ρ)D(y, ρ)
log(A(y,ρ)B(ρ)
C(x,y,ρ)D(y,ρ)
pxy (x, y))
=∑
y
∑x
A(y, ρ)B(ρ)
C(x, y, ρ)D(y, ρ)
[log(A(y, ρ)B(ρ)
) + log(C(x, y, ρ)D(y, ρ)
)− log(pxy (x, y))]
= − log(B(ρ))−H(pρx |y )
+∑
y
∑x
A(y, ρ)B(ρ)
C(x, y, ρ)D(y, ρ)
[log(D(y, ρ)1+ρ)− log(C(x, y, ρ)1+ρ)]
= − log(B(ρ))−H(pρx |y ) + (1 + ρ)H(pρ
x |y )
= − log(∑
y
(∑
x
pxy (x, y)1
1+ρ )1+ρ) + ρH(pρx |y )
¤
Lemma 31
H(pρxy ) =
∂(1 + ρ) log(∑
y
∑x pxy (x, y)
11+ρ )
∂ρ
Proof:
∂(1 + ρ) log(∑
y
∑x pxy (x, y)
11+ρ )
∂ρ
= log(∑
t
∑s
pxy (s, t)1
1+ρ )−∑
y
∑x
pxy (x, y)1
1+ρ
∑t
∑s
pxy (s, t)1
1+ρ
log(pxy (x, y)1
1+ρ )
= −∑
y
∑x
pxy (x, y)1
1+ρ
∑t
∑s
pxy (s, t)1
1+ρ
log(pxy (x, y)
11+ρ
∑t
∑s
pxy (s, t)1
1+ρ
)
= H(pρxy ) (G.23)
¤
Lemma 32
H(pρx |y ) =
∂ log(∑y
(∑x
pxy (x, y)1
1+ρ )1+ρ)
∂ρ
173
Proof: Notice that B(ρ) =∑y
(∑x
pxy (x, y)1
1+ρ )1+ρ, and ∂B(ρ)∂ρ = B(ρ)H(pρ
x |y ) as shown
in Lemma 25. It is clear that:
∂ log(∑y
(∑x
pxy (x, y)1
1+ρ )1+ρ)
∂ρ=
∂ log(B(ρ))∂ρ
=1
B(ρ)∂B(ρ)
∂ρ
= H(pρx |y ) (G.24)
¤
174
Appendix H
Bounding individual error events
for source coding with
side-information
In this appendix, we give the proofs for Lemma 13 and Lemma 14. The proofs here are
very similar to those for point-to-point source coding problem in Propositions 3 and 4.
H.1 ML decoding: Proof of Lemma 13
In this section, we prove Lemma 13. The argument resembles the proof of Lemma 9 in
Appendix F.1.
pn(l) =∑
xn1 ,yn
1
Pr[∃ xn
1 ∈ Bx(xn1 ) ∩ Fn(l, xn
1 ) s.t. pxy (xn1 , yn
1 ) ≥ pxy (xn1 , yn
1 )]pxy (xn
1 , yn1 )
≤∑
xn1 ,yn
1
min[1,
∑
xn1 ∈ Fn(l, xn
1 )s.t.
pxy (xn1 , yn
1 ) ≤ pxy (xn1 , yn
1 )
Pr[xn1 ∈ Bs(xn
1 )]]pxy (xn
1 , yn1 ) (H.1)
175
≤∑
yn1
∑
xl−11 ,xn
l
min[1,
∑
xnl s.t.
pxy (xnl , yn
l ) < pxy (xnl , yn
l )
2× 2−(n−l+1)R]pxy (xl−1
1 , yl−11 )pxy (xn
l , ynl )
≤2×∑
ynl
∑
xnl
min[1,
∑
xnl s.t.
pxy (xnl , yn
l ) < pxy (xnl , yn
l )
2−(n−l+1)R]pxy (xn
l , ynl )
=2×∑
ynl
∑
xnl
min[1,
∑
xnl
1[pxy (xnl , yn
l ) > pxy (xnl , yn
l )]2−(n−l+1)R]pxy (xn
l , ynl )
≤2×∑
ynl
∑
xnl
min
1,
∑
xnl
min[1,
pxy (xnl , yn
l )pxy (xn
l , ynl )
]2−(n−l+1)R
pxy (xn
l , ynl )
≤2×∑
ynl
∑
xnl
∑
xnl
[pxy (xn
l , ynl )
pxy (xnl , yn
l )
] 11+ρ
2−(n−l+1)R
ρ
pxy (xnl , yn
l )
=2×∑
ynl
∑
xnl
pxy (xnl , yn
l )1
1+ρ
∑
xnl
[pxy (xnl , yn
l )]1
1+ρ
ρ
2−(n−l+1)ρR
=2×∑
y
[∑x
pxy (x, y)1
1+ρ
](n−l+1) [∑x
pxy (x, y)1
1+ρ
](n−l+1)ρ
2−(n−l+1)ρR
=2×∑
y
[∑x
pxy (x, y)1
1+ρ
](n−l+1)(1+ρ)
2−(n−l+1)ρR
=2× 2−(n−l+1)
(ρR−log[
∑y[
∑x pxy (x,y)
11+ρ ]1+ρ]
)
(H.2)
for all ρ ∈ [0, 1]. Every step of the proof follows the same argument in the proof of Lemma 9.
Finally, we optimize the exponent over ρ ∈ [0, 1] in (H.2) to prove the lemma. ¤
176
H.2 Universal decoding: Proof of Lemma 14
In this section, we give the details of the proof of Lemma 14. The argument resembles
that in the proof of Lemma 11 in Appendix F.2.
The error probability pn(l) can be thought as starting from (H.1) with the condition
H(xnl , yn
l ) < H(xnl , yn
l ) substituted for pxy (xnl , yn
l ) > pxy (xnl , yn
l ), we get
pn(l) ≤∑
P n−l
∑
V n−l
∑
ynl ∈ T
P n−l
∑
xnl ∈ T
V n−l(ynl
)
min[1,
∑
V n−l s.t.
H(P n−l × V n−l) <
H(P n−l × V n−l)
∑
xnl ∈TV n−l (y
nl )
2× 2−(n−l+1)R]pxy (xn
l , ynl ) (H.3)
In (H.3) we enumerate all the source sequences in a way that allows us to focus on the types
of the important subsequences. We enumerate the possibly misleading candidate sequences
in terms of their suffixes types. We restrict the sum to those sequences xnl that could lead
to mistaken decoding, defining the notation H(Pn−l × V n−l), which is the joint entropy of
the type Pn−l with V-shell V n−l.
Note that the summations within the minimization in (H.3) do not depend on the
arguments within these sums. Thus, we can bound this sum separately to get a bound on
the number of possibly misleading source sequences xnl .
∑
V n−l s.t.
H(P n−l × V n−l) <
H(P n−l × V n−l)
∑
xnl ∈TV n−l (y
nl )
≤∑
V n−l s.t.
H(P n−l × V n−l) <
H(P n−l × V n−l)
|TV n−l(ynl )|
≤∑
V n−l s.t.
H(P n−l × V n−l) <
H(P n−l × V n−l)
2(n−l+1)H(V n−l|P n−l)
≤∑
V n−l
2(n−l+1)H(V n−l|P n−l) (H.4)
≤ (n− l + 2)|X ||Y|2(n−l+1)H(V n−l|P n−l) (H.5)
Every inequality is obvious by following the same argument in the proof of Lemma 11.
(H.4) is true because H(Pn−l × V n−l) < H(Pn−l × V n−l) is equivalent to H(V n−l|Pn−l) <
H(V n−l|Pn−l).
177
We substitute (H.5) into (H.3) and pull out the polynomial term, giving
pn(l) ≤(n− l + 2)|X ||Y|∑
P n−l
∑
V n−l
∑
ynl ∈ T
P n−l
∑
xnl ∈ T
V n−l(ynl
)
min[1,
2× 2−(n−l+1)(R−H(V n−l|P n−l))]pxy (xn
l , ynl )
≤(n− l + 2)|X ||Y|∑
P n−l
∑
V n−l
2−(n−l+1)max0,R−H(V n−l|P n−l)
× 2−(n−l+1)D(P n−l×V n−l‖pxy )
≤2× (n− l + 2)2|X ||Y| × 2−(n−l+1) inf x,yD(px,y‖pxy )+|R−H(V n−l|P n−l)|+
=2× (n− l + 2)2|X ||Y| × 2−(n−l+1)Elowersi (R)
All steps are obvious by following the same argument in the proof of Lemma 11.
¤
178
Appendix I
Proof of Corollary 1
I.1 Proof of the lower bound
From Theorem 6, we know that
Esi(R) ≥ Elowersi (R)
= minqxyD(qxy‖pxy ) + |R−H(qx |y )|+ (I.1)
= maxρ∈[0,1]
ρR− log[∑
y
[∑
x
pxy (x, y)1
1+ρ ]1+ρ] (I.2)
where the equivalence between (I.1) and (I.2) is shown by a Lagrange multiplier argument
in [39] and detailed in Appendix G. Next we show that (I.2) is equal to the random source
coding error exponent Er(R, ps) defined in (A.5) by replacing px with ps .
Elowersi (R) =(a) max
ρ∈[0,1]ρR− log[
∑y
[∑
x
pxy (x, y)1
1+ρ ]1+ρ]
=(b) maxρ∈[0,1]
ρR− log[∑
y
[∑
x
(pxy (x|y)|S| )
11+ρ ]1+ρ]
=(c) maxρ∈[0,1]
ρR− log[1|S|
∑y
[∑
s
ps(s)1
1+ρ ]1+ρ]
=(d) maxρ∈[0,1]
ρR− log[[∑
s
ps(s)1
1+ρ ]1+ρ]
=(e) Er(R, ps)
(a) is by definition. (b) is because the marginal distribution of y is uniform on S, thus
pxy (x, y) = pxy (x|y)py (y) = pxy (x|y) 1|S| . (c) is true because x = y ⊕ s. (d) is obvious. (e) is
by the definition of the random coding error exponent in Equation (A.5). ¤
179
I.2 Proof of the upper bound
From Theorem 7, we know that
Esi(R) ≤(a) Euppersi (R)
=(b) min infqxy ,α≥1:H(qx|y )>(1+α)R
1α
D(qxy‖pxy ),
infqxy ,1≥α≥0:H(qx|y )>(1+α)R
1− α
αD(qx‖px) + D(qxy‖pxy )
≤(c) infqxy ,1≥α≥0:H(qx|y )>(1+α)R
1− α
αD(qx‖px) + D(qxy‖pxy )
≤(d) infqxy ,1≥α≥0:H(qx|y )>(1+α)R
D(qxy‖pxy )
≤(e) infqxy :H(qx|y )>R
D(qxy‖pxy )
=(f) Euppersi,b (R) (I.3)
where (a) and (b) are by Theorem 7, (c) is obvious. (d) is true because D(qx‖px) ≥ 0.
(e) is true because (qxy , 0)|H(qx |y ) > R ⊆ (qxy , α)|H(qx |y ) > (1 + α)R. (f) is by the
definition of the upper bound on block coding error exponent in Theorem 13. By using the
other definition of Euppersi,b (R) in Theorem 13, we have:
Euppersi (R) =(a) max
ρ∈[0,∞]ρR− log[
∑y
[∑
x
pxy (x, y)1
1+ρ ]1+ρ]
=(b) maxρ∈[0,∞]
ρR− log[∑
y
[∑
x
(pxy (x|y)|S| )
11+ρ ]1+ρ]
=(c) maxρ∈[0,∞]
ρR− log[1|S|
∑y
[∑
s
ps(s)1
1+ρ ]1+ρ]
=(d) maxρ∈[0,∞]
ρR− log[[∑
s
ps(s)1
1+ρ ]1+ρ]
=(e) Es,b(R, ps)
where (a)-(d) follows exactly the same arguments as those in the previous section I.1. (e)
is by the definition of block coding error exponent in (A.4). ¤
180
Appendix J
Proof of Theorem 8
The proof here is almost identical to that for the delay constrained lossless source coding
error exponent Es(R) in Chapter 2. In the converse part, we use a different argument which
is genie based [67]. Readers might find that proof interesting.
J.1 Achievability of Eei(R)
In this section, we introduce a universal coding scheme which achieves the delay con-
strained error exponent in Theorem 8. The coding scheme only depends on the size of the
alphabet |X |, |Y|, not the distribution of the source. We first describe our universal coding
scheme.
A block-length N is chosen that is much smaller than the target end-to-end delays, while
still being large enough. Again we are interested in the performance with asymptotically
large delays ∆, For a discrete memoryless source X, side information Y and large block-
length N , the best possible variable-length code is given in [29] and consists of two stages:
first describing the type of the block ~xi, ~yi using at most O(|X ||Y| log2 N) bits and then
describing which particular realization has occurred by using a variable NH(~xi|~yi) bits
where ~xi is the ith block of length N and H(~xi|~yi) is the empirical conditional entropy
of sequence ~xi given ~yi. The overhead O(|X ||Y| log2 N) is asymptotically negligible and
the code is also universal in nature. It is easy to verify that the average code length:
limN→∞Epxy (l(~x ,~y))
N = H(px |y ) This code is obviously a prefix-free code. Write l(~xi, ~yi) as
181
Encoder Buffer
FIFO
Rate R bit streamDecoder
Fixed to variable length EncoderStreaming data X
Streaming side information Y
Figure J.1. A universal delay constrained source coding system with both encoder anddecoder side-information
the length of the codeword for ~xi, ~yi, write 2NεN as the number of types in XN ×YN then:
NH(~xi|~yi) ≤ l(~xi, ~yi) = N(H(~xi|~yi) + εN ) (J.1)
where 0 < εN ≤ |X ||Y| log2(N+1)N goes to 0 as N goes to infinity. The binary sequence
describing the source is fed to the FIFO buffer described in Figure J.1. Notice that if the
buffer is empty, the output of the buffer can be gibberish binary bits. The decoder simply
discards these meaningless bits because it is aware that the buffer is empty.
Proposition 13 For the iid source ∼ pxy using the universal causal code described above,
for all ε, there exists K < ∞, s.t. for all t, ∆:
Pr(~xt 6= ~xt((t + ∆)N) ≤ K2−∆N(Eei(R)−ε)
where ~xt((t + ∆)N) is the estimate of ~xt at time (t + ∆)N . Before the proof, we have the
following lemma to bound the probabilities of atypical source behavior.
Lemma 33 (Source atypicality) for all ε > 0, block length N large enough, there exists
K < ∞, s.t. for all n, if r < log2 |X |:
Pr(n∑
i=1
l(~xi, ~yi) > nNr) ≤ K2−nN(Eei,b(r)−ε) (J.2)
182
Proof: Only need to show the case for r > H(px |y ). By the Cramer’s theorem[31], for
all ε1 > 0, there exists K1, such that
Pr(n∑
i=1
l(~xi, ~yi) > nNr) = Pr(1n
n∑
i=1
l(~xi, ~yi) > Nr) ≤ K12−n(infz>Nr I(z)−ε1) (J.3)
where the rate function I(z) is [31]:
I(z) = supρ∈R
ρz − log2(∑
(~x,~y)∈XN×YN
pxy (~x, ~y)2ρl(~x,~y)) (J.4)
Write I(z, ρ) = ρz − log2(∑
(~x,~y)∈XN×YN
pxy (~x, ~y)2ρl(~x,~y))
Obviously I(z, 0) = 0, z > Nr > NH(px |y ) thus for large N : .
∂I(z, ρ)∂ρ
|ρ=0 = z −∑
(~x,~y)∈XN×YN
pxy (~x, ~y)l(~x, ~y) ≥ 0
By the Holder inequality, for all ρ1, ρ2, and for all θ ∈ (0, 1):
(∑
i
pi2ρ1li)θ(∑
i
pi2ρ2li)(1−θ) ≥∑
i
(pθi 2
θρ1li)(p(1−θ)i 2(1−θ)ρ2li)
=∑
i
pi2(θρ1+(1−θ)ρ2)li
This shows that log2(∑
(~x,~y)∈XN×YN pxy (~x, ~y)2ρl(~x,~y)) is a convex function on ρ, thus I(z, ρ)
is a concave ∩ function on ρ for fixed z. Then ∀z > 0, ∀ρ < 0, I(z, ρ) < 0, which means
that the ρ to maximize I(z, ρ) is positive. This implies that I(z) is monotonically increasing
with z and obviously I(z) is continuous. Thus
infz>Nr
I(z) = I(Nr) (J.5)
Using the upper bound on l(~x, ~y) in (J.1):
log2(∑
(~x,~y)∈XN×YN
pxy (~x, ~y)2ρl(~x,~y)) ≤ log2(∑
qxy∈T N
2−ND(qxy‖pxy )2ρ(εN+NH(qx|y )))
≤ 2NεN 2−N minqD(qxy‖pxy )−ρH(qx|y )−ρεN)
= N(−min
qD(qxy‖pxy )− ρH(qx |y )− ρεN+ εN
)
where 0 < εN ≤ |X ||Y| log2(N+1)N goes to 0 as N goes to infinity.
T N is the set of all types of XN × YN .
183
Substitute the above inequalities to I(Nr) defined in (J.4):
I(Nr) ≥ N(sup
ρmin
qρ(r −H(qx |y )− εN ) + D(qxy‖pxy ) − εN
)(J.6)
We next show that I(Nr) ≥ N(Eei,b(r) + ε) where ε goes to 0 as N goes to infinity.
This can be proved by a somewhat complicated Lagrange multiplier method also used in
[12]. We give a new proof based on the existence of saddle point of the min-max function.
Define
f(q, ρ) = ρ(r −H(qx |y )− εN ) + D(qxy‖pxy )
Obviously, for fixed q, f(q, ρ) is a linear function of ρ, thus concave ∩. Also for fixed
ρ, f(q, ρ) is a convex function of q. Define g(u) = minq supρ(f(q, ρ) + ρu), it is enough [10]
to show that g(u) is finite in the neighbor of u = 0 to establish the existence of the saddle
point.
g(u) =(a) minq
supρ
(f(q, ρ) + ρu)
=(b) minq
supρ
(ρ(r −H(qx |y )− εN + u) + D(qxy‖pxy ))
≤(c) minq:H(qx|y )=r−εN+u
D(qxy‖pxy )
≤(d) ∞ (J.7)
(a) and (b) are by definition. (c) is true because H(px |y ) < r < log2 |X |, thus for very
small εN and u, H(px |y ) < r − εN + u < log2 |X |. Thus there exists distribution q, s.t.
H(qx |y ) = r−εN +u. (d) is true because the assumption on pxy that the marginal px(x) > 0
for all x ∈ X . Thus we proved the existence of the saddle point of f(q, ρ).
supρmin
qf(q, ρ) = min
qsup
ρf(q, ρ) (J.8)
Notice that if H(qx |y ) 6= r + εN , then ρ can be chose to be arbitrarily large or small to
make ρ(r−H(qx |y )− εN )+D(qxy‖pxy ) arbitrarily large. Thus the q to minimize supρ ρ(r−H(qx |y )− εN ) + D(qxy‖pxy ) satisfies that r −H(qx |y )− εN = 0, so:
minqsup
ρρ(r −H(qx |y )− εN ) + D(qxy‖pxy ) =(a) min
q:H(qx|y )=r+εN
D(qxy‖pxy )
=(b) minq:H(qx|y )≥r+εN
D(qxy‖pxy )
=(c) Eei,b(r + εN ) (J.9)
(a) is following the argument above, (b) is true because the distribution q to minimize
the KL divergence is always on the boundary of the feasible set [26] when p is not in the
184
feasible set. (c) is definition. Combining (J.6), (J.8) and (J.9), and let N be sufficiently big,
thus εN sufficiently small, and notice that Eei(r) is continuous in r, we get the the desired
bound in (J.2). ¤
Now we are ready to prove Proposition 13.
Proof: We give an upper bound on the decoding error on ~xt at time (t + ∆)N . At time
(t + ∆)N , the decoder cannot decode ~xt with zero error probability iff the binary strings
describing ~xt are not all out of the buffer. Since the encoding buffer is FIFO, this means that
the number of outgoing bits from some time t1 to (t+∆)N is less than the number of the bits
in the buffer at time t1 plus the number of incoming bits from time t1 to time tN . Suppose
that the buffer is last empty at time tN − nN where 0 ≤ n ≤ t. Given this condition, the
decoding error occurs only if∑n−1
i=0 l(~xt−i, ~yt−i) > (n+∆)NR. Write lmax as the longest code
length, lmax ≤ |X ||Y| log2(N +1)+N log2 |X |. Then Pr(∑n−1
i=0 l(~xt−i, ~yt−i) > (n+∆)NR) >
0 only if n > (n+∆)NRlmax
> ∆NRlmax
∆= β∆
Pr(~xt 6= ~xt((t + ∆)N) ≤t∑
n=β∆
Pr(n−1∑
i=0
l(~xt−i, ~yt−i) > (n + ∆)NR) (J.10)
≤(a)
t∑
n=β∆
K12−nN(Eei,b((n+∆)NR
nN)−ε1)
≤(b)
∞∑
n=γ∆
K22−nN(Eei,b(R)−ε2) +γ∆∑
n=β∆
K22−∆N(minα>1
Eei,b(αR)
α−1−ε2)
≤(c) K32−γ∆N(Eei,b(R)−ε2)|γ∆− β∆|K32−∆N(Eei(R)−ε2)
≤(d) K2−∆N(Eei(R)−ε)
where, K ′is and ε′is are properly chosen real numbers. (a) is true because of Lemma 33.
Define γ = Eei(R)Eei,b(R) , in the first part of (b), we only need the fact that Eei,b(R) is non
decreasing with R. In the second part of (b), we write α = n+∆n and take the α to minimize
the error exponent. The first term of (c) comes from the sum of a geometric series. The
second term of (c) is by the definition of Eei(R). (d) is by the definition of γ. ¥
J.2 Converse
To bound the best possible error exponent with fixed delay, we consider a genie-aided
encoder/decoder pair and translate the block-coding error exponent to the delay constrained
185
error exponent. The arguments are analogous to the “focusing bound” derivation in [67]
for channel coding with feedback.
Proposition 14 For fixed-rate encodings of discrete memoryless sources, it is not possible
to achieve an error exponent with fixed-delay better than
infα>0
1α
Eei,b((α + 1)R) (J.11)
Proof: For simplicity of exposition, we ignore integer effects arising from the finite nature
of ∆, R, etc. For every α > 0 and delay ∆, consider a code running over its fixed-rate
noiseless channel till time ∆α + ∆. By this time, the decoder have committed to estimates
for the source symbols up to time i = ∆α . The total number of bits used during this period
is (∆α + ∆)R.
Now consider a genie that gives the encoder access to the first i source symbols at
the beginning of time, rather than forcing the encoder to get the source symbols one at a
time. Simultaneously, loosen the requirements on the decoder by only demanding correct
estimates for the first i source symbols by the time ∆α + ∆. In effect, the deadline for
decoding the past source symbols is extended to the deadline of the i-th symbol itself.
Any lower-bound to the error probability of the new problem is clearly also a bound
for the original problem. Furthermore, the new problem is just a fixed-length block-coding
problem requiring the encoding of i source symbols into (∆α +∆)R bits. The rate per symbol
is
((∆α
+ ∆)R)1i
= ((∆α
+ ∆)R)α
∆= (α + 1)R
Lemma 12 tells us that such a code has a probability of error that is at least exponential
in iEei,b((α+1)R). Since i = ∆α , this translates into an error exponent of at most Eei,b((α+1)R)
α
with block length ∆.
Since this is true for all α > 0, we have a bound on the reliability function Eei(R) with
fixed delay ∆:
Eei(R) ≤ infα>0
1α
Eei,b((α + 1)R)
The minimizing α tells how much of the past (∆α ) is involved in the dominant error event.
¥
186