05 data compression2
TRANSCRIPT
-
7/31/2019 05 Data Compression2
1/40
CS5058701
Coding and Information
TheorySlide 5&6Data Compression 2
1
-
7/31/2019 05 Data Compression2
2/40
Kraft Inequality for Uniquely
Decodable Codes
2
Theorem 5.5.1. The codeword lengths of any uniquely decodablecode must satisfy the Kraft inequality
m
i =1
q l i 1.
Conversely, given a set of codeword lengths that satisfy this
inequality, it is possible to construct a uniquely decodable code withthese codeword lengths.
Uniquely decodable codes do not o ff er any further choices forthe codeword lengths compared with prex codes.
-
7/31/2019 05 Data Compression2
3/40
Huffman Codes
3
An optimal (shortest expected length) prex code for a given
distribution can be constructed by a simple algorithm discovered byHuff man. These codes are called Hu ff man codes . It turns out thatany other code from the same alphabet cannot have a shorterexpected length than the code constructed by the algorithm.
Huff man codes are introduced with a few examples.
-
7/31/2019 05 Data Compression2
4/40
Huffman Code Exam1
4
Example 1. Consider a random variable X taking values in the setX = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15,respectively.
The optimal code for X is expected to have the longest codewordsassigned to the symbols 4 and 5. Moreover, these lengths must beequal, since otherwise we can delete a bit from the longer codeword
and still have a prex code with shorter length.In general, we can construct a code in which the two longestcodewords di ff er only in the last bit.
-
7/31/2019 05 Data Compression2
5/40
Huffman Code Exam2
5
Example 1. (cont.) For this code we can combine the symbols 4and 5 into a single source symbol, with a probability assignment 0.3.We proceed in this way, combining the two least likely symbols intoone symbol in each step, until we are left with only one symbol.
This procedure and the codewords obtained are shown in the rsttable on [Cov, p. 93]. The code has expected length 2.3 bits.
-
7/31/2019 05 Data Compression2
6/40
Huffman Code Exam3
6
Example 2. Consider a ternary code for the same random variableas in the previous example ( X = {1, 2, 3, 4, 5} with probabilities 0.25,0.25, 0.2, 0.15, 0.15, respectively).
The codewords obtained are shown in the second table on[Cov, p. 93]. The code has an expected length of 1.5 ternary digits.
-
7/31/2019 05 Data Compression2
7/40
Huffman Code Exam4
7
Example 3. If q 3, we may not have a su ffi cient number of symbols so that we can combine them q at a time. In such a case, weadd dummy symbols to the end of the set of symbols. These dummysymbols have probability 0 and are inserted to ll the tree.
Note: Since the number of symbols is reduced by q 1 in each step,we should add dummy symbols to get the total number to be of the
form 1 + k (q 1) for some integer k .The use of dummy symbols is illustrated in the table on [Cov, p. 94].The code has an expected length of 1.7 ternary digits.
-
7/31/2019 05 Data Compression2
8/40
Huffman Code Exam4
7
Example 3. If q 3, we may not have a su ffi cient number of symbols so that we can combine them q at a time. In such a case, weadd dummy symbols to the end of the set of symbols. These dummysymbols have probability 0 and are inserted to ll the tree.
Note: Since the number of symbols is reduced by q 1 in each step,we should add dummy symbols to get the total number to be of the
form 1 + k (q 1) for some integer k .The use of dummy symbols is illustrated in the table on [Cov, p. 94].The code has an expected length of 1.7 ternary digits.
-
7/31/2019 05 Data Compression2
9/40
Huffman Codes vs. Shannon
Codes
8
If Shannon codes (which are suboptimal) are used, the codeword
length for some particular symbol may be much worse than withHuff man codes.
Example. Consider X = {1, 2} with probabilities 0.9999 and 0.0001,respectively. Shannon coding then gives codewords of length 1 and 14bits, respectively, whereas an optimal code has two words of length 1.
-
7/31/2019 05 Data Compression2
10/40
Huffman Codes vs. Shannon
Codes cont.
9
Occasionally, it is also possible that the Hu ff man codeword for aparticular symbol is longer than the corresponding codeword of aShannon code.
Example. For a random variable with distribution ( 13 ,13 ,
14 ,
112 ), the
Huff man coding procedure results in codeword lengths (2 , 2, 2, 2) or
(1, 2, 3, 3) (there are sometimes several optimal codes!), whereas theShannon coding procedure leads to lengths (2 , 2, 2, 4).
Note: The Hu ff man code is shorter on the average.
-
7/31/2019 05 Data Compression2
11/40
Fano Codes
10
Fano proposed a suboptimal procedure for constructing asource code. In his method we first order the probabilities indecreasing order. Then we choose k such that is
minimized. This point divides the source symbols into two setsof almost equal probability. Assign 0 for the first bit of theupper set and 1 for the lower set. Repeat this process for eachsubset. This scheme, although not optimal, achieves L(C ) H ( X ) + 2.
pii = 1
k
pii = k + 1
m
-
7/31/2019 05 Data Compression2
12/40
Shannon-Fano-Elisa Coding
cont.
11
The codeword for x consists of the l(x) rst binary decimals of F (x),
where
l(x) = log p(x) + 1 .
The expected length of this code is less than H (X ) + 2. Twoexamples of the construction of such codes are shown in the tables on[Cov, p. 103].
-
7/31/2019 05 Data Compression2
13/40
Shannon-Fano-Elisa Coding
12
Shannon-Fano-Elias coding is a simple constructive procedure toallot codewords. Let X = {1, 2, . . . , m } and assume that p(x) > 0 for
all x. We dene
F (x) =a x
p(a),
F (x) = p(x)/ 2 +a
-
7/31/2019 05 Data Compression2
14/40
Shannon-Fano-Elisa Coding
12
Shannon-Fano-Elias coding is a simple constructive procedure toallot codewords. Let X = {1, 2, . . . , m } and assume that p(x) > 0 for
all x. We dene
F (x) =a x
p(a),
F (x) = p(x)/ 2 +a
-
7/31/2019 05 Data Compression2
15/40
13
-
7/31/2019 05 Data Compression2
16/40
Choice of Compression
Method
14
For small source alphabets, we must use long blocks of sourcesymbols to get e ffi cient coding. (For example, with a binary symbol,if each symbol is coded separately, we must always use 1 bit per
symbol and no compression is achieved.)
Huff man codes are optimal, but require the calculation of theprobabilities of all source symbols and the construction of thecorresponding complete code tree.
A (good) suboptimal code with computationally e ffi cientalgorithms for encoding and decoding is often desired.
Arithmetic coding fullls these criteria.
-
7/31/2019 05 Data Compression2
17/40
Universal Codes
15
If we do not know the behavior of the source in advance or thebehavior changes, a more sophisticated adaptive arithmetic codingalgorithm can be used. Such a code is an example of a universalcode . Universal codes are designed to work with an arbitrary source
distribution.
A particularly interesting universal code is the Lempel-Ziv code,which will be considered at a later stage of this course.
-
7/31/2019 05 Data Compression2
18/40
Data Compression & Coin
Flips
16
When a random source is compressed into a sequence of bits so thatthe average length is minimized, the encoded sequence is essentially
incompressible , and therefore has an entropy rate close to 1 bit persymbol.
The bits of the encoded sequence are essentially fair coinips.
Let us now go in the opposite direction: How many fair coin ipsdoes it take to generate a random variable X drawn according tosome specied probability mass function p.
-
7/31/2019 05 Data Compression2
19/40
Example: Generating Random
Variable
17
Suppose we wish to generate a random variable X = {a,b,c } with
distribution (12 ,
14 ,
14 ). The answer is obvious.
If the rst bit (coin toss) is 0, let X = a . If the rst two bits are 10,let X = b. If the rst two bits are 11, let X = c.
The average number of fair bits required for generating this randomvariable is 12 1 +
14 2 +
14 2 = 1 .5 bits. This is also the entropy of
the distribution.
-
7/31/2019 05 Data Compression2
20/40
Algorithm for Generating
Random Variables
18
We map (possibly innite!) strings of bits Z 1 , Z 2 , . . . to possibleoutcomes X by a binary tree, where the leaves are marked by outputsymbols X and the path to the leaves is given by the sequence of bitsproduced by the fair coin. For example, the tree for the distributionin the previous example, ( 12 ,
14 ,
14 ), is shown in [Cov, Fig. 5.8].
Theorem 5.12.1. For any algorithm generating X , the expectednumber of fair bits used is greater than the entropy, that is,
ET H (X ).
-
7/31/2019 05 Data Compression2
21/40
Algorithm for Generating
Random Variables cont.
19
If the distribution is not dyadic (that is, 2-adic), we rst write the
probabilities as the (possibly innite) sum of dyadic probabilities,called atoms. (In fact, this means nding the binary expansions of the probabilities.)
In constructing the tree, the same approach as in proving the Kraft
inequality can be used. An atom of the form 2 j is associated to aleaf at depth j . All the leaves of the atoms of the probability of anoutput symbol are marked with that symbol.
-
7/31/2019 05 Data Compression2
22/40
Example: Generating Random
Variable cont.
20
Let X = {a, b } with the distribution ( 23 ,13 ). The binary expansion of
the two probabilities are 0 .101010 . . . and 0 .010101 . . . , respectively.Hence the atoms are
23
=12
,18
,1
32, . . . ,
13 =
14 ,
116 ,
164 , . . . .
The corresponding binary tree is shown in [Cov, Fig. 5.9].
-
7/31/2019 05 Data Compression2
23/40
Bounding The Expected Depth
of The Tree
21
Theorem 5.11.3 The expected Number of fair bits required by theoptimal algorithm to generate a random variable X lies betweenH( X ) and H ( X ) + 2:
H ( X ) ET H ( X ) + 2.
-
7/31/2019 05 Data Compression2
24/40
Background for Universal
Source Coding
22
Huff man coding compresses an i.i.d. source with a known distribution
p(x ) to its entropy limit H (X ). However, if the code is designed foranother distribution q(x ), a penalty of D ( p q) is incurred.
Huff man coding is sensitive to the assumed distribution.
What can be achieved if the true distribution p(x ) is unknown? Isthere a universal code with rate R that su ffi ces to describe every i.i.d.source with entropy H (X ) < R ? Yes!
-
7/31/2019 05 Data Compression2
25/40
Fixed Rate Block Codes
23
A xed rate block code of rate R for a source X 1 , X 2 , . . . , X nwhich has an unknown distribution Q consists of two mappings, the
encoder
f n : X n {1, 2, . . . , 2nR },
and the decoder,
n : {1, 2, . . . , 2nR } X n .
-
7/31/2019 05 Data Compression2
26/40
Universal Source Codes
24
The probability of error for the code with respect to the distributionQ is
P (n )e = Q n (X 1 , . . . , X n : n (f n (X 1 , . . . , X n )) = ( X 1 , . . . , X n )) .
A rate R block code for a source is called universal if the functionsf n and n do not depend on the distribution Q and if P (
n )e 0 as
n when H (Q ) < R .
Theorem 12.3.1. There exists a sequence of ( n, 2nR ) universalsource codes such that P (
n )e 0 as n for every source Q such
that H (Q ) < R .
-
7/31/2019 05 Data Compression2
27/40
Universal Coding Schemes
25
One universal coding scheme is given in the proof of [Cov, Theorem 12.3.1]. That scheme, which is due to Csisz ar andKorner, is universal over the set of i.i.d. distributions. We shall indetail look at another algorithm, the Lempel-Ziv algorithm which is avariable rate universal code.
Q : If universal codes also reach the limit given by the entropy, why
do we need Huff man and similar codes (which are specic to aprobability distribution)?
A : Universal codes need longer blocks length for the same perfor-mance and their decoders and encoders are more complex.
-
7/31/2019 05 Data Compression2
28/40
Lempel-Ziv Coding
26
Most universal compression algorithm used in the real world arebased on algorithms developed by Lempel and Ziv, and we thereforetalk about Lempel-Ziv (LZ) coding . LZ algorithms are good atcompressing data that cannot be modeled simply, such as (note thatLZ also compresses other than i.i.d. sources)
English text, and
computer source code.Computer compression programs, such as compress , gzip , andWinZip , and the GIF format are based on LZ coding.
-
7/31/2019 05 Data Compression2
29/40
Lempel-Ziv Coding cont.
27
The following are the main variants of Lempel-Ziv coding.
LZ77 Also called sliding window Lempel-Ziv .LZ78 Also called dictionary Lempel-Ziv . Described in thetextbook, in [Cov, Sect. 12.10].
LZW Another variant, not described here.
With these algorithms, text with any alphabet size can becompressed. Common sizes are, for example, 2 (binary sequences)and 256 (computer les consisting of a sequence of bytes).
-
7/31/2019 05 Data Compression2
30/40
LZ78
28
In the description of the algorithm, we act on the string1011010100010. The algorithm is as follows:
1. Parse the source into strings that have not appeared so far:1,0,11,01,010,00,10.
2. Code a substring as ( i, c ), where i is the index of thesubstring (starting from 1, 0 = empty string) and c is the
value of the additional character:(0,1),(0,0),(1,1),(2,1),(4,0),(2,0),(1,0).
To express the locationin the example, an integer between 0 and4 = c(n )we need log(c(n ) + 1) bits.
-
7/31/2019 05 Data Compression2
31/40
LZ77
29
We again use the example string 1011010100010. In each step, weproceed as follows:
1. Find p, the relative position of the longest match (the lengthof which is denoted by l).
2. Output ( p, l, c), where c is the rst character that does notmatch.
3. Advance l + 1 positions.1011010100010 (0, 0, 1) 1011010100010 (0, 0, 0) 1011010100010 (2, 1, 1) 1011010100010 (3, 2, 0) 1011010100010 (2, 2, 0) 1011010100010 . . ..
-
7/31/2019 05 Data Compression2
32/40
Some Details of The
Algorithms
30
Several details have to be taken into account; we give one suchexample for each of the algorithms:
LZ77 To avoid large values for the relative positions of strings:do not go too far back (sliding window!)LZ78 To avoid too large dictionaries: Reduce the size in one of
several possible ways, for example, throw the dictionary away
when it reaches a certain size (GIF does this).Traditionally, LZ77 was better but slower, but the gzip version isalmost as fast as any LZ78.
-
7/31/2019 05 Data Compression2
33/40
Run-Length Coding
31
A compression method that does not reach the entropy bound but isused in many applications, including fax machines, is run-lengthcoding . In this method, the input sequence is compressed byidentifying adjacent symbols of equal value and replacing them witha single symbol and a count.
Example. 111111110100000 (1, 8), (0, 1), (1, 1), (0, 5).Clearly, in the binary case, it su ffi ces to give the rst bit and thenonly the lengths of the runs.
-
7/31/2019 05 Data Compression2
34/40
Optimality of Lempel-Ziv
Coding
32
Theorem 12.10.2. Let {X i } be a stationary ergodic stochasticprocess. Let l(X 1 , X 2 , . . . , X n ) be the length of the Lempel-Zivcodeword associated with X 1 , X 2 , . . . , X n . Then
lim supn
1n
l(X 1 , X 2 , . . . , X n ) H (X ) with probability 1 ,
where H (X ) is the entropy rate of the process.From the examples given, it is obvious that these compressionmethods are e ffi cient only for long input sequences.
-
7/31/2019 05 Data Compression2
35/40
Lossy Compression
33
So far, only lossless compression has been considered. In lossycompression , loss of information is allowed:
scalar quantization Take the set of possible messages S andreduce it to a smaller set S with a mapping f : S S . Forexample, least signicant bits are dropped.
vector quantization Map a multidimensional space S into asmaller set S of messages.
transform coding Transform the input into a di ff erent formthat can be more easily compressed (in a lossy or losslessway).
Lossy methods include JPEG (still images) and MPEG (video).
-
7/31/2019 05 Data Compression2
36/40
Rate Distortion Theory
34
The eff ects of quantization are studied in rate distortion theory ,the basic problem of which can be stated as follows:
Q : Given a source distribution and a distortion measure, whatis the minimum expected distortion achievable at a particularrate?
Q : (Equivalent) What is the minimum rate description required toachieve a particular distortion?
It turns out that, perhaps surprisingly, it is more e ffi cient to describetwo (even independent!) variables jointly than individually. Ratedistortion theory can be applied to both discrete and continuous
random variables.
-
7/31/2019 05 Data Compression2
37/40
Example: Quantization
35
We look at the basic problem of representing a single continuousrandom variable by a nite number of bits. Denote the random
variable by X and the representation of X by X (X ). With R bits,the function X can take on 2 R values.
What is the optimum set of values for X and the regions
associated with each value
X ?
-
7/31/2019 05 Data Compression2
38/40
Example: Quantization cont.
36
With X N (0, 2 ) and a squared error distortion measure, we wishto nd a function X that takes on at most 2 R values (these are calledreproduction points ) and minimizes E (X X (X )) 2 .
With one bit, obviously the bits should distinguish whether X < 0 ornot. To minimize squared error, each reproduced symbol should be atthe conditional mean of its regionsee [Cov, Fig. 13.1]and we have
X (X ) = 2 , if x < 0, 2 , if x 0.
-
7/31/2019 05 Data Compression2
39/40
Example: Quantization cont.
37
With two or more bits to represent the sample, the situation gets farmore complicated. The following facts state simple properties of
optimal regions and reconstruction points. Given a set of reconstruction points, the distortion is
minimized by mapping a source random variable X to therepresentation X that is closest to it. The set of regions of
dened by this mapping is call a Voronoi or Dirichlet partition dened by the reconstruction points.
The reconstruction points should minimize the conditionalexpected distortion over their respective assignment regions.
-
7/31/2019 05 Data Compression2
40/40
Example: Quantization cont.
The aforementioned properties enable algorithms to nd goodquantizers called the Lloyd algorithm (for real-valued random
variables) and the generalized Lloyd algorithm (for vector-valuedrandom variables). Starting from any set of reconstruction points,repeat the following steps.
1. Find the optimal set of reconstruction regions.2. Find the optimal reconstruction points for these regions.
The expected distortion is decreased at each stage in this algorithm,so it converges to a local minimum of the distortion.