05 data compression2

7/31/2019 05 Data Compression2

1/40

CS5058701

Coding and Information

TheorySlide 5&6Data Compression 2

1


2/40

Kraft Inequality for Uniquely

Decodable Codes

2

Theorem 5.5.1. The codeword lengths of any uniquely decodablecode must satisfy the Kraft inequality

m

i =1

q l i 1.

Conversely, given a set of codeword lengths that satisfy this

inequality, it is possible to construct a uniquely decodable code withthese codeword lengths.

Uniquely decodable codes do not o ff er any further choices forthe codeword lengths compared with prex codes.


3/40

Huffman Codes

3

An optimal (shortest expected length) prex code for a given

distribution can be constructed by a simple algorithm discovered byHuff man. These codes are called Hu ff man codes . It turns out thatany other code from the same alphabet cannot have a shorterexpected length than the code constructed by the algorithm.

Huff man codes are introduced with a few examples.


4/40

Huffman Code Exam1

4

Example 1. Consider a random variable X taking values in the setX = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15,respectively.

The optimal code for X is expected to have the longest codewordsassigned to the symbols 4 and 5. Moreover, these lengths must beequal, since otherwise we can delete a bit from the longer codeword

and still have a prex code with shorter length.In general, we can construct a code in which the two longestcodewords di ff er only in the last bit.


5/40

Huffman Code Exam2

5

Example 1. (cont.) For this code we can combine the symbols 4and 5 into a single source symbol, with a probability assignment 0.3.We proceed in this way, combining the two least likely symbols intoone symbol in each step, until we are left with only one symbol.

This procedure and the codewords obtained are shown in the rsttable on [Cov, p. 93]. The code has expected length 2.3 bits.


6/40

Huffman Code Exam3

6

Example 2. Consider a ternary code for the same random variableas in the previous example ( X = {1, 2, 3, 4, 5} with probabilities 0.25,0.25, 0.2, 0.15, 0.15, respectively).

The codewords obtained are shown in the second table on[Cov, p. 93]. The code has an expected length of 1.5 ternary digits.


7/40

Huffman Code Exam4

7

Example 3. If q 3, we may not have a su ffi cient number of symbols so that we can combine them q at a time. In such a case, weadd dummy symbols to the end of the set of symbols. These dummysymbols have probability 0 and are inserted to ll the tree.

Note: Since the number of symbols is reduced by q 1 in each step,we should add dummy symbols to get the total number to be of the

form 1 + k (q 1) for some integer k .The use of dummy symbols is illustrated in the table on [Cov, p. 94].The code has an expected length of 1.7 ternary digits.


8/40

Huffman Code Exam4

7

Example 3. If q 3, we may not have a su ffi cient number of symbols so that we can combine them q at a time. In such a case, weadd dummy symbols to the end of the set of symbols. These dummysymbols have probability 0 and are inserted to ll the tree.

Note: Since the number of symbols is reduced by q 1 in each step,we should add dummy symbols to get the total number to be of the

form 1 + k (q 1) for some integer k .The use of dummy symbols is illustrated in the table on [Cov, p. 94].The code has an expected length of 1.7 ternary digits.


9/40

Huffman Codes vs. Shannon

Codes

8

If Shannon codes (which are suboptimal) are used, the codeword

length for some particular symbol may be much worse than withHuff man codes.

Example. Consider X = {1, 2} with probabilities 0.9999 and 0.0001,respectively. Shannon coding then gives codewords of length 1 and 14bits, respectively, whereas an optimal code has two words of length 1.


10/40

Huffman Codes vs. Shannon

Codes cont.

9

Occasionally, it is also possible that the Hu ff man codeword for aparticular symbol is longer than the corresponding codeword of aShannon code.

Example. For a random variable with distribution ( 13 ,13 ,

14 ,

112 ), the

Huff man coding procedure results in codeword lengths (2 , 2, 2, 2) or

(1, 2, 3, 3) (there are sometimes several optimal codes!), whereas theShannon coding procedure leads to lengths (2 , 2, 2, 4).

Note: The Hu ff man code is shorter on the average.


11/40

Fano Codes

10

Fano proposed a suboptimal procedure for constructing asource code. In his method we first order the probabilities indecreasing order. Then we choose k such that is

minimized. This point divides the source symbols into two setsof almost equal probability. Assign 0 for the first bit of theupper set and 1 for the lower set. Repeat this process for eachsubset. This scheme, although not optimal, achieves L(C ) H ( X ) + 2.

pii = 1

k

pii = k + 1

m


12/40

Shannon-Fano-Elisa Coding

cont.

11

The codeword for x consists of the l(x) rst binary decimals of F (x),

where

l(x) = log p(x) + 1 .

The expected length of this code is less than H (X ) + 2. Twoexamples of the construction of such codes are shown in the tables on[Cov, p. 103].


13/40


12

Shannon-Fano-Elias coding is a simple constructive procedure toallot codewords. Let X = {1, 2, . . . , m } and assume that p(x) > 0 for

all x. We dene

F (x) =a x

p(a),

F (x) = p(x)/ 2 +a


14/40


12

Shannon-Fano-Elias coding is a simple constructive procedure toallot codewords. Let X = {1, 2, . . . , m } and assume that p(x) > 0 for

all x. We dene

F (x) =a x

p(a),

F (x) = p(x)/ 2 +a


15/40

13


16/40

Choice of Compression

Method

14

For small source alphabets, we must use long blocks of sourcesymbols to get e ffi cient coding. (For example, with a binary symbol,if each symbol is coded separately, we must always use 1 bit per

symbol and no compression is achieved.)

Huff man codes are optimal, but require the calculation of theprobabilities of all source symbols and the construction of thecorresponding complete code tree.

A (good) suboptimal code with computationally e ffi cientalgorithms for encoding and decoding is often desired.

Arithmetic coding fullls these criteria.


17/40

Universal Codes

15

If we do not know the behavior of the source in advance or thebehavior changes, a more sophisticated adaptive arithmetic codingalgorithm can be used. Such a code is an example of a universalcode . Universal codes are designed to work with an arbitrary source

distribution.

A particularly interesting universal code is the Lempel-Ziv code,which will be considered at a later stage of this course.


18/40

Data Compression & Coin

Flips

16

When a random source is compressed into a sequence of bits so thatthe average length is minimized, the encoded sequence is essentially

incompressible , and therefore has an entropy rate close to 1 bit persymbol.

The bits of the encoded sequence are essentially fair coinips.

Let us now go in the opposite direction: How many fair coin ipsdoes it take to generate a random variable X drawn according tosome specied probability mass function p.


19/40

Example: Generating Random

Variable

17

Suppose we wish to generate a random variable X = {a,b,c } with

distribution (12 ,

14 ,

14 ). The answer is obvious.

If the rst bit (coin toss) is 0, let X = a . If the rst two bits are 10,let X = b. If the rst two bits are 11, let X = c.

The average number of fair bits required for generating this randomvariable is 12 1 +

14 2 +

14 2 = 1 .5 bits. This is also the entropy of

the distribution.


20/40

Algorithm for Generating

Random Variables

18

We map (possibly innite!) strings of bits Z 1 , Z 2 , . . . to possibleoutcomes X by a binary tree, where the leaves are marked by outputsymbols X and the path to the leaves is given by the sequence of bitsproduced by the fair coin. For example, the tree for the distributionin the previous example, ( 12 ,

14 ,

14 ), is shown in [Cov, Fig. 5.8].

Theorem 5.12.1. For any algorithm generating X , the expectednumber of fair bits used is greater than the entropy, that is,

ET H (X ).


21/40

Algorithm for Generating

Random Variables cont.

19

If the distribution is not dyadic (that is, 2-adic), we rst write the

probabilities as the (possibly innite) sum of dyadic probabilities,called atoms. (In fact, this means nding the binary expansions of the probabilities.)

In constructing the tree, the same approach as in proving the Kraft

inequality can be used. An atom of the form 2 j is associated to aleaf at depth j . All the leaves of the atoms of the probability of anoutput symbol are marked with that symbol.


22/40

Example: Generating Random

Variable cont.

20

Let X = {a, b } with the distribution ( 23 ,13 ). The binary expansion of

the two probabilities are 0 .101010 . . . and 0 .010101 . . . , respectively.Hence the atoms are

23

=12

,18

,1

32, . . . ,

13 =

14 ,

116 ,

164 , . . . .

The corresponding binary tree is shown in [Cov, Fig. 5.9].


23/40

Bounding The Expected Depth

of The Tree

21

Theorem 5.11.3 The expected Number of fair bits required by theoptimal algorithm to generate a random variable X lies betweenH( X ) and H ( X ) + 2:

H ( X ) ET H ( X ) + 2.


24/40

Background for Universal

Source Coding

22

Huff man coding compresses an i.i.d. source with a known distribution

p(x ) to its entropy limit H (X ). However, if the code is designed foranother distribution q(x ), a penalty of D ( p q) is incurred.

Huff man coding is sensitive to the assumed distribution.

What can be achieved if the true distribution p(x ) is unknown? Isthere a universal code with rate R that su ffi ces to describe every i.i.d.source with entropy H (X ) < R ? Yes!


25/40

Fixed Rate Block Codes

23

A xed rate block code of rate R for a source X 1 , X 2 , . . . , X nwhich has an unknown distribution Q consists of two mappings, the

encoder

f n : X n {1, 2, . . . , 2nR },

and the decoder,

n : {1, 2, . . . , 2nR } X n .


26/40

Universal Source Codes

24

The probability of error for the code with respect to the distributionQ is

P (n )e = Q n (X 1 , . . . , X n : n (f n (X 1 , . . . , X n )) = ( X 1 , . . . , X n )) .

A rate R block code for a source is called universal if the functionsf n and n do not depend on the distribution Q and if P (

n )e 0 as

n when H (Q ) < R .

Theorem 12.3.1. There exists a sequence of ( n, 2nR ) universalsource codes such that P (

n )e 0 as n for every source Q such

that H (Q ) < R .


27/40

Universal Coding Schemes

25

One universal coding scheme is given in the proof of [Cov, Theorem 12.3.1]. That scheme, which is due to Csisz ar andKorner, is universal over the set of i.i.d. distributions. We shall indetail look at another algorithm, the Lempel-Ziv algorithm which is avariable rate universal code.

Q : If universal codes also reach the limit given by the entropy, why

do we need Huff man and similar codes (which are specic to aprobability distribution)?

A : Universal codes need longer blocks length for the same perfor-mance and their decoders and encoders are more complex.


28/40

Lempel-Ziv Coding

26

Most universal compression algorithm used in the real world arebased on algorithms developed by Lempel and Ziv, and we thereforetalk about Lempel-Ziv (LZ) coding . LZ algorithms are good atcompressing data that cannot be modeled simply, such as (note thatLZ also compresses other than i.i.d. sources)

English text, and

computer source code.Computer compression programs, such as compress , gzip , andWinZip , and the GIF format are based on LZ coding.


29/40

Lempel-Ziv Coding cont.

27

The following are the main variants of Lempel-Ziv coding.

LZ77 Also called sliding window Lempel-Ziv .LZ78 Also called dictionary Lempel-Ziv . Described in thetextbook, in [Cov, Sect. 12.10].

LZW Another variant, not described here.

With these algorithms, text with any alphabet size can becompressed. Common sizes are, for example, 2 (binary sequences)and 256 (computer les consisting of a sequence of bytes).


30/40

LZ78

28

In the description of the algorithm, we act on the string1011010100010. The algorithm is as follows:

1. Parse the source into strings that have not appeared so far:1,0,11,01,010,00,10.

2. Code a substring as ( i, c ), where i is the index of thesubstring (starting from 1, 0 = empty string) and c is the

value of the additional character:(0,1),(0,0),(1,1),(2,1),(4,0),(2,0),(1,0).

To express the locationin the example, an integer between 0 and4 = c(n )we need log(c(n ) + 1) bits.


31/40

LZ77

29

We again use the example string 1011010100010. In each step, weproceed as follows:

1. Find p, the relative position of the longest match (the lengthof which is denoted by l).

2. Output ( p, l, c), where c is the rst character that does notmatch.

3. Advance l + 1 positions.1011010100010 (0, 0, 1) 1011010100010 (0, 0, 0) 1011010100010 (2, 1, 1) 1011010100010 (3, 2, 0) 1011010100010 (2, 2, 0) 1011010100010 . . ..


32/40

Some Details of The

Algorithms

30

Several details have to be taken into account; we give one suchexample for each of the algorithms:

LZ77 To avoid large values for the relative positions of strings:do not go too far back (sliding window!)LZ78 To avoid too large dictionaries: Reduce the size in one of

several possible ways, for example, throw the dictionary away

when it reaches a certain size (GIF does this).Traditionally, LZ77 was better but slower, but the gzip version isalmost as fast as any LZ78.


33/40

Run-Length Coding

31

A compression method that does not reach the entropy bound but isused in many applications, including fax machines, is run-lengthcoding . In this method, the input sequence is compressed byidentifying adjacent symbols of equal value and replacing them witha single symbol and a count.

Example. 111111110100000 (1, 8), (0, 1), (1, 1), (0, 5).Clearly, in the binary case, it su ffi ces to give the rst bit and thenonly the lengths of the runs.


34/40

Optimality of Lempel-Ziv

Coding

32

Theorem 12.10.2. Let {X i } be a stationary ergodic stochasticprocess. Let l(X 1 , X 2 , . . . , X n ) be the length of the Lempel-Zivcodeword associated with X 1 , X 2 , . . . , X n . Then

lim supn

1n

l(X 1 , X 2 , . . . , X n ) H (X ) with probability 1 ,

where H (X ) is the entropy rate of the process.From the examples given, it is obvious that these compressionmethods are e ffi cient only for long input sequences.


35/40

Lossy Compression

33

So far, only lossless compression has been considered. In lossycompression , loss of information is allowed:

scalar quantization Take the set of possible messages S andreduce it to a smaller set S with a mapping f : S S . Forexample, least signicant bits are dropped.

vector quantization Map a multidimensional space S into asmaller set S of messages.

transform coding Transform the input into a di ff erent formthat can be more easily compressed (in a lossy or losslessway).

Lossy methods include JPEG (still images) and MPEG (video).


36/40

Rate Distortion Theory

34

The eff ects of quantization are studied in rate distortion theory ,the basic problem of which can be stated as follows:

Q : Given a source distribution and a distortion measure, whatis the minimum expected distortion achievable at a particularrate?

Q : (Equivalent) What is the minimum rate description required toachieve a particular distortion?

It turns out that, perhaps surprisingly, it is more e ffi cient to describetwo (even independent!) variables jointly than individually. Ratedistortion theory can be applied to both discrete and continuous

random variables.


37/40

Example: Quantization

35

We look at the basic problem of representing a single continuousrandom variable by a nite number of bits. Denote the random

variable by X and the representation of X by X (X ). With R bits,the function X can take on 2 R values.

What is the optimum set of values for X and the regions

associated with each value

X ?


38/40

Example: Quantization cont.

36

With X N (0, 2 ) and a squared error distortion measure, we wishto nd a function X that takes on at most 2 R values (these are calledreproduction points ) and minimizes E (X X (X )) 2 .

With one bit, obviously the bits should distinguish whether X < 0 ornot. To minimize squared error, each reproduced symbol should be atthe conditional mean of its regionsee [Cov, Fig. 13.1]and we have

X (X ) = 2 , if x < 0, 2 , if x 0.


39/40


37

With two or more bits to represent the sample, the situation gets farmore complicated. The following facts state simple properties of

optimal regions and reconstruction points. Given a set of reconstruction points, the distortion is

minimized by mapping a source random variable X to therepresentation X that is closest to it. The set of regions of

dened by this mapping is call a Voronoi or Dirichlet partition dened by the reconstruction points.

The reconstruction points should minimize the conditionalexpected distortion over their respective assignment regions.


40/40


The aforementioned properties enable algorithms to nd goodquantizers called the Lloyd algorithm (for real-valued random

variables) and the generalized Lloyd algorithm (for vector-valuedrandom variables). Starting from any set of reconstruction points,repeat the following steps.

1. Find the optimal set of reconstruction regions.2. Find the optimal reconstruction points for these regions.

The expected distortion is decreased at each stage in this algorithm,so it converges to a local minimum of the distortion.

05 data compression2

Documents