05 data compression2

Upload: m-khaerul-naim

Post on 05-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 05 Data Compression2

    1/40

    CS5058701

    Coding and Information

    TheorySlide 5&6Data Compression 2

    1

  • 7/31/2019 05 Data Compression2

    2/40

    Kraft Inequality for Uniquely

    Decodable Codes

    2

    Theorem 5.5.1. The codeword lengths of any uniquely decodablecode must satisfy the Kraft inequality

    m

    i =1

    q l i 1.

    Conversely, given a set of codeword lengths that satisfy this

    inequality, it is possible to construct a uniquely decodable code withthese codeword lengths.

    Uniquely decodable codes do not o ff er any further choices forthe codeword lengths compared with prex codes.

  • 7/31/2019 05 Data Compression2

    3/40

    Huffman Codes

    3

    An optimal (shortest expected length) prex code for a given

    distribution can be constructed by a simple algorithm discovered byHuff man. These codes are called Hu ff man codes . It turns out thatany other code from the same alphabet cannot have a shorterexpected length than the code constructed by the algorithm.

    Huff man codes are introduced with a few examples.

  • 7/31/2019 05 Data Compression2

    4/40

    Huffman Code Exam1

    4

    Example 1. Consider a random variable X taking values in the setX = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15,respectively.

    The optimal code for X is expected to have the longest codewordsassigned to the symbols 4 and 5. Moreover, these lengths must beequal, since otherwise we can delete a bit from the longer codeword

    and still have a prex code with shorter length.In general, we can construct a code in which the two longestcodewords di ff er only in the last bit.

  • 7/31/2019 05 Data Compression2

    5/40

    Huffman Code Exam2

    5

    Example 1. (cont.) For this code we can combine the symbols 4and 5 into a single source symbol, with a probability assignment 0.3.We proceed in this way, combining the two least likely symbols intoone symbol in each step, until we are left with only one symbol.

    This procedure and the codewords obtained are shown in the rsttable on [Cov, p. 93]. The code has expected length 2.3 bits.

  • 7/31/2019 05 Data Compression2

    6/40

    Huffman Code Exam3

    6

    Example 2. Consider a ternary code for the same random variableas in the previous example ( X = {1, 2, 3, 4, 5} with probabilities 0.25,0.25, 0.2, 0.15, 0.15, respectively).

    The codewords obtained are shown in the second table on[Cov, p. 93]. The code has an expected length of 1.5 ternary digits.

  • 7/31/2019 05 Data Compression2

    7/40

    Huffman Code Exam4

    7

    Example 3. If q 3, we may not have a su ffi cient number of symbols so that we can combine them q at a time. In such a case, weadd dummy symbols to the end of the set of symbols. These dummysymbols have probability 0 and are inserted to ll the tree.

    Note: Since the number of symbols is reduced by q 1 in each step,we should add dummy symbols to get the total number to be of the

    form 1 + k (q 1) for some integer k .The use of dummy symbols is illustrated in the table on [Cov, p. 94].The code has an expected length of 1.7 ternary digits.

  • 7/31/2019 05 Data Compression2

    8/40

    Huffman Code Exam4

    7

    Example 3. If q 3, we may not have a su ffi cient number of symbols so that we can combine them q at a time. In such a case, weadd dummy symbols to the end of the set of symbols. These dummysymbols have probability 0 and are inserted to ll the tree.

    Note: Since the number of symbols is reduced by q 1 in each step,we should add dummy symbols to get the total number to be of the

    form 1 + k (q 1) for some integer k .The use of dummy symbols is illustrated in the table on [Cov, p. 94].The code has an expected length of 1.7 ternary digits.

  • 7/31/2019 05 Data Compression2

    9/40

    Huffman Codes vs. Shannon

    Codes

    8

    If Shannon codes (which are suboptimal) are used, the codeword

    length for some particular symbol may be much worse than withHuff man codes.

    Example. Consider X = {1, 2} with probabilities 0.9999 and 0.0001,respectively. Shannon coding then gives codewords of length 1 and 14bits, respectively, whereas an optimal code has two words of length 1.

  • 7/31/2019 05 Data Compression2

    10/40

    Huffman Codes vs. Shannon

    Codes cont.

    9

    Occasionally, it is also possible that the Hu ff man codeword for aparticular symbol is longer than the corresponding codeword of aShannon code.

    Example. For a random variable with distribution ( 13 ,13 ,

    14 ,

    112 ), the

    Huff man coding procedure results in codeword lengths (2 , 2, 2, 2) or

    (1, 2, 3, 3) (there are sometimes several optimal codes!), whereas theShannon coding procedure leads to lengths (2 , 2, 2, 4).

    Note: The Hu ff man code is shorter on the average.

  • 7/31/2019 05 Data Compression2

    11/40

    Fano Codes

    10

    Fano proposed a suboptimal procedure for constructing asource code. In his method we first order the probabilities indecreasing order. Then we choose k such that is

    minimized. This point divides the source symbols into two setsof almost equal probability. Assign 0 for the first bit of theupper set and 1 for the lower set. Repeat this process for eachsubset. This scheme, although not optimal, achieves L(C ) H ( X ) + 2.

    pii = 1

    k

    pii = k + 1

    m

  • 7/31/2019 05 Data Compression2

    12/40

    Shannon-Fano-Elisa Coding

    cont.

    11

    The codeword for x consists of the l(x) rst binary decimals of F (x),

    where

    l(x) = log p(x) + 1 .

    The expected length of this code is less than H (X ) + 2. Twoexamples of the construction of such codes are shown in the tables on[Cov, p. 103].

  • 7/31/2019 05 Data Compression2

    13/40

    Shannon-Fano-Elisa Coding

    12

    Shannon-Fano-Elias coding is a simple constructive procedure toallot codewords. Let X = {1, 2, . . . , m } and assume that p(x) > 0 for

    all x. We dene

    F (x) =a x

    p(a),

    F (x) = p(x)/ 2 +a

  • 7/31/2019 05 Data Compression2

    14/40

    Shannon-Fano-Elisa Coding

    12

    Shannon-Fano-Elias coding is a simple constructive procedure toallot codewords. Let X = {1, 2, . . . , m } and assume that p(x) > 0 for

    all x. We dene

    F (x) =a x

    p(a),

    F (x) = p(x)/ 2 +a

  • 7/31/2019 05 Data Compression2

    15/40

    13

  • 7/31/2019 05 Data Compression2

    16/40

    Choice of Compression

    Method

    14

    For small source alphabets, we must use long blocks of sourcesymbols to get e ffi cient coding. (For example, with a binary symbol,if each symbol is coded separately, we must always use 1 bit per

    symbol and no compression is achieved.)

    Huff man codes are optimal, but require the calculation of theprobabilities of all source symbols and the construction of thecorresponding complete code tree.

    A (good) suboptimal code with computationally e ffi cientalgorithms for encoding and decoding is often desired.

    Arithmetic coding fullls these criteria.

  • 7/31/2019 05 Data Compression2

    17/40

    Universal Codes

    15

    If we do not know the behavior of the source in advance or thebehavior changes, a more sophisticated adaptive arithmetic codingalgorithm can be used. Such a code is an example of a universalcode . Universal codes are designed to work with an arbitrary source

    distribution.

    A particularly interesting universal code is the Lempel-Ziv code,which will be considered at a later stage of this course.

  • 7/31/2019 05 Data Compression2

    18/40

    Data Compression & Coin

    Flips

    16

    When a random source is compressed into a sequence of bits so thatthe average length is minimized, the encoded sequence is essentially

    incompressible , and therefore has an entropy rate close to 1 bit persymbol.

    The bits of the encoded sequence are essentially fair coinips.

    Let us now go in the opposite direction: How many fair coin ipsdoes it take to generate a random variable X drawn according tosome specied probability mass function p.

  • 7/31/2019 05 Data Compression2

    19/40

    Example: Generating Random

    Variable

    17

    Suppose we wish to generate a random variable X = {a,b,c } with

    distribution (12 ,

    14 ,

    14 ). The answer is obvious.

    If the rst bit (coin toss) is 0, let X = a . If the rst two bits are 10,let X = b. If the rst two bits are 11, let X = c.

    The average number of fair bits required for generating this randomvariable is 12 1 +

    14 2 +

    14 2 = 1 .5 bits. This is also the entropy of

    the distribution.

  • 7/31/2019 05 Data Compression2

    20/40

    Algorithm for Generating

    Random Variables

    18

    We map (possibly innite!) strings of bits Z 1 , Z 2 , . . . to possibleoutcomes X by a binary tree, where the leaves are marked by outputsymbols X and the path to the leaves is given by the sequence of bitsproduced by the fair coin. For example, the tree for the distributionin the previous example, ( 12 ,

    14 ,

    14 ), is shown in [Cov, Fig. 5.8].

    Theorem 5.12.1. For any algorithm generating X , the expectednumber of fair bits used is greater than the entropy, that is,

    ET H (X ).

  • 7/31/2019 05 Data Compression2

    21/40

    Algorithm for Generating

    Random Variables cont.

    19

    If the distribution is not dyadic (that is, 2-adic), we rst write the

    probabilities as the (possibly innite) sum of dyadic probabilities,called atoms. (In fact, this means nding the binary expansions of the probabilities.)

    In constructing the tree, the same approach as in proving the Kraft

    inequality can be used. An atom of the form 2 j is associated to aleaf at depth j . All the leaves of the atoms of the probability of anoutput symbol are marked with that symbol.

  • 7/31/2019 05 Data Compression2

    22/40

    Example: Generating Random

    Variable cont.

    20

    Let X = {a, b } with the distribution ( 23 ,13 ). The binary expansion of

    the two probabilities are 0 .101010 . . . and 0 .010101 . . . , respectively.Hence the atoms are

    23

    =12

    ,18

    ,1

    32, . . . ,

    13 =

    14 ,

    116 ,

    164 , . . . .

    The corresponding binary tree is shown in [Cov, Fig. 5.9].

  • 7/31/2019 05 Data Compression2

    23/40

    Bounding The Expected Depth

    of The Tree

    21

    Theorem 5.11.3 The expected Number of fair bits required by theoptimal algorithm to generate a random variable X lies betweenH( X ) and H ( X ) + 2:

    H ( X ) ET H ( X ) + 2.

  • 7/31/2019 05 Data Compression2

    24/40

    Background for Universal

    Source Coding

    22

    Huff man coding compresses an i.i.d. source with a known distribution

    p(x ) to its entropy limit H (X ). However, if the code is designed foranother distribution q(x ), a penalty of D ( p q) is incurred.

    Huff man coding is sensitive to the assumed distribution.

    What can be achieved if the true distribution p(x ) is unknown? Isthere a universal code with rate R that su ffi ces to describe every i.i.d.source with entropy H (X ) < R ? Yes!

  • 7/31/2019 05 Data Compression2

    25/40

    Fixed Rate Block Codes

    23

    A xed rate block code of rate R for a source X 1 , X 2 , . . . , X nwhich has an unknown distribution Q consists of two mappings, the

    encoder

    f n : X n {1, 2, . . . , 2nR },

    and the decoder,

    n : {1, 2, . . . , 2nR } X n .

  • 7/31/2019 05 Data Compression2

    26/40

    Universal Source Codes

    24

    The probability of error for the code with respect to the distributionQ is

    P (n )e = Q n (X 1 , . . . , X n : n (f n (X 1 , . . . , X n )) = ( X 1 , . . . , X n )) .

    A rate R block code for a source is called universal if the functionsf n and n do not depend on the distribution Q and if P (

    n )e 0 as

    n when H (Q ) < R .

    Theorem 12.3.1. There exists a sequence of ( n, 2nR ) universalsource codes such that P (

    n )e 0 as n for every source Q such

    that H (Q ) < R .

  • 7/31/2019 05 Data Compression2

    27/40

    Universal Coding Schemes

    25

    One universal coding scheme is given in the proof of [Cov, Theorem 12.3.1]. That scheme, which is due to Csisz ar andKorner, is universal over the set of i.i.d. distributions. We shall indetail look at another algorithm, the Lempel-Ziv algorithm which is avariable rate universal code.

    Q : If universal codes also reach the limit given by the entropy, why

    do we need Huff man and similar codes (which are specic to aprobability distribution)?

    A : Universal codes need longer blocks length for the same perfor-mance and their decoders and encoders are more complex.

  • 7/31/2019 05 Data Compression2

    28/40

    Lempel-Ziv Coding

    26

    Most universal compression algorithm used in the real world arebased on algorithms developed by Lempel and Ziv, and we thereforetalk about Lempel-Ziv (LZ) coding . LZ algorithms are good atcompressing data that cannot be modeled simply, such as (note thatLZ also compresses other than i.i.d. sources)

    English text, and

    computer source code.Computer compression programs, such as compress , gzip , andWinZip , and the GIF format are based on LZ coding.

  • 7/31/2019 05 Data Compression2

    29/40

    Lempel-Ziv Coding cont.

    27

    The following are the main variants of Lempel-Ziv coding.

    LZ77 Also called sliding window Lempel-Ziv .LZ78 Also called dictionary Lempel-Ziv . Described in thetextbook, in [Cov, Sect. 12.10].

    LZW Another variant, not described here.

    With these algorithms, text with any alphabet size can becompressed. Common sizes are, for example, 2 (binary sequences)and 256 (computer les consisting of a sequence of bytes).

  • 7/31/2019 05 Data Compression2

    30/40

    LZ78

    28

    In the description of the algorithm, we act on the string1011010100010. The algorithm is as follows:

    1. Parse the source into strings that have not appeared so far:1,0,11,01,010,00,10.

    2. Code a substring as ( i, c ), where i is the index of thesubstring (starting from 1, 0 = empty string) and c is the

    value of the additional character:(0,1),(0,0),(1,1),(2,1),(4,0),(2,0),(1,0).

    To express the locationin the example, an integer between 0 and4 = c(n )we need log(c(n ) + 1) bits.

  • 7/31/2019 05 Data Compression2

    31/40

    LZ77

    29

    We again use the example string 1011010100010. In each step, weproceed as follows:

    1. Find p, the relative position of the longest match (the lengthof which is denoted by l).

    2. Output ( p, l, c), where c is the rst character that does notmatch.

    3. Advance l + 1 positions.1011010100010 (0, 0, 1) 1011010100010 (0, 0, 0) 1011010100010 (2, 1, 1) 1011010100010 (3, 2, 0) 1011010100010 (2, 2, 0) 1011010100010 . . ..

  • 7/31/2019 05 Data Compression2

    32/40

    Some Details of The

    Algorithms

    30

    Several details have to be taken into account; we give one suchexample for each of the algorithms:

    LZ77 To avoid large values for the relative positions of strings:do not go too far back (sliding window!)LZ78 To avoid too large dictionaries: Reduce the size in one of

    several possible ways, for example, throw the dictionary away

    when it reaches a certain size (GIF does this).Traditionally, LZ77 was better but slower, but the gzip version isalmost as fast as any LZ78.

  • 7/31/2019 05 Data Compression2

    33/40

    Run-Length Coding

    31

    A compression method that does not reach the entropy bound but isused in many applications, including fax machines, is run-lengthcoding . In this method, the input sequence is compressed byidentifying adjacent symbols of equal value and replacing them witha single symbol and a count.

    Example. 111111110100000 (1, 8), (0, 1), (1, 1), (0, 5).Clearly, in the binary case, it su ffi ces to give the rst bit and thenonly the lengths of the runs.

  • 7/31/2019 05 Data Compression2

    34/40

    Optimality of Lempel-Ziv

    Coding

    32

    Theorem 12.10.2. Let {X i } be a stationary ergodic stochasticprocess. Let l(X 1 , X 2 , . . . , X n ) be the length of the Lempel-Zivcodeword associated with X 1 , X 2 , . . . , X n . Then

    lim supn

    1n

    l(X 1 , X 2 , . . . , X n ) H (X ) with probability 1 ,

    where H (X ) is the entropy rate of the process.From the examples given, it is obvious that these compressionmethods are e ffi cient only for long input sequences.

  • 7/31/2019 05 Data Compression2

    35/40

    Lossy Compression

    33

    So far, only lossless compression has been considered. In lossycompression , loss of information is allowed:

    scalar quantization Take the set of possible messages S andreduce it to a smaller set S with a mapping f : S S . Forexample, least signicant bits are dropped.

    vector quantization Map a multidimensional space S into asmaller set S of messages.

    transform coding Transform the input into a di ff erent formthat can be more easily compressed (in a lossy or losslessway).

    Lossy methods include JPEG (still images) and MPEG (video).

  • 7/31/2019 05 Data Compression2

    36/40

    Rate Distortion Theory

    34

    The eff ects of quantization are studied in rate distortion theory ,the basic problem of which can be stated as follows:

    Q : Given a source distribution and a distortion measure, whatis the minimum expected distortion achievable at a particularrate?

    Q : (Equivalent) What is the minimum rate description required toachieve a particular distortion?

    It turns out that, perhaps surprisingly, it is more e ffi cient to describetwo (even independent!) variables jointly than individually. Ratedistortion theory can be applied to both discrete and continuous

    random variables.

  • 7/31/2019 05 Data Compression2

    37/40

    Example: Quantization

    35

    We look at the basic problem of representing a single continuousrandom variable by a nite number of bits. Denote the random

    variable by X and the representation of X by X (X ). With R bits,the function X can take on 2 R values.

    What is the optimum set of values for X and the regions

    associated with each value

    X ?

  • 7/31/2019 05 Data Compression2

    38/40

    Example: Quantization cont.

    36

    With X N (0, 2 ) and a squared error distortion measure, we wishto nd a function X that takes on at most 2 R values (these are calledreproduction points ) and minimizes E (X X (X )) 2 .

    With one bit, obviously the bits should distinguish whether X < 0 ornot. To minimize squared error, each reproduced symbol should be atthe conditional mean of its regionsee [Cov, Fig. 13.1]and we have

    X (X ) = 2 , if x < 0, 2 , if x 0.

  • 7/31/2019 05 Data Compression2

    39/40

    Example: Quantization cont.

    37

    With two or more bits to represent the sample, the situation gets farmore complicated. The following facts state simple properties of

    optimal regions and reconstruction points. Given a set of reconstruction points, the distortion is

    minimized by mapping a source random variable X to therepresentation X that is closest to it. The set of regions of

    dened by this mapping is call a Voronoi or Dirichlet partition dened by the reconstruction points.

    The reconstruction points should minimize the conditionalexpected distortion over their respective assignment regions.

  • 7/31/2019 05 Data Compression2

    40/40

    Example: Quantization cont.

    The aforementioned properties enable algorithms to nd goodquantizers called the Lloyd algorithm (for real-valued random

    variables) and the generalized Lloyd algorithm (for vector-valuedrandom variables). Starting from any set of reconstruction points,repeat the following steps.

    1. Find the optimal set of reconstruction regions.2. Find the optimal reconstruction points for these regions.

    The expected distortion is decreased at each stage in this algorithm,so it converges to a local minimum of the distortion.