lecture 4: lossless compression(1) hongli luo fall 2011

Lecture 4: Lossless Compression(1)

Hongli Luo

Fall 2011

Lecture 4: Lossless Compression (1)

Topics (Chapter 7) Introduction Basics of Information Theory Compression techniques

• Lossless compression• Lossy compression

4.1 Introduction

Compression: the process of coding that will effectively reduce the total number of bits needed to represent certain information.

Introduction

Compression ratio:

B0 - number of bits before compression

B1 - number of bits after compression

Types of Compression

Lossless compression Does not lose information – the signal can be

perfectly reconstructed after decompression Produces a variable bit-rate It is not guaranteed to actually reduce the data size

• Depends on the characteristics of the data Example : Winzip

Lossy compression Loses some information – the signal is not perfectly

reconstructed after decompression Produces any desired constant bit-rate Exampe: JPEG, MPEG

4.2 Basics of Information Theory

Model Information at the Source Model data at the Source as a Stream of Symbols

–This defines the “Vocabulary” of the source. Each symbol in the vocabulary is represented by

bits If your vocabulary has N symbols, each symbol

represented with log2N bits.• Text by ASCII code – 8bits/code: N=28=256

symbols• Speech -16 bits/sample: N=216=65,536 symbols• Color Image: 3x8 bits/sample: N=224=17x106

symbols• 8x8 Image Blocks: 8x64 bits/block: N=2512=1077

symbols

Lossless Compression

Lossless compression techniques ensure no loss of data after compression/decompression. Coding: “Translate” each symbol in the vocabulary into a

“binary codeword”. Codewords may have different binary lengths.

Example: You have 4 symbols (a, b, c, d). Each in binary may be represented using 2 bits each, but coded using a different number of bits.

a(00) -> 000 b(01) -> 001 c(10) -> 01 d(11) -> 1

Goal of Coding is to minimize the average symbol length

Average Symbol Length The vocabulary of the source has N symbols l(i) – binary length of ith symbols Symbol i has been emitted m(i) times M = number of symbols that the source emits (on every T second)

Number of bits been emitted in T second

Probability P(i) of a symbol: number of times it occurs in the transmission and is defined as P(i) = m(i) /M

Average Symbol Length

Average length per symbol / average symbol length

Average bit rate

Minimum Average Symbol Length

Goal of compression to minimize the number of bits being transmitted Equivalent to minimize the average symbol length

How to reduce the average symbol length Assign shorter codewords to symbols that appear

more frequently, Assign longer codewords to symbols that appear less

frequently

Minimum Average Symbol Length

What is the lower bound of average symbol length? Decided by the entropy Shannon’s Information Theorem

The average binary length of the encoded symbol is always greater than or equal to the entropy H of the source

Entropy

The entropy η of an information source with alphabet S = {s1, s2, …, sn} is:

pi - probability that symbol si will occur in S.

indicates the amount of information ( self- information as defined by Shannon) contained in si, which corresponds to the number of bits needed to encode si.

Entropy

The entropy is characteristics of a given source of symbols

Entropy is largest (equal to log2N) when all symbols are equally probable

Entropy is small (always >=0) when some symbols are much more likely to appear than other symbols

The chances that each symbol appear are similar, or symbols are uniformly distributed in the source

Entropy and code length

The entropy represents the average amount of information contained per symbol in the source S.

The entropy species the lower bound for the average number of bits to code each symbol in S, i.e.,

- the average length (measured in bits) of the codewords produced by the encoder.

Efficiency of the Encoder

Distribution of Gray-Level Intensities

Fig. 7.2(a) shows the histogram of an image with uniform distribution of gray-level intensities, i.e., pi = 1/256.

Hence, the entropy of this image is: log2 256 = 8 (7.4)

- No compression is possible for this image!

4.3 Compression Techniques

Compression techniques are broadly classified into Lossless compression

• Run-length encoding• Variable length coding (entropy coding):

– Shannon-fano algorithm,

– Huffman coding,

– adaptive Huffman coding

• Arithmetic coding• LZW

Lossy compression

Run-length Encoding

Sequence of elements, c1, c2, …, ci,…, is mapped to a run (ci,li) Ci = symbol li = length of the symbol ci’s run

For example, given the sequence of symbols{1,1,1,1,3,3,4,4,4,3,3,5,2,2,2}The run-length encoding is (1,4),(3,2),(4,3),(3,2),(5,1),(2,3) Apply run-length encoding to a bi-level image

(with only 1-bit black and while pixels) Assume the starting run is of a particular color

(either black or white) Code the length of each run

Variable Length Coding

VLC generates variable length codewords from fixed length

VLC is one of the best known entropy coding method

Methods of VLC Shannon-Fano algorithm Huffman coding Adaptive Huffman coding

Shannon-Fano Algorithm

A top-down approach, Steps: 1. Sort the symbols according to the frequency count of

their occurrences.2. Recursively divide the symbols into two parts, each with

approximately the same number of counts, until all parts contain only one symbol.

An Example: coding of “HELLO”

Sort symbols according to their frequencies, LHEO

Assign bit 0 to its left branches and 1 to the right branches.

Coded bits: 10 bits Raw datawords, 8 bits per character: 40 bits Compression ratio : 10/40 = 25%

lecture 4: lossless compression(1) hongli luo fall 2011

Documents

symbol si

encoded symbol

information source

average symbol lengthhow

number of symbols

information self information

different number of

total number of bits