lossless compression basics

8/3/2019 Lossless Compression Basics

1/42

Data Compression

By

Ziya Arnavut

Department of Computer and Information Sciences

SUNY Fredonia

12-20-2011


2/42

Tremendous amount of data is communicated every

day.

Example: World-Wide-Web: Many people surf on the net.

People communicate over internet using software such as

skype.

The transmission time is related toa) Amount (size) of data b)b) channels capacity.

Can we reduce the transmission time?

Of course.

a) Reduce the size of data. How? Use a suitable compressiontechnique.

b) Increase the channel capacity. For example: 100 MB 1 GB

c) Or, utilize both (a) and (b)


3/42

Can we reduce the size of data?

Yes. Using compression techniques.

Two main data compression techniques:

1. lossless (noiseless).

Examples: Text, Medical and Remote Sensing Imaginary

2. lossy (noisy) techniques.

Examples: Image, sound and some other Multimedia app.

Lossy Techniques

Data compressed by lossy techniques are not exactly recoverable.

In many applications, this feature helps to increase the channel throughput.

For example: JPEG imagesOriginal image may have significant loss but this does not cause a problem, since

humans can often comprehend things even when there is noise.

Hence, depending on the application, lossy techniques may be used to increase the

channel throughput.


4/42

Original Image

Size on disk: 2.25 Mbytes

85% JPEG compressed image.

Size on disk: 267460 Bytes


5/42

A Lossy + Lossless technique:

Color-mapped (Palette) Images

To reduce transmission, storage or, most often, display restrictions, sometimesthere is a need to restrict the number of colors in an image

color reduction step since they are acquired with a high number of different

colors. This step is known as color quantization and several techniques have

been proposed

A color-quantized image is a matrix of indices, where each index icorresponds to a

triplet (Ri, Gi, Bi) in the color-mapped table of the image. Color-quantized imagesare also known as pseudo-color, color-mapped or palette images.

Index R G B

0 28 0 1

1 19 2 5

2 34 1 1

3 39 2 3

4 44 0 2

.. .. .. ..

254 193 211 223

255 206 212 222


6/42

Frymier image

Size: 1,238,678 bytes

GIF: 229,930 bytes

Ben and Jerry

Size 28,326 bytes

GIF: 4,387 bytes

Yahoo image

Size: 28,110 bytes

GIF: 6,968 bytes

Examples: Graphic Interchange Format (GIF) compressed images


7/42

Lossless Coding

A lossless compression scheme has two components1) Modeling

2) Coding

First, I will address Coding

Let A be an alphabet, a collection of distinctsymbols.

Let S = s1 s2 ... sn be a sequence from an alphabet

A. That is, S is a data string.

Example: From the English Alphabet {a..z, , A..Z},This is an example is a data string.


8/42

Assigning binary sequences to individual alphabet

elements is called encoding.

The set of binary sequences resulting from an

encoding is called code C= {c1,c2,cn}

An element of a code is called code-word (i.e., ci C)

For example,ASCII code consists of 128 code-words.

Each code-word has 7-bits. An 8th bit is appended for

parity checking, or other control purposes.

Example: A1000001, B1000010


9/42

Fixed Length Code:

Example: ASCII codes.

Do we gain by using fixed length codes? No! Why?

Use techniques similar to

Morse telegraph codes. That is, assign short code-words (less number of

bits) to the characters that appear more often,

and longer code-words to the characters that

appear less frequently.

This is called variable length coding.


10/42


11/42

Question: Why does this work?

Most often, the frequency distribution of the

letters in a data string are far from uniform.Example: In an English text the most frequentlyoccurring letter is e.

If a data source or string has uniform

distribution, variable length codingtechniques does not work.

For an independent data source S with

probabilities of occurrence p(s1) ,..., p(sm) thezero-order entropy is

H(S) = 7i p(si) * log2 p(si) bps


12/42

Entropy of a source yields lower-bound ofencoding cost.

Two well-known variable length codingtechniques are Huffman and Arithmetic Coding.They can code a data string close to or equal toits entropy.

Example: Consider a data string with 64 characters from the

alphabet {a,,z}S=aaaaaaaaaabbbbbbbccegiidffgiiaaabaaaabbccccaaccaa

aabbaaaaaaabeee

The zero-order entropy of this string is 2.26 bps. Hence, At best we can code the data for

64 * 2.26 = 144.64 bits


13/42

Huffman Coding Merge together the two least probable characters, and repeating this

process until there is only one character remaining.

For our String, here is the probably distribution:

Freq si P(si)

30 a 0.469

13 b 0.203

8 c 0.125

1 d 0.016

4 e 0.063

4 i 0.0632 g 0.031

2 f 0.031

64 1.000

si P(si)

a 0.469b 0.203

c 0.125

e 0.063

i 0.063

g 0.031

f 0.031

d 0.016

After Sorting


14/42

Building Huffman Tree

a 0.469

b 0.203

c 0.125

e 0.063

i 0.063

g 0.031

f 0.031

d 0.016

a 0.469

b 0.203

c 0.125

e 0.063

i 0.063

0.047

g 0.031

a 0.469

b 0.203

c 0.125

0.078

e 0.063

i 0.063

a 0.469

b 0.203

0.126

c 0.125

0.078

a 0.469

b 0.203

0.203

0.126

a 0.469

0.329

b 0.203

0.532

a 0.469


15/42

Huffman Tree & Code

si P(si) Code Bits

a 0.469 1 30

b 0.203 01 26

c 0.125 0000 32

e 0.063 0010 4

i 0.063 0011 16

g 0.031 00011 20

f 0.031 000100 12

d 0.016 000101 12

Total # ofBits 152

Hence, Entropy gives us lower bound in terms ofthe number ofbits

need to encode a data string.


16/42

Can we do better?

Consider, attaching integer values to the

symbols in S. Then apply

HS = s1, s2-s1, s3-s2, ...,s64-s63 .

Note: S can be recovered easily.

However, frequency distribution for the new

sequence HS is

letter 0 1 2 -1 -2 3 -5 -8

frequency 42 9 6 2 1 1 1 1


17/42

Using Huffman coding we may encode HS asfollows:

0 0, 1 10, 2 110,

-1 1110, -2 111100,

3 111101, 8111110,

and -5111111. Total bits

42*1 + 9*2 + ... + 1*6 + 1*6 + 1*6 = 110

A rate of r = 110/64 = 1.71875 bps.

H operatoris an example of decorrelationstep, which is used in modeling of data.


18/42

The aim of Decorrelation step is to remove

redundancies from the data. Note that approximately a further gain of 0.60

bps.

Hence, by considering relationships among

the data elements, we can obtain better

compression.

Research in lossless compression has focused

on modeling the data source in order to

exploit the correlation among data elements


19/42

Arithmetic Coding

Suppose we have an alphabet (a, b, c), withprobabilities of occurrence of (0.7, 0.1, 0.2). Each

symbol may be assigned to the following ranges

based on its probability:

Sample Symbol Ranges

Symbol Probability Range

a 70% [0.00, 0.70)b 10% [0.70, 0.80)

c 20% [0.80, 1.00)


20/42

Encoding with Arithmetic Coder

T

he pseudo code below illustrates how additionalsymbols may be added to an encoded string byrestricting the string's range bounds.

lower bound = 0upper bound = 1

while there are still symbols to encode

current range = upper bound - lower boundupper bound = lower bound + (current range upper bound of new symbol)lower bound = lower bound + (current range lower bound of new symbol)

end while

Any value between the computed lower and upperprobability bounds now encodes the input string.


21/42

Example: To encode abc

Encode 'a'

current range = 1 - 0 = 1upper bound = 0 + (1 0.7) = 0.7lower bound = 0 + (1 0.0) = 0.0

Encode b'current range = 0.7 - 0.0 = 0.7upper bound = 0.0 + (0.7 0.80) = 0.56lower bound = 0.0 + (0.7 0.7) = 0.49

Encode c'current range = 0.56 - 0.49 = 0.07upper bound = 0.49+ (0.07 1.00) = 0.56lower bound = 0.49 + (0.07 0.80) = 0.546

The string "abc" may be encoded by any value withinthe probability range [0.546, 0.56). For example with0.55.


22/42

0.0

a

0.7

b

0.8

c

1.0

0.0

a

0.49

b

0.56

c

0.70

0.49

a

b

c

0.56

0.546

a

b

c

0.56

Encoding string abc with AC


23/42

Decoding Strings encoded value = encoded input

while string is not fully decoded

identify the symbol containing encoded valuewithin its range

//remove effects of symbol from encoded valuecurrent range = upper bound of new symbol -lower bound of new symbol

encoded value = (encoded value - lower bound ofnew symbol) current range

end while


24/42

0.0

a

0.7

b

0.8

c

1.0

0.0

a

0.49

b

0.56

c

0.70

0.49

a

b

c

0.56

0.546

a

b

c

0.56

Decoding abc = 0.55 a b c


25/42

Universal Lossless Compressors

Dictionary Based Algorithm (Ziv-Lempel) Encoding Algorithm

1. Initialize the dictionary to contain all

blocks of length one (D={a,b}).

2. Search for the longest block W which has

appeared in the dictionary.

3. Encode W by its index in the dictionary.

4. Add W followed by the first symbol of thenext block to the dictionary.

5. Go to Step 2.


26/42

The following example illustrates how the

encoding is performed.

Data: a b b a a b b a a b a b b a a a a b a a b b a

0 1 1 0 2 4 2 6 5 5 7 3 0

Dictionary

Index Entity Index Entity

0 a 7 b a a

1 b 8 a b a

2 a b 9 a b b a

3 b b 10 a a a

4 b a 11 a a b

5 a a 12 b a a b

6 a b b 13 b b a


27/42

The size of the dictionary can grow

infinitely large.

In practice, the dictionary size is limited. Once thelimit is reached, no more entries are added.

For example, a dictionary of size 4096. This corresponds to 12 bits per index.

Various implementation of Ziv-Lempel algorithm

has been implemented. Gzip (Gnu Zip) is a freeware available from internet.


28/42

Burrows-Wheeler Transformation

Let w = [3,1,3,1,2] be a data string. Construct

3 1 3 1 2

M= 3 1 2 3 1

1 2 3 1 3

2 3 1 3 1

by forming the successive rows ofM, which areconsecutive cyclic left-shifts of w.


29/42

By sorting the rows ofM lexically we

transform it to

1 2 3 1 3

1 3 1 2 3

M= 2 3 1 3 1

3 1 2 3 1

3 1 3 1 2

Let the last column of M denoted by L


30/42

Note that the original data string wis the 5th

row of M .

Given the I= 5 (row index) ofwin M and

L = [3, 3, 1, 1, 2] we can recover w. How?

1 3

1 3

M= 2 1

3 13 _ _ _ 2


31/42

Is the transformation enough?

Of course not!

The transformation collects similar elements

near by.

To achieve better compression we need to usesome other techniques, like H operator.

In this case we can use Move-to-Front

(Recency Rank), or Inversion

coding/transformation.


32/42

Move To Front Coding

(Recency Ranking)

Introduced by Bentley et al. (1986)

(independently discovered by Elias (1987))

Move-to-Front coding is an adaptive

technique, which is used when the data have

locality of reference.

When an MTF coder is implemented for an 8-

bit data string the identity permutation isconstructed from the set of 0,..., 255.


33/42

Example: Let {a, b, c, d} be our alphabet.

Let S= bbbaaadddddccc be our data string. The MTF encoding is performed as following:

b b b a a a d ..

0123 0123 0123 0123 0123 0123 0123 ..abcd bacd bacd bacd abcd abcd abcd ..

1 0 0 1 0 0 3 ..

Output: 1001003000300.


34/42

MTF decoding on 1001003000300 is done

as following:

1 0 0 1 0 0 3 ..

0123 0123 0123 0123 0123 0123 0123 ..

abcd bacd bacd bacd abcd abcd abcd ..

b b b a a a d ..

Output: bbbaaaddddccc

Why do we use MT

F? If the data have localityof reference, MTF transformed data yields

better distribution for encoding.


35/42

Linear Index Permutation (LIT) For example, let w = [1, 3, 1, 3, 2]

Original l l-1

1 2 3 4 5 1 4 2 5 3 1 3 5 2 4

1 3 1 3 2 1 3 1 3 2 1 1 2 3 3

Note that, the inverse permutation of LIT l isl-1 = [1 ,3, 5, 2, 4]. l-1 is called the Canonical SortingPermutation ofw.

Also, elements ofw is sorted in non-decreasing orderby l-1 and consists of m-blocks of different sizes.

Sorted data can be encoded cheaply.


36/42

Hence, the problem is to encode the canonical

sorting permutation.

Interval ranking, except the first timeappearance of an element from a data string

yields a rank, which is simply the count of the

number of elements between every twosame element.

Let H be the difference operator on a

sequence, then it is easy to prove that the

first-order entropy

H(H l-1 ) $ H(Interval Rank)


37/42

Inversion Coding

Let4

= [4

1,4

2,,4

n] be an arbitrarypermutation of an n-set S of positive

integers. A Left Bigger (LB) inversion vector

associated with 4 is the sequence [I1,I2, ,In

] of non-negative integers defined as follows:

Ik= | {j: 1 < j


38/42

When the H operator (difference operator) is

applied to an inversion vector, except m-1

values (recall that there are m blocks), all the

values would be positive ( or negative).

The value |It - It-1 -1| is the count ofHow

many bigger (or smaller) elements existsbetween the last and recent occurrence ofa

symbol from the alphabet?.

Hence, the decorrelation of inversion vectorelements yields a value which is called

inversion rank (distance).


39/42

Elias proved that the Recency Ranking (MTF)

yields better compression than the Interval

Ranking.

It is easy to prove that the inversion ranking

yields better compression than the Interval

ranking. While theoretically its hard to relate MTF and

Inversion Coding, simulation results has

shown that inversion coding yields bettercompression than the MTF coding.


40/42

The Bzip2 compression scheme utilizes1) BWT transformation

2) MTF Coder

3) Variable length Coding (Huffman coder)

Currently, Bzip2 is one of the best universalcompression scheme.

My contributions in this area: Theoretical settings of the BWT (1997)

New and faster transformation than BWT, Linear OrderTransformation (1999)

Inversion coding for large data files (2004)

BWIC is available fromwww.cs.fredonia.edu/arnavut/research.html

Yields better compression than bzip2 on several differentdata files. For example, on large text files, pseudo-colorimages, audio files, images.


41/42

Data File Size MTF IC BSC BSWICBib 111261 5.94 5.68 2.11 2.17Book1 768768 5.12 4.84 2.61 2.52Book2 610856 5.24 4.95 2.22 2.19Geo 102400 6.03 6.16 4.83 4.97News 377109 5.55 5.32 2.65 2.70Obj1 21504 6.06 5.70 4.02 4.30Obj2 246814 6.15 6.09 2.58 2.77Paper1 53161 5.46 5.42 2.65 2.74Paper2 82199 5.23 5.13 2.61 2.65Pic 513216 1.09 1.03 0.84 0.81

Progc 39611 5.59 5.67 2.67 2.82Progl 71646 4.93 4.96 1.88 1.91Progp 49379 5.12 5.30 1.86 1.96Trans 93695 5.55 5.49 1.63 1.77Bible 4047392 5.04 4.56 1.71 1.62Calag.tar 3276813 4.75 4.45 2.44 2.28E.coli 4638690 2.25 2.14 2.21 2.10

World192 2473400 5.34 5.03 1.49 1.47Avg. 5.02 4.88 2.39 2.43W. Avg. 4.23 3.96 2.04 1.96

Results in bps using arithmetic coder


42/42

THANK YOU!

Questions?

1/2/2012 FIT, December 2011 42

lossless compression basics

Documents