lossless compression basics
TRANSCRIPT
-
8/3/2019 Lossless Compression Basics
1/42
Data Compression
By
Ziya Arnavut
Department of Computer and Information Sciences
SUNY Fredonia
12-20-2011
-
8/3/2019 Lossless Compression Basics
2/42
Tremendous amount of data is communicated every
day.
Example: World-Wide-Web: Many people surf on the net.
People communicate over internet using software such as
skype.
The transmission time is related toa) Amount (size) of data b)b) channels capacity.
Can we reduce the transmission time?
Of course.
a) Reduce the size of data. How? Use a suitable compressiontechnique.
b) Increase the channel capacity. For example: 100 MB 1 GB
c) Or, utilize both (a) and (b)
-
8/3/2019 Lossless Compression Basics
3/42
Can we reduce the size of data?
Yes. Using compression techniques.
Two main data compression techniques:
1. lossless (noiseless).
Examples: Text, Medical and Remote Sensing Imaginary
2. lossy (noisy) techniques.
Examples: Image, sound and some other Multimedia app.
Lossy Techniques
Data compressed by lossy techniques are not exactly recoverable.
In many applications, this feature helps to increase the channel throughput.
For example: JPEG imagesOriginal image may have significant loss but this does not cause a problem, since
humans can often comprehend things even when there is noise.
Hence, depending on the application, lossy techniques may be used to increase the
channel throughput.
-
8/3/2019 Lossless Compression Basics
4/42
Original Image
Size on disk: 2.25 Mbytes
85% JPEG compressed image.
Size on disk: 267460 Bytes
-
8/3/2019 Lossless Compression Basics
5/42
A Lossy + Lossless technique:
Color-mapped (Palette) Images
To reduce transmission, storage or, most often, display restrictions, sometimesthere is a need to restrict the number of colors in an image
color reduction step since they are acquired with a high number of different
colors. This step is known as color quantization and several techniques have
been proposed
A color-quantized image is a matrix of indices, where each index icorresponds to a
triplet (Ri, Gi, Bi) in the color-mapped table of the image. Color-quantized imagesare also known as pseudo-color, color-mapped or palette images.
Index R G B
0 28 0 1
1 19 2 5
2 34 1 1
3 39 2 3
4 44 0 2
.. .. .. ..
254 193 211 223
255 206 212 222
-
8/3/2019 Lossless Compression Basics
6/42
Frymier image
Size: 1,238,678 bytes
GIF: 229,930 bytes
Ben and Jerry
Size 28,326 bytes
GIF: 4,387 bytes
Yahoo image
Size: 28,110 bytes
GIF: 6,968 bytes
Examples: Graphic Interchange Format (GIF) compressed images
-
8/3/2019 Lossless Compression Basics
7/42
Lossless Coding
A lossless compression scheme has two components1) Modeling
2) Coding
First, I will address Coding
Let A be an alphabet, a collection of distinctsymbols.
Let S = s1 s2 ... sn be a sequence from an alphabet
A. That is, S is a data string.
Example: From the English Alphabet {a..z, , A..Z},This is an example is a data string.
-
8/3/2019 Lossless Compression Basics
8/42
Assigning binary sequences to individual alphabet
elements is called encoding.
The set of binary sequences resulting from an
encoding is called code C= {c1,c2,cn}
An element of a code is called code-word (i.e., ci C)
For example,ASCII code consists of 128 code-words.
Each code-word has 7-bits. An 8th bit is appended for
parity checking, or other control purposes.
Example: A1000001, B1000010
-
8/3/2019 Lossless Compression Basics
9/42
Fixed Length Code:
Example: ASCII codes.
Do we gain by using fixed length codes? No! Why?
Use techniques similar to
Morse telegraph codes. That is, assign short code-words (less number of
bits) to the characters that appear more often,
and longer code-words to the characters that
appear less frequently.
This is called variable length coding.
-
8/3/2019 Lossless Compression Basics
10/42
-
8/3/2019 Lossless Compression Basics
11/42
Question: Why does this work?
Most often, the frequency distribution of the
letters in a data string are far from uniform.Example: In an English text the most frequentlyoccurring letter is e.
If a data source or string has uniform
distribution, variable length codingtechniques does not work.
For an independent data source S with
probabilities of occurrence p(s1) ,..., p(sm) thezero-order entropy is
H(S) = 7i p(si) * log2 p(si) bps
-
8/3/2019 Lossless Compression Basics
12/42
Entropy of a source yields lower-bound ofencoding cost.
Two well-known variable length codingtechniques are Huffman and Arithmetic Coding.They can code a data string close to or equal toits entropy.
Example: Consider a data string with 64 characters from the
alphabet {a,,z}S=aaaaaaaaaabbbbbbbccegiidffgiiaaabaaaabbccccaaccaa
aabbaaaaaaabeee
The zero-order entropy of this string is 2.26 bps. Hence, At best we can code the data for
64 * 2.26 = 144.64 bits
-
8/3/2019 Lossless Compression Basics
13/42
Huffman Coding Merge together the two least probable characters, and repeating this
process until there is only one character remaining.
For our String, here is the probably distribution:
Freq si P(si)
30 a 0.469
13 b 0.203
8 c 0.125
1 d 0.016
4 e 0.063
4 i 0.0632 g 0.031
2 f 0.031
64 1.000
si P(si)
a 0.469b 0.203
c 0.125
e 0.063
i 0.063
g 0.031
f 0.031
d 0.016
After Sorting
-
8/3/2019 Lossless Compression Basics
14/42
Building Huffman Tree
a 0.469
b 0.203
c 0.125
e 0.063
i 0.063
g 0.031
f 0.031
d 0.016
a 0.469
b 0.203
c 0.125
e 0.063
i 0.063
0.047
g 0.031
a 0.469
b 0.203
c 0.125
0.078
e 0.063
i 0.063
a 0.469
b 0.203
0.126
c 0.125
0.078
a 0.469
b 0.203
0.203
0.126
a 0.469
0.329
b 0.203
0.532
a 0.469
-
8/3/2019 Lossless Compression Basics
15/42
Huffman Tree & Code
si P(si) Code Bits
a 0.469 1 30
b 0.203 01 26
c 0.125 0000 32
e 0.063 0010 4
i 0.063 0011 16
g 0.031 00011 20
f 0.031 000100 12
d 0.016 000101 12
Total # ofBits 152
Hence, Entropy gives us lower bound in terms ofthe number ofbits
need to encode a data string.
-
8/3/2019 Lossless Compression Basics
16/42
Can we do better?
Consider, attaching integer values to the
symbols in S. Then apply
HS = s1, s2-s1, s3-s2, ...,s64-s63 .
Note: S can be recovered easily.
However, frequency distribution for the new
sequence HS is
letter 0 1 2 -1 -2 3 -5 -8
frequency 42 9 6 2 1 1 1 1
-
8/3/2019 Lossless Compression Basics
17/42
Using Huffman coding we may encode HS asfollows:
0 0, 1 10, 2 110,
-1 1110, -2 111100,
3 111101, 8111110,
and -5111111. Total bits
42*1 + 9*2 + ... + 1*6 + 1*6 + 1*6 = 110
A rate of r = 110/64 = 1.71875 bps.
H operatoris an example of decorrelationstep, which is used in modeling of data.
-
8/3/2019 Lossless Compression Basics
18/42
The aim of Decorrelation step is to remove
redundancies from the data. Note that approximately a further gain of 0.60
bps.
Hence, by considering relationships among
the data elements, we can obtain better
compression.
Research in lossless compression has focused
on modeling the data source in order to
exploit the correlation among data elements
-
8/3/2019 Lossless Compression Basics
19/42
Arithmetic Coding
Suppose we have an alphabet (a, b, c), withprobabilities of occurrence of (0.7, 0.1, 0.2). Each
symbol may be assigned to the following ranges
based on its probability:
Sample Symbol Ranges
Symbol Probability Range
a 70% [0.00, 0.70)b 10% [0.70, 0.80)
c 20% [0.80, 1.00)
-
8/3/2019 Lossless Compression Basics
20/42
Encoding with Arithmetic Coder
T
he pseudo code below illustrates how additionalsymbols may be added to an encoded string byrestricting the string's range bounds.
lower bound = 0upper bound = 1
while there are still symbols to encode
current range = upper bound - lower boundupper bound = lower bound + (current range upper bound of new symbol)lower bound = lower bound + (current range lower bound of new symbol)
end while
Any value between the computed lower and upperprobability bounds now encodes the input string.
-
8/3/2019 Lossless Compression Basics
21/42
Example: To encode abc
Encode 'a'
current range = 1 - 0 = 1upper bound = 0 + (1 0.7) = 0.7lower bound = 0 + (1 0.0) = 0.0
Encode b'current range = 0.7 - 0.0 = 0.7upper bound = 0.0 + (0.7 0.80) = 0.56lower bound = 0.0 + (0.7 0.7) = 0.49
Encode c'current range = 0.56 - 0.49 = 0.07upper bound = 0.49+ (0.07 1.00) = 0.56lower bound = 0.49 + (0.07 0.80) = 0.546
The string "abc" may be encoded by any value withinthe probability range [0.546, 0.56). For example with0.55.
-
8/3/2019 Lossless Compression Basics
22/42
0.0
a
0.7
b
0.8
c
1.0
0.0
a
0.49
b
0.56
c
0.70
0.49
a
b
c
0.56
0.546
a
b
c
0.56
Encoding string abc with AC
-
8/3/2019 Lossless Compression Basics
23/42
Decoding Strings encoded value = encoded input
while string is not fully decoded
identify the symbol containing encoded valuewithin its range
//remove effects of symbol from encoded valuecurrent range = upper bound of new symbol -lower bound of new symbol
encoded value = (encoded value - lower bound ofnew symbol) current range
end while
-
8/3/2019 Lossless Compression Basics
24/42
0.0
a
0.7
b
0.8
c
1.0
0.0
a
0.49
b
0.56
c
0.70
0.49
a
b
c
0.56
0.546
a
b
c
0.56
Decoding abc = 0.55 a b c
-
8/3/2019 Lossless Compression Basics
25/42
Universal Lossless Compressors
Dictionary Based Algorithm (Ziv-Lempel) Encoding Algorithm
1. Initialize the dictionary to contain all
blocks of length one (D={a,b}).
2. Search for the longest block W which has
appeared in the dictionary.
3. Encode W by its index in the dictionary.
4. Add W followed by the first symbol of thenext block to the dictionary.
5. Go to Step 2.
-
8/3/2019 Lossless Compression Basics
26/42
The following example illustrates how the
encoding is performed.
Data: a b b a a b b a a b a b b a a a a b a a b b a
0 1 1 0 2 4 2 6 5 5 7 3 0
Dictionary
Index Entity Index Entity
0 a 7 b a a
1 b 8 a b a
2 a b 9 a b b a
3 b b 10 a a a
4 b a 11 a a b
5 a a 12 b a a b
6 a b b 13 b b a
-
8/3/2019 Lossless Compression Basics
27/42
The size of the dictionary can grow
infinitely large.
In practice, the dictionary size is limited. Once thelimit is reached, no more entries are added.
For example, a dictionary of size 4096. This corresponds to 12 bits per index.
Various implementation of Ziv-Lempel algorithm
has been implemented. Gzip (Gnu Zip) is a freeware available from internet.
-
8/3/2019 Lossless Compression Basics
28/42
Burrows-Wheeler Transformation
Let w = [3,1,3,1,2] be a data string. Construct
3 1 3 1 2
M= 3 1 2 3 1
1 2 3 1 3
2 3 1 3 1
by forming the successive rows ofM, which areconsecutive cyclic left-shifts of w.
-
8/3/2019 Lossless Compression Basics
29/42
By sorting the rows ofM lexically we
transform it to
1 2 3 1 3
1 3 1 2 3
M= 2 3 1 3 1
3 1 2 3 1
3 1 3 1 2
Let the last column of M denoted by L
-
8/3/2019 Lossless Compression Basics
30/42
Note that the original data string wis the 5th
row of M .
Given the I= 5 (row index) ofwin M and
L = [3, 3, 1, 1, 2] we can recover w. How?
1 3
1 3
M= 2 1
3 13 _ _ _ 2
-
8/3/2019 Lossless Compression Basics
31/42
Is the transformation enough?
Of course not!
The transformation collects similar elements
near by.
To achieve better compression we need to usesome other techniques, like H operator.
In this case we can use Move-to-Front
(Recency Rank), or Inversion
coding/transformation.
-
8/3/2019 Lossless Compression Basics
32/42
Move To Front Coding
(Recency Ranking)
Introduced by Bentley et al. (1986)
(independently discovered by Elias (1987))
Move-to-Front coding is an adaptive
technique, which is used when the data have
locality of reference.
When an MTF coder is implemented for an 8-
bit data string the identity permutation isconstructed from the set of 0,..., 255.
-
8/3/2019 Lossless Compression Basics
33/42
Example: Let {a, b, c, d} be our alphabet.
Let S= bbbaaadddddccc be our data string. The MTF encoding is performed as following:
b b b a a a d ..
0123 0123 0123 0123 0123 0123 0123 ..abcd bacd bacd bacd abcd abcd abcd ..
1 0 0 1 0 0 3 ..
Output: 1001003000300.
-
8/3/2019 Lossless Compression Basics
34/42
MTF decoding on 1001003000300 is done
as following:
1 0 0 1 0 0 3 ..
0123 0123 0123 0123 0123 0123 0123 ..
abcd bacd bacd bacd abcd abcd abcd ..
b b b a a a d ..
Output: bbbaaaddddccc
Why do we use MT
F? If the data have localityof reference, MTF transformed data yields
better distribution for encoding.
-
8/3/2019 Lossless Compression Basics
35/42
Linear Index Permutation (LIT) For example, let w = [1, 3, 1, 3, 2]
Original l l-1
1 2 3 4 5 1 4 2 5 3 1 3 5 2 4
1 3 1 3 2 1 3 1 3 2 1 1 2 3 3
Note that, the inverse permutation of LIT l isl-1 = [1 ,3, 5, 2, 4]. l-1 is called the Canonical SortingPermutation ofw.
Also, elements ofw is sorted in non-decreasing orderby l-1 and consists of m-blocks of different sizes.
Sorted data can be encoded cheaply.
-
8/3/2019 Lossless Compression Basics
36/42
Hence, the problem is to encode the canonical
sorting permutation.
Interval ranking, except the first timeappearance of an element from a data string
yields a rank, which is simply the count of the
number of elements between every twosame element.
Let H be the difference operator on a
sequence, then it is easy to prove that the
first-order entropy
H(H l-1 ) $ H(Interval Rank)
-
8/3/2019 Lossless Compression Basics
37/42
Inversion Coding
Let4
= [4
1,4
2,,4
n] be an arbitrarypermutation of an n-set S of positive
integers. A Left Bigger (LB) inversion vector
associated with 4 is the sequence [I1,I2, ,In
] of non-negative integers defined as follows:
Ik= | {j: 1 < j
-
8/3/2019 Lossless Compression Basics
38/42
When the H operator (difference operator) is
applied to an inversion vector, except m-1
values (recall that there are m blocks), all the
values would be positive ( or negative).
The value |It - It-1 -1| is the count ofHow
many bigger (or smaller) elements existsbetween the last and recent occurrence ofa
symbol from the alphabet?.
Hence, the decorrelation of inversion vectorelements yields a value which is called
inversion rank (distance).
-
8/3/2019 Lossless Compression Basics
39/42
Elias proved that the Recency Ranking (MTF)
yields better compression than the Interval
Ranking.
It is easy to prove that the inversion ranking
yields better compression than the Interval
ranking. While theoretically its hard to relate MTF and
Inversion Coding, simulation results has
shown that inversion coding yields bettercompression than the MTF coding.
-
8/3/2019 Lossless Compression Basics
40/42
The Bzip2 compression scheme utilizes1) BWT transformation
2) MTF Coder
3) Variable length Coding (Huffman coder)
Currently, Bzip2 is one of the best universalcompression scheme.
My contributions in this area: Theoretical settings of the BWT (1997)
New and faster transformation than BWT, Linear OrderTransformation (1999)
Inversion coding for large data files (2004)
BWIC is available fromwww.cs.fredonia.edu/arnavut/research.html
Yields better compression than bzip2 on several differentdata files. For example, on large text files, pseudo-colorimages, audio files, images.
-
8/3/2019 Lossless Compression Basics
41/42
Data File Size MTF IC BSC BSWICBib 111261 5.94 5.68 2.11 2.17Book1 768768 5.12 4.84 2.61 2.52Book2 610856 5.24 4.95 2.22 2.19Geo 102400 6.03 6.16 4.83 4.97News 377109 5.55 5.32 2.65 2.70Obj1 21504 6.06 5.70 4.02 4.30Obj2 246814 6.15 6.09 2.58 2.77Paper1 53161 5.46 5.42 2.65 2.74Paper2 82199 5.23 5.13 2.61 2.65Pic 513216 1.09 1.03 0.84 0.81
Progc 39611 5.59 5.67 2.67 2.82Progl 71646 4.93 4.96 1.88 1.91Progp 49379 5.12 5.30 1.86 1.96Trans 93695 5.55 5.49 1.63 1.77Bible 4047392 5.04 4.56 1.71 1.62Calag.tar 3276813 4.75 4.45 2.44 2.28E.coli 4638690 2.25 2.14 2.21 2.10
World192 2473400 5.34 5.03 1.49 1.47Avg. 5.02 4.88 2.39 2.43W. Avg. 4.23 3.96 2.04 1.96
Results in bps using arithmetic coder
-
8/3/2019 Lossless Compression Basics
42/42
THANK YOU!
Questions?
1/2/2012 FIT, December 2011 42