representation of strings background huffman encoding
TRANSCRIPT
![Page 1: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/1.jpg)
Representation of Strings
Background
Huffman Encoding
![Page 2: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/2.jpg)
Representing Strings
How much space do we need? Assume we represent every
character. How many bits to represent each
character? Depends on ||
![Page 3: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/3.jpg)
Bits to encode a character
Two character alphabet{A,B} one bit per character:
0 = A, 1 = B Four character alphabet{A,B,C,D}
two bits per character: 00 = A, 01 = B, 10 = C, 11 = D
Six character alphabet {A,B,C,D,E, F} three bits per character: 000 = A, 001 = B, 010 = C, 011 = D, 100=E,
101 =F, 110 =unused, 111=unused
![Page 4: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/4.jpg)
More generally
The bit sequence representing a character is called the encoding of the character.
There are 2n different bit sequences of length n,
ceil(lg||) bits required to represent each character in
if we use the same number of bits for each character then length of encoding of a word is |w| * ceil(lg||)
![Page 5: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/5.jpg)
Can we do better??
If is very small, might use run-length encoding
![Page 6: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/6.jpg)
Taking a step back …
Why do we need compression? rate of creation of image and video
data image data from digital camera
today 1k by 1.5 k is common = 1.5 mbytes
need 2k by 3k to equal 35mm slide = 6 mbytes
video at even low resolution of 512 by 512 and 3 bytes per pixel, 30
frames/second
![Page 7: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/7.jpg)
Compression basics video data rate
23.6 mbytes/second 2 hours of video = 169 gigabytes
mpeg-1 compresses 23.6 mbytesdown to 187 kbytes per second 169 gigabytes down to 1.3 gigabytes
compression is essential for both storage and transmission of data
![Page 8: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/8.jpg)
Compression basics
compression is very widely used jpeg, gif for single images mpeg1, 2, 3, 4 for video sequence zip for computer data mp3 for sound
based on two fundamental principles spatial coherence and temporal coherence
similarity with spatial neighbor similarity with temporal neighbor
![Page 9: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/9.jpg)
Basics of compression
character = basic data unit in the input stream -- represents byte, bit, etc.
strings = sequences of characters encoding = compression decoding = decompression codeword = data elements used to
represent input characters or character strings
codetable = list of codewords
![Page 10: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/10.jpg)
Codeword
encoding/compression takes characters/strings as input and uses
codetable to decide on which codewords to produce
decoder/decompressor takes codewords as input and uses same codetable
to decide on which characters/strings to produce
Encoder Decoder
InputData Stream
OutputData Stream
Data Storage Or Transmission
![Page 11: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/11.jpg)
Codetable
clearly both encoder and decoder must pass the encoded data as a series of codewords
also must pass the codetable the codetable can be passed explicitly or
implicitly that is we either
pass it across agree on it beforehand (hard wired) recreate it from the codewords (clever!)
![Page 12: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/12.jpg)
Basic definitions
compression ratio = size of original data / compressed data basically higher compression ratio the better
lossless compression output data is exactly same as input data essential for encoding computer processed data
lossy compression output data not same as input data acceptable for data that is only viewed or heard
![Page 13: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/13.jpg)
Lossless versus lossy
human visual system less sensitive to high frequency losses and to losses in color
lossy compression acceptable for visual data degree of loss is usually a parameter of the
compression algorithm tradeoff - loss versus compression
higher compression => more loss lower compression => less loss
![Page 14: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/14.jpg)
Symmetric versus asymmetric
symmetric encoding time == decoding time essential for real-time applications (ie.
video or audio on demand) asymmetric
encoding time >> decoding ok for write-once, read-many situations
![Page 15: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/15.jpg)
Entropy encoding
compression that does not take into account what is being compressed
normally is also lossless encoding most common types of entropy
encoding run length encoding Huffman encoding modified Huffman (fax…) Lempel Ziv
![Page 16: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/16.jpg)
Source encoding
takes into account type of data (ie. visual) normally is lossy but can also be lossless most common types in use:
JPEG, GIF = single images MPEG = sequence of images (video) MP3 = sound sequence
often uses entropy encoding as a sub-routine
![Page 17: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/17.jpg)
Run length encoding
one of simplest and earliest types of compression take account of repeating data (called runs) runs are represented by a count along with the
original data eg. AAAABB => 4A2B
do you run length encode a single character? no, use a special prefix character to represent
start of runs
![Page 18: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/18.jpg)
Run length encoding
runs are represented as <prefix char><repeat count><run char>
prefix char itself becomes<prefix char>1<prefix char>
want a prefix char that is not too common an example early use is MacPaint file
format run length encoding is lossless and has
fixed length codewords
![Page 19: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/19.jpg)
MacPaint File Format
![Page 20: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/20.jpg)
Run length encoding
works best for images with solid background
good example of such an image is a cartoon
does not work as well for natural images
does not work well for English text however, is almost always a part of
a larger compression system
![Page 21: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/21.jpg)
What if …
the string we encode doesn’t use all the letters in the alphabet?
log2(ceil(|set_of_characters_used|) But then also need to store / transmit
the mapping from encodings to characters
… and is typically close to size of alphabet
![Page 22: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/22.jpg)
Huffman Encoding:
Assumes encoding on a per-character basis
Observation: assigning shorter codes to frequently used characters can result in overall shorter encodings of stringsrequires assigning longer codes to
rarely used characters
![Page 23: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/23.jpg)
Huffman Encoding
Problem: when decoding, need to know how
many bits to read off for each character.
Solution: Choose an encoding that ensures that
no character encoding is the prefix of any other character encoding. An encoding tree has this property.
![Page 24: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/24.jpg)
Huffman encoding
assume we know the frequency of each character in the input stream
then encode each character as a variable length bit string, with the length inversely proportional to the character frequency
variable length codewords are used; early example is Morse code
Huffman produced an algorithm for assigning codewords optimally
![Page 25: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/25.jpg)
Huffman encoding
input = probabilities of occurrence of each input character (frequencies of occurrence)
output is a binary tree each leaf node is an input character each branch is a zero or one bit codeword for a leaf is the concatenation of bits
for the path from the root to the leaf codeword is a variable length bit string
a very good compression ratio (optimal)?
![Page 26: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/26.jpg)
Huffman encoding
Basic algorithmMark all characters as free tree nodesWhile there is more than one free node
Take two nodes with lowest freq. of occurrenceCreate a new tree node with these nodes as
children and with freq. equal to the sum of their freqs.
Remove the two children from the free node list.Add the new parent to the free node list
Last remaining free node is the root of the binary tree used for encoding/decoding
![Page 27: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/27.jpg)
A Huffman Encoding Tree
12
21
9
7
43
5
23
A T R N
E
0 1
0 1
0 1 0 1
![Page 28: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/28.jpg)
12
21
9
7
43
5
23
A T R N
E
0 1
0 1
0 1 0 1
A 000
T 001
R 010
N 011
E 1
![Page 29: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/29.jpg)
Weighted path length
A 000
T 001
R 010
N 011
E 1
Weighted path = Len(code(A)) * f(A) +
Len(code(T)) * f(T) + Len(code(R) ) * f(R) +
Len(code(N)) * f(N) + Len(code(E)) * f(E)
= (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1)
= 9 + 6 + 9 + 12 + 9 = 45
Claim (proof in text) : no other encoding can result in a shorter weighted path length
![Page 30: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/30.jpg)
Building the Huffman Tree
A3
T4
R4
E5
![Page 31: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/31.jpg)
Building the Huffman Tree
A3
T4
R4
E5
7
![Page 32: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/32.jpg)
Building the Huffman Tree
R4
E5
A3
T4
7
![Page 33: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/33.jpg)
Building the Huffman Tree
R4
E5
A3
T4
79
![Page 34: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/34.jpg)
Building the Huffman Tree
A3
T4
7
R4
E5
9
![Page 35: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/35.jpg)
Building the Huffman Tree
A3
T4
7
R4
E5
9
16
![Page 36: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/36.jpg)
Building the Huffman Tree
A3
T4
7
R4
E5
9
160
0 1
1
0 1
00 01 10 11
![Page 37: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/37.jpg)
Huffman example
a series of colors in an 8 by 8 screen colors are red, green, cyan, blue,
magenta, yellow, and black sequence is
rkkkkkkk gggmcbrr kkkrrkkk bbbmybbr kkrrrrgg gggggggr kkbcccrr grrrrgrr
![Page 38: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/38.jpg)
Another Huffman example
Color Frequency
Black (K) 19
Red ( R) 17
Green (G) 16
Blue (B) 5
Cyan ( C) 4
Magenta (M) 2
Yellow (Y) 1
![Page 39: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/39.jpg)
Another Huffman Example
![Page 40: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/40.jpg)
Another Huffman example, cont’d
![Page 41: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/41.jpg)
Huffman example, cont’d
Red = 00 Blue = 111 Magenta = 11010
Black = 01 Cyan = 1100 Yellow = 11011
Green = 10
![Page 42: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/42.jpg)
Fixed versus variable length codewords
run length codewords are fixed length Huffman codewords are variable length length inversely proportional to frequency all variable length compression schemes
have the prefix property one code can not be the prefix of another binary tree structure guarantees that this
is the case (a leaf node is a leaf node!)
![Page 43: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/43.jpg)
Huffman encoding
advantages maximum compression ratio assuming correct
probabilities of occurrence easy to implement and fast
disadvantages need two passes for both encoder and decoder
one to create the frequency distribution one to encode/decode the data
can avoid this by sending tree (takes time) or by having unchanging frequencies
![Page 44: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/44.jpg)
Modified Huffman encoding
if we know frequency of occurrences, then Huffman works very well
consider case of a fax; mostly long white spaces with short bursts of black
do the following run length encode each string of bits on a line Huffman encode these run length codewords use a predefined frequency distribution
combination run length, then Huffman
![Page 45: Representation of Strings Background Huffman Encoding](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649da75503460f94a93291/html5/thumbnails/45.jpg)
Beyond Huffman Coding …
1977 – Lempel & Ziv, Israeli information theorists, develop a dictionary-based compression method (LZ77)
1978 – they develop another dictionary-based compression method (LZ78)
… coming soon ….