lecture 10: data compression

LECTURE 10:DATA COMPRESSIONEVI INDRIASARI MANSOREmail: [email protected]

Tel ext: 1741

1

Outline

Basics of Data Compression Text & Numeric Compression Image Compression Audio Compression Video Compression Data Security Through Encryption

2

Basics of Data Compression

Digital compression concepts Compression techniques are used to replace a file

with another that is smaller Compressed data requires less storage and can be

transmitted at a faster rate Decompression techniques expands the

compressed file to recover the original data – either exactly or in facsimile

A pair of compression / decompression techniques that work together is called a codec for short

3

Motivations Basically, data compression deals with reducing

the number of bits used to represent a certain amount of data by removing redundancy

Motivations:1. Compressed data is smaller and requires less

(physical) storage [hence allowing more things to be stored at a lower cost]

2. Smaller amounts of data can be processed faster3. Smaller data can be transmitted faster4. Computing resources requirements (such as

memory) can be minimized

4

Types of data compression1. Logical compression

Generally for databases E.g. Instead of allocating a large field size

for ‘faculty name’ in a university database… a reference number can be used instead (something like the Color Lookup table scheme)

2. Physical compression Deals with removal of redundancy in data This chapter will deal only with this type of

compression

5

Basics of Data Compression (cont)

Compress Decompress – CoDec

Main function of CODEC : to reduce the redundancy in data (CODEC can also be CODE DECODE )

How ??? – by replacing definable patterns with shorter sequences of symbols

6

Uncompressed data Compression / coder Compressed data

Decompression / decoder

CODEC10010110010 10101011100

7

LOSSY compression1. What is it? (again…)

The compressed (and then decompressed) data is not the exact match of the source

Enable better compression performance BUT… only suitable when data loss during compression has no significant effect

E.g. When compressing an image into JPEG or TIFF. Some information loss is imperceptible by the human visual system…

8

LOSSLESS compression1. What is it? (again…)

Source data can be reconstructed (decompressed) exactly from the compressed data

Required for example, when data being sent needs to be prices and the slightest loss of data would be detrimental

E.g. Processing telemetry data from satellite… or when compressing text files

Concept of Models and Coding

A simple ‘formula’ to remember: DATA COMPRESSION = MODELING + CODING This scheme of modeling and coding is used to transform an input

stream of symbols into output codes!

A MODEL is a collection of data and rules Fixed model predefined rules are used form compression Adaptive model Adjustments can be made to suit the pattern of the

data at run-time. Normally capable of better compression performance…

A CODER implements the algorithm that transforms the input data into output data (based on rules/infromation provided by the MODEL)

9

Lossless Data Compression and Applications

10

1. Here… we’ll have a look at some specific algorithms

Substitution & Dictionary Methods Null Suppression Run-Length Encoding

Statistical Method Huffman Coding

Substitution & Dictionary Methods

NULL SUPPRESSION Scans a stream of data for sequences of the NULL

character These nulls are replaced with a special pair of characters

consisting of:1. Indicator character (Ic); and 2. A Count

Example:XYZØØØØØMCKKW (where the Ø are nulls)The encoded output will be:XYZIc5MCKKW (Savings from 15-bytes to 10-bytes)

11

12

Substitution & Dictionary Methods

LIMITATIONS Needs to have 3 or more consecutive

nulls… or else expansion might be achieved instead of compression

An appropriate Ic must be defined, one which does not occur in the data stream

Substitution & Dictionary Methods Run Length Encoding (RLE)

A generalized null suppression technique Identifies repeated characters in the data stream Format:

Ic – repeating character – count Example: A BBCCDDDDDDDDDEEFGGGGG

will be encoded into… ABBCCIcD9EEFIcG5

Savings from 22-bytes to 14-bytes

13

Other variants of the RLE

• Some data contain a sequence of identical (sama) bytes• The RLE technique replaces these runs of data by using a marker or a counter that

indicates the number of occurrences

For instance:

Uncompressed data AAAAOOOOOOOOBBBBCCCDCCCompressed data A#4 O#8 B#4 C C C D C C

The # acts as the marker, followed by a number indicating

the number of occurrence This example shows that each run of code is

compressed to 3-bytes (i.e. A and # and 4 = 3-bytes)

[Steinmetz and Nahrstedt, 2002]

• RLE can also be coded using 2-bytes (http://www.eee.bham.ac.uk/WoolleySI/All7/run_1.htm)

• The first byte indicates the number of occurrence, where the second indicates the data

For instance:

Uncompressed data AAAAOOOOOOOOBBBBCCCDCCCompressed data 4A 8O 4B C C C D C C

Run-Length Encoding (RLE)

• As a result of this, RLE manages to compress the data down a bit. The original data = 22-bytes (AAAAOOOOOOOOBBBBCCCDCC) RLE compresses it down to 12-bytes (4A 8O 4B C C C D C C)

• Compresses more efficiently if the run of strings is long e.g. AAAAAAAAAAAAAAAAAAAA becomes 20A Instead of 20-bytes… the storage is brought down to just 2-bytes (1-bytes for ’20’

and 1-byte for ‘A’)

Measuring Compression Ratio

• Basically, RLE compression ratio can be measure by the formula: (original size / compressed size) : 1

• For the above example… compression ratio is 22/12 : 1, which is almost 2:1

Run-Length Encoding (RLE) – (2)


• Consider this: 1, 3, 4, 1, 3, 4, 1, 3, 4, 1, 3, 4• RLE 4(1,3,4) – translates to 4 occurrences of 1,3 and 4

Run-Length Encoding (RLE) – For repetitive data sources

• Consider this: 1,2,4,5,7,9,10• RLE can also take the differences between adjacent (bersebelahan) strings and encodes

them• In this case… for example… 1 and 2 = 1; 2 and 4 = 2; 4 and 5 = 1… and so on• The respective compressed differences would be 1,2,1,2, 2,1• Further compression 3(1,2)

Run-Length Encoding (RLE) – Compressed by Differencing


Data Compression

• Must be sure that there is significant long runs of repeating data, so that compression is achieved instead of EXPANSION!!!

For instance: ROTI CENAI YUGOSLAV - 17-bytes

RLE 2(A), 1(C), 1(E), 1(G), 2(I), 1(L), 1(N), 2(O), 1(R), 1(S), 1(T), 1(U), 1(V), 1(Y) – 28-bytes

RLE when you know which variant you are using, and on what kinds of data!!!

Statistical Methods19

Huffman Codes Form of statistical encoding that exploits the overall

distribution or frequency of symbols in a source Produces an optimal coding for a passage-based

source on assigning the fewest number of bits to encode each symbol given the probability of its occurrence

e.g. if a passage-based content has a lot of character “e”

then it would make sense to replace it with the smallest sequence of bits possible. Other characters can use its normal representation

refer the HUFFMAN tree

Data Compression

• This technique is based on the probabilistic distribution of symbols or characters• Characters with the most number of occurrences are assigned the shortest length of code• The code length increases as the frequency of occurrence decreases

• Huffman codes are determined by successively constructing a binary tree The leaves of the tree represent the characters to be coded

• Characters are arranged in descending order of probability• The tree is further built further by repeatedly adding two lowest probabilities and resorting• This process goes on until the sum of probabilities of the last two symbols is 1• Once this process is complete, a Huffman binary tree can be generated

• The resultant code words are then formed by tracing the tree path from the root node to the end-nodes code words after assigning 0s and 1s to the branches (This assignment is arbitrary… not according to any order. So different Huffman code yield different results)

- If we do not obtain a probability of 1 in the last two symbols, most likely there is a mistake in the process. This probability of 1 which forms the last symbol is the root of the binary tree

Huffman Coding

• An illustration is as follows

Let’s say you have this particular probabilistic distribution:

A = 0.10; B = 0.35; C = 0.16; D = 0.2; E = 0.19

1. The characters are listed in order of decreasing probability

B = 0.35; D = 0.2; E = 0.19; C = 0.16; A = 0.10

2. TWO chars. with the LOWEST probs. are combined

A = 0.10 and C = 0.16 AC = 0.26

3. Re-Sort… and the new list is:

B = 0.35; AC = 0.26; D = 0.2; E = 0.19

4. Then repeat what was done in step 2 (take the two lowest probs. and combine them).

D = 0.2 and E = 0.19 DE = 0.39

5. Re-Sort the list again and we get:

DE = 0.39; B = 0.35; AC = 0.26

Huffman Coding – (2)

Data Compression

6. Again… get the lowest two probs. and repeat the process

B = 0.35 and AC = 0.26 BAC = 0.61

7. Re-Sort… and you get the new list:

BAC = 0.61; DE = 0.39

8. Finally, BAC and DE are combined… and you get BACDE = 1.0

• From all the combinations of probabilistic values that you’ve done… a binary tree is constructed.

• Each edge from node to sub-node is assigned either a 1 or 0


Data Compression

P(C) = 0.16 P(A) = 0.10

P(AC) = 0.26 P(D) = 0.2 P(E) = 0.19

P(DE) = 0.39

P(B) = 0.35

P(BAC) = 0.61

P(BACDE) = 1.0

10

0 0

0

1 1

1


Data Compression

Resultant Binary Tree

Huffman Code for each Character

Character Probabilities Code words

A 0.10 011B 0.35 00C 0.16 010D 0.20 10E 0.19 11

Text and Numeric Compression (cont)24

Huffman Code Encoding “this is an example of a huffman tree”

Text and Numeric Compression (cont)25

3) LZW compression (Lempel-Ziv Welch) Based on recognizing common string patterns Basic strategy: replace strings in a file with bit codes

rather than replacing individual characters with bit codes Greater compression rate than both previous methods

Lossy Data Compression and Applications

Here, we will be looking at JPEG (Image) MotionJPEG (Video) MPEG (Video) MP3 (Audio)

27

28

Joint Photographic Experts Group JPEG for short

Extensions:.jpg, .jpeg, .jpe, .jif, .jfif, .jfi A lossy algorithm where the reconstructed image

has less information than the original However, you won’t miss the ‘missing’ information that

much since: The human visual system does pays less attention to colour

information as opposed to brightness information The human visual system mostly does not notice the details

in parts of an images that are "busy“ or “high-frequency” Therefore, JPEG compression is suitable for images with

smooth variations of tone and color (i.e. such images will compress well!)

29

Joint Photographic Experts Group High frequency vs. Low frequency

HIGH LOW

30

Joint Photographic Experts Group How the JPEG algorithm works

1. An image is divided into 8X8 pixel blocks2. The Discrete Cosine Transform (DCT) of each block is

calculated. This converts the image from the spatial domain to the frequency domain – resulting in DCT coefficients

3. A quantization process rounds off the coefficients (according to some quantization matrix which determines the quality of the resulting image) – it’s in this step also that you can produce LOSSLESS Jpeg

4. A lossless compression technique is used to encode the coefficients of the 8X8 blocks (e.g. RLE)

5. For decompression… the process is reversed

31

Joint Photographic Experts Group1. The 8 X 8 blocks

Original values are from [0,255]. The resultingImage matrix g is after shifting (subtracting 128 from each of the elements)

32

Joint Photographic Experts Group2. DCT is performed on the 8X8 block

(sub-image) The scary formula looks like this:

resulting in 64-coefficients:

Joint Photographic Experts Group2. DCT (continued)

Notice that the upper-left corner (i.e. the DC coefficient) is quite big in magnitude. These are the lower-frequency components (the ones that we are more sensitive to)

The lower-right are for the higher frequency parts (ones that we are not that sensitive to)

33

34

Joint Photographic Experts Group3. Quantization

Compression is done here… A bunch of numbers falling within a certain range will be

assigned a specific value Therefore, the quantization table/matrix defines just this…

an 8X8 matrix of step sizes (or quantums) – (NOTE: if ALL the values in the quantization table are 1, this is when JPEG becomes LOSSLESS)

This process takes advantage of the human visual system’s ability to seeing small differences in brightness over a relatively large area… which means that we are good at making sense of low frequency images But we are bad at differentiating exact brightness variations

over small areas…

35

3. Quantization (continued) Therefore, the amount of information in the

high frequency components can be reduced (or even removed)

Done by dividing each component in the frequency domain (i.e. matrix G produced by DCT) by a constant for that component, and then rounding to the nearest integer

Joint Photographic Experts Group

3. Quantization (continued) An example quantization table/matrix as

specified in the original JPEG standard:

36


3. Quantization (continued) The formula:

G is the unquantized DCT coefficients; Q the quantization matrix in the previous slide; and the result B the quantized DCT matrix

In short, each element is G is divided (and rounded up) by each corresponding component in Q (hence the indexes j and k)

37


38

3. Quantization (continued) The quantized DCT matrix B is as follows:

Notice that the higher frequency components (which we are not sensitive to) are rounded to ZERO, and the rest become small +/- integers

These require lesser space to store…


39

3. Zig-zagging and Lossless Compression The matrix B is then arranged and coded in a zig-

zag manner… i.e. Bi(0,0), Bi(0,1), Bi(1,0), Bi(2,0), Bi(1,1), Bi(0,2), Bi(0,3), Bi(1,2) and so on

-26-3 0-3 -2 62 -4 1 -31 1 5 1 2-1 1 -1 2 0 00 0 0 -1 -1 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 00 0 0 00 0 00 00


All of these values are stored in a vector (i.e. a one dimensional array) and then coded using DCPM, Huffman and Run-length encoding

The JPEG compression/decompression process

40

http://en.wikipedia.org/wiki/File:JPEG_process.svg

41

So how can JPEG become lossless again??? If you refer to the 1st Slide of the Quantization step (step-3), then it has something to do with the quantization matrix having all values of 1!!!

* So please do the maths Owh, and I reckon you don’t need to do the rounding as well…


Audio Compression42

The choice of sampling rates (frequency and amplitude) are very important to handle the size of an audio size

Higher sampling rates mean higher fidelity, and cost more in storage space and transmission time

Widely used method is ADPCM (Adaptive Differential Pulse Code Modulation)

Audio Compression (cont)43

Adaptive Differential Pulse Code Modulation (ADPCM) Pulse code modulation (PCM) is a basic method for

quantizing audio information Differential PCM compresses the number of bits

needed to represent the data by storing the first PCM sample in its entirety and all succeeding samples as differences from the previous one

Adaptive DPCM (encoder) - takes the scheme and divides the values of DPCM samples by an appropriate coefficient to produce a smaller value to store


Adaptive Differential Pulse Code Modulation (ADPCM) In playback, the decoder multiplies the compressed

data by that coefficient to reproduce the proper differential value

Works very well with speech, but is less effective for music


Perceptual Noise Shaping approach used by MP3 audio format MP3 format helps reduce the number of bytes in a

song without hurting the quality of the song's sound. goal of the MP3 format

compress a CD-quality song by a factor of 10 to 14 without noticeably affecting the CD-quality sound

With MP3, a 32-megabyte (MB) song on a CD compresses down to about 3MB !!!.

MP3 format uses characteristics of the human ear to design the compression algorithm


Perceptual Noise Shaping (cont)

LECTURE 11:DATA COMPRESSION (2)

EVI INDRIASARI MANSOREmail: [email protected] ext: 1741

47

Outline

Basics of Data Compression Text & Numeric Compression Image Compression Audio Compression Video Compression Data Security Through Encryption

48

Learning Outcomes

Differentiate between the lossless and the lossy data compression process

49

Video Compression50

Transmitting standard full screen color imagery as video at 30 fps requires a data rate nearly 28MB per second video compression is absolutely essential !!!

One idea is to reduce the amount of data rate (from 30 fps to 15 fps), but it will sacrifice a lot of video motions

Video Compression (cont)51

Intraframe (spatial) compression: reduce the redundant information contained within a

single image or frame it is not sufficient for achieving the kinds of data rates

essential for transmitting video in practical applications


Interframe (temporal) compression The idea is that much of the data in video images is

repeated frame after frame This technique will eliminates the redundancy of

information between frames Must identify the key frame (master frame) Key frame: the basis for deciding how much motion

or how many changes take place in succeeding frames


Interframe (temporal) compression assumes that the background remains (sky, road, and

grass) but only the car is moving the first frame is stored as key frame and it has

enough information to reconstruct it independently


MPEG (Moving Picture Experts Group) Compression Prediction approach (predicted pictures = P pictures;

intrapictures pictures = I pictures; bi-directional pictures = B pictures).

Some compressed frames are the difference results of predictions based on past frames used as a reference, and others are based on both past and future frames from the sequence

I = intra picture;

B = bi-directional picture;


Spatial vs temporal compression

Data Security Through Encryption56

Encryption and data security Cryptography is the art and science of keeping message secret Encryption techniques convert data into a secret code for transmission The process of retrieving the original message at the receiver is called decryption

Original plaintextSecond bridge on monday

encryption

rmnsroklxswrewtgdln

decryption

Ciphertext

Recovered plaintextSecond bridge on monday

Transmit this

Data Security Through Encryption (cont)57

Encryption keys Key are essential information – usually a numerical parameter(s) – needed for encryption and/or decryption algorithms Encryption keys are used to encode plaintext as

encoded ciphertext Decryption keys are used to decode ciphertext and

recover the original plaintext Decryption keys are sometimes discovered by brute

force methods employing computers to search large potential key spaces


Symmetric or Secret Key Ciphers

Secret key ciphers use a secret key (or set of keys) for both encryption and decryption

The secret key must be transferred securely in order for secret key methods to be secure

Data Encryption Standard (DES) is a US government sponsored secret key cipher. DES uses a 56-bit key

International Data Encryption Algorithms (IDEA) has been proposed to replace DES. It uses a 128-bit key

Longer keys make it more difficult for brute force discovery of the secret key


Asymmetric or Public Key Ciphers

The first practical public key algorithms was published by Rivest, hamir, and Adleman in 1976 and is know as RSA (for their last names)

Public key ciphers employ an algorithms with two keys – a public key and a private key

A sender looks up the recipient’s public key and uses it to encode a message

The recipient then decodes the message with his or her private key (this private key is necessary to decode the message)


Asymmetric or Public Key Ciphers IllustratedReceiver’s Public Key

Sender ReceiverReceiver’s Private Key

1publish

2Lock up

5Decrypt using

Receiver’s Private Key

Original message Recovered message

ciphertext ciphertexttransmit

3Encrypt usingReceiver’s Public Key

4


More on Public Key Methods No attempts is made to keep secret the actual

encryption and decryption algorithms for public key methods – security depends on only the recipient knowing his or her private key

Public key ciphers are more secure than secret key ciphers, but are not as efficient since they require longer keys and more computing in the encryption and decryption processes


More on Public Key Methods (cont) For sake of efficiency, sometimes secret key

encryption is used and the secret key is communicated employing public key methods

the combinations of a secret key encoded message and public key encoded value of the secret key is called a digital envelope


Authentication The process used to verify the identity of a

respondent is called authentication

Authentication is very important for electronic commerce and other network transactions

Authentication exploits the symmetry of public and private keys


Authentication (cont) To authenticate that a person is who they say they

are: Send that person a nonsense message and ask

them to encode it with their private key and return it to you

When the message is returned, if the person is who they claim to be, you should be able to recover your nonsense message using their public key (which presumably you know)

Summary Compressing data means reducing the effective size of

a data file for storage or transmission Particular paired compression/decompression methods

are called codecs Codecs that cannot reproduce the original file exactly

are called lossy methods; those that reproduce the original exactly are called lossless methods

Text and numbers usually lossless methods Images, video and sound codecs are usually lossy Encryption techniques are used to encode messages for

secure transmission

65

Summary (cont) The two primary encryption/decryption methods are

Secret key (symmetric key) ciphers Public key (asymmetric key) ciphers

Public key ciphers are more secure, but secret key ciphers are more efficient

Public key encryption is used for authentication over computer networks

66

lecture 10: data compression

Documents

data codec

data loss

input data

output data

fastersmaller data

collection of data

original data

smallercompressed data