lecture 10: data compression

66
LECTURE 10: DATA COMPRESSION EVI INDRIASARI MANSOR Email: [email protected] Tel ext: 1741 1

Upload: kirby

Post on 24-Feb-2016

99 views

Category:

Documents


1 download

DESCRIPTION

Lecture 10: data compression. Email: [email protected]. EVI INDRIASARI MANSOR. Tel ext: 1741. Outline. Basics of Data Compression Text & N u meric Compression Image Compression Audio Compression Video Compression Data Security Through Encryption. Basics of Data Compression. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 10: data compression

LECTURE 10:DATA COMPRESSIONEVI INDRIASARI MANSOREmail: [email protected]

Tel ext: 1741

1

Page 2: Lecture 10: data compression

Outline

Basics of Data Compression Text & Numeric Compression Image Compression Audio Compression Video Compression Data Security Through Encryption

2

Page 3: Lecture 10: data compression

Basics of Data Compression

Digital compression concepts Compression techniques are used to replace a file

with another that is smaller Compressed data requires less storage and can be

transmitted at a faster rate Decompression techniques expands the

compressed file to recover the original data – either exactly or in facsimile

A pair of compression / decompression techniques that work together is called a codec for short

3

Page 4: Lecture 10: data compression

Motivations Basically, data compression deals with reducing

the number of bits used to represent a certain amount of data by removing redundancy

Motivations:1. Compressed data is smaller and requires less

(physical) storage [hence allowing more things to be stored at a lower cost]

2. Smaller amounts of data can be processed faster3. Smaller data can be transmitted faster4. Computing resources requirements (such as

memory) can be minimized

4

Page 5: Lecture 10: data compression

Types of data compression1. Logical compression

Generally for databases E.g. Instead of allocating a large field size

for ‘faculty name’ in a university database… a reference number can be used instead (something like the Color Lookup table scheme)

2. Physical compression Deals with removal of redundancy in data This chapter will deal only with this type of

compression

5

Page 6: Lecture 10: data compression

Basics of Data Compression (cont)

Compress Decompress – CoDec

Main function of CODEC : to reduce the redundancy in data (CODEC can also be CODE DECODE )

How ??? – by replacing definable patterns with shorter sequences of symbols

6

Uncompressed data Compression / coder Compressed data

Decompression / decoder

CODEC10010110010 10101011100

Page 7: Lecture 10: data compression

7

LOSSY compression1. What is it? (again…)

The compressed (and then decompressed) data is not the exact match of the source

Enable better compression performance BUT… only suitable when data loss during compression has no significant effect

E.g. When compressing an image into JPEG or TIFF. Some information loss is imperceptible by the human visual system…

Page 8: Lecture 10: data compression

8

LOSSLESS compression1. What is it? (again…)

Source data can be reconstructed (decompressed) exactly from the compressed data

Required for example, when data being sent needs to be prices and the slightest loss of data would be detrimental

E.g. Processing telemetry data from satellite… or when compressing text files

Page 9: Lecture 10: data compression

Concept of Models and Coding

A simple ‘formula’ to remember: DATA COMPRESSION = MODELING + CODING This scheme of modeling and coding is used to transform an input

stream of symbols into output codes!

A MODEL is a collection of data and rules Fixed model predefined rules are used form compression Adaptive model Adjustments can be made to suit the pattern of the

data at run-time. Normally capable of better compression performance…

A CODER implements the algorithm that transforms the input data into output data (based on rules/infromation provided by the MODEL)

9

Page 10: Lecture 10: data compression

Lossless Data Compression and Applications

10

1. Here… we’ll have a look at some specific algorithms

Substitution & Dictionary Methods Null Suppression Run-Length Encoding

Statistical Method Huffman Coding

Page 11: Lecture 10: data compression

Substitution & Dictionary Methods

NULL SUPPRESSION Scans a stream of data for sequences of the NULL

character These nulls are replaced with a special pair of characters

consisting of:1. Indicator character (Ic); and 2. A Count

Example:XYZØØØØØMCKKW (where the Ø are nulls)The encoded output will be:XYZIc5MCKKW (Savings from 15-bytes to 10-bytes)

11

Page 12: Lecture 10: data compression

12

Substitution & Dictionary Methods

LIMITATIONS Needs to have 3 or more consecutive

nulls… or else expansion might be achieved instead of compression

An appropriate Ic must be defined, one which does not occur in the data stream

Page 13: Lecture 10: data compression

Substitution & Dictionary Methods Run Length Encoding (RLE)

A generalized null suppression technique Identifies repeated characters in the data stream Format:

Ic – repeating character – count Example: A BBCCDDDDDDDDDEEFGGGGG

will be encoded into… ABBCCIcD9EEFIcG5

Savings from 22-bytes to 14-bytes

13

Page 14: Lecture 10: data compression

Other variants of the RLE

• Some data contain a sequence of identical (sama) bytes• The RLE technique replaces these runs of data by using a marker or a counter that

indicates the number of occurrences

For instance:

Uncompressed data AAAAOOOOOOOOBBBBCCCDCCCompressed data A#4 O#8 B#4 C C C D C C

The # acts as the marker, followed by a number indicating

the number of occurrence This example shows that each run of code is

compressed to 3-bytes (i.e. A and # and 4 = 3-bytes)

[Steinmetz and Nahrstedt, 2002]

• RLE can also be coded using 2-bytes (http://www.eee.bham.ac.uk/WoolleySI/All7/run_1.htm)

• The first byte indicates the number of occurrence, where the second indicates the data

For instance:

Uncompressed data AAAAOOOOOOOOBBBBCCCDCCCompressed data 4A 8O 4B C C C D C C

Run-Length Encoding (RLE)

Page 15: Lecture 10: data compression

• As a result of this, RLE manages to compress the data down a bit. The original data = 22-bytes (AAAAOOOOOOOOBBBBCCCDCC) RLE compresses it down to 12-bytes (4A 8O 4B C C C D C C)

• Compresses more efficiently if the run of strings is long e.g. AAAAAAAAAAAAAAAAAAAA becomes 20A Instead of 20-bytes… the storage is brought down to just 2-bytes (1-bytes for ’20’

and 1-byte for ‘A’)

Measuring Compression Ratio

• Basically, RLE compression ratio can be measure by the formula: (original size / compressed size) : 1

• For the above example… compression ratio is 22/12 : 1, which is almost 2:1

Run-Length Encoding (RLE) – (2)

Other variants of the RLE

Page 16: Lecture 10: data compression

• Consider this: 1, 3, 4, 1, 3, 4, 1, 3, 4, 1, 3, 4• RLE 4(1,3,4) – translates to 4 occurrences of 1,3 and 4

Run-Length Encoding (RLE) – For repetitive data sources

• Consider this: 1,2,4,5,7,9,10• RLE can also take the differences between adjacent (bersebelahan) strings and encodes

them• In this case… for example… 1 and 2 = 1; 2 and 4 = 2; 4 and 5 = 1… and so on• The respective compressed differences would be 1,2,1,2, 2,1• Further compression 3(1,2)

Run-Length Encoding (RLE) – Compressed by Differencing

Other variants of the RLE

Page 17: Lecture 10: data compression

Data Compression

• Must be sure that there is significant long runs of repeating data, so that compression is achieved instead of EXPANSION!!!

For instance: ROTI CENAI YUGOSLAV - 17-bytes

RLE 2(A), 1(C), 1(E), 1(G), 2(I), 1(L), 1(N), 2(O), 1(R), 1(S), 1(T), 1(U), 1(V), 1(Y) – 28-bytes

RLE when you know which variant you are using, and on what kinds of data!!!

Page 18: Lecture 10: data compression

18

Page 19: Lecture 10: data compression

Statistical Methods19

Huffman Codes Form of statistical encoding that exploits the overall

distribution or frequency of symbols in a source Produces an optimal coding for a passage-based

source on assigning the fewest number of bits to encode each symbol given the probability of its occurrence

e.g. if a passage-based content has a lot of character “e”

then it would make sense to replace it with the smallest sequence of bits possible. Other characters can use its normal representation

refer the HUFFMAN tree

Page 20: Lecture 10: data compression

Data Compression

• This technique is based on the probabilistic distribution of symbols or characters• Characters with the most number of occurrences are assigned the shortest length of code• The code length increases as the frequency of occurrence decreases

• Huffman codes are determined by successively constructing a binary tree The leaves of the tree represent the characters to be coded

• Characters are arranged in descending order of probability• The tree is further built further by repeatedly adding two lowest probabilities and resorting• This process goes on until the sum of probabilities of the last two symbols is 1• Once this process is complete, a Huffman binary tree can be generated

• The resultant code words are then formed by tracing the tree path from the root node to the end-nodes code words after assigning 0s and 1s to the branches (This assignment is arbitrary… not according to any order. So different Huffman code yield different results)

- If we do not obtain a probability of 1 in the last two symbols, most likely there is a mistake in the process. This probability of 1 which forms the last symbol is the root of the binary tree

Huffman Coding

Page 21: Lecture 10: data compression

• An illustration is as follows

Let’s say you have this particular probabilistic distribution:

A = 0.10; B = 0.35; C = 0.16; D = 0.2; E = 0.19

1. The characters are listed in order of decreasing probability

B = 0.35; D = 0.2; E = 0.19; C = 0.16; A = 0.10

2. TWO chars. with the LOWEST probs. are combined

A = 0.10 and C = 0.16 AC = 0.26

3. Re-Sort… and the new list is:

B = 0.35; AC = 0.26; D = 0.2; E = 0.19

4. Then repeat what was done in step 2 (take the two lowest probs. and combine them).

D = 0.2 and E = 0.19 DE = 0.39

5. Re-Sort the list again and we get:

DE = 0.39; B = 0.35; AC = 0.26

Huffman Coding – (2)

Data Compression

Page 22: Lecture 10: data compression

6. Again… get the lowest two probs. and repeat the process

B = 0.35 and AC = 0.26 BAC = 0.61

7. Re-Sort… and you get the new list:

BAC = 0.61; DE = 0.39

8. Finally, BAC and DE are combined… and you get BACDE = 1.0

• From all the combinations of probabilistic values that you’ve done… a binary tree is constructed.

• Each edge from node to sub-node is assigned either a 1 or 0

Huffman Coding – (3)

Data Compression

Page 23: Lecture 10: data compression

P(C) = 0.16 P(A) = 0.10

P(AC) = 0.26 P(D) = 0.2 P(E) = 0.19

P(DE) = 0.39

P(B) = 0.35

P(BAC) = 0.61

P(BACDE) = 1.0

10

0 0

0

1 1

1

Huffman Coding – (4)

Data Compression

Resultant Binary Tree

Huffman Code for each Character

Character Probabilities Code words

A 0.10 011B 0.35 00C 0.16 010D 0.20 10E 0.19 11

Page 24: Lecture 10: data compression

Text and Numeric Compression (cont)24

Huffman Code Encoding “this is an example of a huffman tree”

Page 25: Lecture 10: data compression

Text and Numeric Compression (cont)25

3) LZW compression (Lempel-Ziv Welch) Based on recognizing common string patterns Basic strategy: replace strings in a file with bit codes

rather than replacing individual characters with bit codes Greater compression rate than both previous methods

Page 26: Lecture 10: data compression

LZW26

Page 27: Lecture 10: data compression

Lossy Data Compression and Applications

Here, we will be looking at JPEG (Image) MotionJPEG (Video) MPEG (Video) MP3 (Audio)

27

Page 28: Lecture 10: data compression

28

Joint Photographic Experts Group JPEG for short

Extensions:.jpg, .jpeg, .jpe, .jif, .jfif, .jfi A lossy algorithm where the reconstructed image

has less information than the original However, you won’t miss the ‘missing’ information that

much since: The human visual system does pays less attention to colour

information as opposed to brightness information The human visual system mostly does not notice the details

in parts of an images that are "busy“ or “high-frequency” Therefore, JPEG compression is suitable for images with

smooth variations of tone and color (i.e. such images will compress well!)

Page 29: Lecture 10: data compression

29

Joint Photographic Experts Group High frequency vs. Low frequency

HIGH LOW

Page 30: Lecture 10: data compression

30

Joint Photographic Experts Group How the JPEG algorithm works

1. An image is divided into 8X8 pixel blocks2. The Discrete Cosine Transform (DCT) of each block is

calculated. This converts the image from the spatial domain to the frequency domain – resulting in DCT coefficients

3. A quantization process rounds off the coefficients (according to some quantization matrix which determines the quality of the resulting image) – it’s in this step also that you can produce LOSSLESS Jpeg

4. A lossless compression technique is used to encode the coefficients of the 8X8 blocks (e.g. RLE)

5. For decompression… the process is reversed

Page 31: Lecture 10: data compression

31

Joint Photographic Experts Group1. The 8 X 8 blocks

Original values are from [0,255]. The resultingImage matrix g is after shifting (subtracting 128 from each of the elements)

Page 32: Lecture 10: data compression

32

Joint Photographic Experts Group2. DCT is performed on the 8X8 block

(sub-image) The scary formula looks like this:

resulting in 64-coefficients:

Page 33: Lecture 10: data compression

Joint Photographic Experts Group2. DCT (continued)

Notice that the upper-left corner (i.e. the DC coefficient) is quite big in magnitude. These are the lower-frequency components (the ones that we are more sensitive to)

The lower-right are for the higher frequency parts (ones that we are not that sensitive to)

33

Page 34: Lecture 10: data compression

34

Joint Photographic Experts Group3. Quantization

Compression is done here… A bunch of numbers falling within a certain range will be

assigned a specific value Therefore, the quantization table/matrix defines just this…

an 8X8 matrix of step sizes (or quantums) – (NOTE: if ALL the values in the quantization table are 1, this is when JPEG becomes LOSSLESS)

This process takes advantage of the human visual system’s ability to seeing small differences in brightness over a relatively large area… which means that we are good at making sense of low frequency images But we are bad at differentiating exact brightness variations

over small areas…

Page 35: Lecture 10: data compression

35

3. Quantization (continued) Therefore, the amount of information in the

high frequency components can be reduced (or even removed)

Done by dividing each component in the frequency domain (i.e. matrix G produced by DCT) by a constant for that component, and then rounding to the nearest integer 

Joint Photographic Experts Group

Page 36: Lecture 10: data compression

3. Quantization (continued) An example quantization table/matrix as

specified in the original JPEG standard:

36

Joint Photographic Experts Group

Page 37: Lecture 10: data compression

3. Quantization (continued) The formula:

G is the unquantized DCT coefficients; Q the quantization matrix in the previous slide; and the result B the quantized DCT matrix

In short, each element is G is divided (and rounded up) by each corresponding component in Q (hence the indexes j and k)

37

Joint Photographic Experts Group

Page 38: Lecture 10: data compression

38

3. Quantization (continued) The quantized DCT matrix B is as follows:

Notice that the higher frequency components (which we are not sensitive to) are rounded to ZERO, and the rest become small +/- integers

These require lesser space to store…

Joint Photographic Experts Group

Page 39: Lecture 10: data compression

39

3. Zig-zagging and Lossless Compression The matrix B is then arranged and coded in a zig-

zag manner… i.e. Bi(0,0), Bi(0,1), Bi(1,0), Bi(2,0), Bi(1,1), Bi(0,2), Bi(0,3), Bi(1,2) and so on

-26-3 0-3 -2 62 -4 1 -31 1 5 1 2-1 1 -1 2 0 00 0 0 -1 -1 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 00 0 0 00 0 00 00

Joint Photographic Experts Group

All of these values are stored in a vector (i.e. a one dimensional array) and then coded using DCPM, Huffman and Run-length encoding

Page 40: Lecture 10: data compression

The JPEG compression/decompression process

40

http://en.wikipedia.org/wiki/File:JPEG_process.svg

Page 41: Lecture 10: data compression

41

So how can JPEG become lossless again??? If you refer to the 1st Slide of the Quantization step (step-3), then it has something to do with the quantization matrix having all values of 1!!!

* So please do the maths Owh, and I reckon you don’t need to do the rounding as well…

Joint Photographic Experts Group

Page 42: Lecture 10: data compression

Audio Compression42

The choice of sampling rates (frequency and amplitude) are very important to handle the size of an audio size

Higher sampling rates mean higher fidelity, and cost more in storage space and transmission time

Widely used method is ADPCM (Adaptive Differential Pulse Code Modulation)

Page 43: Lecture 10: data compression

Audio Compression (cont)43

Adaptive Differential Pulse Code Modulation (ADPCM) Pulse code modulation (PCM) is a basic method for

quantizing audio information Differential PCM compresses the number of bits

needed to represent the data by storing the first PCM sample in its entirety and all succeeding samples as differences from the previous one

Adaptive DPCM (encoder) - takes the scheme and divides the values of DPCM samples by an appropriate coefficient to produce a smaller value to store

Page 44: Lecture 10: data compression

Audio Compression (cont)44

Adaptive Differential Pulse Code Modulation (ADPCM) In playback, the decoder multiplies the compressed

data by that coefficient to reproduce the proper differential value

Works very well with speech, but is less effective for music

Page 45: Lecture 10: data compression

Audio Compression (cont)45

Perceptual Noise Shaping approach used by MP3 audio format MP3 format helps reduce the number of bytes in a

song without hurting the quality of the song's sound. goal of the MP3 format

compress a CD-quality song by a factor of 10 to 14 without noticeably affecting the CD-quality sound

With MP3, a 32-megabyte (MB) song on a CD compresses down to about 3MB !!!.

MP3 format uses characteristics of the human ear to design the compression algorithm

Page 46: Lecture 10: data compression

Audio Compression (cont)46

Perceptual Noise Shaping (cont)

Page 47: Lecture 10: data compression

LECTURE 11:DATA COMPRESSION (2)

EVI INDRIASARI MANSOREmail: [email protected] ext: 1741

47

Page 48: Lecture 10: data compression

Outline

Basics of Data Compression Text & Numeric Compression Image Compression Audio Compression Video Compression Data Security Through Encryption

48

Page 49: Lecture 10: data compression

Learning Outcomes

Differentiate between the lossless and the lossy data compression process

49

Page 50: Lecture 10: data compression

Video Compression50

Transmitting standard full screen color imagery as video at 30 fps requires a data rate nearly 28MB per second video compression is absolutely essential !!!

One idea is to reduce the amount of data rate (from 30 fps to 15 fps), but it will sacrifice a lot of video motions

Page 51: Lecture 10: data compression

Video Compression (cont)51

Intraframe (spatial) compression: reduce the redundant information contained within a

single image or frame it is not sufficient for achieving the kinds of data rates

essential for transmitting video in practical applications

Page 52: Lecture 10: data compression

Video Compression (cont)52

Interframe (temporal) compression The idea is that much of the data in video images is

repeated frame after frame This technique will eliminates the redundancy of

information between frames Must identify the key frame (master frame) Key frame: the basis for deciding how much motion

or how many changes take place in succeeding frames

Page 53: Lecture 10: data compression

Video Compression (cont)53

Interframe (temporal) compression assumes that the background remains (sky, road, and

grass) but only the car is moving the first frame is stored as key frame and it has

enough information to reconstruct it independently

Page 54: Lecture 10: data compression

Video Compression (cont)54

MPEG (Moving Picture Experts Group) Compression Prediction approach (predicted pictures = P pictures;

intrapictures pictures = I pictures; bi-directional pictures = B pictures).

Some compressed frames are the difference results of predictions based on past frames used as a reference, and others are based on both past and future frames from the sequence

I = intra picture;

B = bi-directional picture;

Page 55: Lecture 10: data compression

Video Compression (cont)55

Spatial vs temporal compression

Page 56: Lecture 10: data compression

Data Security Through Encryption56

Encryption and data security Cryptography is the art and science of keeping message secret Encryption techniques convert data into a secret code for transmission The process of retrieving the original message at the receiver is called decryption

Original plaintextSecond bridge on monday

encryption

rmnsroklxswrewtgdln

decryption

Ciphertext

Recovered plaintextSecond bridge on monday

Transmit this

Page 57: Lecture 10: data compression

Data Security Through Encryption (cont)57

Encryption keys Key are essential information – usually a numerical parameter(s) – needed for encryption and/or decryption algorithms Encryption keys are used to encode plaintext as

encoded ciphertext Decryption keys are used to decode ciphertext and

recover the original plaintext Decryption keys are sometimes discovered by brute

force methods employing computers to search large potential key spaces

Page 58: Lecture 10: data compression

Data Security Through Encryption (cont)58

Symmetric or Secret Key Ciphers

Secret key ciphers use a secret key (or set of keys) for both encryption and decryption

The secret key must be transferred securely in order for secret key methods to be secure

Data Encryption Standard (DES) is a US government sponsored secret key cipher. DES uses a 56-bit key

International Data Encryption Algorithms (IDEA) has been proposed to replace DES. It uses a 128-bit key

Longer keys make it more difficult for brute force discovery of the secret key

Page 59: Lecture 10: data compression

Data Security Through Encryption (cont)59

Asymmetric or Public Key Ciphers

The first practical public key algorithms was published by Rivest, hamir, and Adleman in 1976 and is know as RSA (for their last names)

Public key ciphers employ an algorithms with two keys – a public key and a private key

A sender looks up the recipient’s public key and uses it to encode a message

The recipient then decodes the message with his or her private key (this private key is necessary to decode the message)

Page 60: Lecture 10: data compression

Data Security Through Encryption (cont)60

Asymmetric or Public Key Ciphers IllustratedReceiver’s Public Key

Sender ReceiverReceiver’s Private Key

1publish

2Lock up

5Decrypt using

Receiver’s Private Key

Original message Recovered message

ciphertext ciphertexttransmit

3Encrypt usingReceiver’s Public Key

4

Page 61: Lecture 10: data compression

Data Security Through Encryption (cont)61

More on Public Key Methods No attempts is made to keep secret the actual

encryption and decryption algorithms for public key methods – security depends on only the recipient knowing his or her private key

Public key ciphers are more secure than secret key ciphers, but are not as efficient since they require longer keys and more computing in the encryption and decryption processes

Page 62: Lecture 10: data compression

Data Security Through Encryption (cont)62

More on Public Key Methods (cont) For sake of efficiency, sometimes secret key

encryption is used and the secret key is communicated employing public key methods

the combinations of a secret key encoded message and public key encoded value of the secret key is called a digital envelope

Page 63: Lecture 10: data compression

Data Security Through Encryption (cont)63

Authentication The process used to verify the identity of a

respondent is called authentication

Authentication is very important for electronic commerce and other network transactions

Authentication exploits the symmetry of public and private keys

Page 64: Lecture 10: data compression

Data Security Through Encryption (cont)64

Authentication (cont) To authenticate that a person is who they say they

are: Send that person a nonsense message and ask

them to encode it with their private key and return it to you

When the message is returned, if the person is who they claim to be, you should be able to recover your nonsense message using their public key (which presumably you know)

Page 65: Lecture 10: data compression

Summary Compressing data means reducing the effective size of

a data file for storage or transmission Particular paired compression/decompression methods

are called codecs Codecs that cannot reproduce the original file exactly

are called lossy methods; those that reproduce the original exactly are called lossless methods

Text and numbers usually lossless methods Images, video and sound codecs are usually lossy Encryption techniques are used to encode messages for

secure transmission

65

Page 66: Lecture 10: data compression

Summary (cont) The two primary encryption/decryption methods are

Secret key (symmetric key) ciphers Public key (asymmetric key) ciphers

Public key ciphers are more secure, but secret key ciphers are more efficient

Public key encryption is used for authentication over computer networks

66