data compression by, keerthi gundapaneni. introduction data compression is an very effective means...

Post on 31-Dec-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Compression

By, Keerthi Gundapaneni

Introduction

• Data Compression is an very effective means to save storage space and network bandwidth.

• A large number of compression schemes currently in the market have been based on character encoding or on detection of repetitive string.

• Many of these schemes achieve data reduction rates to 2.3-2.5 bits per character for English text.

• Database performance strongly depends a great deal on the amount of available memory.

• Important to try and use the available memory as efficiently as possible.

Introduction

Current Schemes

• Text compression schemes based on letter frequency. (pioneered by Huffman)

• Schemes based on string matching.• Schemes based on fast implementation of

algorithms, parallel algorithms and VLSI implementations.

• Many database uses prefix and postfix-truncation to save space and increase the fan-out of nodes, e.g. starburst.

Using various schemes

• Compression rates of dataset depends on the attribute type and value distribution.

• It is difficult to compress binary floating point numbers but relatively easy to compress English test by a factor of 2 or 3.

• Optimal performance can only be obtained by judicious decisions which attributes to compress and which compression method to use.

Advantages of Compression

• Reduce disk space required.• Seek distance and Seek times are

reduced.• More data fits into each disk page, track

and cylinder allowing more intelligent clustering of related objects into physically near locations.

• Unused disk space can be used for shadowing to increase reliability

Advantages of Compression

• Compressed data can be transferred faster to and from disk.

• Data compression increases disk bandwidth.

• Due to the information density there is a decrease in the load there for less I/O bottleneck.

• Faster transfer rates across the network.

Advantages of Compression

• Retaining more data in compression from in the I/O buffer allows more records to remain in the buffer, thus increases the buffer hit rate and reducing the number of I/Os.

• The log recorders can become shorter.

Types of compression

• For a given table of “parts” the attribute “color” is replaced by a small integer, save the encoding in a separate relation, and join the larger table with the relatively small encoding table for queries that require string-values output of the color attribute. Since such encoding tables are typically small e.g. a few kilobytes, efficient hash-based algorithms can be used for the join.

Huffman code example

Symbol : A B C D E

Frequency: 24 12 10 8 8

Total 186 bit (with 3 bit per code word)

Huffman code example

Results

Symbol Frequency Code Code Length Total Length

A 24 0 1 24

B 12 100 3 36

C 10 101 3 30

D 8 110 3 24

E 8 111 3 24

Initial. 186 bit Final. 138 bit (3 bit code)

References:

• Seeck, Roger (2008). Binary Essence. Retrieved April 17, 2008, from About BinaryEssence Web site: http://www.binaryessence.com/dct/en000081.htm

• Graefe, Author's first name initialG, & Shapiro, L (1991). ACM/IEEE-CS Symp. Data Compression and Database Performance.

1, 1-10.

top related