information theory and pattern-based data compression
DESCRIPTION
Information Theory and Pattern-based Data Compression. José Galaviz Casas Facultad de Ciencias UNAM. Contents. Introduction, fundamental concepts. Huffman codes and extensions of a source. Pattern-based Data Compression (PbDC). The problems for PbDC. Trying to solve. Heuristics. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/1.jpg)
Information Theory and Pattern-based Data
CompressionJosé Galaviz CasasFacultad de Ciencias
UNAM
![Page 2: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/2.jpg)
february 3, 2004 Information Theory, J. Galaviz 2
Contents
• Introduction, fundamental concepts.• Huffman codes and extensions of a
source.• Pattern-based Data Compression
(PbDC).• The problems for PbDC.• Trying to solve. Heuristics.• Conclusions and further research.
![Page 3: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/3.jpg)
february 3, 2004 Information Theory, J. Galaviz 3
Information source
• Is a “thing” that produces infinite sequences of symbols in some finite alphabet .
• The theoretical model proposed by Shannon is an ergodic Markov chain.
• Markov chain: stochastic process where the state reached at the i-th time step depends on the n previous states, n denotes the order of the Markov chain.
![Page 4: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/4.jpg)
february 3, 2004 Information Theory, J. Galaviz 4
Ergodic source
• A Markov chain is ergodic if the probability distribution over the set of states tends to be stable in the limit. If p(i,j) denotes the transition probability from state i to state j in an ergodic Markov chain, then p(i,j) tends to some limit that does not depend on the source state i.
• Almost every sample is a representative sample.• There exists only one set of interconnected states.• No periodic states.
![Page 5: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/5.jpg)
february 3, 2004 Information Theory, J. Galaviz 5
Information
• Let p(s) be the probability that symbol s will be produced by some information source S.
• The information (in bits) of s is defined as:
)(log)(
1log)( 22 spsp
sI
![Page 6: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/6.jpg)
february 3, 2004 Information Theory, J. Galaviz 6
The meaning
• A measure of “surprise”. • Better: The number of “yes/no”
questions needed to determine that s has occurred.
![Page 7: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/7.jpg)
february 3, 2004 Information Theory, J. Galaviz 7
Entropy
• Is the expected value of symbol information.
• Important: Note that entropy is measured over the source. Probabilities are used, assuming infinite amount of data.
)(log)()()()( 2 iSs
iiSs
i spspsIspSHii
![Page 8: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/8.jpg)
february 3, 2004 Information Theory, J. Galaviz 8
Data Compression
• Given a finite sample of data produced by an unkown information source (unknown in the sense that we doesn´t know the statistical model of such source)
• To express the same information contained in the sample with less data.
• Exactly the same: lossless. Almost the same: lossy.
• We will focus in lossless data compression.
![Page 9: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/9.jpg)
february 3, 2004 Information Theory, J. Galaviz 9
Huffman encoding
• Is based on a statistical model of the sample to be compressed.
• The codeword length for some symbol is inversely related with its frequency.
• Target: minimize the average codeword length (AveLen).
![Page 10: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/10.jpg)
february 3, 2004 Information Theory, J. Galaviz 10
Example
• f(A) = 10• f(B) = 15• f(C) = 10• f(D) = 15• f(E) = 25• f(F) = 40
![Page 11: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/11.jpg)
february 3, 2004 Information Theory, J. Galaviz 11
Huffman codes
• A = 000• B = 100• C = 001• D = 101• E = 01• F = 11• AveLen = 2.24 bits/word Vs. 3 bits
![Page 12: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/12.jpg)
february 3, 2004 Information Theory, J. Galaviz 12
Extensions of a source
• Suppose a source S with alphabet ={A, B}
• P(A) = 0.6875, P(B) = 0.3125• Since there are only two symbols
Huffman algorithm encodes every sample of such source using one bit per symbol in (1 BPS).
• Entropy: H(S) = 0.8960
![Page 13: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/13.jpg)
february 3, 2004 Information Theory, J. Galaviz 13
2nd extension
• P(AA) = 0.4727, WordLen(AA) = 1• P(AB) = 0.2148, WordLen(AB) = 2• P(BA) = 0.2148, WordLen(BA) = 3 • P(BB) = 0.0977, WordLen(BB) = 3• AveLen = 1.8398, BPS2() = 0.9199
![Page 14: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/14.jpg)
february 3, 2004 Information Theory, J. Galaviz 14
3rd extensionStr. Prob. W.Le Str. Prob. W.Le
AAA 0.325 2 BAA 0.148 3
AAB 0.148 2 BAB 0.067 4
ABA 0.148 3 BBA 0.067 4
ABB 0.067 4 BBB 0.031 4
AveLen = 2.759, BPS3() = 0.9197
![Page 15: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/15.jpg)
february 3, 2004 Information Theory, J. Galaviz 15
And so on...
• 4th extension:– AveLen = 3.64138794– BPS4() = 0.91034699
)()(lim SHBPSnn
![Page 16: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/16.jpg)
february 3, 2004 Information Theory, J. Galaviz 16
In practice
• Suppose a sample of our previous source S:
A A A B A A A A B A A B A A B Bf(A) = 11f(B) = 5There are only two symbols, therefore
Huffman assigns: A=0, B=1, 16 bits to express the sample.
![Page 17: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/17.jpg)
february 3, 2004 Information Theory, J. Galaviz 17
Thinking in extensionsdigram fre HC 3-gram fre HC 4-gram fre HC
AA 4 0 AAB 3 0 AAAB 1 00
AB 2 10 AAA 1 10 AAAA 1 01
BA 1 111 BAA 1 111 BAAB 1 10
BB 1 110 B## 1 110 AABB 1 11
Total 14 bits 11 bits 8 bits
![Page 18: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/18.jpg)
february 3, 2004 Information Theory, J. Galaviz 18
Longer strings are better
• The 4-gram sample cannot be compressed since each of the 4 metasymbols (strings of 4 symbols) found, appear with the same frequency.
• Let ´={ AAAA, AAAB, BAAB, AABB } be the alphabet of some information source S´ that produces the symbols in ´ equiprobably.
• The sample could be produced by the maximum entropy source S´.
![Page 19: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/19.jpg)
february 3, 2004 Information Theory, J. Galaviz 19
Dictionary-based methods
• Build a dictionary with frequent strings.• Each time a string in the dictionary
appear in the sample, replace them with a dictionary reference, which are shorter.
• Every frequent string is included only once (in the dictionary).
![Page 20: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/20.jpg)
february 3, 2004 Information Theory, J. Galaviz 20
ExampleAL QUE INGRATO ME DEJA, BUSCO AMANTE; AL QUE AMANTE ME SIGUE, DEJO INGRATA; CONSTANTE ADORO A QUIEN MI AMOR MALTRATA; MALTRATO A QUIEN MI AMOR BUSCA CONSTANTE
1. AL_QUE_2. INGRAT3. _ME_4. _AMANTE5. _A_QUIEN_MI_AMOR_
6. MALTRAT7. CONSTANTE8. DEJ9. BUSC
![Page 21: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/21.jpg)
february 3, 2004 Information Theory, J. Galaviz 21
Result
12O38A, 9O 4; 143SIGUE, 8O 2A; 7 ADORO56A; 6O59A 7
![Page 22: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/22.jpg)
february 3, 2004 Information Theory, J. Galaviz 22
Another posibility
AL_QUE_INGRATO_ME_DEJA,_BUSCO_AMANTE;_ AL_QUE_AMANTE_ME_SIGUE,_DEJO_INGRATA;_ CONSTANTE_ADORO_A_QUIEN_MI_AMOR_MALTRATA;_ MALTRATO_A_QUIEN_MI_AMOR_BUSCA_CONSTANTE
• Build a dictionary of frequent patterns, no necessarily of consecutive symbols (strings).
![Page 23: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/23.jpg)
february 3, 2004 Information Theory, J. Galaviz 23
The compression process
• Given a finite sample of consecutive symbols produced by some source S whose statistical properties can only be estimated from its sample.
• To find a set of frequent patterns such that the sample can be expressed briefly using references to these patterns.
• Encode the sample using the set of patterns (dictionary), and encode the dictionary itself using some other method.
![Page 24: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/24.jpg)
february 3, 2004 Information Theory, J. Galaviz 24
Example
![Page 25: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/25.jpg)
february 3, 2004 Information Theory, J. Galaviz 25
Finding patterns, a naïve algorithm
![Page 26: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/26.jpg)
february 3, 2004 Information Theory, J. Galaviz 26
Algorithm complexity
• Naïve algorithm is very expensive.• We need to find coincidence patterns,
then coincidence patterns in the coincidence patterns previously found, then...
• The number of intersections between coincidence patterns grows exponentially on the number of patterns found (which is O(sample size) ).
![Page 27: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/27.jpg)
february 3, 2004 Information Theory, J. Galaviz 27
There are better algorithms but...
• Not very much better.• The best reported algorithms have
complexity O ( n 2 n ). [Vilo 02]• The patterns we are looking for, are
type P3: “Patterns with wildcards of unrestricted length”
![Page 28: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/28.jpg)
february 3, 2004 Information Theory, J. Galaviz 28
The algorithms for pattern discovery
• Are based in well known string matching techniques supported by special data structure called “suffix tree”.
• There are several algorithms for suffix tree construction (n stands for the string size):– The worst is O ( n 3 )– The two best methods (Wiener and Ukkonen)
are linear on n, and builds the tree “on the fly”.
![Page 29: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/29.jpg)
february 3, 2004 Information Theory, J. Galaviz 29
Suffix tree
Suffix tree for the string ATCAGTGCAATGC
![Page 30: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/30.jpg)
february 3, 2004 Information Theory, J. Galaviz 30
Some posibility?• Generalizing the suffix tree concept in
order to include patterns rather than strings. A “tree of suffix patterns”.
• Cannot be constructed “on the fly” since we need to remember an arbitrary number of previous symbols.
• We need to perform: “Find the longest common pattern in a set of strings”.
• We call this problem the MAXIMUMCOMMONPATTERN problem or MCP.
![Page 31: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/31.jpg)
february 3, 2004 Information Theory, J. Galaviz 31
MAXIMUMCOMMONPATTERN
• We have recently proved that this problem is NP-Complete. That is: currently there is no deterministic polynomial time algorithm to solve it. If such algorithm would be found then all the other problems in this category (the upper bound of complexity) can also be solved in polynomial time and P=NP (the fundamental question in computability theory).
![Page 32: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/32.jpg)
february 3, 2004 Information Theory, J. Galaviz 32
Finding patterns (option 1)
![Page 33: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/33.jpg)
february 3, 2004 Information Theory, J. Galaviz 33
Finding patterns (option 2)
![Page 34: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/34.jpg)
february 3, 2004 Information Theory, J. Galaviz 34
Finding patterns (option 3)
![Page 35: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/35.jpg)
february 3, 2004 Information Theory, J. Galaviz 35
Several options
• Option 1: 12 metasymbols• Option 2: 14 metasymbols• Option 3: 10 metasymbols• Option 3 gives shorter expression of
sample, considering only the data in the sample, ignoring dictionary size.
![Page 36: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/36.jpg)
february 3, 2004 Information Theory, J. Galaviz 36
There is a right choice but...
• The right choice is not easy to do.• There is a trade-off between pattern
size and pattern frequency.• The inclusion of some pattern in
dictionary must be amortized by its use.
![Page 37: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/37.jpg)
february 3, 2004 Information Theory, J. Galaviz 37
How much difficult is the right choice
• Suppose we have a set of frequent patterns P. Each pattern have its frequency and its size.
• We need to chose the subset P´ P that maximizes the compression ratio:
MPTPG ´)(1´)(
![Page 38: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/38.jpg)
february 3, 2004 Information Theory, J. Galaviz 38
• Where |M| is the original sample size, and T(P´) is the sample size after compression is done and dictionary is included.
• T(P´) = D(P´) + E(P´)
![Page 39: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/39.jpg)
february 3, 2004 Information Theory, J. Galaviz 39
OPTIMALPATTERNSUBSET
• We call the selection of best subset of patterns the OPTIMALPATTERNSUBSET
problem.• We have proved that this problem is
also NP-Complete.
![Page 40: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/40.jpg)
february 3, 2004 Information Theory, J. Galaviz 40
But here we have some resources
• We can approximate the best subset by an heuristic algorithm.
• We select the patterns with greatest coverage (number of symbols in the sample that are in the pattern appearances).
• Then we iteratively refine the solution with hillclimbers with local changes.
![Page 41: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/41.jpg)
february 3, 2004 Information Theory, J. Galaviz 41
Conclusions
• The pattern-based data compression is the most general approach to the compression problem based on statistical models of the data to be compressed. Every other technique in this class can be considered a particular case.
• Unfortunately the sub-tasks involved in the compression process are mostly NP-Complete problems.
![Page 42: Information Theory and Pattern-based Data Compression](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816867550346895ddecada/html5/thumbnails/42.jpg)
february 3, 2004 Information Theory, J. Galaviz 42
Further research
• We need to achieve approximation algorithms or heuristics in order to solve the pattern discovery problem efficiently.