ta lecture 12 compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire...

5.1.2011

1

Text Algorithms (6EAP)

Compression

Jaak Vilo 2010 fall

1 MTAT.03.190 Text Algorithms Jaak Vilo

Problem

•  Compress – Text –  Images, video, sound, …

•  Reduce space, efficient communicaNon, etc… – Data deduplicaNon

•  Exact compression/decompression •  Lossy compression

•  Managing Gigabytes: Compressing and Indexing Documents and Images

•  Ian H. WiUen, Alistair Moffat, Timothy C. Bell

•  Hardcover: 519 pages Publisher: Morgan Kaufmann; 2nd Revised ediNon ediNon (11 May 1999) Language English ISBN-‐10: 1558605703

Links

•  hUp://datacompression.info/ •  hUp://en.wikipedia.org/wiki/Data_compression

•  Data Compression Debra A. Lelewer and Daniel S. Hirschberg –  hUp://www.ics.uci.edu/~dan/pubs/DataCompression.html

•  Compression FAQ hUp://www.faqs.org/faqs/compression-‐faq/

•  InformaNon Theory Primer With an Appendix on Logarithms by Tom Schneider hUp://www.lecb.ncifcrf.gov/~toms/paper/primer/

•  hUp://www.cbloom.com/algs/index.html

Problem •  InformaNon transmission •  InformaNon storage •  The data sizes are huge and growing

–  fax -‐ 1.5 x 106 bit/page –  photo: 2M pixels x 24bit = 6MB –  X-‐ray image: ~ 100 MB? –  Microarray scanned image: 30-‐100 MB –  Tissue-‐microarray -‐ hundreds of images, each tens of MB –  Large Hardon Collider (CERN) -‐ The device will produce few peta (1015) bytes of stored

data in a year. –  TV (PAL) 2.7 ·∙ 108 bit/s –  CD-‐sound, super-‐audio, DVD, ...

–  Human genome – 3.2Gbase. 30x sequencing => 100Gbase + quality info (+ raw data) –  1000 genomes, all individual genomes …

What it’s about? •  EliminaNon of redundancy •  Being able to predict… •  Compression and decompression

–  Represent data in a more compact way –  Decompression -‐ restore original form

•  Lossy and lossless compression –  Lossless -‐ restore in exact copy –  Lossy -‐ restore almost the same informaNon

•  Useful when no 100% accuracy needed •  voice, image, movies, ... •  Decompression is determinisNc (lossy in compression phase) •  Can achieve much more effecNve results

5.1.2011

2

Methods covered:

•  Code words (Huffman coding) •  Run-‐length encoding •  ArithmeNc coding •  Lempel-‐Ziv family (compress, gzip, zip, pkzip, ...) •  Burrows-‐Wheeler family (bzip2) •  Other methods, including images •  Kolmogorov complexity •  Search from compressed texts

Model

Encoder Decoder

Model Model

Data Data

Compressed data

•  Let pS be a probability of message S •  The informaNon content can be represented in terms of bits •  I(S) = -‐log( pS ) bits •  If the p=1 then the informaNon content is 0 (no new

informaNon) –  If Pr[s]=1 then I(s) = 0. –  In other words, I(death)=I(taxes)=0

•  I( heads or tails ) = 1 -‐-‐ if the coin is fair •  Entropy H(S) is the average informaNon content of S

–  H(S) = pS ·∙ I(S) = -‐pS log( pS ) bits

hUp://en.wikipedia.org/wiki/InformaNon_entropy

•  Shannon's experiments with human predictors show an informaNon rate of between .6 and 1.3 bits per character, depending on the experimental setup; the PPM compression algorithm can achieve a compression raNo of 1.5 bits per character.

5.1.2011

3

•  No compression can on average achieve beUer compression than the entropy

•  Entropy depends on the model (or choice of symbols) •  Let M={ m1, .. mn } be a set of symbols of the model A and let

p(mi) be the probability of the symbol mi •  The entropy of the model A, H(M) is -‐∑i=1..n p(mi) ·∙ log( p(mi) )

bits •  Let the message S = s1, .. sk, and every symbol si be in the

model M. The informaNon content of model A is -‐∑i=1..k log p(si)

•  Every symbol has to have a probability, otherwise it cannot be coded if it is present in the data hUp://marknelson.us/2006/08/24/the-‐huUer-‐prize/#comment-‐293

•  The data compression world is all abuzz about Marcus HuUer’s recently announced 50,000 euro prize for record-‐breaking data compressors. Marcus, of the Swiss Dalle Molle InsNtute for ArNficial Intelligence, apparently in cahoots with Florida compression maven MaU Mahoney, is offering cash prizes for what amounts to the most impressive ability to compress 100 MBytes of Wikipedia data. (Note that nobody is going to win exactly 50,000 euros -‐ the prize amount is prorated based on how well you beat the current record.)

•  This prize differs considerably from my Million Digit Challenge, which is really nothing more than an aUempt to silence people foolishly claiming to be able to compress random data. Marcus is instead looking for the most effecNve way to reproduce the Wiki data, and he’s pu|ng up real money as an incenNve. The benchmark that contestants need to beat is that set by MaU Mahoney’s paq8f , the current record holder at 18.3 MB. (Alexander Ratushnyak’s submission of a paq variant looks to clock in at a Ndy 17.6 MB, and should soon be confirmed as the new standard.)

•  So why is an AI guy inserNng himself into the world of compression? Well, Marcus realizes that good data compression is all about modeling the data. The beUer you understand the data stream, the beUer you can predict the incoming tokens in a stream. Claude Shannon empirically found that humans could model English text with an entropy of 1.1 to 1.6 0.6 to 1.3 bits per character, which at at best should mean that 100 MB of Wikipedia data could be reduced to 13.75 7.5 MB, with an upper bound of perhaps 20 16.25 MB. The theory is that reaching that 7.5 MB range is going to take such a good understanding of the data stream that it will amount to a demonstraNon of ArNficial Intelligence.

http://prize.hutter1.net/

Model

Encoder Decoder

Model Model

Data Data

Compressed data

StaNc or adapNve

•  StaNc model does not change during the compression

•  AdapNve model can be updated during the process •  Symbols not in message cannot have 0-‐probability •  Semi-‐adapNve model works in 2 stages, off-‐line. •  First create the code table, then encode the message with the code table

How to compare compression techniques?

•  RaNo (t/p) t: original message length •  p: compressed message length •  In texts -‐ bits per symbol •  The Nme and memory used for compression •  The Nme and memory used for decompression

•  error tolerance (e.g. self-‐correcNng code)

Shorter code words…

•  S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' •  Alphabet of 8 •  Length = 40 symbols •  Equal length codewords

•  3-‐bit a 000 b 001 c 010 d 011 e 100 f 101 g 110 space 110

•  S compressed -‐ 3*40 = 120 bits

5.1.2011

4

Run-‐length encoding •  hUp://michael.dipperstein.com/rle/index.html •  The string:

•  "aaaabbcdeeeeefghhhij"

•  may be replaced with

•  "a4b2c1d1e5f1g1h3i1j1". •  This is not shorter because 1-‐leUer repeat takes more characters... •  "a3b1cde4fgh2ij" •  Now we need to know which characters are followed by run-‐length. •  E.g. use escape symbols. •  Or, use the symbol itself -‐ if repeated, then must be followed by run-‐

length •  "aa2bb0cdee3fghh1ij"

AlphabeNcally ordered word-‐lists

resume retail retain retard retire

0resume 2tail 5n 4rd 3ire

Coding techniques

•  Coding refers to techniques used to encode tokens or symbols.

•  Two of the best known coding algorithms are Huffman Coding and ArithmeNc Coding.

•  Coding algorithms are effecNve at compressing data when they use fewer bits for high probability symbols and more bits for low probability symbols.

Variable length encoders

•  How to use codes of variable length? •  Decoder needs to know how long is the symbol

•  Prefix-‐free code: no code can be a prefix of another code

•  Calculate the frequencies and probabiliNes of symbols:

•  S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg'

freq ratio p(s) a 2 2/40 0.05 b 3 3/40 0.075 c 4 4/40 0.1 d 5 5/40 0.125 space 5 5/40 0.125 e 6 6/40 0.15 f 7 7/40 0.175 g 8 8/40 0.2

Algoritm Shannon-‐Fano

•  Input: probabiliNes of symbols •  Output: Codewords in prefix free coding

1. Sort symbols by frequency 2. Divide to two almost probable groups 3. First group gets prefix 0, other 1 4. Repeat recursively in each group unNl 1 symbol remains

5.1.2011

5

Example 1

a 1/2 0 b 1/4 10 c 1/8 110 d 1/16 1110 e 1/32 11110 f 1/32 11111

Example 1

a 1/2 0 b 1/4 10 c 1/8 110 d 1/16 1110 e 1/32 11110 f 1/32 11111

Code:

Shannon-‐Fano S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg'

p(s) code

g 0.2 00 f 0.175 010 e 0.15 011 d 0.125 100 space 0.125 101 c 0.1 110 b 0.075 1110 a 0.05 1111

1

0.475

0.525 0.2

0.325 0.15

0.175

Shannon-‐Fano

•  S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' •  S in compressed is 117 bits •  2*4 + 3*4 + 4*3 + 5*3 + 5*3 + 6*3 + 7*3 + 8*2 = 117

•  Shannon-‐Fano not always opNmal •  SomeNmes 2 equal probable groups cannot be achieved

•  Usually beUer than H+1 bits per symbol, when H is entropy.

Huffman code •  Works the opposite way. •  Start from least probable symbols and separate them with 0

and 1 (sufix) •  Add probabiliNes to form a "new symbol" with the new

probability •  Prepend new bits in front of old ones.

Huffman example Char Freq Code space 7 111 a 4 010 e 4 000 f 3 1101 h 2 1010 i 2 1000 m 2 0111 n 2 0010 s 2 1011 t 2 0110 l 1 11001 o 1 00110 p 1 10011 r 1 11000 u 1 00111 x 1 10010 "this is an example of a huffman tree"

5.1.2011

6

•  Huffman coding is opNmal when the frequencies of input characters are powers of two. ArithmeNc coding produces slight gains over Huffman coding, but in pracNce these gains have not been large enough to offset arithmeNc coding's higher computaNonal complexity and patent royalNes

•  (as of November 2001/Jul2006, IBM owns patents on the core concepts of arithmeNc coding in several jurisdicNons). hUp://en.wikipedia.org/wiki/ArithmeNc_coding#US_patents_on_arithmeNc_coding

ProperNes of Huffman coding

•  Error tolerance quite good •  In case of the loss, adding or change of a single bit, the differences remain local to the place of the error

•  Error usually remains quite local (proof?) •  Has been shown, the code is opNmal •  Can be shown the average result is H+p+0.086, where H is the entropy and p is the probability of the most probable symbol

Move to Front

•  Move to Front (MTF), Least recently used (LRU)

•  Keep a list of last k symbols of S •  Code

– use the code for symbol. –  if in codebook, move to front –  if not in codebook, move to first, remove the last

•  c.f. the handling of memory paging •  Other heurisNcs ...

ArithmeNc (en)coding •  ArithmeNc coding is a method for lossless data compression. •  It is a form of entropy encoding, but where other entropy encoding

techniques separate the input message into its component symbols and replace each symbol with a code word, arithmeNc coding encodes the enNre message into a single number, a fracNon n where (0.0 n < 1.0).

•  Huffman coding is opNmal for character encoding (one character-‐one code word) and simple to program. ArithmeNc coding is beUer sNll, since it can allocate fracNonal bits, but more complicated.

•  Wikipedia hUp://en.wikipedia.org/wiki/ArithmeNc_encoding •  EnNre message is a single floaNng point number, a fracNon n where (0.0 n

< 1.0). •  Every symbol gets a probability based on the model •  ProbabiliNes represent non-‐intersecNng intervals •  Every text is such an interval

Let P(A)=0.1, P(B)=0.4, P(C)=0.5 A [0,0.1) AA [0 , 0.01) AB [0.01 , 0.05) AC [0.05 , 0.1 ) B [0.1,0.5) BA [0.1 , 0.14) BB [0.14 , 0.3 ) BC [0.3 , 0.5 ) C [0.5,1) CA [0.5 , 0.55 ) CB [0.55 , 0.75 ) CC [0.75 , 1 )

5.1.2011

7

•  Add a EOF symbol. •  Problem with infinite precision arithmeNcs •  AlternaNve -‐ blockwise, use integer-‐arithmeNcs

•  Works, if smallest pi not too small •  Best raNo •  Problem -‐ the speed, and error tolerance, small change has catastrophic effect

•  Invented by Jorma Rissanen (then at IBM) •  ArithmeNc coding revisited by Alistair Moffat, Radford M.

Neal, Ian H. WiUen -‐ hUp://portal.acm.org/citaNon.cfm?id=874788

•  Models for arithmeNc coding -‐ •  HMM Hidden Markov Models •  ... •  Context methods: Abrahamson dependency model •  Use the context to maximum, to predict the next symbol •  PPM -‐ PredicNon by ParNal Matching •  Several contexts, choose best •  VariaNons

DicNonary based compression

•  DicNonary (symbol table) , list codes •  If not in discNonary, use escape •  Usual heurisNcs searches for longest repeat •  With fixed table one can search for opNmal code

•  With adapNve dicNonary the opNmal coding is NP-‐complete

•  Quite good for English language texts, for example

Lempel-‐Ziv family, LZ, LZ-‐77 •  Use the dicNonary to memorise the previously compressed parts •  LZ-‐77 •  Sliding window of previous text; and text to be compressed

/bbaaabbbaaffacbbaa…./...abbbaabab... •  Lookahead -‐ longest prefix that begins within the moving window, is

encoded with [posiNon,length] •  In example, [5,6] •  Fast! (Commercial so�ware, e.g. PKZIP, Stacker, DoubleSpace, ) •  Several alternaNve codes for same string (alternaNve substrings will match) •  Lempel-‐Ziv compression from McGill Univ.

hUp://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic23/

Original LZ77 •  Triples [posiNon,length,next char] •  If output [a,b,c], advance by b+1 posiNons •  For each part of the triple the nr of bits is reserved depending on window length

⎡ log(n-‐f) ⎤ + ⎡ log(f) ⎤ + 8 where n is window size, and f is lookahead size

•  Example: abbbbabbc [0,0,a] [0,0,b] [1,3,a] [3,2,c]

•  In example the match actually overlaps with lookahead window

LZ-‐78 •  DicNonary •  Store strings from processed part of the message •  Next symbol is the longest match from dicNonary, that

matches the text to be processed •  LZ78 (Ziv and Lempel 1978) •  First, dicNonary is empty, with index 0 •  Code [i,c] -‐ refers to dicNonary (word u at pos. i) and c is the

next symbol •  Add uc to dicNonary •  Example: ababcabc → [0,a][0,b][1,b][0,c][3,c]

5.1.2011

8

LZW

•  Code consists of indices only! •  First, dicNonary has every symbol /alphabet/ •  Update dicNonary like LZ78 •  In decoding there is a danger: See abcabca

–  If abc is in dicNonary –  add abca to dicNonary –  next is abca, output that code

–  But when decoding, aaer abc it is not known that abca is in the dicNonary

•  SoluNon -‐ if the dicNonary entry is used immediately a�er its creaNon, the 1st and last characters have to match

•  Many representaNons for the dicNonary. •  List, hash, sorted list, combinaNon, binary tree, trie, suffix tree, ...

LZJ

•  Coding -‐ search for longest prefix. •  Code -‐ address of the trie node •  From the root of the trie, there is a transiNon on every symbol (like in LZW).

•  If out of memory, remove these nodes/branches that have been used only once

•  In pracNce, h=6, dicNonary has 8192 nodes

LZFG

•  EffecNve LZ method •  From LZJ •  Create a suffix tree for the window •  Code -‐ node address plus nr of characters from teh edge.

•  The internal and leaf nodes with different codes

•  small match directly... (?)

Burrows-‐Wheeler •  See FAQ hUp://www.faqs.org/faqs/compression-‐faq/part2/secNon-‐9.html •  The method described in the original paper is really a composite of three different

algorithms: –  the block sorNng main engine (a lossless, very slightly expansive preprocessor), –  the move-‐to-‐front coder (a byte-‐for-‐byte simple, fast, locally adapNve noncompressive coder) and –  a simple staNsNcal compressor (first order Huffman is menNoned as a candidate) eventually doing

the compression.

•  Of these three methods only the first two are discussed here as they are what consNtutes the heart of the algorithm. These two algorithms combined form a completely reversible (lossless) transformaNon that -‐ with typical input -‐ skews the first order symbol distribuNons to make the data more compressible with simple methods. IntuiNvely speaking, the method transforms slack in the higher order probabiliNes of the input block (thus making them more even, whitening them) to slack in the lower order staNsNcs. This effect is what is seen in the histogram of the resulNng symbol data.

•  Please, read the arNcle by Mark Nelson: •  Data Compression with the Burrows-‐Wheeler Transform Mark Nelson, Dr. Dobb's Journal

September, 1996. hUp://marknelson.us/1996/09/01/bwt/

Burrows-‐Wheeler Transform (BWT)

5.1.2011

9

CODE: t: hat acts like this:<13><10><1

t: hat buffer to the constructor

t: hat corrupted the heap, or wo

W: hat goes up must come down<13

t: hat happens, it isn't likely

w: hat if you want to dynamicall

t: hat indicates an error.<13><1

t: hat it removes arguments from

t: hat looks like this:<13><10><

t: hat looks something like this

t: hat looks something like this

t: hat once I detect the mangled

Example

•  Decode: errktreteoe.e

•  Hint: . Is the last character, alphabeNcally first…

5.1.2011

10

SyntacNc compression

•  Context Free Grammar for presenNng the syntax tree

•  Usually for source code •  AssumpNon -‐ program is syntacNcally correct •  Comments •  Features, constants -‐ group by group

Image compression

•  Many images, photos, sound, video, ...

Fax group 3

•  Fax/group 3 •  Black/white, 0/1 code •  Run-‐length: 000111001000 → 3,3,2,1,3 •  Variable-‐length codes for represenNng run-‐lengths.

•  Joint Photographic Experts Group JPEG 2000 hUp://www.jpeg.org/jpeg2000/

•  Image Compression -‐-‐ JPEG from W.B. Pennebaker, J.L. Mitchell, "The JPEG SNll Image Data Compression Standard", Van Nostrand Reinhold, 1993.

•  Color image, 8 or 12 bits per pixel per color. •  Four modes SequenNal Mode •  Lossless Mode •  Progressive Mode •  Hierarchical Mode •  DCT (Discrete Cosine Transform)

•  from hUp://www.utdallas.edu/~aria/mcl/post/ •  Lossy signal compression works on the basis of transmi|ng the "important" signal

content, while omi|ng other parts (QuanNzaNon). To perform this quanNzaNon effecNvely, a linear de-‐correlaNng transform is applied to the signal prior to quanNzaNon. All exisNng image and video coding standards use this approach. The most commonly used transform is the Discrete Cosine Transform (DCT) used in JPEG, MPEG-‐1, MPEG-‐2, H.261 and H.263 and its descendants. For a detailed discussion of the theory behind quanNzaNon and jusNficaNon of the usage of linear transforms, see reference [1] below.

•  A brief overview of JPEG compression is as follows. The JPEG encoder parNNons the image into 8x8 blocks of pixels. To each of these blocks it applies a 2-‐dimensional DCT. The transform matrix is normalized (element-‐wise) by a 8x8 quanNzaNon matrix and then rounded to the nearest integer. This operaNon is equivalent to applying different uniform quanNzers to different frequency bands of the image. The high-‐frequency image content can be quanNzed more coarsely than the low-‐frequency content, due to two factors.

•  L9_Compression/lena/

Lena

5.1.2011

11

Vector quanNzaNon

•  Vector quanNzaNon •  DicNonary-‐meetod •  2-‐dimensional blocks

Discrete cosine transform •  A discrete cosine transform (DCT) expresses a sequence of finitely many

data points in terms of a sum of cosine funcNons oscillaNng at different frequencies. DCTs are important to numerous applicaNons in science and engineering, from lossy compression of audio and images (where small high-‐frequency components can be discarded), to spectral methods for the numerical soluNon of parNal differenNal equaNons. The use of cosine rather than sine funcNons is criNcal in these applicaNons: for compression, it turns out that cosine funcNons are much more efficient (as explained below, fewer are needed to approximate a typical signal), whereas for differenNal equaNons the cosines express a parNcular choice of boundary condiNons.

•  hUp://en.wikipedia.org/wiki/Discrete_cosine_transform

2d DCT (type II) compared to the DFT. For both transforms, there is the magnitude of the spectrum on left and the histogram on right; both spectra are cropped to 1/4, to zoom the behaviour in the lower frequencies. The DCT concentrates most of the power on the lower frequencies.

•  In parNcular, a DCT is a Fourier-‐related transform similar to the discrete Fourier transform (DFT), but using only real numbers. DCTs are equivalent to DFTs of roughly twice the length, operaNng on real data with even symmetry (since the Fourier transform of a real and even funcNon is real and even), where in some variants the input and/or output data are shi�ed by half a sample. There are eight standard DCT variants, of which four are common.

•  Digital Image Processing: hUp://www-‐ee.uta.edu/dip/

ENEE631 Digital Image Processing (Spring'04) Lec13 – Transf. Coding & JPEG [65]

Block Diagram of JPEG Baseline

From

Wal

lace

’s JP

EG tu

toria

l (19

93)


475 x 330 x 3 = 157 KB luminance

From

Liu

’s E

E330

(Prin

ceto

n)

5.1.2011

12


RGB Components

From

Liu

’s E

E330

(Prin

ceto

n)


Y U V (Y Cb Cr) Components

Assign more bits to Y, less bits to Cb and Cr

From

Liu

’s E

E330

(Prin

ceto

n)


JPEG Compression (Q=75%)

45 KB, compression ration ~ 4:1 From

Liu

’s E

E330

(Prin

ceto

n)


JPEG Compression (Q=75%)

45 KB, compression ration ~ 4:1 From

Liu

’s E

E330

(Prin

ceto

n)


JPEG Compression (Q=75% & 30%)

45 KB 22 KB From

Liu

’s E

E330

(Prin

ceto

n)


Uncompressed (100KB)

JPEG 75% (18KB)

JPEG 50% (12KB)

JPEG 30% (9KB)

JPEG 10% (5KB)

UM

CP

EN

EE

408G

Slid

es (c

reat

ed b

y M

.Wu

& R

.Liu

© 2

002)

5.1.2011

13

1.4-‐billion-‐pixel digital camera

•  Monday, November 24, 2008 hUp://www.technologyreview.com/compuNng/21705/page1/

•  Giant Camera Tracks Asteroids •  The camera will offer sharper, broader views of the sky.

•  The focal plane of each camera contains an almost complete 64 x 64 array of CCD devices, each containing approximately 600 x 600 pixels, for a total of about 1.4 gigapixels. The CCDs themselves employ the innovaNve technology called "orthogonal transfer", which is described below.

Diagram showing how an OTA chip is made up of 64 OTCCD cells, each of which has 600 x 600 pixels

Fractal compression

•  Fractal Compression group at Waterloo hUp://links.uwaterloo.ca/fractals.home.html

•  A "Hitchhiker's Guide to Fractal Compression" For Beginners �p://links.uwaterloo.ca/pub/Fractals/Papers/Waterloo/vr95.pdf

•  Encode using fractals. •  Search for regions that with a simple transformaNon can be similar to each other.

•  Compressin raNo 20-‐80

•  Moving Pictures Experts Group ( hUp://www.chiariglione.org/mpeg/ )

•  MPEG Compression : hUp://www.cs.cf.ac.uk/Dave/MulNmedia/node255.html •  Screen divided into 256 blocks, where the changes and movements are tracked

•  Only differences from previous frame are shown

•  Compression raNo 50-‐100

•  When one compresses files, can we sNll use fast search techniques without decompressing frst?

•  SomeNmes, yes •  e.g. Udi Manber has developed a method •  Approximate Matching of Run-‐Length Compressed Strings

Veli Mäkinen, Gonzalo Navarro, Esko Ukkonen

•  We focus on the problem of approximate matching of strings that have been compressed using run-‐length encoding. Previous studies have concentrated on the problem of compuNng the longest common subsequence (LCS) between two strings of length m and n , compressed to m' and n' runs. We extend an exisNng algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity. Furthermore, we extend this algorithm to a weighted edit distance model, where the weights of the three basic edit operaNons can be chosen arbitrarily. This approach also gives an algorithm for approximate searching of a paUern of m leUers (m' runs) in a text of n leUers (n' runs) in O(mm'n') Nme. Then we propose improvements for a greedy algorithm for the LCS, and conjecture that the improved algorithm has O(m'n') expected case complexity. Experimental results are provided to support the conjecture.

Kolmogorov (or Algorithmic) complexity

•  Kolmogorov, ChaiNn, ... •  What is the compressed version of sequence

'1234567891011121314151617181920212223242526...' ? •  Every symbol appears almost equally frequently, almost

"random" by entropy •  for i=1 to n do print i ; •  Algorithmic complexity (or Kolmogorov complexity) for string

S is the length of the shortest program that reproduces S, o�en noted K(S)

•  CondiNonal complexity : K(S|T). Reproduce S given T. •  hUp://en.wikipedia.org/wiki/Algorithmic_informaNon_theory

•  Algorithmic informaNon theory is a field of study which aUempts to capture the concept of complexity using tools from theoreNcal computer science. The chief idea is to define the complexity (or Kolmogorov complexity) of a string as the length of the shortest program which, when run without any input, outputs that string. Strings that can be produced by short programs are considered to be not very complex. This noNon is surprisingly deep and can be used to state and prove impossibility results akin to Gödel's incompleteness theorem and Turing's halNng problem.

•  The field was developed by Andrey Kolmogorov and Gregory ChaiNn starNng in the late 1960s.

5.1.2011

14

Kolmogorov complexity: size of circle in bits...

Encoder Decoder

Model Model

Data Data

Compressed data

G J ChaiNn

•  hUp://www.cs.umaine.edu/~chaiNn/

•  Distance using K. •  d(S,T) = ( K(S|T) + K(T|S) ) / ( K(S) + K(T) ) •  We cannot calculate K, but we can approximate it

•  E.g. by compression LZ, BWT, etc •  d(S,T) = ( C(ST) + C(TS) ) / ( C(S) + C(T) )

•  Use of Kolmogorov Distance IdenNficaNon of Web Page Authorship, Topic and Domain David Parry (PPT) in OSWIR 2005, 2005 workshop on Open Source Web InformaNon Retrieval

•  Informatsioonikaugus by Mart Sõmermaa, Fall 2003 (in Data Mining Research seminar) hUp://www.egeen.ee/u/vilo/edu/2003-‐04/DM_seminar_2003_II/Raport/P06/main.pdf