ta lecture 12 compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire...

14
5.1.2011 1 Text Algorithms (6EAP) Compression Jaak Vilo 2010 fall 1 MTAT.03.190 Text Algorithms Jaak Vilo Problem Compress Text Images, video, sound, … Reduce space, efficient communicaNon, etc… Data deduplicaNon Exact compression/decompression Lossy compression Managing Gigabytes: Compressing and Indexing Documents and Images Ian H. WiUen , Alistair Moffat , Timothy C. Bell Hardcover: 519 pages Publisher: Morgan Kaufmann; 2nd Revised ediNon ediNon (11 May 1999) Language English ISBN10: 1558605703 Links hUp://datacompression.info/ hUp://en.wikipedia.org/wiki/Data_compression Data Compression Debra A. Lelewer and Daniel S. Hirschberg hUp://www.ics.uci.edu/~dan/pubs/DataCompression.html Compression FAQ hUp://www.faqs.org/faqs/compressionfaq/ InformaNon Theory Primer With an Appendix on Logarithms by Tom Schneider hUp://www.lecb.ncifcrf.gov/~toms/paper/primer/ hUp://www.cbloom.com/algs/index.html Problem InformaNon transmission InformaNon storage The data sizes are huge and growing fax 1.5 x 10 6 bit/page photo: 2M pixels x 24bit = 6MB Xray image: ~ 100 MB? Microarray scanned image: 30100 MB Tissuemicroarray hundreds of images, each tens of MB Large Hardon Collider (CERN) The device will produce few peta (10 15 ) bytes of stored data in a year. TV (PAL) 2.7 10 8 bit/s CDsound, superaudio, DVD, ... Human genome – 3.2Gbase. 30x sequencing => 100Gbase + quality info (+ raw data) 1000 genomes, all individual genomes … What its about? EliminaNon of redundancy Being able to predict… Compression and decompression Represent data in a more compact way Decompression restore original form Lossy and lossless compression Lossless restore in exact copy Lossy restore almost the same informaNon Useful when no 100% accuracy needed voice, image, movies, ... Decompression is determinisNc (lossy in compression phase) Can achieve much more effecNve results

Upload: others

Post on 26-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

1  

Text  Algorithms  (6EAP)    

Compression  

Jaak  Vilo  2010  fall  

1  MTAT.03.190  Text  Algorithms    Jaak  Vilo  

Problem  

•  Compress  – Text  –  Images,  video,  sound,  …  

•  Reduce  space,  efficient  communicaNon,  etc…  – Data  deduplicaNon  

•  Exact  compression/decompression  •  Lossy  compression  

•  Managing  Gigabytes:  Compressing  and  Indexing  Documents  and  Images  

•  Ian  H.  WiUen,  Alistair  Moffat,  Timothy  C.  Bell    

•  Hardcover:  519  pages  Publisher:  Morgan  Kaufmann;  2nd  Revised  ediNon  ediNon  (11  May  1999)  Language  English  ISBN-­‐10:  1558605703  

Links  

•  hUp://datacompression.info/    •  hUp://en.wikipedia.org/wiki/Data_compression  

•  Data  Compression  Debra  A.  Lelewer  and  Daniel  S.  Hirschberg    –  hUp://www.ics.uci.edu/~dan/pubs/DataCompression.html  

•  Compression  FAQ  hUp://www.faqs.org/faqs/compression-­‐faq/  

•  InformaNon  Theory  Primer  With  an  Appendix  on  Logarithms  by  Tom  Schneider  hUp://www.lecb.ncifcrf.gov/~toms/paper/primer/  

•  hUp://www.cbloom.com/algs/index.html  

Problem  •  InformaNon  transmission    •  InformaNon  storage    •  The  data  sizes  are  huge  and  growing    

–  fax  -­‐  1.5  x  106  bit/page    –  photo:  2M  pixels  x  24bit  =  6MB    –  X-­‐ray  image:  ~  100  MB?    –  Microarray  scanned  image:  30-­‐100  MB    –  Tissue-­‐microarray  -­‐  hundreds  of  images,  each  tens  of  MB    –  Large  Hardon  Collider  (CERN)  -­‐  The  device  will  produce  few  peta  (1015)  bytes  of  stored  

data  in  a  year.    –  TV  (PAL)  2.7  ·∙  108  bit/s    –  CD-­‐sound,  super-­‐audio,  DVD,  ...    

–  Human  genome  –  3.2Gbase.  30x  sequencing  =>  100Gbase  +  quality  info  (+  raw  data)  –  1000  genomes,  all  individual  genomes  …  

What  it’s  about?  •  EliminaNon  of  redundancy    •  Being  able  to  predict…  •  Compression  and  decompression    

–  Represent  data  in  a  more  compact  way    –  Decompression  -­‐  restore  original  form    

•  Lossy  and  lossless  compression    –  Lossless  -­‐  restore  in  exact  copy    –  Lossy  -­‐  restore  almost  the  same  informaNon    

•  Useful  when  no  100%  accuracy  needed    •  voice,  image,  movies,  ...    •  Decompression  is  determinisNc  (lossy  in  compression  phase)    •  Can  achieve  much  more  effecNve  results    

Page 2: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

2  

Methods  covered:  

•  Code  words  (Huffman  coding)    •  Run-­‐length  encoding    •  ArithmeNc  coding    •  Lempel-­‐Ziv  family  (compress,  gzip,  zip,  pkzip,  ...)    •  Burrows-­‐Wheeler  family  (bzip2)    •  Other  methods,  including  images    •  Kolmogorov  complexity    •  Search  from  compressed  texts    

Model  

Encoder   Decoder  

Model   Model  

Data Data

Compressed data

•  Let  pS  be  a  probability  of  message  S    •  The  informaNon  content  can  be  represented  in  terms  of  bits    •  I(S)  =  -­‐log(  pS  )  bits    •  If  the  p=1  then  the  informaNon  content  is  0  (no  new  

informaNon)    –  If  Pr[s]=1  then  I(s)  =  0.    –  In  other  words,  I(death)=I(taxes)=0    

•  I(  heads  or  tails  )  =  1  -­‐-­‐  if  the  coin  is  fair    •  Entropy  H(S)  is  the  average  informaNon  content  of  S    

–  H(S)  =  pS  ·∙  I(S)  =  -­‐pS  log(  pS  )  bits    

hUp://en.wikipedia.org/wiki/InformaNon_entropy  

•  Shannon's  experiments  with  human  predictors  show  an  informaNon  rate  of  between  .6  and  1.3  bits  per  character,  depending  on  the  experimental  setup;  the  PPM  compression  algorithm  can  achieve  a  compression  raNo  of  1.5  bits  per  character.  

Page 3: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

3  

•  No  compression  can  on  average  achieve  beUer  compression  than  the  entropy    

•  Entropy  depends  on  the  model  (or  choice  of  symbols)    •  Let  M={  m1,  ..  mn  }  be  a  set  of  symbols  of  the  model  A  and  let  

p(mi)  be  the  probability  of  the  symbol  mi    •  The  entropy  of  the  model  A,  H(M)  is  -­‐∑i=1..n  p(mi)  ·∙  log(  p(mi)  )  

bits    •  Let  the  message  S  =  s1,  ..  sk,  and  every  symbol  si  be  in  the  

model  M.  The  informaNon  content  of  model  A  is  -­‐∑i=1..k  log  p(si)    

•  Every  symbol  has  to  have  a  probability,  otherwise  it  cannot  be  coded  if  it  is  present  in  the  data   hUp://marknelson.us/2006/08/24/the-­‐huUer-­‐prize/#comment-­‐293  

 

•  The  data  compression  world  is  all  abuzz  about  Marcus  HuUer’s  recently  announced  50,000  euro  prize  for  record-­‐breaking  data  compressors.  Marcus,  of  the  Swiss  Dalle  Molle  InsNtute  for  ArNficial  Intelligence,  apparently  in  cahoots  with  Florida  compression  maven  MaU  Mahoney,  is  offering  cash  prizes  for  what  amounts  to  the  most  impressive  ability  to  compress  100  MBytes  of  Wikipedia  data.  (Note  that  nobody  is  going  to  win  exactly  50,000  euros  -­‐  the  prize  amount  is  prorated  based  on  how  well  you  beat  the  current  record.)  

•  This  prize  differs  considerably  from  my  Million  Digit  Challenge,  which  is  really  nothing  more  than  an  aUempt  to  silence  people  foolishly  claiming  to  be  able  to  compress  random  data.  Marcus  is  instead  looking  for  the  most  effecNve  way  to  reproduce  the  Wiki  data,  and  he’s  pu|ng  up  real  money  as  an  incenNve.  The  benchmark  that  contestants  need  to  beat  is  that  set  by  MaU  Mahoney’s  paq8f  ,  the  current  record  holder  at  18.3  MB.  (Alexander  Ratushnyak’s  submission  of  a  paq  variant  looks  to  clock  in  at  a  Ndy  17.6  MB,  and  should  soon  be  confirmed  as  the  new  standard.)  

•  So  why  is  an  AI  guy  inserNng  himself  into  the  world  of  compression?  Well,  Marcus  realizes  that  good  data  compression  is  all  about  modeling  the  data.  The  beUer  you  understand  the  data  stream,  the  beUer  you  can  predict  the  incoming  tokens  in  a  stream.  Claude  Shannon  empirically  found  that  humans  could  model  English  text  with  an  entropy  of  1.1  to  1.6  0.6  to  1.3  bits  per  character,  which  at  at  best  should  mean  that  100  MB  of  Wikipedia  data  could  be  reduced  to  13.75  7.5  MB,  with  an  upper  bound  of  perhaps  20  16.25  MB.  The  theory  is  that  reaching  that  7.5  MB  range  is  going  to  take  such  a  good  understanding  of  the  data  stream  that  it  will  amount  to  a  demonstraNon  of  ArNficial  Intelligence.  

http://prize.hutter1.net/

Model  

Encoder   Decoder  

Model   Model  

Data Data

Compressed data

StaNc  or  adapNve  

•  StaNc  model  does  not  change  during  the  compression    

•  AdapNve  model  can  be  updated  during  the  process    •  Symbols  not  in  message  cannot  have  0-­‐probability    •  Semi-­‐adapNve  model  works  in  2  stages,  off-­‐line.    •  First  create  the  code  table,  then  encode  the  message  with  the  code  table    

How  to  compare  compression  techniques?  

•  RaNo  (t/p)  t:  original  message  length    •  p:  compressed  message  length    •  In  texts  -­‐  bits  per  symbol    •  The  Nme  and  memory  used  for  compression    •  The  Nme  and  memory  used  for  decompression    

•  error  tolerance  (e.g.  self-­‐correcNng  code)    

Shorter  code  words…  

•  S  =  'aa  bbb  cccc  ddddd  eeeeee  fffffffgggggggg'    •  Alphabet  of  8    •  Length  =  40  symbols    •  Equal  length  codewords    

•  3-­‐bit  a  000  b  001  c  010  d  011  e  100  f  101  g  110  space  110    

•  S  compressed  -­‐  3*40  =  120  bits    

Page 4: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

4  

Run-­‐length  encoding  •  hUp://michael.dipperstein.com/rle/index.html    •  The  string:  

 •           "aaaabbcdeeeeefghhhij"  

 •  may  be  replaced  with  

 •           "a4b2c1d1e5f1g1h3i1j1".    •  This  is  not  shorter  because  1-­‐leUer  repeat  takes  more  characters...    •  "a3b1cde4fgh2ij"    •  Now  we  need  to  know  which  characters  are  followed  by  run-­‐length.    •  E.g.  use  escape  symbols.    •  Or,  use  the  symbol  itself  -­‐  if  repeated,  then  must  be  followed  by  run-­‐

length    •  "aa2bb0cdee3fghh1ij"    

AlphabeNcally  ordered  word-­‐lists  

resume retail retain retard retire

0resume 2tail 5n 4rd 3ire

Coding  techniques  

•  Coding  refers  to  techniques  used  to  encode  tokens  or  symbols.    

•  Two  of  the  best  known  coding  algorithms  are  Huffman  Coding  and  ArithmeNc  Coding.    

•  Coding  algorithms  are  effecNve  at  compressing  data  when  they  use  fewer  bits  for  high  probability  symbols  and  more  bits  for  low  probability  symbols.    

Variable  length  encoders  

•  How  to  use  codes  of  variable  length?    •  Decoder  needs  to  know  how  long  is  the  symbol    

•  Prefix-­‐free  code:  no  code  can  be  a  prefix  of  another  code    

•  Calculate  the  frequencies  and  probabiliNes  of  symbols:    

•  S  =  'aa  bbb  cccc  ddddd  eeeeee  fffffffgggggggg'    

freq ratio p(s) a 2 2/40 0.05 b 3 3/40 0.075 c 4 4/40 0.1 d 5 5/40 0.125 space 5 5/40 0.125 e 6 6/40 0.15 f 7 7/40 0.175 g 8 8/40 0.2

Algoritm  Shannon-­‐Fano    

•  Input:  probabiliNes  of  symbols    •  Output:  Codewords  in  prefix  free  coding        

1. Sort  symbols  by  frequency  2. Divide  to  two  almost  probable  groups    3. First  group  gets  prefix  0,  other  1    4. Repeat  recursively  in  each  group  unNl  1  symbol  remains    

Page 5: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

5  

Example  1  

a 1/2 0 b 1/4 10 c 1/8 110 d 1/16 1110 e 1/32 11110 f 1/32 11111

Example  1  

a 1/2 0 b 1/4 10 c 1/8 110 d 1/16 1110 e 1/32 11110 f 1/32 11111

Code:

Shannon-­‐Fano  S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg'

p(s) code

g 0.2 00 f 0.175 010 e 0.15 011 d 0.125 100 space 0.125 101 c 0.1 110 b 0.075 1110 a 0.05 1111

1

0.475

0.525 0.2

0.325 0.15

0.175

Shannon-­‐Fano  

•  S  =  'aa  bbb  cccc  ddddd  eeeeee  fffffffgggggggg'  •  S  in  compressed  is  117  bits    •  2*4  +  3*4  +  4*3  +  5*3  +  5*3  +  6*3  +  7*3  +  8*2  =  117    

•  Shannon-­‐Fano  not  always  opNmal    •  SomeNmes  2  equal  probable  groups  cannot  be  achieved    

•  Usually  beUer  than  H+1  bits  per  symbol,  when  H  is  entropy.    

Huffman  code  •  Works  the  opposite  way.    •  Start  from  least  probable  symbols  and  separate  them  with  0  

and  1  (sufix)    •  Add  probabiliNes  to  form  a  "new  symbol"  with  the  new  

probability    •  Prepend  new  bits  in  front  of  old  ones.    

Huffman  example  Char   Freq   Code  space   7   111  a   4   010  e   4   000  f   3   1101  h   2   1010  i   2   1000  m   2   0111  n   2   0010  s   2   1011  t   2   0110  l   1   11001  o   1   00110  p   1   10011  r   1   11000  u   1   00111  x   1   10010   "this is an example of a huffman tree"

Page 6: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

6  

•  Huffman  coding  is  opNmal  when  the  frequencies  of  input  characters  are  powers  of  two.  ArithmeNc  coding  produces  slight  gains  over  Huffman  coding,  but  in  pracNce  these  gains  have  not  been  large  enough  to  offset  arithmeNc  coding's  higher  computaNonal  complexity  and  patent  royalNes    

•  (as  of  November  2001/Jul2006,  IBM  owns  patents  on  the  core  concepts  of  arithmeNc  coding  in  several  jurisdicNons).    hUp://en.wikipedia.org/wiki/ArithmeNc_coding#US_patents_on_arithmeNc_coding  

ProperNes  of  Huffman  coding    

•  Error  tolerance  quite  good    •  In  case  of  the  loss,  adding  or  change  of  a  single  bit,  the  differences  remain  local  to  the  place  of  the  error    

•  Error  usually  remains  quite  local  (proof?)    •  Has  been  shown,  the  code  is  opNmal    •  Can  be  shown  the  average  result  is  H+p+0.086,  where  H  is  the  entropy  and  p  is  the  probability  of  the  most  probable  symbol  

Move  to  Front  

•  Move  to  Front  (MTF),  Least  recently  used  (LRU)    

•  Keep  a  list  of  last  k  symbols  of  S    •  Code    

– use  the  code  for  symbol.    –  if  in  codebook,  move  to  front    –  if  not  in  codebook,  move  to  first,  remove  the  last    

•  c.f.  the  handling  of  memory  paging    •  Other  heurisNcs  ...    

ArithmeNc  (en)coding  •  ArithmeNc  coding  is  a  method  for  lossless  data  compression.    •  It  is  a  form  of  entropy  encoding,  but  where  other  entropy  encoding  

techniques  separate  the  input  message  into  its  component  symbols  and  replace  each  symbol  with  a  code  word,  arithmeNc  coding  encodes  the  enNre  message  into  a  single  number,  a  fracNon  n  where  (0.0  n  <  1.0).    

•  Huffman  coding  is  opNmal  for  character  encoding  (one  character-­‐one  code  word)  and  simple  to  program.  ArithmeNc  coding  is  beUer  sNll,  since  it  can  allocate  fracNonal  bits,  but  more  complicated.    

•  Wikipedia  hUp://en.wikipedia.org/wiki/ArithmeNc_encoding  •  EnNre  message  is  a  single  floaNng  point  number,  a  fracNon  n  where  (0.0  n  

<  1.0).    •  Every  symbol  gets  a  probability  based  on  the  model    •  ProbabiliNes  represent  non-­‐intersecNng  intervals    •  Every  text  is  such  an  interval    

Let P(A)=0.1, P(B)=0.4, P(C)=0.5 A [0,0.1) AA [0 , 0.01) AB [0.01 , 0.05) AC [0.05 , 0.1 ) B [0.1,0.5) BA [0.1 , 0.14) BB [0.14 , 0.3 ) BC [0.3 , 0.5 ) C [0.5,1) CA [0.5 , 0.55 ) CB [0.55 , 0.75 ) CC [0.75 , 1 )

Page 7: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

7  

•  Add  a  EOF  symbol.    •  Problem  with  infinite  precision  arithmeNcs    •  AlternaNve  -­‐  blockwise,  use  integer-­‐arithmeNcs    

•  Works,  if  smallest  pi  not  too  small    •  Best  raNo    •  Problem  -­‐  the  speed,  and  error  tolerance,  small  change  has  catastrophic  effect    

•  Invented  by  Jorma  Rissanen  (then  at  IBM)  •  ArithmeNc  coding  revisited  by  Alistair  Moffat,  Radford  M.  

Neal,  Ian  H.  WiUen  -­‐  hUp://portal.acm.org/citaNon.cfm?id=874788  

•  Models  for  arithmeNc  coding  -­‐    •  HMM  Hidden  Markov  Models    •  ...    •  Context  methods:  Abrahamson  dependency  model    •  Use  the  context  to  maximum,  to  predict  the  next  symbol    •  PPM  -­‐  PredicNon  by  ParNal  Matching    •  Several  contexts,  choose  best    •  VariaNons    

DicNonary  based  compression  

•  DicNonary  (symbol  table)  ,  list  codes  •  If  not  in  discNonary,  use  escape    •  Usual  heurisNcs  searches  for  longest  repeat    •  With  fixed  table  one  can  search  for  opNmal  code    

•  With  adapNve  dicNonary  the  opNmal  coding  is  NP-­‐complete    

•  Quite  good  for  English  language  texts,  for  example    

Lempel-­‐Ziv  family,  LZ,  LZ-­‐77  •  Use  the  dicNonary  to  memorise  the  previously  compressed  parts    •  LZ-­‐77    •  Sliding  window  of  previous  text;      and  text  to  be  compressed      

 /bbaaabbbaaffacbbaa…./...abbbaabab...      •  Lookahead  -­‐  longest  prefix  that  begins  within  the  moving  window,  is  

encoded  with  [posiNon,length]    •  In  example,  [5,6]    •  Fast!            (Commercial  so�ware,  e.g.  PKZIP,  Stacker,  DoubleSpace,  )  •  Several  alternaNve  codes  for  same  string  (alternaNve  substrings  will  match)    •  Lempel-­‐Ziv  compression  from  McGill  Univ.    

 hUp://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic23/  

Original  LZ77  •  Triples  [posiNon,length,next  char]    •  If  output  [a,b,c],  advance  by  b+1  posiNons    •  For  each  part  of  the  triple  the  nr  of  bits  is  reserved  depending  on  window  length    

 ⎡  log(n-­‐f)  ⎤  +  ⎡  log(f)  ⎤  +  8  where  n  is  window  size,  and  f  is  lookahead  size    

•  Example:  abbbbabbc          [0,0,a]  [0,0,b]  [1,3,a]  [3,2,c]    

•  In  example  the  match  actually  overlaps  with  lookahead  window    

LZ-­‐78  •  DicNonary    •  Store  strings  from  processed  part  of  the  message    •  Next  symbol  is  the  longest  match  from  dicNonary,  that  

matches  the  text  to  be  processed    •  LZ78  (Ziv  and  Lempel  1978)    •  First,  dicNonary  is  empty,  with  index  0    •  Code  [i,c]  -­‐  refers  to  dicNonary  (word  u  at  pos.  i)  and  c  is  the  

next  symbol    •  Add  uc  to  dicNonary    •  Example:      ababcabc  →  [0,a][0,b][1,b][0,c][3,c]  

Page 8: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

8  

LZW  

•  Code  consists  of  indices  only!    •  First,  dicNonary  has  every  symbol    /alphabet/  •  Update  dicNonary  like  LZ78    •  In  decoding  there  is  a  danger:  See  abcabca    

–  If  abc  is  in  dicNonary    –  add  abca  to  dicNonary    –  next  is  abca,  output  that  code    

–  But  when  decoding,  aaer  abc  it  is  not  known  that  abca  is  in  the  dicNonary    

•  SoluNon  -­‐  if  the  dicNonary  entry  is  used  immediately  a�er  its  creaNon,  the  1st  and  last  characters  have  to  match    

•  Many  representaNons  for  the  dicNonary.    •  List,  hash,  sorted  list,  combinaNon,  binary  tree,  trie,  suffix  tree,  ...  

LZJ  

•  Coding  -­‐  search  for  longest  prefix.    •  Code  -­‐  address  of  the  trie  node    •  From  the  root  of  the  trie,  there  is  a  transiNon  on  every  symbol  (like  in  LZW).    

•  If  out  of  memory,  remove  these  nodes/branches  that  have  been  used  only  once    

•  In  pracNce,  h=6,  dicNonary  has  8192  nodes  

LZFG  

•  EffecNve  LZ  method    •  From  LZJ    •  Create  a  suffix  tree  for  the  window    •  Code  -­‐  node  address  plus  nr  of  characters  from  teh  edge.    

•  The  internal  and  leaf  nodes  with  different  codes    

•  small  match  directly...  (?)    

Burrows-­‐Wheeler  •  See  FAQ  hUp://www.faqs.org/faqs/compression-­‐faq/part2/secNon-­‐9.html  •  The  method  described  in  the  original  paper  is  really  a  composite  of  three  different  

algorithms:    –  the  block  sorNng  main  engine  (a  lossless,  very  slightly  expansive  preprocessor),    –  the  move-­‐to-­‐front  coder  (a  byte-­‐for-­‐byte  simple,  fast,  locally  adapNve  noncompressive  coder)  and    –  a  simple  staNsNcal  compressor  (first  order  Huffman  is  menNoned  as  a  candidate)  eventually  doing  

the  compression.    

•  Of  these  three  methods  only  the  first  two  are  discussed  here  as  they  are  what  consNtutes  the  heart  of  the  algorithm.  These  two  algorithms  combined  form  a  completely  reversible  (lossless)  transformaNon  that  -­‐  with  typical  input  -­‐  skews  the  first  order  symbol  distribuNons  to  make  the  data  more  compressible  with  simple  methods.  IntuiNvely  speaking,  the  method  transforms  slack  in  the  higher  order  probabiliNes  of  the  input  block  (thus  making  them  more  even,  whitening  them)  to  slack  in  the  lower  order  staNsNcs.  This  effect  is  what  is  seen  in  the  histogram  of  the  resulNng  symbol  data.    

•  Please,  read  the  arNcle  by  Mark  Nelson:    •  Data  Compression  with  the  Burrows-­‐Wheeler  Transform  Mark  Nelson,  Dr.  Dobb's  Journal  

September,  1996.  hUp://marknelson.us/1996/09/01/bwt/    

Burrows-­‐Wheeler  Transform  (BWT)  

Page 9: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

9  

CODE:    t: hat acts like this:<13><10><1

t: hat buffer to the constructor

t: hat corrupted the heap, or wo

W: hat goes up must come down<13

t: hat happens, it isn't likely

w: hat if you want to dynamicall

t: hat indicates an error.<13><1

t: hat it removes arguments from

t: hat looks like this:<13><10><

t: hat looks something like this

t: hat looks something like this

t: hat once I detect the mangled

   

Example  

•  Decode:          errktreteoe.e    

•  Hint:      .    Is  the  last  character,    alphabeNcally  first…  

Page 10: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

10  

SyntacNc  compression  

•  Context  Free  Grammar  for  presenNng  the  syntax  tree    

•  Usually  for  source  code    •  AssumpNon  -­‐  program  is  syntacNcally  correct    •  Comments    •  Features,  constants  -­‐  group  by  group  

Image  compression  

•  Many  images,  photos,  sound,  video,  ...    

Fax  group  3  

•  Fax/group  3    •  Black/white,  0/1  code    •  Run-­‐length:  000111001000  →  3,3,2,1,3    •  Variable-­‐length  codes  for  represenNng  run-­‐lengths.  

•  Joint  Photographic  Experts  Group  JPEG  2000  hUp://www.jpeg.org/jpeg2000/  

•  Image  Compression  -­‐-­‐  JPEG  from  W.B.  Pennebaker,  J.L.  Mitchell,  "The  JPEG  SNll  Image  Data  Compression  Standard",  Van  Nostrand  Reinhold,  1993.    

•  Color  image,  8  or  12  bits  per  pixel  per  color.    •  Four  modes  SequenNal  Mode    •  Lossless  Mode    •  Progressive  Mode    •  Hierarchical  Mode    •  DCT  (Discrete  Cosine  Transform)  

•  from  hUp://www.utdallas.edu/~aria/mcl/post/    •  Lossy  signal  compression  works  on  the  basis  of  transmi|ng  the  "important"  signal  

content,  while  omi|ng  other  parts  (QuanNzaNon).  To  perform  this  quanNzaNon  effecNvely,  a  linear  de-­‐correlaNng  transform  is  applied  to  the  signal  prior  to  quanNzaNon.  All  exisNng  image  and  video  coding  standards  use  this  approach.  The  most  commonly  used  transform  is  the  Discrete  Cosine  Transform  (DCT)  used  in  JPEG,  MPEG-­‐1,  MPEG-­‐2,  H.261  and  H.263  and  its  descendants.  For  a  detailed  discussion  of  the  theory  behind  quanNzaNon  and  jusNficaNon  of  the  usage  of  linear  transforms,  see  reference  [1]  below.    

•  A  brief  overview  of  JPEG  compression  is  as  follows.  The  JPEG  encoder  parNNons  the  image  into  8x8  blocks  of  pixels.  To  each  of  these  blocks  it  applies  a  2-­‐dimensional  DCT.  The  transform  matrix  is  normalized  (element-­‐wise)  by  a  8x8  quanNzaNon  matrix  and  then  rounded  to  the  nearest  integer.  This  operaNon  is  equivalent  to  applying  different  uniform  quanNzers  to  different  frequency  bands  of  the  image.  The  high-­‐frequency  image  content  can  be  quanNzed  more  coarsely  than  the  low-­‐frequency  content,  due  to  two  factors.    

•  L9_Compression/lena/    

Lena  

Page 11: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

11  

Vector  quanNzaNon  

•  Vector  quanNzaNon    •  DicNonary-­‐meetod    •  2-­‐dimensional  blocks    

Discrete  cosine  transform  •  A  discrete  cosine  transform  (DCT)  expresses  a  sequence  of  finitely  many  

data  points  in  terms  of  a  sum  of  cosine  funcNons  oscillaNng  at  different  frequencies.  DCTs  are  important  to  numerous  applicaNons  in  science  and  engineering,  from  lossy  compression  of  audio  and  images  (where  small  high-­‐frequency  components  can  be  discarded),  to  spectral  methods  for  the  numerical  soluNon  of  parNal  differenNal  equaNons.  The  use  of  cosine  rather  than  sine  funcNons  is  criNcal  in  these  applicaNons:  for  compression,  it  turns  out  that  cosine  funcNons  are  much  more  efficient  (as  explained  below,  fewer  are  needed  to  approximate  a  typical  signal),  whereas  for  differenNal  equaNons  the  cosines  express  a  parNcular  choice  of  boundary  condiNons.  

•  hUp://en.wikipedia.org/wiki/Discrete_cosine_transform  

2d DCT (type II) compared to the DFT. For both transforms, there is the magnitude of the spectrum on left and the histogram on right; both spectra are cropped to 1/4, to zoom the behaviour in the lower frequencies. The DCT concentrates most of the power on the lower frequencies.

•  In  parNcular,  a  DCT  is  a  Fourier-­‐related  transform  similar  to  the  discrete  Fourier  transform  (DFT),  but  using  only  real  numbers.  DCTs  are  equivalent  to  DFTs  of  roughly  twice  the  length,  operaNng  on  real  data  with  even  symmetry  (since  the  Fourier  transform  of  a  real  and  even  funcNon  is  real  and  even),  where  in  some  variants  the  input  and/or  output  data  are  shi�ed  by  half  a  sample.  There  are  eight  standard  DCT  variants,  of  which  four  are  common.  

•  Digital  Image  Processing:  hUp://www-­‐ee.uta.edu/dip/  

ENEE631  Digital  Image  Processing  (Spring'04)   Lec13  –  Transf.  Coding  &  JPEG    [65]  

Block  Diagram  of  JPEG  Baseline  

From

Wal

lace

’s JP

EG tu

toria

l (19

93)

ENEE631  Digital  Image  Processing  (Spring'04)   Lec13  –  Transf.  Coding  &  JPEG    [66]  

475 x 330 x 3 = 157 KB luminance

From

Liu

’s E

E330

(Prin

ceto

n)

Page 12: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

12  

ENEE631  Digital  Image  Processing  (Spring'04)   Lec13  –  Transf.  Coding  &  JPEG    [67]  

RGB  Components  

From

Liu

’s E

E330

(Prin

ceto

n)

ENEE631  Digital  Image  Processing  (Spring'04)   Lec13  –  Transf.  Coding  &  JPEG    [68]  

Y  U  V  (Y  Cb  Cr)  Components  

Assign more bits to Y, less bits to Cb and Cr

From

Liu

’s E

E330

(Prin

ceto

n)

ENEE631  Digital  Image  Processing  (Spring'04)   Lec13  –  Transf.  Coding  &  JPEG    [69]  

JPEG  Compression  (Q=75%)  

45 KB, compression ration ~ 4:1 From

Liu

’s E

E330

(Prin

ceto

n)

ENEE631  Digital  Image  Processing  (Spring'04)   Lec13  –  Transf.  Coding  &  JPEG    [70]  

JPEG  Compression  (Q=75%)  

45 KB, compression ration ~ 4:1 From

Liu

’s E

E330

(Prin

ceto

n)

ENEE631  Digital  Image  Processing  (Spring'04)   Lec13  –  Transf.  Coding  &  JPEG    [71]  

JPEG  Compression  (Q=75%  &  30%)  

45 KB 22 KB From

Liu

’s E

E330

(Prin

ceto

n)

ENEE631  Digital  Image  Processing  (Spring'04)   Lec13  –  Transf.  Coding  &  JPEG    [72]  

Uncompressed (100KB)

JPEG 75% (18KB)

JPEG 50% (12KB)

JPEG 30% (9KB)

JPEG 10% (5KB)

UM

CP

EN

EE

408G

Slid

es (c

reat

ed b

y M

.Wu

& R

.Liu

© 2

002)

Page 13: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

13  

1.4-­‐billion-­‐pixel  digital  camera  

•  Monday,  November  24,  2008  hUp://www.technologyreview.com/compuNng/21705/page1/  

•  Giant  Camera  Tracks  Asteroids    •  The  camera  will  offer  sharper,  broader  views  of  the  sky.  

•  The  focal  plane  of  each  camera  contains  an  almost  complete  64  x  64  array  of  CCD  devices,  each  containing  approximately  600  x  600  pixels,  for  a  total  of  about  1.4  gigapixels.  The  CCDs  themselves  employ  the  innovaNve  technology  called  "orthogonal  transfer",  which  is  described  below.  

Diagram  showing  how  an  OTA  chip  is  made  up  of  64  OTCCD  cells,  each  of  which  has  600  x  600  pixels    

Fractal  compression  

•  Fractal  Compression  group  at  Waterloo      hUp://links.uwaterloo.ca/fractals.home.html  

•  A  "Hitchhiker's  Guide  to  Fractal  Compression"  For  Beginners    �p://links.uwaterloo.ca/pub/Fractals/Papers/Waterloo/vr95.pdf  

•  Encode  using  fractals.    •  Search  for  regions  that  with  a  simple  transformaNon  can  be  similar  to  each  other.    

•  Compressin  raNo  20-­‐80    

•  Moving  Pictures  Experts  Group  (  hUp://www.chiariglione.org/mpeg/  )    

•  MPEG  Compression  :  hUp://www.cs.cf.ac.uk/Dave/MulNmedia/node255.html  •  Screen  divided  into  256  blocks,  where  the  changes  and  movements  are  tracked    

•  Only  differences  from  previous  frame  are  shown    

•  Compression  raNo  50-­‐100    

•  When  one  compresses  files,  can  we  sNll  use  fast  search  techniques  without  decompressing  frst?    

•  SomeNmes,  yes    •  e.g.  Udi  Manber  has  developed  a  method    •  Approximate  Matching  of  Run-­‐Length  Compressed  Strings  

Veli  Mäkinen,  Gonzalo  Navarro,  Esko  Ukkonen    

•  We  focus  on  the  problem  of  approximate  matching  of  strings  that  have  been  compressed  using  run-­‐length  encoding.  Previous  studies  have  concentrated  on  the  problem  of  compuNng  the  longest  common  subsequence  (LCS)  between  two  strings  of  length  m  and  n  ,  compressed  to  m'  and  n'  runs.  We  extend  an  exisNng  algorithm  for  the  LCS  to  the  Levenshtein  distance  achieving  O(m'n+n'm)  complexity.  Furthermore,  we  extend  this  algorithm  to  a  weighted  edit  distance  model,  where  the  weights  of  the  three  basic  edit  operaNons  can  be  chosen  arbitrarily.  This  approach  also  gives  an  algorithm  for  approximate  searching  of  a  paUern  of  m  leUers  (m'  runs)  in  a  text  of  n  leUers  (n'  runs)  in  O(mm'n')  Nme.  Then  we  propose  improvements  for  a  greedy  algorithm  for  the  LCS,  and  conjecture  that  the  improved  algorithm  has  O(m'n')  expected  case  complexity.  Experimental  results  are  provided  to  support  the  conjecture.  

Kolmogorov  (or  Algorithmic)  complexity  

•  Kolmogorov,  ChaiNn,  ...    •  What  is  the  compressed  version  of  sequence  

'1234567891011121314151617181920212223242526...'  ?    •  Every  symbol  appears  almost  equally  frequently,  almost  

"random"  by  entropy    •  for  i=1  to  n  do  print  i  ;    •  Algorithmic  complexity  (or  Kolmogorov  complexity)  for  string  

S  is  the  length  of  the  shortest  program  that  reproduces  S,  o�en  noted  K(S)    

•  CondiNonal  complexity  :  K(S|T).  Reproduce  S  given  T.    •  hUp://en.wikipedia.org/wiki/Algorithmic_informaNon_theory    

•  Algorithmic  informaNon  theory  is  a  field  of  study  which  aUempts  to  capture  the  concept  of  complexity  using  tools  from  theoreNcal  computer  science.  The  chief  idea  is  to  define  the  complexity  (or  Kolmogorov  complexity)  of  a  string  as  the  length  of  the  shortest  program  which,  when  run  without  any  input,  outputs  that  string.  Strings  that  can  be  produced  by  short  programs  are  considered  to  be  not  very  complex.  This  noNon  is  surprisingly  deep  and  can  be  used  to  state  and  prove  impossibility  results  akin  to  Gödel's  incompleteness  theorem  and  Turing's  halNng  problem.    

•  The  field  was  developed  by  Andrey  Kolmogorov  and  Gregory  ChaiNn  starNng  in  the  late  1960s.    

Page 14: TA Lecture 12 Compression.pptx - ut · resume retail retain retard retire 0resume 2tail 5n 4rd 3ire Coding&techniques& • Coding&refers&to&techniques&used&to&encode& tokens&or&symbols.&&

5.1.2011  

14  

Kolmogorov  complexity:  size  of  circle  in  bits...  

Encoder   Decoder  

Model   Model  

Data Data

Compressed data

G  J  ChaiNn  

 •  hUp://www.cs.umaine.edu/~chaiNn/  

•  Distance  using  K.    •  d(S,T)  =  (  K(S|T)  +  K(T|S)  )  /  (  K(S)  +  K(T)  )    •  We  cannot  calculate  K,  but  we  can  approximate  it    

•  E.g.  by  compression  LZ,  BWT,  etc    •  d(S,T)  =  (  C(ST)  +  C(TS)  )  /  (  C(S)  +  C(T)  )    

•  Use  of  Kolmogorov  Distance  IdenNficaNon  of  Web  Page  Authorship,  Topic  and  Domain  David  Parry  (PPT)  in  OSWIR  2005,  2005  workshop  on  Open  Source  Web  InformaNon  Retrieval    

•  Informatsioonikaugus  by  Mart  Sõmermaa,  Fall  2003  (in  Data  Mining  Research  seminar)  hUp://www.egeen.ee/u/vilo/edu/2003-­‐04/DM_seminar_2003_II/Raport/P06/main.pdf