from kolmogorov and shannon to bioinformatics and grid computing raffaele giancarlo dipartimento di...
Post on 15-Jan-2016
218 views
TRANSCRIPT
From Kolmogorov and Shannon to Bioinformatics
and Grid Computing
Raffaele GiancarloDipartimento di Matematica, Università di Palermo
Aim Give a flavour of fundamental novel discoveries about indexing
and compression: A string, and any compact encoding of it, is the best index for itself
Give a flavour of some fundamental novel discoveries about Distance functions and Classification, particularly relevant for Bioinformatics
On the way, mention uses of :suffix trees, suffix arrays, Burrows-Wheelet Transform, Move to Front…
In 30 min. an incredibly long jurney: From Kolmogorov and Shannon to Grid Computing
References: available on-line
What do we mean by “Indexing” ?
Raw sequence of characters or bytes
Types of data
Types of query
Character-based query
Indexing approaches :
• Full-text indexes, » Suffix Array, Suffix tree,…
DNA sequencesAudio-video filesExecutables
Arbitrary substring
What do we mean by “Compression” ?
Any Algorithm that squezes data : lossless, lossy
From March 2001 the Memory eXpansion Technology (MXT) is available on IBM eServers x330MXT Same performance of a PC with double memory but at half cost
Moral: More economical to store data in compressed form than
uncompressed
» CPU speed nowadays makes (de)compression “costless” !!
What we mean by “Classification” ?
Any tool that can group “related” objects together, e.g. the unaligned mithocondrial genomes NCBI Classfication
Compression and Indexing: Two sides of the same coin !
Do we witness a paradoxical situation ?
An index injects redundant data, in order to speed up the pattern
searches
Compression removes redundancy, in order to squeeze the space occupancy NO, new results proved a mutual reinforcement behaviour !
Better indexes can be designed by exploiting compression techniques
Better compressors can be designed by exploiting indexing techniquesIn terms of space occupancy
Also in terms of compression ratio
•Classification is the “third side” of the coin: Kolmogorov Complexity, Information Theory, Compression and Indexing
Our journey, today...
Suffix Array(1990)
Index design (Weiner ’73) Compressor design (Shannon ’48)
Burrows-Wheeler Transform(1994)
Compressed Index-Space close to gzip, bzip- Query time close to O(|P|)
Compression BoosterTool to transform a poor compressorinto a better compression algorithm
Universal Distances and Classification
Kolmogorov
Investigate
Indexing ideas Compressor design
First Lap…in record time!!!
BoosterBooster
Key Idea 1: Suffix Tree [Weiner 73, McCreight 76, Ukkonen 92]
String: mississippi#
12 1
# i pm
s
11 9
#ppi#
ssi
5 2
ppi#
ssippi#
10 9
i# pi# i si
7 4
ppi# ssippi#
6 3
ppi#
ssippi#
pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i
issippi#mis s
mississippi #ississippi# m
Key Idea 2: Burrows-Wheeler Compression (1994)
Let us be given a string s = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
#mississipp ii#mississip pippi#missis s
bwt(s)
s
Burrows and Wheeler Compression
Why it works:
BWT creates a locally homogeneous string:
abaababa bbbaaaaa
MTF transforms it into a globally homegeneous sequence of integers
bbbaaaaa 00010000
The final string is “easy” to compress
Experimentally: compressibility is proportional to % of zeros
Qualitatively, it can be shown:
c’ is shorter than c, if s is compressible
Time(Aboost) = Time(A), i.e. no slowdown
A is used as a black-box
Boosting [Ferragina, Giancarlo, Manzini, Sciortino, 03,04,05]
The technique takes a poor compressor A and turns it into a compressor
Aboost with better performance guarantee
c’
BoosterThe better is A,
the better is Aboost
As cThe more compressible is s,
the better is Aboost
We investigated:
Index Ideas Compression design
Let’s now turn to the other direction
Compression ideas Index design
Second Lap…Even faster
Compressed IndexesCompressed Indexes
Rotated text
#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
Suffix Array vs. BW-transform
ipssm#pissii
L
12
1185211097463
SA L includes SA and T. Can we search within L ?
mississippi
A compressed index [Ferragina-Manzini, IEEE Focs 2000]
In practice, the index is much appealing: Space close to the best known compressors, ie. bzip Query time of few millisecs on hundreds of MBs
The theoretical result:
Query complexity: O(p + occ log N) time
Space occupancy: O( N Hk(T)) + o(N) bitsk-th order empirical entropy
Universal Distances and Classification
Third Lap…
Large Data Sets
Classification of Sequences on a Genome-wide Scale
Distances based on alignments are either not applicable
or too slow
Fast and reliable alignment-free methods are badly needed
Classification of Proteins, both for Function and Structure- Lagging behind to sequence data
Proteins and Their String Representations
Amino acid sequence (FASTA format);
Atomic coordinates (Atom lines);
Protein Representations Topologic Models (Top Diagrams)
Kolmogorov Complexity
The Kolmogorov Complexity K(x) of a string x is defined as the length of the shortest binary program that produces x.
The conditional Kolmogorov Complexity K(x|y) represents the minimum amount of information required to generate x by an effective computation when y is given as an input to the computation.
The Kolmogorov Complexity K(x,y) of a pair objects x and y is the length of the shortest binary program that produces x and y and a way to tell them apart.
Universal Similarity metric (USM)
Problem: USM(x,y) is based on Kolmogorov Complexity that is non-
computable in the Turing sense.
Solution: K(x) can be approximated via data compression by using its
relationship with Shannon Information Theory. USM is a methodology rather than a formula quantifying the similarity
of two strings.
Approximations of USM
K(x) can be approximated by C(x), K(x,y) by C(xy) and K(x|y*) by C(xy) – C(x). We obtain three approximations to USM:
where
Experiments [Ferragina, Giancarlo, Greco, Manzini, Valiente, 2007]
Experimental setup: Five Benchmarck datasets of proteins (several alternative
representations); A benchmark dataset of Genomic sequences (complete unaligned
mitochondrial Genomes); Twenty-five compression algorithms; Three dissimilarity functions based on USM.
Two set of experiments to compare USM both with methods based on alignments and not: via ROC Analysis; via UPGMA and NJ.
An example Unaligned mitochondrial DNA complete Genomes
Results and Conclusions
Useful Guidelines for Use of USM Methodilogy for Biological Investigation
Which compressor to use Which among UCD,NCD and CD to use Which data representation is best Etc…
Software
Kolmogorov Library: http://www.math.unipa.it/~raffaele/kolmogorov/
Sequential processing is too slow even for relatively small data sets, i.e, 278 files (1.5Mb) classification takes 12 hours on a state of the art PC…half an hour on Grid
Soon Available as a Grid-aware Web Service on COMETA Portal
Adevertisement 2 20° EDition of Lipari International Summer School for
Computer Scientists
TOPIC: Algorithms, Science and Engineering
See Lipari School Website