on data mining, compression, and kolmogorov complexity

On Data Mining, Compression, and Kolmogorov Complexity.

C. Faloutsos and V. MegalooikonomouData Mining and Knowledge Discovery, 2007

Problems: Can data mining be formalized into a logical

system like the relational algebra? Can data mining be automated?

That is, can we remove the need for a human to decide upon a data mining technique?

Why do so many different approaches to clustering exist? (Implicitly, are these approaches all necessary?) Partitioning, hierarchical, density-based, grid-

based, model-based, …

Motivation: Can data mining be formalized?

Implications of an affirmative answer: An SQL-like language for data mining could be created. Logical axioms and set theoretic concepts could be used

to reason about data mining techniques. Can data mining be automated?

Implications: A universally optimal approach for data mining could be

determined without human intervention. Why do so many clustering approaches exist?

Are some redundant? Can we eliminate some of them?

“Data Mining”? What exactly is “data mining”, anyway?

Lecture notes: Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases.

Paper: Supervised or unsupervised concept learning. Supervised:

Classification Regression

Unsupervised: Clustering Pattern discovery

Other tasks: forecasting, outlier detection, …? Related to or dependent upon concept learning.

Intuitions?

Can concept learning be automated? In general, can we always find an optimal model?

Some ideas: MDL Principle: The best model results in the best compression. Kolmogorov complexity: Shortest description of a string in a

fixed description language (alternatively, shortest Universal Turing Machine that can generate the string): E.g. “1 2 4 8 16 32” is less complex than “1 4 2 7 6 3”.

So data mining is essentially equivalent to data compression! Specifically, all we need to do is find the Kolmogorov complexity. There’s only one problem…

Kolmogorov complexity is undecidable!

A Simple Proof:Let K(x) denote the Kolmogorov complexity of string x.

To compute K(x), we must execute every program, select those that halt and compute x, and select the one with the smallest size.

The negative solution to the halting problem proves that the underlined step is undecidable.

Thus Kolmogorov complexity is also undecidable.

Answers: Can data mining be formalized into a logical

system like the relational algebra? No. Can data mining be automated? No. Why do so many different approaches to

clustering exist? Because we can’t automatically choose the best

every time!

All of these results are due to the undecidability of Kolmogorov complexity.

Formal Definition of Kolmogorov Complexity Conditional Kolmogorov Complexity:

KU(x|y) = min{|p| : p 0,1* and U(p,y) = x} Such that p is a description of x (such as a program) and a Universal Turing

Machine U outputs x when given p and some additional information y. In other words, it is the shortest program that can be input to a specific

UTM that produces the desired output. Unconditional Kolmogorov Complexity:

K(x) = K(x|), where is the empty string. In other words, no additional information is required.

Kolmogorov Incompressibility: Defined as the condition K(x) ≥ |x|.

Recall from SLIQ presentation: Cost(M,D) = Cost(D|M) + Cost(M). Cost(M) is not always bounded by |x|, so K(x) > x is possible.

Incompressible strings are random, but not always (though usually) the converse. Pi, for example, is random in the traditional sense, but not incompressible. In fact, it’s an example of an infinite-length string with a finite (and low)

complexity. This can be used as a proof technique, per “Kolmogorov Incompressibility

in Formal Proofs: A Critical Survey” by V. Megalooikonomou.

Some theorems Lempel-Ziv encoding length may be used as

an upper bound on Kolmogorov complexity: K(x|n) ≤ lLZ(x) + c, where n is the length of string x

and lLZ is the length of the Lempel-Ziv encoded string x.

We may also approximate the expected

complexity of a random sequence using its entropy:

)|()/1()()/1(lim nxKnxln LZn

)()|(1 XHnXKn

E n

Why is complexity important? Classification

Usually uses a measure of homogeneity. Entropy gain.

We already demonstrated that this is related to Kolmogorov complexity. Gini index.

A special case of generalized (Renyi) entropies. Clustering

Optimal number of clusters (k) difficult to determine. Parameter-free approaches penalize complexity using:

Akaike Information Criterion Bayesian Information Criterion Minimum Description Length

All three relate to Kolmogorov complexity. Distance functions

Many common distance functions utilize similar principles, but do not explicitly use Kolmogorov complexity (due to its undecidability).

Relevant and Nontrivial Arguments: Do these arguments work for lossy

compression? Yes. Lossy schemes can be made lossless by

encoding the difference between the model and reality.

What if the simplest model isn’t the best? The simplicity of the model could be deceptive.

The model may be simple, but the number of parameters required to fit the data to the model (or the number of digits in a single parameter) may result in high complexity.

Positive aspects Thinking about data mining as compression

will aid in the discovery of new parameter-free methods of classification, clustering, and distance-function design.

Data mining will remain an art; even if we do stumble upon the optimal domain-independent model, we cannot prove it.

Comparing models, however, is trivial: simply compare the number of bytes to a competing model.

Experiments: Comparing models

Example of metabolic rate vs. mass represented using (a) piecewise flat approximation, such as CART, (b) piecewise linear approximation, and (c) linear approximation in a log-log scale.

Kleiber’s law: R = M(3/4). With better models, such as (c), we may perform better compression. However, this may require domain knowledge. A data miner without knowledge of Kleiber’s law may be tempted to use (b).

Conclusion Data mining as compression. Optimal compression undecidable.

Equivalent to finding Kolmogorov complexity, also undecidable.

Consequently, data mining will always remain an art. Unprecedented amounts of data and computing

power will continue to challenge and inspire data miners.

Compression-based view leads to many parameter-free data mining techniques.

References1. C. Faloutsos and V. Megalooikonomou, " On

Data Mining, Compression, and Kolmogorov Complexity," Data Mining and Knowledge Discovery , Tenth Anniversary Issue, 2007.

2. V. Megalooikonomou, "Kolmogorov Incompressibility Method in Formal Proofs: A Critical Survey," Technical Report TR CS-97-01, Department of Computer Science Electrical Engineering, University of Maryland, Baltimore County, Jan. 1997.

Thanks!

Any questions?

on data mining, compression, and kolmogorov complexity

Documents