parameter-free spatial data mining using mdl. s. papadimitriou, a. gionis, p. tsaparas, r.a....

Parameter-Free Spatial Data Mining Using MDL.

S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Väisänen, H. Mannila, and C. Faloutsos.

International Conference on Data Mining 2005

Problems: Finding patterns of spatial correlation and feature co-

occurrence. Automatically

That is, parameter-free. Simultaneously

For example: Spatial locations on a grid. Features correspond to species present in specific cells. Each pair of cell and species is 0 or 1, depending on species

present in that cell. Feature co-occurrence:

Cohabitation of species. Spatial correlation:

Natural habitats for species.

Motivation: Many applications

Biodiversity Data As we just demonstrated.

Geographical Data Presence of facilities on city blocks.

Environmental Data Occurrence of events (storms, drought, fire, etc.) in

various locations. Historical and Linguistic Data

Occurrence of words in different languages/countries, historical events in a set of locations.

Existing methods either: Detect one pattern, but not both, or Require user-input parameters.

Background Minimum Description Length (MDL):

Let L(D|M) denote the code length required to represent data D given (using) model M. Let L(M) be the complexity required to describe the model itself.

The total code length is then: L(D, M) = L(D|M) + L(M)

This was used in SLIQ and is the intuitive notion behind the connection between data mining and data compression. The best model minimizes L(D, M), resulting in optimal

compression. Choosing the best model is a problem in its own right. This will be explored further in the next paper I present.

Background Quadtree Compression

Quadtrees: Used to index and reason about contiguous variable size

grid regions (among other applications, mostly spatial). Used for 2D data; kD analogue is a kD-tree.

“Full Quadtree”: All nodes have either 0 or 4 children. Thus, all internal nodes correspond to a partitioning of a

rectangular region into 4 subregions. Each quadtree’s structure corresponds to a unique partitioning.

Transmission: If we only care about the structure (spatial partitioning), we

can transmit a 0 for internal nodes and a 1 for leaves in depth-first order.

If we transmit the values as well, the cost is the number of leaves times the entropy of the leaf value distribution.

Example

Quadtree Encoding Let T be a quadtree with m leaf nodes, of which mp

have value p. The total codelength is:

If we know the distribution of the leaf values, we can calculate this in constant time.

Updating the tree requires O(log n) time in the worst case, as part of the tree may require pruning.

13

4),...,,()(

21

m

m

m

m

m

m

mmHTL

k

Binary Matrices / Bi-groupings: Bi-grouping:

Simultaneous grouping of m rows and n columns into k and l disjoint row and column groups.

Let D denote an m x n binary matrix. The cost of transmitting D is given as follows:

Recall the MDL Principle: L(D) = L(D|M) + L(M).

Let {Qx, Qy} be a bi-grouping. Lemma (we will skip the proof):

The codelength for transmitting an m-to-k mapping Qx where mp symbols are mapped to the value p is approximately:

),...,,()|(21

m

m

m

m

m

mmHkQL

kx

),,,(),,,|()( lkQQLlkQQDLDL YXYX

Methodology Exploiting spatial locality:

Bi-grouping as presented is nonspatial! To make it spatial, assign a non-uniform prior to possible

groupings. That is, adjacent cells are more likely to belong to the same group.

Row groups correspond to spatial groupings. “Neighborhoods” “Habitats” Row groupings should demonstrate spatial coherence.

Column groups correspond to “families”. “Mountain birds” “Sea birds”

Intuition Alternately group rows and columns iteratively until the

total cost L(D) stops decreasing. Finding the global optimum is very expensive.

So our approach will use a greedy search for local optima.

Algorithms INNER:

Group given the number of row and column groups.

Start with an arbitrary bi-grouping of matrix D into k row groups and l column groups.

do {Letfor each row i from 1 to n

1 ≤ p ≤ k such that the “cost gain”:

is maximized.Repeat for columns, producing the bi-groupingt += 2

} while (L(D) is decreasing)

),( 00YX QQ

tX

tX QQ 1

piQtX )(1

))|(),,,|(())|(),,,|(( 11 kQLkQQDLkQLkQQDL tX

tY

tX

tX

tY

tX

ll

),( 21 tY

tX QQ

Algorithms OUTER:

Finds the number of row and column groups.

Start with k0 = l0 = 1.

Split the row group p* with the maximum per-row entropy, holding the columns fixed.

Move each row in p* to a new group kT+1 iff doing so would decrease the per-row entropy of p*, resulting in a grouping

Assign group to the result of INNERIf the cost does not decrease, return Otherwise, increment t and repeat.Finally, perform this again for the columns.

1tXQ

),( 11 tY

tX QQ ),( 1 t

Y

t

X QQ

),,,( 1 t

Y

t

X

tt QQk l

Complexity INNER is linear with respect to nonzero elements

in D. Let nnz denote those elements. Let k be the number of row groupings and l be the

number of column groupings. Row swaps are performed in the quadtree and take

O(log m) time each, where m is the number of cells. Let T be the iterations required to minimize the cost. O(nnz * (k + l + log m) * T)

OUTER, though quadratic with respect to (k + l), is linear with respect to the dominating term nnz. Let n be the number of row splits. O((k + l)2nnz + (k + l) n log m)

Experiments NoisyRegions

Three features (“species”) on a 32x32 grid. So D has 32x32 = 1024 rows. And 3 columns.

3% of each cell, chosen at random, has a wrong species, also randomly chosen.

The spatial and non-spatial groupings are shown to the right. Recall: Bi-grouping is not spatial by default.

Spatial grouping reduces the total codelength.

The approach is not quite perfect due to the heuristic nature of the algorithm.

Experiments Birds

219 Finnish bird species over 3813 10x10km habitats.

Species are the features, habitats are cells. So our matrix is 3813x219.

The spatial grouping is clearly more coherent. Spatial grouping reveals Boreal zones:

South Boreal: Light Blue and Green. Mid Boreal: Yellow. North Boreal: Red.

Outliers are (correctly) grouped alone. Species with specialized habitats. Or those reintroduced into the wild.

Other approaches Clustering

k-means Variants using different estimates of central tendency:

k-medoids, k-harmonic means, spherical k-means, … Variants determining k based on some criteria:

X-means, G-means, … BIRCH CURE DENCLUE LIMBO

Also information-theoretic. Approaches either lossy, parametric, or aren’t

easily adaptable to spatial data.

Room for improvement: Complexity

O(n * log m) cost for reevaluating the quadtree codelength. O(log m) worst-case time for each reevaluation/row swap * n

swaps. However, the average-case complexity is probably much better. If we know something about the data distribution, we might be

able to reduce this. Faster convergence

Fewer iterations, reducing the scaling factor T. Rather than stopping only when there is no decrease in cost,

perhaps stop when we fall below a threshold? (Introduces a parameter)

Accuracy The search will only find local optima, leading to errors. We can employ some approaches used in annealing or genetic

algorithms to attempt to find the global optimum. Randomly restarting in the search space, for example. Stochastic gradient descent – similar to what we’re already doing,

actually.

Conclusion Simultaneous and automatic grouping of

spatial correlation and feature co-habitation. Easy to exploit spatial locality. Parameter-free. Utilizes MDL:

Minimizes the sum of the model cost and the data cost given the model.

Efficient. Almost linear with the number of entries in the

matrix.

References1. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A.

Vaisanen, H. Mannila, C. Faloutsos, "Parameter-Free Spatial Data Mining Using MDL", ICDM, Houston, TX, U.S.A., November 27-30, 2005.

2. M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.

Thanks!

Any questions?

parameter-free spatial data mining using mdl. s. papadimitriou, a. gionis, p. tsaparas, r.a....

Documents