turing clusters into patterns: rectangle-based discriminative data description byron j. gao and...
TRANSCRIPT
Turing Clusters into Patterns: Rectangle-based Discriminative Data Description
Byron J. Gao and Martin Ester
IEEE ICDM 2006
Adviser: Koh Jia-Ling
Speaker: Liu Yu-Jiun
Date: 2006/11/8
2
Introduction The goal of data mining is to discover
useful knowledge.
Present the clusters as the sets of points.
Interpret the clusters as the human-comprehensible patterns. In the past, only concern the length of patterns,
and descript the cluster C directly.
3
SOR description Sum of Rectangles ( ) is the canonical
format for cluster descriptions. : either or CforSOR
SOR
SORSOR
Black: cluster C (R1 and R2)
Red: other cluster (R1’)
Green: Bc
description: R1 + R2
description: Bc – R1’
SORSOR
kSORSORSOR
4
Notations
21)( RRCESOR
5
Example
R2
R5
R4R3
R1
)'3'2'1()(
54321)(
RRRBCE
RRRRRCE
cSOR
SOR
R2’ R3’
6
Problems Maximum Description Accuracy (MDA)
Minimum Description Length (MDL)
A novel description: descriptionkSOR
7
Accuracy Formula
CCErecall /
ECEprecision /
precisionrecall
precisionrecallf
2
Two additional measures:
1. Recall at fixed precision. (fix precision = 1)
2. Precision at fixed recall. (fix recall = 1)
8
Three Heuristic Algorithms Learn2Cover MDL approximating max length.
Length of rectangle.
DesTree MDA approximating the Pareto front.
FindClans transforms the output from DesTree into the shorter final description.
9
Learn2Cover
is the next point from Bc
in the sorted order.xo
10
Cost of Learn2Cover
: the length of rectangle R along dimension Dj.
R’ : the expanded R in covering xo)(Rl j
11
DesTree DesTree takes the output from Learn2Co
ver, R or R , as input. Build the tree from bottom to up. Merge the child nodes into parent nodes
until a single node is left. Each node represents a rectangle. The higher in the tree we cut, the shorter
the length and the lower the accuracy.
-
12
merge
13
FindClans FindClans takes as input a cut from DesTr
ee, outputs a description.kSOR
14
Algorithm -- FindClans
15
Experimental
Compare with CART and BP.
Real datasets from the UCI repository, where data records with the same class label were treated as a cluster.
16
Comparisons with CART
Concern both of MDA and MDL.
17
DesTree vs. CART
accuracy
length
18
Comparisons with BP BP addresses the MDL problem only. Synthetic datasets. Gaining 20%~50% length reduction. Learn2Cover without violation checking, so
faster than BP.
19
Conclusions
provides enhanced expressive power.
MDA allows trading accuracy for interpretability.
A paradigm for query-based “second-generation” database mining systems.
kSOR