turing clusters into patterns: rectangle-based discriminative data description byron j. gao and...

Turing Clusters into Patterns: Rectangle-based Discriminative Data Description

Byron J. Gao and Martin Ester

IEEE ICDM 2006

Adviser: Koh Jia-Ling

Speaker: Liu Yu-Jiun

Date: 2006/11/8

2

Introduction The goal of data mining is to discover

useful knowledge.

Present the clusters as the sets of points.

Interpret the clusters as the human-comprehensible patterns. In the past, only concern the length of patterns,

and descript the cluster C directly.

3

SOR description Sum of Rectangles ( ) is the canonical

format for cluster descriptions. : either or CforSOR

SOR

SORSOR

Black: cluster C (R1 and R2)

Red: other cluster (R1’)

Green: Bc

description: R1 + R2

description: Bc – R1’

SORSOR

kSORSORSOR

4

Notations

21)( RRCESOR

5

Example

R2

R5

R4R3

R1

)'3'2'1()(

54321)(

RRRBCE

RRRRRCE

cSOR

SOR

R2’ R3’

6

Problems Maximum Description Accuracy (MDA)

Minimum Description Length (MDL)

A novel description: descriptionkSOR

7

Accuracy Formula

CCErecall /

ECEprecision /

precisionrecall

precisionrecallf

2

Two additional measures:

1. Recall at fixed precision. (fix precision = 1)

2. Precision at fixed recall. (fix recall = 1)

8

Three Heuristic Algorithms Learn2Cover MDL approximating max length.

Length of rectangle.

DesTree MDA approximating the Pareto front.

FindClans transforms the output from DesTree into the shorter final description.

9

Learn2Cover

is the next point from Bc

in the sorted order.xo

10

Cost of Learn2Cover

: the length of rectangle R along dimension Dj.

R’ : the expanded R in covering xo)(Rl j

11

DesTree DesTree takes the output from Learn2Co

ver, R or R , as input. Build the tree from bottom to up. Merge the child nodes into parent nodes

until a single node is left. Each node represents a rectangle. The higher in the tree we cut, the shorter

the length and the lower the accuracy.

-

12

merge

13

FindClans FindClans takes as input a cut from DesTr

ee, outputs a description.kSOR

14

Algorithm -- FindClans

15

Experimental

Compare with CART and BP.

Real datasets from the UCI repository, where data records with the same class label were treated as a cluster.

16

Comparisons with CART

Concern both of MDA and MDL.

17

DesTree vs. CART

accuracy

length

18

Comparisons with BP BP addresses the MDL problem only. Synthetic datasets. Gaining 20%~50% length reduction. Learn2Cover without violation checking, so

faster than BP.

19

Conclusions

provides enhanced expressive power.

MDA allows trading accuracy for interpretability.

A paradigm for query-based “second-generation” database mining systems.

kSOR

turing clusters into patterns: rectangle-based discriminative data description byron j. gao and...

Documents

length of rectangle

length of patterns

bc description

novel description

length reduction

destree mda

r1 r2 description

shorter final description