turing clusters into patterns: rectangle-based discriminative data description byron j. gao and...

19
Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Li ng Speaker: Liu Yu-Jiu n Date: 2006/11/8

Upload: ariel-hood

Post on 16-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

Turing Clusters into Patterns: Rectangle-based Discriminative Data Description

Byron J. Gao and Martin Ester

IEEE ICDM 2006

Adviser: Koh Jia-Ling

Speaker: Liu Yu-Jiun

Date: 2006/11/8

Page 2: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

2

Introduction The goal of data mining is to discover

useful knowledge.

Present the clusters as the sets of points.

Interpret the clusters as the human-comprehensible patterns. In the past, only concern the length of patterns,

and descript the cluster C directly.

Page 3: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

3

SOR description Sum of Rectangles ( ) is the canonical

format for cluster descriptions. : either or CforSOR

SOR

SORSOR

Black: cluster C (R1 and R2)

Red: other cluster (R1’)

Green: Bc

description: R1 + R2

description: Bc – R1’

SORSOR

kSORSORSOR

Page 4: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

4

Notations

21)( RRCESOR

Page 5: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

5

Example

R2

R5

R4R3

R1

)'3'2'1()(

54321)(

RRRBCE

RRRRRCE

cSOR

SOR

R2’ R3’

Page 6: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

6

Problems Maximum Description Accuracy (MDA)

Minimum Description Length (MDL)

A novel description: descriptionkSOR

Page 7: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

7

Accuracy Formula

CCErecall /

ECEprecision /

precisionrecall

precisionrecallf

2

Two additional measures:

1. Recall at fixed precision. (fix precision = 1)

2. Precision at fixed recall. (fix recall = 1)

Page 8: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

8

Three Heuristic Algorithms Learn2Cover MDL approximating max length.

Length of rectangle.

DesTree MDA approximating the Pareto front.

FindClans transforms the output from DesTree into the shorter final description.

Page 9: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

9

Learn2Cover

is the next point from Bc

in the sorted order.xo

Page 10: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

10

Cost of Learn2Cover

: the length of rectangle R along dimension Dj.

R’ : the expanded R in covering xo)(Rl j

Page 11: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

11

DesTree DesTree takes the output from Learn2Co

ver, R or R , as input. Build the tree from bottom to up. Merge the child nodes into parent nodes

until a single node is left. Each node represents a rectangle. The higher in the tree we cut, the shorter

the length and the lower the accuracy.

-

Page 12: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

12

merge

Page 13: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

13

FindClans FindClans takes as input a cut from DesTr

ee, outputs a description.kSOR

Page 14: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

14

Algorithm -- FindClans

Page 15: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

15

Experimental

Compare with CART and BP.

Real datasets from the UCI repository, where data records with the same class label were treated as a cluster.

Page 16: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

16

Comparisons with CART

Concern both of MDA and MDL.

Page 17: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

17

DesTree vs. CART

accuracy

length

Page 18: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

18

Comparisons with BP BP addresses the MDL problem only. Synthetic datasets. Gaining 20%~50% length reduction. Learn2Cover without violation checking, so

faster than BP.

Page 19: Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

19

Conclusions

provides enhanced expressive power.

MDA allows trading accuracy for interpretability.

A paradigm for query-based “second-generation” database mining systems.

kSOR