ieee transactions on knowledge and data engineering, tkde (2009)

Post on 23-Feb-2016

43 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Information-theoretic distance measures for clustering validation: Generalization and normalization. Presenter : Lin, Shu -Han Authors : Ping Luo , Hui Xiong , Guoxing Zhan, Junjie Wu, and Zhongzhi Shi. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009). Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Information-theoretic distance measures for clustering validation:

Generalization and normalization

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Presenter : Lin, Shu-HanAuthors : Ping Luo, Hui Xiong, Guoxing Zhan, Junjie Wu, and Zhongzhi Shi

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

2

Outline

Motivation Objective Methodology Experiments Conclusion Comments

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Motivation

External criteria for clustering validation: Information-theoretic distance measures are used to Comparing the

clustering output with the “true” partition

Clustering ability of algorithms: Compare different clustering algorithms, given dataset

Clustering difficulty of datasets: Compare different datasets, given algorithm

3

A B C1 30 0 1

2 2 20 0

3 0 2 15

σ : the “true” partitionπ:

clustering output

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Objectives

Since Dimension, size, sparseness of data; scales of attributes are different for different datasets. the range of distance measures are different To do fair comparison: distance normalization

4

A B C120 120 120

A B C D E F G12 23 30 24 5 90 20

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Conditional Entropy

5

The equality C1=C2 yields the Shannon entropy

π: group labelσ: class label

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Quasi-Distance

6

Minimum reachable: d(π, σ) reaches its minimum over both and iff π=σ

Symmetry: d(π, σ) = d(σ, π) Triangle law: d(π, σ) + d(σ, π) ≧ d(σ, τ)

A B C1 30 0 1

2 2 20 0

3 0 2 15

σ : the “true” partitionπ:

clustering output

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Normalization Issue

7

A B C120 120 120

A B C D E F G12 23 30 24 5 90 20

How to get it?

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Computation of

8

Generate a π0 PART(∈ A) such that

σ: n

The worse result of π (m groups)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Computation of

9

There is an difference between and

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments

10

Shannon Entropy

Pal Entropy

Gini Index

Goodman-Kruskal

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments

11

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

12

Conclusions

Quasi-distance: external measure for clustering validation Symmetry Triangle law Minimum reachable

Normalization: maximum value of a distance measure Compare clustering performances of an algorithm on

different datasets The normalized distance measures outperform the original distance

measure Normalized Shannon distance has best performance among 4 observed

distance measures

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

13

Comments

Advantage Idea is intuitive Theoretically analysis

Drawback Describe why they think quasi-distance is better than DCV.

Application The same use of DCV?

top related