clustering sequential data: research paper review presented by glynis hawley april 28, 2003 on the...

14
Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin and Ming- Syan Chen, Electrical Engineering Department National Taiwan University, Taipei, Taiwan Second SIAM International Conference on Data Mining April 11- 13, 2002 http://www.siam.org/meetings/sdm02/ proceedings/sdm02-09.pdf

Upload: katrina-gaines

Post on 13-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Clustering Sequential Data: Research Paper Review

Presented by Glynis Hawley

April 28, 2003

On the Optimal Clustering of Sequential Data by Cheng-Ru Lin and Ming-Syan Chen, Electrical Engineering Department National Taiwan University, Taipei, Taiwan

Second SIAM International Conference on Data Mining April 11-13, 2002

http://www.siam.org/meetings/sdm02/proceedings/sdm02-09.pdf

Page 2: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Agenda

Introduction: What is sequential clustering?

Problem definition for algorithm design

Optimal Algorithm: SCOPT

Greedy Algorithm: SCGD

Conclusion

Page 3: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Sequential Clustering Problem

Attributes and sequence of objects are both important.

Objects within a cluster form a continuous region.

An object within one cluster may be closer to the centroid of a different cluster than it is to its own centroid.

Page 4: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Conventional Clustering vs. Sequential Clustering

Conventional Clustering

1

2

34

5

6

7

8

9

1011

12

1314

15

X

Y

Sequential Clustering

1

2

34

5

6

7

8

9

1011

12

1314

15

X

Y

Page 5: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Application Areas

Analysis of motion patterns of objects.– Cellular phones.

Analysis of status logs of running machines.

Page 6: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Problem Definition Partitioning problem

– n sequential objects into k clusters Dissimilarity measurement

– Squared Euclidean distance Cluster quality

– Cost measurement: penalizes clusters for amount of dissimilarity of objects

Best solution minimizes the sum of the costs of all clusters

Page 7: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

m

iiE

coDClCost1

),()( 2

Cost Definition Cost of a cluster: summation over all m

objects of the squared Euclidean distance of the object from the cluster centroid.

Page 8: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Sequential Clustering Algorithms

Optimal Sequential Clustering Algorithm– SCOPT

Greedy Sequential Clustering Algorithm– SCGD

Page 9: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Algorithm SCOPT

Determines optimal k-partition of a set of sequential objects.

Uses the property of optimal substructure.– Systematically solves all possible sub-

problems.– Stores results to be used in later steps.

Page 10: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Complexity of Algorithm SCOPT

Time: O (kn2) Space: O (kn)

Page 11: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Initially, arbitrarily insert separators to divide the n objects into k clusters.

1 2 3 | 4 5 6 | 7 8 9

Algorithm SCGD

Page 12: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Reposition the separators by “moves” and “jumps” to reduce the cost of the clusters.

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9

The best possible move or jump is determined by calculating the cost reductions of all possible moves and jumps.

Algorithm SCGD (Cont.)

move

jump

move

jump

Page 13: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Algorithm SCGD (Cont.)

Continue repositioning separators until no further cost reductions are possible.

Complexity– Time: O (nl / k + n), linear– Space: O (k)

Quality of clusters increases with n and with average cluster size.

Page 14: Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin

Conclusion Sequential clustering requires that the

sequence of data points be considered as well as the similarity of attributes.

Algorithms:– SCOPT and SCGD

– SCGD approaches SCOPT in terms of quality of clusters when average cluster sizes are large.