clustering sequential data: research paper review presented by glynis hawley april 28, 2003 on the...
TRANSCRIPT
Clustering Sequential Data: Research Paper Review
Presented by Glynis Hawley
April 28, 2003
On the Optimal Clustering of Sequential Data by Cheng-Ru Lin and Ming-Syan Chen, Electrical Engineering Department National Taiwan University, Taipei, Taiwan
Second SIAM International Conference on Data Mining April 11-13, 2002
http://www.siam.org/meetings/sdm02/proceedings/sdm02-09.pdf
Agenda
Introduction: What is sequential clustering?
Problem definition for algorithm design
Optimal Algorithm: SCOPT
Greedy Algorithm: SCGD
Conclusion
Sequential Clustering Problem
Attributes and sequence of objects are both important.
Objects within a cluster form a continuous region.
An object within one cluster may be closer to the centroid of a different cluster than it is to its own centroid.
Conventional Clustering vs. Sequential Clustering
Conventional Clustering
1
2
34
5
6
7
8
9
1011
12
1314
15
X
Y
Sequential Clustering
1
2
34
5
6
7
8
9
1011
12
1314
15
X
Y
Application Areas
Analysis of motion patterns of objects.– Cellular phones.
Analysis of status logs of running machines.
Problem Definition Partitioning problem
– n sequential objects into k clusters Dissimilarity measurement
– Squared Euclidean distance Cluster quality
– Cost measurement: penalizes clusters for amount of dissimilarity of objects
Best solution minimizes the sum of the costs of all clusters
m
iiE
coDClCost1
),()( 2
Cost Definition Cost of a cluster: summation over all m
objects of the squared Euclidean distance of the object from the cluster centroid.
Sequential Clustering Algorithms
Optimal Sequential Clustering Algorithm– SCOPT
Greedy Sequential Clustering Algorithm– SCGD
Algorithm SCOPT
Determines optimal k-partition of a set of sequential objects.
Uses the property of optimal substructure.– Systematically solves all possible sub-
problems.– Stores results to be used in later steps.
Complexity of Algorithm SCOPT
Time: O (kn2) Space: O (kn)
Initially, arbitrarily insert separators to divide the n objects into k clusters.
1 2 3 | 4 5 6 | 7 8 9
Algorithm SCGD
Reposition the separators by “moves” and “jumps” to reduce the cost of the clusters.
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
The best possible move or jump is determined by calculating the cost reductions of all possible moves and jumps.
Algorithm SCGD (Cont.)
move
jump
move
jump
Algorithm SCGD (Cont.)
Continue repositioning separators until no further cost reductions are possible.
Complexity– Time: O (nl / k + n), linear– Space: O (k)
Quality of clusters increases with n and with average cluster size.
Conclusion Sequential clustering requires that the
sequence of data points be considered as well as the similarity of attributes.
Algorithms:– SCOPT and SCGD
– SCGD approaches SCOPT in terms of quality of clusters when average cluster sizes are large.