clustering by pattern similarity in large data sets haixun wang, wei wang, jiong yang, philip s. yu...
TRANSCRIPT
![Page 1: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/1.jpg)
Clustering by Pattern Similarity in Large Data Sets
Haixun Wang, Wei Wang, Jiong Yang, Philip S. YuIBM T. J. Watson Research CenterPresented by Edmond Wu
![Page 2: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/2.jpg)
DB-Seminar Slide 2
Talk Outline
Introduction
Related Work
pCluster Model
Performance Analysis
Conclusion
![Page 3: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/3.jpg)
DB-Seminar Slide 3
Motivation
Why discovery of clusters based on pattern similarity is interesting and important?
DNA micro-array analysis
E-commerce: Recommendation systems & target marketing
![Page 4: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/4.jpg)
DB-Seminar Slide 4
Background Knowledge
Clustering: the process of grouping a set of objects into classes of similar objects.
Subspace clustering: discovering clusters embedded in the subspace of a high dimensional datasets.
Pattern similarity: coherent pattern on a subset of dimensions. ( Not require to have close values on at least one attribute)
![Page 5: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/5.jpg)
DB-Seminar Slide 5
Example of Similar pattern on a subset of dimensions
![Page 6: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/6.jpg)
DB-Seminar Slide 6
Challenges
Identifying subspace clusters in high-dimensional data sets is difficult.
Traditional distance functions can not capture the pattern similarity among the objects.
![Page 7: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/7.jpg)
DB-Seminar Slide 7
How to detect shifting pattern?
Given N attributes a1,…,an
Define a derived attribute Aij=ai-aj for every
pair of attributes ai-aj Thus, the problem equals to mine subspace clusters on the objects with the derived set of attributes.
Drawback: The converted dataset will have
N(N-1)/2 dimensions
intractable even for a small N
![Page 8: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/8.jpg)
DB-Seminar Slide 8
Related Work
Bicluster Model (Cheng et al):
AIJ: sub Matrix of a DNA array, with the following mean squared residue score H(I,J):
δ- bicluster: AIJ is called a δ- bicluster if H(I,J) ≤δ
![Page 9: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/9.jpg)
DB-Seminar Slide 9
Bicluster Model (Example)
(1) Shifting pattern (2) Scaling patternH(I,J)=0 H(I,J)=2/3
(3) Not similar pattern (4) Submatrix of (2)
H(I,J)=8 H(I,J)=2.25>2/3
If we set δ=2, (3),(4) are not δ- bicluster.
a1 a2 a3
O1 1 2 3
O2 5 6 7
a1 a2 a3
O1 1 2 4
O2 2 4 8
a1 a2 a3
O1 2 4 12
O2 4 6 2
a1 a3
O1 1 4
O2 2 8
![Page 10: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/10.jpg)
DB-Seminar Slide 10
Drawbacks of Bicluster Model
A submatrix of a δ- bicluster is not necessarily a δ- bicluster.
Not sure to find all qualified clusters (randomly greedy algorithm provides only an approximate answer).
Can not exclude outlier in a bicluster.
Difficulties in designing efficient algorithm.
![Page 11: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/11.jpg)
DB-Seminar Slide 11
Bicluster Model (Example)
The bicluster shown in Figure (a) contains an obvious outlier but it still has a fairly small mean squared residue (4.238).
If we get rid of such outliers by reducing the δ threshold, it will exclude many biclusters which do exhibit similar patterns.
![Page 12: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/12.jpg)
DB-Seminar Slide 12
The pCluster Model
pScore of a 2× 2 matrix:
O : subset of objects in the database
T : subset of attributes; (O,T): submatrix of dataset
δ: user specified clustering threshold
dxa: value of object X on attribute a
Given x, y O, and ∈ a, b ∈T
)()( ybyaxbxaybya
xbxadddd
dd
ddpScore
![Page 13: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/13.jpg)
DB-Seminar Slide 13
The pCluster Model (Cont.)
pScore(X) ≤ δ means that the change of values on the two attributes between the two objects in X is confined byδ, a user-specified threshold.
Pair (O, T ) forms a δ-pCluster if for any 2 × 2 submatrix X in (O, T ), we have pScore(X) ≤ δ for some δ ≥ 0.
![Page 14: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/14.jpg)
DB-Seminar Slide 14
The pCluster Model (Example)
In Figure (a): Object 2, 3 and {b, c} form a 2× 2 submatrix X: d2b=12, d2c=15, d3b=40, d3c=43 pScore(X)=|(12-15)-(40-43)|=0
Objects 1,2,3 and {b,c,h,j,e} form a pCluster (δ=0)
![Page 15: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/15.jpg)
DB-Seminar Slide 15
The pCluster Model (Cont.)
Compact property of pCluster:
let (O,T) be a δ-pCluster. Any of its submatrix, (O’,T’) is also a δ-pCluster (Based on the definition of pCluster);
The volume of a pCluster: |O|×|T|;
Definition of pCluster is symmetric:
|(dxa- dxb) - (dya- dyb)|
= |(dxa- dya) - (dxb- dyb)|
![Page 16: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/16.jpg)
DB-Seminar Slide 16
Problem Statement
Task: To find all pairs (O,T) such that (O,T) is a δ-pCluster according to its definition, and |O|≥ nr, |T|≥ nc.
Parameters: D : dataset δ: a cluster threshold nc : a minimal number of columns nr : a minimal number of rows
![Page 17: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/17.jpg)
DB-Seminar Slide 17
The Algorithm
Definition of MDS: Assuming c = (O, T) is a δ-pCluster. Column set T is a Maximum Dimension Set (MDS) of c if
there does not exist T’ T such that (O, T’) is also a δ-pCluster.
Objects can form pClusters on multiple MDSs. The algorithm is depth-first, meaning only generate pClusters that cluster on MDSs.
![Page 18: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/18.jpg)
DB-Seminar Slide 18
Pair-wise Clustering
Pairwise Clustering Principle:
Given objects X and Y, and a dimension set T, X and Y form a δ-pCluster on T iff the difference between the largest and smallest value in
S(X, Y, T) is below δ.
In other word, ({X,Y},T) is a pCluster if the following is true:
),(max,
bafTba
)()(),( ybxbyaxa ddddbaf
![Page 19: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/19.jpg)
DB-Seminar Slide 19
Pair-wise Clustering (Example)
Sorted sequence of S(X, Y, T) =s1,…,sk ,…,sn
Object x and y forms a δ-pCluster if
Three MDSs were found: {e,g,c}, {a,d,b,h}, {h,f}
1,...,1, niss ik
![Page 20: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/20.jpg)
DB-Seminar Slide 20
MDS Pruning
MDS Pruning Principle:
Let Txy be an MDS for objects x, y, and a ∈Txy. For any O and T , a necessary condition of ({x, y} ∪O, {a} ∪ T ) being a δ-pCluster is b ∈ T , Oab {x, y}.
The pruning criterion can be stated as follows:
For any dimension a in a MDS Txy, count the number
of Oab that contain {x, y}. If the number of such Oab is
less than nc-1, remove a from Txy. Furthermore, if the
removal of a makes |Txy| < nc, we remove Txy as well.
![Page 21: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/21.jpg)
DB-Seminar Slide 21
MDS Pruning (Example)
![Page 22: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/22.jpg)
DB-Seminar Slide 22
The Main Algorithm
First step: Scan the dataset to find column-pair MDSs and object-pair MDSs.Second step: Prune object-pair MDSs and column-pair MDSs by turn until no pruning can be made.Third step: Insert the remaining object-pair MDSs into a prefix tree. (Each node represents a cluster of objects, each edge represents the column selected)
![Page 23: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/23.jpg)
DB-Seminar Slide 23
Construct a prefix tree
Sort the order of columns e.g., a,b,c,…Insert 2-object pCluster(O,T) into the prefix tree. Perform a post-order traversal of the prefix tree. Prune nodes that |O|<nr. ( Add the objects in O to nodes whose column set
T’ T and |T’|=|T|-1
![Page 24: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/24.jpg)
DB-Seminar Slide 24
Construct a prefix tree (Example)
![Page 25: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/25.jpg)
DB-Seminar Slide 25
Algorithm Complexity
Main algorithm for mining pClusters has time complexity :
where M is the # of columns and N is the # of
objects.The worse case:However, the complexity can be greatly reduced because of the MDS pruning process.
)loglog( 22 MMNNNMO
)( 22NkMO
![Page 26: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/26.jpg)
DB-Seminar Slide 26
Experiments
DatasetsSynthetic datasets (parameters: different nr, nc, # of embedded perfect pCluster with δ=0)
Gene expression data (yeast microarray)
MovieLens dataset (E-commerce)
![Page 27: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/27.jpg)
DB-Seminar Slide 27
Performance Analysis
Response time VS. data size
![Page 28: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/28.jpg)
DB-Seminar Slide 28
Performance Analysis (Cont.)
Sensitiveness to mining parameters: δ, nc, and nr
![Page 29: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/29.jpg)
DB-Seminar Slide 29
Performance Analysis (Cont.)
Compare the pCluster with an alternative approach based on the subspace clustering algorithm CLIQUE.
![Page 30: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/30.jpg)
DB-Seminar Slide 30
Performance Analysis (Cont.)
The pruning process is essential in the pCluster algorithm.
Without pruning, the pCluster Algorithm can not beyond 3,000 objects. As the number of the MDS become too large to put into a Prefix tree.
![Page 31: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/31.jpg)
DB-Seminar Slide 31
Conclusion
pCluster Model: capture the closeness of objects and pattern similarity among the objects in subsets of dimensions.Advantages :
-Discover all the qualified pClusters. -The depth-first clustering algorithm avoids generating clusters which are part of other clusters. -More efficient than current algorithm. -Resilient to outliers
![Page 32: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/32.jpg)
DB-Seminar Slide 32
References
Y. Cheng and G. Church. Biclustering of expression data. In Proc. of 8th International Conference on Intelligent System for Molecular Biology, 2000.S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and
G. Church. Yeast micro data set, 2000. In http://arep.med.harvard.edu/biclustering/yeast.matrix,
R. C. Agarwal, C. C. Aggarwal, and V. Parsad. Depth first generation of long patterns. In SIGKDD, 2000.J. Yang, W. Wang, H. Wang, and P. S. Yu. δ-clusters:
Capturing subspace correlation in a large data set. In ICDE, pages 517–528, 2002.
![Page 33: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond](https://reader036.vdocuments.site/reader036/viewer/2022062517/56649f205503460f94c394c2/html5/thumbnails/33.jpg)
Thanks!!