clustering theory applications and algorithms
TRANSCRIPT
-
8/10/2019 Clustering Theory Applications and Algorithms
1/9
-
8/10/2019 Clustering Theory Applications and Algorithms
2/9
Contents
List of Figures
iii
List of Tables
v
List of Algorithms
vi i
Preface
ix
I
2
Clustering, Data, and Similarity Measures
Data Clustering
efinition of Data Clustering
.2
he Vocabulary of Clustering
.2.1
ecords and Attributes
.2.2
istances and Similarities
.2.3
lusters, Centers, and Modes
.2.4
ard Clustering and Fuzzy Clustering
.2.5
alidity Indices
.3
lustering Processes
.4
ealing with Missing Values
.5
esources for Clustering
.5.1
urveys and Reviews on Clustering
.5.2
ooks on Clustering
.5.3
ournals
.5.4
onference Proceedings
.5.5
ata Sets
.6
ummary
ata Types
2 .1
ategorical Data
.2
inary Data
.3
ransaction Data
.4
ymbolic Data
.5
ime Series
.6
ummary
3
3
5
5
5
6
7
8
8
10
12
12
12
13
15
17
17
19
19
21
23
23
2 4
2 4
-
8/10/2019 Clustering Theory Applications and Algorithms
3/9
vi
Contents
3
Scale Conversion
25
3. 1
Introduction
5
3.1 .1
nterval to Ordinal
5
3.1.2
nterval to Nominal
7
3.1.3
rdinal to Nominal
8
3.1.4
ominal to Ordinal
8
3.1.5
rdinal to Interval
9
3.1.6
ther Conversions
9
3.2
Categorization of Numerical Data
0
3.2 .1
irect Categorization
0
3.2.2
luster-based Categorization
1
3.2.3
utomatic Categorization
7
3 .3
Summary
1
4
Data Standardization and Transformation
43
4.1
Data Standardization
3
4.2
Data Transformation
6
4.2 .1
rincipal Component Analysis
6
4.2.2
VD
8
4.2.3
he Karhunen-L ve Transformation
9
4 .3
Summary
1
5
Data Visualization
53
5.1
Sammon's Mapping
3
5 .2
MDS
4
5.3
SOM
6
5.4 Class-preserving Projections
9
5 .5
Parallel Coordinates
0
5 .6
Tree Maps
1
5 .7
Categorical Data Visualization
2
5 .8
Other Visualization Techniques
5
5 .9 Summary5
6 Similarity and Dissimilarity Measures
67
6.1
Prel im inaries
7
6.1.1
roximity Matrix
8
6 .1 . 2
roximity Graph
9
6.1.3
catter Matrix
9
6 .1 .4
ovariance Matrix
0
6 .2
Measures for Numerical Data
1
6.2.1
uclidean Distance 1
6.2.2
anhattan Distance
1
6.2.3
aximum Distance 2
6 .2 .4
inkowski Distance
2
6.2.5
ahalanobis Distance
2
-
8/10/2019 Clustering Theory Applications and Algorithms
4/9
Contents
vii
6.2.6
verage Distance
.2 .7
ther Distances
3
7 4
6 .3
Measures for Categorical Data4
6.3.1
he Simple Matching Distance 6
6.3 .2
ther Matching Coefficients 6
6 .4 Measures for Binary Data
7
6 .5
Measures for Mixed-type Data
9
6.5.1
General Similarity Coefficient
9
6.5 .2
General Distance Coefficient
0
6.5.3
Generalized Minkowski Distance
1
6.6
Measures for Time Series Data
3
6.6.1
he Minkowski Distance
4
6.6 .2
ime Series Preprocessing
5
6.6 .3
ynamic Time Warping
7
6.6 .4
easures Based on Longest Common Subsequences 8 8
6.6 .5
easures Based on Probabilistic Models
0
6 .6 .6
easures Based on Landmark Models
1
6.6 .7
valuation
2
6 .7
Other Measures
2
6.7.1
he Cosine Similarity Measure
3
6.7.2
Link-based Similarity Measure
3
6.7 .3
upport
4
6 .8
Similarity and Dissimilarity Measures between Clusters
4
6.8 .1
he Mean-based Distance
4
6 .8 . 2
he Nearest Neighbor Distance
5
6 .8 .3
he Farthest Neighbor Distance
5
6 .8 .4
he Average Neighbor Distance
6
6 .8 .5
ance-Williams Formula
6
6 .9
Similarity and Dissimilarity between Variables
8
6.9 .1
earson's Correlation Coefficients
8
6 .9 . 2
easures Based on the Chi-square Statistic
01
6 .9 .3
easures Based on Optimal Class Prediction
03
6 .9 .4
roup-based Distance
05
6 .1 0 Summary
06
II
Clustering Algorithms
107
7
Hierarchical Clustering Techniques
109
7 .1
Representations of Hierarchical Clusterings
09
7.1.1
i tree
10
7 .1 . 2
endrogram
10
7 .1 .3
anner
12
7 .1 .4
ointer Representation
12
7 .1 .5
acked Representation
14
7 .1 .6
cicle Plot
15
7 .1 .7
ther Representations
15
-
8/10/2019 Clustering Theory Applications and Algorithms
5/9
viii
Contents
7.2
Agglomerative Hierarchical Methods
16
7 .2 .1
he Single-link Method
18
7.2.2
he Complete Link Method
20
7.2.3
he Group Average Method
22
7.2.4
he Weighted Group Average Method
25
7 .2 .5
he Centroid Method
26
7.2.6
he Median Method
30
7 .2 . 7
ard's Method
32
7.2.8
ther Agglomerative Methods
37
7 .3
Divisive Hierarchical Methods
37
7.4
Several Hierarchical Algorithms
38
7.4.1
LINK
38
7.4.2
ingle-link Algorithms Based an Minimum Spanning Trees
140
7.4 .3
LINK
41
7.4.4
IRCH
44
7.4 .5
URE
44
7.4.6
IANA
45
7.4 .7
ISMEA
47
7.4 .8
dwards and Cavalli-Sforza Method
47
7 .5 Summary
49
8 Fuzzy Clustering Algorithms
151
8 .1
Fuzzy Sets
51
8 .2
Fuzzy Relations
53
8 .3 Fuzzy k-means
54
8 .4
Fuzzy k-modes
56
8 .5
The c-means Method
58
8 .6
Summary
59
9
Center based Clustering Algorithms 161
9 .1
The k-means Algorithm
61
9 .2
Variations of the k means Algorithm
64
9.2 .1
he Continuous k-means Algorithm 65
9 .2 . 2
he Compare-means Algorithm
65
9 .2 .3
he Sort-means Algorithm 66
9 .2 .4
cceleration of the k-means Algorithm with the
kd tree
67
9 .2 .5
ther Acceleration Methods
68
9 .3
The Trimmed k-means Algorithm
69
9 .4
The x-means Algorithm
70
9 .5
The k-harmonic Means Algorithm
71
9 .6
The Mean Shift Algorithm
73
9 .7
MEC
75
9 .8
The k-modes Algorithm Huang)
76
9.8 .1
nitial Modes Selection
78
9 .9
The k-modes Algorithm (Chaturvedi et al.)
78
-
8/10/2019 Clustering Theory Applications and Algorithms
6/9
Contents
ix
10
9.10
he k-probabilities Algorithm
.1 1he k-prototypes Algorithm
.1 2
ummary
earch-based Clustering Algorithms
179
181
18 2
183
10.1 Genetic Algorithms
8 4
10 .2
The Tabu Search Method
8 5
10.3
Variable Neighborhood Search for Clustering
8 6
10.4
Al-Sultan's Method
8 7
10 .5
Tabu Searchbased Categorical Clustering Algorithm
8 9
10.6
J-means
90
10.7
GKA
9 2
10.8
The Global k-means Algorithm
95
10.9
The Genetic k-modes Algorithm
95
10.9 .1
he Selection Operator
96
10 .9 .2
he Mutation Operator
96
10 .9 .3
he k-modes Operator
97
10 . 10 The Genetic Fuzzy k-modes Algorithm
97
10 .10 .1
tring Representation
98
10 .10 .2
nitialization Process
98
10 .10 .3
election Process
9 9
10 .10 .4
rossover Process
9 9
10 .10 .5
utation Process
0 0
10 .10 .6
ermination Criterion
0 0
10.11
SARS
0 0
10 . 12
Summary
0 2
11
Graph-based Clustering Algorithms
203
11.1 Chameleon
0 3
11.2
CACTUS
0 4
11.3
A Dynamic Systembased Approach
0 5
11.4
ROCK
0 7
11.5
Summary
0 8
12
Grid-based Clustering Algorithms
209
12.1 STING
0 9
12.2
OptiGrid
1 0
12.3
GRIDCL US
1 2
12.4
GDILC
14
12.5 WaveCluster
16
12.6
Summary
1 7
13
Density-based Clustering Algorithms
219
13.1
DB SCAN
19
13.2
BRIDGE
21
13.3
DB CLASD
22
-
8/10/2019 Clustering Theory Applications and Algorithms
7/9
Contents
14
13.4
ENCLUE
3. 5
UBN
3.6
ummary
odel-based Clustering Algorithms
22 3
22 5
226
227
14.1
Introduction
27
14.2
Gaussian Clustering Models
30
14.3
Model-based Agglomerative Hierarchical Clustering
32
14.4
The EM Algorithm
35
14.5
Model-based Clustering
37
14.6
COOLCAT
40
14.7
STUCCO
41
14.8
Summary
42
15
Subspace Clustering
243
15.1
CLIQUE
44
15.2
PROCLUS
46
15.3
ORCLUS
49
15.4
ENCLUS
53
15.5
FINDIT
55
15.6
MAFIA
58
15.7
DOC
59
15.8
CLTree
61
15.9 PART
6 2
15 .10
SUBCAD
6 4
15.11
Fuzzy Subspace Clustering
70
15 .12
Mean Shift for Subspace Clustering
75
15 .13
Summary
8 5
16
Miscell aneous Algorithms
287
16.1
Time Series Clustering Algorithms
8 7
16.2
Streaming Algorithms
8 9
16 .2 .1
SEARCH
90
16 .2 .2
ther Streaming Algorithms
9 3
16.3
Transaction Data Clustering Algorithms
9 3
16.3.1
argeltem
9 4
16.3 .2
LOPE 9 5
16.3 .3
AK
96
16.4
Summary
9 7
17
Evaluation of Clustering Algorithms
299
17.1
Introduction
9 9
17.1.1
ypothesis Testing
0 1
17 .1 .2
xternal Criteria
0 2
17.1 .3
nternal Criteria
0 3
17 .1 .4
elative Criteria
04
-
8/10/2019 Clustering Theory Applications and Algorithms
8/9
Contents
i
17.2
valuation of Partitional Clustering
0 5
17.2.1odified Hubert's P Statistic
305
17 .2 .2
he Davies-Bouldin Index
0 5
17.2.3
unn's Index
0 7
17 .2 .4
he SD Validity Index
0 7
17.2 .5
he S_Dbw Validity Index
0 8
17.2.6 The RMSSTD Index
309
17 .2 .7
he RS Index
1 0
17 .2 .8
he Calinski-Harabasz Index
1 0
17 .2 .9
and's Index
11
17.2.10 Average of Compactness
1 2
17.2.11
istances between Partitions
1 2
17.3
valuation of Hierarchical Clustering
1 4
17.3.1
esting Absence of Structure
314
17 .3 .2
esting Hierarchical Structures
1 5
17 .4
alidity Indices for Fuzzy Clustering
1 5
17.4.1
he Partition Coefficient Index
1 5
17.4 .2
he Partition Entropy Index
1 6
17.4 .3
he Fukuyama-Sugeno Index
1 6
17.4 .4
alidity Based on Fuzzy Similarity
1 7
17.4 .5
Compact and Separate Fuzzy Validity Criterion
1 8
17 .4 .6
Partition Separation Index
319
17 .4 .7
n Index Based on the Mini-max Filter Concept and Fuzzy
Theory
I 9
17.5 Summary
2 0
III Applications of Clustering
21
18 Clustering Gene Expression Data
23
18.1
ackground
323
18 .2
pplications of Gene Expression Data Clustering
2 4
18 .3
ypes of Gene Expression Data Clustering
2 5
18 .4
ome Guidelines for Gene Expression Clustering
2 5
18 .5
imilarity Measures for Gene Expression Data
2 6
18.5.1
uclidean Distance
2 6
18 .5 .2
earson's Correlation Coefficient
2 6
18.6 A Case Study
2 8
18 .6 .1
++ Code
328
18 .6 .2
esults
34
18.7 Summary
334
IV MATLAB and C for Clustering
41
19 Data Clustering in MATLAB
43
19.1
ead and Write Data Files
43
19.2
andle Categorical Data
347
-
8/10/2019 Clustering Theory Applications and Algorithms
9/9
xi i
ontents
19.3 M-files, MEX-files, and MAT-files
49
19 .3 .1
-files
349
19 .3 .2
EX-files
351
19 .3 .3
AT files
54
19.4 Speed up MATLAB
54
19 .5
ome Clustering Functions
355
19.5.1
ierarchical Clustering
355
19 .5 .2
-means Clustering
59
19.6 Summary
362
20 Clustering in C/C 63
20 1 The STL 363
20 .1 .1he vector
Class
63
20.1.2
he
l st
Class
364
20.2 C/C++ Program Compilation
66
20.3
ata Structure and Implementation
6 7
20 .3 .1
ata Matrices and Centers
367
20.3.2
lustering Results
368
20.3.3
he Quick Sort Algorithm
369
20.4 Summary
369
A
ome Clustering Algorithms
71
B
he kd tree Data Structure
75
C MATLAB Codes
7 7
C.1
he MATLAB Code for Generating Subspace Clusters
377
C.2
he MATLAB Code for the k-modes Algorithm
7 9
C .3
he MATLAB Code for the MSSC Algorithm
381
D C Codes
8 5
D.1
he C++ Code for Converting Categorical Values to Integers
8 5
D.2
he C++ Code for the FSC Algorithm
388
Bibliography
97
Subject Index
43
Author Index
55