machine learning, stor 565 [.1in] clustering: overview and
TRANSCRIPT
![Page 1: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/1.jpg)
Machine Learning, STOR 565
Clustering: Overview and Basic Methods
Andrew Nobel
February, 2021
![Page 2: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/2.jpg)
Overview
Task: Divide a set of objects (e.g. data points) into a small number of disjointgroups such that objects in the same group are close together, and objects indifferent groups are far apart.
![Page 3: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/3.jpg)
Example (http://rosettacode.org)
![Page 4: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/4.jpg)
Example (https://apandre.files.wordpress.com)
![Page 5: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/5.jpg)
General Setting
Given: Objects x1, . . . , xn in feature space X
I Dissimilarity or distance d(xi, xj) between pairs of objects
Goal: Divide x1, . . . , xn into disjoint groups C1, . . . , Ck, called clusters, s.t.
I Objects in same cluster are close together
I Objects in different clusters are far apart
I Number of clusters k is small
Distinction: Clustering is complete if it partitions X and incomplete if itpartitions only x1, . . . , xn.
![Page 6: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/6.jpg)
Clustering: Areas of Application
Genomics, Biology
Data Compression
Psychology
Computer Science
Social and Political Science
![Page 7: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/7.jpg)
Feature Vectors
Objects x ∈ X typically represented by a feature vector
x = (x1, . . . , xp)t
where xi is a numerical/categorical measurement of interest:
I xi ∈ R numerical feature
I xi ∈ {a, b, . . .} categorical feature
![Page 8: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/8.jpg)
Examples
Medicine
I Object = patient
I Feature xi = outcome of a diagnostic test on patient
Microarrays (Genomics)
I Object = tissue sample
I Feature xi = measured expression level of gene i in that sample
Data Mining
I Object = consumer
I Features xi = type, location, or amount of recent purchases
![Page 9: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/9.jpg)
Dissimilarities/Distances Between Feature Vectors
Euclidean d(u,v) =√∑
i(ui − vi)2
Manhattan d(u,v) =∑
i |ui − vi|
Correlation d(u,v) = 1− corr(u, v)
Hamming d(u,v) =∑
i I{ui 6= vi}
Mixtures of these
![Page 10: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/10.jpg)
Basic Steps in Clustering
Objects x1, . . . ,xn
⇓
Selection and Extraction of Features
⇓
Dissimilarity matrix D = {d(xi,xj) : 1 ≤ i, j ≤ n}
⇓
Clustering Algorithm
⇓
Partition π = {C1, . . . , Ck} of x1, . . . ,xn.
![Page 11: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/11.jpg)
Some Clustering Methods
Hierarchical: Candidate divisions of data described by a binary tree
I Agglomerative (bottom-up)
I Divisive (top-down)
Iterative: Search for local minimum of simple cost function
I k-means and variants
I partitioning around medioids, self organizing maps
Model-based: Fit feature vectors with a finite mixture model
Spectral: Threshold eigenvectors of Laplacian of Dissimilarity Matrix
![Page 12: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/12.jpg)
Features of Clusters
Features of clusters can affect the performance of different procedures, e.g.,whether clusters are
I Spherical or elliptical in shape
I Similar in overall variance/spread
I Similar in size (number of points)
![Page 13: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/13.jpg)
The k-Means Algorithm
![Page 14: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/14.jpg)
The k-Means Algorithm
Setting: Objects x1, . . . ,xn ∈ Rp are vectors. Seek k clusters
Approach: Focus on cluster centers
I Find good cluster centers c1, . . . , ck ∈ Rp
I Let cluster Cj = vectors xi closer to cj than other centers cl
Optimization: Select centers to minimize sum of squares (SoS) cost function
Cost(c1, . . . , ck) =
n∑i=1
min1≤j≤k
||xi − cj ||2
Problem: Exact solution of optimization problem not feasible. Resort toiterative methods that find local minimum
![Page 15: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/15.jpg)
Ingredient 1: Centroids
Definition: The centroid of vectors v1, . . . ,vm ∈ Rp is their (vector) average
c =1
m
m∑i=1
vi
I Centroid c is the center of mass of the point configuration v1, . . . ,vm
I Centroid c is an optimal representative for v1, . . . ,vm, in the sense that
m∑i=1
||vi − c||2 ≤m∑i=1
||vi − v||2
for every vector v ∈ Rp
![Page 16: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/16.jpg)
Ingredient 2: Nearest Neighbor Partitions
Idea: Given centers c1, . . . , ck ∈ Rp one can partition Rp into correspondingcells A1, . . . , Ak where
Aj = {x : ||x− cj || ≤ ||x− cs|| all l 6= j}
contains vectors that are closer to center cj than any other center cs (wherewe break ties by index)
Definition: Cells {A1, . . . , Ak} called the nearest neighbor or Voronoipartition of Rp generated by centers c1, . . . , ck
Note: Aj =⋂
s6=j{x : ||x− cj || ≤ ||x− cl||} is an intersection of half-spaces
![Page 17: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/17.jpg)
The k-Means Algorithm
Initialize: Centers C0 = {a1, . . . ,ak}
Iterate: For m = 1, 2, . . . do:
I Let πm be the nearest neighbor partition of the centers Cm−1.
I Let Cm be the centroids of the vectors in each cell of πm
Stop: When Cost(Cm) is close to Cost(Cm+1)
Key Fact: Cost function decreases at each iteration of algorithm. Recall
Cost(c1, . . . , ck) =n∑
i=1
min1≤j≤k
||xi − cj ||2
![Page 18: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/18.jpg)
k-means (Yu-Zhong Chen, ResearchGate)
![Page 19: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/19.jpg)
The k-Means Algorithm
In practice
I Choose multiple initial sets of representative vectors C0 = {c1, . . . , ck}
I Run the iterative k-means procedure
I Choose the partition associated with the smallest final cost
Example: http://onmyphd.com/?p=k-means.clustering.
K-Means tends to perform best when clusters are spherical, similar invariance and size
![Page 20: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/20.jpg)
3-means (onmyphd.com)
![Page 21: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/21.jpg)
2-Means (onmyphd.com)
![Page 22: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/22.jpg)
Agglomerative Clustering
![Page 23: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/23.jpg)
Binary Trees
1. Distinguished node called the root with zero or two children but no parent
2. Every other node has one parent and zero or two children
I Nodes with no children are called leaves
I Nodes with two children are called internal
Note: Tree usually drawn upside-down, with root node at the top
![Page 24: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/24.jpg)
Agglomerative Clustering
Stage 0: Assign each object xi to its own cluster
Stage k:
I Find the two closest clusters at stage k − 1
I Combine them into a single cluster
Stop: When all objects xi belong to a single cluster
Output: Binary tree T called a dendrogram
Note: Closeness of clusters C,C′ can be measured in different ways
![Page 25: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/25.jpg)
Distances Between Clusters
Single Linkageds(C,C
′) = minxi∈C, xj∈C′
d(xi, xj)
Average Linkage
da(C,C′) =
1
|C| |C′|∑
xi∈C, xj∈C′
d(xi, xj)
Total Linkagedt(C,C
′) = maxxi∈C, xj∈C′
d(xi, xj)
![Page 26: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/26.jpg)
Dendrogram
Binary tree associated with the agglomerative clustering procedure: it is agraphical record of the clustering process
Initialize: Each singleton cluster {xi} corresponds to a node at height 0
Update: If two clusters C,C′ are combined, their respective nodes are joinedto a parent node at height d(C,C′)
Each node of dendrogram corresponds to a set of objects. Objectsassociated with two nodes are merged when forming their parent
I Leaves correspond to individual objects
I The root corresponds to all objects
![Page 27: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/27.jpg)
Cities by Distance (blogs.sas.com)
![Page 28: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/28.jpg)
Salmon by Genetic Similarity
![Page 29: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/29.jpg)
Dendrogram, cont.
Note: Dendrogram T represents many possible clusterings, one for each(rooted) subtree.
Methods for selecting a clustering/subtree
I Ad hoc selection (by eye)
I “Cutting” dendrogram at fixed level
I Penalized pruning
Visualization of clustering structure
I Order objects in the same way as the leaves of the dendrogram
I Caveat: many orderings possible
![Page 30: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/30.jpg)
Cars Data
I Samples: 32 unique cars
I Variables: 11 descriptive variables, including gas mileage, horsepower,number of cylinders, etc.
I Freely available in R: data(mtcars)
![Page 31: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/31.jpg)
Mas
erat
i Bor
aFo
rd P
ante
ra L
Dus
ter 3
60C
amar
o Z2
8C
hrys
ler I
mpe
rial
Cad
illac
Fle
etw
ood
Linc
oln
Con
tinen
tal
Hor
net S
porta
bout
Pon
tiac
Fire
bird
Mer
c 45
0SLC
Mer
c 45
0SE
Mer
c 45
0SL
Dod
ge C
halle
nger
AM
C J
avel
inH
orne
t 4 D
rive
Valiant
Ferr
ari D
ino
Hon
da C
ivic
Toyo
ta C
orol
laFi
at 1
28Fi
at X
1-9
Mer
c 24
0DM
azda
RX
4M
azda
RX
4 W
agM
erc
280
Mer
c 28
0CLo
tus
Eur
opa
Mer
c 23
0D
atsu
n 71
0V
olvo
142
ETo
yota
Cor
ona
Por
sche
914
-2
020
4060
80
Single Linkage Clustering on Cars data
hclust (*, "single")dist(mtcars)
Distance
![Page 32: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/32.jpg)
Ferr
ari D
ino
Hon
da C
ivic
Toyo
ta C
orol
laFi
at 1
28Fi
at X
1-9
Maz
da R
X4
Maz
da R
X4
Wag
Mer
c 28
0M
erc
280C
Mer
c 24
0DLo
tus
Eur
opa
Mer
c 23
0V
olvo
142
ED
atsu
n 71
0To
yota
Cor
ona
Por
sche
914
-2M
aser
ati B
ora
Hor
net 4
Driv
eValiant
Mer
c 45
0SLC
Mer
c 45
0SE
Mer
c 45
0SL
Dod
ge C
halle
nger
AM
C J
avel
inC
hrys
ler I
mpe
rial
Cad
illac
Fle
etw
ood
Linc
oln
Con
tinen
tal
Ford
Pan
tera
LD
uste
r 360
Cam
aro
Z28
Hor
net S
porta
bout
Pon
tiac
Fire
bird
050
100
150
200
250
Average Linkage Clustering on Cars data
hclust (*, "average")dist(mtcars)
Distance
![Page 33: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/33.jpg)
TCGA Data
Gene expression data from The Cancer Genome Atlas (TCGA)
I Samples
I 95 Luminal A breast tumors
I 122 Basal breast tumors
I Variables: 2000 randomly selected genes
![Page 34: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/34.jpg)
TCGA Data
I Clustered samples (breast tumor subtype)I Colors: Luminal A and Basal
![Page 35: Machine Learning, STOR 565 [.1in] Clustering: Overview and](https://reader031.vdocuments.site/reader031/viewer/2022013005/61ccffb6941ab3559f21359a/html5/thumbnails/35.jpg)
Important Questions
I What is the right number of clusters?
I What is right measure of distance?
I What is the best clustering method for the data?
I How robust is an observed clustering to small perturbations of the data?
I What significance can be assigned to the clusters?