introduction to machine learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(chapter 8) lecture notes for...
TRANSCRIPT
INTRODUCTION TO
Machine Learning
ETHEM ALPAYDIN© The MIT Press, 2004
[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml
Lecture Slides for
CHAPTER 7:
Clustering
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
3
Semiparametric Density Estimation
Parametric: Assume a single model for p (x | Ci) (Chapter 4 and 5)Semiparametric: p (x | Ci) is a mixture of densities
Multiple possible explanations/prototypes:
Different handwriting styles, accents in speech
Nonparametric: No model; data speaks for itself (Chapter 8)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
4
Mixture Densities
where Gi the components/groups/clusters, P ( Gi ) mixture proportions (priors),p ( x | Gi) component densities
Gaussian mixture where p(x|Gi) ~ N ( µi , ∑i ) parameters Φ = {P ( Gi ), µi , ∑i }k
i=1
unlabeled sample X={xt}t (unsupervised learning)
( ) ( ) ( )∑=
=k
iii Ppp
1
| GGxx
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
5
Classes vs. Clusters Supervised: X = { xt ,rt }t
Classes Ci i=1,...,K
where p ( x | Ci) ~ N ( µi , ∑i )
Φ = {P (Ci ), µi , ∑i }Ki=1
Unsupervised : X = { xt }t
Clusters Gi i=1,...,k
where p ( x | Gi) ~ N ( µi , ∑i )
Φ = {P ( Gi ), µi , ∑i }ki=1
Labels, r ti ?
( ) ( ) ( )∑=
=k
iii Ppp
1
| GGxx( ) ( ) ( )∑=
=K
iii Ppp
1
| CCxx
( )
( )( )∑
∑∑∑∑
−−=
==
t
ti
T
it
t itt
ii
t
ti
t
tti
it
ti
i
r
r
r
r
N
rCP̂
mxmx
xm
S
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
6
k-Means ClusteringFind k reference vectors (prototypes/codebook vectors/codewords) which best represent data
Reference vectors, mj, j =1,...,k
Use nearest (most similar) reference:
Reconstruction error
jt
jit mxmx −=− min
{ }( )
⎪⎩
⎪⎨⎧ −=−
=
−= ∑ ∑=
otherwise0
minif 1
1
jt
ji
tti
t i itt
ik
ii
b
bE
mxmx
mxm X
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
7
Encoding/Decoding
⎪⎩
⎪⎨⎧ −=−
=otherwise0
minif 1 jt
ji
ttib
mxmx
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
8
k-means Clustering
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
9
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
10
Expectation-Maximization (EM)
Log likelihood with a mixture model
Assume hidden variables z, which when known, make optimization much simplerComplete likelihood, Lc(Φ |X,Z), in terms of x and z
Incomplete likelihood, L(Φ |X), in terms of x
( ) ( )
( ) ( )∑ ∑
∏
=
=
=
t
k
iii
t
t
t
Pp
p
1
|log
|log|
GG
XL
x
x ΦΦ
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11
E- and M-steps
Iterate the two steps1. E-step: Estimate z given X and current Φ2. M-step: Find new Φ’ given z, X, and old Φ.
An increase in Q increases incomplete likelihood
( ) ( )[ ]( )ll
lC
l E
ΦΦΦ
ΦΦΦΦ
Φ|maxarg:step-M
|||:step-E1 Q
X,ZX,LQ
=
=+
( ) ( )XLXL ||1 ll ΦΦ ≥+
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
12
EM in Gaussian Mixtures
zti = 1 if xt belongs to Gi, 0 otherwise (labels r ti of
supervised learning); assume p(x|Gi)~N(µi,∑i)
E-step:
M-step:
Use estimated labels in place of unknown labels
[ ] ( ) ( )( ) ( )
( ) ti
lti
j jl
jt
il
it
lti
hP
PpPp
,zE
≡=
=∑
Φ
ΦΦ
Φ
,G
G,GG,GX
x
xx
|
||
( )
( )( )∑
∑∑∑∑
+++
+
−−=
==
t
ti
Tli
t
t
li
ttil
i
t
ti
t
ttil
it
ti
i
h
h
h
h
N
hP
111
1
mxmx
xm
S
G
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
13
P(G1|x)=h1=0.5
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
14
Mixtures of Latent Variable Models
Regularize clusters
1. Assume shared/diagonal covariance matrices
2. Use PCA/FA to decrease dimensionality: Mixtures of PCA/FA
Can use EM to learn Vi (Ghahramani and Hinton, 1997; Tipping and Bishop, 1999)
( ) ( )iTiiiit ,p ψmx += VVNG|
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
15
After Clustering
Dimensionality reduction methods find correlations between features and group featuresClustering methods find similarities between instances and group instances
Allows knowledge extraction throughnumber of clusters,prior probabilities, cluster parameters, i.e., center, range of features.
Example: CRM, customer segmentation
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
16
Clustering as Preprocessing
Estimated group labels hj (soft) or bj (hard) may be seen as the dimensions of a new k dimensional space, where we can then learn our discriminant or regressor.
Local representation (only one bj is 1, all others are 0; only few hj are nonzero) vs
Distributed representation (After PCA; all zj are nonzero)
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
17
Mixture of Mixtures
In classification, the input comes from a mixture of classes (supervised).
If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures:
( ) ( ) ( )
( ) ( ) ( )∑
∑
=
=
=
=
K
iii
k
jijiji
Ppp
Pppi
1
1
|
||
CC
GGC
xx
xx
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
18
Hierarchical Clustering
Cluster based on similarities/distances
Distance measure between instances xr and xs
Minkowski (Lp) (Euclidean for p = 2)
City-block distance
( ) ( )[ ] p/d
j
psj
rj
srm xx,d
1
1∑ =−=xx
( ) ∑ =−=
d
j
sj
rj
srcb xx,d
1xx
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
19
Agglomerative Clustering
Start with N groups each with one instance and merge two closest groups at each iterationDistance between two groups Gi and Gj:
Single-link:
Complete-link:
Average-link, centroid
( ) ( )sr
,ji ,d,d
js
ir
xxxx GG
GG∈∈
= min
( ) ( )sr
,ji ,d,d
js
ir
xxxx GG
GG∈∈
= max
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
20
Dendrogram
Example: Single-Link Clustering
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
21
Choosing k
Defined by the application, e.g., image quantization
Plot data (after PCA) and check for clusters
Incremental (leader-cluster) algorithm: Add one at a time until “elbow” (reconstruction error/log likelihood/intergroup distances)
Manual check for meaning