cluster anlysis
TRANSCRIPT
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 1/30
Data Mining
Cluster Analysis
Lecture Notes 10
Probabilistic Model-Based Clustering
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 2/30
Probabilistic Model-Based Clustering
!
Clustering methods discussed so far
– Every data object is assigned to exactly one cluster
!
Some applications may need for fuzzy or soft
cluster assignment
– Ex. An e-game could belong to both entertainment
and software
! Methods:
– fuzzy clusters
– probabilistic model-based clusters
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 3/30
Fuzzy Clustering
!
Given a set of objects, X=[x1,!,xn], a fuzzy set S is asubset of X that allows each object in X to have a
membership degree between 0 and 1.
! A fuzzy set, S, can be modeled as a function:
F S : X!
[0, 1]! Example: Popularity of cameras is defined as a fuzzy
mapping
– Then, A(0.05), B(1), C (0.86), D(0.27)
pop (o ) =
1 if 1000 or more units of o are soldi 1000 if i (i < 1000) units of o are sold.
Camera Sales (units)
A 50B 1320
C 860
D 270
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 4/30
Fuzzy Clustering
!
k fuzzy clusters C1, !,Ck , represented as a partitionmatrix M = [wij] ( wij is the membership degree of an
object oi in fuzzy cluster C j)
– for each object oi and cluster C j , 0 " w ij " 1 (fuzzy set)
– for each object oi
, , equal participation in the clustering
– for each cluster C j , ensures there is no empty cluster
! For an object oi , SSE:
– where the parameter p controls the influence of the degrees of membership
!
For a cluster Ci , SSE:
! Measure how well a clustering fits the data:
SSE(o i )=
j =1
w p ij dist (o i , c j )
2,
SSE(C j )=
n
i =1
w p ij dist (o i , c j )
2.
SSE(C )=
n
i =1
j =1
w p ij dist (o i , c j )
2.
j =1
w ij = 1
0 <i =1
w ij < n .
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 5/30
Fuzzy Clustering - Example
!
Example: Suppose the store has six reviews
! Let cluster features be
– C1 :digital camera and lens
– C2: computer
! For review, Ri, and cluster, C j, wij defined as
The partition matrix
Here review R4 belongs to
clusters C1 and C2 with
membership degrees 2/3
and 1/3, respectively
w ij =|R i ∩C j |
|R i ∩ (C 1 ∪C 2)|=
|R i ∩C j |
|R i ∩ {digital camera , lens , computer }|.
Review ID Keywords
R 1 digital camera, lens
R 2 digital camera
R 3 lens
R 4 digital camera, lens, computer
R 5 computer, CPU
R 6 computer, computer game M =
1 0
1 0
1 023
13
0 1
0 1
.
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 6/30
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 7/30
Model-Based Clustering
!
A set C of k probabilistic clusters C 1,",C k with probability densityfunctions f 1,", f k , respectively, and their probabilities #1,", #k.
!
Probability of an object o generated by cluster C j is
! Probability of o generated by the set of cluster C is
! Since objects are assumed to be generated independently, for a data
set D = {o1, !, on}, we have,
P (o |C j ) = ω j f j (o ).
P (o |C ) =
k
j =1
ω
j f j (o ).
P (D |C ) =
n
i =1
P (o i |C ) =
n
i =1
j =1
ω j f j (o i ).
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 8/30
Model-Based Clustering
!
Task: Find a set C of k probabilistic clusters such that P(D|C) ismaximized.
!
However, maximizing P(D|C) is often intractable since the probability
density function of a cluster can take an arbitrarily complicated form
! To make it computationally feasible (as a compromise), assume the
probability density functions being some parameterized distributions
!
O = {o1, !, on} (n observed objects), # = {$1, !, $k} (parameters of
the k distributions), and P j(oi| $ j) is the probability that oi is generated
from the j-th distribution using parameter $ j, we have
P (o i |) =
j =1
ω j P j (o i | j ),
P (O|) =
n
i =1
k
j =1
ω j P j (o i | j ).
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 9/30
Univariate Gaussian Mixture Model
!
Assume the probability density function of each cluster follows a 1-d
Gaussian distribution. Suppose that there are k clusters.
! The probability density function of each cluster are centered at µj with
standard deviation % j, $ j, = (µ j, % j), we have
!
Assuming that each cluster has the same probability,
– that is &1 = &2 = ' ' ' = &k = 1 ,
!
Then, we have
P (o i | j ) =1√ 2πσ j
e − (o i −µ j )
2
2σ 2 .
P (O|) =1
k
n
i =1
k
j =1
1√ 2πσ j
e − (o i −µ j )
2
2σ 2 .
P (o i |) =
1
k
k
j =1
1√ 2πσ j e −
(o i −µ j )2
2σ 2 .
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 10/30
The EM (Expectation Maximization) Algorithm
!
The k-means algorithm has two steps at each iteration: – Expectation Step (E-step): Given the current cluster centers,
each object is assigned to the cluster whose center is closest to
the object: An object is expected to belong to the closest cluster
– Maximization Step (M-step): Given the cluster assignment, for
each cluster, the algorithm adjusts the center so that the sum of
distance from the objects assigned to this cluster and the new
center is minimized
! The (EM) algorithm: A framework to approach maximum likelihood
or maximum a posteriori estimates of parameters in statistical
models.
– E-step assigns objects to clusters according to the current fuzzy
clustering or parameters of probabilistic clusters
– M-step finds the new clustering or parameters that maximize the
sum of squared error (SSE) or the expected likelihood
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 11/30
Example: Fuzzy Clustering Using the EM Algorithm
!
Initially, let c1 = a and c2 = b! 1st E-step: assign o to c1,w. wt =
–
!
1st M-step: recalculate the centroids according to the partition matrix,
minimizing the sum of squared error (SSE)
!
Iteratively calculate this until the cluster centers converge or the change
is small enough
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 12/30
Computing Mixture Models with EM
12
!
Given n objects O = {o1, !, on}, we want to mine a set of parameters
# = {#1, !, #k} such that P(O|#) is maximized, where # j = (µ j, % j) are
the mean and standard deviation of the j-th univariate Gaussian
distribution
!
We initially assign random values to parameters # j, then iteratively
conduct the E- and M- steps until converge or sufficiently small change
!
At the E-step, for each object oi, calculate the probability that oi belongsto each distribution,
!
At the M-step, adjust the parameters # j = (µ j, % j) so that the expected
likelihood P(O|#) is maximized
P ( j |o i ,) =P (o i | j )k l =1P (o i |l )
.
µ j =1
k
n i =1
o i P ( j |o i ,)n l =1P ( j |o l ,)
=1
k
n i =1 o i P ( j |o i ,)n i =1P ( j |o i ,)
σ j =
n i =1 P ( j |o i ,)(o i − u j )2n
i =1 P ( j |o i ,).
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 13/30
Advantages and Disadvantages of Mixture Models
!
Strength – Mixture models are more general than partitioning and fuzzy
clustering
– Clusters can be characterized by a small number of parameters
–
The results may satisfy the statistical assumptions of thegenerative models
!
Weakness
– Converge to local optimal (overcome: run multi-times w. random
initialization)
–
Computationally expensive if the number of distributions is large,
or the data set contains very few observed data points
– Need large data sets
– Hard to estimate the number of clusters
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 14/30
Notes about slides
!
The following slides are adapted from“Introduction to Data Mining” book. It describes
the EM in simpler way.
! There may be some notation differences from
previous ones.
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 15/30
Model-based clustering
!
In order to understand our data, we will assume that thereis a generative process (a model) that creates/describes
the data, and we will try to find the model that best fits the
data.
– Models of different complexity can be defined, but we will
assume that our model is a distribution from which data pointsare sampled
– Example: the data is the height of all people in Greece
! In most cases, a single distribution is not good enough todescribe all data points: different parts of the data follow a
different distribution
– Example: the data is the height of all people in Greece and China
– We need a mixture model
– Different distributions correspond to different clusters in the data.
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 16/30
Gaussian Distribution
!
Example: the data is the height of all people inGreece
– Experience has shown that this data follows aGaussian (Normal) distribution
–
Reminder: Normal distribution:
–
! = mean, " = standard deviation
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 17/30
Gaussian Model
!
What is a model? – A Gaussian distribution is fully defined by the mean !
and the standard deviation "
– We define our model as the pair of parameters
# = (!,")
!
This is a general principle: a model is defined as
a vector of parameters !
! Fitting the model: We want to find the normaldistribution that best fits our data
– Find the best values for ! and " and "
– But what does best fit mean?
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 18/30
Maximum Likelihood Estimation (MLE)
!
Suppose that we have a vector $ = (%1, ... , %&) of values
! And we want to fit a Gaussian '(!, ") model to the data
! Probability of observing point %(::
! Probability of observing all points (assume independence)
! We want to find the parameters # = (!, ") that maximizethe probability )( $|#)
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 19/30
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 20/30
Maximum Likelihood Estimation (MLE)
!
Note: these are also the most likely parametersgiven the data
!
If we have no prior information about #, or X,
then maximizing )( $|#) is the same as
maximizing )(#|X)
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 21/30
A Gaussian Distribution
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 22/30
Mixture of Gaussians
! Suppose that you have the heights of people from Greeceand China. In this case the data is the result of the
mixture of two Gaussians
– One for Greek people, and one for Chinese people
– Identifying for each value which Gaussian is most likely to have
generated it will give us a clustering.
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 23/30
Mixture model
!
A value %( is generated according to the followingis generated according to the followingprocess:
– First select the nationality
" With probability +, select Greek, with probability +- select
China (+, + +- = 1) – Given the nationality, generate the point from the
corresponding Gaussian
" )(%(|#,) ~ '(!,,",) if Greece
" )(%(|#C) ~ '(!C,"C) if China
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 24/30
Mixture Model
!
Our model has the following parameters
!
For value %(, we have:, we have:
!
For all values $ =(%1,...,%&)
!
We want to estimate the parameters that
maximize the Likelihood of the data
Mixture probabilities Distribution Parameters
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 25/30
Mixture Models
!
Once we have the parameters# = (+,, +-, !,, !-, ",, "-) we can estimate the
membership probabilities )(,|%() and )(C|%() for each
point %(::
–
This is the probability that point %( belongs to thebelongs to theGreek or the Chinese population (cluster )
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 26/30
EM (Expectation Maximization) Algorithm
!
Initialize the values of the parameters in # to somerandom values
! Repeat until convergence
– E-Step: Given the parameters # estimate the membership
probabilities )(,|%() and )(C|%()
– M-Step: Compute the parameter values that (in expectation)
maximize the data likelihood
MLE Estimates
if
Fraction of
population in G,C
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 27/30
Relationship to K-means
!
E-Step: Assignment of points to clusters – K-means: hard assignment, EM: soft assignment
!
M-Step: Computation of centroids
– K-means assumes common fixed variance (spherical
clusters)
– EM: can change the variance for different clusters or
different dimensions (elipsoid clusters)
!
If the variance is fixed then both minimize the
same error function
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 28/30
EM Clustering
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 29/30
EM Clustering
8/20/2019 Cluster Anlysis
http://slidepdf.com/reader/full/cluster-anlysis 30/30
Mixture Model and K-means Clustering