first topic: clustering and pattern recognition marc sobel

First topic: clustering and pattern recognition

Marc Sobel

Examples of Patterns

Pattern discovery and association

Statistics show connections between the shape of one’s face (adults) and his/her Character. There is also evidence that the outline of children’s face is related to alcohol abuse during pregnancy.

Examples of PatternsCrystal patterns at atomic and molecular levels

Their structures are represented by 3D graphs and can be described by deterministic grammar or formal language

Examples of Patterns with clusters

We may understand patterns of brain activity and find relationships between brain activities, cognition, and behaviors

Patterns of brain activities:

What is a pattern? (see Grenander for a full theory)

In plain language, a pattern is a set of instances which share some regularities,and are similar to each other in the set. A pattern should occur repeatedly.A pattern is observable, sometimes partially, by some sensors with noise anddistortions.

How do we define “regularity”? How do we define “similarity”?How do we define “likelihood” for the repetition of a pattern?How do we model the sensors?

One feature of patterns is that they result from similar groups of data called clusters. How can we identify these clusters.

Hard K-means algorithm (Greedy)

• Start with points x1,…,xn to be clustered.

and mean (cluster) centers m1[0],…,mk[0] (at time t=0).

Iterate the following steps:

1) Assign x’s to the means according to their minimum distance from them. Let Zi=l iff xi gets assigned to class l.

(i=1,…,n; l=1,…,k).

2) Update the means according to:

jZ

x

tmi

jZi

ji

#][

Greedy versus nongreedy algorithms

• Greedy algorithms optimize an objective function at each step. The objective function for the k-means algorithm is:

• Where ‘mj[i]’ means the nearest cluster center to the point xi (i=1,…,n).

• Greedy algorithms are useful but (without external support) are subject to many problems like overfitting, selecting ‘local’ in place of global optima, etc…

2[ ]( , ) i j iO x m x m

Problems with hard k-means

• 1. Convergence depends on starting point.• 2. k-means is a hard assignment algorithm

which means that both important points and outliers play a similar role in assignment.

• 3. Components with different sizes induce a strong bias on the classification

• 4. The distance used plays an enormous role in the kind of clusters which result (e.g., if we used minikowski distance, d(xi,xj)=║xi-xl║α (alpha has an effect) ---possible project.

Example: Dependence on initial condition

The effects of size dissimilarity on the k-means algorithm

Soft k-means Version 1: An improvement?

• In soft k-means, we assign points to clusters with certain probabilities or weights rather than in the usual hard manner:

• For parameters β either known, estimated prior to implementation, or iteratively estimated.

2

,2

1

exp ( , )

exp ( , )

i ji j k

i ll

d x mw

d x m

Soft k means

• We update the means by the update:

• (j=1,..,k).

• This way, points which are between clusters get assigned to ‘both of them’ and hence play a dual role.

,1

,1

n

i j ii

j n

i ji

w xm

w

Soft k-means Version 1 (continued)

• Typically, the parameter β (called the stiffness) plays a significant role. If β goes to infinity, the algorithm tends to that of hard k means. If β tends to 0, the algorithm tends to assign points randomly to clusters. If β tends to minus infinity, the algorithm assigns points to clusters far away from themselves. (Possible Project)

Stiffness Assignment

• Typically, because β is an information type parameter – the bigger it is, the more information used about the points -- and since (1/σ2) also measures the amount of information the data are providing, we assign β= (1/σ2) . Possible Project: What impact does the use of different stiffness parameters have on clustering a particular data set.

The effects of using a stiffness for different values of β when it is

assigned it’s ‘information’ value

Possible Projects

• 1. What happens to the clusters (under soft clustering version 1) with information assignment) when we start with data which is a mixture of two gaussians. Mixtures of gaussians just means that, with a certain probability, data is one gaussian and with one minus that probability, it is another gaussian.

• 2. What happens to the clusters when we have data which consists of two independent gaussians.

Gaussian Distributions

• Gaussian Distributions:

One-dimensional Gaussian distributions:

m is the mean and σ is the standard deviation.

Multidimensional Gaussian distribution:

2

2

1( | , ) exp

2 2

x mg x m

'1

( | , ) exp22

Ig

-1x -m Σ (x -m)x m Σ

Σ

Independent Gaussians

• In the case of independent spherical Gaussians with common sigma:

.... 0 0

.............

0 .... 0

;I

Soft k-means Version 2

• In soft k-means version 2, we assign different stiffness values and different proportions to each separate cluster: We now use the notation sigma instead of β.

2n

2 i,ji=1

, j2i,t

21

2,

2 1j

,1

( , )1exp w2

; ( , )1 wexp

2

( , );

i jj I

j ji j

ki t

t I i tt t t

n

i j i ti

n

i ji

d x m

wd x m

w d x m

I w

Similarity of soft k-means with EM algorithm

• Now assume a data set consisting of gaussian variables x1,…,xn with means among the set {m1,…,mk } and standard deviations in the set {σ1,…,σk}. We write the log likelihood as:

• Maximum likelihood estimators maximize this over the parameters. We just differentiate with respect to each mu and sigma and set to 0.

( ) ( )1log | ,

n

i z i z ii

L g x m

EM algorithm continued

• The critical equations for the mu’s and sigma’s are:

• But, we don’t know what the Z’s are. Therefore, we substitute estimates for them:

• In place of a hard assignment, we substitute the probabilities:

2

2ˆ; # #

i i

i i jZ j Z j

j ji i

x x m

mZ j Z j

1,...,

( | , ) ( )|

( | , ) ( )i j j i

i ii t t i

t k

g x m P Z jZ j P Z j

g x m P Z t

x,m,σ

Bayes Rule

• Above, we have used Bayes rule:

• For events A1,…,Ad whose probabilities sum to 1; and events B:

dttt

iii

ii

APABP

APABPBAP

AP

BPABP

,..,1

i

)()|(

)()|()|(

;)(

)A and ()|(

Bayes Rule for Discrete variables

• For random variables, Bayes rule becomes:• For a discrete random variable X, Y:

1,..,

( and X)( | ) ;

( )

( | ) ( )( | )

( | ) ( )t d

P Y lP Y l X

P X

P X Y l P Y lP Y l X

P X Y t P Y t

Bayes Rule for Continuous variables

• For random variables, Bayes rule becomes:• For a continuous random variable X, Y:

( , )( | ) ;

( )

( | )( )( | ) ;

( | )( )

( ) ( | )( )

x yf y x

x

x y yf y x

x u u du

x x u u du

EM (recontinued)

• Finally, we substitute πf for the probability

• P(Zi=f) (i=1,…,n; f=1,…,k). This adds additional parameters which can be updated via:

• This is the assignment step in the soft k-means algorithm --- in EM it is called the E step.

1,..,

1,..., 1,..,

( | , , )

ˆ( | , , )

ii n

fi

f k i n

P Z f x m

P Z f x m

EM algorithm concluded

• Substituting we get:

• This is the update step in the soft k-means algorithm. In EM it is called the M step.

2

1,..., 1,...,2

1,.., 1,..,

ˆ( | ) ( | )

ˆ; ( | ) ( | )

i i i j ii n i n

j ji i

i n i n

x P Z j m x P Z j m

mP Z j m P Z j m

x, ,σ x, ,σ

x, ,σ x, ,σ

Assuming a single stiffness parameter

• We get,

• Now, a little bit of algebra shows that we are back at the soft means formulation: σ2=(1/β)

2

21,...,

i 2

1,.., 21,...,

( )exp( | )

2; P(Z | )

( | ) ( )exp

2

i jji i

i nj

i i ti n t

t k

x mx P Z j m

m jP Z j m x m

x, ,σ

x,μ,σx, ,σ

Does EM work?

• As presented above: no!!!!!!!!!!!• The problem is typically that:• 1. assuming a single stiffness (sd) means

that we will not properly capture large/small components.

• 2. assuming multiple standard deviations, the resulting sigma’s get mis-estimated because they are very sensitive to mis-estimating the component means.

Possible Projects• Show by simulating a mixture distribution with small

and large component that both the soft k-means versions 1 and 2 fail to work under certain settings. (Hint: suppose you put ‘m’ = an isolated point).

• What happens if you have ‘non-aligned’ components (i.e., they are stretched at a different angle from the other components). What will the EM algorithm described above do?

first topic: clustering and pattern recognition marc sobel

Documents