first topic: clustering and pattern recognition marc sobel

29
First topic: clustering and pattern recognition Marc Sobel

Upload: samuel-copeland

Post on 04-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: First topic: clustering and pattern recognition Marc Sobel

First topic: clustering and pattern recognition

Marc Sobel

Page 2: First topic: clustering and pattern recognition Marc Sobel

Examples of Patterns

Pattern discovery and association

Statistics show connections between the shape of one’s face (adults) and his/her Character. There is also evidence that the outline of children’s face is related to alcohol abuse during pregnancy.

Page 3: First topic: clustering and pattern recognition Marc Sobel

Examples of PatternsCrystal patterns at atomic and molecular levels

Their structures are represented by 3D graphs and can be described by deterministic grammar or formal language

Page 4: First topic: clustering and pattern recognition Marc Sobel

Examples of Patterns with clusters

We may understand patterns of brain activity and find relationships between brain activities, cognition, and behaviors

Patterns of brain activities:

Page 5: First topic: clustering and pattern recognition Marc Sobel

What is a pattern? (see Grenander for a full theory)

In plain language, a pattern is a set of instances which share some regularities,and are similar to each other in the set. A pattern should occur repeatedly.A pattern is observable, sometimes partially, by some sensors with noise anddistortions.

How do we define “regularity”? How do we define “similarity”?How do we define “likelihood” for the repetition of a pattern?How do we model the sensors?

One feature of patterns is that they result from similar groups of data called clusters. How can we identify these clusters.

Page 6: First topic: clustering and pattern recognition Marc Sobel

Hard K-means algorithm (Greedy)

• Start with points x1,…,xn to be clustered.

and mean (cluster) centers m1[0],…,mk[0] (at time t=0).

Iterate the following steps:

1) Assign x’s to the means according to their minimum distance from them. Let Zi=l iff xi gets assigned to class l.

(i=1,…,n; l=1,…,k).

2) Update the means according to:

jZ

x

tmi

jZi

ji

#][

Page 7: First topic: clustering and pattern recognition Marc Sobel

Greedy versus nongreedy algorithms

• Greedy algorithms optimize an objective function at each step. The objective function for the k-means algorithm is:

• Where ‘mj[i]’ means the nearest cluster center to the point xi (i=1,…,n).

• Greedy algorithms are useful but (without external support) are subject to many problems like overfitting, selecting ‘local’ in place of global optima, etc…

2[ ]( , ) i j iO x m x m

Page 8: First topic: clustering and pattern recognition Marc Sobel

Problems with hard k-means

• 1. Convergence depends on starting point.• 2. k-means is a hard assignment algorithm

which means that both important points and outliers play a similar role in assignment.

• 3. Components with different sizes induce a strong bias on the classification

• 4. The distance used plays an enormous role in the kind of clusters which result (e.g., if we used minikowski distance, d(xi,xj)=║xi-xl║α (alpha has an effect) ---possible project.

Page 9: First topic: clustering and pattern recognition Marc Sobel

Example: Dependence on initial condition

Page 10: First topic: clustering and pattern recognition Marc Sobel

The effects of size dissimilarity on the k-means algorithm

Page 11: First topic: clustering and pattern recognition Marc Sobel

Soft k-means Version 1: An improvement?

• In soft k-means, we assign points to clusters with certain probabilities or weights rather than in the usual hard manner:

• For parameters β either known, estimated prior to implementation, or iteratively estimated.

2

,2

1

exp ( , )

exp ( , )

i ji j k

i ll

d x mw

d x m

Page 12: First topic: clustering and pattern recognition Marc Sobel

Soft k means

• We update the means by the update:

• (j=1,..,k).

• This way, points which are between clusters get assigned to ‘both of them’ and hence play a dual role.

,1

,1

n

i j ii

j n

i ji

w xm

w

Page 13: First topic: clustering and pattern recognition Marc Sobel

Soft k-means Version 1 (continued)

• Typically, the parameter β (called the stiffness) plays a significant role. If β goes to infinity, the algorithm tends to that of hard k means. If β tends to 0, the algorithm tends to assign points randomly to clusters. If β tends to minus infinity, the algorithm assigns points to clusters far away from themselves. (Possible Project)

Page 14: First topic: clustering and pattern recognition Marc Sobel

Stiffness Assignment

• Typically, because β is an information type parameter – the bigger it is, the more information used about the points -- and since (1/σ2) also measures the amount of information the data are providing, we assign β= (1/σ2) . Possible Project: What impact does the use of different stiffness parameters have on clustering a particular data set.

Page 15: First topic: clustering and pattern recognition Marc Sobel

The effects of using a stiffness for different values of β when it is

assigned it’s ‘information’ value

Page 16: First topic: clustering and pattern recognition Marc Sobel

Possible Projects

• 1. What happens to the clusters (under soft clustering version 1) with information assignment) when we start with data which is a mixture of two gaussians. Mixtures of gaussians just means that, with a certain probability, data is one gaussian and with one minus that probability, it is another gaussian.

• 2. What happens to the clusters when we have data which consists of two independent gaussians.

Page 17: First topic: clustering and pattern recognition Marc Sobel

Gaussian Distributions

• Gaussian Distributions:

One-dimensional Gaussian distributions:

m is the mean and σ is the standard deviation.

Multidimensional Gaussian distribution:

2

2

1( | , ) exp

2 2

x mg x m

'1

( | , ) exp22

Ig

-1x -m Σ (x -m)x m Σ

Σ

Page 18: First topic: clustering and pattern recognition Marc Sobel

Independent Gaussians

• In the case of independent spherical Gaussians with common sigma:

.... 0 0

.............

0 .... 0

;I

Page 19: First topic: clustering and pattern recognition Marc Sobel

Soft k-means Version 2

• In soft k-means version 2, we assign different stiffness values and different proportions to each separate cluster: We now use the notation sigma instead of β.

2n

2 i,ji=1

, j2i,t

21

2,

2 1j

,1

( , )1exp w2

; ( , )1 wexp

2

( , );

i jj I

j ji j

ki t

t I i tt t t

n

i j i ti

n

i ji

d x m

wd x m

w d x m

I w

Page 20: First topic: clustering and pattern recognition Marc Sobel

Similarity of soft k-means with EM algorithm

• Now assume a data set consisting of gaussian variables x1,…,xn with means among the set {m1,…,mk } and standard deviations in the set {σ1,…,σk}. We write the log likelihood as:

• Maximum likelihood estimators maximize this over the parameters. We just differentiate with respect to each mu and sigma and set to 0.

( ) ( )1log | ,

n

i z i z ii

L g x m

Page 21: First topic: clustering and pattern recognition Marc Sobel

EM algorithm continued

• The critical equations for the mu’s and sigma’s are:

• But, we don’t know what the Z’s are. Therefore, we substitute estimates for them:

• In place of a hard assignment, we substitute the probabilities:

2

2ˆ; # #

i i

i i jZ j Z j

j ji i

x x m

mZ j Z j

1,...,

( | , ) ( )|

( | , ) ( )i j j i

i ii t t i

t k

g x m P Z jZ j P Z j

g x m P Z t

x,m,σ

Page 22: First topic: clustering and pattern recognition Marc Sobel

Bayes Rule

• Above, we have used Bayes rule:

• For events A1,…,Ad whose probabilities sum to 1; and events B:

dttt

iii

ii

APABP

APABPBAP

AP

BPABP

,..,1

i

)()|(

)()|()|(

;)(

)A and ()|(

Page 23: First topic: clustering and pattern recognition Marc Sobel

Bayes Rule for Discrete variables

• For random variables, Bayes rule becomes:• For a discrete random variable X, Y:

1,..,

( and X)( | ) ;

( )

( | ) ( )( | )

( | ) ( )t d

P Y lP Y l X

P X

P X Y l P Y lP Y l X

P X Y t P Y t

Page 24: First topic: clustering and pattern recognition Marc Sobel

Bayes Rule for Continuous variables

• For random variables, Bayes rule becomes:• For a continuous random variable X, Y:

( , )( | ) ;

( )

( | )( )( | ) ;

( | )( )

( ) ( | )( )

x yf y x

x

x y yf y x

x u u du

x x u u du

Page 25: First topic: clustering and pattern recognition Marc Sobel

EM (recontinued)

• Finally, we substitute πf for the probability

• P(Zi=f) (i=1,…,n; f=1,…,k). This adds additional parameters which can be updated via:

• This is the assignment step in the soft k-means algorithm --- in EM it is called the E step.

1,..,

1,..., 1,..,

( | , , )

ˆ( | , , )

ii n

fi

f k i n

P Z f x m

P Z f x m

Page 26: First topic: clustering and pattern recognition Marc Sobel

EM algorithm concluded

• Substituting we get:

• This is the update step in the soft k-means algorithm. In EM it is called the M step.

2

1,..., 1,...,2

1,.., 1,..,

ˆ( | ) ( | )

ˆ; ( | ) ( | )

i i i j ii n i n

j ji i

i n i n

x P Z j m x P Z j m

mP Z j m P Z j m

x, ,σ x, ,σ

x, ,σ x, ,σ

Page 27: First topic: clustering and pattern recognition Marc Sobel

Assuming a single stiffness parameter

• We get,

• Now, a little bit of algebra shows that we are back at the soft means formulation: σ2=(1/β)

2

21,...,

i 2

1,.., 21,...,

( )exp( | )

2; P(Z | )

( | ) ( )exp

2

i jji i

i nj

i i ti n t

t k

x mx P Z j m

m jP Z j m x m

x, ,σ

x,μ,σx, ,σ

Page 28: First topic: clustering and pattern recognition Marc Sobel

Does EM work?

• As presented above: no!!!!!!!!!!!• The problem is typically that:• 1. assuming a single stiffness (sd) means

that we will not properly capture large/small components.

• 2. assuming multiple standard deviations, the resulting sigma’s get mis-estimated because they are very sensitive to mis-estimating the component means.

Page 29: First topic: clustering and pattern recognition Marc Sobel

Possible Projects• Show by simulating a mixture distribution with small

and large component that both the soft k-means versions 1 and 2 fail to work under certain settings. (Hint: suppose you put ‘m’ = an isolated point).

• What happens if you have ‘non-aligned’ components (i.e., they are stretched at a different angle from the other components). What will the EM algorithm described above do?