topic 1 clustering basics · topic 1 clustering basics cs898. overview basics (k-means) •...

72
Topic 1 Clustering Basics CS898

Upload: others

Post on 02-Oct-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Topic 1

Clustering Basics

CS898

Page 2: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Overview

Basics (K-means)

• variance clustering

• generalizations (parametric & non-parametric)

Kernel K-means

Probabilistic K-means,

• entropy clustering

Normalized Cut

Density biases

Spectral methods, bound optimization2

Page 3: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

In the beginning there was…

Basic K-means:

squared L2 norm

features

K subsets of

objective function

input

output

extra parameters (K means)

In this talk: K-means refers mostly to this or related objectives(not to iterative Lloyd’s algorithm, 1957)

Page 4: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Basic K-means examples:

RGB features

color quantization

RGBXY features

superpixels

XY features only

Voronoi cells

pixelsfeatures

compared to RGB onlyXY adds spatial “compactness”

(quazi regularization)

Page 5: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Apply K-means to RGBXY features

Basic K-means examples:

Superpixels

[SLIC superpixels, Achanta et al., PAMI 2011]

Page 6: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

K-means as non-parametric clustering

=

−=

K

k Spq

k

qp

Kk S

ffSE

1

2

||2

||||)(

equivalent (easy to check)

two standard formulas for sample variance

just plug-in

=k

k

Sq

qSk f||

1

k

Sk

Sk

no parametersµk

=

−=K

k Sp

kpk

fSE1

2

),(

qf

pf

pf

Page 7: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

K-means as variance clustering criteria

)var(||1

kK

k

k

K SSE ==

k

Sk

Sk

both objectives can be written as

qf

pf

pf

=> K-means is good for “compact blobs”

Page 8: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

K-means – common extensions

=

−K

k Spdkp

k

f1

• Parametric methods with arbitrary likelihoods P( ˑ|θ)

(probabilistic K-means) [Kearns, Mansour & Ng, UAI’97]=

−K

k Sp

kpk

fP1

)|(log

Examples of P (ˑ|θ) : Gaussian, gamma, exponential, Gibbs, etc.

• Parametric methods with arbitrary distortion measure

(distortion clustering)

d||||

• Non-parametric (pairwise) methods with any kernel or affinity measure

(kernel K-means, average association, average distortion, normalized cut)

),( yxk

Examples of : quadratic absolute truncated(K-means) (K-medians) (K-modes)

d||||

Could be juxtaposed with GMM/EM as hard clustering via ML parameter fitting

2

~ kpf −

e.g. Gaussian 2

2

2

||||exp~

−− fP

replace dot-products by arbitrary kernel k

Page 9: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Probabilistic K-means Example:

Elliptic K-means

for Normal (Gaussian)distribution

(squared)Mahalanobis distance

Examples: a) Z - normal random vector with

m – meanΣ - covariance

b) X = AZ + m for arbitrary vector m and matrix A

distribution of XX = AZ + m

Page 10: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Probabilistic K-means Example:

Elliptic K-means

Basic K-means

(squared)Mahalanobis distance

for Normal (Gaussian)distribution

Page 11: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Probabilistic K-means Example:

Elliptic K-means

Elliptic K-means

(squared)Mahalanobis distance

for Normal (Gaussian)distribution

Page 12: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Probabilistic K-means Example:

Entropy Clustering

=

−K

k Sp

kpk

fP1

)|(log

Monte-Carlo estimation formula

Using “optimal” distributions θk that minimizes cross entropy we get

entropy clustering criteria:

cross entropy

requires sufficiently descriptive (complex) class of probability models that can fit data well.

Page 13: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Probabilistic K-means:

summary

- model fitting (to data)- log-likelihood (model) parameter estimation

- complex data requires complex models

?

basic K-means works only for compact clusters (blobs)

that are linearly separable

Page 14: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

from complex models

towards complex embeddings

Page 15: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

From basic K-means to kernel K-means

(high-dimensional embedding story)

Example:

data can become linearly separableafter some non-linear embedding

(typically in high dimensional space)

for some (non-linear) embedding function

Page 16: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

(explicit)K-means procedure:

(update at time t+1)

equivalent formulation:

for some (non-linear) embedding function

dim(H)x|Ω| embedding matrix

where

- cluster k at iteration t

|Ω| indicatorvector for cluster k

From basic K-means to kernel K-means

(high-dimensional embedding story)

Assume for now that such embedding is given

Page 17: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

equivalent formulation:

for some (non-linear) embedding function

Gram matrix

dot products

- cluster k at iteration t

(explicit)K-means procedure:

(update at time t+1)

From basic K-means to kernel K-means

(high-dimensional embedding story)

Assume for now that such embedding is given

Page 18: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

for some (non-linear) embedding function

(implicit)kernel K-means procedure:

(update at time t+1)

Requires only kernel matrix K , no need to know explicit embedding Φ

Gram matrix

- cluster k at iteration t

(explicit)K-means procedure:

(update at time t+1)

From basic K-means to kernel K-means

(high-dimensional embedding story)

equivalent

Assume for now that such embedding is given

Page 19: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Kernel trick: start with (any ?) kernel K

If we start from given pairwise affinities (kernel matrix K), sometimes it may still be useful to

think about embedding implicitly defined by the kernel (via decomposition )

(Mercer theorem : any p.s.d. kernels can be decomposed that way)

Q: why even worry about embedding Φ when using kernel K-means procedure?

A: (HINT) Think about convergence. What do we minimize via kernel K-means procedure?

Kernel Trick: p.s.d. kernels K are a standard way to (implicitly) define

some high-dimensional embedding Φ (corresponding to decomposition )

Q: what is dimension of each Φp ?

Example: Gaussian kernel

Page 20: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Kernel-induced embedding:

- isometry

and the corresponding kernel-induced metric:

kernel defines an inner product:(in the original feature space)

Kerned-defined Euclidean embedding is isometric to

the original features with kernel-induced metric

NOTE:

Page 21: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Kernel-induced embedding:

- non-linear separation of original features

high-dimensional isometric embedding induced by kernel Kcan make clusters linearly separable

original feature space with kernel-induced metric

kernel-induced Euclidean embedding

Intuition for such “magic” behind commonly used kernels (e.g. Gaussian)?

Page 22: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

From basic K-means to kernel K-means

(robust metric story)

kernel K-means objective:

remember

Page 23: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

robust metric focuses on local distortion (deemphasizes larger distances)

basic (linear) kernel

Examples:

squared Euclidean distance

distance in standard K-means

Gaussian kernel

distance in Gaussian kernel K-means2||||

0

2|||| k

From basic K-means to kernel K-means

(robust metric story)

kernel K-means objective:

and kernel-induced metric:

remember

Page 24: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

robust metric focuses on local distortion (deemphasizes larger distances)

2||||

0

2|||| k

From basic K-means to kernel K-means

(robust metric story)

S1 S2

σ

Page 25: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

On importance of

positive-semi-definite (p.s.d.) kernels K

- Given any (e.g. non-p.s.d.) kernel, “diagonal shift” allows to formulate an equivalent kernel clustering objective with p.s.d. kernel (for sufficiently large scalar δ)

easy to verify equivalence of kernel K-means objectives for any scalar δwhile kernel K-means procedure is modified by the “shift” above

- (Mercer theorem) p.s.d. guarantees existence of explicit Euclidean embedding s.t.

that is

This allows to prove that implicit kernel K-means procedure convergesdue to its equivalence to convergent explicit K-means procedure for some

Page 26: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Weak kernel K-means

versus

?

Page 27: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Weak kernel K-means

versus

?

Page 28: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Weak kernel K-means

versus

?

=

for(due to isometry)

=Each corresponds to some .(These two give the same solution S where two objectives are equal since embedding is isometric.)

The opposite is not true.

Page 29: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Weak kernel K-means

Implicit search space for(higher-dimensional embedding space H)

is larger than search space for(original feature space)

=

=

for(due to isometry)

Page 30: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Weighted K-means and

Weighted kernel K-means

(unary) distortion between a point and a model

(pair-wise) distortion between two points

K=2

Page 31: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

unary and pair-wise distortion clustering(general weighted case)

pKM

probabilistic K-means(ML model fitting)

kKM

kernel K-means(pairwise clustering)

basic K-means

p.d. kernel distance

weakkernel clustering

(unary) Hilbertian distortion

normalized cuts

K-modes(mean-shift)

GMM fitting

entropy clustering

elliptic K-means

gamma fitting

Gaussian kernel K-means

spectralratio cuts

distorton

Gibbs fitting

average cut

average association

averagedistortion

complex models complex embeddings(model parameter fitting)

Page 32: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

(non-parametric)

Kernel Clustering

32

• kernel K-means, average association, Normalized Cuts, …• density biases: isolation of modes or sparse subsets• bound optimization

More on

Page 33: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-meansnon-parametric (kernel) clustering

fp

fq

Apq = k( fp ,fq )

- objective

Page 34: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-meansnon-parametric (kernel) clustering

fp

fq

Apq = k( fp ,fq )

- objective

explicit features fp are unnecessary

Page 35: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-meansnon-parametric (kernel) clustering

only need affinity (or kernel) matrix

A = [Apq ]

p

q

Ω - set of all points(graph nodes)

if necessary, “embedding” s.t.

can be found for p.s.d. A(via eigen decomposition)

as suggested by MERCER THEOREM

Page 36: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-meansnon-parametric (kernel) clustering

Ω - set of all points(graph nodes)

S1

S3

S2

Page 37: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-means or average associationnon-parametric (kernel) clustering

S1

S3

S2

“self-association” of cluster Sk

Page 38: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-means or average associationnon-parametric (kernel) clustering

S1

S3

S2

in matrix notation:

Sk - indicator vector

‘ means transpose

Page 39: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-means or average associationnon-parametric (kernel) clustering

S1

S3

S2

in matrix notation:

Sk - indicator vector

‘ means transpose

Page 40: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-means or average associationnon-parametric (kernel) clustering

e.g. for Gaussian kernel

kernel K-meansbasic K-meansfor

Why?

local “compactness”global “compactness”

Page 41: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

?

Basic K-means vs Kernel Clustering

Basic K-means

Kernel Clustering

compact blobsin RGB space

segments inRGBXY space

(not blobs!)

color quantization

compact blobsin RGBXY space

[Achanta et al., PAMI 2011]

super-pixels

[Shi&Malik 2000]

segmentation

Page 42: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-means or average associationnon-parametric (kernel) clustering

kernel K-means

e.g. for Gaussian kernel

density mode isolation[Marin et al. PAMI 2019]

inhomogeneous data density

“tight” clusters [Shi & Malik, PAMI 2000]

for “small” kernels (empirically observed)

by reduction to continuous Gini criterion

mode bias in discrete Gini criterion [Breiman, Machine Learning 1996]

discrete valued data

RGB features (no XY!)

Page 43: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-means or average associationnon-parametric (kernel) clustering

properties for kernel bandwidths σ

0∞

r-small σ data range diameter

Breiman’s bias[Marin et al. PAMI 2019]

reduces to basic K-means“linear separation” and

“equi-cardinality” bias [Ng et al. UAI’96]

Page 44: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-means or average associationnon-parametric (kernel) clustering

properties for kernel bandwidths σ

0∞

r-small σ data range diameter

there maybe no good (unbiased) solution in-between

Page 45: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

kernel K-means or average associationnon-parametric (kernel) clustering

A solution: density equalization[Marin et al. PAMI 2019]

Theorem: adaptive bandwidths for data in RN implicitly

transform data density r

Density Law (basic form)?no fixed bandwidth

will generate this result

Page 46: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

?

Simple density equalization example

NOTE: same as a heuristicby Zelnik & Perona [NIPS 2004]for another clustering objective

average association withadaptive bandwidths σp

as above

Density Law

using

and standard density estimate

volume of ballcontaining KNN

KNN ball radius

Page 47: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

S1

S3

S2

Other kernel (graph) clustering objectives

so far only looked at

“self-association” for Sk

Page 48: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Other kernel (graph) clustering objectives

“cut” for Sk

S1

S3

S2

S1

S3

S2

“self-association” for Sk

so far only looked at

Page 49: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Average Cut

Other kernel (graph) clustering objectives

• Ratio Cut, Cheeger cut, isoperimetric number, conductance

• spectral graph theory, electrical flows, random walks

S1

S3

S2

Page 50: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Other kernel (graph) clustering objectives

Average Cut

Page 51: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Other kernel (graph) clustering objectives

Average Cut

Page 52: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Other kernel (graph) clustering objectives

“node degree”

normalization

Average Cut

Normalized Cut

Page 53: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Other kernel (graph) clustering objectives

= Normalized Average Association

normalization

Average Cut

Normalized Cut

Page 54: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Summary of

common kernel clustering objectives

Average Cut

Normalized Cut = Normalized Average Association

normalization

Average Association(as discussed earlier)

bias to sparse subsets bias to density mode

?

Page 55: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Normalized Cut (NC) [Shi & Malik, 2000]

small bandwidth ( 2.47)

lack of non-linear separabilityNC cuts isolated points

large bandwidth ( 2.48)

no fixed bandwidth will generate the good result here

still has bias to sparse subset(the opposite of density mode)

Page 56: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Normalized Cut (NC) [Zelnik-Perona, 2004]

with density equalization(via locally adaptive bandwidths)

Average Association (kernel K-means)

gives a similar result for such p … why ?Question:

(locally adaptive kernel)

2𝑅𝐾𝑝

Page 57: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Average Cut, Normalized Cut, Average Association

Equivalence (after density equalization)

Avr. Assoc.(kernel K-means)

after density equalization[Marin et al. 2019]

Avr. Cut(Cheeger sets)

Norm. Cut

=c

Norm. Avr. Assoc.

For simplicity assume “KNN kernel”

(locally adaptive kernel)

2𝑅𝐾𝑝

Page 58: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Optimization

• block-coordinate descent (Lloyd’s algorithm)

• spectral relaxation

• bound optimization

Page 59: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Spectral Relaxation (quick overview)

In the context of kernel K-means (average association)

normalized (L2 norm is 1)cluster indicators

Page 60: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Spectral Relaxation (quick overview)

In the context of kernel K-means (average association)

- |Ω| x K matrix

k-th element on

the diagonal ofmatrix - K x K matrix

Page 61: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Spectral Relaxation (quick overview)

In the context of kernel K-means (average association)

k-th element of

original optimization problem (NP hard)

integrality of indicators Sk is relaxed,

optimization over a unit sphere.

relaxed problem (closed form solution)

Page 62: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Spectral Relaxation (quick overview)

optimization over a unit sphere.

relaxed problem (closed form solution)

(one of generalizations)

Rayleigh quotient problem

Closed form solution:

Z1 , Z2 , … , ZK are (unit) eigenvectors of matrix Acorresponding to its K largest eigenvalues

intuition: consider vector x

and so on…

Page 63: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Kernel clustering

via

bound optimization

Page 64: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Lloyd’s algorithm as bound optimization

64

(explicit)K-means procedure:

Lloyd’s algorithm

corresponds to a block coordinate descent for K-means objective

t+1

Remember an equivalent objective minimizing out parameter m

where and

Page 65: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Lloyd’s algorithm as bound optimization

65

(explicit)K-means procedure:

Lloyd’s algorithm

corresponds to a block coordinate descent for K-means objective

t+1

Remember an equivalent objective minimizing out parameter m

Page 66: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Lloyd’s algorithm as bound optimization

66

(explicit)K-means procedure:

Lloyd’s algorithm

corresponds to a block coordinate descent for K-means objective

t+1

Remember an equivalent objective minimizing out parameter mlinear function w.r.t. S

Page 67: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Lloyd’s algorithm as bound optimization

67

E(S,mSt)

St St+1

E(S)

(explicit)K-means procedure:

Lloyd’s algorithm

t+1

E(S,mSt+1)

compute new means mSt+1

≥ E(S) E(St ,mSt) = E(St)

optimal for bound E(S,mSt)

over binary indicators

Page 68: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

guaranteed energy

decrease

E(S)

St St+1

E(St)

E(St+1)

Bound optimization, in general

At+1 (S)

At (S)

68

Page 69: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Kernel bound

Lemma 1 (concavity)

Function e : R|Ω| → R is concave over region Sk > 0 given p.s.d. affinity matrix A := [Apq].

at(S)(I)

(II)

at+1(S)e(Sk)

A bound is given by the first-order Taylor expansion:

NOTE: optimizing this unary bound for KC (alone) is equivalent to iterative kernel k-means a la [Lloyd’57]

(intuition came from observation: Lloyd’s algorithm unary bound optimization)

Page 70: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Kernel bound

Page 71: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

(approximate)

Spectral bound

Main idea:

- standard eigen analysis (PCA) gives low dimensional embedding

or (equivalently) low rank matrix such that

minimizing Frobenius error

- use linear bound from lemma 1 for kernel matrix

NOTE: optimizing such unary bound for KC (alone) isequivalent to iterative k-means [Lloyd algorithm] for embedding

a la discretization heuristic after spectral relaxation methods

Page 72: Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

(approximate)

Spectral bound

Empirical motivation for low dimensional spectral approximation:

NOTE: optimizing this unary bound alone for KC (without regularization) is similar to discretization heuristic (K-means) for spectral relaxation [Shi&Malik, 2000]

approx KC energy (low dimensions)exact KC energy

progression of iterative (kernel) K-means algorithm [Lloyd]

approx KC energy (high dimensions)exact KC energy