prototype methods: k-means - pennsylvania state...

Prototype Methods: K-Means


Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali


Prototype Methods

I Essentially model-free.

I Not useful for understanding the nature of the relationshipbetween the features and class outcome.

I They can be very effective as black box prediction engines.

I Training data: {(x1, g1), (x2, g2), ..., (xN , gN)}. The classlabels gi ∈ {1, 2, ...,K}.

I Represent the training data by a set of points in feature space,also called prototypes.

I Each prototype has an associated class label, andclassification of a query point x is made to the class of theclosest prototype.

I Methods differ according to how the number and the positionsof the prototypes are decided.



K-means

I Assume there are M prototypes denoted by

Z = {z1, z2, ..., zM} .

I Each training sample is assigned to one of the prototype.Denote the assignment function by A(·). Then A(xi ) = jmeans the ith training sample is assigned to the jth prototype.

I Goal: minimize the total mean squared error between thetraining samples and their representative prototypes, that is,the trace of the pooled within cluster covariance matrix.

arg minZ,A

N∑i=1

‖ xi − zA(xi ) ‖2



I Denote the objective function by

L(Z,A) =N∑

i=1

‖ xi − zA(xi ) ‖2 .

I Intuition: training samples are tightly clustered around theprototypes. Hence, the prototypes serve as a compactrepresentation for the training data.



Necessary Conditions

I If Z is fixed, the optimal assignment function A(·) shouldfollow the nearest neighbor rule, that is,

A(xi ) = arg minj∈{1,2,...,M} ‖ xi − zj ‖ .

I If A(·) is fixed, the prototype zj should be the average(centroid) of all the samples assigned to the jth prototype:

zj =

∑i :A(xi )=j xi

Nj,

where Nj is the number of samples assigned to prototype j .



The Algorithm

I Based on the necessary conditions, the k-means algorithmalternates the two steps:

I For a fixed set of centroids (prototypes), optimize A(·) byassigning each sample to its closest centroid using Euclideandistance.

I Update the centroids by computing the average of all thesamples assigned to it.

I The algorithm converges since after each iteration, theobjective function decreases (non-increasing).

I Usually converges fast.

I Stopping criterion: the ratio between the decrease and theobjective function is below a threshold.



Example

I Training set: {1.2, 5.6, 3.7, 0.6, 0.1, 2.6}.I Apply k-means algorithm with 2 centroids, {z1, z2}.I Initialization: randomly pick z1 = 2, z2 = 5.

fixed update

2 {1.2, 0.6, 0.1, 2.6}5 {5.6, 3.7}

{1.2, 0.6, 0.1, 2.6} 1.125{5.6, 3.7} 4.65

1.125 {1.2, 0.6, 0.1, 2.6}4.65 {5.6, 3.7}

The two prototypes are: z1 = 1.125, z2 = 4.65. The objectivefunction is L(Z,A) = 5.3125.



I Initialization: randomly pick z1 = 0.8, z2 = 3.8.

fixed update

0.8 {1.2, 0.6, 0.1}3.8 {5.6, 3.7, 2.6}

{1.2, 0.6, 0.1 } 0.633{5.6, 3.7, 2.6 } 3.967

0.633 {1.2, 0.6, 0.1}3.967 {5.6, 3.7, 2.6}

The two prototypes are: z1 = 0.633, z2 = 3.967. Theobjective function is L(Z,A) = 5.2133.

I Starting from different initial values, the k-means algorithmconverges to different local optimum.

I It can be shown that {z1 = 0.633, z2 = 3.967} is the globaloptimal solution.



Classification by K-means

I The primary application of k-means is in clustering, orunsupervised classification.

I It can be adapted to supervised classification.I Apply k-means clustering to the training data in each class

separately, using R prototypes per class.I Assign a class label to each of the K × R prototypes.I Classify a new feature x to the class of the closest prototype.

I The above approach to using k-means for classification isreferred to as Scheme 1.



Another Approach

I Another approach to classification by k-meansI Apply k-means clustering to the entire training data, using M

prototypes.I For each prototype, count the number of samples from each

class that are assigned to this prototype. Associate theprototype with the class that has the highest count.

I Classify a new feature x to the class of the closest prototype.

I This alternative approach is referred to as Scheme 2.



Simulation

I Two classes both follow normal distribution with commoncovariance matrix.

I The common covariance matrix Σ =

(1 00 1

).

I The means of the two classes are:

µ1 =

(0.00.0

)µ2 =

(1.51.5

)I The prior probabilities of the two classes are π1 = 0.3,

π2 = 0.7.

I A training and a testing data set are generated, each with2000 samples.



I The optimal decision boundary between the two classes isgiven by LDA, since the two class-conditioned densities of theinput are both normal with common covariance matrix.

I The optimal decision rule is (i.e., the Bayes rule):

G (X ) =

{1 X1 + X2 ≤ 0.93512 otherwise

I The error rate computed using the test data set and theoptimal decision rule is 11.75%.



The scatter plot of the training data set. Red star: Class 1. Bluecircle: Class 2.



K-means with Scheme 1

I For each class, use R = 6prototypes.

I The 6 prototypes for eachclass are shown. Theblack solid line is theboundary between the twoclasses given by K-means.The green dash line is theoptimal decision boundary.

I The error rate based onthe test data is 20.15%.



K-means with Scheme 2

I For the entire training data set, use M = 12 prototypes.

I By counting the number of samples from each class that fallinto each prototype, 4 prototypes are labeled as Class 1 andthe other 8 are labeled as Class 2.

I The prototypes for each class are shown below. The blacksolid line is the boundary between the two classes given byK-means. The green dash line is the optimal decisionboundary.

I The error rate based on the test data is 13.15%.



Compare the Two Schemes

I Scheme 1 works when there is a small amount of overlapbetween classes.

I Scheme 2 is more robust when there is a considerable amountof overlap.

I For Scheme 2, there is no need to specify the number ofprototypes for each class separately. The number ofprototypes assigned to each class is adjusted automatically.Classes with higher priors tend to occupy more prototypes.



I Scheme 2 has better statistical interpretation.I Attempt to estimate Pr(G = j | X ) by a piece-wise constant

function.I Partition the feature space into cells.I Assume Pr(G = j | X ) constant for X in one cell.I Estimate Pr(G = j | X ) by the empirical frequencies of the

classes based on all the training samples that fall into this cell.



Initialization

I Randomly pick up the prototypes to start the k-meansiteration.

I Different initial prototypes may lead to different local optimalsolutions given by k-means.

I Try different sets of initial prototypes, compare the objectivefunction at the end to choose the best solution.

I When randomly select initial prototypes, better make sure noprototype is out of the range of the entire data set.



I Initialization in the above simulation:I Generated M random vectors with independent dimensions.

For each dimension, the feature is uniformly distributed in[−1, 1].

I Linearly transform the jth feature, Zj , j = 1, 2, ..., p in eachprototype (a vector) by: Zjsj + mj , where sj is the samplestandard deviation of dimension j and mj is the sample meanof dimension j , both computed using the training data.



Linde-Buzo-Gray (LBG) Algorithm

I An algorithm developed in vector quantization for the purposeof data compression.

I Y. Linde, A. Buzo and R. M. Gray, ”An algorithm for vectorquantizer design,” IEEE Trans. on Communication, Vol.COM-28, pp. 84-95, Jan. 1980.



I The algorithm

1. Find the centroid z(1)1 of the entire data set.

2. Set k = 1, l = 1.3. If k < M, split the current centroids by adding small offsets.

I If M − k ≥ k, split all the centroids; otherwise, split onlyM − k of them.

I Denote the number of centroids split byk̃ = min(k, M − k).

I For example, to split z(1)1 into two centroids, let z

(2)1 = z

(1)1 ,

z(2)2 = z

(1)1 + ε, where ε has a small norm and a random

direction.

4. k ← k + k̃; l ← l + 1.5. Use {z (l)

1 , z(l)2 , ..., z

(l)k } as initial prototypes. Apply k-means

iteration to update these prototypes.6. If k < M, go back to step 3; otherwise, stop.



Tree-structured Vector Quantization (TSVQ) forClustering

1. Apply 2 centroidsk-means to the entiredata set.

2. The data are assignedto the 2 centroids.

3. For the data assignedto each centroid, apply2 centroids k-means tothem separately.

4. Repeat the above step.



I Compare with LBG:I For LBG, after the initial prototypes are formed by splitting,

k-means is applied to the overall data set. The final result isM prototypes.

I For TSVQ, data partitioned into different centroids at thesame level will never affect each other in the future growth ofthe tree. The final result is a tree structure.

I Fast searchingI For k-means, to decide which cell a query x goes to, M (the

number of prototypes) distances need to be computed.I For the tree-structured clustering, to decide which cell a query

x goes to, only 2 log2(M) distances need to be computed.



I Comments on tree-structured clustering:I It is structurally more constrained. But on the other hand, it

provides more insight into the patterns in the data.I It is greedy in the sense of optimizing at each step sequentially.

An early bad decision will propagate its effect.I It provides more algorithmic flexibility.



Choose the Number of Prototypes

I Cross-validation (data driven approach):I For different number of prototypes, compute the classification

error rate using cross-validation.I The number of prototypes that yields the minimum error rate

according to cross-validation is selected.

I Cross-validation is often rather effective.

I Rule of thumb: on average every cell contains at least 5 ∼ 10samples.

I Other model selection approaches.



Example

I Diabetes data set. The two principal components are used.

I For the number of prototypes M = 1, 2, ..., 20, apply k-meansto classification.

I Two-fold cross-validation is used to compute the error rates.

I The error rate vs. M is plotted below. When M ≥ 8, theperformance is close. The minimum error rate 27.34% isachieved by M = 13.



The prototypes assigned to the two classes and the classificationboundary are shown below. 9 prototypes are assigned to Class 1(without diabetes); 4 to Class 2 (with diabetes).


prototype methods: k-means - pennsylvania state...

Documents