feature selection in k-median clustering olvi mangasarian and edward wild university of wisconsin -...

Feature Selection in k-Median Clustering

Olvi Mangasarian and Edward Wild

University of Wisconsin - Madison

Principal Objective

Find a reduced number of input space features such that clustering in the reduced space closely replicates the clustering in the full dimensional space

Basic Idea Based on rigorous optimization theory, make a simple

but fundamental modification in one of the two steps of the k-median algorithm

In each cluster, find a point closest in the 1-norm to all points in that cluster and to the median of ALL data points

Proposed approach can lead to a feature reduction as high as 64%, with clustering comparable to within 4% to that with the original set of features

Based on increasing weight given to the data median, more features are deleted from problem

FSKM Example

Start with median at origin

Apply k-median algorithm

As weight of data median increases, features are removed from the problem

Outline of Talk

Ordinary k-median algorithm

Two steps of the algorithm

Feature Selecting k-Median (FSKM) Algorithm

Overall optimization objective

Basic idea Mathematical optimization formulation Algorithm statement

Numerical examplesConclusion & outlook

Ordinary k-Median Algorithm

Given m data points in n-dimensional input feature spaceFind k cluster centers with the following propertyThe sum of the 1-norm distances between each data point

and the closest cluster center is minimizedFinding the minimum of a bunch of linear functions

is a concave minimization problem and is NP-hardHowever, the two-step k-median algorithm

terminates in a finite number of steps at a point satisfying the minimum principle necessary optimality condition

Two-Step k-Median Algorithm

(0) Start with k initial cluster centers

(1) Assign each data point to a 1-norm closest cluster center

(2) For each cluster compute a new cluster center that is 1-norm closest to all points in the cluster (median of cluster)

(3) Stop if all cluster centers are unchanged else go to (1)

Algorithm terminates in a finite number of steps at a point satisfying the minimum principle necessary optimality conditions

Key Change in Step (2) of k-Median Algorithm

(0)(1)(2) For each cluster compute a new cluster center that

minimizes the sum of 1-norm distances to all points in the cluster and a weighted 1-norm distance to the median of all data points

(3)Weight of 1-norm distance to dataset median determines number of features deleted:

For a zero weight no features are suppressed

For a sufficiently large weight all features are suppressed

and a weighted 1-norm distance to the median of all data points

FSKM Theory

Subgradients

f(y)-f(x) ¸ f(x)0(y-x) 8 x,y 2 Rn Consider ||x||1 , x 2 R1

If x < 0 ||x||1 = -1

If x > 0 ||x||1 = 1

If x = 0 ||x||1 2 [-1, 1]

FSKM Theory (Continued)

Zeroing Cluster Features(Based on Necessary and Sufficient Optimality Conditions

for Nondifferentiable Convex Optimization)

Thatis, cj = 0; whenever

FSKM Algorithm


FSKM Example (Revisited)

Start with median at origin Apply k-median algorithm Compute ’s

x1 = 1

y1 = 5

x2 = 0

y2 = 4

max x = 1 max y = 5 For =1, feature x is removed

from the problem

1

2

x

y

Numerical Testing


FSKM tested on five publicly available labeled datasets

Labels were used only to test effectiveness of FSKM

Data is first clustered using k-median then FSKM is applied to delete one feature at a time

Without using data labels, “error” in FSKM clustering with reduced features is obtained by comparison with the “gold standard” clustering with the full set of features

FSKM clustering error curve obtained without labels is compared with classification error curve obtained using data labels

3-Class Wine Dataset178 Points in 13-dimensional Space

Remarks

Curves close togetherLargest increase in error

as last few features are removed

Reduced 13 features to 4:Clustering error < 4%Classification error

decreased by 0.56 percentage points

2-Class Votes Dataset435 Points in 16-dimensional Space

Remarks

Curves have similar shape Largest increase in error as

last few features are removed

Reduced 16 features to 3: Clustering error < 10% Classification error increased

by 1.84 percentage points

2-Class WDBC Dataset(Wisconsin Diagnostic Breast Cancer)569 Points in 30-dimensional Space

Remarks

Curves have similar shape for 14 and fewer features

First 3 features removed cause no change to either error curve

Reduced 30 features to 7: Clustering error < 10% Classification error increased


2-Class Star/Galaxy-Bright Dataset2462 Points in 14-dimensional Space

Remarks

Clustering error increases gradually as number of features is reduced

Some features obstructing classification

Reduced 14 features to 4: Clustering error < 10% Classification error decreased


2-Class Cleveland Heart Dataset297 Points in 13-dimensional Space

Remarks

Largest increase in both curves going from 13 to 9 features

Most features useful?Reduced 13 features to

8:Clustering error < 17%Classification error

increased by 7.74 percentage points

Conclusion

FSKM is a fast method for selecting relevant features while maintaining clusters similar to those in the original full dimensional space

Features selected by FSKM without labels may be useful for labeled data classification as well

FSKM eliminates costly search for appropriately reduced number of features required for clustering in smaller dimensional spaces (e.g. 14-choose-6 = 3003 k-median runs to get best 6 features out of 14 for the Star/Galaxy-Bright dataset compared to 9 k-median runs required by FSKM)

Outlook

Feature & data selection for support vector machinesSparse kernel approximation methodsGene expression selection

Incorporation of prior knowledge into learning

Optimization-based clustering may be useful in other machine learning applications

Minimalist supervised & unsupervised

learningSelect minimal knowledge for best model

Web Pages(Containing Paper & Talk)

www:cs:wisc:edu=øolvi

www:cs:wisc:edu=øwildt

feature selection in k-median clustering olvi mangasarian and edward wild university of wisconsin -...

Documents