chapter 4 rough fuzzy c-means subspace...

Chapter 4

Rough Fuzzy c-Means

Subspace Clustering

In this chapter, we propose a novel adaptation of rough fuzzy c-means al-

gorithm for high dimensional data by modifying its objective function. The

proposed algorithm automatically detects the relevant cluster dimensions of

the high dimensional data set. The assignment of weights to attributes being

specific to each cluster, an efficient subspace clustering scheme is generated.

We have also discussed the convergence of the proposed algorithm. The re-

mainder of this chapter is organised as follows: section 4.1 introduces rough

set theory, in section 4.2 on related work, we describe how classical cluster-

ing methods have been adapted to suit the requirements of high dimensional

data, in section 4.3, we extend the rough fuzzy c-means algorithm for sub-

space clustering in the form of Rough Fuzzy c-Means Subspace (RFCMS)

algorithm, in section 4.4 we discuss the convergence of proposed algorithm,

in section 4.5, we present the results of applying RFCMS algorithm on several

UCI data sets, and finally section 4.6 summarizes the chapter.

41

4.1 Introduction

Pawlak introduced rough set theory as a new framework for dealing with

imperfect knowledge [Pawlak, 1991]. Rough set theory provides a method-

ology for addressing the problem of relevant feature selection, by selecting

a set of information rich features from a data set that retains the seman-

tics of the original data and requires no human inputs unlike statistical ap-

proaches [Jensen, 1999]. It is often possible to arrive at a minimal feature set

(called reduct in rough set theory) that can be used for data analysis tasks

such as classification and clustering [Lingras and West, 2004], [Mitra et al.,

2006]. When feature selection approaches based on rough sets are combined

with an intelligent classification system like those based on fuzzy systems or

neural networks, they retain the descriptive power of the overall classifier and

result in simplified system structure which enhances the understandability

of the resultant system [Shen, 2007].

Following Rutkowski we describe the notion of rough sets used to model

uncertainty in information systems [Rutkowski, 2008]. Formally, an infor-

mation system is a pair (U,A), where U is a non-empty finite set of ob-

jects and A is a non-empty finite set of attributes such that each attribute

a has an associated value set Va, i.e. a : U Va for every a A. A

Decision System DS is defined as a pair (U,A {d}), d / A is called de-

cision attribute and the elements of A are called condition attributes. For

an attribute set B A, the set of objects in the information system, in-

discernible w.r.t. B is described by the indiscernibility relation INDIS(B)

defined as: INDIS(B) = {(x1, x2) U2|a(x1) = a(x2) a B}. The

objects x1 and x2 are indiscernible from each other by attributes from B if

(x1, x2) INDIS(B). The equivalence classes of the B-indiscernibility re-

lation are denoted by [x]B. If X U then X can be approximated using

B by constructing three approximations, namely, B lower approximation:

BX = {x|[x]B X}, B upper approximation: BX = {x|[x]B X 6= },

42

and the B boundary region: BX BX of X. Evidently, the boundary

region consists of all objects in upper approximation but not in lower ap-

proximation of X. Bazan et al. discuss various techniques for rough set

reduct generation and argue that the classical reducts being static may not

be stable in randomly chosen samples of a given decision table [Bazan et al.,

2000]. To deal with such situations they focus on reducts that are stable over

different subsets of samples chosen from a given decision table. Such reducts

are called dynamic reducts. They compute reducts using an order based

genetic algorithm and subsequently extract dynamic reducts which are used

to generate classification rules. Each rule set is associated with a measure

called the rule strength which is used later to resolve conflicts when several

rules are applicable. Slezak generalized the concept of reduct by introduc-

ing the notion of association reducts corresponding to both association rules

and rough set reducts [Slezak, 2005]. He defined association reduct as a pair

(A, B) of disjoint subsets of attributes such that all data supported patterns

involving A approximately determine those involving B. He developed an

information theory based algorithm to compute association reducts. As the

algorithm needs to examine all association reducts, it has exponential time

requirements. In order to alleviate this hardship, Slezak targeted significantly

smaller ensembles of dependencies providing reasonably rich knowledge, and

developed an order based genetic algorithm to achieve this [Slezak, 2009].

Shen and Jensen proposed the concept of retainer as an approximation of a

reduct [Richard and Qiang, 2001]. The authors suggest a heuristic to com-

pute the retainer and demonstrate its usefulness for the classification task.

For clustering textual database consisting of N documents, with a vocabu-

lary of size V, Li et al. developed an algorithm based on approximate reducts

that works in time O(VN) [Li et al., 2006].

43

4.2 Related Work

Rough sets have been widely used for classification and clustering [Lingras

and West, 2004], [Mitra et al., 2006], [Pawlak, 1991]. The classical k-means

algorithm has been extended to rough k-means algorithm by Lingras et al.

[Lingras and West, 2004]. In rough k-means algorithm, a cluster in the lower

approximation, called the core cluster, is surrounded by a buffer or boundary

set having objects with unclear membership status [Lingras and West, 2004].

A data point in the lower approximation surely belong to a cluster, although,

membership of the objects in an upper approximation is uncertain. Signature

of each cluster is represented by its center, lower and upper approximation.

If lower and upper approximations are equal then buffer set is empty and

the data objects are crisply assigned to the cluster. The rough k-means

algorithm follows an iterative process, wherein cluster centers are updated

until convergence criterion is met. Asharaf et al. have extended rough k-

means algorithm in such a way that it does not require prior specification of

the number of clusters [Asharaf and Murty, 2004] . They have proposed a

two phase algorithm. It identifies a set of leaders which act as prototypes in

the first phase. Subsequently a set of supporting leaders are identified, which

can act as leaders, provided they yield better partitioning. The evolutionary

rough k-medoids algorithm [Peters et al., 2008] is based on the family of

rough clustering algorithms and the classical k-medoids algorithm [Kaufman

and Rousseeuw, 1990]. In Malyszko et al. have extended rough k-means

clustering to rough entropy clustering [Malyszko and Stepaniuk, 2009]. It is

an iterative process: firstly a predefined number of weight pairs are selected,

for each weight pair a new offspring clustering is determined, rough entropy

is computed, and the partition which gives highest rough entropy is selected.

Liu et al. have proposed a feature selection method ISODATA-RFE for

high dimensional gene expression datasets [Liu et al., 2012]. Bhattacharya

distance is used to rank the features of training set. Features with low Bhat-

44

tacharya distance are removed from feature set. For separating different

classes, fuzzy ISODATA algorithm is used to calculate sensitivity index of

each feature. A recursive feature elimination method is applied to feature

set for removing unimportant features. It generates multiple nested candi-

date feature subsets. Finally, the feature subset with least error is selected

for use in classification and clustering algorithms. Own and Abraham have

proposed a new weighted rough set framework based classification for neo-

natal jaundice [Own and Abraham, 2012]. The weighted information table

is built by applying class equal sample weighting. While samples in ma-

jority class have smaller weight, the samples in minority class have larger

weight. A weighted reduction algorithm MLEM2 exploits the significance

of the attributes to extract a set of diagnosis rules from decision system

of NeoNatal Jaundice database. Deng et al. have proposed an enhanced

entropy weighting subspace clustering algorithm for high dimensional gene

expression data [Deng et al., 2011]. Its objective function integrates the

fuzzy within cluster compactness and between cluster information simulta-

neously. [Cordeiro de Amorim and Mirkin, 2012] have extended the weighted

K-means algorithm proposed by Huang et al.. They have replaced Euclidean

distance metric by minkowski metric for measuring distances as the Euclidean

distance cannot capture the relationship between scales of the feature values

and feature weights. Bai et al. have proposed a novel weighting algorithm

for categorical data [Bai et al., 2011]. The algorithm computes two weights

for each dimension in each cluster. These weight values are used to identify

the subsets of attributes which can categorize different clusters.

Rough set theory has been applied in conjunction with fuzzy set theory in

several domains such as fuzzy rule extraction, reasoning with uncertainty,

fuzzy modelling, and feature selection [Maji and Pal, 2010]. The classical

fuzzy c-means algorithm has been used in conjunction with rough sets to

develop rough fuzzy c-means (RFCM) algorithm [Mitra and Banka, 2007].

The concept of membership in FCM enables efficient handling of overlapping

45

partitions, while, the rough sets are aimed at modelling uncertainty in data.

Such hybrid techniques provide a strong paradigm for uncertainty handling in

various application domains such as pattern recognition, image processing,

mining stock prices, vocabulary for information retrieval, fuzzy clustering,

dimensionality reduction, data mining and knowledge discovery [Maji and

Paul, 2011], [Maji and Pal, 2010]. Maji and Pal proposed an algorithm

RFCMdd for selecting the most informative bio-basis (medoids), where each

partition is represented by a medoid computed as weighted average of the

crisp lower approximation and fuzzy boundary [Maji and Pal, 2007b]. In Maji

introduced a quantitative measure of similarity among genes based on fuzzy

rough sets to develop fuzzy-rough supervised attribute clustering (FRSAC)

algorithm [Maji, 2011].

4.3 Rough Fuzzy c-Means Subspace Cluster-

ing

In this section, we propose an algorithm based on rough fuzzy c-means algo-

rithm for subspace clustering.

4.3.1 Rough c-Means

Rough c-means algorithm [Lingras and West, 2004], has extended the concept

of c-means by considering each cluster as an interval or rough set, where lower

and upper approximations BX and BX are characteristics of rough set X.

A rough set has following properties:

(i) An object xj can belong to at most one lower approximation.

(ii) If xj BX of cluster X, then xj BX also.

(iii) If xj does not belongs to any lower approximation, then it belongs to two

or more upper approximations, i.e. overlap between clusters is possible.

46

The iterative steps of the Rough c-Means Algorithm are as follows:

Algorithm 2 Rough c-Means Algorithm

1. Chose initial means zi, 1 i k, for the k clusters.

2. Assign each data point xj, 1 j n, to the lower approximation BUi or

upper approximations BUi, BUi of cluster pairs Ui , Ui by computing the

difference in its distance dij dij, where xj be a jth data point at distance

dij from ith centroid zi of cluster Ui.

3. Let dij be minimum and dij be the next to minimum. If dijdij is less than

some threshold then xj BUi and xj BUi and xj cannot be a member of

any lower approximation, else xj BUi such that distance dij is minimum

over the k clusters.

4. Compute new mean zi for each cluster, as

zi =

xj(BUiBUi)

xj

|BUiBUi|if BUi = BUi BUi 6=

wlow

xjBUixj

|BUi| +wup

xj(BUiBUi)

xj

|BUiBUi|if BUi 6= BUi BUi 6=

xjBUixj

|BUi| otherwise.

where the parameters wlow and wup represents the relative importance of the

lower and upper approximations respectively. Thus, RCM generates three

types of clusters, with objects (i) in both the lower and upper approximations,

(ii) only in lower approximation, and (iii) only in upper approximation.

5. Repeat Steps 2-4 until convergence, i.e., there are no more new assignments,

or upper limit on the number of iterations is reached.

Note: wup = 1 wlow , 0.5 < wlow < 1, and 0 < threshold < 0.5.

47

4.3.2 Rough-Fuzzy c-Means

Rough-fuzzy c-means algorithm [Mitra et al., 2006] incorporates weighted

distance in terms of fuzzy membership value uij of a data point xj to a clus-

ter mean zi, instead of the absolute individual distance dij of jth data point

from ith cluster center.

The iterative steps of the algorithm are as follows:

Algorithm 3 Rough Fuzzy c-Means Algorithm

1. Chose initial means zi, 1 i k, for the k clusters.

2. Compute uij by eq. 3.9 for k clusters and n data objects.

3. Assign each data point xj to the lower approximation BUi or upper approxi-

mation BUi, BUi of cluster pairs Ui , Ui by computing the difference in its

membership uij uij.

4. Let uij be maximum and uij be the next to maximum. If uij - uij is less

than some threshold then xj BUi and xj BUi and xj cannot be a

member of any lower approximation, else xj BUi such that membership uijis maximum over the k clusters.

5. Compute new mean zi for each cluster, as

zi =

xj(BUiBUi)

ijxjxj(BUiBUi)

ijif BUi = BUi BUi 6=

wlow

xjBUi

ijxjxjBUi

ij+ wup

xj(BUiBUi)

ijxjxj(BUiBUi)

ijif BUi 6= BUi BUi 6=

xjBUiijxj

xjBUiij

, otherwise.

6. Repeat Steps 2-5 until convergence, i.e., there are no more new assignments

, or upper limit on the number of iterations is reached.

Note: wup = 1 wlow, 0.5 < wlow < 1, and 0 < threshold < 0.5.

48

4.3.3 Rough Fuzzy c-Means Subspace Clustering Al-

gorithm

The proposed algorithm called Rough Fuzzy c-Means Subspace (RFCMS)

has been developed by hybridizing the concept of fuzzy membership for ob-

jects (in clusters) and dimensions (fuzzy membership serves as weight of

dimension) and rough set based approximations of clusters.

Objective Function Let, BUi, BUi and BUiBUi denote lower approxima-

tion, upper approximation, and boundary region of ith cluster Ui respectively.

In [Lingras and West, 2004] the classical objective function of fuzzy c-means

algorithm has been modified in the rough framework by incorporating the

lower and upper approximations of the clusters. We have extended the ob-

jective function of rough fuzzy c-means algorithm [Lingras and West, 2004]

by incorporating the weights of dimensions as relevant to different clusters.

We associate with ith cluster, the weight vector, i which represents the rel-

ative relevance of different attributes for the ith cluster. Thus, in the matrix

W = [ir]kd, ir denote the contribution of rth dimension to the ith cluster.

The sum of contributions from all dimensions adds to 1 for each cluster.

dr=1

ir = 1, 1 i k, (4.1)

ir [0, 1] , 1 i k, 1 r d (4.2)

The proposed RFCMS algorithm minimizes the following objective function

JRFCMS to partition data set into k clusters.

JRFCMS =

aA+ bB if BU 6= BU BU 6=

A if BU 6= BU BU =

B otherwise.

49

where,

A =

xjBUi

ki=1

dr=1

ijird

2ijr

B =

xj(BUiBUi)

ki=1

dr=1

ijird

2ijr (4.3)

In the above formulation, A and B correspond to lower and upper approx-

imations. Parameters a and b control the contribution of lower and upper

approximation of a cluster.

d2ijr = (xjr zir)2 (4.4)

is the distance between ith cluster center and jth data object along rth dimen-

sion. Parameters (1,) , (1,) are weighting components. These

parameters control the fuzzification of ij and ir respectively.

Solving 4.3 w.r.t ij and ir we get:

ij =1

kl=1

[dr=1

(ir)d2ijrdr=1

(lr)d2ljr

]1/(1) (4.5)

ir =1

dl=1

[nj=1

(ij)d2ijrnj=1

(ij)d2ijl

]1/(1) (4.6)

The weights of dimensions are computed using eq. 4.6 as in [Kumar and

Puri, 2009].

Cluster Center The cluster centers are computed as:

zir =

xj(BUiBUi)

ijxjrxj(BUiBUi)

ijif BUi = BUi BUi 6=

axjBUi

ijxjrxjBUi

ij

+bxj(BUiBUi)

ijxjrxj(BUiBUi)

ijif BUi 6= BUi BUi 6=

xjBUiijxjr

xjBUiij

otherwise.

50

(4.7)

As the objects lying in lower approximation definitely belong to the cluster so they

are assigned higher weights as compared to weight for objects lying in boundary

region. For the case a 1 cluster center may get stuck in local optimum because

clusters cannot find the objects lying in the boundary region and therefore, they

may not be able to move towards the best cluster center. In order to maintain

the greater degree of freedom to move, the values of parameters a and b are set as

o < a < b < 1 such that a+ b = 1 [Maji and Pal, 2007a]. Like FCM [Bezdek et al.,

1987], and Yans fuzzy curve tracing algorithm [Yan, 2004] the proposed RFCMS

algorithm converges, at least along a subsequence, to a local optimum solution.

The iterative steps of the algorithm are as follows:

Algorithm 4 Rough Fuzzy c-Means Subspace Clustering Algorithm

1. Chose initial cluster centers zi, 1 i k, for the k clusters.

2. Compute ij by eq. 4.5 for k clusters and n data objects.

3. Let ij be maximum and ij be the next to maximum for an object xj.

If ij - ij is less than some threshold then xj BUi and xj BUi and xjcannot

be a member of any lower approximation, else xj BUi such that membership

ij is maximum over the k clusters.

4. Compute ir by eq. 4.6 for k clusters and d dimensions.

5. Compute new cluster centers zi for each cluster, as in eq. 4.7.

6. Repeat steps 2-5 until convergence, i.e., there are no more new assignments,

or limit on maximum number of iterations is reached.

Note: a = 1 b, 0.5 < a < 1, and 0 < threshold < 0.5.

51

4.4 Convergence

In this section, we discuss the convergence criteria of the proposed algorithm along

with its proof. On the similar lines, as global convergence property of FCM al-

gorithm, global convergence property of RFCMS states that for any data set and

initialization parameters, an iteration sequence of RFCMS algorithm either (i)

converges to a local minimum or (ii) there exists a subsequence of the iteration

sequence that converges to a stationary point. Theorems 4.1, 4.2 and 4.3 below

show that necessary and sufficient conditions hold for U , W , and Z respectively.

Theorem 4.1 Let : Mknf

Assuming that Sij 6= 0, 1 j n, 1 i k, we get:

dr=1

S22ij P2ir d

2ijr + j = 0

or j = dr=1

S22ij P2ir d

2ijr

or S22ij =j

dr=1 P

2ir d

2ijr

or S2ij =

jdr=1 P

2ir d

2ijr

1(1)

ij =

jdr=1 P

2ir d

2ijr

1(1) (4.10)Using constraint eq. 2.11 in eq. 4.10, we get:

ki=1

ij =ki=1

jdr=1 P

2ir d

2ijr

1(1) = 1

Substituting the value of j in eq. 4.10, we obtain:

ij =1

kl=1

[dr=1

ird2ijrd

r=1lrd2ljr

]1/(1) (4.11)Now, to prove the sufficiency condition we compute the second order partial deriva-

tive.

2JRFCMSSijSij

=

2(2 1)dr=1 S22ij P 2ir d2ijr + 2j if i = i j = j ,0 otherwise.

= 2(2 1)dr=1

(1)ij P

2ir d

2ijr + 2j (4.12)

= 2(2 1)(1)ij d2ij+ 2j (4.13)

53

where

d

ij =dr=1

P 2ir d2ijr

Substituting the value of ij and j in 4.13, we get:

2(2 1)d2ij

1/

kl=1

d2ij d2lj

1/(1)

(1)

2

1/

kl=1

1d2lj

1/(1)

(1)

= (2(2 1) 2)

1/

kl=1

1d2lj

1/(1)

(1)

(4.14)

= 4( 1)[kl=1

[d2lj]1/(1)](1)

(4.15)

Letting, aj =

[kl=1

(d2lj)1/(1)

](1), 1 j n,

2JRFCMSSijSij

= j where, j = 4( 1)aj 1 j n. (4.16)

Hence there are n distinct eigen values each of multiplicity k, of Hessian matrix of

U which is a diagonal matrix. With the assumptions > 1, > 1 and d2ij > 0 l, j

it implies j > 0 j. Thus, Hessian matrix of U is positive definite and hence, the

sufficiency condition is proved.

Theorem 4.2 Let : Mkdf

Since, ir = P2ir we get:

ir =

[i

nj=1 S

2ij d

2ijr

] 1(1)

(4.18)

Using constraint eq. 3.4 we get:

dr=1

ir =dr=1

[i

nj=1 S

2ij d

2ijr

] 1(1)

= 1

Substituting the value of i in eq. 4.18, we obtain:

ir =1

dl=1

[nj=1

ijd2ijrn

j=1ijd

2

ijl

]1/(1) (4.19)Now, to prove the sufficiency condition we compute the second order partial deriva-

tive

2JRFCMSPirPir

=

2(2 1)nj=1 P 22ir S2ij d2ijr + 2i if i = i , r = r0 otherwise.

= 2(2 1)nj=1

(1)ir S

2ij d

2ijr + 2i (4.20)

(4.21)

= 2(2 1)(1)ir d2ir + 2i (4.22)

where

d2ir =nj=1

S2ij d2ijr (4.23)

Substituting the value of ir and i in 4.22, we get:

= 2(2 1)d2ir

1/

dl=1

d2ird2il

1/(1)

(1)

2

1/

dl=1

1d2

il

1/(1)

(1)

= (2(2 1) 2)

1/

dl=1

1d2il

1/(1)

(1)

= 4( 1)

dl=1

(d2il

)1/(1)

(1)

55

Letting, bi =

dl=1

(d2il

)1/(1)

(1) 1 i k,2JRFCMSPirPir

= i where, i = 4( 1)bi 1 i k. (4.24)

Hence there are k distinct eigen values each of multiplicity r, of Hessian matrix of

W which is a diagonal matrix. With the assumption > 1, > 1 and d2il> 0 i, l

it implies i > 0 i. Thus, Hessian matrix of W is positive definite and hence, the

sufficiency condition is proved.

Theorem 4.3 Let :

zirupper approx =

xj(BUiBUi)

ijxjr

xj(BUiBUi) ij(4.27)

As an object may not belong to both lower approximation and upper

approximation, thus, the convergence of cluster center depends on both the lower

and upper approximation of cluster center. Eqs. 4.26 and 4.27 can be written as:

|BUi|zirlower approx =

xjBUixjr (4.28)

|BUi BUi|ijzirlower approx =

xj(BUiBUi)ijxjr (4.29)

Eqs. 4.28 and 4.29 represents a linear set of equations. In order to prove the

convergence we treat eqs. 4.26 and 4.27 as a Gauss-seidel iterations for solving

the set of equations with ij considered to be fixed. The sufficient condition by

Gauss-seidel algorithm for assuring the convergence of the matrix, representing

each iteration is that it should be diagonally dominant. The matrices

corresponding to eqs. 4.26 and 4.27 are:

A =

|BU1| 0 . . . . . . 0

0 |BU2| 0 . . . 0

0 . . . . . . 0 |BUk|

B =

1 0 . . . . . . 0

0 2 0 . . . 0

0 0 . . . . . . k

where

i =xj(BUiBUi)

ijxj

The sufficient condition for matrices A and B to be diagonally dominant is:

|BUi| > 0 and i > 0 respectively.

57

Also, going by the convergence theorem proposed by [Bezdek et al., 1987] for

FCM, [Maji and Pal, 2007a] and [Yan, 2004] convergence analysis of the fuzzy

curve tracing algorithm, matrices A and B are hessian of A and B w.r.t zirlower approx

and zirupper approx respectively with all postive eigen values and hence proved that

these matrices are diagonally dominant. Thus, by theorem 4.1, 4.2 and 4.3 the

proposed algorithm RFCMS converges, at least along a subsequence, to a local

optimum solution.

4.5 Experiments

In this section, we present the comparative performance of proposed subspace clus-

tering algorithm RFCMS with FCM, RCM, RFCM, DOC, and PROCLUS, using

UCI data sets [uci, ]. While FCM, RCM, RFCM are full dimensional clustering

algorithms, PROCLUS and DOC, are subspace clustering algorithms tailored for

high-dimensional applications. We used MATLAB version of FCM, opensubspace

weka [osw, ] implementation for DOC and PROCLUS, and implemented RCM,

RFCM, and RFCMS algorithms in MATLAB. In all the experiments, with FCM,

RCM, RFCM and RFCMS algorithm the stopping criterion parameter was set

as 103 and the maximum number of iterations was restricted to 100. However,

in all the experiments we conducted, the algorithms always converged before the

limit on the number of iterations was reached. The normed differences between

successive iterations of matrix Z is compared with the threshold parameter , set

to define convergence criterion. Based on experimentation, we set the value of

parameters a = 0.85 and b = 0.25 for RCM, RFCM and RFCMS algorithms. The

parameters for DOC algorithm were used as mentioned in [Procopiuc et al., 2002].

The number of clusters k was set equal to the number of classes given in each

data set, as indicated in Table 4.1. We have evaluated the effect of fuzzification

parameters and of RFCMS algorithm and fuzzification parameter m of FCM

and RFCM algorithms. We evaluated the performance of all the algorithms w.r.t.

quality and validity measures. The set of relevant dimensions computed by each

58

Data Sets Instances Attributes Classes

Alzheimr 45 8 3

Breast Cancer 569 30 2

Spambase 4601 57 2

Wine 178 13 3

Diabetes 768 8 2

Magic 19020 10 2

Table 4.1: Data Sets

of the subspace clustering algorithms RFCMS, DOC and PROCLUS have been

shown for all the data sets.

4.5.1 Data Sets

We experimented with Alzhemir, Breast Cancer, Spambase, Wine, Diabetes and

Magic data sets from the UCI data repository [uci, ]. These data sets are heteroge-

neous in terms of size, number of clusters, and distribution of classes and have no

missing values. General characteristics of the data sets are summarized in Table

4.1.

4.5.2 Effect of Fuzzification Parameters

For the RFCMS algorithm, the best combination of fuzzification parameters and

was determined by varying the values of and in the range 2-10 independent

of each other. This was done for each data set. Similarly, the best value of fuzzi-

fication parameter m for FCM and RFCM algorithm was determined by varying

the values of m. Table 4.2 shows the complete list of fuzzification parameters we

found for different data sets as a result of fine-tuning.

59

Data Sets RFCMS FCM RFCM

m m

Alzehmir 2 2 6 4

Breast Cancer 4 10 6 6

Spambase 3 10 10 6

Wine 3 9 9 2

Diabetes 2 2 2 2

Magic 2 2 2 2

Table 4.2: Fuzzifier Values: RFCMS, FCM, and RFCM

Data Sets RFCMS FCM RCM RFCM PROCLUS DOC

Alzehmir 0.7556 0.8000 0.6889 0.7333 0.0750 0.2813

Breast Cancer 0.9192 0.8282 0.8541 0.8682 0.8336 0.0887

Spambase 0.7457 0.6568 0.6433 0.6568 0.5885 0.7062

Wine 0.9101 0.7079 0.6854 0.6966 0.5427 0.2743

Diabetes 0.6510 0.6589 0.6589 0.6589 0.5248 0.6910

Magic 0.6931 0.6961 0.6961 0.7294 0.2813 0.4817

Table 4.3: Accuracy: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC

4.5.3 Cluster Validity

Table 4.3 shows accuracy results for all the algorithms and data sets. RFCMS

algorithm has highest accuracy for Breast Cancer, Spambase and Wine data sets.

FCM algorithm achieves highest accuracy for Alzehmir data set, RFCM algorithm

achieves highest accuracy for Magic data set and Doc algorithm achieves high-

est accuracy for Diabetes data set. In Table 4.4, 4.5, 4.6, and 4.7, we present

the results of applying recall, specificity, precision and F1-measure to the out-

comes of clustering schemes produced by different algorithms. RFCMS algorithm

achieves highest recall and specificity for Breast Cancer, Spambase and Wine data

sets. FCM algorithm achieves highest recall and specificity for Alzehmir data set,

RFCM algorithm achieves highest recall and specificity for Magic data set and Doc

60


Alzehmir 0.7470 0.7976 0.6921 0.7367 0.0953 0.4193

Breast Cancer 0.8944 0.8241 0.8052 0.8241 0.8906 0.1527

Spambase 0.7740 0.5798 0.5550 0.5767 0.4485 0.6543

Wine 0.9249 0.7030 0.6765 0.6904 0.5488 0.2702

Diabetes 0.5000 0.5943 0.5943 0.5943 0.4902 0.6488

Magic 0.6236 0.5722 0.5722 0.7982 0.4913 0.3787

Table 4.4: Recall: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC


Alzehmir 0.8769 0.9003 0.8465 0.8684 0.5155 0.4193

Breast Cancer 0.8949 0.8241 0.8052 0.8241 0.8906 0.1527

Spambase 0.7740 0.5798 0.5550 0.5767 0.4485 0.6543

Wine 0.9559 0.8565 0.8446 0.8508 0.8380 0.6327

Diabetes 0.5000 0.5943 0.5943 0.5943 0.4902 0.6488

Magic 0.6236 0.5722 0.5722 0.7982 0.4913 0.3787

Table 4.5: Specificity: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC

algorithm achieves highest recall and specificity for Diabetes data set. RFCMS al-

gorithm has highest precision for Breast Cancer, Spambase, Diabetes, Magic and

Wine data sets. FCM algorithm achieves highest precision for Alzehmir data set.

RFCMS algorithm achieves highest F1-measure for Breast Cancer, Spambase and

Wine data sets. FCM algorithm achieves highest F1-measure for Alzehmir data

set, RFCM algorithm achieves highest F1-measure for Magic data set. FCM, RCM

and RFCM algorithms achieve highest F1-measure for Diabetes data set. In sum-

mary itcan be seen that no algorithm is a clear winner w.r.t all measures for all

the algorithms and all the data sets.

61


Alzehmir 0.7407 0.9716 0.7008 0.7463 0.0769 0.1806

Breast Cancer 0.9371 0.9104 0.9026 0.9104 0.7825 0.1332

Spambase 0.7677 0.6810 0.6982 0.6938 0.4994 0.5981

Wine 0.9202 0.7301 0.7084 0.7211 0.5104 0.1778

Diabetes 0.6510 0.6120 0.6120 0.6120 0.4897 0.6028

Magic 0.7982 0.7870 0.7870 0.7958 0.1806 0.4054

Table 4.6: Precision: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC


Alzehmir 0.7439 0.7946 0.6964 0.7415 0.0851 0.2525

Breast Cancer 0.9153 0.8651 0.8511 0.8651 0.8330 0.1423

Spambase 0.7708 0.6263 0.6184 0.6299 0.4726 0.6250

Wine 0.9225 0.7163 0.6921 0.7054 0.5289 0.2145

Diabetes 0.5656 0.7062 0.7062 0.7062 0.4899 0.6249

Magic 0.7002 0.6626 0.6626 0.7970 0.2641 0.3916

Table 4.7: F1-measure: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC

4.5.4 Subspaces Generated

The proposed algorithm RFCMS, is an objective function based subspace clustering

algorithm. For such algorithms fewer the number of dimensions lesser will be the

error or scatter among objects of a cluster. We have compared RFCMS, DOC and

PROCLUS algorithms in terms of the number of dimensions found.

Tables 4.8, 4.9, 4.10, 4.11, 4.12 and 4.13 show the sets of dimensions found

for Alzehmir, Breast Cancer, Spambase, Wine, Diabetes and Magic data sets by

RFCMS, PROCLUS and DOC algorithms. For all the data sets mentioned above,

RFCMS algorithm finds subspaces with fewer dimensions.

62

Cluster No. RFCMS PROCLUS DOC

1 4 4,6,7 1,2,3,4,5,6,7

2 4, 5, 7 4,5,6 1,2,3,4,5,6,7

3 4, 5, 6 4,5,6 1,2,3,4,5,6,7

Table 4.8: Dimensions: RFCMS, PROCLUS and DOC for Alzehmir


1 10, 15, 20 1-3, 5-13 1-3, 5-13

15-24, 26-30 15-23, 25-30

2 10, 15, 20 1,2 1-3,5-13

15-23, 25-30

Table 4.9: Dimensions: RFCMS, PROCLUS and DOC for Breast Cancer


1 28, 29, 32, 34, 38, 44, 47 1-54 1-56

2 45, 46, 47, 51, 52 40, 49 1-56

Table 4.10: Dimensions: RFCMS, PROCLUS and DOC for Spambase


1 3, 8, 11 1,2,3,6,7,8,9,11,12 1-12

2 3, 8, 11 1,3,6,7,8,9,11,12 1-12

3 3, 7, 8, 9, 11 1,2 1-12

Table 4.11: Dimensions: RFCMS, PROCLUS and DOC for Wine


1 1,6,7 1,6-8 1, 6-8

2 1,7 1,4,5,7 1, 6-8

Table 4.12: Dimensions: RFCMS, PROCLUS and DOC for Diabetes

63


1 4,5 3,4,5,8,9 2-6,8,9

2 4,5 1,2,3,4,5 1-5,8

Table 4.13: Dimensions: RFCMS, PROCLUS and DOC for Magic

4.5.5 Experiments on Biological Datasets

In this section, we present the comparative performance of proposed projected

clustering algorithm RFCMS with EWKM, FWKM and LAC algorithms for bi-

ological data sets. RFCMS, EWKM, FWKM and LAC algorithms are subspace

clustering algorithms tailored for high-dimensional applications. We used weka

implementation for EWKM, FWKM and LAC [Peng and Zhang, 2011]. The pa-

rameters for EWKM, FWKM and LAC algorithm were used as mentioned in [Jing

et al., 2007], [Jing et al., 2005] and [Domeniconi et al., 2007]. We have evaluated

the effect of fuzzification parameters and of RFCMS algorithm. We evaluated

the performance of all the algorithms w.r.t. validity measures. The set of relevant

dimensions computed by each of the subspace clustering algorithms RFCMS have

been shown for all the data sets.

4.5.5.1 Data Sets

We experimented with Colon, Embroynal Tumours, Prostate and Leukemia data

sets [bio, ]. These data sets are heterogeneous in terms of size, and have no missing

values. We have chosen datasets which are pre-classified as it helps in evaluating

the results of applying clustering algorithms. General characteristics of the data

sets are summarized in Table 4.14.

4.5.5.2 Effect of Fuzzification Parameters

For the RFCMS algorithm the best combination of fuzzification parameters and

was determined by varying the values of and in the range 2-5 independent of

each other. This was done for each data set. Table 4.15 shows the complete list of

64

Data Sets Instances Attributes Classes

Colon Cancer 62 2001 2

Embroynal Tumours 60 7130 2

Leukemia 38 7130 2

Prostate 21 12601 2

Table 4.14: Data Sets

fuzzification parameters we found for different data sets as a result of fine-tuning.

Data Sets

Colon Cancer 2 4

Embroynal Tumours 3 5

Leukemia 3 4

Prostate 2 2

Table 4.15: Fuzzifier Values

4.5.5.3 Cluster Validity

Table 4.16 shows accuracy results for all the algorithms and data sets. RFCMS

algorithm achieves highest accuracy for Colon and Leukemia datasets. FWKM and

LAC algorithm achieves highest accuracy for Embroynal Tumour data set. FWKM

algorithm achieves highest accuracy for Prostate data set. However accuracy of

Data Sets RFCMS EWKM FWKM LAC

Colon Cancer 0.58065 0.5322 0.5322 0.5438

Embroynal Tumours 0.5833 0.6666 0.6666 0.6666

Leukemia 0.8421 0.5526 0.0.5526 0.5263

Prostate 0.6195 0.6190 0.6666 0.6190

Table 4.16: Accuracy: RFCMS, EWKM, FWKM and LAC

65


Colon Cancer 0.5318 0.53636 0.53636 0.51364


Leukemia 0.83502 0.4697 0.4697 0.45118

Prostate 0.5240 0.47596 0.41346 0.47596

Table 4.17: Specificity: RFCMS, EWKM, FWKM and LAC

RFCMS was comparable with FWKM algorithm for both Embryonal Tumour and

Prostate data set.

In Table 4.18, 4.17, 4.19, and 4.20, we present the results of applying recall,

specificity, precision and F1-measure to the outcomes of clustering schemes pro-

duced by different algorithms.

RFCMS algorithm achieves highest recall for Leukemia data set. EWKM,

FWKM and LAC algorithms achieve highest recall for Embryonal Tumour data

set. FWKM algorithm achieves highest recall for Prostate data set. RFCMS al-

gorithm achieves highest specificity for Colon, Embroynal Tumours, Prostate and

Leukemia data sets.

RFCMS algorithm has highest precision for Colon and Leukeima data sets.

EWKM, FWKM and LAC algorithms achieve highest precision for Embryonal Tu-

mour data set. EWKM and LAC algorithm achieves highest precision for Prostate

data set. RFCMS algorithm achieves highest F1-measure for Colon, Embroynal Tu-

mours, and Prostate data sets. FWKM achieves highest F1-measure for Leukemia

data set.

4.5.5.4 Subspaces Generated

Figures 4.1 to 4.12 show the set of dimensions found for Colon, Embroynal Tu-

mours, Prostate and Leukemia data sets by RFCMS, EWKM and LAC algorithms.

RFCMS algorithm finds fewer dimensions as compared to EWKM and LAC al-

gorithms. For Embroynal Tumours data set EWKM and LAC algorithms fails to

66


Colon Cancer 0.53182 0.5322 0.5322 0.5483


Leukemia 0.83502 0.5526 0.5526 0.5263

Prostate 0.52404 0.6190 0.6666 0.6190

Table 4.18: Recall: RFCMS, EWKM, FWKM and LAC


Colon Cancer 0.5333 0.5057 0.5057 0.5288


Leukemia 0.8061 0.5642 0.5642 0.5263

Prostate 0.5657 0.5814 0.6666 0.5814

Table 4.19: Precision: RFCMS, EWKM, FWKM and LAC

distinguish the relevance of dimensions for cluster 2. However RFCMS algorithm

distinguishes the relevant and non relevant dimensions for cluster 2. For Prostate

data set RFCMS algorithm finds fewer dimensions as compared to EWKM and

LAC algorithms. For Leukemia data set results of RFCMS, EWKM, and LAC

algorithms are comparable.


Colon Cancer 0.5325 0.5322 0.5322 0.5483


Leukemia 0.83502 0.5526 0.5526 0.5263

Prostate 0.5240 0.6190 0.6666 0.6190

Table 4.20: F1-measure: RFCMS, EWKM, FWKM and LAC

67

Figure 4.1: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for

Colon Dataset

Figure 4.2: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for

Colon Dataset

68

Figure 4.3: LAC: Memberships of dimensions in cluster 1 and cluster 2 for Colon

Dataset


Embryonal Tumours Dataset

69



Figure 4.6: LAC: Memberships of dimensions in cluster 1 and cluster 2 for


70


Prostate Dataset


Prostate Dataset

71


Prostate Dataset


Leukemia Dataset

72


Leukemia Dataset


Leukemia Dataset

73

4.6 Summary

In this chapter, we have proposed a novel subspace clustering algorithm which

employs a combination of rough sets and fuzzy set theory. Rough fuzzy c-Means

Subspace (RFCMS) algorithm is an extension of rough fuzzy c-means algorithm,

which incorporates fuzzy membership of data points and dimensions in each cluster.

In each iteration, cluster centers are updated and a data point is assigned to lower

approximations or upper approximation of a cluster. This process is repeated

until convergence criterion is met. We have also discussed the convergence of the

proposed algorithm. The results of applying the proposed approach to UCI data

sets shows that the proposed algorithm scores over its competitors in terms of

several validity measures. The proposed algorithm can be used in conjunction

with density based algorithms to automatically detect the number of clusters.

74

chapter 4 rough fuzzy c-means subspace...

Documents