chapter 4 rough fuzzy c-means subspace...

Download Chapter 4 Rough Fuzzy c-Means Subspace Clusteringshodhganga.inflibnet.ac.in/bitstream/10603/9192/13/13_chapter 4.pdf · Chapter 4 Rough Fuzzy c-Means Subspace Clustering In this chapter,

If you can't read please download the document

Upload: ngodieu

Post on 07-Feb-2018

231 views

Category:

Documents


1 download

TRANSCRIPT

  • Chapter 4

    Rough Fuzzy c-Means

    Subspace Clustering

    In this chapter, we propose a novel adaptation of rough fuzzy c-means al-

    gorithm for high dimensional data by modifying its objective function. The

    proposed algorithm automatically detects the relevant cluster dimensions of

    the high dimensional data set. The assignment of weights to attributes being

    specific to each cluster, an efficient subspace clustering scheme is generated.

    We have also discussed the convergence of the proposed algorithm. The re-

    mainder of this chapter is organised as follows: section 4.1 introduces rough

    set theory, in section 4.2 on related work, we describe how classical cluster-

    ing methods have been adapted to suit the requirements of high dimensional

    data, in section 4.3, we extend the rough fuzzy c-means algorithm for sub-

    space clustering in the form of Rough Fuzzy c-Means Subspace (RFCMS)

    algorithm, in section 4.4 we discuss the convergence of proposed algorithm,

    in section 4.5, we present the results of applying RFCMS algorithm on several

    UCI data sets, and finally section 4.6 summarizes the chapter.

    41

  • 4.1 Introduction

    Pawlak introduced rough set theory as a new framework for dealing with

    imperfect knowledge [Pawlak, 1991]. Rough set theory provides a method-

    ology for addressing the problem of relevant feature selection, by selecting

    a set of information rich features from a data set that retains the seman-

    tics of the original data and requires no human inputs unlike statistical ap-

    proaches [Jensen, 1999]. It is often possible to arrive at a minimal feature set

    (called reduct in rough set theory) that can be used for data analysis tasks

    such as classification and clustering [Lingras and West, 2004], [Mitra et al.,

    2006]. When feature selection approaches based on rough sets are combined

    with an intelligent classification system like those based on fuzzy systems or

    neural networks, they retain the descriptive power of the overall classifier and

    result in simplified system structure which enhances the understandability

    of the resultant system [Shen, 2007].

    Following Rutkowski we describe the notion of rough sets used to model

    uncertainty in information systems [Rutkowski, 2008]. Formally, an infor-

    mation system is a pair (U,A), where U is a non-empty finite set of ob-

    jects and A is a non-empty finite set of attributes such that each attribute

    a has an associated value set Va, i.e. a : U Va for every a A. A

    Decision System DS is defined as a pair (U,A {d}), d / A is called de-

    cision attribute and the elements of A are called condition attributes. For

    an attribute set B A, the set of objects in the information system, in-

    discernible w.r.t. B is described by the indiscernibility relation INDIS(B)

    defined as: INDIS(B) = {(x1, x2) U2|a(x1) = a(x2) a B}. The

    objects x1 and x2 are indiscernible from each other by attributes from B if

    (x1, x2) INDIS(B). The equivalence classes of the B-indiscernibility re-

    lation are denoted by [x]B. If X U then X can be approximated using

    B by constructing three approximations, namely, B lower approximation:

    BX = {x|[x]B X}, B upper approximation: BX = {x|[x]B X 6= },

    42

  • and the B boundary region: BX BX of X. Evidently, the boundary

    region consists of all objects in upper approximation but not in lower ap-

    proximation of X. Bazan et al. discuss various techniques for rough set

    reduct generation and argue that the classical reducts being static may not

    be stable in randomly chosen samples of a given decision table [Bazan et al.,

    2000]. To deal with such situations they focus on reducts that are stable over

    different subsets of samples chosen from a given decision table. Such reducts

    are called dynamic reducts. They compute reducts using an order based

    genetic algorithm and subsequently extract dynamic reducts which are used

    to generate classification rules. Each rule set is associated with a measure

    called the rule strength which is used later to resolve conflicts when several

    rules are applicable. Slezak generalized the concept of reduct by introduc-

    ing the notion of association reducts corresponding to both association rules

    and rough set reducts [Slezak, 2005]. He defined association reduct as a pair

    (A, B) of disjoint subsets of attributes such that all data supported patterns

    involving A approximately determine those involving B. He developed an

    information theory based algorithm to compute association reducts. As the

    algorithm needs to examine all association reducts, it has exponential time

    requirements. In order to alleviate this hardship, Slezak targeted significantly

    smaller ensembles of dependencies providing reasonably rich knowledge, and

    developed an order based genetic algorithm to achieve this [Slezak, 2009].

    Shen and Jensen proposed the concept of retainer as an approximation of a

    reduct [Richard and Qiang, 2001]. The authors suggest a heuristic to com-

    pute the retainer and demonstrate its usefulness for the classification task.

    For clustering textual database consisting of N documents, with a vocabu-

    lary of size V, Li et al. developed an algorithm based on approximate reducts

    that works in time O(VN) [Li et al., 2006].

    43

  • 4.2 Related Work

    Rough sets have been widely used for classification and clustering [Lingras

    and West, 2004], [Mitra et al., 2006], [Pawlak, 1991]. The classical k-means

    algorithm has been extended to rough k-means algorithm by Lingras et al.

    [Lingras and West, 2004]. In rough k-means algorithm, a cluster in the lower

    approximation, called the core cluster, is surrounded by a buffer or boundary

    set having objects with unclear membership status [Lingras and West, 2004].

    A data point in the lower approximation surely belong to a cluster, although,

    membership of the objects in an upper approximation is uncertain. Signature

    of each cluster is represented by its center, lower and upper approximation.

    If lower and upper approximations are equal then buffer set is empty and

    the data objects are crisply assigned to the cluster. The rough k-means

    algorithm follows an iterative process, wherein cluster centers are updated

    until convergence criterion is met. Asharaf et al. have extended rough k-

    means algorithm in such a way that it does not require prior specification of

    the number of clusters [Asharaf and Murty, 2004] . They have proposed a

    two phase algorithm. It identifies a set of leaders which act as prototypes in

    the first phase. Subsequently a set of supporting leaders are identified, which

    can act as leaders, provided they yield better partitioning. The evolutionary

    rough k-medoids algorithm [Peters et al., 2008] is based on the family of

    rough clustering algorithms and the classical k-medoids algorithm [Kaufman

    and Rousseeuw, 1990]. In Malyszko et al. have extended rough k-means

    clustering to rough entropy clustering [Malyszko and Stepaniuk, 2009]. It is

    an iterative process: firstly a predefined number of weight pairs are selected,

    for each weight pair a new offspring clustering is determined, rough entropy

    is computed, and the partition which gives highest rough entropy is selected.

    Liu et al. have proposed a feature selection method ISODATA-RFE for

    high dimensional gene expression datasets [Liu et al., 2012]. Bhattacharya

    distance is used to rank the features of training set. Features with low Bhat-

    44

  • tacharya distance are removed from feature set. For separating different

    classes, fuzzy ISODATA algorithm is used to calculate sensitivity index of

    each feature. A recursive feature elimination method is applied to feature

    set for removing unimportant features. It generates multiple nested candi-

    date feature subsets. Finally, the feature subset with least error is selected

    for use in classification and clustering algorithms. Own and Abraham have

    proposed a new weighted rough set framework based classification for neo-

    natal jaundice [Own and Abraham, 2012]. The weighted information table

    is built by applying class equal sample weighting. While samples in ma-

    jority class have smaller weight, the samples in minority class have larger

    weight. A weighted reduction algorithm MLEM2 exploits the significance

    of the attributes to extract a set of diagnosis rules from decision system

    of NeoNatal Jaundice database. Deng et al. have proposed an enhanced

    entropy weighting subspace clustering algorithm for high dimensional gene

    expression data [Deng et al., 2011]. Its objective function integrates the

    fuzzy within cluster compactness and between cluster information simulta-

    neously. [Cordeiro de Amorim and Mirkin, 2012] have extended the weighted

    K-means algorithm proposed by Huang et al.. They have replaced Euclidean

    distance metric by minkowski metric for measuring distances as the Euclidean

    distance cannot capture the relationship between scales of the feature values

    and feature weights. Bai et al. have proposed a novel weighting algorithm

    for categorical data [Bai et al., 2011]. The algorithm computes two weights

    for each dimension in each cluster. These weight values are used to identify

    the subsets of attributes which can categorize different clusters.

    Rough set theory has been applied in conjunction with fuzzy set theory in

    several domains such as fuzzy rule extraction, reasoning with uncertainty,

    fuzzy modelling, and feature selection [Maji and Pal, 2010]. The classical

    fuzzy c-means algorithm has been used in conjunction with rough sets to

    develop rough fuzzy c-means (RFCM) algorithm [Mitra and Banka, 2007].

    The concept of membership in FCM enables efficient handling of overlapping

    45

  • partitions, while, the rough sets are aimed at modelling uncertainty in data.

    Such hybrid techniques provide a strong paradigm for uncertainty handling in

    various application domains such as pattern recognition, image processing,

    mining stock prices, vocabulary for information retrieval, fuzzy clustering,

    dimensionality reduction, data mining and knowledge discovery [Maji and

    Paul, 2011], [Maji and Pal, 2010]. Maji and Pal proposed an algorithm

    RFCMdd for selecting the most informative bio-basis (medoids), where each

    partition is represented by a medoid computed as weighted average of the

    crisp lower approximation and fuzzy boundary [Maji and Pal, 2007b]. In Maji

    introduced a quantitative measure of similarity among genes based on fuzzy

    rough sets to develop fuzzy-rough supervised attribute clustering (FRSAC)

    algorithm [Maji, 2011].

    4.3 Rough Fuzzy c-Means Subspace Cluster-

    ing

    In this section, we propose an algorithm based on rough fuzzy c-means algo-

    rithm for subspace clustering.

    4.3.1 Rough c-Means

    Rough c-means algorithm [Lingras and West, 2004], has extended the concept

    of c-means by considering each cluster as an interval or rough set, where lower

    and upper approximations BX and BX are characteristics of rough set X.

    A rough set has following properties:

    (i) An object xj can belong to at most one lower approximation.

    (ii) If xj BX of cluster X, then xj BX also.

    (iii) If xj does not belongs to any lower approximation, then it belongs to two

    or more upper approximations, i.e. overlap between clusters is possible.

    46

  • The iterative steps of the Rough c-Means Algorithm are as follows:

    Algorithm 2 Rough c-Means Algorithm

    1. Chose initial means zi, 1 i k, for the k clusters.

    2. Assign each data point xj, 1 j n, to the lower approximation BUi or

    upper approximations BUi, BUi of cluster pairs Ui , Ui by computing the

    difference in its distance dij dij, where xj be a jth data point at distance

    dij from ith centroid zi of cluster Ui.

    3. Let dij be minimum and dij be the next to minimum. If dijdij is less than

    some threshold then xj BUi and xj BUi and xj cannot be a member of

    any lower approximation, else xj BUi such that distance dij is minimum

    over the k clusters.

    4. Compute new mean zi for each cluster, as

    zi =

    xj(BUiBUi)

    xj

    |BUiBUi|if BUi = BUi BUi 6=

    wlow

    xjBUixj

    |BUi| +wup

    xj(BUiBUi)

    xj

    |BUiBUi|if BUi 6= BUi BUi 6=

    xjBUixj

    |BUi| otherwise.

    where the parameters wlow and wup represents the relative importance of the

    lower and upper approximations respectively. Thus, RCM generates three

    types of clusters, with objects (i) in both the lower and upper approximations,

    (ii) only in lower approximation, and (iii) only in upper approximation.

    5. Repeat Steps 2-4 until convergence, i.e., there are no more new assignments,

    or upper limit on the number of iterations is reached.

    Note: wup = 1 wlow , 0.5 < wlow < 1, and 0 < threshold < 0.5.

    47

  • 4.3.2 Rough-Fuzzy c-Means

    Rough-fuzzy c-means algorithm [Mitra et al., 2006] incorporates weighted

    distance in terms of fuzzy membership value uij of a data point xj to a clus-

    ter mean zi, instead of the absolute individual distance dij of jth data point

    from ith cluster center.

    The iterative steps of the algorithm are as follows:

    Algorithm 3 Rough Fuzzy c-Means Algorithm

    1. Chose initial means zi, 1 i k, for the k clusters.

    2. Compute uij by eq. 3.9 for k clusters and n data objects.

    3. Assign each data point xj to the lower approximation BUi or upper approxi-

    mation BUi, BUi of cluster pairs Ui , Ui by computing the difference in its

    membership uij uij.

    4. Let uij be maximum and uij be the next to maximum. If uij - uij is less

    than some threshold then xj BUi and xj BUi and xj cannot be a

    member of any lower approximation, else xj BUi such that membership uijis maximum over the k clusters.

    5. Compute new mean zi for each cluster, as

    zi =

    xj(BUiBUi)

    ijxjxj(BUiBUi)

    ijif BUi = BUi BUi 6=

    wlow

    xjBUi

    ijxjxjBUi

    ij+ wup

    xj(BUiBUi)

    ijxjxj(BUiBUi)

    ijif BUi 6= BUi BUi 6=

    xjBUiijxj

    xjBUiij

    , otherwise.

    6. Repeat Steps 2-5 until convergence, i.e., there are no more new assignments

    , or upper limit on the number of iterations is reached.

    Note: wup = 1 wlow, 0.5 < wlow < 1, and 0 < threshold < 0.5.

    48

  • 4.3.3 Rough Fuzzy c-Means Subspace Clustering Al-

    gorithm

    The proposed algorithm called Rough Fuzzy c-Means Subspace (RFCMS)

    has been developed by hybridizing the concept of fuzzy membership for ob-

    jects (in clusters) and dimensions (fuzzy membership serves as weight of

    dimension) and rough set based approximations of clusters.

    Objective Function Let, BUi, BUi and BUiBUi denote lower approxima-

    tion, upper approximation, and boundary region of ith cluster Ui respectively.

    In [Lingras and West, 2004] the classical objective function of fuzzy c-means

    algorithm has been modified in the rough framework by incorporating the

    lower and upper approximations of the clusters. We have extended the ob-

    jective function of rough fuzzy c-means algorithm [Lingras and West, 2004]

    by incorporating the weights of dimensions as relevant to different clusters.

    We associate with ith cluster, the weight vector, i which represents the rel-

    ative relevance of different attributes for the ith cluster. Thus, in the matrix

    W = [ir]kd, ir denote the contribution of rth dimension to the ith cluster.

    The sum of contributions from all dimensions adds to 1 for each cluster.

    dr=1

    ir = 1, 1 i k, (4.1)

    ir [0, 1] , 1 i k, 1 r d (4.2)

    The proposed RFCMS algorithm minimizes the following objective function

    JRFCMS to partition data set into k clusters.

    JRFCMS =

    aA+ bB if BU 6= BU BU 6=

    A if BU 6= BU BU =

    B otherwise.

    49

  • where,

    A =

    xjBUi

    ki=1

    dr=1

    ijird

    2ijr

    B =

    xj(BUiBUi)

    ki=1

    dr=1

    ijird

    2ijr (4.3)

    In the above formulation, A and B correspond to lower and upper approx-

    imations. Parameters a and b control the contribution of lower and upper

    approximation of a cluster.

    d2ijr = (xjr zir)2 (4.4)

    is the distance between ith cluster center and jth data object along rth dimen-

    sion. Parameters (1,) , (1,) are weighting components. These

    parameters control the fuzzification of ij and ir respectively.

    Solving 4.3 w.r.t ij and ir we get:

    ij =1

    kl=1

    [dr=1

    (ir)d2ijrdr=1

    (lr)d2ljr

    ]1/(1) (4.5)

    ir =1

    dl=1

    [nj=1

    (ij)d2ijrnj=1

    (ij)d2ijl

    ]1/(1) (4.6)

    The weights of dimensions are computed using eq. 4.6 as in [Kumar and

    Puri, 2009].

    Cluster Center The cluster centers are computed as:

    zir =

    xj(BUiBUi)

    ijxjrxj(BUiBUi)

    ijif BUi = BUi BUi 6=

    axjBUi

    ijxjrxjBUi

    ij

    +bxj(BUiBUi)

    ijxjrxj(BUiBUi)

    ijif BUi 6= BUi BUi 6=

    xjBUiijxjr

    xjBUiij

    otherwise.

    50

  • (4.7)

    As the objects lying in lower approximation definitely belong to the cluster so they

    are assigned higher weights as compared to weight for objects lying in boundary

    region. For the case a 1 cluster center may get stuck in local optimum because

    clusters cannot find the objects lying in the boundary region and therefore, they

    may not be able to move towards the best cluster center. In order to maintain

    the greater degree of freedom to move, the values of parameters a and b are set as

    o < a < b < 1 such that a+ b = 1 [Maji and Pal, 2007a]. Like FCM [Bezdek et al.,

    1987], and Yans fuzzy curve tracing algorithm [Yan, 2004] the proposed RFCMS

    algorithm converges, at least along a subsequence, to a local optimum solution.

    The iterative steps of the algorithm are as follows:

    Algorithm 4 Rough Fuzzy c-Means Subspace Clustering Algorithm

    1. Chose initial cluster centers zi, 1 i k, for the k clusters.

    2. Compute ij by eq. 4.5 for k clusters and n data objects.

    3. Let ij be maximum and ij be the next to maximum for an object xj.

    If ij - ij is less than some threshold then xj BUi and xj BUi and xjcannot

    be a member of any lower approximation, else xj BUi such that membership

    ij is maximum over the k clusters.

    4. Compute ir by eq. 4.6 for k clusters and d dimensions.

    5. Compute new cluster centers zi for each cluster, as in eq. 4.7.

    6. Repeat steps 2-5 until convergence, i.e., there are no more new assignments,

    or limit on maximum number of iterations is reached.

    Note: a = 1 b, 0.5 < a < 1, and 0 < threshold < 0.5.

    51

  • 4.4 Convergence

    In this section, we discuss the convergence criteria of the proposed algorithm along

    with its proof. On the similar lines, as global convergence property of FCM al-

    gorithm, global convergence property of RFCMS states that for any data set and

    initialization parameters, an iteration sequence of RFCMS algorithm either (i)

    converges to a local minimum or (ii) there exists a subsequence of the iteration

    sequence that converges to a stationary point. Theorems 4.1, 4.2 and 4.3 below

    show that necessary and sufficient conditions hold for U , W , and Z respectively.

    Theorem 4.1 Let : Mknf

  • Assuming that Sij 6= 0, 1 j n, 1 i k, we get:

    dr=1

    S22ij P2ir d

    2ijr + j = 0

    or j = dr=1

    S22ij P2ir d

    2ijr

    or S22ij =j

    dr=1 P

    2ir d

    2ijr

    or S2ij =

    jdr=1 P

    2ir d

    2ijr

    1(1)

    ij =

    jdr=1 P

    2ir d

    2ijr

    1(1) (4.10)Using constraint eq. 2.11 in eq. 4.10, we get:

    ki=1

    ij =ki=1

    jdr=1 P

    2ir d

    2ijr

    1(1) = 1

    Substituting the value of j in eq. 4.10, we obtain:

    ij =1

    kl=1

    [dr=1

    ird2ijrd

    r=1lrd2ljr

    ]1/(1) (4.11)Now, to prove the sufficiency condition we compute the second order partial deriva-

    tive.

    2JRFCMSSijSij

    =

    2(2 1)dr=1 S22ij P 2ir d2ijr + 2j if i = i j = j ,0 otherwise.

    = 2(2 1)dr=1

    (1)ij P

    2ir d

    2ijr + 2j (4.12)

    = 2(2 1)(1)ij d2ij+ 2j (4.13)

    53

  • where

    d

    ij =dr=1

    P 2ir d2ijr

    Substituting the value of ij and j in 4.13, we get:

    2(2 1)d2ij

    1/

    kl=1

    d2ij d2lj

    1/(1)

    (1)

    2

    1/

    kl=1

    1d2lj

    1/(1)

    (1)

    = (2(2 1) 2)

    1/

    kl=1

    1d2lj

    1/(1)

    (1)

    (4.14)

    = 4( 1)[kl=1

    [d2lj]1/(1)](1)

    (4.15)

    Letting, aj =

    [kl=1

    (d2lj)1/(1)

    ](1), 1 j n,

    2JRFCMSSijSij

    = j where, j = 4( 1)aj 1 j n. (4.16)

    Hence there are n distinct eigen values each of multiplicity k, of Hessian matrix of

    U which is a diagonal matrix. With the assumptions > 1, > 1 and d2ij > 0 l, j

    it implies j > 0 j. Thus, Hessian matrix of U is positive definite and hence, the

    sufficiency condition is proved.

    Theorem 4.2 Let : Mkdf

  • Since, ir = P2ir we get:

    ir =

    [i

    nj=1 S

    2ij d

    2ijr

    ] 1(1)

    (4.18)

    Using constraint eq. 3.4 we get:

    dr=1

    ir =dr=1

    [i

    nj=1 S

    2ij d

    2ijr

    ] 1(1)

    = 1

    Substituting the value of i in eq. 4.18, we obtain:

    ir =1

    dl=1

    [nj=1

    ijd2ijrn

    j=1ijd

    2

    ijl

    ]1/(1) (4.19)Now, to prove the sufficiency condition we compute the second order partial deriva-

    tive

    2JRFCMSPirPir

    =

    2(2 1)nj=1 P 22ir S2ij d2ijr + 2i if i = i , r = r0 otherwise.

    = 2(2 1)nj=1

    (1)ir S

    2ij d

    2ijr + 2i (4.20)

    (4.21)

    = 2(2 1)(1)ir d2ir + 2i (4.22)

    where

    d2ir =nj=1

    S2ij d2ijr (4.23)

    Substituting the value of ir and i in 4.22, we get:

    = 2(2 1)d2ir

    1/

    dl=1

    d2ird2il

    1/(1)

    (1)

    2

    1/

    dl=1

    1d2

    il

    1/(1)

    (1)

    = (2(2 1) 2)

    1/

    dl=1

    1d2il

    1/(1)

    (1)

    = 4( 1)

    dl=1

    (d2il

    )1/(1)

    (1)

    55

  • Letting, bi =

    dl=1

    (d2il

    )1/(1)

    (1) 1 i k,2JRFCMSPirPir

    = i where, i = 4( 1)bi 1 i k. (4.24)

    Hence there are k distinct eigen values each of multiplicity r, of Hessian matrix of

    W which is a diagonal matrix. With the assumption > 1, > 1 and d2il> 0 i, l

    it implies i > 0 i. Thus, Hessian matrix of W is positive definite and hence, the

    sufficiency condition is proved.

    Theorem 4.3 Let :

  • zirupper approx =

    xj(BUiBUi)

    ijxjr

    xj(BUiBUi) ij(4.27)

    As an object may not belong to both lower approximation and upper

    approximation, thus, the convergence of cluster center depends on both the lower

    and upper approximation of cluster center. Eqs. 4.26 and 4.27 can be written as:

    |BUi|zirlower approx =

    xjBUixjr (4.28)

    |BUi BUi|ijzirlower approx =

    xj(BUiBUi)ijxjr (4.29)

    Eqs. 4.28 and 4.29 represents a linear set of equations. In order to prove the

    convergence we treat eqs. 4.26 and 4.27 as a Gauss-seidel iterations for solving

    the set of equations with ij considered to be fixed. The sufficient condition by

    Gauss-seidel algorithm for assuring the convergence of the matrix, representing

    each iteration is that it should be diagonally dominant. The matrices

    corresponding to eqs. 4.26 and 4.27 are:

    A =

    |BU1| 0 . . . . . . 0

    0 |BU2| 0 . . . 0

    0 . . . . . . 0 |BUk|

    B =

    1 0 . . . . . . 0

    0 2 0 . . . 0

    0 0 . . . . . . k

    where

    i =xj(BUiBUi)

    ijxj

    The sufficient condition for matrices A and B to be diagonally dominant is:

    |BUi| > 0 and i > 0 respectively.

    57

  • Also, going by the convergence theorem proposed by [Bezdek et al., 1987] for

    FCM, [Maji and Pal, 2007a] and [Yan, 2004] convergence analysis of the fuzzy

    curve tracing algorithm, matrices A and B are hessian of A and B w.r.t zirlower approx

    and zirupper approx respectively with all postive eigen values and hence proved that

    these matrices are diagonally dominant. Thus, by theorem 4.1, 4.2 and 4.3 the

    proposed algorithm RFCMS converges, at least along a subsequence, to a local

    optimum solution.

    4.5 Experiments

    In this section, we present the comparative performance of proposed subspace clus-

    tering algorithm RFCMS with FCM, RCM, RFCM, DOC, and PROCLUS, using

    UCI data sets [uci, ]. While FCM, RCM, RFCM are full dimensional clustering

    algorithms, PROCLUS and DOC, are subspace clustering algorithms tailored for

    high-dimensional applications. We used MATLAB version of FCM, opensubspace

    weka [osw, ] implementation for DOC and PROCLUS, and implemented RCM,

    RFCM, and RFCMS algorithms in MATLAB. In all the experiments, with FCM,

    RCM, RFCM and RFCMS algorithm the stopping criterion parameter was set

    as 103 and the maximum number of iterations was restricted to 100. However,

    in all the experiments we conducted, the algorithms always converged before the

    limit on the number of iterations was reached. The normed differences between

    successive iterations of matrix Z is compared with the threshold parameter , set

    to define convergence criterion. Based on experimentation, we set the value of

    parameters a = 0.85 and b = 0.25 for RCM, RFCM and RFCMS algorithms. The

    parameters for DOC algorithm were used as mentioned in [Procopiuc et al., 2002].

    The number of clusters k was set equal to the number of classes given in each

    data set, as indicated in Table 4.1. We have evaluated the effect of fuzzification

    parameters and of RFCMS algorithm and fuzzification parameter m of FCM

    and RFCM algorithms. We evaluated the performance of all the algorithms w.r.t.

    quality and validity measures. The set of relevant dimensions computed by each

    58

  • Data Sets Instances Attributes Classes

    Alzheimr 45 8 3

    Breast Cancer 569 30 2

    Spambase 4601 57 2

    Wine 178 13 3

    Diabetes 768 8 2

    Magic 19020 10 2

    Table 4.1: Data Sets

    of the subspace clustering algorithms RFCMS, DOC and PROCLUS have been

    shown for all the data sets.

    4.5.1 Data Sets

    We experimented with Alzhemir, Breast Cancer, Spambase, Wine, Diabetes and

    Magic data sets from the UCI data repository [uci, ]. These data sets are heteroge-

    neous in terms of size, number of clusters, and distribution of classes and have no

    missing values. General characteristics of the data sets are summarized in Table

    4.1.

    4.5.2 Effect of Fuzzification Parameters

    For the RFCMS algorithm, the best combination of fuzzification parameters and

    was determined by varying the values of and in the range 2-10 independent

    of each other. This was done for each data set. Similarly, the best value of fuzzi-

    fication parameter m for FCM and RFCM algorithm was determined by varying

    the values of m. Table 4.2 shows the complete list of fuzzification parameters we

    found for different data sets as a result of fine-tuning.

    59

  • Data Sets RFCMS FCM RFCM

    m m

    Alzehmir 2 2 6 4

    Breast Cancer 4 10 6 6

    Spambase 3 10 10 6

    Wine 3 9 9 2

    Diabetes 2 2 2 2

    Magic 2 2 2 2

    Table 4.2: Fuzzifier Values: RFCMS, FCM, and RFCM

    Data Sets RFCMS FCM RCM RFCM PROCLUS DOC

    Alzehmir 0.7556 0.8000 0.6889 0.7333 0.0750 0.2813

    Breast Cancer 0.9192 0.8282 0.8541 0.8682 0.8336 0.0887

    Spambase 0.7457 0.6568 0.6433 0.6568 0.5885 0.7062

    Wine 0.9101 0.7079 0.6854 0.6966 0.5427 0.2743

    Diabetes 0.6510 0.6589 0.6589 0.6589 0.5248 0.6910

    Magic 0.6931 0.6961 0.6961 0.7294 0.2813 0.4817

    Table 4.3: Accuracy: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC

    4.5.3 Cluster Validity

    Table 4.3 shows accuracy results for all the algorithms and data sets. RFCMS

    algorithm has highest accuracy for Breast Cancer, Spambase and Wine data sets.

    FCM algorithm achieves highest accuracy for Alzehmir data set, RFCM algorithm

    achieves highest accuracy for Magic data set and Doc algorithm achieves high-

    est accuracy for Diabetes data set. In Table 4.4, 4.5, 4.6, and 4.7, we present

    the results of applying recall, specificity, precision and F1-measure to the out-

    comes of clustering schemes produced by different algorithms. RFCMS algorithm

    achieves highest recall and specificity for Breast Cancer, Spambase and Wine data

    sets. FCM algorithm achieves highest recall and specificity for Alzehmir data set,

    RFCM algorithm achieves highest recall and specificity for Magic data set and Doc

    60

  • Data Sets RFCMS FCM RCM RFCM PROCLUS DOC

    Alzehmir 0.7470 0.7976 0.6921 0.7367 0.0953 0.4193

    Breast Cancer 0.8944 0.8241 0.8052 0.8241 0.8906 0.1527

    Spambase 0.7740 0.5798 0.5550 0.5767 0.4485 0.6543

    Wine 0.9249 0.7030 0.6765 0.6904 0.5488 0.2702

    Diabetes 0.5000 0.5943 0.5943 0.5943 0.4902 0.6488

    Magic 0.6236 0.5722 0.5722 0.7982 0.4913 0.3787

    Table 4.4: Recall: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC

    Data Sets RFCMS FCM RCM RFCM PROCLUS DOC

    Alzehmir 0.8769 0.9003 0.8465 0.8684 0.5155 0.4193

    Breast Cancer 0.8949 0.8241 0.8052 0.8241 0.8906 0.1527

    Spambase 0.7740 0.5798 0.5550 0.5767 0.4485 0.6543

    Wine 0.9559 0.8565 0.8446 0.8508 0.8380 0.6327

    Diabetes 0.5000 0.5943 0.5943 0.5943 0.4902 0.6488

    Magic 0.6236 0.5722 0.5722 0.7982 0.4913 0.3787

    Table 4.5: Specificity: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC

    algorithm achieves highest recall and specificity for Diabetes data set. RFCMS al-

    gorithm has highest precision for Breast Cancer, Spambase, Diabetes, Magic and

    Wine data sets. FCM algorithm achieves highest precision for Alzehmir data set.

    RFCMS algorithm achieves highest F1-measure for Breast Cancer, Spambase and

    Wine data sets. FCM algorithm achieves highest F1-measure for Alzehmir data

    set, RFCM algorithm achieves highest F1-measure for Magic data set. FCM, RCM

    and RFCM algorithms achieve highest F1-measure for Diabetes data set. In sum-

    mary itcan be seen that no algorithm is a clear winner w.r.t all measures for all

    the algorithms and all the data sets.

    61

  • Data Sets RFCMS FCM RCM RFCM PROCLUS DOC

    Alzehmir 0.7407 0.9716 0.7008 0.7463 0.0769 0.1806

    Breast Cancer 0.9371 0.9104 0.9026 0.9104 0.7825 0.1332

    Spambase 0.7677 0.6810 0.6982 0.6938 0.4994 0.5981

    Wine 0.9202 0.7301 0.7084 0.7211 0.5104 0.1778

    Diabetes 0.6510 0.6120 0.6120 0.6120 0.4897 0.6028

    Magic 0.7982 0.7870 0.7870 0.7958 0.1806 0.4054

    Table 4.6: Precision: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC

    Data Sets RFCMS FCM RCM RFCM PROCLUS DOC

    Alzehmir 0.7439 0.7946 0.6964 0.7415 0.0851 0.2525

    Breast Cancer 0.9153 0.8651 0.8511 0.8651 0.8330 0.1423

    Spambase 0.7708 0.6263 0.6184 0.6299 0.4726 0.6250

    Wine 0.9225 0.7163 0.6921 0.7054 0.5289 0.2145

    Diabetes 0.5656 0.7062 0.7062 0.7062 0.4899 0.6249

    Magic 0.7002 0.6626 0.6626 0.7970 0.2641 0.3916

    Table 4.7: F1-measure: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC

    4.5.4 Subspaces Generated

    The proposed algorithm RFCMS, is an objective function based subspace clustering

    algorithm. For such algorithms fewer the number of dimensions lesser will be the

    error or scatter among objects of a cluster. We have compared RFCMS, DOC and

    PROCLUS algorithms in terms of the number of dimensions found.

    Tables 4.8, 4.9, 4.10, 4.11, 4.12 and 4.13 show the sets of dimensions found

    for Alzehmir, Breast Cancer, Spambase, Wine, Diabetes and Magic data sets by

    RFCMS, PROCLUS and DOC algorithms. For all the data sets mentioned above,

    RFCMS algorithm finds subspaces with fewer dimensions.

    62

  • Cluster No. RFCMS PROCLUS DOC

    1 4 4,6,7 1,2,3,4,5,6,7

    2 4, 5, 7 4,5,6 1,2,3,4,5,6,7

    3 4, 5, 6 4,5,6 1,2,3,4,5,6,7

    Table 4.8: Dimensions: RFCMS, PROCLUS and DOC for Alzehmir

    Cluster No. RFCMS PROCLUS DOC

    1 10, 15, 20 1-3, 5-13 1-3, 5-13

    15-24, 26-30 15-23, 25-30

    2 10, 15, 20 1,2 1-3,5-13

    15-23, 25-30

    Table 4.9: Dimensions: RFCMS, PROCLUS and DOC for Breast Cancer

    Cluster No. RFCMS PROCLUS DOC

    1 28, 29, 32, 34, 38, 44, 47 1-54 1-56

    2 45, 46, 47, 51, 52 40, 49 1-56

    Table 4.10: Dimensions: RFCMS, PROCLUS and DOC for Spambase

    Cluster No. RFCMS PROCLUS DOC

    1 3, 8, 11 1,2,3,6,7,8,9,11,12 1-12

    2 3, 8, 11 1,3,6,7,8,9,11,12 1-12

    3 3, 7, 8, 9, 11 1,2 1-12

    Table 4.11: Dimensions: RFCMS, PROCLUS and DOC for Wine

    Cluster No. RFCMS PROCLUS DOC

    1 1,6,7 1,6-8 1, 6-8

    2 1,7 1,4,5,7 1, 6-8

    Table 4.12: Dimensions: RFCMS, PROCLUS and DOC for Diabetes

    63

  • Cluster No. RFCMS PROCLUS DOC

    1 4,5 3,4,5,8,9 2-6,8,9

    2 4,5 1,2,3,4,5 1-5,8

    Table 4.13: Dimensions: RFCMS, PROCLUS and DOC for Magic

    4.5.5 Experiments on Biological Datasets

    In this section, we present the comparative performance of proposed projected

    clustering algorithm RFCMS with EWKM, FWKM and LAC algorithms for bi-

    ological data sets. RFCMS, EWKM, FWKM and LAC algorithms are subspace

    clustering algorithms tailored for high-dimensional applications. We used weka

    implementation for EWKM, FWKM and LAC [Peng and Zhang, 2011]. The pa-

    rameters for EWKM, FWKM and LAC algorithm were used as mentioned in [Jing

    et al., 2007], [Jing et al., 2005] and [Domeniconi et al., 2007]. We have evaluated

    the effect of fuzzification parameters and of RFCMS algorithm. We evaluated

    the performance of all the algorithms w.r.t. validity measures. The set of relevant

    dimensions computed by each of the subspace clustering algorithms RFCMS have

    been shown for all the data sets.

    4.5.5.1 Data Sets

    We experimented with Colon, Embroynal Tumours, Prostate and Leukemia data

    sets [bio, ]. These data sets are heterogeneous in terms of size, and have no missing

    values. We have chosen datasets which are pre-classified as it helps in evaluating

    the results of applying clustering algorithms. General characteristics of the data

    sets are summarized in Table 4.14.

    4.5.5.2 Effect of Fuzzification Parameters

    For the RFCMS algorithm the best combination of fuzzification parameters and

    was determined by varying the values of and in the range 2-5 independent of

    each other. This was done for each data set. Table 4.15 shows the complete list of

    64

  • Data Sets Instances Attributes Classes

    Colon Cancer 62 2001 2

    Embroynal Tumours 60 7130 2

    Leukemia 38 7130 2

    Prostate 21 12601 2

    Table 4.14: Data Sets

    fuzzification parameters we found for different data sets as a result of fine-tuning.

    Data Sets

    Colon Cancer 2 4

    Embroynal Tumours 3 5

    Leukemia 3 4

    Prostate 2 2

    Table 4.15: Fuzzifier Values

    4.5.5.3 Cluster Validity

    Table 4.16 shows accuracy results for all the algorithms and data sets. RFCMS

    algorithm achieves highest accuracy for Colon and Leukemia datasets. FWKM and

    LAC algorithm achieves highest accuracy for Embroynal Tumour data set. FWKM

    algorithm achieves highest accuracy for Prostate data set. However accuracy of

    Data Sets RFCMS EWKM FWKM LAC

    Colon Cancer 0.58065 0.5322 0.5322 0.5438

    Embroynal Tumours 0.5833 0.6666 0.6666 0.6666

    Leukemia 0.8421 0.5526 0.0.5526 0.5263

    Prostate 0.6195 0.6190 0.6666 0.6190

    Table 4.16: Accuracy: RFCMS, EWKM, FWKM and LAC

    65

  • Data Sets RFCMS EWKM FWKM LAC

    Colon Cancer 0.5318 0.53636 0.53636 0.51364

    Embroynal Tumours 0.63553 0.47619 0.47619 0.47619

    Leukemia 0.83502 0.4697 0.4697 0.45118

    Prostate 0.5240 0.47596 0.41346 0.47596

    Table 4.17: Specificity: RFCMS, EWKM, FWKM and LAC

    RFCMS was comparable with FWKM algorithm for both Embryonal Tumour and

    Prostate data set.

    In Table 4.18, 4.17, 4.19, and 4.20, we present the results of applying recall,

    specificity, precision and F1-measure to the outcomes of clustering schemes pro-

    duced by different algorithms.

    RFCMS algorithm achieves highest recall for Leukemia data set. EWKM,

    FWKM and LAC algorithms achieve highest recall for Embryonal Tumour data

    set. FWKM algorithm achieves highest recall for Prostate data set. RFCMS al-

    gorithm achieves highest specificity for Colon, Embroynal Tumours, Prostate and

    Leukemia data sets.

    RFCMS algorithm has highest precision for Colon and Leukeima data sets.

    EWKM, FWKM and LAC algorithms achieve highest precision for Embryonal Tu-

    mour data set. EWKM and LAC algorithm achieves highest precision for Prostate

    data set. RFCMS algorithm achieves highest F1-measure for Colon, Embroynal Tu-

    mours, and Prostate data sets. FWKM achieves highest F1-measure for Leukemia

    data set.

    4.5.5.4 Subspaces Generated

    Figures 4.1 to 4.12 show the set of dimensions found for Colon, Embroynal Tu-

    mours, Prostate and Leukemia data sets by RFCMS, EWKM and LAC algorithms.

    RFCMS algorithm finds fewer dimensions as compared to EWKM and LAC al-

    gorithms. For Embroynal Tumours data set EWKM and LAC algorithms fails to

    66

  • Data Sets RFCMS EWKM FWKM LAC

    Colon Cancer 0.53182 0.5322 0.5322 0.5483

    Embroynal Tumours 0.63553 0.6666 0.6666 0.6666

    Leukemia 0.83502 0.5526 0.5526 0.5263

    Prostate 0.52404 0.6190 0.6666 0.6190

    Table 4.18: Recall: RFCMS, EWKM, FWKM and LAC

    Data Sets RFCMS EWKM FWKM LAC

    Colon Cancer 0.5333 0.5057 0.5057 0.5288

    Embroynal Tumours 0.63415 0.7796 0.7796 0.7796

    Leukemia 0.8061 0.5642 0.5642 0.5263

    Prostate 0.5657 0.5814 0.6666 0.5814

    Table 4.19: Precision: RFCMS, EWKM, FWKM and LAC

    distinguish the relevance of dimensions for cluster 2. However RFCMS algorithm

    distinguishes the relevant and non relevant dimensions for cluster 2. For Prostate

    data set RFCMS algorithm finds fewer dimensions as compared to EWKM and

    LAC algorithms. For Leukemia data set results of RFCMS, EWKM, and LAC

    algorithms are comparable.

    Data Sets RFCMS EWKM FWKM LAC

    Colon Cancer 0.5325 0.5322 0.5322 0.5483

    Embroynal Tumours 0.63415 0.6666 0.6666 0.6666

    Leukemia 0.83502 0.5526 0.5526 0.5263

    Prostate 0.5240 0.6190 0.6666 0.6190

    Table 4.20: F1-measure: RFCMS, EWKM, FWKM and LAC

    67

  • Figure 4.1: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for

    Colon Dataset

    Figure 4.2: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for

    Colon Dataset

    68

  • Figure 4.3: LAC: Memberships of dimensions in cluster 1 and cluster 2 for Colon

    Dataset

    Figure 4.4: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for

    Embryonal Tumours Dataset

    69

  • Figure 4.5: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for

    Embryonal Tumours Dataset

    Figure 4.6: LAC: Memberships of dimensions in cluster 1 and cluster 2 for

    Embryonal Tumours Dataset

    70

  • Figure 4.7: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for

    Prostate Dataset

    Figure 4.8: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for

    Prostate Dataset

    71

  • Figure 4.9: LAC: Memberships of dimensions in cluster 1 and cluster 2 for

    Prostate Dataset

    Figure 4.10: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for

    Leukemia Dataset

    72

  • Figure 4.11: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for

    Leukemia Dataset

    Figure 4.12: LAC: Memberships of dimensions in cluster 1 and cluster 2 for

    Leukemia Dataset

    73

  • 4.6 Summary

    In this chapter, we have proposed a novel subspace clustering algorithm which

    employs a combination of rough sets and fuzzy set theory. Rough fuzzy c-Means

    Subspace (RFCMS) algorithm is an extension of rough fuzzy c-means algorithm,

    which incorporates fuzzy membership of data points and dimensions in each cluster.

    In each iteration, cluster centers are updated and a data point is assigned to lower

    approximations or upper approximation of a cluster. This process is repeated

    until convergence criterion is met. We have also discussed the convergence of the

    proposed algorithm. The results of applying the proposed approach to UCI data

    sets shows that the proposed algorithm scores over its competitors in terms of

    several validity measures. The proposed algorithm can be used in conjunction

    with density based algorithms to automatically detect the number of clusters.

    74