an exponential monte-carlo algorithm for feature selection problems

Computers & Industrial Engineering 67 (2014) 160–167

Contents lists available at ScienceDirect

Computers & Industrial Engineering

journal homepage: www.elsevier .com/ locate/caie

An Exponential Monte-Carlo algorithm for feature selection problems q

0360-8352/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.cie.2013.10.009

q This manuscript was processed by Area Editor Ibrahim H. Osman.⇑ Corresponding author. Tel.: +60 389216667.

E-mail addresses: [email protected] (S. Abdullah), [email protected] (N.R. Sabar), [email protected] (M.Z. Ahmad Nazri), [email protected] (M. Ayob).

Salwani Abdullah a,⇑, Nasser R. Sabar b, Mohd Zakree Ahmad Nazri a, Masri Ayob a

a Data Mining and Optimization Research Group (DMO), Center for Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysiab The University of Nottingham Malaysia Campus, Jalan Broga, 43500 Semenyih, Selangor, Malaysia

a r t i c l e i n f o

Article history:Received 25 December 2010Received in revised form 9 October 2013Accepted 29 October 2013Available online 12 November 2013

Keywords:Feature selectionExponential Monte-CarloLocal search

a b s t r a c t

Feature selection problems (FS) can be defined as the process of eliminating redundant features whileavoiding information loss. Due to that fact that FS is an NP-hard problem, heuristic and meta-heuristicapproaches have been widely used by researchers. In this work, we proposed an Exponential Monte-Carloalgorithm (EMC-FS) for the feature selection problem. EMC-FS is a meta-heuristic approach which is quitesimilar to a simulated annealing algorithm. The difference is that no cooling schedule is required.Improved solutions are accepted and worse solutions are adaptively accepted based on the quality ofthe trial solution, the search time and the number of consecutive non-improving iterations. We haveevaluated our approach against the latest methodologies in the literature on standard benchmark prob-lems. The quality of the obtained subset of features has also been evaluated in terms of the number ofgenerated rules (descriptive patterns) and classification accuracy. Our research demonstrates that ourapproach produces some of the best known results.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Recently, there has been a great deal of attention paid to featureselection in data mining. Feature selection can be defined as theproblem of finding a minimal subset of features while avoidinginformation loss (Pawlak, 1982, 1991). Removing redundant andmisleading features can improve the performance and efficiencyof a learning process (Pawlak, 1991). It is known that finding asmallest subset of features is a NP-hard problem (Pawlak, 1982).The optimal subset of features is determined by both relevancyand redundancy aspects. A feature is said to be relevant if a deci-sion depends on it; if no decision depends on the feature, it isnot relevant. However, a feature can also be considered to beredundant if it is highly correlated with other features (Pawlak,1991). Hence, the aim is to search for features that are strongly cor-related with the decision feature. Finding an optimal subset of fea-tures varies from one problem to another depending on theproblem complexity.

During the last decade, there have been a number of approachesutilised to solve feature selection problems. These approaches canusually be classified as either random or heuristic based methods.In random search based method, the main idea is to randomly gen-erate a subset of feature until optimal subset is found or reached

the predefined termination criterion. The optimal subset has fewernumbers of features when compared to the original one, but theinformation is the same. However, despite being simple to imple-ment, random search based methods are impractical when dealingwith a huge dataset and the quality of the generated solution isunsatisfactory.

On the other hand, heuristic and meta-heuristic approacheshave been successfully applied to feature selection problems.These can be classified into local search methods and populationbased methods. Example of population based methods are: geneticalgorithms (Wroblewski, 1995; Jensen & Shen, 2003), ant colony(Jensen & Shen, 2003; Ke, Feng, & Ren, 2008), and scatter search(Jue, Hedar, Guihuan, & Shouyang, 2009). Example of local searchmethods are: simulated annealing (Jensen & Shen, 2004), tabusearch (Hedar, Wang, & Fukushima, 2008), variable neighbourhoodsearch (Arajy & Abdullah, 2010), iterative algorithm with compos-ite neighbourhood structure (Jihad & Abdullah, 2010), great delugealgorithm (Abdullah & Jaddi, 2010), nonlinear great deluge (Jaddi &Abdullah, 2013a), and constructive hyper-heuristics (Abdullah,Sabar, Ahmad Nazri, Turabieh, & McCollum, 2010). Hybrid ap-proaches have also been tested on feature selection problems suchas the hybridization between fuzzy logic and record-to-record tra-vel algorithm (Mafarja & Abdullah, 2013a), hybrid genetic algo-rithm with great deluge (Jaddi & Abdullah, 2013b), and memeticalgorithm (Mafarja & Abdullah, 2013b). Other approaches and sur-veys can be found in Jensen and Shen (2004), Zhang, Qiu, and Wu(2007) and Skowron and Grzymala-Busse (1994).

In this work, we propose an Exponential Monte-Carlo algorithmfor solving feature selection problems (EMC-FS). EMC-FS is similar

http://crossmark.crossref.org/dialog/?doi=10.1016/j.cie.2013.10.009&domain=pdf

http://dx.doi.org/10.1016/j.cie.2013.10.009

mailto:[email protected]






http://dx.doi.org/10.1016/j.cie.2013.10.009

http://www.sciencedirect.com/science/journal/03608352

http://www.elsevier.com/locate/caie

S. Abdullah et al. / Computers & Industrial Engineering 67 (2014) 160–167 161

to simulated annealing (SA) algorithm but employs a differentmechanism to escape from local optima. It belongs to the class ofno-monotonic SA algorithms that were introduced in Osman(1993) and Osman and Christofides (1994) but uses a differentmechanism to accept worse solutions. In this work, we selectEMC to solve feature selection problems due to its ability to controlthe intensification/diversification problem faced by most of thelocal search algorithms, has less number of parameters that needto be tuned in advance and shown to be an effective method whensolving hard optimization problems (Abdullah, Burke, & McCollum,2005, 2007; Ayob & Kendall, 2003; Sabar, Ayob, & Kendall, 2009).

The proposed method has been tested on UCI datasets (Blake &Merz, 1998) and we used the rough set theory to evaluate the ob-tained subset of features (Pawlak, 1982, 1991). Furthermore, incontrast to available feature selection methods that they only re-port the numbers of generated features, we also evaluate the qual-ity of the generated subset of features in terms of the number ofgenerated rules (descriptive patterns) and the classificationaccuracy.

2. Problem description

In this section, we describe the feature selection problem, solu-tion representation and the objective function.

2.1. Feature selection problems

Feature selection (FS) problem is a pre-processing task in datamining and has been intensively studied by researchers, due to itcritical effects on the learning process. Given a set of features,the primary goal of a feature selection is to select, among the pos-sible subset of features, the smallest subset in such a way that theinformation of the selected subset is the same as the original set offeatures and can generate a better accuracy (Jensen & Shen, 2003;Pawlak, 1982, 1991). In particular, FS can be represented by a pairof (A, c) where A represents the original set of features (the searchspace of all possible solutions) and c is the objective functionwhich evaluates how good the selected subset is. Then the problemis to find the best subset of features s 2 A in such a way that thegenerated subset s has a smaller number of features compared tothe original set A. The goal of a searching method is to searchthrough all possible subsets of features and determine the mostinformative subset.

2.2. Solution representation

In this work, a solution is represented in a one-dimensional vec-tor. The size of the vector is equal to the number of features in theoriginal dataset. Each cell in the vector is represented by ‘‘1’’ or ‘‘0’’.The value ‘‘1’’ shows that the corresponding feature is selected,while ‘‘0’’ mean the corresponding feature is not selected.

2.3. The objective function

The objective function, c, evaluates how good the selected sub-set of features compared to previous one. In this work, the gener-ated subset of features by the search method is accepted if theobjective function of the generated subset is better than the previ-ous one or both can lead to same objective function value but thegenerated one has a smaller number of features. In this work, weuse the dependency degree of rough set theory as the objectivefunction to evaluate the generated subset of features (Pawlak,1982, 1991). The dependency degree calculates data dependenciesand returns a value between zero and one. The generated subset offeatures is called an informative if the returned value by the

dependency degree is equal to one (maximization problem). Thatis the algorithm keeps generating a new subset of features by add-ing or deleting features from a given subset until the value re-turned by the dependency degree is equal to one. In particular,given two solutions (two subsets of features), i.e., current solution,Sol, and trial solution, Sol⁄, the trial solution Sol⁄ is accepted if thereis an enhancement in the objective function value (i.e., if c(Sol⁄) >c(Sol)). If the objective function value for both solutions are thesame (i.e., c(Sol⁄) = c(Sol)), then the solution with the lowest num-ber of features (denoted as #) will be accepted. In this work, therough set theory is used to discover data dependencies and EMC-FS to search the space of all available subset of features. More de-tails about the rough set theory for feature selection problems canbe found in (Jensen & Shen, 2003; Ke et al., 2008; Pawlak, 1982,1991).

3. Exponential Monte-Carlo algorithm for feature selection(EMC-FS)

In this work, we propose EMC-FS to deal with the feature selec-tion problem. The EMC-FS algorithm adapted in this work aims toinvestigate the impact of the algorithm with fewer parametersdependent when solving the feature selection problem comparedto other available approaches that have several parameters to betuned in advance. The following subsections cover the initial solu-tion generation method and neighbourhood operator, and theEMC-FS algorithm.

3.1. Initial solution method and the neighbourhood operator

The initial solution is constructed randomly, where each cell inthe vector is assigned a value ‘‘1’’ or ‘‘0’’ at random. In this work, weuse a systematic neighbourhood operator to generates a neigh-bourhood solution by starting from the first element of the arrayand use a flip strategy to change each entry of the vector and de-cide to accept/reject. If the value of the selected cell is ‘‘1’’, it willbe changed to ‘‘0’’. This change means that one feature has beendeleted from the current solution. If the value of the selected cellis ‘‘0’’, then it will be changed to ‘‘1’’, which means that one featurehas been added to the current solution.

3.2. The algorithm: EMC-FS

The EMC algorithm was introduced by Ayob and Kendall (2003).EMC is similar to the acceptance criterion in a simulated annealingalgorithm but no cooling schedule is required. The algorithm willalways accept the better solution. A worse solution is likely to beaccepted based on a certain probability that depends on the fol-lowing three parameters: the quality of the solution (representedas a dependency degree), the number of iterations, and the numberof consecutive no-improving iterations (we consider this thirdparameter as a period where the search is trapped in the localoptima).

The acceptance probability is computed by e�H/k whereH = d � t, k = q, where d is the difference between the objectivefunction of the current and trial solutions, i.e., d = c(Sol) � c(Sol⁄),t is an iteration counter, and q is a controller parameter that repre-sents a consecutive no-improving counter. The probability ofaccepting a worse solution decreases as the number of iterationst, increases. However, if there is no improvement for a number ofconsecutive iterations, then the probability of accepting a worsesolution will increase according to the objective function of thetrial solution and the number of iterations. A worse solution ismore likely to be accepted if d is small or q is large. This is a diver-sification factor where the search will diversify when it is trapped

162 S. Abdullah et al. / Computers & Industrial Engineering 67 (2014) 160–167

in the local optima. The q parameter can be considered as the mostsensitive one because it controls the acceptance of worse solutionswhich, in turn, control the diversification and intensification pro-cess. A larger value of q may leads to accept a very low quality ofa solution which might moves the search to a non-promising area.On the other hand, a lower value of q may leads to accept a solutionthat is, in term of quality, not far away from the current one, inwhich after a certain number of iterations, the search processmight return to a previous search area. In this work, the initial va-lue of q is set to a value of 1 (q = 1) as recommended in Ayob andKendall (2003) and it will be increased by one (q = q + 1) after a cer-tain number of a consecutive no-improvement iterations. Once aworse solution is accepted, the q will be reset to 1.

In EMC the intensification factor is controlled by a parameter, t.The larger the value of t, the more unlikely it is to accept a worsesolution. In this method, t is treated as an intensification factorand q is treated as a diversification factor. The main idea of EMC-FS is to ensure that we only accept a worse solution after most ofthe neighbourhoods of the current solution have been exploredand none of them are found to be better than the current solution.In other words, when there is no improvement for a number ofconsecutive iterations, then the probability of accepting a worsesolution will increase according to the dependency degree of thetrial solution and the number of iterations.

Fig. 1 illustrates the pseudo-code that represents the EMC-FSmethod. The algorithm starts with an initial solution (Sol). We ini-tialise the maximum number of no-improving counter (Max_no-improvement), a controller parameter (q), the iteration (t) and no-improving counter (no-improvement). The objective value (qualityof solution) and the number of features for the initial solutionare then calculated (c(Sol), #(Sol)). In the do-while loop, a trial solu-tion (Sol⁄) is generated using a systematic flip-flop neighbourhoodoperator. The objective value of the trial solution c(Sol⁄) is then cal-culated and compared with the objective function value of the cur-rent solution, c(Sol). If there is an improvement in the objectivefunction value, i.e., c(Sol⁄) > c(Sol), the trial solution Sol⁄ is acceptedand the current solution is updated (Sol Sol⁄). We update thebest solution if the objective function value of the trial solutionis better than the objective function value of the best solution,i.e., (Solbest Sol⁄; c(Solbest) c(Sol⁄) if c(Sol⁄) > c(Solbest)) and setthe value of q to 1 (q = 1). We also accept the trial solution thathas the same objective function value as the current solution ifthe number of features of the trial solution is less than the numberof features of the current solution (#(Sol⁄) < #(Sol)). We do this be-cause it might prevent the search algorithm from being trapped ina local optimal. If the solution has not been accepted, we call theacceptance criterion to decide either to accept or reject the trialsolution based on the acceptance probability function. If we reachthe end of the vector and there is no a trial solution (Sol⁄) has beenaccepted, we increase the q parameter by value of 1 (q = q + 1) andstarts a new iteration. In particular, the parameter q is initially setto 1 and will be increased by 1 at every time that no trial solution isaccepted after systematically searching the whole vector. Once thetrial solution is accepted, q is reset to 1. This procedure is repeateduntil the termination criterion is met.

4. Numerical experiments

EMC-FS was programmed in MATLAB on a PC AMD Athlon witha 1.92 GHz processor and 512 RAM running Windows XP 2002. The13 well-known UCI datasets (Blake & Merz, 1998) are used to testthe performance of the EMC-FS, as shown in Table 1. For each data-set, the algorithm was run 20 times. EMC-FS has the followingthree parameters: (i) the maximum number of no-improvementiterations (Max_no-improvement) which is fixed same as the

number of features in a given instances, (ii) q, which is a counterfor consecutive no-improving solutions and it is set to a value of1 (q = 1) as recommended by Ayob and Kendall (2003) and Sabaret al. (2009), and (iii) t, the number of iterations which representthe stopping condition of EMC-FS. In this work, the parameter tstarts with 1 and will increase as the search progress. In order toensure a fair comparison between EMC-FS and existing featureselection methods, we set t to a value of 100 (t = 100) as in Jensenand Shen (2003), Jue et al. (2009) and Hedar et al. (2008).

In this work, two experimental tests have been carried out to (i)calculate the number of minimal features (Section 4.1), where themain aim of this test is to evaluate the effectiveness of EMC-FS inproducing a small number of features compared to the state ofthe art methods, and (ii) generate the number of rules (descriptivepatterns) and the classification accuracy based on the obtainedsubset of features (Section 4.2), where the aim of this test is to ver-ify that the generated subset of features is informative from theclassification perspective. In the second set of experiment, thewell-known classification toolkit called ROSETTA is used to analysedata within the framework of rough set theory for rule buildingand classification accuracy.

4.1. Number of minimal features (subset)

We compare our approach with other feature selection methodsthat are available in the literature. Table 2 shows the comparison ofthe minimal number of features (the smallest subset of features)between EMC-FS and other methods. The superscripts in parenthe-ses represent the number of runs that the algorithm achieved theminimal number of features. The number of features withoutsuperscripts means that the method is able to obtain this numberof minimal of features for all runs. We are interested to evaluateour results based on the following two categories: (i) comparisonwith single solution-based methods and (ii) comparison with pop-ulation-based methods. We categorised TSAR and SimRSAR as sin-gle-based methods (third and fourth columns in Table 2), andAntRSAR, GenRSAR, ACOAR and SSAR as population-based methods(fifth, sixth, seventh and eighth columns in Table 2). We used thesame number of runs as other methods, except for SimRSAR, whichused 30, 30 and 10 runs for Heart, Vote and Derm2 datasets,respectively. The evaluation criterion is based on the number ofminimal features.

The comparison with single solution-based methods shows thatEMC-FS is able to obtain better results on 8 (#4, #6, #7, #9, #10,#11, #12 and #13) and 7 (#4, #5, #6, #10, #11, #12 and #13) data-sets compared to TSAR and SimRSAR, respectively (see Table 2). Inaddition, EMC-FS obtained the same results as TSAR and SimRSARon 4 datasets (#1, #2, #3 and #8). When compared to population-based methods, EMC-FS outperforms GenRSAR on all datasets (#1to #13), and better than AntRSAR, ACOAR and SSAR on 6 (#4, #6,#8, #10, #11 and #12), 4 (#4, #6, #11 and #12) and 6 (#4, #6,#7, #9, #11 and #12) datasets, respectively. Furthermore, EMC-FS obtained the same results as AntRSAR, ACOAR and SSAR for 7(#1, #2, #3, #5, #7, #9 and #13), 9 (#1, #2, #3, #5, #7, #8, #9,#10 and #13) and 7 (#1, #2, #3, #5, #8, #10 and #13) datasets,respectively (see Table 2).

The results presented in Table 2 clearly shown that, on manydatasets, EMC-FS outperformed both single solution-based andpopulation-based methods. We believe this is because EMC-FS isable to create a balance between intensification and diversificationstrategies that are controlled by iteration counters, t and qthroughout the search process. Furthermore, EMC-FS has less num-ber of parameters that needs to be tuned compared to other meth-ods which have a few parameters require an advance tuning, suchas TSAR has 7 parameters, GenRSAR has 3 parameters, SimRSAR

Fig. 1. The pseudo code for EMC-FS.

Table 1Datasets used in the experiments.

Datasets No. of features No. of objects

M-of-N 13 1000Exactly 13 1000Exactly2 13 1000Heart 13 294Vote 16 300Credit 20 1000Mushroom 22 8124LED 24 2000Letters 25 26Derm 34 366Derm2 34 358WQ 38 521Lung 56 32


has 2 parameters, SSAR has 5 parameters, AntRSAR has 2 parame-ters and ACOAR has 8 parameters.

In Table 3, we compare the computational time of EMC-FSagainst ACOAR. It should be noted that only ACOAR reported thecomputation time for each dataset. As can be seen from Table 3,the computational time of EMC-FS is better than ACOAR on alltested datasets and the overall average time (last row in Table 3)of EMC-FS (1.97 s) is lower than ACOAR (2.59 s), indicates EMC-FS is much faster than ACOAR.

Among the compared algorithms, only Jensen and Shen (2003)reported the overall average of time across all tested datasetswhich did not give a clear indication regarding the computation ef-fort. However, if we consider the average time of our algorithmoverall instances, our average time are 1.97 s which is less thanthe SimRSAR, AntRSAR and GenRSAR of Jensen and Shen (2003).It should be noted that the difference in the computer resources

Table 2Results of EMC-FS compared to existing methods in term of minimal number of features for 20 runs.

# Datasets EMC-FS TSAR (Hedar et al.,2008)

SimRSAR (Jensen,2003)

AntRSAR (Jensen,2003)

GenRSAR (Jensen,2003)

ACOAR (Ke et al.,2008)

SSAR (Jue et al.,2009)

1- M-of-N 6 6 6 6 6(6) 7(12) 6 62- Exactly 6 6 6 6 6(10) 7(10) 6 63- Exactly2 10 10 10 10 10(9) 11(11) 10 104- Heart 5(3) 6(17) 6 6(29) 7(1) 6(18) 7(2) 6(18) 7(2) 6 65- Vote 8 8 8(15) 9(15) 8 8(2) 9(18) 8 86- Credit 8 8(13) 9(5) 10(2) 8(18) 9(1) 11(1) 8(12) 9(4) 10(4) 10(6) 11(14) 8(16) 9(4) 8(9) 9(8) 10(3)

7- Mushroom 4 4(17) 5(3) 4 4 5(1) 6(5) 7(14) 4 4(12) 5(8)

8- LED 5 5 5 5(12) 6(4) 7(3) 6(1) 7(3) 8(16) 5 59- Letters 8 8(17) 9(3) 8 8 8(8) 9(12) 8 8(5) 9(15)

10- Derm 6 6(14) 7(6) 6(12) 7(8) 6(17) 7(3) 10(6) 11(14) 6 611- Derm2 8(19) 9(1) 8(2) 9(14) 10(4) 8(3) 9(7) 8(3) 9(17) 10(4) 11(16) 8(4) 9(16) 8(2) 9(18)

12- WQ 12(17)

14(3)12(1) 13(13) 14(6) 13(16) 14(4) 12(2) 13(7) 14(11) 16 12(4) 13(12) 14(4) 13(4) 14(16)

13- Lung 4 4(6) 5(13) 6(1) 4(7) 5(12) 6(1) 4 6(8) 7(12) 4 4

Table 3The average runtimes overall runs (in seconds) of EMC-FS compared to ACOAR.

# Datasets EMC-FS ACOAR

1- M-of-N 0.14 0.312- Exactly 0.16 0.383- Exactly2 0.15 0.344- Heart 0.11 0.215- Vote 0.14 0.236- Credit 0.95 1.787- Mushroom 1.57 2.018- LED 0.95 1.379- Letters 0.11 0.21

10- Derm 0.42 0.8011- Derm2 3.24 5.7212- WQ 16.88 20.2613- Lung 0.8 0.17

Average 1.97 2.59


used by other researchers make the time comparisons with otherapproaches unfair, if not impossible.

To determine if there is a significant difference between theEMC-FS and the best-performing algorithms in the literature, wefurther validate our results by conducting a statistical analysisusing a Wilcoxon rank test with 0.05 critical level. The p-value ob-tained from the Wilcoxon test for EMC-FS and the compared meth-ods is presented in Table 4. In this table, the symbol ‘‘S+’’ meansEMC-FS is statistically significant than the compared method

Table 4The p-value of EMC-FS compared to other algorithms.

# EMC-FS vs. TSAR SimRSAR AntRSAR GenRSAR ACOAR SSAR

1- M-of-N = = = S+ = =2- Exactly = = = S+ = =3- Exactly2 = = = S+ = =4- Heart � S+ S+ S+ � �5- Vote = S+ = S+ = =6- Credit S+ S� S+ S+ S+ S+7- Mushroom S+ = = S+ = S+8- LED = = S+ S+ = =9- Letters S+ = = S+ = S+

10- Derm S+ S+ � S+ = =11- Derm2 S+ S+ S+ S+ S+ S+12- WQ S+ S+ S+ S+ S+ S+13- Lung S+ S+ = S+ = =

Note: ‘‘S+’’ means EMC-FS is statistically significant than the compared method (p-value <0.05), ‘‘�’’ EMC-FS marginally better than the compared method (p-value<0.1), ‘‘S�’’ means the compared method is better than EMC-FS (p-value >0.05) and‘‘=’’ means both EMC-FS and the compared method obtained the same result.

(p-value <0.05), ‘‘�’’ EMC-FS marginally better than the comparedmethod (p-value <0.1), ‘‘S�’’ means the compared method is betterthan EMC-FS (p-value >0.05) and ‘‘=’’ means both EMC-FS and thecompared method obtained the same result.

From Table 4 we can make the following conclusion:

� EMC-FS is significantly better than TSAR on 7 datasets (#6, #7,#9, #10, #11 and #12), marginally better on 1 dataset (#4) andobtained the same result on 5 datasets (#1, #2, #3, #5 and #8).

� EMC-FS is significantly better than SimRSAR on 6 datasets (#4,#5, #10, #11 and #12), obtained the same result on 5 datasets(#1, #2, #3, #7, #8 and #9) and not better than SimRSAR on 1dataset (#6).

� EMC-FS is significantly better than AntRSAR on 5 datasets (#4,#6, #8, #11 and #12), marginally better on 1 dataset (#10)and obtained the same result on 7 datasets (#1, #2, #3, #5,#7, #9 and #13).

� EMC-FS is significantly better than GenRSAR on all tested data-sets (#1 to #13).

� EMC-FS is significantly better than ACOAR on 3 datasets (#6,#10 and #11), marginally better on 1 dataset (#4) and obtainedthe same result on 5 datasets (#1, #2, #3, #5, #7, #8, #9, #10and #13).

� EMC-FS is significantly better than SSAR on 5 datasets (#6, #7,#9, #11 and #12), marginally better on 1 dataset (#4) andobtained the same result on 5 datasets (#1, #2, #3, #5, #8and #13).

Based on p-value reported in Table 4, EMC-FS is able to producevery good results across all datasets, and better than the bestknown results for some cases. This is consistent with the No FreeLunch Theorem (Wolpert & Macread, 1997), which postulates thata generic solution method that can beat all algorithms does not ex-ist. What we would like to stress is that EMC-FS managed to pro-duce good results and outperformed existing techniques in atleast one or a few datasets. Overall, the results indicate that theproposed EMC-FS is an effective method for feature selectionproblems.

4.2. Number of rules and classification accuracy

We carried out further experiments to generate the number ofrules and the accuracy of the classification for all the datasetsbased on the obtained subset of features reported in Section 4.1.These experiments are carried out using ROSETTA software, whichis a tool for analysing data within the structure of rough set theory

Table 5Parameter settings for GA and Johnson’s IR algorithm within the ROSETTA toolkit.

No Parameter GA Johnson’s IR

1 Crossover probability 0.3 –2 Mutation probability 0.05 –3 Inversion probability 0.05 –4 Population size 70 –5 No. of crossover point 1 –6 Stopping criteria (no. of generation to wait for fitness to improve) 30 –7 No. mutation on an individual 1 –8 No. transposition for inversion 1 –9 Random number generated seed 12345 –

10 Boundary region thinning – 0.111 Hitting fraction – 0.95

Table 6Comparison results on number of rules and classification accuracy generated by EMC-FS and other methods.

Dataset EMC-FS Genetic algorithm Johnson’s algorithm Holte’s 1R algorithm

Average number ofrules

Accuracy%


Accuracy%


Accuracy%


Accuracy%

M-of-N 64 100 8658 95 45 99 26 63Exactly 64 100 28350 73 271 94 26 68Exactly2 606 57 27167 74 441 79 26 73Heart 233 18 1984 81 133 68 50 64Vote 136 68 2517 95 54 95 48 90Credit 817 14 47450 72 560 67 81 69Mushroom 53 99 8872 100 85 100 112 89LED 10 100 41908 98 316 99 48 63Letters 23 0 1139 0 20 0 50 0Derm 115 55 7485 94 92 89 129 48Derm2 302 11 8378 92 98 88 138 48WQ 459 4 48687 69 329 58 94 51Lung 25 33 1387 71 12 69 156 73

Average 223.6154 50.69231 17998.62 78 188.9231 77.30769 75.69231 61.46154

Note: the number of rules is minimization problems (the lower, the better). Calcification accuracy maximization problems (the higher, the better).


that is available at http://www.lcb.uu.se/tools/rosetta/. The resultsof EMC-FS are compared to genetic algorithm (GA), Johnson’s 1Ralgorithm and Holte’s 1R algorithm that are available in ROSETTA.Standard Voting within the ROSETTA toolkit has been used as aclassifier. The parameter settings for GA and Johnson’s IR are givenin Table 5. Note that, Standard Voting and Holte’s algorithms arebased on IF-THEN rules (no parameter setting is involved).

Note that the methods of comparison described in Section 4.1above did not consider the number of rules and classification accu-racy as other performance measurements. We performed 10-foldcross validation to evaluate the classification accuracy as suggested

Fig. 2. Comparison on the number of rules generated by EMC-FS and GA.

by Ambroise and McLachlan (2002), Han (2006) and Borra andCiaccio (2010). Datasets are randomly divided into 10 subsampleswhere 1 subsample is treated as a validation set (i.e., 10% as valida-tion sets) and 9 subsamples are treated as training sets (i.e., 90% astraining sets). The results from the 10-fold cross-validation areaveraged up to generate a single measure called as a predictedaccuracy. This process was performed on 13 datasets. The resultsobtained are shown in Table 6. The best results are highlightedin bold.

We first compare EMC-FS and other methods (genetic algorithm(GA), Johnson’s 1R algorithm and Holte’s 1R) based on the numberof the generated rules (the lower the better). From Table 6, in termsof the number of generated rules, EMC-FS is able to generate twobest results, i.e., on Mushroom and LED datasets where 53 and10 rules are generated, respectively. If we consider an individualcomparison, we can make the following conclusion (the evaluationcriterion is the number of generated rules, where the lower thebetter):

� In all of the cases, EMC-FS is better than GA (see Fig. 2 andTable 6).

� When compared to Johnson’s algorithm, EMC-FS is better onthree datasets (i.e., Exactly, Mushroom and LED), see Fig. 3and Table 6.

� EMC-FS is better than Holte’s 1R algorithm on 5 out of 13 data-sets (i.e., Mushroom, LED, Letters, Derm and Lung), see Fig. 3and Table 6.

Based on the average overall presented in the last row of Table 6,we can see that EMC-FS obtained the third best results. It should be

http://www.lcb.uu.se/tools/rosetta/

Fig. 3. Comparison on the number of rules generated by EMC-FS, Johnson’s andHolte’s 1R.


noted that the results of EMC-FS is based on a subset of features,whilst the compared methods (genetic algorithm (GA), Johnson’s1R algorithm and Holte’s 1R) use all features in the given datasets.We can see that, with just a subset of features generated by EMC-FS, the classification accuracy of EMC-FS is better than GA on alldatasets and very competitive, if not better on some datasets,when compared to Johnson and Holte’1R algorithms. Such resultsclearly demonstrate the benefit of EMC-FS in reducing the numberof features while avoiding information loses.

We now compare EMC-FS with other methods (genetic algo-rithm (GA), Johnson’s 1R algorithm and Holte’s 1R) from the classi-fication accuracy perspective (the higher the better). Examiningclassification accuracy shown in Table 6 and Fig. 4, we can see thatEMC-FS is able to generate 100% classification accuracy on threedatasets, M-of-N, Exactly and LED, and 99% classification accuracyon the Mushroom dataset. The accuracy of EMC-FS is between 50%and 70% on three datasets and on four of the datasets the accuracyis between 4% and 35%. We are excluding the 0% accuracy of lettersdataset. In comparison with the other three algorithms, EMC-FS iscomparable to Holte’s1R algorithm, but worse than GA and John-son’s algorithm on eight datasets. Again, we would like to stressthat the results of the classification accuracy of EMC-FS is calcu-lated using a subset of features, whilst all the compared methodsuse all available features in the given dataset. Therefore, for afew datasets, the classification accuracy generated by EMC-FS areslightly worse than other methods.

Fig. 4. Comparison on the classification accuracy.

From Table 6, we observe that the distribution of the classifica-tion accuracy on all datasets using EMC-FS is between 4% and100%. This reflects a large range in classification accuracy. Thisindicates that the EMC-FS (with respect to the classification accu-racy) behaves differently on different datasets. We believe this isdue to the complexity of the datasets themselves (based on thenumber of features and objects).

We would like to stress that the ROSETTA software is used toverify the classification accuracy of the obtained subset of featuresby EMC-FS when compared to other methods (genetic algorithm(GA), Johnson’s 1R algorithm and Holte’s 1R algorithm) that usethe original set of features. Thus, one can clearly see that EMC-FSmanaged to reduce the number of features for each dataset and ob-tained competitive, if not better on some datasets, classificationaccuracy compared to those use all available features. From learn-ing and the computation time views, EMC-FS is more beneficialthan other methods because it generates a small number of fea-tures without information loss which can help the predication pro-cess to be more accurate and fast.

Another important point is that all the methods in comparison(Section 4.1) only report the minimal number of obtained featureswithout any further validation or testing in terms of the generatedrules (descriptive patterns) and the classification accuracy. Such aresult would raise the question of how efficient these results are interms of generated rules (descriptive patterns) and classificationaccuracy which are the main measures in evaluating the obtainedsubset of features.

5. Conclusion

The feature selection problem has been studied in this work. AnExponential Monte-Carlo algorithm (EMC-FS) has been proposed tosolve the problem. EMC-FS always accepts an improved solutionbut probabilistically accepts worse solutions. Numerical experi-ments on 13 well-known datasets have been presented to showthe effectiveness of the EMC-FS in producing a smallest subset offeatures when compared to state-of-the-art approaches. In addi-tion, the results showed that EMC-FS is very competitive in termsof the number of generated rules and classification accuracy whencompared to GA, Johnson’s 1R algorithm and Holte’s 1R algorithm.In future work, we intend to tackle other data mining problemsusing the proposed algorithm. Applying an adaptive mechanismto control parameters of the proposed algorithm should be aninteresting subject. We believe that better solutions can be ob-tained if the parameter (as employed here) is adaptively modified,for example, by adaptively changing the number of consecutive no-improvement solutions based on the quality of the current solu-tion. We also interested in hybridizing the proposed algorithmwith other population based algorithms such as genetic algorithmand scatter search.

References

Abdullah, S., & Jaddi, N. S. (2010). Great deluge algorithm for rough set attributereduction. In Y. Zhang, A. Cuzzocrea, J. Ma, K. -I. Chung, T. Arslan, & X. Song(Eds.), Communications in computer and information science, 1. Database theoryand application, bio-science and bio-technology (Vol. 118, pp. 189–197). BerlinHeidelberg New York: Springer. ISBN: 978-3-642-17621-0.

Abdullah, S., Burke, E. K., McCollum, B. (2005). An investigation of variableneighbourhood search for the course timetabling problem. In The 2ndmultidisciplinary international conference on scheduling: Theory and applications(MISTA) (pp. 413–427). New York, July 18–21, 2005.

Abdullah, S., Burke, E. K., & McCollum, B. (2007). Using a randomised iterativeimprovement algorithm with composite neighbourhood structures for theuniversity course timetabling problem. In K. F. Doerner, M. Gendreau, P.Greistorfer, W. J. Gutjahr, R. F. Hartl, & M. Reimann (Eds.), Metaheuristics –Progress in complex systems optimization. Computer Science Interfaces Book Series(Vol. 39, pp. 153–169). Springer Operations Research. ISBN-13:978-0-387-71919-1


Abdullah, S., Sabar, N. R., Ahmad Nazri, M. Z., Turabieh, H., & McCollum, B. (2010). Aconstructive hyper-heuristics for rough set attribute reduction. ISDA, 2010,1032–1035.

Ambroise, C., & McLachlan, G. J. (2002). Selection bias in gene extraction on thebasis of microarray gene-expression data. Proceedings of the National Academy ofSciences, 99(10), 6562–6566.

Arajy, Y. Z., & Abdullah, S. (2010). Hybrid variable neighbourhood search algorithmfor attribute reduction in rough set theory. ISDA, 2010, 1015–1020.

Ayob, M. & Kendall, G. (2003). A Monte Carlo hyper-heuristic to optimisecomponent placement sequencing for multi head placement machine. In Proc.of the international conference on intelligent technologies, InTech’03, Chiang Mai(pp. 17–19).

Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine:University of California.

Borra, S., & Ciaccio, A. D. (2010). Measuring the prediction error. A comparison ofcross-validation, bootstrap and covariance penalty methods. ComputationalStatistics and Data Analysis, 54, 2976–2989.

Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques. In J. Gray(Series Ed.), The Morgan Kaufmann series in data management systems (2nd ed.).Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6.

Hedar, A.-R., Wang, J., & Fukushima, M. (2008). Tabu search for attribute reductionin rough set theory. Soft Computing, 12(9), 909–918.

Jaddi, N. S., & Abdullah, S. (2013a). Nonlinear great deluge algorithm for rough setattribute reduction. Journal of Information Science and Engineering, 29, 49–62.

Jaddi, N. S., & Abdullah, S. (2013b). Hybrid of genetic algorithm and great deluge forrough set attribute reduction. Turkish Journal of Electrical Engineering andComputer Sciences. http://dx.doi.org/10.3906/elk-1202-113.

Jensen, R., & Shen, Q. (2003). Finding rough set reducts with ant colonyoptimization. In Proceedings of the 2003 UK workshop on computationalintelligence (pp. 15–22).

Jensen, R., & Shen, Q. (2004). Semantics-preserving dimensionality reduction:Rough and fuzzy-rough-based approaches. IEEE Transactions on Knowledge andData Engineering, 16(12), 1457–1471.

Jihad, S. K., & Abdullah, S. (2010). Investigating composite neighbourhood structurefor attribute reduction in rough set theory. ISDA, 2010, 1183–1188.

Jue, W., Hedar, A. R., Guihuan, Z., & Shouyang, W. (2009). Scatter search for rough setattribute reduction. Computational Sciences and Optimization 2009, CSO 2009, 1,531–535.

Ke, L., Feng, Z., & Ren, Z. (2008). An efficient ant colony optimization approach toattribute reduction in rough set theory. Pattern Recognition Letters, 29(9),1351–1357.

Mafarja, M., & Abdullah, S. (2013a). Fuzzy record to record travel algorithm insolving rough set attribute reduction. International Journal of Systems Science.http://dx.doi.org/10.1080/00207721.2013.791000, pp. 1–10.

Mafarja, M., & Abdullah, S. (2013b). Investigating memetic algorithm in solvingrough set attribute reduction. International Journal of Computer Applications inTechnology, 48(3), 195–201.

Osman, I. H. (1993). Metastrategy simulated annealing and tabu search algorithmsfor the vehicle routing problem. Annals of Operations Research, 41(1993),421–451.

Osman, I. H., & Christofides, N. (1994). Capacitated clustering problems by hybridsimulated annealing and tabu search. International Transactions in OperationalResearch I, 1994, 317–336.

Pawlak, Z. (1982). Rough sets. International Journal of Computer and InformationSciences (11), 341–356.

Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. KluwerAcademic Publishers.

Sabar, N. R., Ayob, M., & Kendall, G. (2009). Tabu exponential Monte-Carlo withcounter heuristic for examination timetabling. In Proceedings of 2009 IEEESymposium on Computational Intelligence in Scheduling (CISched2009), 2009a (pp.90–94). Nashville, Tennessee, USA.

Skowron, A., & Grzymala-Busse, J. (1994). From rough set theory to evidence theory.In Advances in the Dempster–Shafer theory of evidence (pp. 193–236). John Wiley& Sons, Inc..

Wolpert, H. D., & Macread, W. G. (1997). No free lunch theorems for optimization.IEEE Transactions on Evolutionary Computation, 1(1), 67–82.

Wroblewski, J. (1995). Finding minimal reducts using genetic algorithms. In Proc.second ann. joint conf. information sciences (pp. 186–189).

Zhang, W., Qiu, G., & Wu, W. (2007). A general approach to attribute reduction inrough set theory. Science in China Series F: Information Sciences, 50(2), 188–197.

http://refhub.elsevier.com/S0360-8352(13)00338-0/h0020

















http://dx.doi.org/10.3906/elk-1202-113












http://dx.doi.org/10.1080/00207721.2013.791000





















an exponential monte-carlo algorithm for feature selection problems

Documents