on principal component analysis for high-dimensional xcsr

10
SPECIAL ISSUE On principal component analysis for high-dimensional XCSR Mohammad Behdad Tim French Luigi Barone Mohammed Bennamoun Received: 7 January 2012 / Accepted: 8 May 2012 / Published online: 22 May 2012 Ó Springer-Verlag 2012 Abstract XCSR is an accuracy-based learning classifier system which can handle classification problems with real- value features. However, as the number of features increases, a high classification accuracy comes at the cost of more resources: larger population sizes and longer computational running times. In this paper we investigate PCA-XCSR (a sequential application of PCA and XCSR) in three environments with different characteristics: a dis- crete and imbalanced environment (KDD’99 network intrusion), a continuous and highly symmetric environment (MiniBooNE), and a highly discrete, highly imbalanced environment (Census/Income (KDD)). These experiments show that in the three different environments, PCA-XCSR, in addition to being able to reduce the computational resources and time requirements of XCSR by approxi- mately 50 %, is able to consistently maintain its high accuracy. In addition to that, it reduces the required pop- ulation size needed by XCSR. Also, we suggest heuristics for selecting the number of principal components to use when using PCA-XCSR. Keywords Genetics-based machine learning Learning classifier systems XCSR Dimensionality reduction Principal components analysis 1 Introduction First proposed by John Holland [12] in 1976, a learning classifier system (LCS) is a machine learning technique that combines reinforcement learning, evolutionary com- puting, and other heuristics to produce an adaptive system capable of learning in dynamic environments. It uses a population of condition-action-prediction rules called classifiers to encode appropriate actions for different input states. Learning classifier systems (LCSs) are good at solving complex adaptive problems and have been suc- cessfully used in data mining applications [4, 7, 21]. However, when they are used for high-dimensional prob- lems, they slow down significantly [13]. XCS [24] is an accuracy-based LCS in which it is not the classifiers which produce the greatest rewards that have the best chance of propagating their genes into subsequent generations, but rather the classifiers which predict the reward more accu- rately, independent of the magnitude of the reward. KDD’99 [11], a popular network intrusion data-set, is one of the three high-dimensional data sets experimented with in this research. Each instance of this set describes a network connection using 41 features describing properties about a connection (such as how many bytes are trans- ferred, number of files access, or the number of times a login was attempted). Many of these connections are legitimate, but a number are malicious (such as trying to gain root access to a system). The classifier must try to determine whether a connection is benign, or whether it corresponds to one of several different attack types. This M. Behdad (&) T. French L. Barone M. Bennamoun School of Computer Science and Software Engineering, The University of Western Australia, 35 Stirling Highway, Crawley, WA 6009, Australia e-mail: [email protected] T. French e-mail: [email protected] L. Barone e-mail: [email protected] M. Bennamoun e-mail: [email protected] 123 Evol. Intel. (2012) 5:129–138 DOI 10.1007/s12065-012-0075-6

Upload: mohammed-bennamoun

Post on 25-Aug-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: On principal component analysis for high-dimensional XCSR

SPECIAL ISSUE

On principal component analysis for high-dimensional XCSR

Mohammad Behdad • Tim French •

Luigi Barone • Mohammed Bennamoun

Received: 7 January 2012 / Accepted: 8 May 2012 / Published online: 22 May 2012

� Springer-Verlag 2012

Abstract XCSR is an accuracy-based learning classifier

system which can handle classification problems with real-

value features. However, as the number of features

increases, a high classification accuracy comes at the cost

of more resources: larger population sizes and longer

computational running times. In this paper we investigate

PCA-XCSR (a sequential application of PCA and XCSR)

in three environments with different characteristics: a dis-

crete and imbalanced environment (KDD’99 network

intrusion), a continuous and highly symmetric environment

(MiniBooNE), and a highly discrete, highly imbalanced

environment (Census/Income (KDD)). These experiments

show that in the three different environments, PCA-XCSR,

in addition to being able to reduce the computational

resources and time requirements of XCSR by approxi-

mately 50 %, is able to consistently maintain its high

accuracy. In addition to that, it reduces the required pop-

ulation size needed by XCSR. Also, we suggest heuristics

for selecting the number of principal components to use

when using PCA-XCSR.

Keywords Genetics-based machine learning � Learning

classifier systems � XCSR � Dimensionality reduction �Principal components analysis

1 Introduction

First proposed by John Holland [12] in 1976, a learning

classifier system (LCS) is a machine learning technique

that combines reinforcement learning, evolutionary com-

puting, and other heuristics to produce an adaptive system

capable of learning in dynamic environments. It uses a

population of condition-action-prediction rules called

classifiers to encode appropriate actions for different input

states. Learning classifier systems (LCSs) are good at

solving complex adaptive problems and have been suc-

cessfully used in data mining applications [4, 7, 21].

However, when they are used for high-dimensional prob-

lems, they slow down significantly [13]. XCS [24] is an

accuracy-based LCS in which it is not the classifiers which

produce the greatest rewards that have the best chance of

propagating their genes into subsequent generations, but

rather the classifiers which predict the reward more accu-

rately, independent of the magnitude of the reward.

KDD’99 [11], a popular network intrusion data-set, is

one of the three high-dimensional data sets experimented

with in this research. Each instance of this set describes a

network connection using 41 features describing properties

about a connection (such as how many bytes are trans-

ferred, number of files access, or the number of times a

login was attempted). Many of these connections are

legitimate, but a number are malicious (such as trying to

gain root access to a system). The classifier must try to

determine whether a connection is benign, or whether it

corresponds to one of several different attack types. This

M. Behdad (&) � T. French � L. Barone � M. Bennamoun

School of Computer Science and Software Engineering,

The University of Western Australia, 35 Stirling Highway,

Crawley, WA 6009, Australia

e-mail: [email protected]

T. French

e-mail: [email protected]

L. Barone

e-mail: [email protected]

M. Bennamoun

e-mail: [email protected]

123

Evol. Intel. (2012) 5:129–138

DOI 10.1007/s12065-012-0075-6

Page 2: On principal component analysis for high-dimensional XCSR

data set has some interesting characteristics: it is large in

both the number of instances and number of dimensions,

and its dimensions are of different types, including cate-

gorical, Booleans and numbers.

In order to be able to use the XCSR for KDD’99 data

set, the data must first be normalised, so that XCSR can

handle categorical features, and also for numerical features,

operators such as mutation and covering are not biased

towards any one particular feature. The normalisation

process increases the number of features from 41 to 118

(see Sect. 4.2 for details). One important property of the

normalised data is that most of its features have value 0. In

other words, we are dealing with sparse data with a high

number of dimensions. Hence, there is a high chance that

some of the features in the data are correlated or redundant.

In order to exploit this characteristic, we can use a

dimensionality reduction technique to ‘‘compress’’ the

problem and reduce the search space for the learning

algorithm.

Principal Component Analysis (PCA) is a linear

dimensionality transformation method, capable of reducing

the number of dimensions of a data set by transforming the

problem into a new search space. In this paper, we inves-

tigate if using PCA as a preprocessing step can reduce the

time and computational resources needed by XCSR in such

high-dimensional problems while maintaining its classifi-

cation accuracy. We call this combination of PCA-pre-

processing and XCSR, PCA-XCSR.

To see if the behaviour of PCA-XCSR on the KDD’99

data set is applicable to different environments, two other

environments with different characteristics to KDD’99

environment are also experimented with. We also present

heuristics to select the number of principal components to

use with PCA-XCSR.

In the next section, we discuss the related works and

background techniques. Then, in Sect. 3, the PCA-XCSR

method is explained. In Sect. 4, the data sets used in the

experiments, the experimental procedure and their results

are discussed. Finally, Sect. 5 concludes this work.

2 Related works

Of all the three data sets we experiment in this work, only

the KDD’99 network intrusion data set has been tackled by

researchers using the combination of a machine learning

algorithm and PCA, and none combining an LCS with

PCA. Depren et al. [9], propose an Intrusion Detection

System (IDS) architecture which has an anomaly detection

module which in turn uses a Self-Organizing Map (SOM)

structure. The SOM initialisation is done both randomly

and PCA-based and they find that the results are not dif-

ferent. So, they go with the random initialisation which is

the simpler approach. The work [10] suggests a model for

network intrusion detection. It attempts to reduce the

number of dimensions in KDD’99 network intrusion

detection data set. Its model combines principal component

analysis (PCA), neural networks and multi-layer percep-

trons. PCA is used to extract important data and to reduce

high dimensional data vectors [10]. In the case of KDD’99

data set, using PCA, it reduces the number of dimensions

from 41 to 12 and the experimental results show the high

accuracy of 93.21 %.

Shyu et al. [18] propose a PCA-based outlier detection

approach for intrusion detection. They use PCA to reduce

the dimensionality without sacrificing valuable informa-

tion. Also PCA determines the important components and

minor ones. The distance measure they use is based on both

the major components that account for 50 % of the total

variation and the minor components—for instance the ones

with eigen-values less than 0.20. Their experiments with

the KDD’99 data set is based on detecting normal instances

against attack ones. They show that their method outper-

forms many other approaches.

The work by Bouzida et al. [6] uses PCA for network

intrusion detection. It applies PCA to two different

machine learning methods, namely decision trees and

nearest neighbour, in order to reduce the information

quantity without loss of information. Their experiments on

the KDD’99 data set show no decrease in the accuracy but

a major drop in the processing time.

Abedini et al. [2] are motivated by the degradation of

the performance of LCS when it deals with data sets of

high dimensionality and relatively few instances. They

propose CoXCS as a coevolutionary multi-population

XCS. Isolated sub-populations evolve a set of classifiers

based on a partitioning of the feature space in the data. In

their followup paper [3], they use alternative feature space

partitioning techniques in a multiple population island-

based parallel XCS. Here, each of the isolated populations

evolve rules based on a subset of the features. They analyse

their method on the Boolean logic multiplexer problem and

they report a better performance and generalisation using

multiple population XCS in comparison to the the single

population XCS, although the effectiveness of the model

depends on the feature space partitioning strategy used.

In our previous work [5], XCSR is applied to the

KDD’99 network intrusion data set. The results are good

compared to the winners of that competition—the accuracy

is within 3.5 % of that of the winner. During the research

we noticed a problem: long running times that XCSR

required in order to do the classification. In the current

research we investigate and analyse the effect of using

PCA as a preprocessing step for XCSR and apply it on

three real-world environments with different characteris-

tics. At the end, we will provide some observations,

130 Evol. Intel. (2012) 5:129–138

123

Page 3: On principal component analysis for high-dimensional XCSR

guidelines and general heuristics about its performance and

mechanisms for getting the best out of it.

2.1 LCS

Learning classifier systems use a population of condition-

action-prediction rules called classifiers to encode appro-

priate actions for different input states: if a condition is

true, then performing the corresponding action is expected

to lead to receiving the anticipated reward. The decision of

which rule to use depends on the type of LCS used. Some

use reward (the prediction in the classifier triple) the sys-

tem expects to receive from the environment as a result of

performing the chosen action. In some methods, this

decision is based on the accuracy of the classifiers

matching the current input and their predicted reward. In

essence, an LCS attempts to enhance its understanding of

the environment through improving its classifiers over

time.

2.1.1 XCS

XCS [24] is an accuracy-based LCS in which it is not the

classifiers which produce the greatest rewards that have the

best chance of propagating their genes into subsequent

generations, but rather the classifiers which predict the

reward more accurately, independent of the magnitude of

the reward. As a result, XCS is said to cover the state space

more efficiently [19]. When a state is perceived by an LCS,

the rule-base is scanned and any rule whose condition

matches the current state is added as a member of the

current match set M. In XCS—specifically in exploit pha-

ses—once M has been formed, an action which has the

highest weighted fitness prediction is chosen. All members

of M that propose the same action as the chosen rule then

form the action set A. The accuracy of all members of the

action set are then updated when the reward from the

environment is determined. Another important difference

of XCS over its predecessors is that the GA is applied to

A (or niches) rather than the whole classifier population.

This narrows the evolutionary competition to between

classifiers which can be applied in similar scenarios rather

than all the classifiers in the LCS [19].

2.1.2 XCSR

XCSR [25] is a variation of XCS [22] which can handle

inputs having real-value features. In an Unordered Bound

Representation (UBR) [20] encoding, the condition part of

classifiers consists of a set if intervals. Each takes the form

of (pi, qi) where either pi or qi can be the maximum or

minimum bound. When XCSR receives an instance, if the

values of the features of the instance are in the range of a

classifier’s condition, then that classifier is used in the

classification process.

2.2 Principal component analysis

Principal component analysis is a feature transformation

technique whose purpose is to find the most important

information through dimensionality reduction. It analyses a

set of original variables and represents them as a set of new

orthogonal variables called principal components. These

principal components are calculated using linear combi-

nations of the original variables. The principal components

are listed in the order of importance. That is, the first one

will have the highest variance, the second one, which is

orthogonal to the first, has the second largest variance, and

so on [15].

The values of the new variables are calculated by pro-

jecting the original variables onto the principal compo-

nents. Each principal component has an eigen-value which

is calculated by summing the squared factor scores for that

component. Dimensionality reduction is achieved by using

only a subset of the principal components (the first n)

instead of using all of them. For a comprehensive intro-

duction to PCA, the reader is referred to [1].

3 PCA-XCSR

As was evident from our earlier work [5] and others [2, 3],

the performance of learning classifier systems degrades

when they are dealing with problems of high dimension-

ality. In that experience [5], when XCSR was used on data

sets with high number of dimensions or features, its time

performance would have become relatively slow. Hence,

the main purpose of this research is to improve the speed of

XCSR in such high-dimensional data sets (Fig. 1).

In our method, PCA is used as a pre-processing step to

compact the data set through dimensionality reduction

(Fig. 2). First, the data is normalised so that XCSR can

handle every type of feature in the data correctly. Then, it

is compacted (transformed) by PCA before being given to

XCSR to learn from. XCSR hence learns in a smaller

search space—the space that results from selecting only a

fixed number of dimensions (the n best principal compo-

nents in the transformed space). The basis for doing this is

that as these dimensions add the most variance to the input

space, they will most accurately predict the class of each

Fig. 1 A high-level view of PCA-XCSR

Evol. Intel. (2012) 5:129–138 131

123

Page 4: On principal component analysis for high-dimensional XCSR

input. We call this combination of PCA-preprocessing and

XCSR, PCA-XCSR. This presents a number of potential

advantages:

1. By examining only the most significant principal

components, we reduce the dimensionality of the

search space.

2. As the principal components are linear combinations

of dimensions, the classifier rules may naturally

exploit existing dependencies among features rather

than having to search out such dependencies.

3. By retaining only the components with the highest

eigen-values, we eliminate noise caused by less

relevant components.

4 Experiments

In order to evaluate the utility of PCA-XCSR, we compare

its performance with the performance of standard XCSR on

data sets from three different environments: KDD’99 net-

work intrusion data set [11], MiniBooNE particle identifi-

cation data set [17], and Census-Income (KDD) data set

[14]. Our main interest is exploring the trade-off between

accuracy and time, for different numbers of principal

components used by the learning classifier system. It

should be noted that we are not interested in getting the

best results for a particular data set; we are instead inter-

ested in observing the behaviour of PCA-XCSR.

4.1 Data

To see if the behaviour of PCA-XCSR is consistent on

different high-dimensional data sets, three data sets from

different environments are used to observe the behaviour of

PCA-XCSR and standard XCSR. Since we were looking

for high-dimensional data sets in this research, each data

set has at least 40 features (which in some cases are

increased to more than 400 when normalised). Also, as

LCS is used, each data set has many instances—here, at

least 40,000—for training and testing purposes.

Fig. 2 Accuracy and running

time for standard XCSR vs

PCA-XCSR in three different

environments. Error bars which

are very tiny most of the time,

show the standard error of the

mean over 10 samples. Figure is

best seen in colour

132 Evol. Intel. (2012) 5:129–138

123

Page 5: On principal component analysis for high-dimensional XCSR

The first data set is KDD’99 network intrusion data set

[11]. Each element of this set describes a network instance

using 41 features describing properties about an instance

(such as how many bytes are transferred, number of file

accesses, or the number of times a login was attempted). Its

41 features of data increase to 108 after normalisation (see

Sect. 4.2 for details). Many of the instances in the data set

represent legitimate operations, but a number are malicious

(such as trying to gain root access to a system). The clas-

sifier must try to determine whether an instance is benign,

or whether it corresponds to one of several different attack

types. 10,000 of these instances are randomly selected for

training and 40,000 for testing. This data set has some

interesting characteristics: it is large in both the number of

instances and number of dimensions, and its dimensions

are of different types, including categorical, Booleans and

numbers.

The second data set is the MiniBooNE particle identi-

fication data set. This data set [17] is abstracted from the

MiniBooNE experiment and is used to distinguish electron

neutrinos (signal) from muon neutrinos (background). The

whole data set contains 130,065 instances. 40,000 of these

instances are randomly selected for training and 10,000 for

testing. Each instance has 50 continuous features.

The third data set is Census-Income (KDD) data set

[14]. It contains census data extracted from the 1994 and

1995 current population surveys conducted by the U.S.

Census Bureau. The set contains 40 features: 7 continuous

and 33 categorical. These 40 features are increased up to

408 after normalisation. The label for each feature vector

determines if the income of the person is above or below

$50,000. The features are of both continuous and cate-

gorical. The number of instances in the training file is

199,523 and there are 99,762 instances in the testing file.

4.2 Procedure

To start, the data in the training file is firstly normalised.

This involves scaling all quantitative features to lie

between -1 and 1, and representing all categorical types

with independent scalar features. Therefore, if there are 6

possible values for a type, this feature will be replaced with

6 features. In an instance, if the original feature (before

normalisation) has the third possible value, after normali-

sation the third feature will have value 1 and the remaining

five features will have value -1. If there are categorical

features in the data set, this process results in an increase in

the number of features, but also ensures the data is very

sparse.

The underlying learning classifier system is based on

Butz’s implementation of XCS [8]. For representation of

intervals, we use the Unordered Bound Representation

(UBR) [20] without limiting the ranges to be between -1

and 1—as outliers fall outside this range after normalisa-

tion. The values of important XCSR parameters, if

fnot stated differently, are as follows: a ¼ 0:1; b ¼ 0:2;

m ¼ 5; �0 ¼ 0:01; hGA ¼ 25, and hSub = 12. In order to use

LCS’s subsumption pressure towards generalisation, both

types of subsumption (GA and action set) are on. In these

experiments, the system does not learn during the test

phase.

The MATLAB Statistics Toolbox is used for the PCA pre-

processing step. Firstly, its PCA routine is used to calculate

the eigen-vectors and eigen-values for the training data set.

Then, each instance is normalised and transformed using

PCA using n principal components. Experiments below vary

n, with results reporting the trade-off in performance

(accuracy) versus running time. All experiments are repeated

10 times and the results averaged. To keep the consistency of

the conditions (especially the time each experiment takes),

experiments are all run on the same machine (an Intel Core2

Duo 2.50 GHz machine with 4 GB RAM) under the same

load conditions. Reported times include both training and

testing time and for the PCA-XCSR trials, the time needed to

perform the initial PCA pre-processing.

4.3 Effects of the number of dimensions

on performance

In our first set of experiments, different numbers of prin-

cipal components are analysed to see their effect on both

time and accuracy. For each environment, first we find the

optimum population size for XCSR—the maximum num-

ber of classifiers that XCSR can maintain—by trying dif-

ferent population sizes for each environment. These

numbers are: 16,000 for KDD’99, 4,000 for Census/Income

and 6,400 for MiniBooNE environment. These numbers

will be used for both XCSR and PCA-XCSR. Then, XCSR

is used in all three different environments and the results

recorded. Finally, we vary the number of principal com-

ponents (dimensions) and the results are recorded.

Figure 2 shows the summary of the experiments in the

three environments in terms of accuracy and time. The

graphs also show the error bars which indicate the standard

error of the mean over 10 samples. Standard error of the

mean (SEM) is the standard deviation of the sampling

distribution of the mean. It is calculated using the formula:

SEM ¼ sffiffi

np in which s is the standard deviation and n is the

sample size.

Looking at the three graphs in Fig. 2, we observe several

insights into the behaviour of PCA-XCSR:

1. Initially, increasing the number of principal compo-

nents leads to better accuracy of PCA-XCSR.

2. The increase in the accuracy does not go on forever. It

stops after adding an optimum number of principal

Evol. Intel. (2012) 5:129–138 133

123

Page 6: On principal component analysis for high-dimensional XCSR

components, which is different for each environ-

ment—it is 5 for KDD’99, 1 for Census/Income

(KDD) and 4 for MiniBooNE environment.

3. After adding the first few principal components,

adding more not only does not improve the accuracy

of PCA-XCSR, but deteriorates it.

4. Provided that the number of principal components is

correctly selected, the accuracy that PCA-XCSR can

achieve always surpasses the accuracy of XCSR.

Figure 2 illustrates this observation when tested on our

three real-world data sets.

5. PCA-XCSR achieves these high accuracies using very

small number of principal components—less than 10

in our three environments.

6. Because of the low number of dimensions, PCA-

XCSR achieves such high accuracy in relatively

shorter times—for instance, in the KDD’99 environ-

ment, PCA-XCSR gets 99.1 % accuracy in 28 s while

XCSR requires 233 s to get to 98.5 % accuracy. This is

a major advantage of PCA-XCSR; it needs much

shorter time to process data than XCSR, especially

when population size is large.

It should be noted that PCA-XCSR does not need the

large number of classifiers (population size) that XCSR

needs to achieve such high accuracy. For instance, in

KDD’99 environment, instead of 16,000 classifiers, PCA-

XCSR can use 4,000 classifiers and achieve 99.1 % accu-

racy. It is discussed in more detail next.

4.4 Effect of population size

Figure 3 shows the summary of another set of experiments

on KDD’99 environment in which different population

sizes ranging from 500 to 16,000 are used. In this figure the

accuracy of both PCA-XCSR (using different number of

principal components from 1 to 20) and XCSR methods are

shown. The graphs also show the error bars which indicate

the standard error of the mean over 10 samples. Adding

principal components after a certain number of them, not

only does not improve the performance, but also deterio-

rates it. We believe that adding these extra dimensions in

addition to adding noise to the problem, increases the

population size needed, because when the population size

is larger, a much smaller performance deterioration is

observed. These experiments illustrate that PCA-XCSR is

able to achieve high accuracies using small population

sizes. In the KDD’99 environment, with a population size

of 1,000, PCA-XCSR can get 98.7 % accuracy, using only

5 principal components, while standard XCSR can get

96.7 % accuracy. We can see that increasing the popula-

tion size beyond 4,000 allows PCA-XCSR to process

more principal components without a degradation in

performance. However, accuracy is not improved signifi-

cantly. For instance, the best accuracy achieved by PCA-

XCSR with population sizes 1,000 to 16,000 is between

98.7 and 99.1 %. But, it should be noted that the time that

each trial needs naturally increases as population size

grows.

4.5 Effect of covering parameters

In XCSR, covering is an important mechanism in creating

new classifiers. When an instance is received, if no clas-

sifiers in the population are triggered by the instance, the

covering operator will create new classifiers matching it. In

our XCSR, the condition part of covering is created this

way: each position in the classifier has two bounds:

bound1 = value - (CR/CRD) and bound2 = value ?

(CR/CRD) in which, value is the value of the corresponding

position (dimension) in the current instance, CR stands for

‘‘Covering Range’’, and CRD stands for ‘‘Covering Range

Divisor’’. As mentioned before, each instance before being

processed is first normalised to a range. CR is the magni-

tude of that range. For the range [-1, 1] which we use, CR

is 2. Here is an example: Assume that the first value of the

feature vector of an instance (input) is 0.2. Then having

CRD = 2, the first pair of the bounds in the condition will

be -0.8 and 1.2. But, if CRD = 20, then the corresponding

bounds will be 0.1 and 0.3. Therefore, the smaller the

CRD, the more general the condition of the newly created

classifier will be. The default value for CRD in our XCSR

is 2.

Wada et al. [23] show that XCSR is very sensitive to

covering parameter settings. They state that when condi-

tions of new classifiers have generalisation level smaller

than a certain value, the performance of XCSR deterio-

rates. Our next set of experiments were done to examine

and compare the sensitivity of XCSR and PCA-XCSR to

the level of generalisation of the classifiers created by

covering (which is tuned in our code using CRD). In the

KDD’99 data set, a small population size (500) and a large

population size (16,000) were selected and both XCSR and

PCA-XCSR were run on them, each with both a small CRD

value (2) and a large one (20). The results of these

experiments are shown in Fig. 4.

As can be seen in Fig. 4, XCSR is sensitive to CRD

value and its sensitivity is more notable in smaller popu-

lation sizes. Using population size 500, when CRD is

increased from 2 to 20, accuracy of XCSR drops more than

10 %—from 96.51 to 85.13 %. In larger population sizes

such as 16,000, this effect still exists but to a lesser extent

(more than 2 %). However, according to Fig. 4 PCA-

XCSR is robust to these changes. Its accuracy is changed

less than 0.5 % even in small population sizes such as 500.

Therefore, it seems that PCA-XCSR has one less parameter

134 Evol. Intel. (2012) 5:129–138

123

Page 7: On principal component analysis for high-dimensional XCSR

to worry about: finding the best generalisation level for

covering (CRD). It should be noted that experiments with

two other environments yield similar results.

This capability of PCA-XCSR is of more significance in

problems which may have concept drift. Concept drift

happens when the statistical properties of the environment

change in unpredictable ways. For instance, in a network

intrusion detection problem, one of the features of the

instances is the connection speed. As technology advances,

the average speed and its standard deviation may change.

This will affect the normalisation procedure which depends

on these two values. Therefore, the previous parameter

settings of the covering operator may not be the right ones

any more. Hence, the robustness of PCA-XCSR to the

covering operator’s parameter settings becomes very

important in such situations.

4.6 Heuristics for selecting the number of principal

components

When using PCA-XCSR, what is the optimal choice for the

number of principal components? Optimal is subjective as

it refers to the trade-off between time and accuracy.

However, our results show that maximum accuracy is

normally achieved using very few components. Therefore,

our interest is primarily in determining how many com-

ponents will give the maximum accuracy for a given

population.

The method we used in our experiments is the follow-

ing: Perform several trials on the training data, where the

number of principal components is incremented by one

each time until the performance starts dropping. Then, the

smallest number of principal components which gives the

Fig. 3 Accuracy of XCSR vs

PCA-XCSR using different

population sizes in KDD’99

environment. Error bars which

are very tiny most of the time,

show the standard error of the

mean over 10 samples

Evol. Intel. (2012) 5:129–138 135

123

Page 8: On principal component analysis for high-dimensional XCSR

highest performance is chosen. This method is used in

many researches which use PCA, such as [10]. These are

the optimum number of components in the three environ-

ments: For KDD’99 five components gave 99.1 %, for

Census/Income one component gave 93.8 %, and for

MiniBooNE four components gave 85.8 % accuracy.

It should be noted that we had an observation in regards

to the population size: Optimal number of components for

an environment is independent of the population size of the

PCA-XCSR (not true for unreasonably small population

sizes).

4.6.1 Eigen-values

When using PCA, the ranking mechanism of the principal

components are based on the value of their eigen-values

(introduced in Sect. 2.2). So using only the first few prin-

cipal components means using most of the information. As

the first n principal components will optimise the amount of

variance, the choice of n is subjective, and different criteria

has been suggested by different researchers: a general rule

of thumb is to use only the principal components whose

eigen-values are larger than 1. However, this rule does not

work in all domains [16]. Figure 6 shows the eigen-values

of the first 12 principal components for the three test

environments along with some values that can be used as

guidelines to set the number of principal components.

Fig. 4 Effect of covering

parameter settings on the

performance of XCSR versus

PCA-XCSR

Fig. 5 A problem with two features (x and y) and two classes (A and

B). Its dimensions are transformed into PC1 and PC2 after PCA. It is

an example of a problem in which the variance of a feature does not

correspond to its importance in the classification

136 Evol. Intel. (2012) 5:129–138

123

Page 9: On principal component analysis for high-dimensional XCSR

Heuristic 1 Use all the principal components whose ei-

gen-values are greater than 1. Looking at Fig. 6, you’ll find

that using this heuristic, PCA-XCSR achieves relatively

good results—KDD’99: 98.4 %, Census/Income: 92 %,

and MiniBooNE: 81 % accuracy.

Heuristic 2 Use n principal components, where n is the

first principal component for which the cumulative eigen-

value contribution is more than 50 %. This heuristic sug-

gests using more than 50 percent of the information—or

variance in the data. Again, by looking at Fig. 6, we see

that PCA-XCSR can achieve high accuracies using this

heuristic.

It is worth noting that Fig. 6 shows a different pattern in

the values of the MiniBooNE eigen-values. All the vari-

ability/information is in the first three principal components

(and mostly the first principal component), while we usually

see a gradual decrease in the eigen-values. This may be

the reason for different performance profile behaviour of

MiniBooNE in Fig. 2.

It is not always the case that the variance of the values in

features, shown by eigen-values, is helpful in the classifi-

cation task. Sometimes a feature with a smaller variance

(and therefore smaller eigen-value) is more important in

determining the class of instances than a feature with a

large variation. Consider the problem depicted in Fig. 5 in

which instances belong to two classes, namely A and B.

PCA can do a good job in transforming the space and

making the rectangular shapes parallel to the new axes,

PC1 and PC2. Now, relying on the variance means giving

more weight to the PC1 dimension which is irrelevant in

this problem and will mislead the classifier. In such

Fig. 6 The first 12 principal

components of the three

environments with each one’s

corresponding PCA-XCSR

accuracy, eigen-value, and

cumulative eigen-value

contribution to total eigen-value

Evol. Intel. (2012) 5:129–138 137

123

Page 10: On principal component analysis for high-dimensional XCSR

situations it is important to have a preprocessing step which

uses classes (labels) of instances. This makes a topic of its

own and a good direction for the future research (Fig. 6).

5 Conclusion and future works

XCSR’s good performance in high-dimensional environ-

ments comes at the cost of high computational and time

requirements. We show in this research that PCA-XCSR

which is a combination of PCA and XCSR is a good

alternative in such environments. We analyse PCA-XCSR

in three environments with three different characteristics

and confirm that it is capable of achieving high accuracies

using less computational and time resources. We show that

in contrast to XCSR, PCA-XCSR is robust to parameter

settings of the covering operator which is of high signifi-

cance in environments with concept drift. Finally, two

heuristics are presented for choosing the smallest number

of principal components using which PCA-XCSR can

achieve high accuracies.

Future works include finding heuristics which also take

into account the population size used by XCSR, the eigen-

value distribution of principal components and classes of

instances (in a preprocessing step). Also, heuristics for

finding the minimum population size needed for each

environment which is based on training size and number of

dimensions is another important question. On the other

hand, PCA is a linear transformation method and for non-

linear environments (consider a sphere), it will not be a

good choice to complement XCSR. Investigating non-lin-

ear dimensionality reduction methods is another research

avenue. The last thing to look at would be designing a

dynamic PCA-XCSR. This version of PCA-XCSR would

be able to increase or decrease the number of principal

components automatically as it is being used.

Acknowledgments The first author would like to acknowledge the

financial support provided by the Robert and Maude Gledden

Scholarship.

References

1. Abdi H, Williams LJ (2010) Principal component analysis. Wiley

Interdis Rev Comput Stat 2(4):433–459

2. Abedini M, Kirley M (2009) CoXCS: a coevolutionary learning

classifier based on feature space partitioning. In: Nicholson A, Li

X (eds) AI 2009: advances in artificial intelligence, lecture notes

in computer science, vol 5866. Springer, Berlin, pp 360–369

3. Abedini M, Kirley M (2010) A multiple population XCS:

evolving condition-action rules based on feature space partitions.

In: Evolutionary computation (CEC), 2010 IEEE congress,

pp 1–8

4. Bacardit J, Stout M, Hirst J, Krasnogor N (2008) Data mining

in proteomics with learning classifier systems. In Bull L,

Bernad-Mansilla E, Holmes J (eds) Learning classifier systems in

data mining, studies in computational intelligence, vol 125.

Springer, Berlin, pp 17–46

5. Behdad M, Barone L, French T, Bennamoun M (2010) An

investigation of real-valued accuracy-based learning classifier

systems for electronic fraud detection. In: Proceedings of the 12th

annual conference companion on Genetic and evolutionary

computation, pp 1893–1900

6. Bouzida Y, Cuppens F, Cuppens-Boulahia N, Gombault S (2004)

Efficient intrusion detection using principal component analysis.

In: Proceedings of the 3eme Conference sur la Securite et

Architectures Reseaux (SAR)

7. Bull L, Bernado-Mansilla E, Holmes J (2008) Learning classifier

systems in data mining. Studies in computational intelligence.

Springer, Berlin

8. Butz MV, Wilson SW (2000) An algorithmic descrtipion of

XCS. Technical report 2000017, Illinois Genetics Algorithms

Laboratory

9. Depren O, Topallar M, Anarim E, Ciliz MK (2005) An intelligent

intrusion detection system (IDS) for anomaly and misuse detec-

tion in computer networks. Exp Syst Appl 29(4):713–722

10. Golovko V, Vaitsekhovich L, Kochurko P, Rubanau U (2007)

Dimensionality reduction and attack recognition using neural

network approaches. In: Neural networks, 2007. IJCNN 2007.

International joint conference, pp 2734–2739

11. Hettich S, Bay SD (1999) KDD’99 network intrusion detection

data set. UCI Machine Learning Repository [http://archive.ics.

uci.edu/ml/]

12. Holland JH (1976) Adaptation. Prog Theor Biol 4:263–293

13. Kirley M, Abedini M (2009) CoXCS: a CoEvolutionary learning

classifier based on feature space partitioning. In: Lecture notes in

computer science, vol 5866, pp 360–369

14. Lane T, Kohavi R (2010) Census-Income (KDD) data set. UCI

machine learning repository [http://archive.ics.uci.edu/ml/]

15. Moore B (1981) Principal component analysis in linear systems:

controllability, observability, and model reduction. IEEE Trans

Autom Control 26:17–32

16. O’Toole AJ, Abdi H, Deffenbacher KA, Valentin D (1993) Low-

dimensional representation of faces in higher dimensions of the

face space. J Opt Soc Am 10(3):405–411

17. Roe B (2010) MiniBooNE particle identification data set. UCI

machine learning repository [http://archive.ics.uci.edu/ml/]

18. Shyu M, Chen S, Sarinnapakorn K, Chang L (2003) A novel

anomaly detection scheme based on principal component classi-

fier. In: Proceedings of the IEEE foundations and new directions

of data mining workshop, pp 172–179

19. Sigaud O, Wilson SW (2007) Learning classifier systems: a

survey. Soft computing—a fusion of foundations. Method Appl

11(11):1065–1078

20. Stone C, Bull L (2003) For real! XCS with continuous-valued

inputs. Evol Comput 11(3):299–336

21. Tsai WC, Chen AP (2011) Using the XCS classifier system for

portfolio allocation of MSCI index component stocks. Exp Syst

Appl 38(1):151–154

22. Urbanowicz RJ, Moore JH (2009) Learning classifier systems: a

complete introduction, review, and roadmap. J Artif Evol App

2009:1:1–1:25

23. Wada A, Takadama K, Shimohara K, Katai O (2007) Analyzing

parameter sensitivity and classifier representations for real-valued

XCS. In: Learning classifier systems, lecture notes in computer

science, vol 4399. Springer, Berlin, pp 1–16

24. Wilson SW (1995) Classifier fitness based on accuracy. Evol

Comput 3(2):149–175

25. Wilson SW (2000) Get real! XCS with continuous-valued inputs.

In: Learning classifier systems, from foundations to applications.

Springer, Berlin, pp 209–222

138 Evol. Intel. (2012) 5:129–138

123