on principal component analysis for high-dimensional xcsr
TRANSCRIPT
SPECIAL ISSUE
On principal component analysis for high-dimensional XCSR
Mohammad Behdad • Tim French •
Luigi Barone • Mohammed Bennamoun
Received: 7 January 2012 / Accepted: 8 May 2012 / Published online: 22 May 2012
� Springer-Verlag 2012
Abstract XCSR is an accuracy-based learning classifier
system which can handle classification problems with real-
value features. However, as the number of features
increases, a high classification accuracy comes at the cost
of more resources: larger population sizes and longer
computational running times. In this paper we investigate
PCA-XCSR (a sequential application of PCA and XCSR)
in three environments with different characteristics: a dis-
crete and imbalanced environment (KDD’99 network
intrusion), a continuous and highly symmetric environment
(MiniBooNE), and a highly discrete, highly imbalanced
environment (Census/Income (KDD)). These experiments
show that in the three different environments, PCA-XCSR,
in addition to being able to reduce the computational
resources and time requirements of XCSR by approxi-
mately 50 %, is able to consistently maintain its high
accuracy. In addition to that, it reduces the required pop-
ulation size needed by XCSR. Also, we suggest heuristics
for selecting the number of principal components to use
when using PCA-XCSR.
Keywords Genetics-based machine learning � Learning
classifier systems � XCSR � Dimensionality reduction �Principal components analysis
1 Introduction
First proposed by John Holland [12] in 1976, a learning
classifier system (LCS) is a machine learning technique
that combines reinforcement learning, evolutionary com-
puting, and other heuristics to produce an adaptive system
capable of learning in dynamic environments. It uses a
population of condition-action-prediction rules called
classifiers to encode appropriate actions for different input
states. Learning classifier systems (LCSs) are good at
solving complex adaptive problems and have been suc-
cessfully used in data mining applications [4, 7, 21].
However, when they are used for high-dimensional prob-
lems, they slow down significantly [13]. XCS [24] is an
accuracy-based LCS in which it is not the classifiers which
produce the greatest rewards that have the best chance of
propagating their genes into subsequent generations, but
rather the classifiers which predict the reward more accu-
rately, independent of the magnitude of the reward.
KDD’99 [11], a popular network intrusion data-set, is
one of the three high-dimensional data sets experimented
with in this research. Each instance of this set describes a
network connection using 41 features describing properties
about a connection (such as how many bytes are trans-
ferred, number of files access, or the number of times a
login was attempted). Many of these connections are
legitimate, but a number are malicious (such as trying to
gain root access to a system). The classifier must try to
determine whether a connection is benign, or whether it
corresponds to one of several different attack types. This
M. Behdad (&) � T. French � L. Barone � M. Bennamoun
School of Computer Science and Software Engineering,
The University of Western Australia, 35 Stirling Highway,
Crawley, WA 6009, Australia
e-mail: [email protected]
T. French
e-mail: [email protected]
L. Barone
e-mail: [email protected]
M. Bennamoun
e-mail: [email protected]
123
Evol. Intel. (2012) 5:129–138
DOI 10.1007/s12065-012-0075-6
data set has some interesting characteristics: it is large in
both the number of instances and number of dimensions,
and its dimensions are of different types, including cate-
gorical, Booleans and numbers.
In order to be able to use the XCSR for KDD’99 data
set, the data must first be normalised, so that XCSR can
handle categorical features, and also for numerical features,
operators such as mutation and covering are not biased
towards any one particular feature. The normalisation
process increases the number of features from 41 to 118
(see Sect. 4.2 for details). One important property of the
normalised data is that most of its features have value 0. In
other words, we are dealing with sparse data with a high
number of dimensions. Hence, there is a high chance that
some of the features in the data are correlated or redundant.
In order to exploit this characteristic, we can use a
dimensionality reduction technique to ‘‘compress’’ the
problem and reduce the search space for the learning
algorithm.
Principal Component Analysis (PCA) is a linear
dimensionality transformation method, capable of reducing
the number of dimensions of a data set by transforming the
problem into a new search space. In this paper, we inves-
tigate if using PCA as a preprocessing step can reduce the
time and computational resources needed by XCSR in such
high-dimensional problems while maintaining its classifi-
cation accuracy. We call this combination of PCA-pre-
processing and XCSR, PCA-XCSR.
To see if the behaviour of PCA-XCSR on the KDD’99
data set is applicable to different environments, two other
environments with different characteristics to KDD’99
environment are also experimented with. We also present
heuristics to select the number of principal components to
use with PCA-XCSR.
In the next section, we discuss the related works and
background techniques. Then, in Sect. 3, the PCA-XCSR
method is explained. In Sect. 4, the data sets used in the
experiments, the experimental procedure and their results
are discussed. Finally, Sect. 5 concludes this work.
2 Related works
Of all the three data sets we experiment in this work, only
the KDD’99 network intrusion data set has been tackled by
researchers using the combination of a machine learning
algorithm and PCA, and none combining an LCS with
PCA. Depren et al. [9], propose an Intrusion Detection
System (IDS) architecture which has an anomaly detection
module which in turn uses a Self-Organizing Map (SOM)
structure. The SOM initialisation is done both randomly
and PCA-based and they find that the results are not dif-
ferent. So, they go with the random initialisation which is
the simpler approach. The work [10] suggests a model for
network intrusion detection. It attempts to reduce the
number of dimensions in KDD’99 network intrusion
detection data set. Its model combines principal component
analysis (PCA), neural networks and multi-layer percep-
trons. PCA is used to extract important data and to reduce
high dimensional data vectors [10]. In the case of KDD’99
data set, using PCA, it reduces the number of dimensions
from 41 to 12 and the experimental results show the high
accuracy of 93.21 %.
Shyu et al. [18] propose a PCA-based outlier detection
approach for intrusion detection. They use PCA to reduce
the dimensionality without sacrificing valuable informa-
tion. Also PCA determines the important components and
minor ones. The distance measure they use is based on both
the major components that account for 50 % of the total
variation and the minor components—for instance the ones
with eigen-values less than 0.20. Their experiments with
the KDD’99 data set is based on detecting normal instances
against attack ones. They show that their method outper-
forms many other approaches.
The work by Bouzida et al. [6] uses PCA for network
intrusion detection. It applies PCA to two different
machine learning methods, namely decision trees and
nearest neighbour, in order to reduce the information
quantity without loss of information. Their experiments on
the KDD’99 data set show no decrease in the accuracy but
a major drop in the processing time.
Abedini et al. [2] are motivated by the degradation of
the performance of LCS when it deals with data sets of
high dimensionality and relatively few instances. They
propose CoXCS as a coevolutionary multi-population
XCS. Isolated sub-populations evolve a set of classifiers
based on a partitioning of the feature space in the data. In
their followup paper [3], they use alternative feature space
partitioning techniques in a multiple population island-
based parallel XCS. Here, each of the isolated populations
evolve rules based on a subset of the features. They analyse
their method on the Boolean logic multiplexer problem and
they report a better performance and generalisation using
multiple population XCS in comparison to the the single
population XCS, although the effectiveness of the model
depends on the feature space partitioning strategy used.
In our previous work [5], XCSR is applied to the
KDD’99 network intrusion data set. The results are good
compared to the winners of that competition—the accuracy
is within 3.5 % of that of the winner. During the research
we noticed a problem: long running times that XCSR
required in order to do the classification. In the current
research we investigate and analyse the effect of using
PCA as a preprocessing step for XCSR and apply it on
three real-world environments with different characteris-
tics. At the end, we will provide some observations,
130 Evol. Intel. (2012) 5:129–138
123
guidelines and general heuristics about its performance and
mechanisms for getting the best out of it.
2.1 LCS
Learning classifier systems use a population of condition-
action-prediction rules called classifiers to encode appro-
priate actions for different input states: if a condition is
true, then performing the corresponding action is expected
to lead to receiving the anticipated reward. The decision of
which rule to use depends on the type of LCS used. Some
use reward (the prediction in the classifier triple) the sys-
tem expects to receive from the environment as a result of
performing the chosen action. In some methods, this
decision is based on the accuracy of the classifiers
matching the current input and their predicted reward. In
essence, an LCS attempts to enhance its understanding of
the environment through improving its classifiers over
time.
2.1.1 XCS
XCS [24] is an accuracy-based LCS in which it is not the
classifiers which produce the greatest rewards that have the
best chance of propagating their genes into subsequent
generations, but rather the classifiers which predict the
reward more accurately, independent of the magnitude of
the reward. As a result, XCS is said to cover the state space
more efficiently [19]. When a state is perceived by an LCS,
the rule-base is scanned and any rule whose condition
matches the current state is added as a member of the
current match set M. In XCS—specifically in exploit pha-
ses—once M has been formed, an action which has the
highest weighted fitness prediction is chosen. All members
of M that propose the same action as the chosen rule then
form the action set A. The accuracy of all members of the
action set are then updated when the reward from the
environment is determined. Another important difference
of XCS over its predecessors is that the GA is applied to
A (or niches) rather than the whole classifier population.
This narrows the evolutionary competition to between
classifiers which can be applied in similar scenarios rather
than all the classifiers in the LCS [19].
2.1.2 XCSR
XCSR [25] is a variation of XCS [22] which can handle
inputs having real-value features. In an Unordered Bound
Representation (UBR) [20] encoding, the condition part of
classifiers consists of a set if intervals. Each takes the form
of (pi, qi) where either pi or qi can be the maximum or
minimum bound. When XCSR receives an instance, if the
values of the features of the instance are in the range of a
classifier’s condition, then that classifier is used in the
classification process.
2.2 Principal component analysis
Principal component analysis is a feature transformation
technique whose purpose is to find the most important
information through dimensionality reduction. It analyses a
set of original variables and represents them as a set of new
orthogonal variables called principal components. These
principal components are calculated using linear combi-
nations of the original variables. The principal components
are listed in the order of importance. That is, the first one
will have the highest variance, the second one, which is
orthogonal to the first, has the second largest variance, and
so on [15].
The values of the new variables are calculated by pro-
jecting the original variables onto the principal compo-
nents. Each principal component has an eigen-value which
is calculated by summing the squared factor scores for that
component. Dimensionality reduction is achieved by using
only a subset of the principal components (the first n)
instead of using all of them. For a comprehensive intro-
duction to PCA, the reader is referred to [1].
3 PCA-XCSR
As was evident from our earlier work [5] and others [2, 3],
the performance of learning classifier systems degrades
when they are dealing with problems of high dimension-
ality. In that experience [5], when XCSR was used on data
sets with high number of dimensions or features, its time
performance would have become relatively slow. Hence,
the main purpose of this research is to improve the speed of
XCSR in such high-dimensional data sets (Fig. 1).
In our method, PCA is used as a pre-processing step to
compact the data set through dimensionality reduction
(Fig. 2). First, the data is normalised so that XCSR can
handle every type of feature in the data correctly. Then, it
is compacted (transformed) by PCA before being given to
XCSR to learn from. XCSR hence learns in a smaller
search space—the space that results from selecting only a
fixed number of dimensions (the n best principal compo-
nents in the transformed space). The basis for doing this is
that as these dimensions add the most variance to the input
space, they will most accurately predict the class of each
Fig. 1 A high-level view of PCA-XCSR
Evol. Intel. (2012) 5:129–138 131
123
input. We call this combination of PCA-preprocessing and
XCSR, PCA-XCSR. This presents a number of potential
advantages:
1. By examining only the most significant principal
components, we reduce the dimensionality of the
search space.
2. As the principal components are linear combinations
of dimensions, the classifier rules may naturally
exploit existing dependencies among features rather
than having to search out such dependencies.
3. By retaining only the components with the highest
eigen-values, we eliminate noise caused by less
relevant components.
4 Experiments
In order to evaluate the utility of PCA-XCSR, we compare
its performance with the performance of standard XCSR on
data sets from three different environments: KDD’99 net-
work intrusion data set [11], MiniBooNE particle identifi-
cation data set [17], and Census-Income (KDD) data set
[14]. Our main interest is exploring the trade-off between
accuracy and time, for different numbers of principal
components used by the learning classifier system. It
should be noted that we are not interested in getting the
best results for a particular data set; we are instead inter-
ested in observing the behaviour of PCA-XCSR.
4.1 Data
To see if the behaviour of PCA-XCSR is consistent on
different high-dimensional data sets, three data sets from
different environments are used to observe the behaviour of
PCA-XCSR and standard XCSR. Since we were looking
for high-dimensional data sets in this research, each data
set has at least 40 features (which in some cases are
increased to more than 400 when normalised). Also, as
LCS is used, each data set has many instances—here, at
least 40,000—for training and testing purposes.
Fig. 2 Accuracy and running
time for standard XCSR vs
PCA-XCSR in three different
environments. Error bars which
are very tiny most of the time,
show the standard error of the
mean over 10 samples. Figure is
best seen in colour
132 Evol. Intel. (2012) 5:129–138
123
The first data set is KDD’99 network intrusion data set
[11]. Each element of this set describes a network instance
using 41 features describing properties about an instance
(such as how many bytes are transferred, number of file
accesses, or the number of times a login was attempted). Its
41 features of data increase to 108 after normalisation (see
Sect. 4.2 for details). Many of the instances in the data set
represent legitimate operations, but a number are malicious
(such as trying to gain root access to a system). The clas-
sifier must try to determine whether an instance is benign,
or whether it corresponds to one of several different attack
types. 10,000 of these instances are randomly selected for
training and 40,000 for testing. This data set has some
interesting characteristics: it is large in both the number of
instances and number of dimensions, and its dimensions
are of different types, including categorical, Booleans and
numbers.
The second data set is the MiniBooNE particle identi-
fication data set. This data set [17] is abstracted from the
MiniBooNE experiment and is used to distinguish electron
neutrinos (signal) from muon neutrinos (background). The
whole data set contains 130,065 instances. 40,000 of these
instances are randomly selected for training and 10,000 for
testing. Each instance has 50 continuous features.
The third data set is Census-Income (KDD) data set
[14]. It contains census data extracted from the 1994 and
1995 current population surveys conducted by the U.S.
Census Bureau. The set contains 40 features: 7 continuous
and 33 categorical. These 40 features are increased up to
408 after normalisation. The label for each feature vector
determines if the income of the person is above or below
$50,000. The features are of both continuous and cate-
gorical. The number of instances in the training file is
199,523 and there are 99,762 instances in the testing file.
4.2 Procedure
To start, the data in the training file is firstly normalised.
This involves scaling all quantitative features to lie
between -1 and 1, and representing all categorical types
with independent scalar features. Therefore, if there are 6
possible values for a type, this feature will be replaced with
6 features. In an instance, if the original feature (before
normalisation) has the third possible value, after normali-
sation the third feature will have value 1 and the remaining
five features will have value -1. If there are categorical
features in the data set, this process results in an increase in
the number of features, but also ensures the data is very
sparse.
The underlying learning classifier system is based on
Butz’s implementation of XCS [8]. For representation of
intervals, we use the Unordered Bound Representation
(UBR) [20] without limiting the ranges to be between -1
and 1—as outliers fall outside this range after normalisa-
tion. The values of important XCSR parameters, if
fnot stated differently, are as follows: a ¼ 0:1; b ¼ 0:2;
m ¼ 5; �0 ¼ 0:01; hGA ¼ 25, and hSub = 12. In order to use
LCS’s subsumption pressure towards generalisation, both
types of subsumption (GA and action set) are on. In these
experiments, the system does not learn during the test
phase.
The MATLAB Statistics Toolbox is used for the PCA pre-
processing step. Firstly, its PCA routine is used to calculate
the eigen-vectors and eigen-values for the training data set.
Then, each instance is normalised and transformed using
PCA using n principal components. Experiments below vary
n, with results reporting the trade-off in performance
(accuracy) versus running time. All experiments are repeated
10 times and the results averaged. To keep the consistency of
the conditions (especially the time each experiment takes),
experiments are all run on the same machine (an Intel Core2
Duo 2.50 GHz machine with 4 GB RAM) under the same
load conditions. Reported times include both training and
testing time and for the PCA-XCSR trials, the time needed to
perform the initial PCA pre-processing.
4.3 Effects of the number of dimensions
on performance
In our first set of experiments, different numbers of prin-
cipal components are analysed to see their effect on both
time and accuracy. For each environment, first we find the
optimum population size for XCSR—the maximum num-
ber of classifiers that XCSR can maintain—by trying dif-
ferent population sizes for each environment. These
numbers are: 16,000 for KDD’99, 4,000 for Census/Income
and 6,400 for MiniBooNE environment. These numbers
will be used for both XCSR and PCA-XCSR. Then, XCSR
is used in all three different environments and the results
recorded. Finally, we vary the number of principal com-
ponents (dimensions) and the results are recorded.
Figure 2 shows the summary of the experiments in the
three environments in terms of accuracy and time. The
graphs also show the error bars which indicate the standard
error of the mean over 10 samples. Standard error of the
mean (SEM) is the standard deviation of the sampling
distribution of the mean. It is calculated using the formula:
SEM ¼ sffiffi
np in which s is the standard deviation and n is the
sample size.
Looking at the three graphs in Fig. 2, we observe several
insights into the behaviour of PCA-XCSR:
1. Initially, increasing the number of principal compo-
nents leads to better accuracy of PCA-XCSR.
2. The increase in the accuracy does not go on forever. It
stops after adding an optimum number of principal
Evol. Intel. (2012) 5:129–138 133
123
components, which is different for each environ-
ment—it is 5 for KDD’99, 1 for Census/Income
(KDD) and 4 for MiniBooNE environment.
3. After adding the first few principal components,
adding more not only does not improve the accuracy
of PCA-XCSR, but deteriorates it.
4. Provided that the number of principal components is
correctly selected, the accuracy that PCA-XCSR can
achieve always surpasses the accuracy of XCSR.
Figure 2 illustrates this observation when tested on our
three real-world data sets.
5. PCA-XCSR achieves these high accuracies using very
small number of principal components—less than 10
in our three environments.
6. Because of the low number of dimensions, PCA-
XCSR achieves such high accuracy in relatively
shorter times—for instance, in the KDD’99 environ-
ment, PCA-XCSR gets 99.1 % accuracy in 28 s while
XCSR requires 233 s to get to 98.5 % accuracy. This is
a major advantage of PCA-XCSR; it needs much
shorter time to process data than XCSR, especially
when population size is large.
It should be noted that PCA-XCSR does not need the
large number of classifiers (population size) that XCSR
needs to achieve such high accuracy. For instance, in
KDD’99 environment, instead of 16,000 classifiers, PCA-
XCSR can use 4,000 classifiers and achieve 99.1 % accu-
racy. It is discussed in more detail next.
4.4 Effect of population size
Figure 3 shows the summary of another set of experiments
on KDD’99 environment in which different population
sizes ranging from 500 to 16,000 are used. In this figure the
accuracy of both PCA-XCSR (using different number of
principal components from 1 to 20) and XCSR methods are
shown. The graphs also show the error bars which indicate
the standard error of the mean over 10 samples. Adding
principal components after a certain number of them, not
only does not improve the performance, but also deterio-
rates it. We believe that adding these extra dimensions in
addition to adding noise to the problem, increases the
population size needed, because when the population size
is larger, a much smaller performance deterioration is
observed. These experiments illustrate that PCA-XCSR is
able to achieve high accuracies using small population
sizes. In the KDD’99 environment, with a population size
of 1,000, PCA-XCSR can get 98.7 % accuracy, using only
5 principal components, while standard XCSR can get
96.7 % accuracy. We can see that increasing the popula-
tion size beyond 4,000 allows PCA-XCSR to process
more principal components without a degradation in
performance. However, accuracy is not improved signifi-
cantly. For instance, the best accuracy achieved by PCA-
XCSR with population sizes 1,000 to 16,000 is between
98.7 and 99.1 %. But, it should be noted that the time that
each trial needs naturally increases as population size
grows.
4.5 Effect of covering parameters
In XCSR, covering is an important mechanism in creating
new classifiers. When an instance is received, if no clas-
sifiers in the population are triggered by the instance, the
covering operator will create new classifiers matching it. In
our XCSR, the condition part of covering is created this
way: each position in the classifier has two bounds:
bound1 = value - (CR/CRD) and bound2 = value ?
(CR/CRD) in which, value is the value of the corresponding
position (dimension) in the current instance, CR stands for
‘‘Covering Range’’, and CRD stands for ‘‘Covering Range
Divisor’’. As mentioned before, each instance before being
processed is first normalised to a range. CR is the magni-
tude of that range. For the range [-1, 1] which we use, CR
is 2. Here is an example: Assume that the first value of the
feature vector of an instance (input) is 0.2. Then having
CRD = 2, the first pair of the bounds in the condition will
be -0.8 and 1.2. But, if CRD = 20, then the corresponding
bounds will be 0.1 and 0.3. Therefore, the smaller the
CRD, the more general the condition of the newly created
classifier will be. The default value for CRD in our XCSR
is 2.
Wada et al. [23] show that XCSR is very sensitive to
covering parameter settings. They state that when condi-
tions of new classifiers have generalisation level smaller
than a certain value, the performance of XCSR deterio-
rates. Our next set of experiments were done to examine
and compare the sensitivity of XCSR and PCA-XCSR to
the level of generalisation of the classifiers created by
covering (which is tuned in our code using CRD). In the
KDD’99 data set, a small population size (500) and a large
population size (16,000) were selected and both XCSR and
PCA-XCSR were run on them, each with both a small CRD
value (2) and a large one (20). The results of these
experiments are shown in Fig. 4.
As can be seen in Fig. 4, XCSR is sensitive to CRD
value and its sensitivity is more notable in smaller popu-
lation sizes. Using population size 500, when CRD is
increased from 2 to 20, accuracy of XCSR drops more than
10 %—from 96.51 to 85.13 %. In larger population sizes
such as 16,000, this effect still exists but to a lesser extent
(more than 2 %). However, according to Fig. 4 PCA-
XCSR is robust to these changes. Its accuracy is changed
less than 0.5 % even in small population sizes such as 500.
Therefore, it seems that PCA-XCSR has one less parameter
134 Evol. Intel. (2012) 5:129–138
123
to worry about: finding the best generalisation level for
covering (CRD). It should be noted that experiments with
two other environments yield similar results.
This capability of PCA-XCSR is of more significance in
problems which may have concept drift. Concept drift
happens when the statistical properties of the environment
change in unpredictable ways. For instance, in a network
intrusion detection problem, one of the features of the
instances is the connection speed. As technology advances,
the average speed and its standard deviation may change.
This will affect the normalisation procedure which depends
on these two values. Therefore, the previous parameter
settings of the covering operator may not be the right ones
any more. Hence, the robustness of PCA-XCSR to the
covering operator’s parameter settings becomes very
important in such situations.
4.6 Heuristics for selecting the number of principal
components
When using PCA-XCSR, what is the optimal choice for the
number of principal components? Optimal is subjective as
it refers to the trade-off between time and accuracy.
However, our results show that maximum accuracy is
normally achieved using very few components. Therefore,
our interest is primarily in determining how many com-
ponents will give the maximum accuracy for a given
population.
The method we used in our experiments is the follow-
ing: Perform several trials on the training data, where the
number of principal components is incremented by one
each time until the performance starts dropping. Then, the
smallest number of principal components which gives the
Fig. 3 Accuracy of XCSR vs
PCA-XCSR using different
population sizes in KDD’99
environment. Error bars which
are very tiny most of the time,
show the standard error of the
mean over 10 samples
Evol. Intel. (2012) 5:129–138 135
123
highest performance is chosen. This method is used in
many researches which use PCA, such as [10]. These are
the optimum number of components in the three environ-
ments: For KDD’99 five components gave 99.1 %, for
Census/Income one component gave 93.8 %, and for
MiniBooNE four components gave 85.8 % accuracy.
It should be noted that we had an observation in regards
to the population size: Optimal number of components for
an environment is independent of the population size of the
PCA-XCSR (not true for unreasonably small population
sizes).
4.6.1 Eigen-values
When using PCA, the ranking mechanism of the principal
components are based on the value of their eigen-values
(introduced in Sect. 2.2). So using only the first few prin-
cipal components means using most of the information. As
the first n principal components will optimise the amount of
variance, the choice of n is subjective, and different criteria
has been suggested by different researchers: a general rule
of thumb is to use only the principal components whose
eigen-values are larger than 1. However, this rule does not
work in all domains [16]. Figure 6 shows the eigen-values
of the first 12 principal components for the three test
environments along with some values that can be used as
guidelines to set the number of principal components.
Fig. 4 Effect of covering
parameter settings on the
performance of XCSR versus
PCA-XCSR
Fig. 5 A problem with two features (x and y) and two classes (A and
B). Its dimensions are transformed into PC1 and PC2 after PCA. It is
an example of a problem in which the variance of a feature does not
correspond to its importance in the classification
136 Evol. Intel. (2012) 5:129–138
123
Heuristic 1 Use all the principal components whose ei-
gen-values are greater than 1. Looking at Fig. 6, you’ll find
that using this heuristic, PCA-XCSR achieves relatively
good results—KDD’99: 98.4 %, Census/Income: 92 %,
and MiniBooNE: 81 % accuracy.
Heuristic 2 Use n principal components, where n is the
first principal component for which the cumulative eigen-
value contribution is more than 50 %. This heuristic sug-
gests using more than 50 percent of the information—or
variance in the data. Again, by looking at Fig. 6, we see
that PCA-XCSR can achieve high accuracies using this
heuristic.
It is worth noting that Fig. 6 shows a different pattern in
the values of the MiniBooNE eigen-values. All the vari-
ability/information is in the first three principal components
(and mostly the first principal component), while we usually
see a gradual decrease in the eigen-values. This may be
the reason for different performance profile behaviour of
MiniBooNE in Fig. 2.
It is not always the case that the variance of the values in
features, shown by eigen-values, is helpful in the classifi-
cation task. Sometimes a feature with a smaller variance
(and therefore smaller eigen-value) is more important in
determining the class of instances than a feature with a
large variation. Consider the problem depicted in Fig. 5 in
which instances belong to two classes, namely A and B.
PCA can do a good job in transforming the space and
making the rectangular shapes parallel to the new axes,
PC1 and PC2. Now, relying on the variance means giving
more weight to the PC1 dimension which is irrelevant in
this problem and will mislead the classifier. In such
Fig. 6 The first 12 principal
components of the three
environments with each one’s
corresponding PCA-XCSR
accuracy, eigen-value, and
cumulative eigen-value
contribution to total eigen-value
Evol. Intel. (2012) 5:129–138 137
123
situations it is important to have a preprocessing step which
uses classes (labels) of instances. This makes a topic of its
own and a good direction for the future research (Fig. 6).
5 Conclusion and future works
XCSR’s good performance in high-dimensional environ-
ments comes at the cost of high computational and time
requirements. We show in this research that PCA-XCSR
which is a combination of PCA and XCSR is a good
alternative in such environments. We analyse PCA-XCSR
in three environments with three different characteristics
and confirm that it is capable of achieving high accuracies
using less computational and time resources. We show that
in contrast to XCSR, PCA-XCSR is robust to parameter
settings of the covering operator which is of high signifi-
cance in environments with concept drift. Finally, two
heuristics are presented for choosing the smallest number
of principal components using which PCA-XCSR can
achieve high accuracies.
Future works include finding heuristics which also take
into account the population size used by XCSR, the eigen-
value distribution of principal components and classes of
instances (in a preprocessing step). Also, heuristics for
finding the minimum population size needed for each
environment which is based on training size and number of
dimensions is another important question. On the other
hand, PCA is a linear transformation method and for non-
linear environments (consider a sphere), it will not be a
good choice to complement XCSR. Investigating non-lin-
ear dimensionality reduction methods is another research
avenue. The last thing to look at would be designing a
dynamic PCA-XCSR. This version of PCA-XCSR would
be able to increase or decrease the number of principal
components automatically as it is being used.
Acknowledgments The first author would like to acknowledge the
financial support provided by the Robert and Maude Gledden
Scholarship.
References
1. Abdi H, Williams LJ (2010) Principal component analysis. Wiley
Interdis Rev Comput Stat 2(4):433–459
2. Abedini M, Kirley M (2009) CoXCS: a coevolutionary learning
classifier based on feature space partitioning. In: Nicholson A, Li
X (eds) AI 2009: advances in artificial intelligence, lecture notes
in computer science, vol 5866. Springer, Berlin, pp 360–369
3. Abedini M, Kirley M (2010) A multiple population XCS:
evolving condition-action rules based on feature space partitions.
In: Evolutionary computation (CEC), 2010 IEEE congress,
pp 1–8
4. Bacardit J, Stout M, Hirst J, Krasnogor N (2008) Data mining
in proteomics with learning classifier systems. In Bull L,
Bernad-Mansilla E, Holmes J (eds) Learning classifier systems in
data mining, studies in computational intelligence, vol 125.
Springer, Berlin, pp 17–46
5. Behdad M, Barone L, French T, Bennamoun M (2010) An
investigation of real-valued accuracy-based learning classifier
systems for electronic fraud detection. In: Proceedings of the 12th
annual conference companion on Genetic and evolutionary
computation, pp 1893–1900
6. Bouzida Y, Cuppens F, Cuppens-Boulahia N, Gombault S (2004)
Efficient intrusion detection using principal component analysis.
In: Proceedings of the 3eme Conference sur la Securite et
Architectures Reseaux (SAR)
7. Bull L, Bernado-Mansilla E, Holmes J (2008) Learning classifier
systems in data mining. Studies in computational intelligence.
Springer, Berlin
8. Butz MV, Wilson SW (2000) An algorithmic descrtipion of
XCS. Technical report 2000017, Illinois Genetics Algorithms
Laboratory
9. Depren O, Topallar M, Anarim E, Ciliz MK (2005) An intelligent
intrusion detection system (IDS) for anomaly and misuse detec-
tion in computer networks. Exp Syst Appl 29(4):713–722
10. Golovko V, Vaitsekhovich L, Kochurko P, Rubanau U (2007)
Dimensionality reduction and attack recognition using neural
network approaches. In: Neural networks, 2007. IJCNN 2007.
International joint conference, pp 2734–2739
11. Hettich S, Bay SD (1999) KDD’99 network intrusion detection
data set. UCI Machine Learning Repository [http://archive.ics.
uci.edu/ml/]
12. Holland JH (1976) Adaptation. Prog Theor Biol 4:263–293
13. Kirley M, Abedini M (2009) CoXCS: a CoEvolutionary learning
classifier based on feature space partitioning. In: Lecture notes in
computer science, vol 5866, pp 360–369
14. Lane T, Kohavi R (2010) Census-Income (KDD) data set. UCI
machine learning repository [http://archive.ics.uci.edu/ml/]
15. Moore B (1981) Principal component analysis in linear systems:
controllability, observability, and model reduction. IEEE Trans
Autom Control 26:17–32
16. O’Toole AJ, Abdi H, Deffenbacher KA, Valentin D (1993) Low-
dimensional representation of faces in higher dimensions of the
face space. J Opt Soc Am 10(3):405–411
17. Roe B (2010) MiniBooNE particle identification data set. UCI
machine learning repository [http://archive.ics.uci.edu/ml/]
18. Shyu M, Chen S, Sarinnapakorn K, Chang L (2003) A novel
anomaly detection scheme based on principal component classi-
fier. In: Proceedings of the IEEE foundations and new directions
of data mining workshop, pp 172–179
19. Sigaud O, Wilson SW (2007) Learning classifier systems: a
survey. Soft computing—a fusion of foundations. Method Appl
11(11):1065–1078
20. Stone C, Bull L (2003) For real! XCS with continuous-valued
inputs. Evol Comput 11(3):299–336
21. Tsai WC, Chen AP (2011) Using the XCS classifier system for
portfolio allocation of MSCI index component stocks. Exp Syst
Appl 38(1):151–154
22. Urbanowicz RJ, Moore JH (2009) Learning classifier systems: a
complete introduction, review, and roadmap. J Artif Evol App
2009:1:1–1:25
23. Wada A, Takadama K, Shimohara K, Katai O (2007) Analyzing
parameter sensitivity and classifier representations for real-valued
XCS. In: Learning classifier systems, lecture notes in computer
science, vol 4399. Springer, Berlin, pp 1–16
24. Wilson SW (1995) Classifier fitness based on accuracy. Evol
Comput 3(2):149–175
25. Wilson SW (2000) Get real! XCS with continuous-valued inputs.
In: Learning classifier systems, from foundations to applications.
Springer, Berlin, pp 209–222
138 Evol. Intel. (2012) 5:129–138
123