genetic algorithms for the structural optimisation of learned polynomial expressions

26
Genetic algorithms for the structural optimisation of learned polynomial expressions G. Potgieter, A.P. Engelbrecht * Department of Computer Science, School of Information Technology, Roperstreet, University of Pretoria, Pretoria 2000, South Africa Abstract This paper presents a hybrid genetic algorithm approach to construct optimal polynomial expressions to characterise a function described by a set of data points. The algorithm learns structurally optimal polynomial expressions (polynomial expressions where both the architecture and the error function have been minimised over a dataset), through the use of specialised mutation and crossover operators. The algorithm also optimises the learning process by using an efficient, fast data clustering algorithm to reduce the training pattern search space. Experimental results are compared with results obtained from a neural network. These results indicate that this genetic algorithm technique is substantially faster than the neural network, and produces comparable accuracy. Ó 2006 Elsevier Inc. All rights reserved. Keywords: Polynomial approximation; Genetic algorithms; Neural networks; Data clustering; Structure optimisation 1. Introduction The study of function approximation can be divided into two classes of problems. One class deals with a function being explicitly stated, where the objective is to find a computationally simpler type of function, such as a polynomial, that can be used to approximate a given function. The other class deals with finding the best function to represent a given set of data points (or patterns). The latter class of function approximation plays a very important role in the prediction of continuous-valued outcomes, e.g., subscriber growth forecasts, time series modelling, etc. This paper concentrates on methods to construct functions that accurately represent a series of data points, at minimal processing cost. Traditional methods to perform this function approximation include the frequently used discrete least- squares method and Taylor polynomials. Other methods include neural networks and some evolutionary algo- rithm paradigms. This paper develops a hybrid genetic algorithm (GASOLPE) approach to evolve structurally optimal poly- nomial expressions in order to accurately describe a given data set. A fast clustering algorithm is used to 0096-3003/$ - see front matter Ó 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2006.07.164 * Corresponding author. E-mail address: [email protected] (A.P. Engelbrecht). Applied Mathematics and Computation 186 (2007) 1441–1466 www.elsevier.com/locate/amc

Upload: g-potgieter

Post on 26-Jun-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Applied Mathematics and Computation 186 (2007) 1441–1466

www.elsevier.com/locate/amc

Genetic algorithms for the structural optimisationof learned polynomial expressions

G. Potgieter, A.P. Engelbrecht *

Department of Computer Science, School of Information Technology, Roperstreet, University of Pretoria, Pretoria 2000, South Africa

Abstract

This paper presents a hybrid genetic algorithm approach to construct optimal polynomial expressions to characterise afunction described by a set of data points. The algorithm learns structurally optimal polynomial expressions (polynomialexpressions where both the architecture and the error function have been minimised over a dataset), through the use ofspecialised mutation and crossover operators. The algorithm also optimises the learning process by using an efficient, fastdata clustering algorithm to reduce the training pattern search space. Experimental results are compared with resultsobtained from a neural network. These results indicate that this genetic algorithm technique is substantially faster thanthe neural network, and produces comparable accuracy.� 2006 Elsevier Inc. All rights reserved.

Keywords: Polynomial approximation; Genetic algorithms; Neural networks; Data clustering; Structure optimisation

1. Introduction

The study of function approximation can be divided into two classes of problems. One class deals with afunction being explicitly stated, where the objective is to find a computationally simpler type of function, suchas a polynomial, that can be used to approximate a given function. The other class deals with finding the bestfunction to represent a given set of data points (or patterns). The latter class of function approximation plays avery important role in the prediction of continuous-valued outcomes, e.g., subscriber growth forecasts, timeseries modelling, etc. This paper concentrates on methods to construct functions that accurately represent aseries of data points, at minimal processing cost.

Traditional methods to perform this function approximation include the frequently used discrete least-squares method and Taylor polynomials. Other methods include neural networks and some evolutionary algo-rithm paradigms.

This paper develops a hybrid genetic algorithm (GASOLPE) approach to evolve structurally optimal poly-nomial expressions in order to accurately describe a given data set. A fast clustering algorithm is used to

0096-3003/$ - see front matter � 2006 Elsevier Inc. All rights reserved.

doi:10.1016/j.amc.2006.07.164

* Corresponding author.E-mail address: [email protected] (A.P. Engelbrecht).

1442 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

reduce the pattern space and thereby reduce the training time of the algorithm. Highly specialised mutationand crossover operators are used to directly optimise the structure of polynomial expressions, and to exploitsimilarities between the various polynomial expressions in the search space. Least-squares approximation isused within the genetic algorithm to find the coefficients of the terms of evolved polynomials.

The remainder of this paper is organised as follows. Section 2 presents an overview of various functionapproximation techniques. The implementation of the genetic algorithm polynomial approximator is pre-sented in detail in Section 3. The section presents the clustering algorithm used to cluster the training data,and introduces the representation, specialised mutation and crossover operators, and the hall-of-fame, all ofwhich ensure the structural optimality of the evolved polynomials. The experimental procedure, data setsand results are discussed in Section 4. Finally, Section 5 presents the summarised findings and envisionedfuture developments to the method.

2. Theory of function approximation

As was mentioned earlier, function approximation can be divided into two classes of problems. One classdealing with the simplification of a defined function in order to determine approximate values for that func-tion, and the other class dealing with finding a function that best describes a set of data points. This sectiondiscusses classical methods to perform the latter class of function approximation, such as the discrete least-squares approximation and Taylor polynomials. This section will also discuss methods from the field of com-putational intelligence [1], such as neural networks [2,3], genetic algorithms [4,5] and genetic programming [6].

2.1. Discrete least-squares approximation

The following is a brief adaptation of Fraleigh and Beauregard [7], and Burden and Faires [8]. The methodof least-squares involves determining the best linear approximation to an arbitrary set of m data points{(a1,b1), . . ., (am,bm)}, by minimising the least-squares error:

� ¼Xm

i¼1

bi � b0i� �2

;

where b0i represents the predicted output of some arbitrary function, e.g.,

b0i ¼ r0 þ r1ai þ � � � þ rn�1an�1i þ rnan

i ¼Xn

j¼0

riaji ;

where n + 1 represents the maximum number of terms.The coefficients r0 through rn of the polynomial function mentioned above can be determined by solving the

linear system:

b � Ar; ð1Þ

where rT ¼ r0 r1 � � � rn½ �, which means obtaining the least-squares solution by solving the overdeterminedlinear system:

ðATAÞr ¼ ATb; ð2Þ

to which there is probably no exact solution. Matrix A is of the form:

A ¼

1 a1 a21 � � � an

1

1 a2 a22 � � � an

2

..

. ... ..

. ...

1 am a2m � � � an

m

266664

377775 ð3Þ

and vector bT ¼ b0 b1 � � � bn½ �.

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1443

Obviously, the above method requires a decision as to what function to use to calculate the least-squares fit.Many types of functions can be fitted, e.g., polynomial, exponential, logarithmic, etc. In the simple polynomialcase, however, at least a decision needs to be taken as to the value of n. Because a least-squares fit is empir-ically obtained from a set of data points, the interpolation characteristics of such a fit are reasonably good.However, for the same reason, a least-squares fit has poor extrapolation properties, particularly when anextrapolated point lies far away from the data set. Care should be taken in selecting n, since a too high ordermay overfit data points, resulting in bad generalisation.

2.2. Taylor polynomials

The following is a brief adaptation of Haggarty [9]. If f is an n-times differentiable function at a, then theTaylor polynomial of degree n for f at a is defined by:

T n;af ðxÞ ¼ f ðaÞ þ f 0ðaÞ1!ðx� aÞ þ � � � þ f ðnÞðaÞ

n!ðx� aÞn:

The importance of Taylor polynomials is that they only involve simple addition and multiplicationoperations. Moreover, given x to any specified degree of accuracy, it is straightforward to evaluate such poly-nomial expressions to a comparable degree of accuracy.

Taylor’s theorem provides an important result: Let f be (n + 1)-times continuously differentiable on an openinterval containing the points a and b. Then the difference between f and Tn,af at b is given by:

f ðbÞ � T n;af ðbÞ ¼ ðb� aÞðnþ1Þ

ðnþ 1Þ! f ðnþ1ÞðcÞ;

for some c between a and b. The error in approximating f(x) by the polynomial Tn,af(x) is the term to the rightof the equality in the above. By integrating the above, the error over an interval between the Taylor polyno-mial Tn,af(x) and any function f can be determined to any desired degree of accuracy. This, in turn, means thatTaylor polynomials can be used to approximate any n times differential, continuous function.

2.3. Neural networks

Neural networks have been proved to be universal approximators [10,11]. This means that a neural networkcan approximate any non-linear mapping to a desired degree of accuracy, provided that enough hidden unitsare provided in the hidden layer. Many successful applications of neural networks can be found in the liter-ature [12–14], including function approximation. Benefits of neural networks include robustness to noise,which directly translates into good generalisation ability.

A neural network is an interconnection of artificial neurons, arranged into layers. Each artificial neuronreceives signals from the environment or other artificial neurons, accumulates these signals, and then emitsa signal according to its activation function. These output signals either interact with other artificial neurons,or alter the environment. Each neuron represents the weighted sum of its inputs and its internal bias:

net ¼X

f

xfwf � h;

where the neuron fires based on its activation function, e.g.,

f ðnetÞ ¼ 1

1þ e�kðnetÞ ;

in the case of the sigmoid activation function. Each weight wf and each bias h in a neural network is learntusing a supervised learning rule such as gradient descent [15], or scaled conjugate gradient [16].

While neural networks are successful in performing function approximation, they do suffer from a numberof problems:

1444 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

• The training of neural networks are computationally time consuming, especially when a large number ofdata points are used and for large architectures.

• Finding the optimal architecture is crucial to ensure optimal interpolation (generalisation) performance[17]. However, architecture selection further adds to the complexity of training.

• Depending on the training algorithm used, neural networks are susceptible to local minima.• Neural networks are sensitive to initial conditions and values of training parameters.• While neural networks do have extrapolation capabilities, this deteriorates the further extrapolation points

lie from the training set.

The interested reader is encouraged to read Zurada [2] and Bishop [3].

2.4. Genetic algorithms

Genetic algorithms [4,5] are inspired by natural evolution as described by Darwin [18]. A genetic algorithmconsists of a population of individuals. The characteristics of each individual are expressed using a genotype.Operators, such as mutation and crossover, are used to create further generations of individuals by adjustingthe genotype of each individual. A function known as the fitness function is used to reward those individualswhose phenotypic behaviour corresponds to better solutions in the problem domain, by allowing those indi-viduals to reproduce or survive to the next generation. A suitable encoding scheme is needed to encode thegenotype of each individual. Such an encoding scheme is usually a binary string, but can be a vector of float-ing-point values, nominal values, etc.

Genetic algorithms have been applied to a range of diverse problems such as music [19], data mining [20] anddesign [21]. A very important aspect in designing a genetic algorithm is the definition of the fitness function. Forfunction approximation, a suitable fitness function is fairly obvious. The genetic algorithm attempts to minimisethe mean squared error between the target outputs and the predicted outputs (or some other suitable variant).

Wilson describes a genetic algorithm that performs a piecewise-linear approximation to any arbitrary func-tion [22]. His findings yielded arbitrary close approximations that were efficiently distributed over a function’sdomain.

Kowar used a genetic function approximation (GFA) algorithm to accomplish objectives similar to that ofstatistical experimental design [23]. He applied GFA to the 24 design examples to determine whether themodel found using experimental design techniques could be determined using GFA. Kowar found that theGFA algorithm produces a population of model equations that contains the same regression equation as isderived using statistical experimental design analysis, but also allows the experimentalist to increase the qual-ity and quantity of information derived. He named this application the Genetic Function ApproximationExperimental Design (GFAXD) method.

Yao and Liu present a general framework for the design of artificial neural networks using evolutionaryalgorithms [24]. Their system, called EPNet, implemented a number of mutation and crossover operators thatevolved the connections, neurons and the weights of a neural network, thereby training and evolving the struc-ture of a neural network. The algorithm was benchmarked on the two spirals problem. They argued thatEPNet offered a competitive alternative to conventional constructive and pruning algorithms.

2.5. Genetic programs

A genetic program [6] is nothing more than a genetic algorithm that has a tree-based chromosome repre-sentation. Such a representation can be used to represent programs and mathematical equations. A few geneticprograms already exist that apply to function approximation. Angeline, discusses a model that uses geneticprogramming to select a system of equations that are optimised in a neural network like fashion, in orderto predict chaotic time series [25]. His goal, specifically, was to evolve task specific activation functions forneural network-like systems of equations, i.e., activation functions that were not sigmoidal in nature.

Nikolaev and Iba discuss a genetic program that uses Chebychev polynomials as building blocks for a tree-structured polynomial expression [26]. Their findings indicate that their tree-structured polynomial represen-tation produced superior results on several benchmark and real-world time series prediction problems.

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1445

Symbolic regression is a means of evolving functions to regress a set of data points. The functions arerepresented as trees, where the non-terminal symbols are selected from a function set and the terminal sym-bols are selected from a terminal set. The function set consists of a set of functions and operators,e.g., {sin, cos,+, /}, and the terminal set consists of a set of constants and attributes, e.g., {1.5,x, 0.5}. Abasset al. [27] present a concise overview of symbolic regression techniques using genetic programming.

3. Evolving structurally optimal polynomials

This section discusses the implementation specifics of the genetic algorithm (GASOLPE), which heavily bor-rows from the ideas presented in Section 2. Essentially, the algorithm is a three stage process that consists of:

1. a fast, rough k-means clustering algorithm,2. a genetic algorithm, and3. a hall-of-fame.

Each of these sections will be discussed and elaborated on in full.

3.1. K-means clustering

One of the primary problems with most computational intelligence paradigms, is the need to iterate overeach training pattern in order to calculate an error metric (in the case of a neural network) or to calculatethe fitness (in the case of an evolutionary algorithm) of an individual. GASOLPE uses clustering in orderto try to break the aforementioned restriction. The idea is to draw a stratified random sample [28], of sizes, from k clusters of training patterns, where each cluster represents a stratum of homogeneous trainingpatterns (according to some characteristic inherent in the data), instead of using all of the available trainingpatterns. The stratified random sample is drawn proportionally from each of the k clusters, i.e.,

s ¼Xk

w¼1

jCwjS

;

where jCwj is the (stratum) size of cluster w. Obviously, a proportional sample would be meaningless unlesseach strata was homogeneous with respect to some characteristic.

Alsabti et al. discusses an efficient direct k-means clustering algorithm that uses a tree structure to representthe nearest neighbour to each individual pattern in an efficient way [29]. The GASOLPE method makes use ofthis k-means clustering algorithm. For the purposes of the GASOLPE method, however, the tree structure hasbeen sacrificed for a simpler heuristic. The direct k-means clustering method follows the following pseudo-code algorithm:

1. Initialise k centroids (w1, . . .,wk) such that each centroid is initialised to one input vector wd = ql,d 2 {1, . . .,k}, l 2 {1, . . ., jSj}, where each cluster Cd is associated with the centroid wd.

2. For each input vector ql, where l 2 {1, . . ., jSj}(a) Find the nearest centroid wd� , i.e., if

kql � wd�k 6 kql � wdk; ðd ¼ 1; . . . ; kÞ ^ ðd 6¼ d�Þ

then

Cd� ¼ Cd� [ fqlg:

3. For each cluster Cd, where d 2 {1, . . .,k}(a) Update the cluster centroid wd to be the centroid of all the samples currently in Cd using

wd ¼P

ql2Cdql

jCdj;

1446 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

4. Compute the error function

E ¼Xk

d¼1

Xql2Cd

kql � wdk2:

5. Return to 2 until E does not change significantly or cluster membership does not change.

The simpler heuristic used by the GASOLPE method requires only minor adjustments (steps 1 and 2a) tothe above algorithm, and includes a centroid standard deviation measure for each cluster:

1. Initialise k centroids (w1, . . .,wk) such that each centroid is initialised to one input vector wd = ql,d 2 {1, . . .,k}, l 2 {1, . . ., jSj} and initialise k centroid deviation vectors (r1, . . .,rk) such that rd = 0,d 2 {1, . . .,k}, where each cluster Cd is associated with the centroid wd and the centroid deviation vector rd.

2. For each input vector ql, where l 2 {1, . . ., jSj}.(a) If (ql 2 Cx,x 2 {1, . . .,k}) ^ ((ql < wx � rx) _ (ql > wx + rx)), then find the nearest centroid wd� , i.e., if

kql � wd�k 6 kql � wdk; ðd ¼ 1; . . . ; kÞ ^ ðd 6¼ d�Þ

then

Cd� ¼ Cd� [ fplg:

3. For each cluster Cd, where d 2 {1, . . .,k}

(a) Update the cluster centroid wd to be the centroid of all the samples currently in Cd so that

wd ¼P

ql2Cdql

jCdj:

(b) Update the cluster centroid deviation vector rd to be the standard deviation of all the samples currentlyin Cd so that

rd ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

ql2Cdq2

l �P

ql2Cdql

� �2

jCdj

jCdj � 1

vuuuut:

4. Compute the error function

E ¼Xk

d¼1

Xql2Cd

kql � wdk2:

5. Return to 2 until E does not change significantly or cluster membership does not change.

Essentially, the centroid wd and the centroid standard deviation rd in the above algorithm, allow thek-means clustering algorithm to fit k hyper-cubes over an n-dimensional pattern space. Any pattern not withina hyper-cube becomes eligible for selection in step 2a of the algorithm. This pattern selection strategy drasti-cally reduces the number of comparisons that need to be made for each training iteration, resulting inimproved performance over the normal direct k-means clustering algorithm.

3.2. Genetic algorithm

The following section discusses the core algorithms employed by the genetic algorithm component of theGASOLPE method. Section 3.2.1 introduces the reader to the complexity of the technique used by the geneticalgorithm. Section 3.2.2 discusses the representation of each individual in the genetic algorithm. Section 3.2.3illustrates how an individual is initialised. Sections 3.2.4 and 3.2.5 focus on the mutation and crossoveroperators employed by the genetic algorithm. Section 3.2.6 discusses the fitness function used by the geneticalgorithm. Finally, Section 3.2.7 presents the algorithm to guide the optimisation process.

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1447

3.2.1. Introduction

In Sections 2.1 and 2.2 Taylor polynomials and discrete least-squares approximation were discussed interms of their relevance to function approximation. The definition of the linear function presented in Sections2.1 and 2.2 is extended from:

b0i ¼Xn

j¼0

rjaji ;

to

b0i ¼Xn

s¼0;Pm

j¼1kj¼s

rðk1;k2;...;kmÞYm

q¼1

akqi;q

!; ð4Þ

where m is the dimensionality of the input space and n is the maximum polynomial order. This definitionallows the representation of functions such as:

b0i ¼ rð0;0Þ þ rð1;0Þai;1 þ rð0;1Þai;2 þ rð1;1Þai;1ai;2 þ rð2;0Þa2i;1 þ rð0;2Þa2

i;2;

for m = 2 and n = 2. If the value of the coefficients rðk1;k2;...;kmÞ are efficiently determined using the least-squaresapproximation (from Eq. (4)), then all the genetic algorithm is required to do is to algorithmically determinethe optimal approximating polynomial structure. The definition of optimality is bimodal: to obtain the small-est polynomial structure, and the best possible function approximation. A simplistic approach would be togenerate every possible combination of function, and test the predicted result of such a function against thedataset.

Firstly, the total number of unique terms t generated by the extended form in Eq. (4) is determined: Select,with repetition, p inputs from a set of m inputs. This problem is similar to determining the number of n-multi-sets of size p, where p 2 {0, . . .,n}. There are:

t ¼Xn

p¼0

p þ m� 1

p

� �;

such multi-sets (terms) [30]. By applying induction and Pascal’s formula we can simplify the above to

t ¼mþ n

n

� �: ð5Þ

To determine the number of function choices, u, calculate without repetition, q terms from the set of t

terms, as follows:

u ¼Xt

q¼0

t

q

� �:

Using the Binomial theorem, the above is simplified to

u ¼ 2t: ð6Þ

From Eqs. (5) and (6), for a 10-dimensional input space with a maximum polynomial order of three there

are 286 possible terms and 2286 function choices. To iteratively calculate and test each function choice againsta set of S training patterns is thus computationally demanding. Genetic algorithms, however, can be used todetermine solutions to such difficult problems.

3.2.2. Representation

The representation used by the algorithm is fairly simple and is, in fact, a representation of Eq. (4). Eachindividual is made up of a set I of unique, term-coefficient mappings, e.g.,

I ¼ fðt0 ! r0Þ; . . . ; ðtp�1 ! rp�1Þg;

1448 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

where p is the maximum set size (maximum number of terms) and rn,n 2 {0, . . .,p � 1} is a real-valued coef-ficient. Each term tn is made up of a set T of unique, variable-order mappings, e.g.,

T ¼ fðai;1 ! k1Þ; . . . ; ðai;m ! kmÞg;

where m is the number of inputs (variables), ai,s, s 2 {1, . . .,m} is an integer representing an input and ks is anatural valued order. In practise, these two sets are maintained as two variable-length, sorted arrays, whichallow only unique insertion.

3.2.3. Initialisation

Each individual in a population is initialised by randomly selecting variable-order pairs, in order to build aterm up to a maximum polynomial order, e 2 {0, . . .,n}. This process is repeated until the number of terms isequal to the maximum number of terms p. The initialisation of an individual is fully described by the followingpseudo-code algorithm:

1. Set Ix = {}

2. While jIxj < p do(a) Set Tn = {}(b) Select e 2 {0, . . .,n} uniformly from the maximum polynomial order n.(c) While e > 0 do

(i) Select 1 6 f 6 e uniformly from the available orders e.

(ii) Select 1 6 g 6 m uniformly from the set of m inputs.(iii) If jTnj < jTn [ {(g! f)}j then e :¼ e � f, i.e., decrease the number of available orders e.

(iv) Set Tn = Tn [ {(g! f)}.

(d) Set Ix = Ix [ {(Tn! 0)} as shown in Fig. 1.

3.2.4. Mutation operators

The mutation operators serve to inject new genetic material into a population of individuals, therebyensuring that a larger part of the search space is covered. Four mutation operators are used by the geneticalgorithm; shrink, expand, perturb and reinitialise. These operators have been developed with the main objec-tive to optimise the structure of polynomials represented by individuals.

• Shrink operator: The shrink operator is fairly simple to implement and consists of removing, arbitrarily, oneof the term-coefficient pairs from the set Ix. The pseudo-code for the shrink operator is as follows:1. Select Tn 2 Ix uniformly from the set of terms Ix.2. Set Ix = Ix/{Tn} as shown in Fig. 2.

• Expand operator: The expand operator adds a new random term-coefficient pair to the set Ix. The pseudo-code for the expand operator is as follows:

I ω

r0

r1 ,a1 1 1,1λ

a2,1 a2,3r2 2,1λ 2,3λ

a3,2r3 3,2λ

a4,2 a4,3 a4,4r4 4,2λ 4,3λ 4,4λ

Fig. 1. Illustration of GASOLPE chromosome initialisation for an individual Ix.

r0

r1 ,a1 1

a2,1 a2,3r2

a3,2r3

a4,2 a4,3 a4,4r4

1,1λ

2,1λ 2,3λ

3,2λ

4,2λ 4,3λ 4,4λ

I ω

Fig. 2. Illustration of the GASOLPE shrink operator for an individual Ix.

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1449

1. If jIxj < p then(a) Set Tn = {}(b) Select e 2 {0, . . .,n} uniformly from the maximum polynomial order n.(c) While e > 0 do

(i) Select f 2 {1, . . .,e} uniformly from the available orders e.(ii) Select g 2 {1, . . .,m} uniformly from the set of m inputs.

(iii) If jTnj < jTn [ {(g! f)}j then e :¼ e � f, i.e., decrease the number of available orders e.(iv) Set Tn = Tn [ {(g! f)}.

(d) Set Ix = Ix [ {(Tn! 0)} as shown in Fig. 3.

• Perturb operator: The perturb operator is fairly complicated and requires the algorithm to select a term

from the individual, and adjust one of the variable-order mappings. This adjustment can either add, removeor adjust an order in a variable-order mapping and is applied uniformly (with equal probability). Thepseudo-code for the perturb operator is as follows:1. Select Tn 2 Ix uniformly from the set of terms Ix.2. Calculate the number of orders available

e :¼ p �XjT nj

m¼1

kIx;m:

3. Select g 2 {1, . . .,m} uniformly from the set of m inputs.4. Select h 2 U(0, 1) as a uniformly distributed random number.5. If h < 0.333 then

(a) Set Tn = Tn/{(g! k)} for any k, i.e., remove the gth variable-order mapping from set Tn as shown bythe crossed out section in Fig. 4.

r0

r1 a1,1 1,1

a2,1 a2,3r2

a3 , 2r3

a4,4r4

r a

λ

2,1λ 2,3λ

4,2λ

3 ,2λ

4,3λ 4,4λ

5,3λ Tξ

I ω

a4,3a4,2

5,5 3

Fig. 3. Illustration of the GASOLPE expand operator for an individual Ix.

r0

a4,2 a4,3 a4,4

r4 a ,41

a3,2r3

a2,1 a2,3r2

r1 a1,1 1,1λ

2,1λ 2,3λ

3,2λ

4,2λ 4,3λ 4,4λ ,41λ

I ω

Fig. 4. Illustration of the GASOLPE perturb operator for an individual Ix.

1450 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

6. Else if h < 0.666 then(a) Select f 2 {1, . . .,e} uniformly from the available orders e.(b) Set Tn = Tn [ {(g! f)} as shown by the large box in Fig. 4.

7. Else(a) Set Tn = Tn/{(g! k)} for any k, i.e., remove the gth variable-order mapping from set Tn.(b) Set e :¼ e + k, i.e., increase the number of available orders e.(c) Select f 2 {1, . . .,e} uniformly from the available orders e as shown by the small box in

Fig. 4.(d) Set Tn = Tn [ {(g! f)}.

• Finally, the reinitialise operator is just a re-invocation of the initialisation operator of Section 3.2.3.

3.2.5. Crossover operator

The crossover operator serves to retain genetic material from one generation of individuals to the next, thusthe crossover operator narrows the search space around a particular solution. The crossover operator, usedby the genetic algorithm, selects a subset of two individuals in order to construct a new chromosome.Term-coefficient mappings are selected to construct the new chromosome at random, with a higher probabil-ity of selection given to term-coefficient mappings that are prevalent in both individuals. A ratio of 80:20was used (shown below), because, on average, the new individual generated by these parameters was foundto be roughly the same length as its longer parent. The pseudo algorithm for the crossover operator is asfollows:

1. Let Ia = {} be a new term-coefficient set (individual) as shown in Fig. 5.2. Let Ib 2 P be any term-coefficient set in the population of individuals P as shown in Fig. 5.3. Let Ic 2 P be any term-coefficient set in the population of individuals P as shown in Fig. 5.4. Let A = Ib \ Ic be the intersection of term-coefficient mappings.5. Let B = (Ib/Ic) [ (Ic/Ib) be the union of the exclusions.6. Set e :¼ 17. While e < jAj and Ia < p do

(a) Select h 2 U(0,1) as a uniformly distributed random number.(b) If h < 0.8 then Ia = Ia [ {Ae}(c) Set e :¼ e + 1

8. Set e :¼ 19. While e < jBj and Ia < p do

(a) Select h 2 U(0,1) as a uniformly distributed random number.(b) If h < 0.2 then Ia = Ia [ {Be}(c) Set e :¼ e + 1

r1 ,a11 1,1λ

a2,1 a2,3r2 2,1λ 2,3λ

r0

a3,2r3 3,2λ

a4,2 a4,3 a4,4r4 4,2λ 4,3λ 4,4λ

I α

r1 a ,11 a ,31,11λ ,31λ

r2 a2,4 2,4λ

r0

I β

a1 a31λ 3λ

a1 1λ

a4 4λ

a2 2λ

a2 a4a32λ 3λ 4λ

r0

r1 a ,11 a ,31

r2 a2,2

,11λ ,31λ

2,2λ

I γ

B

A

1

2

3

6

4

1 3

5

1 3

2

4

6

5

1 3

4

Fig. 5. Illustration of the GASOLPE crossover operator for individuals Ia, Ib and Ic.

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1451

3.2.6. Fitness function

The fitness function is an important aspect of a genetic algorithm, in that it serves to direct the algorithmtoward optimal solutions. The fitness function used by the genetic algorithm is similar to the adjusted coeffi-

cient of determination [28]. This contrasts with Potgieter and Engelbrecht in that, the mean-squared error termfailed to penalise the complexity of a given solution [31]. The fitness function is defined as:

R2a ¼ 1�

PNi¼1ðbi � b0Ix;i

Þ2PNi¼1ðbi � �bÞ2

� s� 1

s� k; ð7Þ

where N is the sample size, bi is the actual output of pattern i, b0Ix;iis the predicted output of individual Ix for

pattern i, and the model complexity k is calculated as follows:

k ¼XjIxj

n¼1

XjT nj

s¼1

kn;s; ð8Þ

where Ix is an individual in the set GGA of individuals, Tn is a term of Ix and kn,s is the order of term Tn.This fitness function penalises the complexity of an individual Ix by penalising the number of multiplica-tions needed to calculate the predicted output of that individual, i.e., the number of terms and their order.

In order to calculate the fitness of an individual, however, the algorithm requires the coefficients in the setIx to be calculated. Matrix A (from Eq. (3)) is populated with the combination of terms represented by eachterm-coefficient mapping, e.g., if the term is a0a2

1, the algorithm multiplies out each of the input attributes for aparticular pattern. Matrix A is thus populated from left to right with terms from each pattern in the samplespace (where the patterns proceed from top to bottom). The vector b is made up of the target output for eachpattern. After reducing the linear system shown by Eq. (2), vector r represents the coefficients of each of theterm-coefficient mappings.

3.2.7. The genetic algorithm optimisation process

The optimisation algorithm for the genetic algorithm is now presented:

1. Let g = 0 be the generation counter2. Initialise a population Pg of N individuals, i.e.,

P g ¼ fP g;njn ¼ 1; . . . ;Ng:

1452 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

3. While g < G, where G is the total number of generations, do(a) Sample C0 � fg[k

d¼1Cd where Cd is a cluster (stratum) of patterns(b) Determine the coefficients of each of the term-coefficient mappings in an individual Pg,n by reducing

b � Ar(c) Evaluate the fitness FGA(Pg,n) of each individual in population Pg using the patterns in C 0

(d) Let P 0g � P g be the top x% of the individuals to be involved in elitism(e) Install the members of P 0g into Pg+1

(f) Let P 0g � P g be the top y% of the individuals to be involved in crossover(g) Perform crossover:

(i) Select two individuals P 00g;n1 and P 00g;n2

(ii) Produce offspring Pg+1,n3 from P 00g;n1 and P 00g;n2

(h) Perform mutation:(i) Select an individual Pg+1,n1

(ii) Mutate Pg+1,n1

(i) Evolve the next generation g :¼ g + 1.

3.3. Hall-of-fame

The hall-of-fame concept [32] works in the same way as the hall-of-fame in classic arcade games, where theplayer that achieved a better score than any of the players in the hall-of-fame takes his/her rightful place (byentering his/her initials) and knocks the worst score off the list. For GASOLPE, the hall-of-fame is essentiallya set of unique, individual solutions, ranked according to their fitness value. After every generation the best indi-vidual is given the opportunity to enter the hall-of-fame. Entry into the hall-of-fame is determined as follows:

• If the best individual of a generation is structurally equivalent to an individual in the hall-of-fame, the fit-ness values of the individual in the hall-of-fame and the best individual are compared. The individual withthe best fitness then replaces the individual in the hall-of-fame.

• Otherwise, if the best individual of a generation is not structurally equivalent to any individual in the hall-of-fame, the best individual is inserted relative to its fitness (or possibly not at all if its fitness is worse thanany of the individuals in the hall-of-fame).

The hall-of-fame is not an elitism method; the individuals in the hall-of-fame do not further participate inthe evolutionary process. The hall-of-fame simply keeps track of the best solutions for any given architecture.In the end the solution taken is not necessarily the best solution of the last generation, but the best over allgenerations. Ultimately, the purpose of the hall-of-fame is to ensure that the best, general solution is selectedas the solution to the optimisation process.

Because the genetic algorithm of Section 3.2 works with only a sample of the available patterns, certainfunction fits may not represent the true nature of the data set, particularly when such a data set is extremelynoisy, because of the new sample used at each generation. For example, with a particularly poor sample selec-tion from a noisy data set, the genetic algorithm may decide that a straight line is the optimal fit for the dataset, when, in fact, a cubic function would have performed better on the whole. Following that, the geneticalgorithm may decide that such a fit was the best fit seen so far in the optimisation process and will decideto retain that solution, ultimately leading to sub-optimal convergence. The hall-of-fame prevents this scenariofrom happening, because all best solutions compete for a place in the hall-of-fame. At the end of the optimi-sation process, the solutions in the hall-of-fame are tested against a validation set (a subset of the patternswithheld from training), to determine the ultimate solution.

4. Results

The following section discusses the experimental procedure and results of the GASOLPE method, appliedto various datasets generated from a range of generating functions. Section 4.1 presents the generating func-

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1453

tions of the various datasets, as well as the experimental procedure and program initialisation. Section 4.2 pre-sents the experimental results for the functions listed in Section 4.1 and discusses the findings.

4.1. Procedure

4.1.1. Functions

Table 1 presents a range of functions (f1–f5) used to test the algorithm. These functions are all continuousover their domains and have been injected with noise to illustrate the characteristics of the GASOLPE methodon noisy datasets. Additionally, the method has been tested on a number of interesting chaotic time seriesproblems.

• Logistic map: The first, and most basic, chaotic time series problem is the logistic map, whose generatingfunction can be described in the following manner:

TableFuncti

Name

f1

f2

f3

f4

f5

dxdt¼ a � xnð1� xnÞ;

where a = 4 and x0 = 0.2.• Henon map: The next chaotic time series problem is the Henon map, which can be described in the follow-

ing manner:

dxdt¼ 1� a � x2

n þ b � yn;

dydt¼ dx

dt;

where a = 1.4, b = 0.3, x(0) = U(�1,1) and y(0) = U(�1,1).• Rossler attractor: The next chaotic time series problem is the Rossler attractor, whose generating function is

as follows:

dxdt¼ �yn � zn;

dydt¼ xn � a � yn;

dzdt¼ bþ znðxn � cÞ;

where a = 0.2, b = 0.2, c = 5.7, x0 = 1.0, y0 = 0 and z0 = 0. This function should be generated using theRunga–Kutta order 4 method [8].

• Lorenz attractor: The last chaotic time series problem is the Lorenz attractor, whose generating function isas follows:

1on definitions

Function

f(x0) = sin(x0) + U(�1,1);x0 2 [0,2p]

f(x0) = sin(x0) + cos(x0) + U(�1,1);x0 2 [0,2p]

f ðx0Þ ¼ x50 � 5x3

0 þ 4x0 þ Uð�1; 1Þ; x0 2 ½�2; 2�f(x0,x1) = sin(x0) + sin(x1) + U(�1,1);{x0,x1} 2 [0,2p]

f ðx0; x1Þ ¼ x50 � 5x3

0 þ 4x0 þ x51 � 5x3

1 þ 4x1 þ Uð�1; 1Þ; x0; x1 2 ½�2; 2�

TableGeneti

Variab

ClusteClusteFunctiFunctiFunctiFunctiFunctiFunctiFunctiFuncti

Clusteclusterrepreseprunin

1454 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

dxdt¼ rðyn � xnÞ;

dydt¼ r � xn � yn � xnzn;

dzdt¼ xnyn � b � zn;

where r = 10, r = 28, b = 8/3, x0 = 1.0, y0 = 0 and z0 = 0. Once again, this function should be generatedusing the Runga–Kutta order 4 method [8].

4.1.2. Experimental procedure

Each of the generating functions listed in Section 4.1.1 were used to create a corresponding dataset consist-ing of 12000 patterns. Each pattern consisted of the inputs and outputs for the specific generating function,e.g., for the Rossler attractor, each pattern consisted of the three input components and one output compo-nent. Each dataset was scaled to create another dataset, which consisted of scaled input components (to therange [�1,1]), to be used by a neural network for function approximation.

The neural network implementation used for comparison with the GASOLPE method, was trained untilthe maximum number of epochs were exceeded, using stochastic gradient descent. Only the hidden layer ofthe neural network used sigmoidal activation functions; the other layers used linear activation functions.The use of linear activation functions in specifically the output layers negates the need for scaling neural net-work target outputs. The neural network was initialised with a learning rate of 0.05, a momentum of 0.9 and amaximum number of epochs of 3000. The neural network also made use of k-means clustering to drawstratified random samples in an identical manner to the GASOLPE method (in contrast to Potgieter andEngelbrecht [31]). Consequently, the number of clusters was set to 15, the number of cluster epochs was setto 10 and the percentage sample size was set to 0.01. For fair comparisons, the neural network used the sameclustering procedure as used by GASOLPE to draw stratified samples from the original data sets.

The GASOLPE method was initialised using the values shown in Table 2. These values were selected basedon numerous experimental runs which will not be discussed further in this paper, with some exceptions. Toensure a reducible overdetermined linear system (see Section 2.1), the number of clusters should be selectedto be at one more than the maximum number of function components. It is also necessary to select the samplesize such that at least one pattern can be drawn from each cluster, to ensure an adequate sample spread ofpatterns over the function domain. An increased sample size will also ensure that the overdetermined linearsystem is solvable. Additionally, there is a function cutoff heuristic that prevents terms from appearing in a

2c algorithm initialisation

le Value

rs 15rEpochs 10onMutationRate 0.1onCrossoverRate 0.2onGenerations 100onIndividuals 30onPercentageSampleSize 0.01onMaximumComponents 20onElite 0.1onCutOff 0.001

rs represents the number of clusters used by the k-means clustering algorithm, ClusterEpochs represents the number of epochs theer is run for, FunctionPercentageSampleSize represents the sample size selected from each stratum, FunctionMaximumComponentsnts the maximum number of terms allowed per expression and FunctionCutOff represents the minimum coefficient value beforeg occurs. All other parameters are self explanatory.

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1455

polynomial expression when that term’s coefficients tend to 0. This cutoff results in a reduced set of terms andis thus used to prune each individual solution.

Both the neural network and the GASOLPE method made use of three distinct sets of patterns. Each data-set mentioned earlier was split up into a training set of 10000 patterns, a validation set of 1000 patterns and ageneralisation set of 1000 patterns. The purpose of the training set is to train the two methods; the fitness func-tion of the genetic algorithm and the forward- and back-propagation phase of the neural network use thetraining set as the driving force of their algorithms. The validation set is used to validate the interpolationability of the genetic algorithm; the genetic algorithm uses the validation set to select the best individual fromthe hall-of-fame. The generalisation set is used to compare the two algorithms on unseen data patterns, i.e.,generalisation ability.

The results for each of the functions listed in Section 4.1.1, were obtained by running 100 simulations of thecorresponding datasets. Note though, that before each simulation was run, the data patterns were shuffledrandomly among the three sets (training, validation and generalisation) and presented in this form to boththe neural network and the GASOLPE method.

4.2. Experimental results

This section discusses the experimental results of both the neural network and the genetic algorithm pre-sented in this paper. The section is organised in three subsections namely, noiseless datasets, noisy data setsand polynomial structure.

Before the datasets and results are described, the choices for the maximum polynomial order and thehidden units parameters are discussed. Any function with n turning points can be reasonably described byan order (n + 1)-degree polynomial. Consider a continuous function f with n turning points, (p1, . . .,pn). Thenthe derivative of f must necessarily be 0 at those turning points. A polynomial expression with f(x) = 0 at allpoints (p1, . . .,pn) has the factorised form:

f ðxÞ ¼ ðx� p1Þðx� p2Þ � � � ðx� pnÞ

and has the simplified form of:

f ðxÞ ¼ rnxn þ rn�1xn�1 þ � � � þ r0x0:

Integration of the simplified form yields:

f ðxÞ ¼ rn

nxnþ1 þ rn�1

n� 1xn þ � � � þ r0

1x1 þ C;

which has degree n + 1. Thus, the maximum polynomial order of the GASOLPE method should always beselected as one more than the number of turning points of the original function. Similarly, the number of hid-den units used by a neural network should also be chosen to be one more than the number of turning points ofthe original function [17].

Note that the mean squared error comparisons of the results are roughly twice as large as that of Potgieterand Engelbrecht [31]. This is due to a slight miscalculation of the mean squared error in that paper, howeverthe comparisons are still valid because the same miscalculation was present in both the neural network andgenetic algorithm results.

4.2.1. Noiseless data sets

Table 3 summarises the results of the noiseless application of the Henon map, Logistic map, Rossler attrac-tor and the Lorenz attractor, respectively. All experiments utilising these functions used a genetic algorithm(GA) of maximum polynomial order of 3 and a neural network (NN) with a hidden layer size of 3, i.e., therewere 3 hidden units. As mentioned in Section 4.1.2, all other initialisation values were set according to Table 2and as specified in Section 4.1.2.

For each of the experiments shown in Table 3, the GA performed significantly better, on average, than theneural network. This improvement was both in terms of training and generalisation accuracy and in terms of

Table 3Comparison of GASOLPE and NN on noiseless data

Function GA TMSE rTMSE GMSE rGMSE �t rt

NN TMSE rTMSE GMSE rGMSE �t rt

Henon GA 0.000000 0.000000 0.000000 0.000000 0.8433 0.035020NN 0.000975 0.002375 0.000965 0.002355 1.3034 0.028753

Logistic GA 0.000000 0.000000 0.000000 0.000000 0.8524 0.033609NN 0.001315 0.012948 0.001301 0.012810 1.1439 0.054899

Rossler GA 0.000000 0.000000 0.000000 0.000000 0.9397 0.025799x component NN 0.047543 0.053032 0.047551 0.053607 1.4370 0.040390Rossler GA 0.000000 0.000000 0.000000 0.000000 0.8390 0.021438y component NN 0.017081 0.031806 0.017089 0.032378 1.4287 0.036060Rossler GA 0.000000 0.000000 0.000002 0.000000 0.8572 0.026594z component NN 0.136534 0.128612 0.123652 0.116032 1.4274 0.028235

Lorenz GA 0.000433 0.000000 0.000443 0.000000 0.7997 0.034420x component NN 0.220568 0.266238 0.225746 0.270200 1.4492 0.046399Lorenz GA 0.000906 0.000000 0.000900 0.000000 0.9097 0.036054y component NN 1.688550 1.336680 1.699200 1.279130 1.4328 0.031272Lorenz GA 0.000393 0.000000 0.000446 0.000000 0.9235 0.033826z component NN 2.798120 1.266880 2.855730 1.340020 1.4728 0.073238

TMSE = mean squared error on training set, GMSE = mean squared error on test set (generalisation), r indicates the standard deviation,�t is average simulation completion time in seconds.

1456 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

the average simulation completion time for each simulation run. What is interesting to note, is that the GAperformed better than the NN on a range of chaotic time series problems. Figs. 9–17 show the function plotsof most of the functions used in this section.

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7

y

x

f1AGANN

Fig. 6. Function f1 actual versus GA and NN predicted.

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7

y

x

f1BGANN

Fig. 7. Function f2 actual versus GA and NN predicted.

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1457

4.2.2. Noisy data sets

Table 4 represents the experimental results for the GASOLPE method (GA) and the neural network (NN)as applied to a number of noisy data sets.

-5

-4

-3

-2

-1

0

1

2

3

4

5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

y

x

f1CGANN

Fig. 8. Function f3 actual versus GA and NN predicted.

Henon

-1.5-1

-0.50

0.51

1.5x

-1.5-1

-0.50

0.51

1.5

y

-1.5

-1

-0.5

0

0.5

1

1.5

z

Fig. 9. Henon map.

Rossler

-10-5

05

1015

x

-12-10

-8-6

-4-2

02

46

8

y

0

5

10

15

20

25

z

Fig. 10. Rossler attractor.

1458 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

• f1: Function f1 represents the results of a noisy application of sin(x) over a domain of one period. The NNwas trained using 3 hidden units in the hidden layer and the GA was trained with a maximum polynomialorder of 3. All other initialisation parameters were set as shown by Table 2 and specified in Section 4.1.2.The GA performed slightly better than the NN in terms of training and generalisation error, and performedbetter than the NN in terms of the average simulation completion time. A plot of the best GA and NNoutput, evaluated according to generalisation ability of each end state of each simulation run, againstthe plot of the generalisation data set is shown in Fig. 6.

• f2: Function f2 represents the results of a noisy application of sin(x) + cos(x) over a domain of one period.The NN was, once again, trained using 3 hidden units and the GA was trained with a maximum polynomialorder of 3. The GA performed slightly better than the NN in terms of accuracy. The GA performed betterthan the NN in terms of the average simulation completion time. Fig. 7 shows a plot of the best NN andGA output for the function f2 against a plot of the generalisation data set.

• f3: Function f3 represents the results of a noisy application of a fifth order polynomial over the interval[�2,2]. The GA used a maximum polynomial order of 5 and the NN used 5 hidden units. The GA per-formed better than the NN, both in terms of accuracy and the average simulation completion time. A plotof the best GA and NN output against the plot of the generalisation data set is shown in Fig. 8.

-10

-5

0

5

10

15

0 100 200 300 400 500 600 700 800 900 1000

x

t

Rossler 1

Fig. 11. Rossler attractor: x component.

-12

-10

-8

-6

-4

-2

0

2

4

6

8

0 100 200 300 400 500 600 700 800 900 1000

y

t

Rossler 2

Fig. 12. Rossler attractor: y component.

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1459

• f4: Function f4 represents the results of a 2-dimensional application of function f1. The GA used a maxi-mum polynomial order of 3 and the NN used 6 hidden units. The GA performed significantly better thanthe NN in terms of training and generalisation accuracy, and in terms of the average simulation completiontime.

0

5

10

15

20

25

0 100 200 300 400 500 600 700 800 900 1000

z

t

Rossler 3

Fig. 13. Rossler attractor: z component.

Lorenz

-20-15

-10-5

05

1015

20x

-30-20

-100

1020

30

y

05

101520253035404550

z

Fig. 14. Lorenz attractor.

1460 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

• f5: Function f5 represents the results of a 2-dimensional application of function f3. The GA used a maxi-mum polynomial order of 5 and the NN used 10 hidden units. The GA performed significantly better thanthe GA both in terms of training and generalisation accuracy, and in terms of the average simulation com-pletion time.

• Henon: The Henon function represents the results of a noisy application of the Henon map. Uniformly dis-tributed noise in the range [�1,1] was injected into the output component of the data set. The GA used amaximum polynomial order of 3 and the NN used 3 hidden units. The GA performed slightly better thanthe NN in terms of training and generalisation accuracy, and performed significantly better than the NN interms of the average simulation completion time. Fig. 9 shows a plot of the Henon map.

-20

-15

-10

-5

0

5

10

15

20

0 2000 4000 6000 8000 10000 12000

x

t

Lorenz 1

Fig. 15. Lorenz attractor: x component.

-30

-20

-10

0

10

20

30

0 2000 4000 6000 8000 10000 12000

y

t

Lorenz 2

Fig. 16. Lorenz attractor: y component.

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1461

• Lorenz: The Lorenz function represent the results of a noisy application of the Lorenz attractor in all threecomponents. Uniformly distributed noise in the range [�10,10] was injected into each of the input and out-put components of the data set. The GA used a maximum polynomial order of 3 and the NN used 3 hidden

0

5

10

15

20

25

30

35

40

45

50

0 2000 4000 6000 8000 10000 12000

z

t

Lorenz 3

Fig. 17. Lorenz attractor: z component.

1462 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

units. For all three experiments, the GA performed significantly better than the NN both in terms of train-ing and generalisation accuracy, and in terms of the average simulation completion time. Figs. 14–17 showthe plots for the Lorenz attractor.

Table 4Comparison of GASOLPE and NN on noisy data

Function GA TMSE rTMSE GMSE rGMSE �t rt

NN TMSE rTMSE GMSE rGMSE �t rt

f 1 GA 0.343135 0.001644 0.338052 0.004972 0.8425 0.018983NN 0.399733 0.047117 0.396936 0.048930 1.0657 0.041636

f 2 GA 0.372829 0.002661 0.374673 0.006059 0.8310 0.022042NN 0.390366 0.042499 0.391309 0.043960 1.0501 0.030666

f 3 GA 0.339414 0.001765 0.338496 0.013471 0.9471 0.035228NN 0.560401 0.109980 0.560173 0.115488 1.1866 0.022483

f 4 GA 0.345282 0.001698 0.348297 0.011319 1.1514 0.058982NN 0.557837 0.115652 0.562788 0.110518 1.5374 0.024479

f5 GA 0.332791 0.001935 0.332758 0.010562 1.2607 0.076028NN 2.076470 1.103230 2.100930 1.136890 1.9936 0.033920

Henon GA 0.335182 0.001160 0.335557 0.007941 1.0130 0.069187NN 0.424972 0.055029 0.428270 0.057967 1.3438 0.038945

Lorenz GA 33.993600 0.201502 34.480000 0.299403 0.9763 0.119618x component NN 49.060600 5.866370 48.843200 6.138620 1.4642 0.032417Lorenz GA 33.591800 0.986713 35.040100 0.896421 1.0156 0.134091y component NN 53.807800 7.505340 54.353500 8.162100 1.4531 0.039637Lorenz GA 33.541200 0.107222 33.185500 0.144716 0.9670 0.133398z component NN 53.640600 8.451190 54.019200 8.339930 1.4749 0.038231

TMSE = mean squared error on training set, GMSE = mean squared error on test set (generalisation), r indicates the standard deviation,�t is average simulation completion time in seconds.

Table 5Comparison of GASOLPE and NN on noiseless data, without clustering and sampling

Function GA TMSE rTMSE GMSE rGMSE �t rt

NN TMSE rTMSE GMSE rGMSE �t rt

Henon GA 0.000000 0.000000 0.000000 0.000000 75.189200 1.059180NN 0.001306 0.002678 0.001281 0.002537 103.410000 2.069960

Logistic GA 0.000000 0.000000 0.000000 0.000000 70.842800 0.474810NN 0.008278 0.032965 0.008341 0.033250 99.671100 2.836440

Rossler GA 0.000000 0.000000 0.000000 0.000000 118.690000 2.946300x component NN 1.346130 0.913826 1.348310 0.889236 113.220000 1.345250Rossler GA 0.000000 0.000000 0.000000 0.000000 106.810000 2.381300y component NN 0.730697 0.440817 0.728198 0.435515 120.980000 1.463220Rossler GA 0.000000 0.000000 0.000003 0.000000 116.240000 3.015520z component NN 0.134398 0.111063 0.130536 0.113929 121.390000 2.773960

Lorenz GA 0.000433 0.000000 0.000443 0.000000 106.631000 2.921870x component NN 9.597510 7.581300 9.660480 7.652430 114.790000 2.495630Lorenz GA 0.000906 0.000000 0.000900 0.000000 119.370000 2.676700y component NN 12.817000 5.639590 12.840600 5.590290 112.680000 1.398990Lorenz GA 0.000393 0.000000 0.000440 0.000000 129.080000 2.619580z component NN 8.855620 3.420960 8.779560 3.387780 126.500000 2.618890

TMSE = mean squared error on training set, GMSE = mean squared error on test set (generalisation), r indicates the standard deviation,�t is average simulation completion time in seconds.

Table 6Comparison of GASOLPE and NN on noisy data, without clustering and sampling

Function GA TMSE rTMSE GMSE rGMSE �t rt

NN TMSE rTMSE GMSE rGMSE �t rt

f 1 GA 0.343135 0.001644 0.338052 0.004972 63.259800 0.189102NN 0.457727 0.126863 0.454270 0.127160 93.374900 1.309790

f 2 GA 0.372829 0.002661 0.437673 0.006059 63.236800 0.704665NN 0.471971 0.179688 0.471387 0.180965 93.992300 3.552180

f 3 GA 0.339423 0.001758 0.338495 0.013470 72.282700 5.174860NN 0.653251 0.240320 0.650769 0.238727 120.640000 2.231640

f 4 GA 0.345285 0.001701 0.348266 0.011292 174.700000 1.176370NN 0.523719 0.133442 0.529929 0.136431 153.310000 2.423350

f 5 GA 0.332855 0.002283 0.332639 0.010724 214.550000 23.608100NN 1.669340 0.703825 1.672670 0.721260 220.960000 4.817730

Henon GA 0.334961 0.001134 0.335139 0.007257 89.221000 10.055000NN 0.496410 0.164857 0.501595 0.168396 105.640000 1.630020

Lorenz GA 33.986500 0.098196 34.473400 0.148774 118.310000 2.997290x component NN 55.112000 16.153000 55.319600 16.765000 165.730000 12.708000Lorenz GA 33.487500 0.091685 34.937200 0.075824 163.330000 3.787420y component NN 63.600700 20.333200 63.978500 20.674900 171.860000 13.531900Lorenz GA 33.505600 0.055774 33.160800 0.060393 130.360000 7.952540z component NN 60.051600 15.461700 61.183300 15.624000 171.770000 12.309900

TMSE = mean squared error on training set, GMSE = mean squared error on test set (generalisation), r indicates the standard deviation,�t is average simulation completion time in seconds.

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1463

4.2.3. Effect of stratified samples and clustering

This section repeats the previous experiments, but without the clustering process and sampling method.This is done for both GASOLPE and the neural network, in order to show that the clustering and sampling

Table 7Comparison of GA and Lagrange polynomials

Function Type Approximation MSE

y = sin(x), x 2 [0,2p] L3 y = 1.860735x � 0.888436x2 + 0.094266x3 0.019702L4 y = 1.697653x � 0.810569x2 + 0.086004x3 0.009750L5 y = 0.832295x + 0.343104x2 � 0.429250x3 + 0.093785x4 � 0.005971x5 0.000115GA y = 0.0159533 + 0.861138x + 0.286288x2 � 0.401674x3 + 0.0886412x4 � 0.00564312x5 0.000018

y = ex, x 2 [0,2p] L3 y = 1 + 1.058613x + 0.295655x2 + 0.364013x3 0.000013L4 y = 1 + 0.992442x + 0.540839x2 + 0.094520x3 + 0.090298x4 0.000000L5 y = 1 + 1.000785x + 0.494131x2 + 0.182240x3 + 0.023142x4 + 0.017975x5 0.000000GA y = 0.999965 + 1.00096x + 0.49377x2 + 0.182663x3 + 0.0228933x4 + 0.0180254x5 0.000000

y = ln(x), x 2 [0,3] L3 y = �1.150728 + 1.436152x � 0.313740x2 + 0.028317x3 0.000043L4 y = �1.378877 + 1.907025x � 0.639256x2 + 0.120200x3 � 0.009092x4 0.000003L5 y = �1.565489 + 2.374494x � 1.073576x2 + 0.309847x3 � 0.048390x4 + 0.003114x5 0.000000GA y = �1.56923 + 2.39301x � 1.09562x2 + 0.320748x3 � 0.0508251x4 + 0.00331785x5 0.000000

y ¼ffiffiffixp

, x 2 [0,3] L3 y = 1.456030x � 0.537598x2 + 0.081568x3 0.004592L4 y = 1.809139x � 1.140235x2 + 0.394751x3 � 0.050513x4 0.002036L5 y = 2.129687x � 1.965082x2 + 1.123084x3 � 0.316723x4 + 0.034403x5 0.001100GA y = 0.142346 + 1.71346x � 1.53925x2 + 0.928645x3 � 0.277349x4 + 0.0317172x5 0.000072

y ¼ 11þe�x, x 2 [�3,3] L3 y = 0.5x + 0.241084x2 � 0.010025x3 0.000176

L4 y = 0.5x + 0.232002x2 � 0.009016x3 0.000079L5 y = 0.5x + 0.249619x2 � 0.018252x3 + 0.000809x5 0.000008GA y = 0.5 + 0.24639x � 0.0166246x3 + 0.000681502x5 0.000001

1464 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

process does have a benefit. Tables 5 and 6 summarise the results for the noiseless and noisy data respectively.It is clear from these results that exclusion of the clustering and sampling process has a significant impact oncomputational speed. The tables clearly illustrates that the reduction in the number of training patterns usehave no impact on the accuracy of the GASOLPE method. This is, in most cases not the case for the neuralnetwork, where obtained for 76% of the cases.

4.2.4. Polynomial structureUsing Lagrange interpolating polynomials (an extension of Taylor polynomials), a curve can be approxi-

mated to any required degree of accuracy [8]. Table 7 shows the comparison between the Lagrange polyno-mials of degree 3–5 and the GASOLPE method (initialised using Table 2 and a maximum polynomialorder of 5).

In all cases the polynomial expression generated by the GASOLPE method is superior in terms of the meansquared error to the Lagrange polynomials. Notice that in all cases, the optimal number of terms were chosento participate in a GASOLPE expression. The results illustrate that the GASOLPE method does indeed findthe optimal polynomial approximation to a function.

5. Conclusion

This paper presented and discussed a hybrid genetic algorithm approach to evolve structurally optimalpolynomial expressions to represent a given data set. The genetic algorithm was shown to be significantly fas-ter than a neural network approach, and the genetic algorithm produced comparable results when comparedto the neural network approach in terms of generalisation ability for all of the functions used in this paper(which included chaotic time series). The success of the genetic algorithm approach is mainly due to the spec-ialised mutation and crossover operators, and can also be attributed to the fast k-means clustering algorithm,both of which lead to a significant reduction of the search space.

Although the genetic algorithm discussed in this paper appears to be fairly effective both in terms of accu-racy and speed, there is however one serious drawback. The drawback is that Taylor polynomials are poorpredictors of periodic data. In order to predict periodic data over an interval with Taylor polynomials, it is

G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466 1465

necessary to increase the order of the Taylor polynomial. However, such a prediction becomes particularlypoor and deteriorates when the Taylor polynomial predictor is used to extrapolate information outside ofthe aforementioned interval. This problem can be solved in one of 2 ways. Build an expression that utilisesa periodic function such as cosine or use only linear predictors at the ends of the approximation interval.

The use of cosine as a periodic function in the above, would require a substantial rework of the datastructure employed to house term-coefficient pairs. The data structure would have to be changed from a listto a tree, which would require the operators to be changed. The use of linear predictors at the ends of theapproximation interval is fairly simple to implement. Construct an expression that represents a hyper-planein the attribute space, e.g., for two inputs (x0,x1) and one output z, construct the expression z =r0x0 + r1x1 + c and use the linear system mentioned in this paper to solve r0, r1 and c. This hyper-planecan then be used in conjunction with an interval measure to extrapolate any unseen data outside the traininginterval.

Further developments will include integration of the polynomial approximator with a rule extraction algo-rithm for the mining of continuous classes.

References

[1] A.P. Engelbrecht, Computational Intelligence: An Introduction, Wiley and Sons, 2002.[2] J.M. Zurada, Introduction to Artificial Neural Systems, PWS Publishing Company, 1992.[3] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1992.[4] J. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, 1975.[5] D. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, 1989.[6] J. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, 1992.[7] J.B. Fraleigh, R.A. Beauregard, Linear Algebra, third ed., Addison-Wesley Publishing Company, 1995.[8] R.L. Burden, J.D. Faires, Numerical Analysis, sixth ed., Brooks/Cole Publishing Company, 1997.[9] R. Haggarty, Fundamentals of Mathematical Analysis, second ed., Addison-Wesley Publishing Company, 1993.

[10] K. Hornik, Multilayer feedforward networks are universal approximators, Neural Networks 2 (1989) 359–366.[11] K. Hornik, M. Stinchcombe, H. White, Universal approximation of an unknown mapping and its derivatives using multilayer

feedforward networks, Neural Networks 3 (1990) 551–560.[12] G.W. Irwin, K. Warwick, K.J. Hunt (Eds.), Neural Network Applications in Control. Institution of Electrical Engineers, 1995.[13] F. Fogelman-Soulie, P. Gallinari (Eds.), Industrial Applications of Neural Networks, World Scientific, 1998.[14] P.J.G. Lisboa, B. Edisbury, A. Vellido (Eds.), Business Applications of Neural Networks: The State-Of-The-Art of Real-World

Applications, World Scientific, 2000.[15] P. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioural Sciences. Ph.D. thesis, Harvard University,

1974.[16] M. Møller, A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks 6 (1993) 525–553.[17] A.P. Engelbrecht, Sensitivity Analysis of Multilayer Feedforward Neural Networks. Ph.D. thesis, University of Stellenbosch, 1999.[18] C. Darwin, On the Origin of Species, John Murray, London, 1859.[19] J. Biles, ‘‘Genjam: A genetic algorithm for generating jazz solos,’’ in: Proceedings of ICMC 1994, The Computer Music Association,

1994.[20] J. Yang, V. Honavar, ‘‘Feature subset selection using a genetic algorithm,’’ in: J.R. Koza, K. Deb, M. Dorigo, D.B. Fogel, M.

Garzon, H. Iba, R.L. Riolo (Eds.), Proceedings of Genetic Programming 1997, Second Annual Conference, (Stanford University, CA,USA), Morgan Kaufmann, 1997, p. 380.

[21] E. Alba, J. Aldana, J. Troya, ‘‘Genetic algorithms as heuristics for optimizing ANN design,’’ in: R. Albrecht, C. Reeves, N. Steele(Eds.), Proceedings of the International Conference on Artificial Neural Nets and Genetic Algorithms Springer-Verlag, 1993, pp. 683–690.

[22] S.W. Wilson, ‘‘Function approximation with a classifier system,’’ in: Proceedings of the of the Genetic and Evolutionary ComputationConference, (San Francisco, California, USA), Morgan Kaufmann Publishers, 2001, pp. 974–981.

[23] T. Kowar, Genetic function approximation experimental design (GFAXD): A new method for experimental design, Journal ofChemical Information and Computer Sciences (1998) 858–866.

[24] X. Yao, Y. Liu, Towards designing artificial neural networks by evolution, Applied Mathematics and Computation 91 (1) (1998) 83–90.

[25] P.J. Angeline, ‘‘Evolving predictors for chaotic time series,’’ in: S. Rogers, D. Fogel, J. Bezdek, B. Bosacchi (Eds.), Proceedings ofSPIE (Volume 3390): Application and Science of Computational Intelligence, (Bellingham, WA), pp. 170–180, 1998.

[26] N. Nikolaev, H. Iba, ‘‘Genetic programming using Chebichev polynomials,’’ in: Proceedings of the Genetic and EvolutionaryComputation Conference, (San Francisco, California, USA), pp. 89–96, Morgan Kaufmann Publishers, 2001.

[27] H. Abass, R. Saker, C. Newton (Eds.), Data Mining: A Heuristic Approach, Idea Publishing Group, 2002.[28] A.G.W. Steyn, C.F. Smit, S.H.C. du Toit, C. Strasheim, Modern Statistics in Practice. J.L. van Schaik, 1996.

1466 G. Potgieter, A.P. Engelbrecht / Applied Mathematics and Computation 186 (2007) 1441–1466

[29] K. Alsabti, S. Ranka, V. Singh, ‘‘An efficient parallel algorithm for high dimensional similarity join,’’ in: Proceedings of IPPS 11thInternational Parallel Processing Symposium, IEEE Computer Society Press, 1998.

[30] S.S. Epp, Discrete Mathematics with Applications, second ed., Brooks/Cole Publishing Company, 1995.[31] G. Potgieter, A.P. Engelbrecht, ‘‘Structural optimization of learned polynomial expressions using genetic algorithms,’’ in: Proceedings

of the Fourth Asia–Pacific Conference on Simulated Evolution and Learning, vol. 2, (Singapore), pp. 605–609, 2002.[32] C. Rosin, R. Belew, New Methods for Competitive Coevolution, Evolutionary Computation 5 (1) (1997) 1–29.