evolving, training and designing neural network ensemblesxin/papers/iconip11plenary.pdf ·...

'

&

$

%

Evolving, Training and Designing Neural NetworkEnsembles

Xin Yao (http://www.cs.bham.ac.uk/∼xin)CERCIA and Natural Computation Group

School of Computer ScienceThe University of Birmingham

Edgbaston, Birmingham B15 2TT, UKEmail: [email protected]

'

&

$

%

An Overview of This Talk

1. Motivations

(a) Background

(b) From Evolutionary Learning to Ensemble Learning

2. Basic Ideas

(a) Speciated Evolutionary Learning

(b) Negative Correlation Learning

3. Algorithms

(a) Evolutionary and Constructive Learning of Ensembles

(b) Multi-objective Ensemble Learning

(c) Online Ensemble Learning

4. Conclusions

'

&

$

%

Introduction

• An ensemble is a collection of learning systems. In this talk, Iwill only consider artificial neural networks (ANNs) althoughrule-based systems, decision trees and other learners can alsobe used.

• There have been many studies that have shown that ensemblesusually perform better than any single individuals (under somemild conditions).

• Evolving ANNs naturally keeps a population of evolutionaryANNs (EANNs). It seems to fit the idea and motivationbehind ensembles nicely.

'

&

$

%

Why Evolving

1. Learning and evolution are two fundamental forms ofadaptation. It is interesting to study the integration of the two.

2. Simulated evolution makes few assumptions of what’s beingevolved. It can be introduced into an ANN at different levels,including weight training, architecture adaptation and learningrule adaptationa.

aX. Yao, “Evolving artificial neural networks,” Proceedings of the IEEE,

87(9):1423-1447, September 1999.

'

&

$

%

Digression: Evolutionary Computation

1. It is the study of computational systems that use ideas and getinspirations from natural evolution.

2. For example, one of the often used inspiration is survival of the fittest.

3. Evolutionary computation (EC) can be used in optimisationa, machinelearning and creative designb. It is more than just a genetic algorithm.

4. There has been a significant growth in EC theories in recent years,especially in computational time complexity analysisc

aX. Yao, Y. Liu and G. Lin, “Evolutionary programming made faster,” IEEE Trans-

actions on Evolutionary Computation, 3(2):82-102, July 1999.bY. Li, C. Hu and X. Yao, “Innovative Batik Design with an Interactive Evolutionary

Art System,” Journal of Computer Science and Technology, 24(6):1035-1047, November

2009.cJ. He and X. Yao, “Towards an Analytic Framework for Analysing the Computation

Time of Evolutionary Algorithms,” Artificial Intelligence, 145(1-2):59-97, April 2003.

'

&

$

%

Current Practice in Evolutionary Learning

fitness evaluation and selection

"genetic" operators

mutation

. . . . . .

a population of individuals(learning systems, e.g., ANNs orrule−based systems)

best individual

Pitt Style Evolutionary Learning

crossover

Figure 1: A general framework for Pitt-style evolutionary learning.

'

&

$

%

Fitness Evaluation

1. Based on the training error.

2. Based on the training error and complexity (regularisation),i.e.,

1fitness

∝ error + α ∗ complexity

'

&

$

%

Evolutionary Learning and Optimisation

• Learning has often been formulated as an optimisationproblem.

• However, learning is different from optimisation.a

1. In optimisation, the fitness function reflects what is needed.The optimal value is always better than the second optimal one.

2. In learning, there is no way to quantify generalisation exactly.A system with minimum training error may not be the onewith the best generalisation ability. Why select the “best”individual in a population as the final output?

aX. Yao, Y. Liu and P. Darwen, “How to make best use of evolutionary

learning,” Complexity International: An Electronic Journal of Complex Systems

Research (ISSN 1320-0682), Vol. 3, July 1996.

'

&

$

%

Survival-of-the-Fittest Is Questionable Here

1. The “best” ANN with the smallest training error in apopulation may not be the one with the best generalisation.

2. It is a population that is evolving, not just a single ANN. Apopulation contains more (or at least as much) informationthan any single individuals.

'

&

$

%

Population as Ensemble

Simplest strategy:

• Keep every member in the population and form an ensembleoutput as the final solution.

• Don’t throw away anyone.

• Don’t put all your eggs in one basket.

'

&

$

%

How To Evolve?

1. We use EPNeta to evolve a population of ANNs, following thecommon practice. This step is the same as everyone else did.

2. However, in the last generation, we do not pick the bestindividual. We retain the entire population, so that we canform a combined output.

aX. Yao and Y. Liu, “A new evolutionary system for evolving artificial neural

networks,” IEEE Transactions on Neural Networks, 8(3):694-713, May 1997.

'

&

$

%

How To Combine Individuals?

We studied four simple methods for combining members in anensemblea. The purpose here is not to find out the best method,but to investigate whether the ensembles offer any advantages.

1. Majority Voting

2. Linear Combination (LC) Based on ANN’s Fitness

3. LC Through the Recursive Least Square (RLS) Algorithm

4. LC Over a Subset of a PopulationaX. Yao and Y. Liu, “Making use of population information in evolutionary

artificial neural networks,” IEEE Transactions on Systems, Man and Cybernet-

ics, Part B: Cybernetics, 28(3):417-425, June 1998.

'

&

$

%

Two Heads Are Better Than One: Example I

Data set Method Testing Error RateCard EPNet 0.100

Ensemble 0.093Diabetes EPNet 0.232

Ensemble 0.226Heart EPNet 0.154

Ensemble 0.151

Table 1: Comparison between the best individual (EPNet) and the en-semble output formed by the RLS combination method.

'

&

$

%

Hmmm ... Encouraging ...

... but a little naive and simplistic?

'

&

$

%

Designing Better Heads

1. Having a population of the same head (individual) gives us noadditional advantage.

2. We want a diverse set of individuals in the population(ensemble).

Diversitya? Howb?

• Speciation by fitness sharing is one of the techniques whichencourage automatic formation of species. Different species aregood at different things. They are diverse.

aE. K. Tang, P. N. Suganthan and X. Yao, “An Analysis of Diversity Mea-

sures,” Machine Learning, 65:247-271, 2006.bG. Brown, J. L. Wyatt, R. Harris and X. Yao, “Diversity Creation Methods:

A Survey and Categorisation,” Information Fusion, 6(1):5-20, January 2005.

'

&

$

%

Two Heads Are Better Than One: Example II

The idea of evolving a diverse set of individuals for latercombination is not limited to ANNs. Here is an example of arule-based system for playing the two-player iterated prisoner’sdilemma (2IPD) gamea.

1. An evolutionary algorithm was used to evolve strategies forplaying the 2IPD.

2. Implicit fitness sharing was used to form different species(specialists) in a population.

3. A gating algorithm was used to combine individuals in apopulation together.

aP. J. Darwen and X. Yao, “Speciation as automatic categorical modulariza-

tion,” IEEE Transactions on Evolutionary Computation, 1(2):101-108, 1997.

'

&

$

%

Experimental Results

l = 4

Strategy Wins Ties Average Score

% % Own Other’s

best.sr 0.360 0.059 1.322 1.513

gate.sr 0.643 0.059 1.520 1.234

Table 2: Results against new opponents for 2IPD with rememberedhistory l = 4. The results were averaged over 30 runs.

'

&

$

%

Sound Interesting ...

• There are a number of things used here: neural networks,evolutionary algorithms, fitness sharing, ...

• Which ones are most important?

• What is the essence of such population-based learning?

'

&

$

%

NN Ensembles — Negatively Correlated Learning

It turns out that the evolutionary part is not essential, but the diversity isa.

1. Making individuals different (diversity):

Ei =1N

N∑n=1

(12(Fi(n)− d(n))2 + λpi(n)

)

wherepi(n) = (Fi(n)− d(n))

∑

j 6=i

(Fj(n)− d(n))

F (n) is the ensemble output.

2. All individuals are learnt simultaneously using the same set of trainingdata.

aY. Liu and X. Yao, “Negatively correlated neural networks can produce best ensem-

bles,” Australian Journal of Intelligent Information Processing Systems, 4(3/4):176-185,

1997.

'

&

$

%

Ensemble Learning

We consider estimating g(x) = E[d|x] by forming a simple averageof a set of outputs of individual networks which are trained usingthe same training data set D

F (x, D) =1M

ΣMi=1Fi(x, D) (1)

where Fi(x, D) is the actual response of network i and M is thenumber of neural network estimators.

'

&

$

%

Bias-Variance-Covariance Trade-off

Taking expectations with respect to the training set D, theexpected mean-squared error of the combined system can bewritten in terms of individual network output

ED

[(E[d|x]− F (x, D))2

]= (ED[F (x, D)]− E[d|x])2

+ ED

[1

M2ΣM

i=1 (Fi(x, D)− ED[Fi(x, D)])2]

+ ED

[1

M2ΣM

i=1Σj 6=i (Fi(x, D)− ED[Fi(x, D)])

(Fj(x, D)− ED[Fj(x, D)])]

(2)

The expectation operator ED represents the average over all thepatterns in the training set D.

'

&

$

%

All It Says, Roughly Speaking, Is ...

1. MSE = Bias + Variance + Co-variance

2. Negative correlation learning (NCL) tries to minimiseco-variancea.

3. For more theoretical justification, seeb.aY. Liu and X. Yao, “Ensemble learning via negative correlation,” Neural

Networks, 12(10):1399-1404, December 1999.bE. K. Tang, P. N. Suganthan and X. Yao, “An Analysis of Diversity Mea-

sures,” Machine Learning, 65:247-271, 2006.

'

&

$

%

How to Choose the Correlation Penalty

The purpose of minimising pi is to negatively correlate eachindividual’s error with errors for the rest of the ensemble.

• The function pi for regression problems can be chosen as

pi(n) = (Fi(n)− d(n)) Σj 6=i (Fj(n)− d(n)) (3)

for noise free data, or

pi(n) = (Fi(n)− F (n)) Σj 6=i (Fj(n)− F (n)) (4)

for noisy data.

• The function pi for classification problem can be chosen as

pi(n) = (Fi(n)− 0.5)Σj 6=i (Fj(n)− 0.5) (5)

'

&

$

%

An Illustrative Example

• The Australian credit card assessment problem was used.

• The whole data set is randomly partitioned into a training (518cases) and a testing set (172 cases).

• The ensemble used in our experiment consisted of fourstrictly-layered feedforward neural networks. All individualnetworks had the same architecture. They had three layers and5 hidden nodes in the hidden layer. The learning-rate η in BPis 0.1, and λ is 0.375. These parameters were chosen afterlimited preliminary experiments. They are not meant to beoptimal.

'

&

$

%

Experiment Results: Error Rates

Training set Test set

# epochs 500 1000 500 1000

Mean 0.1093 0.0846 0.1177 0.1163

SD 0.0092 0.0088 0.0182 0.0159

Min 0.0927 0.0676 0.0698 0.0756

Max 0.1255 0.1004 0.1628 0.1454

Table 3: Error rates for the Australian credit card assessment prob-lem. The results were averaged over 25 runs.

'

&

$

%

Comparison with Other Work

Algorithm TER Algorithm TER

NCNN 0.116 Logdisc 0.141

EPNet 0.115 CART 0.145

Evo-En-RLS 0.131 RBF 0.145

Cal5 0.137 CASTLE 0.148

ITrule 0.141 NaiveBay 0.151

DIPOL92 0.141 IndCART 0.152

Table 4: Comparison among the negative correlation NN (NCNN),EPNet, an evolutionary ensemble learning algorithm (Evo-En-RLS),and others in terms of the average testing error rate. TER standsfor Testing Error Rate in the table.

'

&

$

%

Does NCL Work As Expected?

Ω1 = 146 Ω2 = 148 Ω3 = 149 Ω4 = 151

Ω12 = 137 Ω13 = 137 Ω14 = 141 Ω23 = 140

Ω24 = 141 Ω34 = 139 Ω123 = 132 Ω124 = 133

Ω134 = 133 Ω234 = 134 Ω1234 = 128

Table 5: The sizes of the correct response sets of individual networkson the testing set and their intersections for the Australian creditcard assessment problem.

'

&

$

%

Mackey-Glass Time Series Prediction Problem

Method Testing RMS

∆t = 6 ∆t = 84

NCNN 0.01 0.03

EPNet 0.02 0.06

BP 0.02 0.05

CC Learning 0.06 0.32

Table 6: The “Testing RMS” in the table refers to the normalisedroot-mean-square error on the testing set.

'

&

$

%

Comparison between NCNN and ME

Adding noises.

Method Emse

σ2 = 0.1 σ2 = 0.2

NCNN 0.012 0.023

ME 0.018 0.038

Table 7: Comparison between NCNN and the mixtures-of-experts(ME) architectures in terms of the integrated mean-squared error onthe testing set for the moderate noise case and the large noise case.

'

&

$

%

“Visualisation”:

Three Approaches of Ensemble Learning

• Independent Training

• Sequential Training

• Simultaneous Training

'

&

$

%

Two Classes of Gaussian-Distributed Patterns

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6 8 10

Class 1

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6 8 10

Class 2

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6 8 10

Class 1Class 2

(c)(b)(a)

Figure 2: (a) Scatter plot of Class 1. (b) Scatter plot of Class 2.(c) Combined scatter plot of both classes. The circle represents theoptimum Bayes solution.

'

&

$

%

Approach of Independent Training

In order to create different neural networks, independent trainingapproach trains a set of neural networks independently by

• varying initial random weights

• varying the architectures

• varying the learning algorithm used

• varying the data

• ... ...

'

&

$

%

Approach of Sequential Training

In order to decorrelate the individual neural networks, sequentialtraining approach trains a set of networks in a particular order,such as the boosting algorithm:

• Train the first neural network with randomly chosen N1

patterns

• Select N2 patterns on which the first neural network wouldhave 50% error rate. Train the second neural network with theselected patterns.

• Select N3 patterns on which the first two trained neuralnetworks disagree. Train the third neural network with theselected patterns.

'

&

$

%

Approach of Simultaneous Training

• The mixtures-of-experts (ME) architectures: consists of twotypes of networks, i.e., a gating network and a number of expertnetworks. Each expert network makes individual decision on itscovered region. The gating network weights the outputs of theexpert networks to provide an overall best decision.

• Negative correlation learning (NCL): no gating network isneeded in NCL. The idea of NCL is to introduce a correlationpenalty term into the error function of each individual networkso that the individual network can be trained simultaneouslyand interactively.

'

&

$

%

Decision Boundaries by the Independent Training

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

boundary of network 1boundary of Bayesian decision

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6


-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

boundary of network3boundary of Bayesian decision

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

boundary of ensembleboundary of Bayesian decision

(a) (b)

(d)(c)

Figure 3: (a) Network 1. (b) Network 2. (c) Network 3. (d) Ensem-ble. The circle represents the optimum Bayes solution.

'

&

$

%

Decision Boundaries by the Boosting Algorithm

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6


-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6


-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6


-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6


(a)

(c) (d)

(b)


'

&

$

%

Decision Boundaries by NCL

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6


-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6


-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6


-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6


(a) (b)

(c) (d)


'

&

$

%

Comparisons

Boosting Training

Network 1 Network 2 Network 3 Ensemble

81.11 75.26 73.09 81.03

Negative Correlation Learning (NCL)


80.71 80.55 80.97 81.41

Independent Training (λ = 0 in NCL)


81.13 80.48 81.13 80.99

'

&

$

%

Discussions

• The independently trained neural networks tended to generatesimilar decision boundaries because of lack of interactionsamong the individual networks during learning.

• The boosting algorithm performed well, but was hindered byits data filtering process which generated highly unbalancetraining data points. For example, the ensemble performanceactually got worse than network 1.

• No process of filtering data is needed in NCL. The performanceof NCL (81.41) is very close to the theoretical optimum (81.51).

'

&

$

%

Evolving ANN Ensembles: Evolution Is Still Useful

No need to predefine the number of ANNs in an ensemble. We can evolveensemblesa.

1. Evolve ensembles through hybridisation with negative correlation, sothat a population of species are formed.

2. The number of species is determined automatically.

3. Cluster the NNs in the population (with species). These clusters arethen used to construct NN ensembles.

aY. Liu, X. Yao and T. Higuchi, “Evolutionary Ensembles with Negative Correlation

Learning,” IEEE Transactions on Evolutionary Computation, 4(4):380-387, November

2000.

'

&

$

%

Fitness Sharing and Fitness Evaluation

• An implicit fitness sharing is used based on the idea of“covering” the same training case by shared individuals. Theprocedure of calculating shared fitness is carried outcase-by-case over the training set.

• For each training case, if there are p > 0 individuals thatcorrectly classify it, then each of these p individuals receives1/p fitness reward, and the rest individuals in the populationreceive zero fitness reward. The fitness reward is summed overall training cases. This method is expected to generate asmoother shared fitness landscape.

'

&

$

%

Constructive Neural Network Ensemble (CNNE)

• The previous evolutionary approach does not tell us how todesign each individual ANNs in an ensemble. It onlydetermines the number of ANNs in an ensemble automatically.

• CNNEa uses constructive training, in association with negativecorrelation, in determining both ANN and ensemblearchitectures.

• It constructs an ANN first until a stopping criterion. Then anew minimal ANN is added, ..., until the stopping criteria forthe ensemble construction are satisfied.

aMd. Monirul Islam, X. Yao and K. Murase, “A constructive algorithm for

training cooperative neural network ensembles,” IEEE Transactions on Neural

Networks, 14(4):820-834, July 2003.

'

&

$

%

Multi-Objective Ensemble Learning

1. A re-play for evolutionary learning (what we said before):

1fitness

∝ error + α ∗ complexity

2. That’s clearly a two objective problem. Why not formulate thismulti-objective problem as a ‘proper’ multi-objective learningproblem?

3. The Pareto front (the non-dominated set found by anmulti-objective evolutionary algorithm in practice) can then betreated as a diverse set of individuals for ensemblinga.

4. Instead of only two objectives, one can have as many objective asneeded.

aA Chandra and X. Yao, “Ensemble learning using multi-objective evolutionary

algorithms,” Journal of Mathematical Modelling and Algorithms, 5(4):417-445, De-

cember 2006.

'

&

$

%

Too Many Individuals In An Ensemble?

Ensemble Pruning

1. We could use an evolutionary algorithm to find the nearoptimal subseta.

2. We could also use expectation propagationb.aX. Yao and Y. Liu, “Making use of population information in evolutionary

artificial neural networks,” IEEE Transactions on Systems, Man and Cybernet-

ics, Part B: Cybernetics, 28(3):417-425, June 1998.bH. Chen, P. Tino and X. Yao, “Predictive Ensemble Pruning by Expecta-

tion Propagation,” IEEE Transactions on Knowledge and Data Engineering,

21(7):999-1013, July 2009.

'

&

$

%

Online Learning Using Ensemble Approaches

1. The population nature of ensembles suggests that ensemblelearning has great potentials in online learninga.

2. Indeed, a detailed analysis has shown that ensembles can dealwith concept drifts much better than a single individualb.Diversity plays an important role here as well.

aK. Tang, M. Lin, F. L. Minku and X. Yao, “Selective Negative Correlation

Learning Approach to Incremental Learning,” Neurocomputing, 72(13-15):2796-

2805, August 2009.bL. L. Minku, A. White and X. Yao, “The Impact of Diversity on On-line

Ensemble Learning in the Presence of Concept Drift,” IEEE Transactions on

Knowledge and Data Engineering, published online on 6 July 2009.

'

&

$

%

Can NCL Be Improved Further?

1. We can add a regularisation term explicitly to the ensembleerror function and develop a new regularised negativecorrelation learning (RNCL) algorithm, which has a rigorousBayesian interpretation.a

2. Multi-objective evolutionary algorithms can be used tooptimise the error, correlation and regularisation.b

aH. Chen and X. Yao, “Regularized Negative Correlation Learning for Neu-

ral Network Ensembles,” IEEE Transactions on Neural Networks, 20(12):1962-

1979, December 2009.bH. Chen and X. Yao, “Multiobjective Neural Network Ensembles

based on Regularized Negative Correlation Learning,” IEEE Transac-

tions on Knowledge and Data Engineering, accepted in September 2009.

http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.26

'

&

$

%

Concluding Remarks

1. Combining evolutionary computation with ensemble learning isbeneficial and has great potentials since both emphasise populations.

2. Negative correlation learning (NCL) tries to negatively maximiseco-variance. It has have been used in different contexts, includingevolutionary learning, multi-objective learning, online learning, etc.

3. Diversity is a key issue in ensemble learning.

4. Ensemble learning is a huge topic. This talk touches upon only a verysmall part of it, primarily from the evolutionary computationperspective. There are more technical details in the literaturecomparing negative correlation learning (NCL) with others, e.g.,boosting, bagging, mixture-of-experts, etc.

5. There are studies analysing NCL theoretically, pointing out why andwhen it works (or not) for regression and classification problems,respectively.

'

&

$

%

Future Challenges

There are many challenges, e.g.,

• online ensemble learning,

• class imbalance learning using ensembles,

• coping with scalability using ensembles as automatic problemdecomposition approachesab.

aP. J. Darwen and X. Yao, “Speciation as automatic categorical modulariza-

tion,” IEEE Transactions on Evolutionary Computation, 1(2):101-108, 1997.bV. Khare, X. Yao and B. Sendhoff, “Multi-network evolutionary systems and

automatic problem decomposition,” International Journal of General Systems,

35(3):259-274, June 2006.

evolving, training and designing neural network ensemblesxin/papers/iconip11plenary.pdf ·...

Documents