avances base radial

27

Upload: escom

Post on 27-Jun-2015

346 views

Category:

Education


0 download

DESCRIPTION

Recent Advances in Radial Basis Function Networks

TRANSCRIPT

Page 1: Avances Base Radial

Recent Advances in

Radial Basis Function

Networks

Mark J� L� Orr�

Institute for Adaptive and Neural Computation

Division of Informatics� Edinburgh University

Edinburgh EH� �LW� Scotland� UK

June ��� ����

Abstract

In ���� an Introduction to Radial Basis Function Networks was published on

the web� along with a package of Matlab functions�� The emphasis was on the

linear character of RBF networks and two techniques borrowed from statistics�

forward selection and ridge regression�

This document� is an update on developments between ���� and ���� and

is associated with a second version of the Matlab package�� Improvements have

been made to the forward selection and ridge regression methods and a new

method� which is a cross between regression trees and RBF networks� has been

developed�

� mjo�anc�ed�ac�uk� www�anc�ed�ac�uk��mjo�papers�intro�ps� www�anc�ed�ac�uk��mjo�software�rbf�zip� www�anc�ed�ac�uk��mjo�papers�recad�ps� www�anc�ed�ac�uk��mjo�software�rbf��zip

Page 2: Avances Base Radial

� CONTENTS

Contents

� Introduction �

��� MacKay�s Hermite Polynomial � � � � � � � � � � � � � � � � � � � � � �

��� Friedman�s Simulated Circuit � � � � � � � � � � � � � � � � � � � � � � �

� Maximum Marginal Likelihood �

��� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� Review � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� The EM Algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� The DM Algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� Optimising the Size of RBFs �

��� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� Review � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� E�cient Re�estimation of � � � � � � � � � � � � � � � � � � � � � � � � �

��� Avoiding Local Minima � � � � � � � � � � � � � � � � � � � � � � � � � ��

��� The Optimal RBF Size � � � � � � � � � � � � � � � � � � � � � � � � � � ��

��� Trial Values in Other Contexts � � � � � � � � � � � � � � � � � � � � � ��

��� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Regression Trees and RBF Networks ��

��� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

��� The Basic Idea � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

��� Generating the Regression Tree � � � � � � � � � � � � � � � � � � � � � ��

��� From Hyperrectangles to RBFs � � � � � � � � � � � � � � � � � � � � � ��

��� Selecting the Subset of RBFs � � � � � � � � � � � � � � � � � � � � � � ��

��� The Best Parameter Values � � � � � � � � � � � � � � � � � � � � � � � �

��� Demonstrations � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Appendix ��

A Applying the EM Algorithm � � � � � � � � � � � � � � � � � � � � � � � ��

B The Eigensystem of HH� � � � � � � � � � � � � � � � � � � � � � � � � ��

Page 3: Avances Base Radial

Introduction �

� Introduction

In �� an introduction to radial basis function �RBF� networks was published on theweb ���� along with an associated Matlab software package ����� The approach takenstressed the linear character of RBF networks� which traditionally have only a singlehidden layer� and borrowed techniques from statistics� such as forward selection andridge regression� as strategies for controlling model complexity� the main challengefacing all methods of nonparametric regression�

That was three years ago� Since then� some improvements have been made� anew algorithm devised and the package of Matlab functions is now in its secondversion ���� This document describes the theory of the new developments and willbe of interest to practitioners using the new software package and theorists enhancingexisting methods or developing new ones�

Section � describes what happens when the expectation�maximisation algorithmis applied to RBF networks� Section � describes a simple procedure for optimisingthe RBF widths� particularly for ridge regression� Finally� section � describes thenew algorithm which uses a regression tree to generate the centres and sizes of aset of candidate RBFs and to help select a subset of these for the network� Twosimulated data sets� used for demonstration� are described below�

��� MacKay�s Hermite Polynomial

The �rst data set is from �� � and is based on a one�dimensional Hermite polynomial

y � � � ��� x� �x�� e�x�

� input values are sampled randomly between �� � x � � and Gaussian noise ofstandard deviation � � �� is added to the outputs ��gure �����

−4 −2 0 2 40

1

2

3

x

y

training set actual function

Figure ���� Sample Hermite data �stars� and the actual function �curve��

Page 4: Avances Base Radial

� Introduction

��� Friedman�s Simulated Circuit

This second data set simulates an alternating current circuit with four parameters�resistance �R ohms�� angular frequency �� radians per second�� inductance �L hen�ries� and capacitance �C farads� in the ranges

� R � � �

� � � � � �� � �

� L � � �

�� � �� � C � ��� � �� �

� random samples of the four parameters in these ranges were used to generatecorresponding values of the impedance�

Z �

qR� � �� L� ��� C�� �

to which Gaussian noise of standard deviation � � ��� was added� This resultedin a training set of � cases with four�dimensional inputs x � �R � L C�� and ascalar output y � Z� The problem originates from ���� Before applying any learningalgorithms to this data� the original inputs� with their very di�erent dynamic ranges�are rescaled to the range ���� �� in each component�

Page 5: Avances Base Radial

Maximum Marginal Likelihood �

� Maximum Marginal Likelihood

��� Introduction

The expectation�maximisation �EM� algorithm ��� �� performs maximum likelihoodestimation for problems in which some of the variables are unobserved� Recentlyit has been successfully applied to density estimation ��� and probabilistic principlecomponents ���� ���� for example� This section discusses the application of EM toRBF networks�

First we review the probability model of a linear neural network and come upwith an expression for the marginal likelihood of the data� It is this likelihood whichwe ultimately want to maximise� Then we show the results of applying the EMalgorithm� a pair of re�estimation formulae for the model parameters� However� itturns out that a similar set of re�estimation formula can be derived by a simplermethod and also that they converge more rapidly than the EM versions� Finally� wedraw some conclusions�

��� Review

The model estimated by a linear neural network from noisy samples f�xi� yi�gpi��can be written

f�x� �mXj��

wj hj�x� � �����

where the fhjgmj�� are �xed basis functions and fwjgmj�� are unknown weights �to beestimated�� The vector of residual errors between model and data is

e � y �Hw �

whereH is the design matrix and has elements Hij � hj�xi�� In a Bayesian approachto analysing the estimation process� the a priori probability of the weights w canbe modelled as a Gaussian of variance ���

p�w� � ��m exp

��w

�w

� ��

�� �����

The conditional probability of the data y given the weights w can also be modelledas a Gaussian� with variance ��� to account for the noise included in the outputs ofthe training set� fyigpi���

p�yjw� � ��p exp

��e

�e

���

�� �����

The joint probability of data and weights is the product of p�w� with p�yjw� andcan be represented as an equivalent cost function by taking logarithms� multiplyingby �� and dropping constant terms to obtain

E�y� w� � p ln �� �m ln �� �e�e

���w�w

��� �����

Page 6: Avances Base Radial

� Maximum Marginal Likelihood

The conditional probability of the weights w given the data y is found using Bayesrule� again involves the product of ����� with ����� and is another Gaussian�

p�wjy� � p�yjw� p�w�

� jWj���� exp���

��w� �w��W���w � �w�

�� �����

where

�w � A��H�y �

W � ��A�� �

A � H�H� � Im �

� ���

��� �����

Finally� the marginal likelihood of the data is

p�y� �

Zp�yjw� p�w� dw

� ��p jPj��� exp��y

�Py

���

�� �����

where

P � Ip �HA��H� �

Note that there is an equivalent cost function for p�y� which is obtained by takinglogarithms� multiplying by �� and dropping the constant terms�

E�y� � p ln�� � ln jPj� y�Py

��� ����

��� The EM Algorithm

The EM algorithm estimates the parameters of a model iteratively� starting fromsome initial guess� Each iteration consists of an expectation �E� step which �ndsthe distribution for the unobserved variables and a maximisation �M� step whichre�estimates the parameters of the model to be those with the maximum likelihoodfor the observed and missing data combined�

In the context of a linear neural network it is possible to consider the trainingset f�xi� yi�gpi�� as the observed data� the weights fwjgmj�� as the missing data and

the variance of the noise �� and the a priori variance of the weights �� as the modelparameters�

In the E�step� the expectation of the conditional probability of the missing data����� is taken and substituted� in the M�step� into the joint probability of the com�bined data� or its equivalent cost function ������ which is then optimised with respectto the model parameters �� and ��� These two steps are guaranteed to increase themarginal probability of the observed data and when iterated convergence to a localmaximum�

Page 7: Avances Base Radial

��� The DM Algorithm �

Detailed analysis �see appendix A� results in a pair of re�estimation formulae forthe parameters �� and ���

�� ��e��e� ��

p� ����

�� ��w� �w � �m� � ��

m� ���� �

where

�e � y �H �w �

� m� � trA�� �

Initial guesses are substituted into the right hand sides which produce new guesses�The process is repeated until a local minimum of ���� is reached�

Note that equation ���� � was derived in ���� by a free energy approach� It hasbeen shown that free energy and the EM algorithm are intimately connected �����

Figure ��� illustrates with the Hermite data described in section ���� Centres ofradius r � � were created for each training set input� The �gure plots logarithmiccontours of ���� and the sequence of �� and �� values re�estimated by ���� ��� ��

−4 −2 0−4

−2

0

2

log(σ2)

log(ς2 )

Figure ���� Optimisation of �� and �� by EM�

��� The DM Algorithm

An alternative approach to minimising ����� is simply to di�erentiate it and set theresults to zero� This is easily done and results in the pair of re�estimation formulae

�� ��e��e

p� � ������

�� ��w��w

� ������

Page 8: Avances Base Radial

Maximum Marginal Likelihood

I call this method the �DM algorithm� after David MacKay who �rst derivedthese equations �� �� Its disadvantage is the absence of any guarantee that theiterations converge� unlike their EM counterparts ���� ��� � which are known toincrease the marginal likelihood �or leave it the same if a �xed point has beenreached�� Any �xed point of DM is also a �xed point of EM� and vice versa� but ifthere are multiple �xed points there is no guarantee that both methods will convergeto the same one� even when starting from the same guess�

Figure ��� plots the sequence of re�estimated values using ������ ����� for thesame training set� RBF network and initial values of �� and �� used for �gure ����It is apparent that convergence is faster for DM than for EM in this example� taking� iterations for DM compared to � for EM� In fact� our empirical observation isthat DM always converges considerably faster than EM if they start from the sameguess and converge to the same local minimum� Furthermore� DM has never failedto converge�

−4 −2 0−4

−2

0

2

log(σ2)

log(ς2 )

Figure ���� Optimisation of �� and �� by DM�

��� Conclusions

We started by applying the EM algorithm to RBF networks using the weight�decay�ridge regression� style of penalised likelihood and ended with a pair of re�estimationformulae for the noise variance �� and prior weight variance ��� However� theseturned out to be less e�cient than a similar pair of formulae which had been knownin the literature for some time�

The rbf rr � method in the Matlab software package ��� has an option to usemaximum marginal likelihood �MML� as the model selection criterion �instead ofGCV or BIC� for example�� When this option is selected the regularisation parameter����� is re�estimated using� by default� the DM equations ������ ������ Another optioncan be set so that the EM versions ���� ��� � are used instead�

Page 9: Avances Base Radial

Optimising the Size of RBFs �

� Optimising the Size of RBFs

��� Introduction

In previous work ���� ��� we concentrated on methods for optimising the regular�isation parameter� �� of an RBF network� However� another key parameter is thesize of the RBFs and until now no methods have been provided for its optimisation�This section describes a simple scheme to �nd an overall scale size for the RBFs ina network�

We �rst review the basic concepts already covered elsewhere ���� and then de�scribe an improved version of the re�estimation formula for the regularisation pa�rameter which is considerably more e�cient and allows multiple initial guesses for� to be optimised in an e�ort to avoid getting trapped in local minima �the detailsare given in appendix B�� We then describe a method for choosing the best overallsize for the RBFs from a number of trial values which is rendered tractable by thee�cient optimisation of �� We then make some concluding remarks�

��� Review

In a linear model with �xed basis functions fhjgmj�� and weights fwjgmj���

f�x� �mXj��

wj hj�x� � �����

the model complexity can be controlled by the addition of a penalty term to the sumof squared errors over the training set� f�xi� yi�gpi��� When this combined error�

E �

pXi��

�yi � f�xi��� � �

mXj��

w�j �

is optimised� large components in the weight vector w are inhibited� This kind ofpenalty is known as ridge regression or weight�decay and the parameter �� whichcontrols the amount of penalty� is known as the regularisation parameter� While thenominal number of free parameters is m �the weights�� the e�ective number is less�due to the penalty term� and is given ���� by

� m� � trA�� � �����

A � H�H� � Im � �����

where H is the design matrix with elements Hij � hj�xi�� The expression for is monotonic in � so model complexity can be decreased �or increased� by raising�or lowering� the value of ��

The parameter � has a Bayesian interpretation� it�s the ratio of ��� the noisecorrupting the training set outputs� to ��� the a priori variance of the weights �seesection ��� If the value of � is known then the optimal weight is

�w � A��H�y � �����

Page 10: Avances Base Radial

� Optimising the Size of RBFs

However� neither �� nor �� may be available in a practical situation so it is usuallynecessary to establish an e�ective value for � in parallel with optimising the weights�This may be done with model selection criterion such as BIC �Bayesian informa�tion criterion�� GCV �generalised cross�validation� or MML �maximum marginalisedlikelihood� see section �� and in particular with one or more re�estimation formula�For GCV the single formula is

� �

p�

�e��e

�w�A�� �w� �����

where

�e � y �H �w �

� tr�A�� � �A��

��

An initial guess for � is used to evaluate the right hand side of ����� which producesa new guess� The resulting sequence of re�estimated values converge to a localminimum of GCV� Each iteration requires the inverse of the m�by�m matrix A andtherefore costs of order m� �oating point operations�

��� E�cient Re�estimation of �

The optimisation of � by iteration of the re�estimation formula is burdened by thenecessity of having to compute an expensive matrix inverse every iteration� How�ever� by a reformulation of the individual terms of the equation using the eigen�values and eigenvectors of HH� it is possible to perform most of the work duringthe �rst iteration and reuse the results for subsequent ones� Thus the amount ofcomputation required to complete an optimisation which takes q steps to convergeis reduced by almost a factor of ��q� Unfortunately� the technique only works for asingle global regularisation parameter ����� not for multiple parameters applying todi�erent groups of weights or to individual weights �����

Suppose the eigenvalues and eigenvectors of HH� are f�igpi�� and fuigpi�� andthat the projections of y onto the eigenvectors are �yi � y�ui� Then� as shown inappendix B� the four terms involved in the re�estimation formula ����� are

pXi��

�i � �� �����

p� �

pXi��

�i��i � ���

� �����

�e��e �

pXi��

�i �y�i

��i � ���� ����

�w�A�� �w �

pXi��

���y�i��i � ���

� ����

If � is re�estimated by computing ��������� instead of explicitly calculating theinverse in ������ then the computational cost of each iteration is only of order p�

Page 11: Avances Base Radial

��� Avoiding Local Minima ��

instead of m�� The overhead of initially calculating the eigensystem� which is oforder p�� has to be taken into account but is only performed once� For problems inwhich p is not much bigger thanm this represents a signi�cant saving in computationtime and makes it feasible to optimise multiple guesses for the initial value of � todecrease the chances of getting caught in a local minimum�

��� Avoiding Local Minima

If the initial guess for � is close to a local minimum of GCV �or whatever modelselection criterion is employed� then re�estimation using ����� is likely to get trapped�We illustrate by using Friedman�s data set as described in section ��� with an RBFnetwork of � Gaussian centres coincident with the inputs of the training set andof �xed radius r � �

The solid curve in �gure ��� shows the variation of GCV with �� The open circlesshow a sequence of re�estimated � values with their corresponding GCV scores� Theinitial guess was � � � �� and the sequence converged near � � � �� �shown by theclosed circle� at a local minimum� The global minimum near � � � ��� was missed�

−12 −10 −8 −6 −4 −2 04.7

4.8

4.9

5

5.1

5.2

5.3

log(λ)

log(

GC

V)

Figure ���� The variation of GCV with � for Friedman�s problem and a sequence ofre�estimations starting at � � � ���

Compare �gure ��� with �gure ��� where the only change was to use a di�erentguess� � ��� for the initial value of �� This time the guess is su�ciently close tothe global minimum that the re�estimations are attracted towards it� Note thatthe set of eigenvalues and eigenvectors used to compute the sequences in �gures ���and ��� are identical� Since the calculation of the eigensystem dominates the othercomputational costs it is almost as expensive to optimise one trial value as it is tooptimise several� Thus to avoid falling into a local minimum several trial valuesspread over a wide range can be optimised and the solution with the lowest GCVselected as the overall winner� This value can then be used to determine the weights����� ����� and ultimately the predictions ������ of the network�

Page 12: Avances Base Radial

�� Optimising the Size of RBFs

−12 −10 −8 −6 −4 −2 04.7

4.8

4.9

5

5.1

5.2

5.3

log(λ)

log(

GC

V)

Figure ���� Same as �gure ��� except the initial guess is � � � ���

��� The Optimal RBF Size

For Gaussian radial functions of �xed width r the transfer functions of the hiddenunits are

hj�x� � exp

�� �x� cj�

��x� cj�

r�

��

Unfortunately� there is no re�estimation formula for r� as there is for �� even inthis simple case where the same scale is used for each RBF and each component ofthe input ���� To properly optimise the value of r would thus require the use of anonlinear optimisation algorithm and would have to incorporate the optimisation of� �since the optimal value of � changes as r changes��

An alternative� if rather crude approach� is to test a number of trial values for r�For each value an optimal � is calculated �by using the re�estimation method above�and the model selection score noted� When all the values have been checked� theone associated with the lowest score wins� The computational cost of this procedureis dominated� once again� by the cost of computing the eigenvalues and eigenvectorsof HH�� and these have to be calculated separately for each value of r�

While this procedure is less computationally demanding than a full nonlinearoptimisation of r and � it�s drawback is that it is only capable of identifying thebest value for r from a �nite number of alternatives� On the other hand� given thatthe value of � is fully optimised and that the model selection criteria are heuristic�in other words� approximate� in nature� it is arguable that a more precise locationfor the optimal value of r is unlikely to have much practical signi�cance�

We illustrate the method on the Hermite data as described in section ���� Onceagain we use each training set input as an RBF centre� We tried seven di�erent trialvalues for r� ��� ��� �� �� � ���� ��� and ���� For each trial value we plot� in �gure���� the variation of GCV with � �the curves�� as well as the optimal � �the closed

Page 13: Avances Base Radial

��� Trial Values in Other Contexts ��

−10 −8 −6 −4 −2 0−1.9

−1.8

−1.7

−1.6

−1.5

log(λ)

log(

GC

V)

0.40.60.81.01.21.41.6

Figure ���� The Hermite data set with four sizes of RBFs�

circles� found by re�estimation as described above� The radius value which led tothe lowest GCV score was r � �� and the corresponding optimal regularisationparameter is � � ���

Initially� as r increases from its lowest value� the GCV score at the optimum �decreases� Eventually it reaches its lowest value at r � �� � Above that there is notmuch increase in optimised GCV� although the optimal � decreases rapidly�

�� Trial Values in Other Contexts

The use of trial values is limited to cases where there is a small number of parametersto optimise� such as the single parameter r� If there are several parameters withtrial values then the number of di�erent combinations to evaluate can easily becomeprohibitively large� In RBF networks where there is a separate scale parameterfor each dimension� so that the transfer functions are� for example in the case ofGaussians�

hj�x� � exp

��

nXk��

�xk � cjk��

r�jk

��

there would be tmn combinations to check where t is the number of trial valuesfor each rjk� m is the number of basis functions and n the number of dimensions�However� it is possible to test trial values for an overall scale size � if some othermechanism can be used to generate the scales rjk� Here� the transfer functions are

hj�x� � exp

��

nXk��

�xk � cjk��

�� r�jk

��

This is the approach taken for the method in section � where a regression treedetermines the values of rjk but the overall scale size � is optimised by testing trialvalues�

Page 14: Avances Base Radial

�� Optimising the Size of RBFs

�� Conclusions

We�ve shown how trial values for the overall size of the RBFs can be compared usinga model selection criterion� In the case of ridge regression� an e�cient method foroptimising the regularisation parameter helps reduce the computational burden oftraining a separate network for each trial value� However� the same technique canalso be used with other methods of complexity control� including those in whichthere is no regularisation�

In the Matlab software package ��� each method can be con�gured with a set oftrial values for the overall RBF scale� The best value is chosen and used to generatethe RBF network which the Matlab function returns�

Page 15: Avances Base Radial

Regression Trees and RBF Networks ��

� Regression Trees and RBF Networks

��� Introduction

This section is about a novel method for nonparametric regression involving a combi�nation between regression trees and RBF networks ��� The basic idea of a regressiontree is to recursively partition the input space in two and approximate the functionin each half by the average output value of the samples it contains ���� Each splitis parallel to one of the axes so it can be expressed by an inequality involving oneof the input components �e�g� xk b�� The input space is thus divided into hy�perrectangles organised into a binary tree where each branch is determined by thedimension �k� and boundary �b� which together minimise the residual error betweenmodel and data�

A bene�t of regression trees is the information provided in the split statisticsabout the relevance of each input variable� The components which carry the mostinformation about the output tend to be split earliest and most often� A weaknessof regression trees is the discontinuous model caused by the output value jumpingacross the boundary between two hyperrectangles� There is also the problem ofdeciding when to stop growing the tree �or equivalently� how much to prune after ithas fully grown� which is the familiar bias�variance dilemma faced by all methods ofnonparametric regression ���� The use of radial basis functions in conjunction withregression trees can help to solve both these problems�

Below we outline the basic method of combining RBFs and regression trees asit appeared originally and describe our version of this idea and why we think it isan improvement� Finally we show some results and summarise our conclusions�

��� The Basic Idea

The combination of trees and RBF networks was �rst suggested by �� in the con�text of classi�cation rather than regression �though the two cases are very similar��Further elaboration of the idea appeared in ��� Essentially� each terminal node ofthe classi�cation tree contributes one hidden unit to the RBF network� the centreand radius of which are determined by the position and size of the correspondinghyperrectangle� Thus the tree sets the number� positions and sizes of all RBFs inthe network� Model complexity is controlled by two parameters� �c determines theamount of tree pruning in C��� �� � �the software package used by �� to generateclassi�cation trees� and � �xes the size of RBFs relative to hyperrectangles�

Our major reservation about the approach taken by �� is the treatment of modelcomplexity� In the case of the scaling parameter ���� the author claimed it had littlee�ect on prediction accuracy� but this is not in accord with our previous experienceof RBF networks� As for the amount of pruning ��c�� he demonstrated its e�ecton prediction accuracy yet used a �xed value in his benchmark tests� Moreover�there was no discussion of how to control scaling and pruning to optimise modelcomplexity for a given data set�

Page 16: Avances Base Radial

�� Regression Trees and RBF Networks

Our method is a variation on Kubat�s with the following alterations�

�� We address the model complexity issue by using the nodes of the regressiontree not to �x the RBF network but rather to generate a set of RBFs fromwhich the �nal network can be selected� Thus the burden of controlling modelcomplexity shifts from tree generation to RBF selection�

�� The regression tree from which the RBFs are produced can also be used toorder selections such that certain candidate RBFs are allowed to enter themodel before others� We describe one way to achieve such an ordering anddemonstrate that it produces more accurate models than plain forward selec�tion�

�� We show that� contrary to the conclusions of ��� the method is typically quitesensitive to the parameter � and discuss its optimisation by the use of multipletrial values�

��� Generating the Regression Tree

The �rst stage of our method �and Kubat�s� is to generate a regression tree� Theroot node of the tree is the smallest hyperrectangle which contains all the trainingset inputs� fxigpi��� Its size sk �the half�width� and centre ck in each dimension kare

sk ��

�maxi�S

�xik��mini�S

�xik�

��

ck ��

�maxi�S

�xik� � mini�S

�xik�

��

where S � f�� � � � � � pg is the set of training set indices� A split of the root nodedivides the training samples into left and right subsets� SL and SR� on either side ofa boundary b in one of the dimensions k such that

SL � fi � xik � bg �SR � fi � xik bg �

The mean output value on either side of the split is

�yL ��

pL

Xi�SL

yi �

�yR ��

pR

Xi�SR

yi �

where pL and pR are the number of samples in each subset� The residual squareerror between model and data is then

E�k� b� ��

p

��Xi�SL

�yi � �yL�� �

Xi�SR

�yi � �yR��

�A �

Page 17: Avances Base Radial

��� From Hyperrectangles to RBFs ��

The split which minimisesE�k� b� over all possible choices of k and b is used to createthe children of the root node and is easily found by discrete search over n dimensionsand p cases� The children of the root node are split recursively in the same mannerand the process terminates when a node cannot be split without creating a childcontaining less samples than a given minimum� pmin� which is a parameter of themethod� Compared to their parent nodes� the child centres will be shifted and theirsizes reduced in the k�th dimension�

Since the size of the regression tree does not determine the model complexity�there is no need to perform the �nal pruning step normally associated with recursivesplitting methods ��� � � ��

��� From Hyperrectangles to RBFs

The regression tree contains a root node� some nonterminal nodes �having children�and some terminal nodes �having no children�� Each node is associated with ahyperrectangle of input space having a centre c and size s as described above� Thenode corresponding to the largest hyperrectangle is the root node and the nodesizes decrease down the tree as they are divided into smaller and smaller pieces�To translate a hyperrectangle into a Gaussian RBF we use its centre c as the RBFcentre and its size s scaled by a parameter � as the RBF radius� r � � s� Thescalar � has the same value for all nodes and is another parameter of the method�in addition to pmin�� Our � is not quite the same as Kubat�s � �they�re related byan inverse and a factor of

p�� but plays exactly the same role�

��� Selecting the Subset of RBFs

After the tree nodes are translated into RBFs the next step of our method is toselect a subset of them for inclusion in the model� This is in contrast to the methodof �� where all RBFs from terminal nodes were included in the model which wasthus heavily dependent on the extent of tree pruning to control model complexity�Selection can be performed using either a standard method such as forward selection���� ��� or in a novel way� by employing the tree to guide the order in which candidateRBFs are considered�

In the standard methods for subset selection the RBFs generated from the regres�sion tree are treated as an unstructured collection with no distinction between RBFscorresponding to di�erent nodes in the tree� However� intuition suggests that thebest order to consider RBFs for inclusion in the model is large ones �rst and smallones last� to synthesise coarse structure before �ne details� This� in turn� suggestssearching for RBF candidates by traversing the tree from the largest hyperrectangle�and RBF� at the root to the smallest hyperrectangles �and RBFs� at the terminalnodes� Thus the �rst decision should be whether to include the root node in themodel� the second whether to include any of the children of the root node� and soon� until the terminal nodes are reached�

The scheme we eventually developed for selecting RBFs goes somewhat beyondthis simple picture and was in�uenced by two other considerations� The �rst con�cerns a classic problem with forward selection� namely� that one regressor can blockthe selection of other more explanatory regressors which would have been chosen in

Page 18: Avances Base Radial

� Regression Trees and RBF Networks

preference had they been considered �rst� In our case there was a danger that aparent RBF could block its own children� To avoid this situation� when consider�ing whether to add the children of a node which had already been selected we alsoconsidered the e�ect of deleting the parent� Thus our method has a measure of back�ward elimination as well as forward selection� This is reminiscent of the selectionschemes developed for the MARS ��� and MAPS ��� algorithms�

A second reason for departing from a simple breadth��rst search is because thesize of a hyperrectangle �in terms of volume� on one level is not guaranteed to besmaller than the size of all the hyperrectangles in the level above �only its parent�so it not easy to achieve a strict largest�to�smallest ordering� In view of this� weabandoned any attempt to achieve a strict ordering and instead devised a searchalgorithm which dynamically adjusts the set of selectable RBFs by replacing selectedRBFs with their children�

The algorithm depends on the concept of an active list of nodes� At any givenmoment during the selection process only these nodes and their children are con�sidered for inclusion or exclusion from the model� Every time RBFs are added orsubtracted from the model the active list expands by having a node replaced by itschildren� Eventually the active list becomes coincident with the terminal nodes andthe search is terminated� In detail� the steps of the algorithm are as follows�

�� Initialise the active list with the root node and the model with the root node�sRBF�

�� For all nonterminal nodes on the active list consider the e�ect �on the modelselection criterion� of adding both or just one of the children�s RBFs �threepossible modi�cations to the model�� If the parent�s RBF is already in themodel� also consider the e�ect of �rst removing it before adding one or bothchildren�s RBFs or of just removing it �a further four possible modi�cations��

�� The total number of possible adjustments to the model is somewhere betweenthree and seven times the number of active nonterminal nodes� depending onhow many of their RBFs are already in the model� From all these possibilitieschoose the one which most decreases the model selection criterion� Update thecurrent model and remove the node involved from the active list� replacing itwith its children� If none of the modi�cations decrease the selection criterionthen chose one of the active nodes at random and replace it by its children butleave the model unaltered�

�� Return to step � and repeat until all the active nodes are terminal nodes�

Once the selection process has terminated the network weights can be calculatedin the usual way by solving the normal equation�

w �H�H

��H�y �

where H is the design matrix� There is no need for a regularisation term� as ap�pears in equations ����� ���� for example� because model complexity is limited by theselection process�

Page 19: Avances Base Radial

��� The Best Parameter Values ��

�� The Best Parameter Values

Our method has three main parameters� the model selection criterion� pmin whichcontrols the depth of the regression tree and � which determines the relative sizebetween hyperrectangles and RBFs�

For the model selection criterion we found that the more conservative BIC� whichtends to produce more parsimonious models� rarely performed worse than GCVand often did signi�cantly better� This is in line with the experiences of otherpractitioners of algorithms based on subset selection such as ���� who modi�ed GCVto make it more conservative� and ���� who also found BIC gave better results thanGCV�

For pmin and � we use the simple method of comparing the model selectionscores of a number of trial values� as for the RBF widths in section �� This meansgrowing several trees �one for each trial value of pmin� and then �for each tree�selecting models from several sets of RBFs �one for each value of ��� The cost isextra computation� the more trial values there are� the longer the algorithm takesto search through them� However� the basic algorithm is not unduly expensive andif the number of trial values is kept fairly low �about � or less alternatives for eachparameter�� the computation time is acceptable�

�� Demonstrations

Figure ��� shows the prediction of a pure regression tree for a sample of Hermitedata �section ����� For clarity� the samples themselves are not shown� just thetarget function and the prediction� Of course� the model is discontinuous and eachhorizontal section corresponds to one terminal node in the tree�

−4 −2 0 2 40

1

2

3

x

y

target prediction

Figure ���� A pure regression tree prediction on the Hermite data�

Page 20: Avances Base Radial

� Regression Trees and RBF Networks

The tree which produced the prediction shown in �gure ��� was grown untilfurther splitting would have violated the minimum number of samples allowed pernode �pmin�� There was no pruning or any other sophisticated form of complexitycontrol� so this kind of tree is not suitable for practical use as a prediction method�However� in our method the tree is only used to create RBF candidates� Modelcomplexity is controlled by a separate process which selects a subset of RBFs forthe network�

Figure ��� shows the predictions of the combined method on the same data setused in �gure ��� after a subset of RBFs were selected from the pool of candidatesgenerated by the tree nodes� Now the model is continuous and its complexity is wellmatched to the data�

−4 −2 0 2 40

1

2

3

x

y

target prediction

Figure ���� The combined method on the Hermite data�

As a last demonstration we turn our attention to Friedman�s data set �sec�tion ����� In experiments with the MARS algorithm ���� Friedman estimated theaccuracy of his method by replicating data sets � times and computing the meanand standard deviation of the scaled sum�square�error� For this data set his best re�sults� corresponding to the most favourable values for the parameters of the MARSmethod� were ���� � ��

To compare our algorithm with MARS� and also to test the e�ect of using mul�tiple trial values for our method�s parameters� pmin and �� we conducted a similarexperiment� Before we started� we tried some di�erent settings for the trial valuesand identi�ed one which gave good results on test data� Then� for each of the � replications� we applied the method twice� In the �rst run we used the trial valueswe had discovered earlier� In the second run we used only a single �best� valuefor each parameter� the average of the trial values� forcing this value to be used forevery replicated data set� The results are shown in table ��

It is apparent that the results are practically identical to MARS when the fullsets of trial values are used but signi�cantly inferior when only single �best� valuesare used�

Page 21: Avances Base Radial

�� Conclusions ��

pmin � error

�� �� � �� �� � � � ���� � �

� ��� � �

Table �� Results on � replications of Friedman�s data set�

In another test using replicated data sets we compared the two alternative meth�ods of selecting the RBFs from the candidates generated by the tree� standard for�ward selection or the method described above in section ��� which uses the tree toguide the order in which candidates are considered� This was the only di�erencebetween the two runs� the model parameters were the same as in the �rst row oftable �� The performance of tree�guided selection was ��� � � � �as in table ���but forward selection was signi�cantly worse� �� � �� �

��� Conclusions

We have described a method for nonparametric regression based on combining re�gression trees and radial basis function networks� The method is similar to �� andhas the same advantages �a continuous model and automatic relevance determina�tion� but also some signi�cant improvements� The main enhancement is the additionof an automatic method for the control of model complexity through the selectionof RBFs� We have also developed a novel procedure for selecting the RBFs basedon the structure of the tree�

We�ve presented evidence that the method is comparable in performance to thewell known MARS algorithm and that some of its novel features �trial parametervalues� tree�guided selection� are actually bene�cial� More detailed evaluations withDELVE ���� data sets are in preparation and preliminary results support these con�clusions�

The Matlab software package ��� has two implementations of the method� Onefunction� rbf rt �� uses tree�guided selection� while the other� rbf rt �� uses for�ward selection� The operation of each function is described� with examples� in acomprehensive manual�

Page 22: Avances Base Radial

�� Appendix

� Appendix

A Applying the EM Algorithm

We want to maximise the marginal probability of the observed data ����� by substi�tuting expectations of the conditional probability of the unobserved data ����� intothe cost function for the joint probability of the combined data ����� and minimis�ing this with respect to the parameters �� �the noise variance� and �� �the a priori

weight variance��

From ������ hwi � �w and h�w� �w� �w� �w��i �W � ��A��� The expectationof w�w is then

hw�wi � tr hww�i� �w� �w� tr hww� � �w �w�i� �w� �w� tr h�w � �w� ��w � �w��i� �w� �w� ��trA��

� �w� �w� ���m� � � �A���

The last step follows from � m � � trA�� �the e�ective number of parameters�and � � ����� �the regularisation parameter�� Similarly�

he�ei � tr he e�i� �e��e� tr he e� � �e�e�i� �e��e� trH hww� � �w �w�iH�

� �e��e� ��trHA��H�

� �e��e� �� � �A���

since e � y � Hw is linear in w and trHA��H� is another expression for thee�ective number of parameters �

Equations �A��� A��� summarise the expectation of the conditional probabilityfor w and can be substituted into the joint probability of the combined data or theequivalent cost function ����� so that the resulting expression can be optimised withrespect to �� and ��� Note that in �A��� A��� these parameters are held constant attheir old values� only the explicit occurrences of �� and �� in ����� are varied in theoptimisation�

After di�erentiating ����� with respect to �� and ��� equating the results to zeroand �nally substituting the expectations �A��� A��� we get the re�estimation formulae

�� ��e��e� ��

p�

�� ��w� �w � �m� � ��

m�

Page 23: Avances Base Radial

B The Eigensystem of HH� ��

B The Eigensystem of HH�

We want to derive expressions for each of the terms in ����� using the eigenvaluesand eigenvectors of HH� We start with a singular value decomposition of thedesign matrix� H � USV�� where U � �u� u� � � � up� � Rp�p and V � Rm�m areorthogonal and S � Rp�m�

S �

�����������

p�� � � �

p�� � � �

������

� � ����

� � � p�m

� � � ���

���� � �

��� � � �

�������������

contains the singular values� fp�igpi��� Note that� due to the orthogonality of V�

HH� � USS�U�

pXi��

�i ui u�i �

so the �i are the eigenvalues� and ui the eigenvectors� of the matrix HH�� Theeigenvalues are non�negative and� we assume� ordered from largest to smallest so thatif p m then �i � for i m� The eigenvectors are orthonormal �u�i ui� � �ii���

We want to derive expressions for the terms in ����� using just the eigenvalues andeigenvectors of HH�� As a preliminary step� we derive some more basic relations�First� the matrix inverse in each re�estimation is

A�� �H�H� � Im

���

V�S�SV � �V�V

��� V

S�S� � Im

��V� � �B���

Note that the second step would have been impossible if the regularisation term�� Im� had not been proportional to the identity matrix� which is where the analysisbreaks down in the case of multiple regularisation parameters� Secondly� the optimalweight vector is

�w � A��H�y

� VS�S� � Im

��S�U�y

� VS�S� � Im

��S��y �B���

where �y is the projection of y onto the eigenbasis U�

Page 24: Avances Base Radial

�� Appendix

Thirdly� from �B��� we can further derive

� m� � trA��

� m� � trVS�S� � Im

��V�

� m� � trS�S� � Im

��

� m�mXj��

�j � �

mXj��

�j�j � �

pXi��

�i�i � �

�B���

Here we have assumed p � m so the last step follows �for � � because if p mthen the last �p �m� eigenvalues are zero� However� the conclusion is also true ifp � m since in that case the last �m � p� singular values are annihilated in theproduct S�S�

Fourthly� and last of the preliminary calculations� the vector of residual errors is

�e � y �H �w

�Ip �US �S�S� � Im�

��S�U�y

� UIp � S �S�S� � Im�

��S��y � �B���

Now we are ready to tackle the terms in ������ From �B��� we have

p� � p�pXi��

�i�i � �

pXi��

�i � �� �B���

From �B��� and a set of steps similar to the derivation of �B��� it follows that

� trA�� � � trA��

mXj��

�j � ��

mXj��

��j � ���

�mXj��

�j��j � ���

pXi��

�i��i � ���

� �B���

Page 25: Avances Base Radial

B The Eigensystem of HH� ��

The last step follows in a similar way to the last step of �B���� Next we tackle theterm �w�A�� �w� From �B��� and �B��� we get

�w�A�� �w � �y�SS�S� � Im

��S��y

pXi��

�i �y�i

��i � ���� �B���

The sum of squared residual errors is� from �B����

�e��e � �yIp � S �S�S� � Im�

��S���y

mXj��

���y�j��j � ���

pXi�m��

�y�i

pXi��

���y�i��i � ���

� �B��

For this derivation we assumed that p � m but� for reasons similar to those statedfor the derivation of �B���� the result is also true for p � m�

Equations �B���B�� express each of the four terms in ����� using the eigenvaluesand eigenvectors of HH�� which was our main goal in this appendix� Other usefulexpressions involving the eigensystem of HH� are

ln jPj �

pXi��

ln

���

���i � ��

� p ln�� �pXi��

ln����i � ��

��

y�Py �

pXi��

���y�i���i � ��

where P � Ip �HA��H�� �� is the noise variance and �� is the a priori varianceof the weights �see section ��� For example� if these expressions are substituted inequation ���� for the cost function associated with the marginal likelihood of thedata� the two p ln�� terms cancel� leaving

E�y� � p ln�� � ln jPj� y�Py

��

pXi��

ln����i � ��

��

pXi��

�y�i���i � ��

Page 26: Avances Base Radial

�� REFERENCES

References

��� A�R� Barron and X� Xiao� Discussion of �Multivariate adaptive regressionsplines� by J�H� Friedman� Annals of Statistics� ������� ���

��� C� M� Bishop� M� Svensen� and K�I� Williams� Em optimization of latent�variable density models� In D�S� Touretzky� M�C� Mozer� and M�E� Hasselmo�editors� Advances in Neural Information Processing Systems �� pages ��������MIT Press� Cambridge� MA� ���

��� C�M� Bishop� Neural Networks for Pattern Recognition� Clarendon Press� Ox�ford� ���

��� L� Breiman� J� Friedman� J� Olsen� and C� Stone� Classi�cation and Regression

Trees� Wadsworth� Belmont� CA� ���

��� A�P� Dempster� N�M� Laird� and D�B� Rubin� Maximum likelyhood from in�complete data via the EM algorithm� Journal of the Royal Statistical Society

�B�� ��������� ����

��� J�H� Friedman� Multivariate adaptive regression splines �with discussion�� An�nals of Statistics� �������� ���

��� S� Geman� E� Bienenstock� and R� Doursat� Neural networks and thebias�variance dilemma� Neural Computation� ��������� ���

�� M� Kubat� Decision trees can initialize radial�basis function networks� IEEE

Transactions on Neural Networks� ���������� ��

�� M� Kubat and I� Ivanova� Initialization of RBF networks with decision trees�In Proc� of the �th Belgian�Dutch Conf� Machine Learning� BENELEARN���pages ���� � ���

�� � D�J�C� MacKay� Bayesian interpolation� Neural Computation� ����������������

���� D�J�C� MacKay� Comparison of approximate methods of handling hyperparam�eters� Accepted for publication by Neural Computation� ��

���� J�E� Moody� The e�ective number of parameters� An analysis of generalisationand regularisation in nonlinear learning systems� In J�E� Moody� S�J� Hanson�and R�P� Lippmann� editors� Neural Information Processing Systems � pages������ Morgan Kaufmann� San Mateo CA� ���

���� R�M� Neal and G�E� Hinton� A view of the EM algorithm that justi�es incre�mental� sparse� and other variants� In M�I� Jordan� editor� Learning in Graphical

Models� Kluwer Academic Press� ��

���� M�J�L� Orr� Local smoothing of radial basis function networks� In International

Symposium on Arti�cial Neural Networks� Hsinchu� Taiwan� ���

���� M�J�L� Orr� Regularisation in the selection of radial basis function centres�Neural Computation� ������ ������ ���

Page 27: Avances Base Radial

REFERENCES ��

���� M�J�L� Orr� Introduction to radial basis function networks� Technical report�Institute for Adaptive and Neural Computation� Division of Informatics� Edin�burgh University� ��� www�anc�ed�ac�uk�mjo�papers�intro�ps�

���� M�J�L� Orr� Matlab routines for subset selection and ridge regression in linearneural networks� Technical report� Institute for Adaptive and Neural Computa�tion� Division of Informatics� Edinburgh University� ��� www�anc�ed�ac�uk�mjo�software�rbf�zip�

��� M�J�L� Orr� An EM algorithm for regularised radial basis function networks�In International Conference on Neural Networks and Brain� Beijing� China�October ��

��� M�J�L� Orr� Matlab functions for radial basis function networks� Techni�cal report� Institute for Adaptive and Neural Computation� Division of In�formatics� Edinburgh University� �� Download from www�anc�ed�ac�uk�

mjo�software�rbf��zip��� � J�R� Quinlan� C��� Programs for Machine Learning� Morgan Kaufmann� San

Mateo CA� ���

���� C�E� Rasmussen� R�M� Neal� G�E� Hinton� D� van Camp� Z� Ghahramani�M� Revow� R� Kustra� and R� Tibshirani� The DELVE Manual� ���http���www�cs�utoronto�ca�delve��

���� M�E� Tipping and C�M� Bishop� Mixtures of principle component analysers�Technical Report NCRG��� �� Neural Computing Research Group� AstonUniversity� UK� ���

���� M�E� Tipping and C�M� Bishop� Probabilistic principal component analysis�Technical Report NCRG��� � � Neural Computing Research Group� AstonUniversity� UK� ���