avances base radial

Recent Advances in

Radial Basis Function

Networks

Mark J� L� Orr�

Institute for Adaptive and Neural Computation

Division of Informatics� Edinburgh University

Edinburgh EH� �LW� Scotland� UK

June ��

Abstract

In �� an Introduction to Radial Basis Function Networks was published on

the web� along with a package of Matlab functions�� The emphasis was on the

linear character of RBF networks and two techniques borrowed from statistics�

forward selection and ridge regression�

This document� is an update on developments between �� and �� and

is associated with a second version of the Matlab package�� Improvements have

been made to the forward selection and ridge regression methods and a new

method� which is a cross between regression trees and RBF networks� has been

developed�

� mjo�anc�ed�ac�uk� www�anc�ed�ac�uk��mjo�papers�intro�ps� www�anc�ed�ac�uk��mjo�software�rbf�zip� www�anc�ed�ac�uk��mjo�papers�recad�ps� www�anc�ed�ac�uk��mjo�software�rbf��zip

� CONTENTS

Contents

� Introduction �

�� MacKay�s Hermite Polynomial � � � � � � � � � � � � � � � � � � � � � �

�� Friedman�s Simulated Circuit � � � � � � � � � � � � � � � � � � � � � � �

� Maximum Marginal Likelihood �

�� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Review � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� The EM Algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� The DM Algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� Optimising the Size of RBFs �

�� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Review � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� E�cient Re�estimation of � � � � � � � � � � � � � � � � � � � � � � � � �

�� Avoiding Local Minima � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� The Optimal RBF Size � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Trial Values in Other Contexts � � � � � � � � � � � � � � � � � � � � � ��

�� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Regression Trees and RBF Networks ��

�� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� The Basic Idea � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Generating the Regression Tree � � � � � � � � � � � � � � � � � � � � � ��

�� From Hyperrectangles to RBFs � � � � � � � � � � � � � � � � � � � � � ��

�� Selecting the Subset of RBFs � � � � � � � � � � � � � � � � � � � � � � ��

�� The Best Parameter Values � � � � � � � � � � � � � � � � � � � � � � � �

�� Demonstrations � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Appendix ��

A Applying the EM Algorithm � � � � � � � � � � � � � � � � � � � � � � � ��

B The Eigensystem of HH� � � � � � � � � � � � � � � � � � � � � � � � � ��

Introduction �

� Introduction

In �� an introduction to radial basis function �RBF� networks was published on theweb �� along with an associated Matlab software package �� The approach takenstressed the linear character of RBF networks� which traditionally have only a singlehidden layer� and borrowed techniques from statistics� such as forward selection andridge regression� as strategies for controlling model complexity� the main challengefacing all methods of nonparametric regression�

That was three years ago� Since then� some improvements have been made� anew algorithm devised and the package of Matlab functions is now in its secondversion �� This document describes the theory of the new developments and willbe of interest to practitioners using the new software package and theorists enhancingexisting methods or developing new ones�

Section � describes what happens when the expectation�maximisation algorithmis applied to RBF networks� Section � describes a simple procedure for optimisingthe RBF widths� particularly for ridge regression� Finally� section � describes thenew algorithm which uses a regression tree to generate the centres and sizes of aset of candidate RBFs and to help select a subset of these for the network� Twosimulated data sets� used for demonstration� are described below�

�� MacKay�s Hermite Polynomial

The �rst data set is from �� and is based on a one�dimensional Hermite polynomial

y � � � �� x� �x�� e�x�

�

� input values are sampled randomly between �� x � � and Gaussian noise ofstandard deviation � � �� is added to the outputs ��gure ��

−4 −2 0 2 40

1

2

3

x

y

training set actual function

Figure �� Sample Hermite data �stars� and the actual function �curve��

� Introduction

�� Friedman�s Simulated Circuit

This second data set simulates an alternating current circuit with four parameters�resistance �R ohms�� angular frequency �� radians per second�� inductance �L hen�ries� and capacitance �C farads� in the ranges

� R � � �

� � � � � ��

� L � � �

�� C � ��

� random samples of the four parameters in these ranges were used to generatecorresponding values of the impedance�

Z �

qR� � �� L� �� C��

to which Gaussian noise of standard deviation � � �� was added� This resultedin a training set of � cases with four�dimensional inputs x � �R � L C�� and ascalar output y � Z� The problem originates from �� Before applying any learningalgorithms to this data� the original inputs� with their very di�erent dynamic ranges�are rescaled to the range �� in each component�

Maximum Marginal Likelihood �

� Maximum Marginal Likelihood

�� Introduction

The expectation�maximisation �EM� algorithm �� performs maximum likelihoodestimation for problems in which some of the variables are unobserved� Recentlyit has been successfully applied to density estimation �� and probabilistic principlecomponents �� for example� This section discusses the application of EM toRBF networks�

First we review the probability model of a linear neural network and come upwith an expression for the marginal likelihood of the data� It is this likelihood whichwe ultimately want to maximise� Then we show the results of applying the EMalgorithm� a pair of re�estimation formulae for the model parameters� However� itturns out that a similar set of re�estimation formula can be derived by a simplermethod and also that they converge more rapidly than the EM versions� Finally� wedraw some conclusions�

�� Review

The model estimated by a linear neural network from noisy samples f�xi� yi�gpi��can be written

f�x� �mXj��

wj hj�x� � ��

where the fhjgmj�� are �xed basis functions and fwjgmj�� are unknown weights �to beestimated�� The vector of residual errors between model and data is

e � y �Hw �

whereH is the design matrix and has elements Hij � hj�xi�� In a Bayesian approachto analysing the estimation process� the a priori probability of the weights w canbe modelled as a Gaussian of variance ��

p�w� � ��m exp

��w

�w

� ��

��

The conditional probability of the data y given the weights w can also be modelledas a Gaussian� with variance �� to account for the noise included in the outputs ofthe training set� fyigpi��

p�yjw� � ��p exp

��e

�e

��

��

The joint probability of data and weights is the product of p�w� with p�yjw� andcan be represented as an equivalent cost function by taking logarithms� multiplyingby �� and dropping constant terms to obtain

E�y� w� � p ln �� m ln �� e�e

��w�w

��

� Maximum Marginal Likelihood

The conditional probability of the weights w given the data y is found using Bayesrule� again involves the product of �� with �� and is another Gaussian�

p�wjy� � p�yjw� p�w�

� jWj�� exp��

��w� �w��W��w � �w�

��

where

�w � A��H�y �

W � ��A��

A � H�H� � Im �

� ��

��

Finally� the marginal likelihood of the data is

p�y� �

Zp�yjw� p�w� dw

� ��p jPj�� exp��y

�Py

��

��

where

P � Ip �HA��H� �

Note that there is an equivalent cost function for p�y� which is obtained by takinglogarithms� multiplying by �� and dropping the constant terms�

E�y� � p ln�� ln jPj� y�Py

��

�� The EM Algorithm

The EM algorithm estimates the parameters of a model iteratively� starting fromsome initial guess� Each iteration consists of an expectation �E� step which �ndsthe distribution for the unobserved variables and a maximisation �M� step whichre�estimates the parameters of the model to be those with the maximum likelihoodfor the observed and missing data combined�

In the context of a linear neural network it is possible to consider the trainingset f�xi� yi�gpi�� as the observed data� the weights fwjgmj�� as the missing data and

the variance of the noise �� and the a priori variance of the weights �� as the modelparameters�

In the E�step� the expectation of the conditional probability of the missing data�� is taken and substituted� in the M�step� into the joint probability of the com�bined data� or its equivalent cost function �� which is then optimised with respectto the model parameters �� and �� These two steps are guaranteed to increase themarginal probability of the observed data and when iterated convergence to a localmaximum�

�� The DM Algorithm �

Detailed analysis �see appendix A� results in a pair of re�estimation formulae forthe parameters �� and ��

�� e��e� ��

p� ��

�� w� �w � �m� � ��

m� ��

where

�e � y �H �w �

� m� � trA��

Initial guesses are substituted into the right hand sides which produce new guesses�The process is repeated until a local minimum of �� is reached�

Note that equation �� was derived in �� by a free energy approach� It hasbeen shown that free energy and the EM algorithm are intimately connected ��

Figure �� illustrates with the Hermite data described in section �� Centres ofradius r � � were created for each training set input� The �gure plots logarithmiccontours of �� and the sequence of �� and �� values re�estimated by ��

−4 −2 0−4

−2

0

2

log(σ2)

log(ς2 )

Figure �� Optimisation of �� and �� by EM�

�� The DM Algorithm

An alternative approach to minimising �� is simply to di�erentiate it and set theresults to zero� This is easily done and results in the pair of re�estimation formulae

�� e��e

p� � ��

�� w��w

� ��

Maximum Marginal Likelihood

I call this method the �DM algorithm� after David MacKay who �rst derivedthese equations �� Its disadvantage is the absence of any guarantee that theiterations converge� unlike their EM counterparts �� which are known toincrease the marginal likelihood �or leave it the same if a �xed point has beenreached�� Any �xed point of DM is also a �xed point of EM� and vice versa� but ifthere are multiple �xed points there is no guarantee that both methods will convergeto the same one� even when starting from the same guess�

Figure �� plots the sequence of re�estimated values using �� for thesame training set� RBF network and initial values of �� and �� used for �gure ��It is apparent that convergence is faster for DM than for EM in this example� taking� iterations for DM compared to � for EM� In fact� our empirical observation isthat DM always converges considerably faster than EM if they start from the sameguess and converge to the same local minimum� Furthermore� DM has never failedto converge�

−4 −2 0−4

−2

0

2

log(σ2)

log(ς2 )

Figure �� Optimisation of �� and �� by DM�

�� Conclusions

We started by applying the EM algorithm to RBF networks using the weight�decay�ridge regression� style of penalised likelihood and ended with a pair of re�estimationformulae for the noise variance �� and prior weight variance �� However� theseturned out to be less e�cient than a similar pair of formulae which had been knownin the literature for some time�

The rbf rr � method in the Matlab software package �� has an option to usemaximum marginal likelihood �MML� as the model selection criterion �instead ofGCV or BIC� for example�� When this option is selected the regularisation parameter�� is re�estimated using� by default� the DM equations �� Another optioncan be set so that the EM versions �� are used instead�

Optimising the Size of RBFs �

� Optimising the Size of RBFs

�� Introduction

In previous work �� we concentrated on methods for optimising the regular�isation parameter� �� of an RBF network� However� another key parameter is thesize of the RBFs and until now no methods have been provided for its optimisation�This section describes a simple scheme to �nd an overall scale size for the RBFs ina network�

We �rst review the basic concepts already covered elsewhere �� and then de�scribe an improved version of the re�estimation formula for the regularisation pa�rameter which is considerably more e�cient and allows multiple initial guesses for� to be optimised in an e�ort to avoid getting trapped in local minima �the detailsare given in appendix B�� We then describe a method for choosing the best overallsize for the RBFs from a number of trial values which is rendered tractable by thee�cient optimisation of �� We then make some concluding remarks�

�� Review

In a linear model with �xed basis functions fhjgmj�� and weights fwjgmj��

f�x� �mXj��

wj hj�x� � ��

the model complexity can be controlled by the addition of a penalty term to the sumof squared errors over the training set� f�xi� yi�gpi�� When this combined error�

E �

pXi��

�yi � f�xi��

mXj��

w�j �

is optimised� large components in the weight vector w are inhibited� This kind ofpenalty is known as ridge regression or weight�decay and the parameter �� whichcontrols the amount of penalty� is known as the regularisation parameter� While thenominal number of free parameters is m �the weights�� the e�ective number is less�due to the penalty term� and is given �� by

� m� � trA��

A � H�H� � Im � ��

where H is the design matrix with elements Hij � hj�xi�� The expression for is monotonic in � so model complexity can be decreased �or increased� by raising�or lowering� the value of ��

The parameter � has a Bayesian interpretation� it�s the ratio of �� the noisecorrupting the training set outputs� to �� the a priori variance of the weights �seesection �� If the value of � is known then the optimal weight is

�w � A��H�y � ��

� Optimising the Size of RBFs

However� neither �� nor �� may be available in a practical situation so it is usuallynecessary to establish an e�ective value for � in parallel with optimising the weights�This may be done with model selection criterion such as BIC �Bayesian informa�tion criterion�� GCV �generalised cross�validation� or MML �maximum marginalisedlikelihood� see section �� and in particular with one or more re�estimation formula�For GCV the single formula is

� �

p�

�e��e

�w�A�� w� ��

where

�e � y �H �w �

� tr�A�� A��

��

An initial guess for � is used to evaluate the right hand side of �� which producesa new guess� The resulting sequence of re�estimated values converge to a localminimum of GCV� Each iteration requires the inverse of the m�by�m matrix A andtherefore costs of order m� �oating point operations�

�� E�cient Re�estimation of �

The optimisation of � by iteration of the re�estimation formula is burdened by thenecessity of having to compute an expensive matrix inverse every iteration� How�ever� by a reformulation of the individual terms of the equation using the eigen�values and eigenvectors of HH� it is possible to perform most of the work duringthe �rst iteration and reuse the results for subsequent ones� Thus the amount ofcomputation required to complete an optimisation which takes q steps to convergeis reduced by almost a factor of ��q� Unfortunately� the technique only works for asingle global regularisation parameter �� not for multiple parameters applying todi�erent groups of weights or to individual weights ��

Suppose the eigenvalues and eigenvectors of HH� are f�igpi�� and fuigpi�� andthat the projections of y onto the eigenvectors are �yi � y�ui� Then� as shown inappendix B� the four terms involved in the re�estimation formula �� are

�

pXi��

�

�i � ��

p� �

pXi��

�i��i � ��

� ��

�e��e �

pXi��

�i �y�i

��i � ��

�w�A�� w �

pXi��

��y�i��i � ��

� ��

If � is re�estimated by computing �� instead of explicitly calculating theinverse in �� then the computational cost of each iteration is only of order p�

�� Avoiding Local Minima ��

instead of m�� The overhead of initially calculating the eigensystem� which is oforder p�� has to be taken into account but is only performed once� For problems inwhich p is not much bigger thanm this represents a signi�cant saving in computationtime and makes it feasible to optimise multiple guesses for the initial value of � todecrease the chances of getting caught in a local minimum�

�� Avoiding Local Minima

If the initial guess for � is close to a local minimum of GCV �or whatever modelselection criterion is employed� then re�estimation using �� is likely to get trapped�We illustrate by using Friedman�s data set as described in section �� with an RBFnetwork of � Gaussian centres coincident with the inputs of the training set andof �xed radius r � �

The solid curve in �gure �� shows the variation of GCV with �� The open circlesshow a sequence of re�estimated � values with their corresponding GCV scores� Theinitial guess was � � � �� and the sequence converged near � � � �� shown by theclosed circle� at a local minimum� The global minimum near � � � �� was missed�

−12 −10 −8 −6 −4 −2 04.7

4.8

4.9

5

5.1

5.2

5.3

log(λ)

log(

GC

V)

Figure �� The variation of GCV with � for Friedman�s problem and a sequence ofre�estimations starting at � � � ��

Compare �gure �� with �gure �� where the only change was to use a di�erentguess� � �� for the initial value of �� This time the guess is su�ciently close tothe global minimum that the re�estimations are attracted towards it� Note thatthe set of eigenvalues and eigenvectors used to compute the sequences in �gures ��and �� are identical� Since the calculation of the eigensystem dominates the othercomputational costs it is almost as expensive to optimise one trial value as it is tooptimise several� Thus to avoid falling into a local minimum several trial valuesspread over a wide range can be optimised and the solution with the lowest GCVselected as the overall winner� This value can then be used to determine the weights�� and ultimately the predictions �� of the network�

�� Optimising the Size of RBFs

−12 −10 −8 −6 −4 −2 04.7

4.8

4.9

5

5.1

5.2

5.3

log(λ)

log(

GC

V)

Figure �� Same as �gure �� except the initial guess is � � � ��

�� The Optimal RBF Size

For Gaussian radial functions of �xed width r the transfer functions of the hiddenunits are

hj�x� � exp

�� x� cj�

��x� cj�

r�

��

Unfortunately� there is no re�estimation formula for r� as there is for �� even inthis simple case where the same scale is used for each RBF and each component ofthe input �� To properly optimise the value of r would thus require the use of anonlinear optimisation algorithm and would have to incorporate the optimisation of� �since the optimal value of � changes as r changes��

An alternative� if rather crude approach� is to test a number of trial values for r�For each value an optimal � is calculated �by using the re�estimation method above�and the model selection score noted� When all the values have been checked� theone associated with the lowest score wins� The computational cost of this procedureis dominated� once again� by the cost of computing the eigenvalues and eigenvectorsof HH�� and these have to be calculated separately for each value of r�

While this procedure is less computationally demanding than a full nonlinearoptimisation of r and � it�s drawback is that it is only capable of identifying thebest value for r from a �nite number of alternatives� On the other hand� given thatthe value of � is fully optimised and that the model selection criteria are heuristic�in other words� approximate� in nature� it is arguable that a more precise locationfor the optimal value of r is unlikely to have much practical signi�cance�

We illustrate the method on the Hermite data as described in section �� Onceagain we use each training set input as an RBF centre� We tried seven di�erent trialvalues for r� �� and �� For each trial value we plot� in �gure�� the variation of GCV with � �the curves�� as well as the optimal � �the closed

�� Trial Values in Other Contexts ��

−10 −8 −6 −4 −2 0−1.9

−1.8

−1.7

−1.6

−1.5

log(λ)

log(

GC

V)

0.40.60.81.01.21.41.6

Figure �� The Hermite data set with four sizes of RBFs�

circles� found by re�estimation as described above� The radius value which led tothe lowest GCV score was r � �� and the corresponding optimal regularisationparameter is � � ��

Initially� as r increases from its lowest value� the GCV score at the optimum �decreases� Eventually it reaches its lowest value at r � �� Above that there is notmuch increase in optimised GCV� although the optimal � decreases rapidly�

�� Trial Values in Other Contexts

The use of trial values is limited to cases where there is a small number of parametersto optimise� such as the single parameter r� If there are several parameters withtrial values then the number of di�erent combinations to evaluate can easily becomeprohibitively large� In RBF networks where there is a separate scale parameterfor each dimension� so that the transfer functions are� for example in the case ofGaussians�

hj�x� � exp

��

nXk��

�xk � cjk��

r�jk

��

there would be tmn combinations to check where t is the number of trial valuesfor each rjk� m is the number of basis functions and n the number of dimensions�However� it is possible to test trial values for an overall scale size � if some othermechanism can be used to generate the scales rjk� Here� the transfer functions are

hj�x� � exp

��

nXk��

�xk � cjk��

�� r�jk

��

This is the approach taken for the method in section � where a regression treedetermines the values of rjk but the overall scale size � is optimised by testing trialvalues�

�� Optimising the Size of RBFs

�� Conclusions

We�ve shown how trial values for the overall size of the RBFs can be compared usinga model selection criterion� In the case of ridge regression� an e�cient method foroptimising the regularisation parameter helps reduce the computational burden oftraining a separate network for each trial value� However� the same technique canalso be used with other methods of complexity control� including those in whichthere is no regularisation�

In the Matlab software package �� each method can be con�gured with a set oftrial values for the overall RBF scale� The best value is chosen and used to generatethe RBF network which the Matlab function returns�

Regression Trees and RBF Networks ��

� Regression Trees and RBF Networks

�� Introduction

This section is about a novel method for nonparametric regression involving a combi�nation between regression trees and RBF networks �� The basic idea of a regressiontree is to recursively partition the input space in two and approximate the functionin each half by the average output value of the samples it contains �� Each splitis parallel to one of the axes so it can be expressed by an inequality involving oneof the input components �e�g� xk b�� The input space is thus divided into hy�perrectangles organised into a binary tree where each branch is determined by thedimension �k� and boundary �b� which together minimise the residual error betweenmodel and data�

A bene�t of regression trees is the information provided in the split statisticsabout the relevance of each input variable� The components which carry the mostinformation about the output tend to be split earliest and most often� A weaknessof regression trees is the discontinuous model caused by the output value jumpingacross the boundary between two hyperrectangles� There is also the problem ofdeciding when to stop growing the tree �or equivalently� how much to prune after ithas fully grown� which is the familiar bias�variance dilemma faced by all methods ofnonparametric regression �� The use of radial basis functions in conjunction withregression trees can help to solve both these problems�

Below we outline the basic method of combining RBFs and regression trees asit appeared originally and describe our version of this idea and why we think it isan improvement� Finally we show some results and summarise our conclusions�

�� The Basic Idea

The combination of trees and RBF networks was �rst suggested by �� in the con�text of classi�cation rather than regression �though the two cases are very similar��Further elaboration of the idea appeared in �� Essentially� each terminal node ofthe classi�cation tree contributes one hidden unit to the RBF network� the centreand radius of which are determined by the position and size of the correspondinghyperrectangle� Thus the tree sets the number� positions and sizes of all RBFs inthe network� Model complexity is controlled by two parameters� �c determines theamount of tree pruning in C�� the software package used by �� to generateclassi�cation trees� and � �xes the size of RBFs relative to hyperrectangles�

Our major reservation about the approach taken by �� is the treatment of modelcomplexity� In the case of the scaling parameter �� the author claimed it had littlee�ect on prediction accuracy� but this is not in accord with our previous experienceof RBF networks� As for the amount of pruning ��c�� he demonstrated its e�ecton prediction accuracy yet used a �xed value in his benchmark tests� Moreover�there was no discussion of how to control scaling and pruning to optimise modelcomplexity for a given data set�

�� Regression Trees and RBF Networks

Our method is a variation on Kubat�s with the following alterations�

�� We address the model complexity issue by using the nodes of the regressiontree not to �x the RBF network but rather to generate a set of RBFs fromwhich the �nal network can be selected� Thus the burden of controlling modelcomplexity shifts from tree generation to RBF selection�

�� The regression tree from which the RBFs are produced can also be used toorder selections such that certain candidate RBFs are allowed to enter themodel before others� We describe one way to achieve such an ordering anddemonstrate that it produces more accurate models than plain forward selec�tion�

�� We show that� contrary to the conclusions of �� the method is typically quitesensitive to the parameter � and discuss its optimisation by the use of multipletrial values�

�� Generating the Regression Tree

The �rst stage of our method �and Kubat�s� is to generate a regression tree� Theroot node of the tree is the smallest hyperrectangle which contains all the trainingset inputs� fxigpi�� Its size sk �the half�width� and centre ck in each dimension kare

sk ��

�

�maxi�S

�xik��mini�S

�xik�

��

ck ��

�

�maxi�S

�xik� � mini�S

�xik�

��

where S � f�� pg is the set of training set indices� A split of the root nodedivides the training samples into left and right subsets� SL and SR� on either side ofa boundary b in one of the dimensions k such that

SL � fi � xik � bg �SR � fi � xik bg �

The mean output value on either side of the split is

�yL ��

pL

Xi�SL

yi �

�yR ��

pR

Xi�SR

yi �

where pL and pR are the number of samples in each subset� The residual squareerror between model and data is then

E�k� b� ��

p

��Xi�SL

�yi � �yL��

Xi�SR

�yi � �yR��

�A �

�� From Hyperrectangles to RBFs ��

The split which minimisesE�k� b� over all possible choices of k and b is used to createthe children of the root node and is easily found by discrete search over n dimensionsand p cases� The children of the root node are split recursively in the same mannerand the process terminates when a node cannot be split without creating a childcontaining less samples than a given minimum� pmin� which is a parameter of themethod� Compared to their parent nodes� the child centres will be shifted and theirsizes reduced in the k�th dimension�

Since the size of the regression tree does not determine the model complexity�there is no need to perform the �nal pruning step normally associated with recursivesplitting methods ��

�� From Hyperrectangles to RBFs

The regression tree contains a root node� some nonterminal nodes �having children�and some terminal nodes �having no children�� Each node is associated with ahyperrectangle of input space having a centre c and size s as described above� Thenode corresponding to the largest hyperrectangle is the root node and the nodesizes decrease down the tree as they are divided into smaller and smaller pieces�To translate a hyperrectangle into a Gaussian RBF we use its centre c as the RBFcentre and its size s scaled by a parameter � as the RBF radius� r � � s� Thescalar � has the same value for all nodes and is another parameter of the method�in addition to pmin�� Our � is not quite the same as Kubat�s � �they�re related byan inverse and a factor of

p�� but plays exactly the same role�

�� Selecting the Subset of RBFs

After the tree nodes are translated into RBFs the next step of our method is toselect a subset of them for inclusion in the model� This is in contrast to the methodof �� where all RBFs from terminal nodes were included in the model which wasthus heavily dependent on the extent of tree pruning to control model complexity�Selection can be performed using either a standard method such as forward selection�� or in a novel way� by employing the tree to guide the order in which candidateRBFs are considered�

In the standard methods for subset selection the RBFs generated from the regres�sion tree are treated as an unstructured collection with no distinction between RBFscorresponding to di�erent nodes in the tree� However� intuition suggests that thebest order to consider RBFs for inclusion in the model is large ones �rst and smallones last� to synthesise coarse structure before �ne details� This� in turn� suggestssearching for RBF candidates by traversing the tree from the largest hyperrectangle�and RBF� at the root to the smallest hyperrectangles �and RBFs� at the terminalnodes� Thus the �rst decision should be whether to include the root node in themodel� the second whether to include any of the children of the root node� and soon� until the terminal nodes are reached�

The scheme we eventually developed for selecting RBFs goes somewhat beyondthis simple picture and was in�uenced by two other considerations� The �rst con�cerns a classic problem with forward selection� namely� that one regressor can blockthe selection of other more explanatory regressors which would have been chosen in


preference had they been considered �rst� In our case there was a danger that aparent RBF could block its own children� To avoid this situation� when consider�ing whether to add the children of a node which had already been selected we alsoconsidered the e�ect of deleting the parent� Thus our method has a measure of back�ward elimination as well as forward selection� This is reminiscent of the selectionschemes developed for the MARS �� and MAPS �� algorithms�

A second reason for departing from a simple breadth��rst search is because thesize of a hyperrectangle �in terms of volume� on one level is not guaranteed to besmaller than the size of all the hyperrectangles in the level above �only its parent�so it not easy to achieve a strict largest�to�smallest ordering� In view of this� weabandoned any attempt to achieve a strict ordering and instead devised a searchalgorithm which dynamically adjusts the set of selectable RBFs by replacing selectedRBFs with their children�

The algorithm depends on the concept of an active list of nodes� At any givenmoment during the selection process only these nodes and their children are con�sidered for inclusion or exclusion from the model� Every time RBFs are added orsubtracted from the model the active list expands by having a node replaced by itschildren� Eventually the active list becomes coincident with the terminal nodes andthe search is terminated� In detail� the steps of the algorithm are as follows�

�� Initialise the active list with the root node and the model with the root node�sRBF�

�� For all nonterminal nodes on the active list consider the e�ect �on the modelselection criterion� of adding both or just one of the children�s RBFs �threepossible modi�cations to the model�� If the parent�s RBF is already in themodel� also consider the e�ect of �rst removing it before adding one or bothchildren�s RBFs or of just removing it �a further four possible modi�cations��

�� The total number of possible adjustments to the model is somewhere betweenthree and seven times the number of active nonterminal nodes� depending onhow many of their RBFs are already in the model� From all these possibilitieschoose the one which most decreases the model selection criterion� Update thecurrent model and remove the node involved from the active list� replacing itwith its children� If none of the modi�cations decrease the selection criterionthen chose one of the active nodes at random and replace it by its children butleave the model unaltered�

�� Return to step � and repeat until all the active nodes are terminal nodes�

Once the selection process has terminated the network weights can be calculatedin the usual way by solving the normal equation�

w �H�H

��H�y �

where H is the design matrix� There is no need for a regularisation term� as ap�pears in equations �� for example� because model complexity is limited by theselection process�

�� The Best Parameter Values ��

�� The Best Parameter Values

Our method has three main parameters� the model selection criterion� pmin whichcontrols the depth of the regression tree and � which determines the relative sizebetween hyperrectangles and RBFs�

For the model selection criterion we found that the more conservative BIC� whichtends to produce more parsimonious models� rarely performed worse than GCVand often did signi�cantly better� This is in line with the experiences of otherpractitioners of algorithms based on subset selection such as �� who modi�ed GCVto make it more conservative� and �� who also found BIC gave better results thanGCV�

For pmin and � we use the simple method of comparing the model selectionscores of a number of trial values� as for the RBF widths in section �� This meansgrowing several trees �one for each trial value of pmin� and then �for each tree�selecting models from several sets of RBFs �one for each value of �� The cost isextra computation� the more trial values there are� the longer the algorithm takesto search through them� However� the basic algorithm is not unduly expensive andif the number of trial values is kept fairly low �about � or less alternatives for eachparameter�� the computation time is acceptable�

�� Demonstrations

Figure �� shows the prediction of a pure regression tree for a sample of Hermitedata �section �� For clarity� the samples themselves are not shown� just thetarget function and the prediction� Of course� the model is discontinuous and eachhorizontal section corresponds to one terminal node in the tree�

−4 −2 0 2 40

1

2

3

x

y

target prediction

Figure �� A pure regression tree prediction on the Hermite data�


The tree which produced the prediction shown in �gure �� was grown untilfurther splitting would have violated the minimum number of samples allowed pernode �pmin�� There was no pruning or any other sophisticated form of complexitycontrol� so this kind of tree is not suitable for practical use as a prediction method�However� in our method the tree is only used to create RBF candidates� Modelcomplexity is controlled by a separate process which selects a subset of RBFs forthe network�

Figure �� shows the predictions of the combined method on the same data setused in �gure �� after a subset of RBFs were selected from the pool of candidatesgenerated by the tree nodes� Now the model is continuous and its complexity is wellmatched to the data�

−4 −2 0 2 40

1

2

3

x

y

target prediction

Figure �� The combined method on the Hermite data�

As a last demonstration we turn our attention to Friedman�s data set �sec�tion �� In experiments with the MARS algorithm �� Friedman estimated theaccuracy of his method by replicating data sets � times and computing the meanand standard deviation of the scaled sum�square�error� For this data set his best re�sults� corresponding to the most favourable values for the parameters of the MARSmethod� were ��

To compare our algorithm with MARS� and also to test the e�ect of using mul�tiple trial values for our method�s parameters� pmin and �� we conducted a similarexperiment� Before we started� we tried some di�erent settings for the trial valuesand identi�ed one which gave good results on test data� Then� for each of the � replications� we applied the method twice� In the �rst run we used the trial valueswe had discovered earlier� In the second run we used only a single �best� valuefor each parameter� the average of the trial values� forcing this value to be used forevery replicated data set� The results are shown in table ��

It is apparent that the results are practically identical to MARS when the fullsets of trial values are used but signi�cantly inferior when only single �best� valuesare used�

�� Conclusions ��

pmin � error

��

� ��

Table �� Results on � replications of Friedman�s data set�

In another test using replicated data sets we compared the two alternative meth�ods of selecting the RBFs from the candidates generated by the tree� standard for�ward selection or the method described above in section �� which uses the tree toguide the order in which candidates are considered� This was the only di�erencebetween the two runs� the model parameters were the same as in the �rst row oftable �� The performance of tree�guided selection was �� as in table ��but forward selection was signi�cantly worse� ��

�� Conclusions

We have described a method for nonparametric regression based on combining re�gression trees and radial basis function networks� The method is similar to �� andhas the same advantages �a continuous model and automatic relevance determina�tion� but also some signi�cant improvements� The main enhancement is the additionof an automatic method for the control of model complexity through the selectionof RBFs� We have also developed a novel procedure for selecting the RBFs basedon the structure of the tree�

We�ve presented evidence that the method is comparable in performance to thewell known MARS algorithm and that some of its novel features �trial parametervalues� tree�guided selection� are actually bene�cial� More detailed evaluations withDELVE �� data sets are in preparation and preliminary results support these con�clusions�

The Matlab software package �� has two implementations of the method� Onefunction� rbf rt �� uses tree�guided selection� while the other� rbf rt �� uses for�ward selection� The operation of each function is described� with examples� in acomprehensive manual�

�� Appendix

� Appendix

A Applying the EM Algorithm

We want to maximise the marginal probability of the observed data �� by substi�tuting expectations of the conditional probability of the unobserved data �� intothe cost function for the joint probability of the combined data �� and minimis�ing this with respect to the parameters �� the noise variance� and �� the a priori

weight variance��

From �� hwi � �w and h�w� �w� �w� �w��i �W � ��A�� The expectationof w�w is then

hw�wi � tr hww�i� �w� �w� tr hww� � �w �w�i� �w� �w� tr h�w � �w� ��w � �w��i� �w� �w� ��trA��

� �w� �w� ��m� � � �A��

The last step follows from � m � � trA�� the e�ective number of parameters�and � � �� the regularisation parameter�� Similarly�

he�ei � tr he e�i� �e��e� tr he e� � �e�e�i� �e��e� trH hww� � �w �w�iH�

� �e��e� ��trHA��H�

� �e��e� �� A��

since e � y � Hw is linear in w and trHA��H� is another expression for thee�ective number of parameters �

Equations �A�� A�� summarise the expectation of the conditional probabilityfor w and can be substituted into the joint probability of the combined data or theequivalent cost function �� so that the resulting expression can be optimised withrespect to �� and �� Note that in �A�� A�� these parameters are held constant attheir old values� only the explicit occurrences of �� and �� in �� are varied in theoptimisation�

After di�erentiating �� with respect to �� and �� equating the results to zeroand �nally substituting the expectations �A�� A�� we get the re�estimation formulae

�� e��e� ��

p�

�� w� �w � �m� � ��

m�

B The Eigensystem of HH� ��

B The Eigensystem of HH�

We want to derive expressions for each of the terms in �� using the eigenvaluesand eigenvectors of HH� We start with a singular value decomposition of thedesign matrix� H � USV�� where U � �u� u� � � � up� � Rp�p and V � Rm�m areorthogonal and S � Rp�m�

S �

��

p��

p��

��

� � ��

� � � p�m

� � � ��

��

��

��

contains the singular values� fp�igpi�� Note that� due to the orthogonality of V�

HH� � USS�U�

�

pXi��

�i ui u�i �

so the �i are the eigenvalues� and ui the eigenvectors� of the matrix HH�� Theeigenvalues are non�negative and� we assume� ordered from largest to smallest so thatif p m then �i � for i m� The eigenvectors are orthonormal �u�i ui� � �ii��

We want to derive expressions for the terms in �� using just the eigenvalues andeigenvectors of HH�� As a preliminary step� we derive some more basic relations�First� the matrix inverse in each re�estimation is

A�� H�H� � Im

��

V�S�SV � �V�V

�� V

S�S� � Im

��V� � �B��

Note that the second step would have been impossible if the regularisation term�� Im� had not been proportional to the identity matrix� which is where the analysisbreaks down in the case of multiple regularisation parameters� Secondly� the optimalweight vector is

�w � A��H�y

� VS�S� � Im

��S�U�y

� VS�S� � Im

��S��y �B��

where �y is the projection of y onto the eigenbasis U�

�� Appendix

Thirdly� from �B�� we can further derive

� m� � trA��

� m� � trVS�S� � Im

��V�

� m� � trS�S� � Im

��

� m�mXj��

�

�j � �

�

mXj��

�j�j � �

�

pXi��

�i�i � �

�B��

Here we have assumed p � m so the last step follows �for � � because if p mthen the last �p �m� eigenvalues are zero� However� the conclusion is also true ifp � m since in that case the last �m � p� singular values are annihilated in theproduct S�S�

Fourthly� and last of the preliminary calculations� the vector of residual errors is

�e � y �H �w

�Ip �US �S�S� � Im�

��S�U�y

� UIp � S �S�S� � Im�

��S��y � �B��

Now we are ready to tackle the terms in �� From �B�� we have

p� � p�pXi��

�i�i � �

�

pXi��

�

�i � �� B��

From �B�� and a set of steps similar to the derivation of �B�� it follows that

� trA�� trA��

�

mXj��

�

�j � ��

mXj��

�

��j � ��

�mXj��

�j��j � ��

�

pXi��

�i��i � ��

� �B��

B The Eigensystem of HH� ��

The last step follows in a similar way to the last step of �B�� Next we tackle theterm �w�A�� w� From �B�� and �B�� we get

�w�A�� w � �y�SS�S� � Im

��S��y

�

pXi��

�i �y�i

��i � �� B��

The sum of squared residual errors is� from �B��

�e��e � �yIp � S �S�S� � Im�

��S��y

�

mXj��

��y�j��j � ��

�

pXi�m��

�y�i

�

pXi��

��y�i��i � ��

� �B��

For this derivation we assumed that p � m but� for reasons similar to those statedfor the derivation of �B�� the result is also true for p � m�

Equations �B��B�� express each of the four terms in �� using the eigenvaluesand eigenvectors of HH�� which was our main goal in this appendix� Other usefulexpressions involving the eigensystem of HH� are

ln jPj �

pXi��

ln

��

��i � ��

�

� p ln�� pXi��

ln��i � ��

��

y�Py �

pXi��

��y�i��i � ��

�

where P � Ip �HA��H�� is the noise variance and �� is the a priori varianceof the weights �see section �� For example� if these expressions are substituted inequation �� for the cost function associated with the marginal likelihood of thedata� the two p ln�� terms cancel� leaving

E�y� � p ln�� ln jPj� y�Py

��

�

pXi��

ln��i � ��

��

pXi��

�y�i��i � ��

�

�� REFERENCES

References

�� A�R� Barron and X� Xiao� Discussion of �Multivariate adaptive regressionsplines� by J�H� Friedman� Annals of Statistics� ��

�� C� M� Bishop� M� Svensen� and K�I� Williams� Em optimization of latent�variable density models� In D�S� Touretzky� M�C� Mozer� and M�E� Hasselmo�editors� Advances in Neural Information Processing Systems �� pages ��MIT Press� Cambridge� MA� ��

�� C�M� Bishop� Neural Networks for Pattern Recognition� Clarendon Press� Ox�ford� ��

�� L� Breiman� J� Friedman� J� Olsen� and C� Stone� Classi�cation and Regression

Trees� Wadsworth� Belmont� CA� ��

�� A�P� Dempster� N�M� Laird� and D�B� Rubin� Maximum likelyhood from in�complete data via the EM algorithm� Journal of the Royal Statistical Society

�B��

�� J�H� Friedman� Multivariate adaptive regression splines �with discussion�� An�nals of Statistics� ��

�� S� Geman� E� Bienenstock� and R� Doursat� Neural networks and thebias�variance dilemma� Neural Computation� ��

�� M� Kubat� Decision trees can initialize radial�basis function networks� IEEE

Transactions on Neural Networks� ��

�� M� Kubat and I� Ivanova� Initialization of RBF networks with decision trees�In Proc� of the �th Belgian�Dutch Conf� Machine Learning� BENELEARN��pages ��

�� D�J�C� MacKay� Bayesian interpolation� Neural Computation� ��

�� D�J�C� MacKay� Comparison of approximate methods of handling hyperparam�eters� Accepted for publication by Neural Computation� ��

�� J�E� Moody� The e�ective number of parameters� An analysis of generalisationand regularisation in nonlinear learning systems� In J�E� Moody� S�J� Hanson�and R�P� Lippmann� editors� Neural Information Processing Systems � pages�� Morgan Kaufmann� San Mateo CA� ��

�� R�M� Neal and G�E� Hinton� A view of the EM algorithm that justi�es incre�mental� sparse� and other variants� In M�I� Jordan� editor� Learning in Graphical

Models� Kluwer Academic Press� ��

�� M�J�L� Orr� Local smoothing of radial basis function networks� In International

Symposium on Arti�cial Neural Networks� Hsinchu� Taiwan� ��

�� M�J�L� Orr� Regularisation in the selection of radial basis function centres�Neural Computation� ��

REFERENCES ��

�� M�J�L� Orr� Introduction to radial basis function networks� Technical report�Institute for Adaptive and Neural Computation� Division of Informatics� Edin�burgh University� �� www�anc�ed�ac�uk�mjo�papers�intro�ps�

�� M�J�L� Orr� Matlab routines for subset selection and ridge regression in linearneural networks� Technical report� Institute for Adaptive and Neural Computa�tion� Division of Informatics� Edinburgh University� �� www�anc�ed�ac�uk�mjo�software�rbf�zip�

�� M�J�L� Orr� An EM algorithm for regularised radial basis function networks�In International Conference on Neural Networks and Brain� Beijing� China�October ��

�� M�J�L� Orr� Matlab functions for radial basis function networks� Techni�cal report� Institute for Adaptive and Neural Computation� Division of In�formatics� Edinburgh University� �� Download from www�anc�ed�ac�uk�

mjo�software�rbf��zip�� J�R� Quinlan� C�� Programs for Machine Learning� Morgan Kaufmann� San

Mateo CA� ��

�� C�E� Rasmussen� R�M� Neal� G�E� Hinton� D� van Camp� Z� Ghahramani�M� Revow� R� Kustra� and R� Tibshirani� The DELVE Manual� ��http��www�cs�utoronto�ca�delve��

�� M�E� Tipping and C�M� Bishop� Mixtures of principle component analysers�Technical Report NCRG�� Neural Computing Research Group� AstonUniversity� UK� ��

�� M�E� Tipping and C�M� Bishop� Probabilistic principal component analysis�Technical Report NCRG�� Neural Computing Research Group� AstonUniversity� UK� ��

avances base radial

Education

new algorithm

regression trees

ridge regression methods

rbf widths

rbf networks15

maximum likelihood estimation

new software package

new method