Download - Introduction to Radial Basis Function Networks

Introduction to Radial Basis Function Networks

Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Incremental Operations


Overview

Typical Applications of NN

Pattern Classification

Function Approximation

Time-Series Forecasting

( )l f x l C N

( )fy x mY R y

mX R x

nX R x

1 2 3( ) ( , , , )t t tft x x x x

Function Approximation

mX R f nY R

:f X YUnknown

mX R nY R

ˆ :f X YApproximatorf

NeuralNetwork

Supervised Learning

UnknownFunction

ixiy

ˆ iy1,2,i +

+

ie

Neural Networks as Universal Approximators Feedforward neural networks with a single hidden layer of sigmoidal units are capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy.

– Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks are Universal Approximators," Neural Networks, 2(5), 359-366.

Like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators.– Park, J. and Sandberg, I. W. (1991). "Universal Approximation Using Ra

dial-Basis-Function Networks," Neural Computation, 3(2), 246-257.– Park, J. and Sandberg, I. W. (1993). "Approximation and Radial-Basis-F

unction Networks," Neural Computation, 5(2), 305-316.

Statistics vs. Neural Networks

Statistics Neural Networksmodel networkestimation learningregression supervised learninginterpolation generalizationobservations training setparameters (synaptic) weightsindependent variables inputsdependent variables outputsridge regression weight decay


The Model ofFunction Approximator

Linear Models

1

(( ) )ii

i

m

f w

x xFixed BasisFunctions

Weights

Linear Models

Feature VectorsInputs

HiddenUnits

OutputUnits

• Decomposition• Feature Extraction• Transformation

Linearly weighted output

1

(( ) )ii

i

m

f w

x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Linear Models

Feature VectorsInputs

HiddenUnits

OutputUnits

• Decomposition• Feature Extraction• Transformation

Linearly weighted output

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Can you say some bases?1

(( ) )ii

i

m

f w

x x

Example Linear Models

Polynomial

Fourier Series

( ) ii

iwf xx

0exp 2( )k

k jf w k xx

Are they orthogonal bases?

( ) , 0,1,2,ii ix x

0( ) exp 2 0,1,, 2, k x j k x k

1

(( ) )ii

i

m

f w

x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Single-Layer Perceptrons as Universal Aproximators

HiddenUnits

With sufficient number of sigmoidal units, it can be a universal approximator.

1

(( ) )ii

i

m

f w

x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Radial Basis Function Networks as Universal Aproximators

HiddenUnits

With sufficient number of radial-basis-function units, it can also be a universal approximator.

1

(( ) )ii

i

m

f w

x x

Non-Linear Models

1

(( ) )ii

i

m

f w

x xAdjusted by the

Learning process

Weights

1

(( ) )ii

i

m

f w

x x


The Radial Basis Function Networks

Radial Basis Functions1

(( ) )ii

i

m

f w

x x

Center

Distance Measure Shape

Three parameters for a radial function:

xi

r = ||x xi||

i(x)= (||x xi||)

Typical Radial Functions

Gaussian

Hardy Multiquadratic

Inverse Multiquadratic

222 0 and

r

r e r

2 2 0 and r r c c c r

2 2 0 and r c r c c r

Gaussian Basis Function (=0.5,1.0,1.5)

222 0 and

r

r e r

0.5 1.0

1.5

Inverse Multiquadratic

00.10.20.30.40.50.60.70.80.9

1

-10 -5 0 5 10

2 2 0 and r c r c c r

c=1c=2

c=3c=4

c=5

+ +

Most General RBF

1( ) ( )i iT

i μ x μx x

+1μ

2μ 3μ

Basis {i: i =1,2,…} is `near’ orthogonal.

Properties of RBF’sOn-Center, Off Surround

Analogies with localized receptive fields found in several biological structures, e.g.,– visual cortex;– ganglion cells

The Topology of RBF

Feature Vectorsx1 x2 xn

y1 ym

Inputs

HiddenUnits

OutputUnits

Projection

Interpolation

As a function approximator ( )y f x

The Topology of RBF

Feature Vectorsx1 x2 xn

y1 ym

Inputs

HiddenUnits

OutputUnits

Subclasses

Classes

As a pattern classifier.


RBFN’s forFunction Approximation

The idea

x

y Unknown Function to Approximate

TrainingData

The idea

x

y Unknown Function to Approximate

TrainingData

Basis Functions (Kernels)

The idea

x

y


Function Learned

1

)) ((m

iiiwy f

xx

The idea

x

y


Function Learned

NontrainingSample

1

)) ((m

iiiwy f

xx

The idea

x

y Function Learned

NontrainingSample

1

)) ((m

iiiwy f

xx

Radial Basis Function Networks as Universal Aproximators

1

( ) ( )m

ii

iy wf

x x

x1 x2 xn

w1 w2 wm

x =

Training set ( )

1

( ),pk

k

ky

xT

Goal (( ))k kfy x for all k

2

( )

1

( )min kp

k

k

SSE fy

x

2

)(

1

) (

1

p mk

ik

ki

i

y w

x

Learn the Optimal Weight Vector

w1 w2 wm

x1 x2 xnx =

Training set ( )

1

( ),pk

k

ky

xT


1

( ) ( )m

ii

iy wf

x x

2

( )

1

( )min kp

k

k

SSE fy

x

2

)(

1

) (

1

p mk

ik

ki

i

y w

x

2

( )

1

( )min kp

k

k

SSE fy

x

2

)(

1

) (

1

p mk

ik

ki

i

y w

x

Regularization

Training set ( )

1

( ),pk

k

ky

xT


2

1

m

iiiw

2

1

m

iiiw

0i

If regularization is unneeded, set 0i

C


2 2( )

1

( )

1

p mk

ki

ii

k wfyC

xMinimize

1

( ) ( )m

ii

iwf

x x

(( )

) ( )

1

2 2kp

jk

j

kk

j j

wfC f

w wy

x

x

( ) ( ) ( )

1

2 2p

k kj

kjj

k f wy

x x

0

( ) * ( ) ( )

1 1

)* (p p

k k kj j

k kjj

kwf y

x x x

*

1

*( ) ( )i

m

ii

wf

x x


(1) ( ), ,Tp

j j j φ x x

* * (1) * ( ), ,Tpf ff x xDefine

(1) ( ), ,Tpy yy

* *j

T Tj jjw φ f φ y 1, ,j m

( ) * ( ) ( )

1 1

)* (p p

k k kj j

k kjj

kwf y

x x x

1

( ) ( )m

ii

iwf

x x *

1

*( ) ( )i

m

ii

wf

x x


*1 1

*2 2

*1

*

*

1

2 2

*

T T

T T

T Tm mmm

w

w

w

φ f φ yφ f φ y

φ f φ y

* *j

T Tj jjw φ f φ y 1, ,j m

1 2, , , mΦ φ φ φ

1

2

m

Λ

* * *1 2* , , ,

T

mw w ww

Define

**T T wΦ Λf Φ y


**T T wΦ Λf Φ y

* (1)

* (2)*

* ( )p

f

f

f

x

xf

x

*

1

*( ) ( )i

m

ii

wf

x x

(1)

1

(2)

1

(

*

*

1

*

)

k

k

m

kk

m

kk

mP

kk

k

w

w

w

x

x

x

(1) (1) (1)*1 21

(2) (2) (2) *1 2 2

*( ) ( ) ( )

1 2

m

m

p p p mm

ww

w

x x x

x x x

x x x

*Φw

* *T T ΛwΦ wΦ Φ y


* *T T ΛwΦ wΦ Φ y

*T T wΦ ΛΦ Φ y

1* T T Λw Φ Φ Φ y

Variance Matrix

1 TA Φ y

1 :ADesign Matrix:Φ

Summary

1* T T Λw Φ Φ Φ y 1 TA Φ y

(1)

(2)

( )

j

jj

pj

x

xφ

x

(1)

(2)

( )p

yy

y

y

1 2, , , mΦ φ φ φ

*1*2

*

*

m

ww

w

w

1

2

m

Λ

1

)) ((m

iiiwy f

xx

Training set ( )

1

( ),pk

k

ky

xT


The Projection Matrix

The Empirical-Error Vector

1* Tw A Φ y

*1w *

2w *mw

( )kx ( )1

kx ( )2

kx ( )knx( )kx ( )

1kx ( )

2kx ( )k

nx

UnknownFunction

y ** f Φw

1φ 2φ mφ

The Empirical-Error Vector

1* Tw A Φ y

*1w *

2w *mw

( )kx ( )1

kx ( )2

kx ( )knx( )kx ( )

1kx ( )

2kx ( )k

nx

UnknownFunction

y ** f Φw

1φ 2φ mφError Vector

* e y f * y Φw 1 T ΦAy Φ y 1 T

p AI Φ Φ yPy

Sum-Squared-Error

e Py1 T

p P I Φ ΦA

T A Φ Φ Λ

1 2, , , mΦ φ φ φError Vector

T Py Py2Ty P y

2( ) * ( )

1

pk k

k

SSE y f

x

If =0, the RBFN’s learning algorithm is to minimize SSE (MSE).

The Projection Matrix

e Py1 T

p P I Φ ΦA

T A Φ Φ Λ

1 2, , , mΦ φ φ φError Vector

2TSSE y P y

1 2( , , , )mspane φ φ φΛ 0

( )P Py Pee Py

Ty Py


Learning the Kernels

RBFN’s as Universal Approximators

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

Training set ) ( )(

1,

pk

k

k

yxT

2

2( ) exp2

jj

j

x μx

Kernels

What to Learn?

Weights wij’s Centers j’s of j’s Widths j’s of j’s Number of j’s Model Selection

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

One-Stage Learning

( ) ( ) ( )1

k k kij i i jw y f x x

( ) ( )

1

lk k

i ij jj

f w

x x

( ) ( ) ( )2 2

1

ljk k k

j j ij i iij

w y f

x μμ x x

2

( ) ( ) ( )3 3

1

ljk k k

j j ij i iij

w y f

x μx x

One-Stage Learning ( ) ( )

1

lk k

i ij jj

f w

x xThe simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting.

( ) ( ) ( )1

k k kij i i jw y f x x

( ) ( ) ( )2 2

1

ljk k k

j j ij i iij

w y f

x μμ x x

2

( ) ( ) ( )3 3

1

ljk k k

j j ij i iij

w y f

x μx x

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

Two-Stage Training

Step 1

Step 2

Determines Centers j’s of j’s. Widths j’s of j’s. Number of j’s.

Determines wij’s.E.g., using batch-learning.

Train the Kernels

+

+

+

+

+

Unsupervised Training

Subset Selection– Random Subset Selection– Forward Selection– Backward Elimination

Clustering Algorithms– KMEANS– LVQ

Mixture Models– GMM

Methods

Subset Selection

Random Subset Selection Randomly choosing a subset of points

from training set

Sensitive to the initially chosen points.

Using some adaptive techniques to tune– Centers– Widths– #points

Clustering Algorithms

Partition the data points into K clusters.


Is such a partition satisfactory?

Clustering AlgorithmsHow about this?

+

++

+


1

2

3

4


Bias-Variance Dilemma

Goal Revisit

Minimize Prediction Error• Ultimate Goal Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error

Badness of Fit Underfitting

– A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting.– Produces excessive bias in the outputs.

Overfitting– A model (e.g., network) that is too complex may fit the noise, not just the signal, leading to overfitting.– Produces excessive variance in the outputs.

Underfitting/Overfitting Avoidance

Model selection Jittering Early stopping Weight decay

– Regularization– Ridge Regression

Bayesian learning Combining networks

Best Way to Avoid Overfitting

Use lots of training data, e.g.,– 30 times as many training cases as there are weights in the network.– for noise-free data, 5 times as many training cases as weights may be sufficient.

Don’t arbitrarily reduce the number of weights for fear of underfitting.

Badness of FitUnderfit Overfit


Underfit Overfit

Large bias Small biasSmall variance Large variance

However, it's not really a dilemma.

Underfit Overfit

Large bias Small biasSmall variance Large variance

Bias-Variance DilemmaMore on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data.

Bias-Variance DilemmaIt's not really a dilemma.

fit moreoverfitmoreunderfit

Bias

Variance


Sets of functions E.g., depend on # hidden nodes used.

The true modelnoise

bias

Solution obtained with training set 2.



biasbias

The mean of the bias=?The variance of the bias=?



The true modelnoise

The mean of the bias=?The variance of the bias=?

biasVariance

Model Selection


The true modelnoise

biasVariance

Reduce the effective number of parameters.Reduce the number of hidden nodes.



The true modelnoise

bias ( )g x( ) ( )y g x x

( )f x

Goal: 2min ( ) ( )E y f x x

Bias-Variance DilemmaGoal: 2min ( ) ( )E y f x x

2( ) ( )E y f x x 2( ) ( )E g f x x

2 2( ) ( ) 2 ( ) ( )E g f g f x x x x

2 2( ) ( ) 2 ( ) ( )E g f E g f E x x x x

Goal: 2min ( ) ( )E g f x x 2min ( ) ( )E y f x x

0 constant

Bias-Variance Dilemma 2( ) ( )E g f x x 2[ ( )] [( ) () )( ]E f E fE g f x xx x

2[ ]( )) (E fE g x x 2[ ]( )) (E fE f x x

[ ( )] [ ( )) ( ]2 ( )E g fE f E f x xx x

2 2[ ( )] [ ( )]( ) ( ) [ ( )] [2 ( ) ( ) ( )]E f E f E f EE g g f ff x x x xx x x x

( ) ( )[ ( )] [ ( )]E E f Eg f f x xxx

( ) ( )E g f x x [ )]( ()EE fg x x [ ( ()] )E E f f x x [ ( )] [ ( )]E f E fE x x

( ) ( )E g E f x x [ )]( () EE fg x x [ ( )] ( )EE f f x x [ ( )] [ ( )]E f E f x x 0

2 22( ) ( ) ( ) ( )E y f E E g f x x x x

0


2 22( ) ( ) ( ) ( )E y f E E g f x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx


noise bias2 variance

Cannot be minimized

Minimize both bias2 and variance

Model Complexity vs. Bias-Variance

2 22( ) ( ) ( ) ( )E y f E E g f x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx



ModelComplexity(Capacity)


2 22( ) ( ) ( ) ( )E y f E E g f x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx



ModelComplexity(Capacity)

Example (Polynomial Fits)22exp( 16 )y x x

Example (Polynomial Fits)

Example (Polynomial Fits)

Degree 1 Degree 5

Degree 10 Degree 15


The Effective Number of Parameters

Variance Estimation

Mean

Variance 22

1

1ˆp

ii

xp

In general, not available.

Variance Estimation

Mean1

1ˆp

ii

x xp

Variance 22 2

1

ˆ1

1 ˆp

ii

s xp

Loss 1 degree of freedom

Simple Linear Regression

0 1y x 2~ (0, )N

01

yx

y

x

Simple Linear Regression

01

yx

y

x

01ˆ

ˆ

y

x

0 1y x 2~ (0, )N

21

ˆp

i ii

SSE y y

Minimize

Mean Squared Error (MSE)

01

yx

y

x

01ˆ

ˆ

y

x

0 1y x 2~ (0, )N

21

ˆp

i ii

SSE y y

Minimize

2

2ˆ SSEMSE

p

Loss 2 degrees

of freedom

Variance Estimation

2ˆ SSEMSEmp

Loss m degrees of freedom

m: #parameters of the model

The Number of Parameters

w1 w2 wm

x1 x2 xnx =

1

( ) ( )m

ii

iy wf

x x

m#degrees of freedom:

The Effective Number of Parameters ()

w1 w2 wm

x1 x2 xnx =

1

( ) ( )m

ii

iy wf

x x The projection Matrix1 T

p P I Φ ΦA

T A Φ Φ Λ

Λ 0( )p trace P m

The Effective Number of Parameters ()

The projection Matrix1 T

p P I Φ ΦA

T A Φ Φ Λ

Λ 0

1( ) Tptrace trace AP I Φ Φ

1 Tp trace ΦA Φ

( ) ( ) ( )trace trace trace A B A BFacts:( ) ( )trace traceAB BA

1 TTp trace

Φ Φ ΦΦ

Pf)

1T Tp trace

Φ ΦΦ Φ

mp trace I

p m ( )p trace P m

Regularization

((

1

)2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

ErModel spenalror t

sy

Co t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models withlarge weightsSSE

( )p trace PThe effective number of parameters:

Regularization

((

1

)2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

ErModel spenalror t

sy

Co t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models withlarge weightsSSE Without penalty

(i=0), there are m degrees of freedom

to minimize SSE (Cost).

The effective number of

parameters =m.

The effective number of parameters:( )p trace P

Regularization

((

1

)2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

ErModel spenalror t

sy

Co t

2

( )

1 1

( )p m

ki

k

k

iiwy

x


With penalty (i>0), the liberty to

minimize SSE will be reduced.

The effective number of

parameters <m.


Variance Estimation

2ˆ MSE

Loss degrees of freedom


SSEp

Variance Estimation

2ˆ MSE


( )SSE

trace

P


Model Selection

Model Selection Goal

– Choose the fittest model Criteria

– Least prediction error Main Tools (Estimate Model Fitness)

– Cross validation– Projection matrix

Methods– Weight decay (Ridge regression)– Pruning and Growing RBFN’s

Empirical Error vs. Model Fitness

Minimize Prediction Error• Ultimate Goal Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error

Minimize Prediction Error

(MSE)

Estimating Prediction ErrorWhen you have plenty of data use

independent test sets– E.g., use the same training set to train

different models, and choose the best model by comparing on the test set.

When data is scarce, use– Cross-Validation– Bootstrap

Cross Validation Simplest and most widely used method

for estimating prediction error.

Partition the original set into several different ways and to compute an average score over the different partitions, e.g.,– K-fold Cross-Validation– Leave-One-Out Cross-Validation– Generalize Cross-Validation

K-Fold CVSplit the set, say, D of available input-output patterns into k mutually exclusive subsets, say D1, D2, …, Dk.Train and test the learning algorithm k times, each time it is trained on D\Di and tested on Di.

K-Fold CV

Available Data

K-Fold CV

Available DataD1 D2 D3. . . Dk

D1 D2 D3. . . Dk

D1 D2 D3. . . Dk

D1 D2 D3. . . Dk

D1 D2 D3. . . Dk

Test Set

Training Set

Estimate 2

Leave-One-Out CVSplit the p available input-output

patterns into a training set of size p1 and a test set of size 1.

Average the squared error on the left-out pattern over the p possible ways of partition.

A special case of k-fold CV.

Error Variance Predicted by LOO


22

1

1ˆ ( )p

LOO i i ii

y fp

x

( , ) : 1, 2, ,k kD y k p x

\ ( , )i i iD D y x

Available input-output patterns.

1, ,i p Training sets of LOO.

if Function learned using Di as training set.

The estimate for the variance of prediction error using LOO:

Error-square for the left-out element.



22

1

1ˆ ( )p

LOO i i ii

y fp

x

( , ) : 1, 2, ,k kD y k p x

\ ( , )i i iD D y x






Given a model, the function with least empirical error for Di.

As an index of model’s fitness.We want to find a model also minimize

this.



22

1

1ˆ ( )p

LOO i i ii

y fp

x

( , ) : 1, 2, ,k kD y k p x

\ ( , )i i iD D y x






How to estimate?Are there any efficient ways?


22

1

1ˆ ( )p

LOO i i ii

y fp

x


2 2(1ˆ ˆ ( )) ˆTLOO di

pag y PP Py

Generalized Cross-Validation

2 2(1ˆ ˆ ( )) ˆTLOO di

pag y PP Py

( )( ) ptracediag

p

PP I

2

22

ˆ ˆˆ( )

T

GCVp

trace

y P yP

2

2

ˆ ˆ( )

Tpp

y P y

More Criteria Based on CV2

22

ˆ ˆˆ( )

T

GCVpp

y P y GCV(Generalized CV)

22 ˆ ˆˆ

T

UEV p

y P y UEV

(Unbiased estimate of variance)

22 ˆ ˆˆ

T

FPEpp p

y P y FPE(Final Prediction Error)

Akaike’sInformation Criterion

BIC(Bayesian Information Criterio) 22 ln( ) 1 ˆ ˆˆ

T

BIC

p pp p

y P y


22

ˆ ˆˆ( )

T

GCVpp

y P y

22 ˆ ˆˆ

T

UEV p

y P y

22 ˆ ˆˆ

T

FPEpp p

y P y

22 ln( ) 1 ˆ ˆˆ

T

BIC

p pp p

y P y

2ÛEVp

p

2ÛEVp

p

2ln( ) 1ÛEV

p pp

2 2 2 2ˆ ˆ ˆ ÛEV FPE GCV BIC


22

ˆ ˆˆ( )

T

GCVpp

y P y

22 ˆ ˆˆ

T

UEV p

y P y

22 ˆ ˆˆ

T

FPEpp p

y P y

22 ln( ) 1 ˆ ˆˆ

T

BIC

p pp p

y P y

2ÛEVp

p

2ÛEVp

p

2ln( ) 1ÛEV

p pp

2 2 2 2ˆ ˆ ˆ ÛEV FPE GCV BIC

?P

Regularization

((

1

)2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

ErModel spenalror t

sy

Co t

2

( )

1 1

( )p m

ki

k

k

iiwy

x


Standard Ridge Regression,

,i i

Regularization

((

1

)2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

ErModel spenalror t

sy

Co t

2

( )

1 1

( )p m

ki

k

k

iiwy

x


2

1

m

iiw

Standard Ridge Regression,

,i i

Solution Review

2(( ) )

1 1

2

1

min p m m

ki

k

ki i

i i

C w wy

x

1* Tw A Φ yT

m Φ Φ IA1 T

p P I Φ ΦA

Used to compute model selection criteria

Example22( ) (1 2 ) xy x x x e

50p

2~ (0,0.2 )N

Width of RBF r = 0.5

Example2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC

22( ) (1 2 ) xy x x x e

50p

2~ (0,0.2 )N


Example2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC

22( ) (1 2 ) xy x x x e

50p

2~ (0,0.2 )N


How the determine the

optimal regularization

parameter effectively?

Optimizing the Regularization Parameter

Re-Estimation Formula

2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

Local Ridge Regression

Re-Estimation Formula

2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

Examplesin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF

( )p trace P

0.05r


50m p

— Width of RBF0.05r

There are two local-minima. 2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

Using the about re-estimation formula, it will be stuck at the nearest local minimum.

That is, the solution depends on the initial setting.


50m p



1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

ˆ(0) 0.01

lim 0.1ˆ( )t

t


50m p



1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

5ˆ(0) 10

4ˆ( )lim 2.1 10t

t


50m p


RMSE: Root Mean Squared ErrorIn real case, it is not available.


2

(( ) )

1 1

2

1

min p m m

ki

k

ki i

i i

C w wy

x

Standard Ridge Regression

2

(( ) )

1 1

2

1

min p m m

i ik

ik i

ik

i

C w wy

x



2

(( ) )

1 1

2

1

min p m m

ki

k

ki i

i i

C w wy

x


2

(( ) )

1 1

2

1

min p m m

i ik

ik i

ik

i

C w wy

x


j implies that j() can be removed.

The Solutions1* Tw A Φ y

Tm Φ Φ IA

1 Tp

P I Φ ΦAUsed to compute model selection criteria

T A Φ Φ Λ

TA Φ Φ Linear Regression



Optimizing the Regularization Parameters

Incremental OperationP: The current projection Matrix.

Pj: The projection Matrix obtained by removing j(). T

j j j jj T

j j j j

P φ φ P

P Pφ P φ

2

22

ˆ ˆˆ( )

T

GCVp

trace

y P yP


2

22

ˆ ˆˆ( )

T

GCVp

trace

y P yP

Solve 2ˆ

0GCV

j

Subject to 0j

Tj j j j

j Tj j j j

P φ φ PP P

φ P φ


2Tja y P y2T Tj j j jby P φ y P φ2 2( )T T

j j j j jc φ P φ y P φ

( )jtrace P2T

j j j φ P φ

if if and ˆ

0 if and if

c bTj j j b a

c bTjj j j b a

c b c bT Tj j j j j jb a b a

a ba ba b

φ P φφ P φ

φ P φ φ P φ

Solve 2ˆ

0GCV

j

Subject to 0j


2Tja y P y2T Tj j j jby P φ y P φ2 2( )T T

j j j j jc φ P φ y P φ

( )jtrace P2T

j j j φ P φ

if if and ˆ

0 if and if

c bTj j j b a

c bTjj j j b a

c b c bT Tj j j j j jb a b a

a ba ba b

φ P φφ P φ

φ P φ φ P φ

Solve 2ˆ

0GCV

j

Subject to 0j Remove j()

The Algorithm

Initialize i’s. – e.g., performing standard ridge regression.

Repeat the following until GCV converges:– Randomly select j and compute– Perform local ridge regression– If GCV reduce & remove j()

ˆj

ˆj

References

Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI).

Mark J. L. Orr (April 1996), Introduction to Radial Basis Function Networks, http://www.anc.ed.ac.uk/~mjo/intro/intro.html.

http://www.anc.ed.ac.uk/~mjo/intro/intro.html

http://www.anc.ed.ac.uk/~mjo/intro/intro.html


Incremental Operations

Download - Introduction to Radial Basis Function Networks

Top Related