optimizing number of hidden neurons in neural networks

1

Optimizing number of hidden neurons in neural networks

Janusz A. StarzykSchool of Electrical Engineering and Computer Science

Ohio UniversityAthens Ohio U.S.A

IASTED International Conference on Artificial Intelligence and Applications Innsbruck, Austria Feb, 2007

2

Outline

Neural networks – multi-layer perceptron Overfitting problem Signal-to-noise ratio figure (SNRF) Optimization using signal-to-noise ratio figure Experimental results Conclusions

3

Neural networks– multi-layer perceptron

(MLP)

)( 11 yfz ii xwy 11

122ii zwy )( 22 yfz

Inputs x Outputs z

4

Neural networks– multi-layer perceptron

(MLP)

Efficient mapping from inputs to outputs

Powerful universal function approximation

Number of inputs and outputs determined by the data

Number of hidden neurons: determines the fitting accuracy critical

inputs outputs

1 2 3 4 5 6 7 8 9 10-1.5

-1

-0.5

0

0.5

1

1.5

training datafunction approximation

MLP

5

Overfitting problem

Generalization:

Overfitting: overestimates the function complexity, degrades generalization capability

Bias/variance dilemma Excessive hidden neuron

overfitting

Training data(x, y) Model

MLP training new data

(x’) y’Model

1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4-5

0

5

10

15

20

25

30

35

40Training dataDesired functionOverfitted functionTesting setDesired value for new dataPredicted value for new data

6

Overfitting problem

Avoid overfitting: cross-validation & early stopping

All available training data

(x, y)

training data(x, y)

testing data(x’, y’)

4.4 4.6 4.8 5 5.2 5.4 5.6 5.8

-1.2

-1.1

-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

3.2 3.4 3.6 3.8 4 4.2 4.4-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

Training error etrain

Testing error etest

Number of hidden neurons

Fitting error

etrain

etest

MLPtraining

MLPtesting

Optimum number

Stopping criterion:

etest starts to increase or etrain

and etest start to diverge

7

Overfitting problem

How to divide available data?

All available training data

(x, y)

training data(x, y)

testing data(x’, y’)

Number of hidden neurons

Fitting error

etrain

etest

Optimum number

When to stop?

data wasted

Can test error catch the generalization error?

8

Overfitting problem

Desired:

•Quantitative measure of unlearned useful information from etrain

•Automatic recognition of overfitting

1.5 2 2.5 3-5

0

5

10

15

20

25

30

35

40

45

Training dataDesired functionOverfitted functionTesting setDesired value on new dataPredicted value on new data

1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4

-5

0

5

10

15

Training data cubicTesting setDesired value on new dataPredicted value on new data

1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2Training dataTesting setFitting function

9

Sampled data: function value + noise Error signal:

approximation error component + noise component

Signal-to-noise ratio figure (SNRF)

Noise part Should not be learned

Useful signalShould be reduced

Assumption: continuous function & WGN as noise Signal-to-noise ratio figure (SNRF):

signal energy/noise energy Compare SNRFe and SNRFWGN

Learning should stop – ?If there is useful signal left unlearnedIf noise dominates in the error signal

10

Signal-to-noise ratio figure (SNRF)– one-dimensional case

1 1.5 2 2.5 3 3.5 4 4.5 5-1.5

-1

-0.5

0

0.5

1

1.5Training dataQuadratic fitting

1 1.5 2 2.5 3 3.5 4 4.5 5-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Training data and approximating function Error signal

),...2,1( Ninse iii

approximation error component noise component+

How to measure the level of these two components?

11

iii nse

nsns EEE

N

ii

iins

e

eeCE

1

2

),(

),( 1 iis eeCE

1 1.5 2 2.5 3 3.5 4 4.5 5-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

0),( 1 ii nnC


High correlation between

neighboring samples of signals

snsn EEE

12


),(),(

),(

1

1

iiii

ii

n

se eeCeeC

eeC

E

ESNRF

),(),(

),(

1

1

iiii

iiWGN nnCnnC

nnCSNRF

0)(_ NWGNSNRF

NNWGNSNRF

1)(_

13

-0.015 -0.01 -0.005 0 0.005 0.01 0.0150

5

10

15

20

25

30

35Histogram of SNRF for WGN with 216 samples

1.7mean : 0

standard deviation: 0.0039


)(7.1)()( ___ NNNth WGNSNRFWGNSNRFWGNSNRF Hypothesis test:

5% significance level

14

Signal-to-noise ratio figure (SNRF)– multi-dimensional case

Signal and noise level: estimated within neighborhood

),...2,1( 1

NpeewEM

ipippisp

sample p

Ni

Np

d

dw

eed

M

iMpi

Mpi

pi

pippi

,...2,1

,...2,1

1

1

1

M neighbors

15

N

psps EE

1

N

p

M

ipippi

N

iisnsn eeweEEE

1 11

2

N

p

M

ipippi

N

ii

N

p

M

ipippi

n

se

eewe

eew

E

ESNRF

1 11

2

1 1

All samples


16


0)(

1 11

2

1 1_

N

p

M

ipippi

N

ii

N

p

M

ipippi

WGNSNRF

eewe

eew

N

NNWGNSNRF

2)(_

)(2.1)()( ___ NNNth WGNSNRFWGNSNRFWGNSNRF

WGNSNRF

M=1 threshold multi-dimensional (M=1)≈ threshold one-dimensional

17

Optimization using SNRF

Noise dominates in the error signal, Little information left unlearned,

Learning should stop

SNRFe< threshold SNRFWGN

Start with small network Train the MLP etrain

Compare SNRFe & SNRFWGN

Add hidden neurons

Stopping criterion:SNRFe< threshold SNRFWGN

18


Set the structure of MLP Train the MLP with back-propagation iteration

etrain

Compare SNRFe & SNRFWGN

Keep training with more iterations

Applied in optimizing number of iterations in back-propagation training to avoid overfitting

(overtraining)

19

Experimental results

Optimizing number of iterations

-3 -2 -1 0 1 2 3-0.5

0

0.5

1

1.5Testing performance using 10 iterations

x

y

testing dataapproximated value

-3 -2 -1 0 1 2 3-0.5

0

0.5

1

1.5Testing performance using 200 iterations

x

y

testing dataapproximated value

noise-corrupted 0.4sinx+0.5

20


Optimizing order of polynomial

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

0

0.2

0.4

0.6

0.8

1

x

y

Training dataTesting dataDesired function

5 10 15 20 25

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

order of fitting polynomial

Training errorTesting errorGeneralization errorSNRFStopping threshold

4 6 8 10 12 14

0

0.005

0.01

0.015

0.02

order of fitting polynomial

Training errorTesting errorGeneralization error

21


0 5 10 15 20 25 30 35 40 45 50-2

0

2

4

6

number of hidden neurons

SN

RF

SNRF of error signal vs. number of hidden neurons

SNRF of error signalthreshold

0 5 10 15 20 25 30 35 40 45 500

0.2

0.4

0.6

0.8

1Training MSE and Testing MSE vs. number of hidden neurons


MS

E

training performancetesting performance

Optimizing number of hidden neuronstwo-dimensional function

-1-0.5

00.5

1-1

-0.50

0.51

-3

-2

-1

0

1

2

3

Training data from multi-dimensional function

22


-1

-0.5

0

0.5

1

-1

-0.5

0

0.5

1

-3

-2

-1

0

1

2

3

Difference between desired function and approximating function using 25 neurons

-1

-0.5

0

0.5

1

-1-0.5

00.5

1

-3

-2

-1

0

1

2

Difference between desired function and approximating function using 35 neurons

23


Mackey-glass database

Every consecutive 7 samples the following sampleMLP

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5

1

1.5

2

2.5

number of hidden neurons(b)

SN

RF



0 2 4 6 8 10 12 14 16 18 200

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02Training MSE and Testing MSE vs. number of hidden neurons

number of hidden neurons(a)

MS

E

Training MSETesting MSE

24

0 20 40 60 80 100 120 140 160 180 200-0.02

-0.01

0

0.01

0.02Error signal obtained in OAA

sample

erro

r sig

nal

0 50 100 150 200 250 300 350 400-2

0

2

4x 10

-3 Autocorrelation of error signal obtained in OAA

Aut

ocor

rela

tion


WGN characteristic

25


Puma robot arm dynamics database8 inputs (positions, velocities, torques) angular acceleration

MLP

0 10 20 30 40 50 60 70 80 90 100-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6


SN

RF



0 10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5Training MSE and Testing MSE vs. number of hidden neurons


MS

E

training performancetesting performance 6th degree polynomial fit

26

Conclusions

Quantitative criterion based on SNRF to optimize number of hidden neurons in MLP

Detect overfitting by training error only No separate test set required Criterion: simple, easy to apply, efficient and

effective Optimization of other parameters of neural

networks or fitting problems

optimizing number of hidden neurons in neural networks

Documents

noise level

generalization error

threshold multidimensional

ytraining datax

error signalsignal

ytesting datax

data wastedcan test

available data