nature inspired learning: classification and prediction algorithms

Nature Inspired Learning: Classification and Prediction Algorithms

Šarūnas Raudys

Computational Intelligence Group

Department of Informatics

Vilnius University. Lithuania

e-mail: [email protected]

Juodkrante, 2009 05 22

2

-2 0 2-3

-2

-1

0

1

2

3

-2 0 2-3

-2

-1

0

1

2

3

-2 0 2-3

-2

-1

0

1

2

3

-2 0 2-3

-2

-1

0

1

2

3

Nature inspired learningStatics

Dynamics

Accuracy, and the relations between sample size, and

complexity

+learning rapiditybecomes a very important issue

W = S -1 (M1-M2)

perceptron

4

Nature inspired learning

A Non-linear Single Layer Perceptron - a main element in the ANN theory

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

4

6

8

10

12

14

16

weighted sum

x1, x2, …, xp

i n p u t s

nonlinearity

output y

5


TRAINING THE SINGLE LAYER PERCEPTRON

OUTLINE A plot of 300 bivariate vectors (dots and pluses) sampled from

two Gaussian pattern classes, and the linear decision boundary

START FINISH

CLASSIFICATION

CLUSTERIZATION, if target2 = target1

Minimization of deviations

Three tasks

6

CLASSIFICATION

2 category case

I will speak also about the multi-category case

START

FINISH1. Cost function and training SLP used for classification.

2. When to stop training?

3. Six types of classification Equation while training SLP:

1. Euclidean distance, (only means)

2 Regularized,

3 Fisher, or

4 Fisher with pseudo-inversion of

5 Robust,

6. Minimal empirical error,

7 Support vector (maximal margin).

How to train SLP in the best way?

7


Training the non-linear SLP

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

4

6

8

10

12

14

16

weighted sum

inputs X= (x1, x2, …, xp)

nonlinearity

output y

Training

Data

x1, x2, …, xp y

1 2

N

net = f( VTX + v0), where

f(net ) is a non-linear activation function, e.g. a sigmoid function: f (net)= 1/(1+e-net ) = f sigmoid(net), and v0, VT = (v1 , v2 , ... , vp) are the weights of the DF.

STANDART

8

TRAINING THE SINGLE LAYER PERCEPTRON BASED CLASSIFIER

Training Data

x1, x2, …, xp y

1 2

N

o = f( VTX + v0), where f(net ) is a non-linear activation function, and v0,

VT = (v1 , v2 , ... , vp) are the weights.

Cost function (Amari 1967; Tsypkin, 1966)

C = 1/N (yj - f( VTXj + v0))2,

Vt+1= Vt - x gradient, Training

where is a learning step parameter and yj is training signal (desired output)

V(0)

V(FINISH) mimimum of the cost functionA true (unknown) minimum

Optimal

stopping

Rule

9

Training the Non-linear Single Layer Perceptron

Training Data

Vt+1= Vt - x gradient

1 2

N

Finish

True landscape

Training data landscape

Videal

Optimal

stopping

10

Vt+1= Vt - x gradientEarly stopping

Vopt= optVstart + (1-opt)Vfinish,

where 2finish

2start

2finish

οpt

Raudys&Amari, 1998

Late stopping

Majority, who stopped too late, are here.

A general

Principle

accuracy

11

Where to use Early stopping? - Knowledge discovery in very large databases


Data Set 1

Data Set 2

Data Set 3In order to save previous information, stop training early!Train, however, st

op training early!

12

Standard sum of squares cost function = Standard regression

C = 1/N (yj – f ( VTXj + v0))2.

We assume that the data is normalized:

yT XXXSS 1

1. ,0 ,0 eviationsstandard dyX

Covariances

Let correlations between input variables x1, x2, …, xp be zero.

Then components of vector V will be proportional to correlations between x1, x2, …, xp and y.

We may obtain such regression after the first iteration.

Gradient descent training algorithm Vt+1= Vt - x gradient

13

SLP AS SIX REGRESSIONS

START

14

-10 -5 0 5 100

20

40

60

80

100

(yj - VTXj)2

yj - VTXj

robust

In order to obtain robust regression,instead of square function we have to use “robust function”

Š. Raudys (2000). Evolution and generalization of a single neurone. III. Primitive, regularized, standard, robust and minimax regressions.

Neural Networks 13

(3/4):507-523.

Nature inspired learning. Robust regression

150 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

-4

-2

0

2

4

6

8

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000-4

-2

0

2

4

6

8

10

Mother and a fetus (“baby”) ECG. Two signals

Result: the fetus signal

A real world problem. Use of robust regression in order to distinguish very weak baby signal from mother’s ECG.

Robust regression pays attention to smallest deviations, not to the largest ones considered as the outliers.

16

Nature inspired learning. Standard and regularized regression

Use of “statistical methods” to perform diverse

whitening data transformations,

where the input variables x1, x2, …, xp are decorrelated and scaled in order to have the same variances. Then while training the perceptron in the transformed feature space, we can obtain standard regression after the very first iteration.

Xnew=T Xold T = -1/2 , where SXX = is a singular value decomposition of the covariance matrix SXX.

0. ,0 ,0 deviations standardyX

Vstart = 0,

If SXX = SXX + I, we obtain regularized regression. Moreover, we can equalize eigenvalues and speed up training process.

Speeding up the calculations (a converegence)

17

SLP AS SEVEN STATISTICAL CLASSIFIERS

STARTThe simplest classifier

Large

weights

Small weights

18


E1) a centre M=(M1+M2)/2 is moved to 0 point, E2) training begins from zero weights, E3) the target t2 = - t1N1/N2, E4) a total gradient training (batch mode) is used.

Conditions to obtain Euclidean distance classifier

just after the first iteration

V t+1 = (2/(t-1)/ I + S) -1 (M1-M2)

When we train further, we have regularized discriminant analysis (RDA):

is regularization parameter, 0 with an increase in the number of training iterations

Fisher classifier,

or Fisher classifier with pseudoinverse of the covariance matrix

19

Nature inspired learning. Standard approach.

Use the diversity of “statistical methods and multivariate models” in order to obtain efficient estimate of covariance matrix. Then perform whitening data transformations, where the input variables are decorrelated and scaled in order to have the same variances.

While training the perceptron in the transformed feature space, we can obtain the Euclidean distance classifier after the first iteration. In original feature space it corresponds to the Fisher classifier or to modification of the Fisher (it depends on a method used to estimate covariance matrix) in original feature space.

Untransformed data

Transformed data

Euclidean classifier

Fisher classifier

Euclidean classifier = Fisher in original space

20


Generalisation errors. EDC, Fisher and Quadratic classifiers

Table 1. Learning quantity, ratio =NA/

E of the Euclidean distance E, the Fisher F

and the quadratic classifiers versus N, the training set size, for dimensionality n=50 and five values of distance (asymptotic error; from Raudys and Pikelis, 1980). EDC Fisher LDF QDF N 1.82 2.34 3.09 3.66 4.22 8 1.70 2.03 2.41 2.65 2.87 12 1.54 1.70 1.84 1.92 1.99 20 1.43 1.50 1.55 1.58 1.61 2.05 3.39 8.40 19.7 52.0 30 1.30 1.32 1.33 1.34 1.35 1.62 2.15 3.61 5.95 10.6 2.21 3.25 7.87 18.3 40.6 50* 1.18 1.17 1.16 1.16 1.17 1.33 1.51 1.93 2.47 3.27 2.13 3.12 7.10 13.1 25.1 100 1.08 1.07 1.06 1.06 1.06 1.14 1.19 1.31 1.44 1.61 1.81 2.35 3.23 4.03 5.05 250 1.04 1.03 1.03 1.03 1.03 1.07 1.09 1.15 1.20 1.27 1.58 1.78 2.01 2.18 2.35 500 1.02 1.02 1.02 1.02 1.02 1.04 1.05 1.07 1.10 1.13 1.37 1.42 1.47 1.51 1.56 1000 1.1 1.01 1.01 1.01 1.01 1.01 1.02 1.03 1.04 1.05 1.18 1.16 1.18 1.18 1.20 2500 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50

0.2 0.1 0.03 0.01 .003 0.2 0.1 0.03 0.01 .003 0.2 0.1 0.03 0.01 .003 E

*) 80 for QDF)

21

S. Raudys, M. Iwamura. Structures of covariance matrix in handwritten character recognition. Lecture Notes in Computer Science, 3138, 725-733, 2004.

S. Raudys, A. Saudargiene. First-order tree-type dependence between variables and classification performance. IEEE Trans. on Pattern Analysis and Machine Intelligence. Vol. PAMI-23 (2),

pp. 233-239, 2001.

A real world problem. Dozens of ways used to estimate covariance matrix and perform whitening data transformation. It is “an additional information” (if correct), that can be useful in SLP training

196-dimensional data

22

Covariance matrices are different.

Decision boundaries of EDC, LDF, QDF and Anderson- Bahadur linear DF. AB and F are different.

If we would start with

the AB decision boundary,

not with the Fisher,

it would be better.

Hence, we have proposed a special method of input data transformation.S. Raudys (2004). Integration of statistical and neural methods

to design classifiers in case of unequal covariance matrices. Lecture Notes in Artificial intelligence, Springer-Verlag. Vol. 3238, pp. 270-280

Q Fisher AB

23

Non-linear discrimination. Similarity features LNCS 3686, pp. 136 – 145, 2005

-10 -8 -6 -4 -2 0 2 4 6 8 10-8

-6

-4

-2

0

2

4

6

8

-10 -8 -6 -4 -2 0 2 4 6 8 10-8

-6

-4

-2

0

2

4

6

8

-10 -8 -6 -4 -2 0 2 4 6 8 10-8

-6

-4

-2

0

2

4

6

8

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

0.02

0.022

0.024

0.026

0.028

0.03

0.032

a b

c

d

100+100 2D two class training vectors (pluses and circles) and decision boundaries of Kernel Discriminant Analysis (a), SVM (b), SLP trained in 200D dissimilarity feature space (c). Learning curve: generalization error of SLP classifier as a function of number of training epochs (d).

SV classifier

SLP

KDA

SV classifier

SLP

optimal stopping of SLP

Generalization error

epochs

24

A “coloured” noise, used to form pseudo-validation set:we are adding a noise in directions of closest training vectors. So, we almost do not distort “geometry of the data”.

In this technique, we use “additional information”:

a space between neighboring points in multidimensional feature space is not empty – it is filled by vectors of the same class.

Nature inspired learning. A noise injection

A pseudo-validation data set used to realize

early stopping

25

Nature inspired learning. Multi-category cases

-4 -2 0 2

-4

-2

0

2

4

6

2

1

2(1/3)

213

133

113

3(2/3)

112

212

1(1/2)

3

233

232

-4 -2 0 2

-4

-2

0

2

4

6

3

1

A B

C

O

2

Pair-wise classifiers: optimally stopped (+noise) SLPs + H-T fusion. Wee need to obtain the classifier (SLP) of optimal complexity:

Early stopping

1 2

26

Learning Rapidity. Two Pattern Recognition (PR) tasks

-2 0 2-3

-2

-1

0

1

2

3

-2 0 2-3

-2

-1

0

1

2

3

A time to learn the second task is restricted, say 300 training epochs

Parameters that affect learning rapidity:

– learning step & the weights growth

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

4

6

8

10

12

14

16

s = target1 – target2

+ Regularization: a) weight decay term, b) a noise injection to input vectors, c) a corruption of the targets

Wstart= xWstart. also controls learning rapidity

, s, and

27

Optimal values of learning parameters

0 0.2 0.4 0.6 0.8 10

50

100

150

200

250

300

350

400

num

ber

of e

poch

s

difference between targets

1

2

1a

2a

3a

3

100

101

102

103

0

50

100

150

200

250

300

350

num

ber

of e

poch

s

learning step

1

2

3

200300 500500

weights magnitude

diff

eren

ce b

etw

een

the

targ

ets

0 0.5 1 1.5 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

s = target1 – target2

# of epochs

, s, and

s, and

– the learning step

s

28

Collective learning. A l e n g t h y sequence of diverse PR tasks

-2 0 2-3

-2

-1

0

1

2

3

-2 0 2-3

-2

-1

0

1

2

3

-2 0 2-3

-2

-1

0

1

2

3

5 10 15 20 25 30 35

1

2

3

4

5

6

atpazinimo uzdavinio pasikeitimai

pasis

ukim

as

-2 0 2-3

-2

-1

0

1

2

3

The angle and/or the time between two changes are varying all the time

29

The multi-agent system composed of adaptive agents – the single layer perceptrons

In order to survive the agents should learn rapidly.

Unsuccessful agents are replaced by newborn. Inside the group the agents help each other.

In a case of emergency, they help to the weakest groups.

Genetics learning and adaptive one.

A moral: a single agent (SLP) can not learn very long sequence of the PR tasks successfully

30

A power of the PR task changes and parameter s as a

function of time

50 100 150 200 250 300 350 400 450

3.5

4

4.5

5

5.5

atpazinimo uzdaviniu pasikeitimai

pasis

ukim

as "

teta

max"

0 50 100 150 200 250 300 350 400 450 5000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

atpazinimo uzdaviniu pasikeitimai

stim

uliavim

as

t1-t

2

PR task changes

A power of

the changes

s = t1-t2

I tried to learn: s, “emotions”, “altruism”, the noise intensity, a length of learning set, e.t.c.

s is following the variation of the power of the changes

31

Integrating Statistical Methods and Neural Networks.


The theory for equal covariance matrix case

The theory for unequal covariance matrices and multicategory cases LNCS, 4432, pp. 1 – 10,

2007 LNCS, 4472, pp. 62–71, 2007 LNCS, 4142, pp. 47 – 56, 2006 LNAI, 3238, pp. 270-280, 2004

Regression

Neural Networks, 13 (3/4), pp. 507-523, 2000

JMLR, ICNC'08

nature inspired learning: classification and prediction algorithms

Documents