csss2010 20100803-kanevski-lecture2

Prof. M. Kanevski 1

Machine Learning Algorithms: Theory,

Applications and Software Tools

Lecture 2 Basics of ANN: MLP

Prof. Mikhail Kanevski

Institute of Geomatics and Analysis of Risk,

University of Lausanne

[email protected]

Prof. M. Kanevski 2

Contents

• Introduction to artificial neural networks

• Multilayer perceptron

• Case studies

Prof. M. Kanevski 3

Basics of ANN

Artificial neural networks are analytical systems that address problems whose solutions have not been

explicitly formulated.

In this way they contrast to classical computers and computer programs, which are designed to solve problems whose solutions - although they may be extremely complex - have been made explicit.

Prof. M. Kanevski 4

Basics of ANN

• We can program or train neural networks to store, recognise, and associatively retrieve patterns;

• to filter noise from measurement data;

• to control ill-defined problems;

in summary:• to estimate sampled functions when we do not

know the form of the functions.

Prof. M. Kanevski 5

Basics of ANN

Unlike statistical estimators, they estimate a function without a mathematical model of how outputs

depend on inputs.

Neural networks are model-semifree estimators (semiparametric models). They "learn from experience" with numerical and, sometimes, linguistic sample data.

Prof. M. Kanevski 6

Basics of ANN

The major applications of ANN:• Feature recognition (pattern classification). Speech

recognition• Signal processing• Time-series prediction• Function approximation and regression, classification• Data Mining• Intelligent control• Associative memories• Optimisation• And many others

Prof. M. Kanevski 7

Basics of ANN. Simple biological neuron

Prof. M. Kanevski 8

Basics of ANNSimple model of the neuron

Prof. M. Kanevski 9

Examples of transfer functions.

)]exp(1[

1)(

xxf

−+=

)]exp()[exp(

)]exp()[exp()tanh(

xx

xxx

−+

−−=

Prof. M. Kanevski 10

Basics of ANN

The main parts of ANN:

• Neurones

(nodes, cells, units, processing

elements)

• Network topology

(connections between neurones)


Basics of ANN

In general, Artificial Neural Networks are a collection of simple computational units (cells) interlinked by a system of connections (synaptic connections). The number of units and connections form a network topology.


Multilayer perceptron


Basics of ANN. ANN learning/training

Supervised learning is the most common training. Many samples

Input(i), Output(i) are prepared as a training set. Then a subset from

the training data set is selected. Samples from this subset are

presented to the network one by one. For each sample results

obtained by the network O[(input(i)] are compared with the desired

O[utput(i)]. After presenting the entire training subset the weights are

updated. This updating is done in such a way that a measure of the

error between the network's and desired outputs is reduced. One pass

through the subset of training samples, along with an updating of the

weights is called an epoch. The number of samples in the subset is

called epoch size. Sometimes an epoch size of one is used.


Basics of ANN. ANN supervised learning.

ExamplesResponse

Neural network

Teacher

Evaluation

Of Response

Learning

Algorithm

Modifications

to Network


Basics of ANNFeedforward ANN.

If there are no feedback and lateral connections we have feedforward ANN. The most frequently used model is so called - multi-layer perceptron. The term feedforward means that information flows only in one direction -from the input to the output.


ANN Multi-layer Perceptron (MLP)

• Depends only on the data

and its inner structure

• Is able to learn from data

and generalise

• Good at modelling non-

linearities

• Robust to noise and

outliers

[ANN = artificial neurons + connection weights]


Basics of ANN

All knowledge of ANN is based on

synaptic weights between units.


The Universality Property

• A two layer feed-forward neural network

with step activation functions can implement any Boolean function,

provided that the number of hidden

neurons H is sufficiently large.


MLP modelling

1 1 1 1

2 1 1 1 2 2 2

3 1 1 1 2 2 2 3 3 3

( , ) ( ) ,

( , ) ( ) ( ) ,

( , ) ( ) ( ) ( ) .

out

out

out out

out

out out out

out

F t w f w t b b

F t w f w t b w f w t b b

F t w f w t b w f w t b w f w t b b

= + +

= + + + +

= + + + + + +

w

w

w


Backpropagation training


Error function depends on network’s weights (W)

{ }∑−

=

−=1

0

2)(

1)(

n

j

out

ljljl WZTn

WE


MLP training algorithms

Optimisation algorithms used for MLP training:

• Stochastic

− Annealing

− Genetic algorithm

• Gradient

− Conjugate gradients (slow 1st order gradient algorithm)

− Levenberg-Marquardt (fast 2nd order gradient algorithm)

− BFGS formula – quasi Newton

− Steepest Descent

− RProp – resilient propagation

− BackProp – back propagation


Feedforward ANN: Multilayer

perceptron. Backprop algorithm

• The possibilities and capabilities of multi-layer perceptrons stem from

the non-linearities used within nodes. MLP can learn with supervised

learning rule - backpropagation algorithm. The Backword Error

Propagation algorithm for the ANN learning/training caused a

breakthrough in the application of multilayer perceptrons.

• The backpropagation algorithm is a supervised learning algorithm. The

backpropagation algorithm is an iterative gradient algorithm

designed to minimise the error measure between the actual output of

the neural network and the desired output. We have to optimise a very

non-linear system consisting of a large number of highly correlated

variables.


Basics of ANNBackpropagation Algorithm

The backpropagation algorithm follows the next algorithmic steps:

• 1. Initialize weights. Usually it is recommended to set all

weights and node offsets to small random variables. In our

study we shall use simulated annealing and/or genetic

algorithm to select starting values more intelligently as it is

recommended in [Masters].

• 2. Present inputs and desired outputs. The vectors (Inputl, Outputl=tl) are presented to the network.

• 3. Calculate the actual output of the ANN.



• 4. Calculate error measure and update the weights. Use a recursive algorithm starting at the

output neurons (nodes) and working back to the first hidden layer - it is this backward propagation of output errors that inspired the name for this training algorithm. Update the weights W by


We want to know how to modify

weights in order to decrease the

error function

( )( 1) ( )

( )ij ij

ij

E tw t w t

w t

∂

∂+ − ∝−



)1()()1( −+=+ m

j

m

i

m

ij

m

ij Znwnw ηδ

where n - iteration step, η - rate of learning 0 < η≤ 1), Zj

m( )−1 - output of the j-th neurone in the layer

(m-1), error δi

m for the output layer is defined by equation



δ i

out

i

out

i

out

i i

outZ Z T Z= − −( )( )1

h

j

j

h

ij

h

i

h

i

h

i wZZ δδ ∑−=− )1()1(



Other error measures (such as maximum absolute error and

median squared error) have even greater advantages in

many situations. For example, median squared error is useful because unlike the mean the median is a robust

statistic - its value is insensitive to occasional large errors

in the training data. Unfortunately, practical techniques for

implementing these more desirable error measures do not

yet exist. Thus, most neural networks today are tied to

mean squared error measurements.



More general error functions can be written taking into

account (weighting, declustering, economic criteria, etc.)

importance of the samples presented to the network :

{ } lj

n

j

out

ljljl WZTWE ω∑−

=

−=1

0

2)()(


Gradient descent

w

J(w)

Minimum

Direction of the gradientJ’(W)


Gradient descent

w

J(w)

Minimum


In reality the situation with error

function and corresponding

optimization problem is much more

complicated:

the presence of multiple local minima!


Gradient descent

Local minima


SA: Illustration


How important are local

minima?(Duda et al. 2001)

In computational practice, we do not want our network to be caught in a local minimum having high training error because this usually indicates that key features of the problem have not been learned by the network.

In such cases it is traditional to reinitialize the weights and train again, possibly also altering other parameters in the net


How important are local

minima?(Duda et al. 2001)

In many problems, convergence to a nonglobal minimum is acceptable, if the error is nevertheless fairly low. Furthermore, common stopping criteria demand that training terminate even before the minimum is reached, and thus it is not essential that the network be converging toward the global minimum or acceptable performance.


In short

The presence of multiple minima does not

necessarily present difficulties in training

nets, and a few simple heuristics can often

overcome such problems (see next slide)


Practical techniques for

improving backpropagation

• Activation function (sigmoid, hyperbolic tangent,..)

• Scaling inputs

• Training with noise (noise injection)

• Initializing weights (simulated annealing)

• Regularization (weight decay)

• Number of hidden layers

• Learning parameters (rates, momentum,..)

• Cost function

• ………………………………….


Interpretation of network’s

outputs

Consider the limit in which the size N of the training data set goes to infinity [Bishop 1995]. In this limit we can replace the finite sum over patterns in the sum-of-squares error with an integral of the form

∑ ∑=

−=N

n k

n

k

n

k twxyN

E1

2});({2

1lim

∑ ∫∫ −=k

kkkk dxdtxtptwxy ),(});({2

1 2


Interpretation of network’s

outputs

the network mapping is given by the conditional average of the target data, the regression of tkconditioned on x.

⟩⟨= xtwxy kk |*);(


DEMO


MLP and number of layers

• The problem with MLP using single hidden

layer is that the neurons tend to interact with

each other globally. In complex situations ,

this interaction makes it difficult to improve

the approximation at one point without

worsening it at some other point.

• On the other hand, with two hidden layers,

the approximation process becomes more

manageable.


Two hidden layers! (Haykin)

1. Local features are extracted in the first hidden layer. Specifically, some neurons in the first hidden layer are used to partition the input space into regions, and other neurons in that layer learn the local features characterizing those regions.

2. Global features are extracted in the second layer. Specifically, a neuron in the second hidden layer combines the outputs of neurons in the first hidden layer operating on a particular region of the input space and thereby learns the global features for that region and outputs zero elsewhere.


Data Preprocessing

• Machine learning algorithms are data-

driven methods.

• The quality and quantity of data is essential for training and generalization Post-processing

Pre-processing

MLA

Results

Input data


Types of pre-processing:

1. Linear and nonlinear transformations

e.g input scaling/normalisation, Z-score transform,

square root transform, N-score transform, etc.

2. Dimensionality reduction

3. Incorporate prior knowledge

Invariants, hints,…

4. Feature extraction

linear/nonlinear combination of input variables

5. Feature selectiondecide which features to use


Dimensionality reduction

• Two approaches are available to perform

dimensionality reduction:

• Feature extraction: creating a subset of new

features by combinations of the existing features

• Feature selection: choosing a subset of all

the features (the ones more informative)


Feature selection/extraction


Feature selection

• Reducing the feature space by throwing

out some of the features (covariates)

– Also called variable selection

• Motivating idea: try to find a simple,

“parsimonious” model (Occam’s razor!)


Univariate selection may fail

Guyon-Elisseeff, JMLR 2004; Springer 2006


Dimensionality Reduction

Clearly losing some information but this can be helpful

due to curse of dimensionality

Need some way of deciding what dimensions to keep

1. Random choice

2. Principal components analysis (PCA)

3. Independent components analysis (ICA)

4. Self-organised maps (SOM)


Data transform

• Y = aZ+b

• Y = Log(Z)

• Y = Ind(Z, Zs)

• Normalisation: ZscoreY = (Z-Zm)/σ

• Box-Cox nonlinear transform :

1( ) si 0

( 0) ( )

ZY

Y Ln Z

λ

λ λλ

λ

−= >

= =


Model Selection & Model Evaluation


Guillaume d'Occam (1285 - 1349)

“Pluralitas non est ponenda sine

necessitate”

Occam’s razor:

“The more simple explanation of the phenomena is more likely to be correct”


Model Assessment and Model Selection:

Two separate goals


Model Selection:

Estimating the performance of different

models in order to choose the

(approximate) best one

Model Assessment:

Having chosen a final model, estimating its

prediction error (generalization error) on

new data


If we are in a data-rich situation, the best

solution is to split randomly (?) data

Raw Data

Test:25%

(validation)Validation:25%

(test)

Train: 50%

(Train)


Interpretation

• The training set is used to fit the models

• The validation set is used to estimate prediction error for model selection (tuning hyperparameters)

• The test set is used for assessment of the generalization error of the final chosen model

Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001


Bias and Variance.

Model’s complexity

2 4 6 8 10

0.5

1

1.5

2

2.5

3

c. Underfitting

2 4 6 8 10

0.5

1

1.5

2

2.5

3

b. Overfitting


One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples.

This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task.

Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error


Bias-Variance Dilemma

Assume that

2

( )

( ) 0,

( )

Y f X

where

E

Var ε

ε

ε

ε σ

= +

=

=


We can derive an expression for the expected prediction error of a

regression at an input point X=x0

using squared-error loss:


2

0 0 0

2 2 2

0 0 0 0

2 2

0 0

2

( ) [( ( )) ¦ ]

[ ( ) ( )] [ ( ) ( )]

( ( )) ( ( ))

Err x E Y f x X x

E f x f x E f x E f x

Bias f x Var f x

IrreducibleError Bias Variance

ε

ε

σ

σ

∧

∧ ∧ ∧

∧ ∧

= − = =

+ − + − =

+ + =

+ +


• The first term is the variance of the target around its true mean f(x0), and cannot be avoided no

matter how well we estimate f(x0), unless σε2=0.

• The second term is the squared bias, the amount by which the average of our estimate differs from the true mean

• The last term is the variance, the expected

squared deviation of around its mean. 0( )f x

∧


Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001


• A neural network is only as good as the

training data!

• Poor training data inevitably leads to an

unreliable and unpredictable network.

• Exploratory Data Analysis and data

preprocessing are extremely important!!!


MLP modelling. Case Studies.

Original (10 000 points) Training (900 points)


MLP modeling

Original MLP prediction

TrainRMSE 1.97

Ro 0.69

Which result do you prefer?


MLP modeling


TrainRMSE 1.61

Ro 0.80



MLP modelingOriginal MLP prediction

TrainRMSE 1.67

Ro 0.79



MLP modeling


TrainRMSE 1.10

Ro 0.92Which result do you prefer?


MLP modeling


TrainRMSE 0.83

Ro 0.95



MLP modeling


TrainRMSE 0.55

Ro 0.98



MLP modeling

Trainig statistics

5-5

10-10

10

15-15

20-20

5

0.50

0.70

0.90

1.10

1.30

1.50

1.70

1.90

5 10 5-5 10-10 15-15 20-20

M LP

RM

SE 5

10

10-10

15-15

5-5

20-20

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

5 10 5-5 10-10 15-15 20-20

MLP

Ro

Model 20-20 is the best ?


MLP modelingTrainig statistics

0.980.5520-20

0.950.8315-15

0.921.1010-10

0.791.675-5

0.801.6110

0.691.975

RoRMSEMLP


10

15-15

5-5

20-20

5

10-10

0.50

0.70

0.90

1.10

1.30

1.50

1.70

1.90

2.10

5 10 5-5 10-10 15-15 20-20

MLP

RM

SE

Validationg Training

MLP modeling

Training &Validation statistics

10

15-15 20-20

10-10

5-5

5

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

5 10 5-5 10-10 15-15 20-20

MLP

Ro


MLP modeling

0.881.3920-20

0.891.2415-15

0.891.2510-10

0.791.705-5

0.801.6610

0.682.015

RoRMSEMLP

Validation statistics


ANNEX model: ANNEX model: AArtificial rtificial NNeural eural

NNetworks with etworks with ExExternal drift ternal drift

environmental data mappingenvironmental data mapping


Traditional application of Traditional application of

ANN to spatial predictionsANN to spatial predictions

� Data are available at measurement points: F(xi,yi),

for i= 1,…N

� ANN solution: x,y - 2 inputs, F - output

- select ANN architecture

- train with available data

- after training use to predict

� Problem: Predict F(x,y) at the points without

measurements. Usually regular grid


If there is an additional information

(available at training and prediction points)

related to the primary one, we can use it as

an additional inputs to the ANN.

ANNEX is similar to ANNEX is similar to ““Kriging with Kriging with

External Drift ModelExternal Drift Model””::

InputsInputs: : x,y,+fx,y,+fextext(x,y(x,y))


Examples of external Examples of external

informationinformation

• Cheap information on secondary

variable

� Physical model of the phenomena

� Remotely sensed images

� GIS data

� DEM data


Kriging with external driftKriging with external drift

Kriging with external drift is the model when trends

are limited to

E{F(x,y)}=m(x,y) = λλλλ0 +λλλλ1 fext(x,y) (1)

where the smooth variability of the secondary variable

is considered to be related (e.g., linearly correlated) to

that of primary variable F(x,y) being estimated.

In general, kriging with an external drift is a simple and efficient algorithms to incorporate a secondary variable in the estimation of the primary variable.


What relationship between primary and What relationship between primary and

external information should be in case external information should be in case

of ANNEX?of ANNEX?

ANNEX modelANNEX model


What does external What does external ““relatedrelated””

(how to measure: correlation between variables?)(how to measure: correlation between variables?)

information bring?information bring?

ANNEX modelANNEX model

Improved accuracy of prediction?Improved accuracy of prediction?

Reduce uncertainty of prediction?Reduce uncertainty of prediction?

An important problem is related to the question of the An important problem is related to the question of the

quality of additional data: there is a dilemma between quality of additional data: there is a dilemma between

introducing new information and/or new noise. introducing new information and/or new noise.


Case study: Case study: Kazakh Kazakh PriaraliePriaralie, ,

monitoring networkmonitoring network

1 400 000 km1 400 000 km22 -- 400 monitoring stations400 monitoring stations


DatasetsDatasets

GIS DEM GIS DEM

modelmodel

Average longAverage long--term term

temperatures of air in temperatures of air in

June (June (°°°°°°°°C)C)


CorrelationCorrelation

Air temperature vs. AltitudeAir temperature vs. Altitude


Train and Test datasetsTrain and Test datasets

TrainTest


ANN and ANNEX modelsANN and ANNEX models

Model Correlation RMSE MAE MRE

2-7-5-1 0.917 2.57 1.96 -0.02

3-3-1 0.989 0.96 0.73 -0.01

3-5-1 0.99 0.9 0.7 -0.007

3-7-1 0.991 0.85 0.66 -0.004

3-8-1 0.991 0.84 0.68 -0.001

3-9-1 0.991 0.88 0.69 -0.01

3-10-1 0.99 0.92 0.74 -0.01

Kriging with

external drift0.984 1.19 0.91 -0.03


Scatter plotsScatter plots

KrigingKriging CokrigingCokriging Drift Drift

KrigingKrigingANNEXANNEX


Mapping resultsMapping results

Drift

KrigingANNEX

Kriging Cokriging


Modelling noisy Modelling noisy ““altitudealtitude””

effect (100 %)effect (100 %)

BeforeBefore AfterAfter


Scatter plots between variables Scatter plots between variables

(noisy 100 % altitude)(noisy 100 % altitude)

TrainTrain TestTest


Mapping noise results Mapping noise results

ANNEXANNEX

Air temperature (Air temperature (°°C)C)


Noise resultsNoise resultsModel Correlation RMSE MAE MRE

Kriging 0.874 3.13 2.04 -0.06

Kriging – external drift 0.984 1.19 0.91 -0.03

3-7-1 0.991 0.85 0.66 -0.004

3-8-1 0.991 0.84 0.68 -0.001

3-8-1

(100% noise)0.839 3.54 2.37 -0.13

3-7-1

(10% noise) Test 10.939 2.32 -1.49 -0.003

Kriging – external drift

(10% noise) Test 10.941 2.23 1.54 -0.06

3-7-1

(10% noise) Test 20.899 2.81 1.52 -0.08

Kriging – external drift

(10% noise) Test 20.903 2.81 1.59 -0.103


MLP: real case study

Wind fields in Switzerland


(pp 168-172 of the book)

Monitoring network:

111 stations in Switzerland

(80 training + 31 for validation)

Mapping of daily:

• Mean speed

• Maximum gust

• Average direction

Modeling of wind fields with MLP

using regularization technique


Monitoring network:111 stations in Switzerland (80 training + 31 for validation)

Mapping of daily:• Mean speed• Maximum gust

• Average direction

Input information:

X,Y geographical coordinates

DEM (resolution 500 m)23 DEM-based « geo-features »

Total 26 features

Modeling of wind fields with MLP

and regularization technique

Model:MLP 26-20-20-3


Model:

MLP 26-20-20-3

Training:

• Random initialization

• 500 iterations of the

RPROP algorithm

Training of the MLP


Results: naîve approach


Results: Noisy ejection regularization


Results: summary

Noisy ejection regularization

Without regularization (overfitting)


Conclusion

• MLP is a nonlinear universal tool for the

learning from and modeling of data.

Excellent exploratory tool.

• Application demands deep expert

knowledge and experience

csss2010 20100803-kanevski-lecture2

Technology

basics of annfeedforward

basics of annsimple

ann multilayer perceptron

exp x exp x exp x tanh

subset of training samples

ann supervised learning

training set

basics of annin general