csss2010 20100803-kanevski-lecture2
TRANSCRIPT
Prof. M. Kanevski 1
Machine Learning Algorithms: Theory,
Applications and Software Tools
Lecture 2 Basics of ANN: MLP
Prof. Mikhail Kanevski
Institute of Geomatics and Analysis of Risk,
University of Lausanne
Prof. M. Kanevski 2
Contents
• Introduction to artificial neural networks
• Multilayer perceptron
• Case studies
Prof. M. Kanevski 3
Basics of ANN
Artificial neural networks are analytical systems that address problems whose solutions have not been
explicitly formulated.
In this way they contrast to classical computers and computer programs, which are designed to solve problems whose solutions - although they may be extremely complex - have been made explicit.
Prof. M. Kanevski 4
Basics of ANN
• We can program or train neural networks to store, recognise, and associatively retrieve patterns;
• to filter noise from measurement data;
• to control ill-defined problems;
in summary:• to estimate sampled functions when we do not
know the form of the functions.
Prof. M. Kanevski 5
Basics of ANN
Unlike statistical estimators, they estimate a function without a mathematical model of how outputs
depend on inputs.
Neural networks are model-semifree estimators (semiparametric models). They "learn from experience" with numerical and, sometimes, linguistic sample data.
Prof. M. Kanevski 6
Basics of ANN
The major applications of ANN:• Feature recognition (pattern classification). Speech
recognition• Signal processing• Time-series prediction• Function approximation and regression, classification• Data Mining• Intelligent control• Associative memories• Optimisation• And many others
Prof. M. Kanevski 7
Basics of ANN. Simple biological neuron
Prof. M. Kanevski 8
Basics of ANNSimple model of the neuron
Prof. M. Kanevski 9
Examples of transfer functions.
)]exp(1[
1)(
xxf
−+=
)]exp()[exp(
)]exp()[exp()tanh(
xx
xxx
−+
−−=
Prof. M. Kanevski 10
Basics of ANN
The main parts of ANN:
• Neurones
(nodes, cells, units, processing
elements)
• Network topology
(connections between neurones)
Prof. M. Kanevski 11
Basics of ANN
In general, Artificial Neural Networks are a collection of simple computational units (cells) interlinked by a system of connections (synaptic connections). The number of units and connections form a network topology.
Prof. M. Kanevski 12
Multilayer perceptron
Prof. M. Kanevski 13
Basics of ANN. ANN learning/training
Supervised learning is the most common training. Many samples
Input(i), Output(i) are prepared as a training set. Then a subset from
the training data set is selected. Samples from this subset are
presented to the network one by one. For each sample results
obtained by the network O[(input(i)] are compared with the desired
O[utput(i)]. After presenting the entire training subset the weights are
updated. This updating is done in such a way that a measure of the
error between the network's and desired outputs is reduced. One pass
through the subset of training samples, along with an updating of the
weights is called an epoch. The number of samples in the subset is
called epoch size. Sometimes an epoch size of one is used.
Prof. M. Kanevski 14
Basics of ANN. ANN supervised learning.
ExamplesResponse
Neural network
Teacher
Evaluation
Of Response
Learning
Algorithm
Modifications
to Network
Prof. M. Kanevski 15
Basics of ANNFeedforward ANN.
If there are no feedback and lateral connections we have feedforward ANN. The most frequently used model is so called - multi-layer perceptron. The term feedforward means that information flows only in one direction -from the input to the output.
Prof. M. Kanevski 16
ANN Multi-layer Perceptron (MLP)
• Depends only on the data
and its inner structure
• Is able to learn from data
and generalise
• Good at modelling non-
linearities
• Robust to noise and
outliers
[ANN = artificial neurons + connection weights]
Prof. M. Kanevski 17
Basics of ANN
All knowledge of ANN is based on
synaptic weights between units.
Prof. M. Kanevski 18
The Universality Property
• A two layer feed-forward neural network
with step activation functions can implement any Boolean function,
provided that the number of hidden
neurons H is sufficiently large.
Prof. M. Kanevski 19
MLP modelling
1 1 1 1
2 1 1 1 2 2 2
3 1 1 1 2 2 2 3 3 3
( , ) ( ) ,
( , ) ( ) ( ) ,
( , ) ( ) ( ) ( ) .
out
out
out out
out
out out out
out
F t w f w t b b
F t w f w t b w f w t b b
F t w f w t b w f w t b w f w t b b
= + +
= + + + +
= + + + + + +
w
w
w
Prof. M. Kanevski 20
Backpropagation training
Prof. M. Kanevski 21
Error function depends on network’s weights (W)
{ }∑−
=
−=1
0
2)(
1)(
n
j
out
ljljl WZTn
WE
Prof. M. Kanevski 22
MLP training algorithms
Optimisation algorithms used for MLP training:
• Stochastic
− Annealing
− Genetic algorithm
• Gradient
− Conjugate gradients (slow 1st order gradient algorithm)
− Levenberg-Marquardt (fast 2nd order gradient algorithm)
− BFGS formula – quasi Newton
− Steepest Descent
− RProp – resilient propagation
− BackProp – back propagation
Prof. M. Kanevski 23
Feedforward ANN: Multilayer
perceptron. Backprop algorithm
• The possibilities and capabilities of multi-layer perceptrons stem from
the non-linearities used within nodes. MLP can learn with supervised
learning rule - backpropagation algorithm. The Backword Error
Propagation algorithm for the ANN learning/training caused a
breakthrough in the application of multilayer perceptrons.
• The backpropagation algorithm is a supervised learning algorithm. The
backpropagation algorithm is an iterative gradient algorithm
designed to minimise the error measure between the actual output of
the neural network and the desired output. We have to optimise a very
non-linear system consisting of a large number of highly correlated
variables.
Prof. M. Kanevski 24
Basics of ANNBackpropagation Algorithm
The backpropagation algorithm follows the next algorithmic steps:
• 1. Initialize weights. Usually it is recommended to set all
weights and node offsets to small random variables. In our
study we shall use simulated annealing and/or genetic
algorithm to select starting values more intelligently as it is
recommended in [Masters].
• 2. Present inputs and desired outputs. The vectors (Inputl, Outputl=tl) are presented to the network.
• 3. Calculate the actual output of the ANN.
Prof. M. Kanevski 25
Basics of ANNBackpropagation Algorithm
• 4. Calculate error measure and update the weights. Use a recursive algorithm starting at the
output neurons (nodes) and working back to the first hidden layer - it is this backward propagation of output errors that inspired the name for this training algorithm. Update the weights W by
Prof. M. Kanevski 26
We want to know how to modify
weights in order to decrease the
error function
( )( 1) ( )
( )ij ij
ij
E tw t w t
w t
∂
∂+ − ∝−
Prof. M. Kanevski 27
Basics of ANNBackpropagation Algorithm
)1()()1( −+=+ m
j
m
i
m
ij
m
ij Znwnw ηδ
where n - iteration step, η - rate of learning 0 < η≤ 1), Zj
m( )−1 - output of the j-th neurone in the layer
(m-1), error δi
m for the output layer is defined by equation
Prof. M. Kanevski 28
Basics of ANNBackpropagation Algorithm
δ i
out
i
out
i
out
i i
outZ Z T Z= − −( )( )1
h
j
j
h
ij
h
i
h
i
h
i wZZ δδ ∑−=− )1()1(
Prof. M. Kanevski 29
Basics of ANNBackpropagation Algorithm
Other error measures (such as maximum absolute error and
median squared error) have even greater advantages in
many situations. For example, median squared error is useful because unlike the mean the median is a robust
statistic - its value is insensitive to occasional large errors
in the training data. Unfortunately, practical techniques for
implementing these more desirable error measures do not
yet exist. Thus, most neural networks today are tied to
mean squared error measurements.
Prof. M. Kanevski 30
Basics of ANNBackpropagation Algorithm
More general error functions can be written taking into
account (weighting, declustering, economic criteria, etc.)
importance of the samples presented to the network :
{ } lj
n
j
out
ljljl WZTWE ω∑−
=
−=1
0
2)()(
Prof. M. Kanevski 31
Gradient descent
w
J(w)
Minimum
Direction of the gradientJ’(W)
Prof. M. Kanevski 32
Gradient descent
w
J(w)
Minimum
Prof. M. Kanevski 33
In reality the situation with error
function and corresponding
optimization problem is much more
complicated:
the presence of multiple local minima!
Prof. M. Kanevski 34
Gradient descent
Local minima
Prof. M. Kanevski 35
SA: Illustration
Prof. M. Kanevski 36
How important are local
minima?(Duda et al. 2001)
In computational practice, we do not want our network to be caught in a local minimum having high training error because this usually indicates that key features of the problem have not been learned by the network.
In such cases it is traditional to reinitialize the weights and train again, possibly also altering other parameters in the net
Prof. M. Kanevski 37
How important are local
minima?(Duda et al. 2001)
In many problems, convergence to a nonglobal minimum is acceptable, if the error is nevertheless fairly low. Furthermore, common stopping criteria demand that training terminate even before the minimum is reached, and thus it is not essential that the network be converging toward the global minimum or acceptable performance.
Prof. M. Kanevski 38
In short
The presence of multiple minima does not
necessarily present difficulties in training
nets, and a few simple heuristics can often
overcome such problems (see next slide)
Prof. M. Kanevski 39
Practical techniques for
improving backpropagation
• Activation function (sigmoid, hyperbolic tangent,..)
• Scaling inputs
• Training with noise (noise injection)
• Initializing weights (simulated annealing)
• Regularization (weight decay)
• Number of hidden layers
• Learning parameters (rates, momentum,..)
• Cost function
• ………………………………….
Prof. M. Kanevski 40
Interpretation of network’s
outputs
Consider the limit in which the size N of the training data set goes to infinity [Bishop 1995]. In this limit we can replace the finite sum over patterns in the sum-of-squares error with an integral of the form
∑ ∑=
−=N
n k
n
k
n
k twxyN
E1
2});({2
1lim
∑ ∫∫ −=k
kkkk dxdtxtptwxy ),(});({2
1 2
Prof. M. Kanevski 41
Interpretation of network’s
outputs
the network mapping is given by the conditional average of the target data, the regression of tkconditioned on x.
⟩⟨= xtwxy kk |*);(
Prof. M. Kanevski 42
DEMO
Prof. M. Kanevski 43
MLP and number of layers
• The problem with MLP using single hidden
layer is that the neurons tend to interact with
each other globally. In complex situations ,
this interaction makes it difficult to improve
the approximation at one point without
worsening it at some other point.
• On the other hand, with two hidden layers,
the approximation process becomes more
manageable.
Prof. M. Kanevski 44
Two hidden layers! (Haykin)
1. Local features are extracted in the first hidden layer. Specifically, some neurons in the first hidden layer are used to partition the input space into regions, and other neurons in that layer learn the local features characterizing those regions.
2. Global features are extracted in the second layer. Specifically, a neuron in the second hidden layer combines the outputs of neurons in the first hidden layer operating on a particular region of the input space and thereby learns the global features for that region and outputs zero elsewhere.
Prof. M. Kanevski 45
Data Preprocessing
• Machine learning algorithms are data-
driven methods.
• The quality and quantity of data is essential for training and generalization Post-processing
Pre-processing
MLA
Results
Input data
Prof. M. Kanevski 46
Types of pre-processing:
1. Linear and nonlinear transformations
e.g input scaling/normalisation, Z-score transform,
square root transform, N-score transform, etc.
2. Dimensionality reduction
3. Incorporate prior knowledge
Invariants, hints,…
4. Feature extraction
linear/nonlinear combination of input variables
5. Feature selectiondecide which features to use
Prof. M. Kanevski 47
Dimensionality reduction
• Two approaches are available to perform
dimensionality reduction:
• Feature extraction: creating a subset of new
features by combinations of the existing features
• Feature selection: choosing a subset of all
the features (the ones more informative)
Prof. M. Kanevski 48
Feature selection/extraction
Prof. M. Kanevski 49
Feature selection
• Reducing the feature space by throwing
out some of the features (covariates)
– Also called variable selection
• Motivating idea: try to find a simple,
“parsimonious” model (Occam’s razor!)
Prof. M. Kanevski 50
Univariate selection may fail
Guyon-Elisseeff, JMLR 2004; Springer 2006
Prof. M. Kanevski 51
Dimensionality Reduction
Clearly losing some information but this can be helpful
due to curse of dimensionality
Need some way of deciding what dimensions to keep
1. Random choice
2. Principal components analysis (PCA)
3. Independent components analysis (ICA)
4. Self-organised maps (SOM)
Prof. M. Kanevski 52
Data transform
• Y = aZ+b
• Y = Log(Z)
• Y = Ind(Z, Zs)
• Normalisation: ZscoreY = (Z-Zm)/σ
• Box-Cox nonlinear transform :
1( ) si 0
( 0) ( )
ZY
Y Ln Z
λ
λ λλ
λ
−= >
= =
Prof. M. Kanevski 53
Model Selection & Model Evaluation
Prof. M. Kanevski 54
Guillaume d'Occam (1285 - 1349)
“Pluralitas non est ponenda sine
necessitate”
Occam’s razor:
“The more simple explanation of the phenomena is more likely to be correct”
Prof. M. Kanevski 55
Model Assessment and Model Selection:
Two separate goals
Prof. M. Kanevski 56
Model Selection:
Estimating the performance of different
models in order to choose the
(approximate) best one
Model Assessment:
Having chosen a final model, estimating its
prediction error (generalization error) on
new data
Prof. M. Kanevski 57
If we are in a data-rich situation, the best
solution is to split randomly (?) data
Raw Data
Test:25%
(validation)Validation:25%
(test)
Train: 50%
(Train)
Prof. M. Kanevski 58
Interpretation
• The training set is used to fit the models
• The validation set is used to estimate prediction error for model selection (tuning hyperparameters)
• The test set is used for assessment of the generalization error of the final chosen model
Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001
Prof. M. Kanevski 59
Bias and Variance.
Model’s complexity
2 4 6 8 10
0.5
1
1.5
2
2.5
3
c. Underfitting
2 4 6 8 10
0.5
1
1.5
2
2.5
3
b. Overfitting
Prof. M. Kanevski 60
One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples.
This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task.
Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error
Prof. M. Kanevski 61
Bias-Variance Dilemma
Assume that
2
( )
( ) 0,
( )
Y f X
where
E
Var ε
ε
ε
ε σ
= +
=
=
Prof. M. Kanevski 62
We can derive an expression for the expected prediction error of a
regression at an input point X=x0
using squared-error loss:
Prof. M. Kanevski 63
2
0 0 0
2 2 2
0 0 0 0
2 2
0 0
2
( ) [( ( )) ¦ ]
[ ( ) ( )] [ ( ) ( )]
( ( )) ( ( ))
Err x E Y f x X x
E f x f x E f x E f x
Bias f x Var f x
IrreducibleError Bias Variance
ε
ε
σ
σ
∧
∧ ∧ ∧
∧ ∧
= − = =
+ − + − =
+ + =
+ +
Prof. M. Kanevski 64
• The first term is the variance of the target around its true mean f(x0), and cannot be avoided no
matter how well we estimate f(x0), unless σε2=0.
• The second term is the squared bias, the amount by which the average of our estimate differs from the true mean
• The last term is the variance, the expected
squared deviation of around its mean. 0( )f x
∧
Prof. M. Kanevski 65
Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001
Prof. M. Kanevski 66
Prof. M. Kanevski 67
• A neural network is only as good as the
training data!
• Poor training data inevitably leads to an
unreliable and unpredictable network.
• Exploratory Data Analysis and data
preprocessing are extremely important!!!
Prof. M. Kanevski 68
MLP modelling. Case Studies.
Original (10 000 points) Training (900 points)
Prof. M. Kanevski 69
MLP modeling
Original MLP prediction
TrainRMSE 1.97
Ro 0.69
Which result do you prefer?
Prof. M. Kanevski 70
MLP modeling
Original MLP prediction
TrainRMSE 1.61
Ro 0.80
Which result do you prefer?
Prof. M. Kanevski 71
MLP modelingOriginal MLP prediction
TrainRMSE 1.67
Ro 0.79
Which result do you prefer?
Prof. M. Kanevski 72
MLP modeling
Original MLP prediction
TrainRMSE 1.10
Ro 0.92Which result do you prefer?
Prof. M. Kanevski 73
MLP modeling
Original MLP prediction
TrainRMSE 0.83
Ro 0.95
Which result do you prefer?
Prof. M. Kanevski 74
MLP modeling
Original MLP prediction
TrainRMSE 0.55
Ro 0.98
Which result do you prefer?
Prof. M. Kanevski 75
MLP modeling
Trainig statistics
5-5
10-10
10
15-15
20-20
5
0.50
0.70
0.90
1.10
1.30
1.50
1.70
1.90
5 10 5-5 10-10 15-15 20-20
M LP
RM
SE 5
10
10-10
15-15
5-5
20-20
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
5 10 5-5 10-10 15-15 20-20
MLP
Ro
Model 20-20 is the best ?
Prof. M. Kanevski 76
MLP modelingTrainig statistics
0.980.5520-20
0.950.8315-15
0.921.1010-10
0.791.675-5
0.801.6110
0.691.975
RoRMSEMLP
Prof. M. Kanevski 77
10
15-15
5-5
20-20
5
10-10
0.50
0.70
0.90
1.10
1.30
1.50
1.70
1.90
2.10
5 10 5-5 10-10 15-15 20-20
MLP
RM
SE
Validationg Training
MLP modeling
Training &Validation statistics
10
15-15 20-20
10-10
5-5
5
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
5 10 5-5 10-10 15-15 20-20
MLP
Ro
Prof. M. Kanevski 78
10
15-15
5-5
20-20
5
10-10
0.50
0.70
0.90
1.10
1.30
1.50
1.70
1.90
2.10
5 10 5-5 10-10 15-15 20-20
MLP
RM
SE
Validationg Training
MLP modeling
Training &Validation statistics
10
15-15 20-20
10-10
5-5
5
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
5 10 5-5 10-10 15-15 20-20
MLP
Ro
Prof. M. Kanevski 79
MLP modeling
0.881.3920-20
0.891.2415-15
0.891.2510-10
0.791.705-5
0.801.6610
0.682.015
RoRMSEMLP
Validation statistics
Prof. M. Kanevski 80
ANNEX model: ANNEX model: AArtificial rtificial NNeural eural
NNetworks with etworks with ExExternal drift ternal drift
environmental data mappingenvironmental data mapping
Prof. M. Kanevski 81
Traditional application of Traditional application of
ANN to spatial predictionsANN to spatial predictions
� Data are available at measurement points: F(xi,yi),
for i= 1,…N
� ANN solution: x,y - 2 inputs, F - output
- select ANN architecture
- train with available data
- after training use to predict
� Problem: Predict F(x,y) at the points without
measurements. Usually regular grid
Prof. M. Kanevski 82
If there is an additional information
(available at training and prediction points)
related to the primary one, we can use it as
an additional inputs to the ANN.
ANNEX is similar to ANNEX is similar to ““Kriging with Kriging with
External Drift ModelExternal Drift Model””::
InputsInputs: : x,y,+fx,y,+fextext(x,y(x,y))
Prof. M. Kanevski 83
Examples of external Examples of external
informationinformation
• Cheap information on secondary
variable
� Physical model of the phenomena
� Remotely sensed images
� GIS data
� DEM data
Prof. M. Kanevski 84
Kriging with external driftKriging with external drift
Kriging with external drift is the model when trends
are limited to
E{F(x,y)}=m(x,y) = λλλλ0 +λλλλ1 fext(x,y) (1)
where the smooth variability of the secondary variable
is considered to be related (e.g., linearly correlated) to
that of primary variable F(x,y) being estimated.
In general, kriging with an external drift is a simple and efficient algorithms to incorporate a secondary variable in the estimation of the primary variable.
Prof. M. Kanevski 85
What relationship between primary and What relationship between primary and
external information should be in case external information should be in case
of ANNEX?of ANNEX?
ANNEX modelANNEX model
Prof. M. Kanevski 86
What does external What does external ““relatedrelated””
(how to measure: correlation between variables?)(how to measure: correlation between variables?)
information bring?information bring?
ANNEX modelANNEX model
Improved accuracy of prediction?Improved accuracy of prediction?
Reduce uncertainty of prediction?Reduce uncertainty of prediction?
An important problem is related to the question of the An important problem is related to the question of the
quality of additional data: there is a dilemma between quality of additional data: there is a dilemma between
introducing new information and/or new noise. introducing new information and/or new noise.
Prof. M. Kanevski 87
Case study: Case study: Kazakh Kazakh PriaraliePriaralie, ,
monitoring networkmonitoring network
1 400 000 km1 400 000 km22 -- 400 monitoring stations400 monitoring stations
Prof. M. Kanevski 88
DatasetsDatasets
GIS DEM GIS DEM
modelmodel
Average longAverage long--term term
temperatures of air in temperatures of air in
June (June (°°°°°°°°C)C)
Prof. M. Kanevski 89
CorrelationCorrelation
Air temperature vs. AltitudeAir temperature vs. Altitude
Prof. M. Kanevski 90
Train and Test datasetsTrain and Test datasets
TrainTest
Prof. M. Kanevski 91
ANN and ANNEX modelsANN and ANNEX models
Model Correlation RMSE MAE MRE
2-7-5-1 0.917 2.57 1.96 -0.02
3-3-1 0.989 0.96 0.73 -0.01
3-5-1 0.99 0.9 0.7 -0.007
3-7-1 0.991 0.85 0.66 -0.004
3-8-1 0.991 0.84 0.68 -0.001
3-9-1 0.991 0.88 0.69 -0.01
3-10-1 0.99 0.92 0.74 -0.01
Kriging with
external drift0.984 1.19 0.91 -0.03
Prof. M. Kanevski 92
Scatter plotsScatter plots
KrigingKriging CokrigingCokriging Drift Drift
KrigingKrigingANNEXANNEX
Prof. M. Kanevski 93
Mapping resultsMapping results
Drift
KrigingANNEX
Kriging Cokriging
Prof. M. Kanevski 94
Modelling noisy Modelling noisy ““altitudealtitude””
effect (100 %)effect (100 %)
BeforeBefore AfterAfter
Prof. M. Kanevski 95
Scatter plots between variables Scatter plots between variables
(noisy 100 % altitude)(noisy 100 % altitude)
TrainTrain TestTest
Prof. M. Kanevski 96
Mapping noise results Mapping noise results
ANNEXANNEX
Air temperature (Air temperature (°°C)C)
Prof. M. Kanevski 97
Noise resultsNoise resultsModel Correlation RMSE MAE MRE
Kriging 0.874 3.13 2.04 -0.06
Kriging – external drift 0.984 1.19 0.91 -0.03
3-7-1 0.991 0.85 0.66 -0.004
3-8-1 0.991 0.84 0.68 -0.001
3-8-1
(100% noise)0.839 3.54 2.37 -0.13
3-7-1
(10% noise) Test 10.939 2.32 -1.49 -0.003
Kriging – external drift
(10% noise) Test 10.941 2.23 1.54 -0.06
3-7-1
(10% noise) Test 20.899 2.81 1.52 -0.08
Kriging – external drift
(10% noise) Test 20.903 2.81 1.59 -0.103
Prof. M. Kanevski 98
MLP: real case study
Wind fields in Switzerland
Prof. M. Kanevski 99
(pp 168-172 of the book)
Monitoring network:
111 stations in Switzerland
(80 training + 31 for validation)
Mapping of daily:
• Mean speed
• Maximum gust
• Average direction
Modeling of wind fields with MLP
using regularization technique
Prof. M. Kanevski 100
Monitoring network:111 stations in Switzerland (80 training + 31 for validation)
Mapping of daily:• Mean speed• Maximum gust
• Average direction
Input information:
X,Y geographical coordinates
DEM (resolution 500 m)23 DEM-based « geo-features »
Total 26 features
Modeling of wind fields with MLP
and regularization technique
Model:MLP 26-20-20-3
Prof. M. Kanevski 101
Model:
MLP 26-20-20-3
Training:
• Random initialization
• 500 iterations of the
RPROP algorithm
Training of the MLP
Prof. M. Kanevski 102
Results: naîve approach
Prof. M. Kanevski 103
Results: Noisy ejection regularization
Prof. M. Kanevski 104
Results: summary
Noisy ejection regularization
Without regularization (overfitting)
Prof. M. Kanevski 105
Conclusion
• MLP is a nonlinear universal tool for the
learning from and modeling of data.
Excellent exploratory tool.
• Application demands deep expert
knowledge and experience