non-linear models. basis expansion. overfitting. regularization. · 2020. 3. 3. · • question...
TRANSCRIPT
CZECH TECHNICAL UNIVERSITY IN PRAGUE
Faculty of Electrical Engineering
Department of Cybernetics
P. Posık c© 2020 Artificial Intelligence – 1 / 23
Non-linear models. Basis expansion.Overfitting. Regularization.
Petr Posık
Czech Technical University in Prague
Faculty of Electrical Engineering
Dept. of Cybernetics
When a linear model is not enough. . .
P. Posık c© 2020 Artificial Intelligence – 2 / 23
Question
Non-linear models
• Question
• Basis expansion
• Two spaces
• Remarks
How to evaluate apredictive model?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 3 / 23
When a linear model is not enough, . . .
A we are doomed. There are no other methods than linear ones.
B we can easily use non-linear models. They are as easy to train as the linear methods.
C we must use non-linear models, but their training is much more demanding.
D we can use a special type of non-linear models which can be trained in the same wayas the linear methods.
Basis expansion
Non-linear models
• Question
• Basis expansion
• Two spaces
• Remarks
How to evaluate apredictive model?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 4 / 23
a.k.a. feature space straightening.
Basis expansion
Non-linear models
• Question
• Basis expansion
• Two spaces
• Remarks
How to evaluate apredictive model?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 4 / 23
a.k.a. feature space straightening.
Why?
■ Linear decision boundary (or linear regression model) may not be flexible enough toperform accurate classification (regression).
■ The algorithms for fitting linear models can be used to fit (certain type of) non-linearmodels!
Basis expansion
Non-linear models
• Question
• Basis expansion
• Two spaces
• Remarks
How to evaluate apredictive model?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 4 / 23
a.k.a. feature space straightening.
Why?
■ Linear decision boundary (or linear regression model) may not be flexible enough toperform accurate classification (regression).
■ The algorithms for fitting linear models can be used to fit (certain type of) non-linearmodels!
How?
■ Let’s define a new multidimensional image space F.
■ Feature vectors x are transformed into this image space F (new features are derived)using mapping Φ:
x → z = Φ(x),
x = (x1, x2, . . . , xD) → z = (Φ1(x), Φ2(x), . . . , ΦG(x)),
while usually D ≪ G.
■ In the image space, a linear model is trained. However, this is equivalent to training anon-linear model in the original space.
fG(z) = w1z1 + w2z2 + . . . + wGzG + w0
f (x) = fG(Φ(x)) = w1Φ1(x) + w2Φ2(x) + . . . + wGΦG(x) + w0
Two coordinate systems
Non-linear models
• Question
• Basis expansion
• Two spaces
• Remarks
How to evaluate apredictive model?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 5 / 23
Feature space Image space
x = (x1, x2, . . . , xD) z = (z1, z2, . . . , zG)z1 = log x1
z2 = x21 x3
z3 = ex2
. . .
f (x) = w1 log x1 + w2x21 x3 +
w3ex2 + . . . + w0
fG(z) = w1z1 + w2z2 + w3z3 +. . . + w0
Transformation intoa high-dimensional image
space
Training a linearmodel in theimage space
Non-linear model in thefeature space
Two coordinate systems: simple graphical example
Non-linear models
• Question
• Basis expansion
• Two spaces
• Remarks
How to evaluate apredictive model?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 6 / 23
1D Feature space 2D Image space
0 2 4 6 8 10−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x
Two coordinate systems: simple graphical example
Non-linear models
• Question
• Basis expansion
• Two spaces
• Remarks
How to evaluate apredictive model?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 6 / 23
1D Feature space 2D Image space
0 2 4 6 8 10−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x0 2 4 6 8 10
0
10
20
30
40
50
60
70
80
90
100
x
x2
Transformation intoa high-dimensional
image space
Two coordinate systems: simple graphical example
Non-linear models
• Question
• Basis expansion
• Two spaces
• Remarks
How to evaluate apredictive model?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 6 / 23
1D Feature space 2D Image space
0 2 4 6 8 10−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x0 2 4 6 8 10
0
10
20
30
40
50
60
70
80
90
100
x
x2
0 2 4 6 8 100
10
20
30
40
50
60
70
80
90
100
x
x2
Transformation intoa high-dimensional
image space
Traininga linear modelin the imagespace
Two coordinate systems: simple graphical example
Non-linear models
• Question
• Basis expansion
• Two spaces
• Remarks
How to evaluate apredictive model?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 6 / 23
1D Feature space 2D Image space
0 2 4 6 8 10−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x0 2 4 6 8 10
0
10
20
30
40
50
60
70
80
90
100
x
x2
0 2 4 6 8 10−3000
−2500
−2000
−1500
−1000
−500
0
500
x0 2 4 6 8 10
0
10
20
30
40
50
60
70
80
90
100
x
x2
Transformation intoa high-dimensional
image space
Traininga linear modelin the imagespace
Non-linear model in thefeature space
Basis expansion: remarks
Non-linear models
• Question
• Basis expansion
• Two spaces
• Remarks
How to evaluate apredictive model?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 7 / 23
Advantages:
■ Universal, generally usable method.
Disadvantages:
■ We must define what new features shall form the high-dimensional space F.
■ The examples must be really transformed into the high-dimensional space F.
■ When too many derived features are used, the resulting models are prone tooverfitting (see next slides).
For certain type of algorithms, there is a method how to perform the basis expansionwithout actually carrying out the mapping! (See the next lecture.)
How to evaluate a predictive model?
P. Posık c© 2020 Artificial Intelligence – 8 / 23
Question
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 9 / 23
-2 -1 0 1 2 3 4
x
-4
-2
0
2
4
6
8
10
yWhich of the following models will have the smallest training error if fitted to the abovedata?
A y = w0 + w1x
B y = w0 + w1x + w2x2
C y = w0 + w1x + w2x2 + w3x3
D y = w0 + w1x + w2x2 + w3x3 + w4x4
Model evaluation
P. Posık c© 2020 Artificial Intelligence – 10 / 23
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
Model evaluation
P. Posık c© 2020 Artificial Intelligence – 10 / 23
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the same data that were used for training?
Model evaluation
P. Posık c© 2020 Artificial Intelligence – 10 / 23
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the same data that were used for training?
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
3f(x) = x
f(x) = x3−3x
2+3x
Model evaluation
P. Posık c© 2020 Artificial Intelligence – 10 / 23
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the same data that were used for training?
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
3f(x) = x
f(x) = x3−3x
2+3x
Using MSE only, both models are equivalent!!!
Model evaluation
P. Posık c© 2020 Artificial Intelligence – 10 / 23
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the same data that were used for training?
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
3f(x) = x
f(x) = x3−3x
2+3x
Using MSE only, both models are equivalent!!!
−0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5f(x) = −0.09 + 0.99x
f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x
3)
Model evaluation
P. Posık c© 2020 Artificial Intelligence – 10 / 23
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the same data that were used for training?
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
3f(x) = x
f(x) = x3−3x
2+3x
Using MSE only, both models are equivalent!!!
−0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5f(x) = −0.09 + 0.99x
f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x
3)
Using MSE only, the cubic model is better thanlinear!!!
Model evaluation
P. Posık c© 2020 Artificial Intelligence – 10 / 23
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the same data that were used for training?
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
3f(x) = x
f(x) = x3−3x
2+3x
Using MSE only, both models are equivalent!!!
−0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5f(x) = −0.09 + 0.99x
f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x
3)
Using MSE only, the cubic model is better thanlinear!!!
A basic method of evaluation is model validation on a different, independent data set from the same source, i.e.on testing data.
Validation on testing data
P. Posık c© 2020 Artificial Intelligence – 11 / 23
Example: Polynomial regression with varrying degree:
X ∼ U(−1, 3)
Y ∼ X2 + N(0, 1)
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 0 , t r. e rr.: 8 .3 1 9 , t e s t . e rr.: 6 .9 0 1
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 1 , t r. e rr.: 2 .0 1 3 , t e s t . e rr.: 2 .8 4 1
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 2 , t r. e rr.: 0 .6 4 7 , t e s t . e rr.: 0 .9 2 5
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 3 , t r. e rr.: 0 .6 4 5 , t e s t . e rr.: 0 .9 1 9
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 5 , t r. e rr.: 0 .6 1 1 , t e s t . e rr.: 0 .9 7 9
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 9 , t r. e rr.: 0 .5 4 5 , t e s t . e rr.: 1 .0 6 7
Tra in in g d a ta
Te s t in g d a ta
Training and testing error
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 12 / 23
0 2 4 6 8 1 0
Polyn om d e g re e
0
1
2
3
4
5
6
7
8
9
MS
E
Tra in in g e rror
Te s t in g e rror
■ The training error decreases with increasing model flexibility.
■ The testing error is minimal for certain degree of model flexibility.
Overfitting
P. Posık c© 2020 Artificial Intelligence – 13 / 23
Definition of overfitting:
■ Let M be the space of candidate models.
■ Let m1 ∈ M and m2 ∈ M be 2 different models fromthis space.
■ Let ErrTr(m) be an error of the model m measuredon the training dataset (training error).
■ Let ErrTst(m) be an error of the model m measuredon the testing dataset (testing error).
■ We say that m2 is overfitted if there is another m1 forwhich
ErrTr(m2) < ErrTr(m1) ∧ ErrTst(m2) > ErrTst(m1)
Mo
del
Err
or
Model Flexibility
Training data
Testing data
Model m1 Model m2
■ “When overfitted, the model works well for the training data, but fails for new (testing) data.”
■ Overfitting is a general phenomenon affecting all kinds of inductive learning of models with tunableflexibility.
Overfitting
P. Posık c© 2020 Artificial Intelligence – 13 / 23
Definition of overfitting:
■ Let M be the space of candidate models.
■ Let m1 ∈ M and m2 ∈ M be 2 different models fromthis space.
■ Let ErrTr(m) be an error of the model m measuredon the training dataset (training error).
■ Let ErrTst(m) be an error of the model m measuredon the testing dataset (testing error).
■ We say that m2 is overfitted if there is another m1 forwhich
ErrTr(m2) < ErrTr(m1) ∧ ErrTst(m2) > ErrTst(m1)
Mo
del
Err
or
Model Flexibility
Training data
Testing data
Model m1 Model m2
■ “When overfitted, the model works well for the training data, but fails for new (testing) data.”
■ Overfitting is a general phenomenon affecting all kinds of inductive learning of models with tunableflexibility.
We want models and learning algorithms with a good generalization ability, i.e.
■ we want models that encode only the relationships valid in the whole domain, not those that learned thespecifics of the training data, i.e.
■ we want algorithms able to find only the relationships valid in the whole domain and ignore specifics ofthe training data.
Bias vs Variance
P. Posık c© 2020 Artificial Intelligence – 14 / 23
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 1 , t r. e rr.: 2 .0 1 3 , t e s t . e rr.: 2 .8 4 1
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 2 , t r. e rr.: 0 .6 4 7 , t e s t . e rr.: 0 .9 2 5
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 9 , t r. e rr.: 0 .5 4 5 , t e s t . e rr.: 1 .0 6 7
Tra in in g d a ta
Te s t in g d a ta
High bias:model not flexible enough
(Underfit)
“Just right”(Good fit)
High variance:model flexibility too high
(Overfit)
Bias vs Variance
P. Posık c© 2020 Artificial Intelligence – 14 / 23
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 1 , t r. e rr.: 2 .0 1 3 , t e s t . e rr.: 2 .8 4 1
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 2 , t r. e rr.: 0 .6 4 7 , t e s t . e rr.: 0 .9 2 5
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 9 , t r. e rr.: 0 .5 4 5 , t e s t . e rr.: 1 .0 6 7
Tra in in g d a ta
Te s t in g d a ta
High bias:model not flexible enough
(Underfit)
“Just right”(Good fit)
High variance:model flexibility too high
(Overfit)
High bias problem:
■ ErrTr(h) is high
■ ErrTst(h) ≈ ErrTr(h) Mo
del
Err
or
Model Flexibility
Training data
Testing data
High variance problem:
■ ErrTr(h) is low
■ ErrTst(h) >> ErrTr(h)
Crossvalidation
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 15 / 23
How to estimate the true error of a model on new, unseen data?
Crossvalidation
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 15 / 23
How to estimate the true error of a model on new, unseen data?
Simple crossvalidation:
■ Split the data into training and testing subsets.
■ Train the model on training data.
■ Evaluate the model error on testing data.
Crossvalidation
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 15 / 23
How to estimate the true error of a model on new, unseen data?
Simple crossvalidation:
■ Split the data into training and testing subsets.
■ Train the model on training data.
■ Evaluate the model error on testing data.
K-fold crossvalidation:
■ Split the data into k folds (k is usually 5 or 10).
■ In each iteration:
■ Use k − 1 folds to train the model.
■ Use 1 fold to test the model, i.e. measure error.
Iter. 1 Training Training TestingIter. 2 Training Testing TrainingIter. k Testing Training Training
■ Aggregate (average) the k error measurements to get the final error estimate.
■ Train the model on the whole data set.
Crossvalidation
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 15 / 23
How to estimate the true error of a model on new, unseen data?
Simple crossvalidation:
■ Split the data into training and testing subsets.
■ Train the model on training data.
■ Evaluate the model error on testing data.
K-fold crossvalidation:
■ Split the data into k folds (k is usually 5 or 10).
■ In each iteration:
■ Use k − 1 folds to train the model.
■ Use 1 fold to test the model, i.e. measure error.
Iter. 1 Training Training TestingIter. 2 Training Testing TrainingIter. k Testing Training Training
■ Aggregate (average) the k error measurements to get the final error estimate.
■ Train the model on the whole data set.
Leave-one-out (LOO) crossvalidation:
■ k = |T|, i.e. the number of folds is equal to the training set size.
■ Time consuming for large |T|.
How to determine a suitable model flexibility
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 16 / 23
Simply test models of varying complexities and choose the one with the best testing error,right?
■ The testing data are used here to tune a meta-parameter of the model.
■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.
■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.
■ A new, separate data set is needed to estimate the model error.
How to determine a suitable model flexibility
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 16 / 23
Simply test models of varying complexities and choose the one with the best testing error,right?
■ The testing data are used here to tune a meta-parameter of the model.
■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.
■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.
■ A new, separate data set is needed to estimate the model error.
Using simple crossvalidation:
1. Training data: use cca 50 % of data for model building.
2. Validation data: use cca 25 % of data to search for the suitable model flexibility.
3. Train the suitable model on training + validation data.
4. Testing data: use cca 25 % of data for the final estimate of the model error.
How to determine a suitable model flexibility
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 16 / 23
Simply test models of varying complexities and choose the one with the best testing error,right?
■ The testing data are used here to tune a meta-parameter of the model.
■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.
■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.
■ A new, separate data set is needed to estimate the model error.
Using simple crossvalidation:
1. Training data: use cca 50 % of data for model building.
2. Validation data: use cca 25 % of data to search for the suitable model flexibility.
3. Train the suitable model on training + validation data.
4. Testing data: use cca 25 % of data for the final estimate of the model error.
Using k-fold crossvalidation
1. Training data: use cca 75 % of data to find and train a suitable model usingcrossvalidation.
2. Testing data: use cca 25 % of data for the final estimate of the model error.
How to determine a suitable model flexibility
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 16 / 23
Simply test models of varying complexities and choose the one with the best testing error,right?
■ The testing data are used here to tune a meta-parameter of the model.
■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.
■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.
■ A new, separate data set is needed to estimate the model error.
Using simple crossvalidation:
1. Training data: use cca 50 % of data for model building.
2. Validation data: use cca 25 % of data to search for the suitable model flexibility.
3. Train the suitable model on training + validation data.
4. Testing data: use cca 25 % of data for the final estimate of the model error.
Using k-fold crossvalidation
1. Training data: use cca 75 % of data to find and train a suitable model usingcrossvalidation.
2. Testing data: use cca 25 % of data for the final estimate of the model error.
The ratios are not set in stone, there are other possibilities, e.g. 60:20:20, etc.
How to prevent overfitting?
Non-linear models
How to evaluate apredictive model?
• Question
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
Summary
P. Posık c© 2020 Artificial Intelligence – 17 / 23
1. Feature selection: Reduce the number of features.
■ Select manually, which features to keep.
■ Try to identify a suitable subset of features during learning phase (many featureselection methods exist; none is perfect).
2. Regularization:
■ Keep all the features, but reduce the magnitude of their weights w.
■ Works well, if we have a lot of features each of which contributes a bit topredicting y.
Regularization
P. Posık c© 2020 Artificial Intelligence – 18 / 23
Question
Non-linear models
How to evaluate apredictive model?
Regularization
• Question
• Ridge
• Lasso
Summary
P. Posık c© 2020 Artificial Intelligence – 19 / 23
Which of the following methods will NOT help with overfitting:
A Choosing only a subset of input variables.
B Increasing the number of training data.
C Penalizing the size of model parameters.
D Basis expansion, i.e., deriving new features from the old ones.
Ridge regularization (a.k.a. Tikhonov regularization)
P. Posık c© 2020 Artificial Intelligence – 20 / 23
Ridge regularization penalizes the size of themodel coefficients:
■ Modification of the optimization criterion:
J(w) =1
|T|
|T|
∑i=1
(
y(i) − hw(x(i)))2
+α
D
∑d=1
w2d.
■ The solution is given by a modified Normalequation
w∗ = (XT X+αI)−1XTy
■ As α → 0, wridge → wOLS.
■ As α → ∞, wridge → 0.
OLS - ordinary least squares. Just a simple multiplelinear regression.
Training and testing errors as functions ofregularization parameter:
10-8 10-6 10-4 10-2 100 102 104 106 108
Re g u la riza t ion fa c tor
0 .5
1 .0
1 .5
2 .0
2 .5
3 .0
3 .5
MS
E
Tra in in g e rror
Te s t in g e rror
The values of coefficients (weights w) asfunctions of regularization parameter:
10-8 10-6 10-4 10-2 100 102 104 106 108
Re g u la riza t ion fa c tor
− 1 0
− 5
0
5
1 0
1 5
Co
eff
icie
nt
siz
e
Lasso regularization
P. Posık c© 2020 Artificial Intelligence – 21 / 23
Lasso regularization penalizes the size of themodel coefficients:
■ Modification of the optimization criterion:
J(w) =1
|T|
|T|
∑i=1
(
y(i) − hw(x(i)))2
+α
D
∑d=1
|wd|.
■ As α → ∞, Lasso regularization decreases thenumber of non-zero coefficients, effectively alsoperforming feature selection and creatingsparse models.
Training and testing errors as functions ofregularization parameter:
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102
Re g u la riza t ion fa c tor
1 .0
1 .2
1 .4
1 .6
1 .8
2 .0
2 .2
2 .4
2 .6
2 .8
MS
E
Tra in in g e rror
Te s t in g e rror
The values of coefficients as functions ofregularization parameter:
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102
Re g u la riza t ion fa c tor
− 0 .1
0 .0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
Co
eff
icie
nt
siz
e
Summary
P. Posık c© 2020 Artificial Intelligence – 22 / 23
Competencies
Non-linear models
How to evaluate apredictive model?
Regularization
Summary
• Competencies
P. Posık c© 2020 Artificial Intelligence – 23 / 23
After this lecture, a student shall be able to . . .
1. explain the reason for doing basis expansion (feature space straightening), anddescribe its principle;
2. show the effect of basis expansion with a linear model on a simple example for bothclassification and regression settings;
3. implement user-defined basis expansions in certain programming language;
4. list advantages and disadvantages of basis expansion;
5. explain why the error measured on the training data is not a good estimate of theexpected error of the model for new data, and whether it under- or overestimates thetrue error;
6. explain basic methods to get unbiased estimate of the true model error (testing data,k-fold crossvalidation, LOO crossvalidation);
7. describe the general form of the dependency of training and testing errors on themodel complexity/flexibility/capacity;
8. define overfitting;
9. discuss high bias and high variance problems of models;
10. explain how to proceed if a suitable model complexity must be chosen as part of thetraining process;
11. list 2 basic methods for overfitting prevention;
12. describe the principles of ridge (Tikhonov) and lasso regularizations and their effectson the model parameters.