computational intelligence (introduction to machine ... · computational intelligence (introduction...

68
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training- & Validation- & Test set)

Upload: others

Post on 16-Aug-2020

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

COMPUTATIONAL

INTELLIGENCE(INTRODUCTION TO MACHINE LEARNING) SS18

Lecture 3:

• Classification with Logistic Regression

• Advanced optimization techniques

• Underfitting & Overfitting

• Model selection (Training- & Validation- & Test set)

Page 2: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

CLASSIFICATION WITH

LOGISTIC REGRESSION

Page 3: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Logistic Regression

• Classification and not regression

• Classification = recognition

Page 4: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Logistic Regression

• “The” default classification model

• Binary classification

• Extensions to multi-class later in the course

• Simple classification algorithm

• Convex cost - unique local optimum

• Gradient descent

• No more parameter than with linear regression

• Interpretability of parameters

• Fast evaluation of hypothesis for making predictions

Page 5: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

LOGISTIC REGRESSIONHypothesis

Page 6: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Example (step function hypothesis)

iTumour

size (mm)

x

Malignant

?

y

1 2.3 0 (N)

2 5.1 1 (Y)

3 1.4 0 (N)

4 6.3 1 (Y)

5 5.3 1 (Y)

… …

Tumour size (x)

benign

malignant

decision boundary

0 1

labels

“labelled data”

1

0

Page 7: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Example (logistic function hypothesis)

Tumor size (x)

benign

malignant

decision boundary

1

00.5

Hypothesis: Tumor is malignant with probability p

Classification: if p < 0.5: 0

if p ≥ 0.5: 1

iTumor

size (mm)

x

Malignant

?

y

1 2.3 0 (N)

2 5.1 1 (Y)

3 1.4 0 (N)

4 6.3 1 (Y)

5 5.3 1 (Y)

… …

labels

„labeled data“

?

?

p=0.2, class 0

Page 8: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Example (logistic function hypothesis)

Tumor size (x)

benign

malignant

decision boundary

1

00.5

iTumor

size (mm)

x

Malignant

?

y

1 2.3 0 (N)

2 5.1 1 (Y)

3 1.4 0 (N)

4 6.3 1 (Y)

5 5.3 1 (Y)

… …

labels

„labeled data“

?

?

p=0.001, class 0

Hypothesis: Tumor is malignant with probability p

Classification: if p < 0.5: 0

if p ≥ 0.5: 1

Page 9: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Logistic (Sigmoid) function

• Advantages over step function for classification:• Differentiable → (gradient descent)

• Contains additional information (how certain is the prediction?)

Page 10: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

-4 -2 0 2 4-10

-5

0

5

10

-4 -2 0 2 4-10

-5

0

5

10

-4 -2 0 2 4-10

-5

0

5

10

-4 -2 0 2 4

0

0.5

1

-4 -2 0 2 4

0

0.5

1

-4 -2 0 2 4

0

0.5

1

Logistic regression hypothesis (one input)

Page 11: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Classification with multiple inputs

iTumor

size

(mm)

x1

Age

x2

Maligna

nt?

y

1 2.3 25 0 (N)

2 5.1 62 1 (Y)

3 1.4 47 0 (N)

4 6.3 39 1 (Y)

5 5.3 72 1 (Y)

… …

Tumor size (x1)A

ge

(x2

)

Page 12: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Multiple inputs and logistic hypothesis

Tumor size (x1)A

ge

(x2

)

1. Reduce point in high-dimensional space

to a scalar z

2. Apply logistic function

iTumor

size

(mm)

x1

Age

x2

Maligna

nt?

y

1 2.3 25 0 (N)

2 5.1 62 1 (Y)

3 1.4 47 0 (N)

4 6.3 39 1 (Y)

5 5.3 72 1 (Y)

… …

?

?

p=0.8, class 1

decision

boundary

Page 13: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Classification with multiple inputs

Tumor size (x1)A

ge

(x2

)

1. Reduce point in high-dimensional space

to a scalar z

2. Apply logistic function

?

?

p=0.999, class 1i

Tumor

size

(mm)

x1

Age

x2

Maligna

nt?

y

1 2.3 25 0 (N)

2 5.1 62 1 (Y)

3 1.4 47 0 (N)

4 6.3 39 1 (Y)

5 5.3 72 1 (Y)

… …

Page 14: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Logistic regression hypothesis

1. Reduce high-dimensional input to a scalar

2. Apply logistic function

3. Interpret output as probability and predict class:

Class =

Page 15: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

LOGISTIC REGRESSIONCost function

Page 16: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Logistic regression cost function

Tumor size (x1)

Ag

e (

x2

)

• How well does the hypothesis fit the data?

Prediction: 0.1

Actual value y: 0

Prediction: 0.98

Actual value y: 1

Prediction: 0.6

Actual value y: 0

Page 17: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Logistic regression cost function

Tumor size (x1)

Ag

e (

x2

)

• Probabilistic model: y is 1 with probability:

Page 18: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Logistic regression cost function

• How well does the hypothesis fit the data?

• „Cost“ for predicting probability p when the real value is y:

• Mean over all training examples:

0 0.5 10

1

2

3

4

5

0 0.5 10

1

2

3

4

5

Page 19: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Multiple inputs and logistic hypothesis

• How well does the hypothesis fit the data?

Prediction: 0.1

Actual value y: 0

Prediction: 0.98

Actual value y: 1

Prediction: 0.6

Actual value y: 0

0 0.5 10

1

2

3

4

5

0 0.5 10

1

2

3

4

5

Page 20: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Comparison cost functions

Linear regression Logistic regression

0 0.5 10

1

2

3

4

5

0 0.5 10

1

2

3

4

5

-2 0 20

1

2

3

4

Mean over all training examples:

Page 21: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Why not mean squared error (MSE) again?

• MSE with logistic hypothesis is non-convex (many local minima)

• Logistic regression is convex (unique minimum)

• Cost function can be derived from statistical principles

(„maximum likelihood“)

non-convex function convex function

Page 22: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

LOGISTIC REGRESSION Explaining the cost function with probabilistic models

(more information in second half of the semester)

Page 23: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Cost functions and probabilistic models

• The “likelihood” of data:

𝑝 𝑦 = 𝑦1, … 𝑦𝑛 , 𝑋 = 𝑥1…𝑥𝑛 𝜃

i.e the probability of the given dataset as a function of parameters

Or if the output is assumed to depend on the input

𝑝 𝑦 = 𝑦1, … 𝑦𝑛 𝑋 = 𝑥1…𝑥𝑛 ; 𝜃)

• We want to maximize the likelihood of data

• We usually maximize the log likelihood instead• (or minimize the negative log-likelihood)

• Because logarithm:• Is monotonically increasing

• And products become sums

• more numerically stable for small numbers (like probabilities)

Page 24: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

• Model the deviation around the prediction

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

Cost functions and probabilistic models(linear regression)

Page 25: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

• Model the deviation around the prediction:

(gaussian with mean y and standard deviation sigma)

The probability of each datapoint if thus given by:

• Maximization of the loglikehood is equivalent to

minimization of the MSE

Cost functions and probabilistic models(linear regression)

Page 26: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Tumor size (x1)

Ag

e (

x2

)

• Probabilistic model: y is 1 with probability:

Cost function and probabilistic models(for logistic regression)

Page 27: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Cost function and probabilistic models

(for logistic regression)Probabilistic model: y is 1 with probability: 𝑝 𝐶1 𝑋 = ℎ𝜃 𝒙 = 𝜎 𝑥𝑇𝜃

(y is a Bernoulli random variable (represents coin toss with an unfair coin) )

The parameters should maximize the log-likelihood of the data

If data points are independent

Separating positive and negative examples

max𝜃

log 𝑝 𝑦 𝑋; 𝜃)

max𝜃

𝑖

log 𝑝 𝑦𝑖 𝑋; 𝜃)

max𝜃

𝑡𝑖=1

log 𝑝 𝑦𝑖 = 1 𝑋; 𝜃) +

𝑡𝑖=0

log 𝑝 𝑦𝑖 = 0 𝑋; 𝜃)

𝜎(𝑥𝑇𝜃) 1 − 𝜎(𝑥𝑇𝜃)

𝑝 𝑦𝑖 , 𝑦𝑗 𝑋; 𝜃) = 𝑝 𝑦𝑗 𝑋; 𝜃) 𝑝 𝑦𝑖 𝑋; 𝜃)

Page 28: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Comparison cost functionsLinear regression

Modeling errors as gaussian noise

Logistic regression

Modeling the class as a bernoulli variable

0 0.5 10

1

2

3

4

5

0 0.5 10

1

2

3

4

5

-2 0 20

1

2

3

4

Mean over all training examples:

Page 29: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

LOGISTIC REGRESSIONLearning from data

Page 30: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Minimizing the cost via gradient descent

• Gradient descent

• Gradient of logistic regression cost:

(simultaneous

update for

j=0…n)

(for j=0: )

„error“ „input“

Page 31: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Linear features

Tumor size (x1)

Ag

e (

x2

)

linear decision boundary

Page 32: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Non-linear features

Tumor size (x1)

Ag

e (

x2

)

non-linear decision boundary

Page 33: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Decision boundarieslinear decision boundary

Tumor size (x1)

Ag

e (

x2

)

Tumor size (x1)A

ge

(x2

)

non-linear decision boundary

Decision boundary is a property of hypothesis, not of data!

Page 34: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Linear vs. Logistic Regression

Linear Regression

• Regression

• Hypothesis

• Cost for one training example:

• Gradient

• Analytical:

Logistic Regression

• Binary classification (!)

• Hypothesis

• Cost for one training example:

• Gradient

• No analytical solution!

„error“ „input“ „error“ „input“

Page 35: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

EVALUATION OF

HYPOTHESISTraining and Test set

Page 36: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Training and Test set• Training set: used by learning algorithm to fit parameters and find a

hypothesis.

• Test set: independent data set, used after learning to estimate the

performance of the hypothesis on new (unseen) test examples.

Regression example

E.g. 80% randomly chosen examples

from dataset are training examples,

the remaining 20% are test examples.

Must be disjoint subsets!-2 -1 0 1 20

0.5

1

1.5

2

2.5

x

y

training

test

Page 37: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Training and Test set workflow

Learning algorithm

„Hypothesis“ h

Training set

Test set Testing

Training error (cost)

Test error (cost)

Page 38: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Linear regression training vs. Test error

-2 -1 0 1 2-1

0

1

2

x

y

training

-2 -1 0 1 2-1

0

1

2

x

y

MSEtrain=0.01

training

hypothesis

-2 -1 0 1 2-1

0

1

2

xy

MSEtest=2.02

test

hypothesis

polynomial degree 14

-2 -1 0 1 2-1

0

1

2

x

y

MSEtrain=0.01

training

hypothesis

-2 -1 0 1 2-1

0

1

2

xy

MSEtest=2.02

test

hypothesis

Page 39: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Classification Training / Test set

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Training set Test set„1“

„0“

„1“

„0“

Page 40: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Logistic regression training vs. test error

0 0.5 10

0.2

0.4

0.6

0.8

1Training Error: 0.05

0 0.5 10

0.2

0.4

0.6

0.8

1Test Error: 2.71

polynomial features

up to power 10

decision boundary

0 0.5 10

0.2

0.4

0.6

0.8

1Training Error: 0.05

0 0.5 10

0.2

0.4

0.6

0.8

1Test Error: 2.71

0 0.5 10

0.2

0.4

0.6

0.8

1Training Error: 0.05

0 0.5 10

0.2

0.4

0.6

0.8

1Test Error: 2.71

0 0.5 10

0.2

0.4

0.6

0.8

1Training Error: 0.05

0 0.5 10

0.2

0.4

0.6

0.8

1Test Error: 2.71

Page 41: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

UNDERFITTING AND

OVERFITTING

Page 42: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Polynomial regression under-/overfitting

-2 -1 0 1 20

0.5

1

1.5

2

2.5

x

y

MSEtrain=0.22, MSEtest=0.29

training

test

polynomial fit degree 1

Page 43: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

-2 -1 0 1 20

0.5

1

1.5

2

2.5

x

y

MSEtrain=0.01, MSEtest=0.01

training

test

polynomial fit degree 5

Polynomial regression under-/overfitting

Page 44: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

-2 -1 0 1 20

0.5

1

1.5

2

2.5

x

y

MSEtrain=0.00, MSEtest=3.17

training

test

polynomial fit degree 15

Polynomial regression under-/overfitting

Page 45: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Polynomial regression under-/overfitting

underfitting

„just right“

overfitting

-2 -1 0 1 20

0.5

1

1.5

2

2.5

x

y

MSEtrain=0.22, MSEtest=0.29

training

test

polynomial fit degree 1

-2 -1 0 1 20

0.5

1

1.5

2

2.5

x

y

MSEtrain=0.01, MSEtest=0.01

training

test

polynomial fit degree 5

-2 -1 0 1 20

0.5

1

1.5

2

2.5

x

y

MSEtrain=0.00, MSEtest=3.17

training

test

polynomial fit degree 15

Page 46: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Logistic regression with polynomial terms

0 0.5 10

0.5

1

0 0.5 10

0.5

1

Training set Test set

Page 47: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Logistic regression with polynomial terms

0 0.5 10

0.5

1Training Error: 0.33

0 0.5 10

0.5

1Test Error: 0.32

Terms up to power 1

decision boundary

0 0.5 10

0.2

0.4

0.6

0.8

1Training Error: 0.19

0 0.5 10

0.2

0.4

0.6

0.8

1Test Error: 0.19

Terms up to power 2

Page 48: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

0 0.5 10

0.5

1Training Error: 0.17

0 0.5 10

0.5

1Test Error: 0.44

Terms up to power 20

Logistic regression with polynomial terms

Page 49: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

0 0.5 10

0.5

1Training Error: 0.33

0 0.5 10

0.5

1Test Error: 0.32

0 0.5 10

0.2

0.4

0.6

0.8

1Training Error: 0.19

0 0.5 10

0.2

0.4

0.6

0.8

1Test Error: 0.19

0 0.5 10

0.5

1Training Error: 0.17

0 0.5 10

0.5

1Test Error: 0.44

underfitting

„just right“

overfitting

Page 50: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Under-/ and Overfitting

Model complexity(e.g. degree of polynomial terms)

Training error (cost)

Test error (cost)

underfitting

overfitting

„just right“

Page 51: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Under- and Overfitting

• Underfitting:

• Model is too simple (often: too few

parameters)

• High training error, high test error

• Overfitting

• Model is too complex (often: too many

parameters relative to number of training

examples)

• Low training error, high test error

• In between:

• Model has „right“ complexity

• Moderate training error

• Lowest test error

Model

complexity

Training error (cost)

Test error (cost)

Page 52: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

How to deal with overfitting

• Use model selection to automatically select the right model complexity

• Use regularization to keep parameters small (other lecture…)

• Collect more data

(often not possible or inefficient)

• Manually throw out features which are unlikely to contribute

(often hard to guess which ones, potentially throwing out the wrong ones)

• Find better features with less noise, more predictive of the output

(often not possible or inefficient)

Page 53: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

MODEL SELECTIONTraining, Validation and Test sets

Page 54: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Model selection

• Selection of learning algorithm and „hyperparameters“

(model complexity) that are most suitable for a given

learning problem

Model complexity

Training error (cost)

Test error (cost)

Page 55: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Idea

• Try out different learning

algorithms/variants• Vary degree of polynomial

• Try different sets of features

• …

• Select variant with best

predictive performance

Model

complexity

Training error (cost)

„Prediction“ error (cost)

Page 56: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Training, Validation, Test set

• Training set: used by learning algorithm to fit parameters and find a hypothesis for each learning algorithm/variant.

• Validation set: used to estimate predictive performance of each learning algorithm/variant. The hypothesis with lowest validation error (cost) is selected.

• Test set: independent data set, used after learning and model selection to estimate the performance of the final (selected) hypothesis on new (unseen) test examples.

E.g. 60/20/20% randomly chosen examples from dataset. Must be disjoint subsets!

Page 57: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Training/Validation/Test set workflow

Hyperparameter

setting B

Training Set

Validation setModel

selection

Hyperparameter

setting A

Hypothesis hA Hypothesis hB

Hyperparameter

setting C

Hypothesis hC

For example:

degree of

polynomial=1, 5, and 15

Test set Testing Test error/cost

Selected

Hypothesis h

Learning Algorithm with:

Page 58: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Training/Validation/Test set workflow

Learning

algorithm B

Training Set

Validation setModel

selection

Learning

algorithm A

Hypothesis hA Hypothesis hB

Learning

algorithm C

Hypothesis hC

Test set Testing Test error/cost

For example:

Linear regression,

Polynomial regression,

Artificial Neural Network

Selected

Hypothesis h

Page 59: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

GRADIENT DESCENT TRICKS,

AND MORE ADVANCED

OPTIMIZATION TECHNIQUESFor linear regression, logistic regression, ….

Page 60: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Minimizing the cost via gradient descent

• Gradient descent

• Gradient of logistic regression cost:

(simultaneous

update for

j=0…n)

(for j=0: )

„error“ „input“

Page 61: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

GD trick #1: feature scaling

• Feature scaling and mean normalization• Bring all features into a similar range

• E.g.: shift and scale each feature to have mean 0 and variance 1

• Do not apply to constant feature !

• Typically leads to much faster convergence

(when using

non-linear

features)

Mean of unscaled feature

Standard deviation

of unscaled feature

Page 62: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

GD trick #2: monitoring convergence

• Diagnose typical issues with Gradient Descent:

0 500 10000.55

0.6

0.65

0.7

0.75

iteration

cost

J

0 50 1000

2

4

6

8

10

iteration

cost

J

… slow convergence

(increase learning rate?)

…oscillations

(decrease learning rate)

0 5 10 15 200

20

40

60

iteration

cost

J

…divergence

(decrease learning rate)

at iteration 300

Page 63: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

GD trick #3: adaptive learning rate

• At each iteration

• Compare cost function value before and

after Gradient Descent update

• If cost increased:

• Reject update (go back to previous parameters)

• Multiply learning rate by 0.7 (for example)

• If cost decreased:

• Multiply learning rate by 1.02 (for example)

0 500 10000.55

0.6

0.65

0.7

0.75

iteration

co

st

J

0 5 10 15 200

20

40

60

iteration

co

st

JOften eliminates slow convergence and divergence issues

Page 64: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Variants of Gradient Descent

Source: http://sebastianruder.com/optimizing-gradient-descent/index.html

• SGD: Stochastic Gradient Descent

• Momentum: with momentum term

• RMSProp: adaptive learning rate

Page 65: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

More advanced optimization methods

• Gradient methods = order 1 Newton methods = order 2

• Need Hessian matrix or approximations

• Avoid choosing a learning rate

• Conjugate gradient, BFGS, L-BFGS, …

• Tricky to implement (numerical stability, etc.)• Use available toolbox / library implementations! scipy.optimize.minimize

• Only use when fighting for performance

Page 66: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

SUMMARYAnd questions

Page 67: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

Some questions…

• Logistic regression is a method for … regression/classification?

• What is the hypothesis for Logistic regression?

• What‘s the cost function used for logistic regression?

• Is the cost function convex or non-convex?

• What is under-/overfitting?

• What is model selection?

• What are training, validation and test sets?

• How does model selection work (procedure)?

• What does “adaptive learning rate” mean in the context of gradient

descent?

Page 68: Computational Intelligence (Introduction to machine ... · COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING)SS18 Lecture 3: • Classification with Logistic Regression

What is next?

• Neural Networks (Guillaume Bellec):

• Perceptron

• Feedforward Neural Network

• Backpropagation