computational intelligence sew · why linear regression? • simplest machine learning algorithm...

COMPUTATIONAL

INTELLIGENCE SEW(INTRODUCTION TO MACHINE LEARNING) SS18

Lecture 2:

• Linear Regression

• Gradient Descent

• Non-linear basis functions

Practical of Friday rescheduled to Monday

11:00 to 12:00 and

12:00 to 13:00

News group:

tu-graz.lv.ew

LINEAR REGRESSION

MOTIVATION

Why Linear Regression?

• Simplest machine learning algorithm for regression• Widely used in biological, behavioural and social sciences to describe

and to extract relationships between variables from data

• Prediction of real-valued outputs

• Easy to implement, fast to execute

• Benchmark algorithm for comparison with more complex algorithms

• Introduction to notation and concepts that we will need again later in

the course• Data format, vector & matrix notation

• Learning from data by minimizing a cost function

• Gradient descent

• Non-linear features and basis functions• Preparation for neural networks

Applications of (linear) regression

• Brain computer interfaces

• https://www.youtube.com/watch?v=Ae6En8-eaww

• Neuroprosthetic control

• https://www.youtube.com/watch?v=X_AI4MiY6L4

https://www.youtube.com/watch?v=Ae6En8-eaww

https://www.youtube.com/watch?v=X_AI4MiY6L4

LINEAR REGRESSION

WITH ONE INPUT

A regression problem• We want to learn to predict a person’s height based on his/her

knee height and/or arm span

• This is useful for patients who are bed bound or in a wheelchair

and cannot stand to take an accurate measurement of their height

Knee

Height

[cm]

Arm

span

[cm]

Height

[cm]

50 166 171

56 172 175

52 174 168

… … …

Linear regression with one input

…

Learning algorithm

„Hypothesis“

hx

Training set

Hypothesis

Parameters

Test input

Prediction

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

?

?

Example Data

Knee

height

[cm]

Arm

span

[cm]

Height

[cm]

50 166 171

56 172 175

52 174 168

… … …

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

160 165 170 175 180 185 190170

175

180

185

190

armspan

body h

eig

ht

m=30 data points

Example Data

4550

5560

160

180

200170

175

180

185

190

knee heightarmspan

body h

eig

ht

Knee

Height

[cm]

Arm

span

[cm]

Height

[cm]

50 166 171

56 172 175

52 174 168

… … …

Linear regression with one input

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

Knee

Height

[cm]

Height

[cm]

50 171

56 175

52 168

… …

HypothesisParameters ?

Which hypothesis is better?

In what sense is it better?

Formalization of problem

• Given m training examples

• Goal: learn parameters

such that

for all training examples i=1…30.

…

Knee

Height

[cm]

Height

[cm]

50 171

56 175

52 168

… …

m=30 data points

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

Least Squares Objective

• Minimize Error

0.6

150


• Minimize Error

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

10.77

0.6

150

cost function mean squared error

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht


• Minimize Error

5.94

0.75

140

cost function mean squared error

Cost function illustrated

Properties of cost function:

• Quadratic function

• „Bowl“-shaped

• Unique local and global

minimum (under

„regular“ conditions)

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

10.77

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

5.94

Minimizing the cost

• Two ways to find the parameters

minimizing

• Gradient descent

• Direct analytical solution

(setting derivatives = 0)

Recall: Functions of multiple variables

• Example:

• Partial derivatives

• Gradients vectors formed with the partial derivatives (fundamental in lecture 2)

• Chain rule (fundamental for neural networks in lecture 4)

• Function of multiple variable with high dimensional values

• Jacobian matrix formed with the partial derivatives

GRADIENT DESCENT

Descending in the steepest directionGradient descent on some arbitrary cost function …

learning rate („eta“)

Gradient descent algorithm

• Repeat until convergence

(simultaneously updating

and )

partial derivative of

with respect to

negative gradient =

descent

Gradient is orthogonal to contour

lines

-2-1

01

2 -2

-1

0

1

20

0.5

1

1.5

2

2.5

3

3.5

4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

A contour line

is a line along which

= const

Potential issues with gradient descent

• May get stuck in local minima

• Learning rate too small: slow

convergence

• Learning rate too large: oscillations,

divergence

too small too large

LINEAR REGRESSION

WITH GRADIENT

DESCENT(ONE INPUT)

Application of gradient descent

• Linear regression cost • Gradient descent

(simultaneous update)

(simultaneous

update)

“error” “input”

”learning rate”

Predicting height from knee height

• Optimal fit to training data

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

0.8

137.4

LINEAR REGRESSIONMORE GENERAL FORMULATION: MULTIPLE FEATURES

Multiple inputs (features)

• Notation:

… number of training examples

… number of features

… input features of i‘th training example (vector-valued)

…. value of feature j in i‘th training example

Knee

Height

x1

Arm

span

x2

Age

x3

Height

y

50 166 32 171

56 172 17 175

52 174 62 168

… … … …

= 3

=

56

172

17

= 17

Linear hypothesis

• Hypothesis (one input):

• Hypothesis (multiple input features):

• More compact notation:

Example: h(x) = 50 + 0.5*kneeheight + 0.3*armspan + 0.1*age

Introduce

Why? Notation convenience!

Multiple inputs (features) revisited

• Notation:

… number of training examples

… number of features

… input features of i‘th training example (vector-valued)

…. value of feature j in i‘th training example

x0

Knee

Height

x1

Arm

span

x2

Age

x3

Height

y

1 50 166 32 171

1 56 172 17 175

1 52 174 62 168

1 … … … …

= 3

=

1

56

172

17

= 17

= 1

Matrix and vector notation

x0

Knee

Height

x1

Arm

span

x2

Age

x3

Height

y

1 50 166 32 171

1 56 172 17 175

1 52 174 62 168

(n+1) ˟ 1 m ˟ (n+1) m ˟ 1

design matrixfeatures of i‘th training example output/target vector

Matrix and vector notation

x0

Knee

Height

x1

Arm

span

x2

Age

x3

Height

y

1 50 166 32 171

1 56 172 17 175

1 52 174 62 168

𝐻 𝜽 = 𝑋𝜽

LINEAR REGRESSION

WITH GRADIENT

DESCENT(GENERAL FORMULATION)

Linear regression problem statement

• Hypothesis:

• Cost function:

Goal is to find parameters which minimize the cost

high-dimensional quadratic

(„bowl“-shaped) function

Gradient descent (multiple features)

(simultaneous

update for

j=0…n)

For j = 0: define for convenience

with one input feature:

with n input features:

(simultaneous

update)

“error”

“error”

“input”

“input””learning rate”

”learning rate”

LINEAR REGRESSION

ANALYTICAL SOLUTION

Analytical solution

… design matrix

… output/target vector

• Set all partial derivatives of cost

function = 0

• Solving system of linear

equations yields:

Moore-Penrose Pseudoinverse of

• Note: This analytical solution requires that columns of are linearly

independent („regular“ conditions)

Example: analytical solution applied

to problem with one input

Knee

Height

[cm]

Height

[cm]

50 171

56 175

52 168

… …

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

Example: analytical solution applied

to problem with one input

Knee

Height

[cm]

Height

[cm]

50 171

56 175

52 168

… … 30 ˟ 2 30 ˟ 1

2 ˟ 2

2 ˟ 2

2 ˟ 1

Predicting height from knee height

45 50 55 60170

175

180

185

190

knee height

body h

eig

ht

0.8

137.4

Gradient descent Analytical solution

• Need to choose learning

rate

• Iterative algorithm (needs

many iterations to

converge)

• Works well even when

number of input features

is large

• No need to choose

• Direct solution (no

iteration)

• Slow if is too large

(inverting n x n matrix)

NON-LINEAR FEATURES(NON-LINEAR BASIS FUNCTIONS)

Non-linear trends in data

x y

0.01 -0.27

-1.22 2.63

0.17 -0.13

… …

-4 -3 -2 -1 0 1 2 3-2

0

2

4

6

8

10

12

14

16

-4 -3 -2 -1 0 1 2 3-2

0

2

4

6

8

10

12

14

16

• How can we learn non-linear hypotheses?

?

? ? ?

Linear fit to this “non-linear” data

x y

0.01 -0.27

-1.22 2.63

0.17 -0.13

… …

standard design matrix

Hypothesis:

Optimal parameters:

Linear fit to this “non-linear” data

-4 -3 -2 -1 0 1 2 3-2

0

2

4

6

8

10

12

14

16

Non-linear (quadratic) fit

x y

0.01 -0.27

-1.22 2.63

0.17 -0.13

… …

design matrix with

non-linear features

Hypothesis:

Optimal parameters:

Non-linear (quadratic) fit

-4 -3 -2 -1 0 1 2 3-2

0

2

4

6

8

10

12

14

16

Non-linear (sinusoid) fit

x y

0.01 -0.27

-1.22 2.63

0.17 -0.13

… …

design matrix with

non-linear features

Hypothesis:

Optimal parameters:

Non-linear (sinusoidal) fit

-4 -3 -2 -1 0 1 2 3-2

0

2

4

6

8

10

12

14

16

Non-linear input features (in general)

• Feature 2 for each training example i is computed by applying a

non-linear basis function:

• Allows to learn a variety of non-linear functions with the same technique(s):• Analytical or gradient descent

all features of

1st training example

feature 2 of all training examples

Polynomial regression• Features are powers of x

n = degree of polynome

to be learned

n=0 n=1

n=3 n=9

What happened here?

Next lecture…

Radial basis functions

• „Gaussian“-shaped RBFs:• Each basis function j has a center in the input space

• The width of the basis functions is determined by

-6 -4 -2 0 2 4 6 80

0.2

0.4

0.6

0.8

1

x

-6 -4 -2 0 2 4 6 80

0.2

0.4

0.6

0.8

1

x

Radial basis functions

• „Gaussian“-shaped RBFs:• Each basis function j has a center in the input space

• The width of the basis functions is determined by

Fitting a single RBF to data

-4 -2 0 2 4 6-2

0

2

4

6

8

10

12

14

16

RBF with

-4 -2 0 2 4 6-2

0

2

4

6

8

10

12

14

16

-4 -2 0 2 4 60

0.2

0.4

0.6

0.8

1

-4 -2 0 2 4 6-15

-10

-5

0

Fitting RBFs to data

-4 -2 0 2 4 6-2

0

2

4

6

8

10

12

14

16

-4 -2 0 2 4 6-2

0

2

4

6

8

10

12

14

16

RBFs with

SUMMARY (QUESTIONS)

Some questions…

• Hypothesis for linear regression = ?

• Cost function for linear regression = ?

• How many local minima may the cost function for lin. reg. have (under

regular conditions)?

• Name two ways to minimize the cost function?

• General gradient descent formula?

• How is Linear regression with gradient descent solved?

• What issues can arise during gradient descent?

• What is the design matrix? What are its dimensions?

• Analytical solution for linear regression = ?• What are the components of the solution?

• Pros and Cons of gradient descent vs. analytical solution?

• How can one learn non-linear hypotheses with linear regression?

• What is polynomial regression?

• What are radial basis functions?

What is next?

• Classification with Logistic Regression

• Gradient descent tricks & more advanced optimization techniques

• Underfitting & Overfitting

• Model selection (Training, Validation and test set)

computational intelligence sew · why linear regression? • simplest machine learning algorithm...

Documents