eecs349 linear regression - interactive audio...

42
(contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/Stork, and Bishop. Updated Fall 2015) Machine Learning Topic 4: Linear Regression Models 1

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

(contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/Stork, and Bishop. Updated Fall 2015)

Machine Learning

Topic 4: Linear Regression Models

1

Page 2: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

There is a set of possible examples

Each example is a vector of k real valued attributes

There is a target function that maps X onto some real value Y

The DATA is a set of tuples <example, response value>

Find a hypothesis h such that...

Regression Learning Task

X = {x1,...xn}

xi =< xi1,..., xik >

YXf ®:

∀x,h(x) ≈ f (x)Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012

{< x1, y1 >,...< xn, yn >}

Page 3: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Why use a linear regression model?

• Easily understood

• Interpretable

• Well studied by statisticians– many variations and diagnostic measures

• Computationally efficient

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 3

Page 4: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Linear Regression Model

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 4

Assumption: The observed response (dependent) variable, r, is the true function, f(x), with additive Gaussian noise, ε, with a 0 mean.

Observed response

Where

y = f (x)+ ε

Assumption: The expected value of the response variable yis a linear combination of the k independent attributes/features)

Page 5: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

The Hypothesis Space

Given the assumptions on the previous slide, our hypothesis space is the set of linear functions (hyperplanes)

The goal is to learn a k+1 dimensional vector of weights that define a hyperplane minimizing an error criterion.

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 5

h(x) = w0 +w1x1 +w2x2 + ...wkxk

w =< w0,w1,...wk >

(w0 is the offset from the origin. You always need w0 )

Page 6: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Simple Linear Regression• x has 1 attribute a (predictor variable)• Hypothesis function is a line:

Example:

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 6

x

t

0 1

−1

0

1

y = h(x) = w0 +w1x

x

y

Page 7: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

The Error CriterionTypically estimate parameters by minimizing sum of squared residuals (RSS)…also known as the Sum of Squared Errors (SSE)

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 7

residual

number of training examples

RSS = (yii

n

∑ − h(xi ))2

x

y

Page 8: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Multiple (Multivariate*) Linear Regression

• Many attributes • h(x) function is a hyperplane

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 8

*NOTE: In statistical literature, multivariate linear regression is regression with multiple outputs, and the case of multiple input variables is simply“multiple linear regression”

h(x) = w0 +w1x1 +w2x2 + ...wkxk

x1

x2

x1,...xk

Page 9: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Formatting the data Create a new 0 dimension with 1 and append it to the beginning of every example vector This placeholder corresponds to the offset

Format the data as a matrix of examples x and a vector of response values y…

9

One training example

xi =<1, xi,1, xi,2 ..., xi,k >

xi w0

X =

1 x1,1 ... x1,k

1 x2,1 ... x2,k

... ... ... ...1 xn,k ... xn,k

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

y =

y1

y2

...yn

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

Page 10: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

There is a closed-form solution!

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 10

w = (XTX)−1XTy

RSS = (yii

n

∑ − h(xi ))2

It turns out that there is a close-form solution to this problem!

Our goal is to find the weights of a function….

…that minimizes the sum of squared residuals:

h(x) = w0 +w1x1 +w2x2 + ...wkxk

Just plug your training data into the above formula and thebest hyperplane comes out!

Page 11: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

RSS in vector/matrix notation

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 11

RSS(w) = (yii=1

n∑ − h(xi ))2

= (yii=1

n∑ −w0 − xijj=1

k∑ wj )2

= (y −Xw)T (y −Xw)

Page 12: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Deriving the formula to find w

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12

RSS(w) = (y −Xw)T (y −Xw)∂RSS∂w

= −2XT (y −Xw)

0 = −2XT (y −Xw)0 = XT (y −Xw)0 = XTy −XTXw

XTXw = XTyw = (XTX)−1XTy

Page 13: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

What if X is not singular?

• We said there was a closed form solution:

• This presupposes matrix is invertible (non singular) and we can therefore find

• If two columns of X are exactly linearly related and thus not independent, then is NOT invertible

• What then?

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13

w = (XTX)−1XTy

(XTX)−1(XTX)

(XTX)

Page 14: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Your Friend: Dimensionality Reduction

• We need to make every column of X independent.

• The easy way: add random noise to X.

• The (probably) better way: do dimensionality reduction to get rid of those redundant columns.

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 14

Page 15: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Making polynomial regression

You’re familiar with linear regression where the input has k dimensions.

We can use this same machinery to make polynomial regression from a one-dimensional input…..

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 15

h(x) = w0 +w1x +w2x2 + ...wkx

k

h(x) = w0 +w1x1 +w2x2 + ...wkxk

Page 16: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Making polynomial regression

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 16

h(x) = w0 +w1z +w2z2 + ...wkz

k

Given a scalar example z. We canmake a k+1 dimensional example x

x = z0, z1, z2,..., zk

The ith element of x is the power zi

Page 17: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Making polynomial regression

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 17

h(x) = w0 +w1x1 +w2x2 + ...wkxk= w0 +w1z +w2z

2 + ...wkzk

Since we can interpret the output of the regression as a polynomial function of

xk ≡ zk

z

Page 18: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Polynomial Regression

• Model the relationship between the response variable and the attributes/predictor variables as a kth-order polynomial. While this can model non-linear functions, it is still linear with respect to the coefficients.

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 18x

t

0 1

−1

0

1

h(x) = w0 +w1z +w2z2 +w3z

3

Page 19: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Polynomial Regression

Parameter estimation (analytically minimizing sum of squared residuals):

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 19

(Note, there is only 1 attribute z for each training example. Those superscripts are powers, since we’re doing polynomial regression)

One training example

X =

1 z11 ... z1

k

1 z21 ... z2

k

... ... ... ...1 zn

1 ... znk

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

y =

y1

y2

...yn

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

w = (XTX)−1XTy

Page 20: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

x

t

0 1

−1

0

1

What is your hypothesis for ?

Tuning Model Complexity: Example

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 20

Page 21: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

What is your hypothesis for ?

Tuning Model Complexity: Example

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 21

x

t

M = 0

0 1

−1

0

1 k=0kth orderpolynomialregression

Page 22: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Tuning Model Complexity: Example

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 22

x

t

M = 1

0 1

−1

0

1

What is your hypothesis for ?

k=1kth orderpolynomialregression

Page 23: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Tuning Model Complexity: Example

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 23

x

t

M = 3

0 1

−1

0

1

What is your hypothesis for ?

k=3kth orderpolynomialregression

Page 24: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Tuning Model Complexity: Example

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 24

x

t

M = 9

0 1

−1

0

1

What is your hypothesis for ?

k=9kth orderpolynomialregression

Page 25: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Tuning Model Complexity: Example

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 25

x

t

M = 9

0 1

−1

0

1

What is your hypothesis for ?

k=9kth orderpolynomialregression

Page 26: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Tuning Model Complexity: Example

What happens if we fit to more data?

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 26

x

t

N = 15

0 1

−1

0

1M = 9k=9m=15

kth orderpolynomialregression

Page 27: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Tuning Model Complexity: Example

What happens if we fit to more data?

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 27

x

t

N = 100

0 1

−1

0

1M = 9k=9m=100

kth orderpolynomialregression

Page 28: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Bias and Variance of an Estimator

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 28

• Let X be a sample from a population specified by a true parameter

• Let d=d(X) be an estimator for

mean square error variance bias2

Page 29: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Bias and Variance

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 29

As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data)

Order of a polynomial fit to some data

Page 30: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Bias and Variance of Hypothesis Fn

• Bias:Measures how much h(x) is wrong disregarding the

effect of varying samples (This the statistical bias of an estimator. This is NOT the same as inductive bias, which is the set of assumptions that your learner is making)

• Variance:Measures how much h(x) fluctuate around the expected

value as the sample varies.

NOTE: These concepts are general machine learning concepts, not specific to linear regression.

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 30

Page 31: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Coefficient of Determination

• the coefficient of determination, or R2

indicates how well data points fit a line or curve. We’d like R2 to be close to 1

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 31

ERSS =(yi

i

n

∑ − h(xi ))2

(yii

n

∑ − y )2 where y is the sample mean

Page 32: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Don’t just rely on numbers, visualize!

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 32

For all 4 sets: same mean and variance for x, same mean and variance (almost) for y, and same regression line and correlation between x and y (and therefore same R-squared).

Page 33: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Summary of Linear Regression Models

• Easily understood• Interpretable• Well studied by statisticians• Computationally efficient• Can handle non-linear situations if formulated

properly• Bias/variance tradeoff (occurs in all machine

learning)• Visualize!!• GLMs

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 33

Page 34: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Appendix

(Stuff I couldn’t cover in class)

34

Page 35: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Bias and Variance

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 35

high bias, low variance

Page 36: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Bias and Variance

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 36

high bias, high variance

Page 37: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Bias and Variance

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 37

low bias, high variance

Page 38: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Bias and Variance

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 38

low bias, low variance

Page 39: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Bias and Variance

• Bias:Measures how much h(x) is wrong disregarding the

effect of varying sampleshigh bias underfitting

• Variance:Measures how much h(x) fluctuate around the expected

value as the sample varies.high variance overfitting

There’s a trade-off between bias and variance

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 39

Page 40: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Ways to Avoid Overfitting

• Simpler model– E.g. fewer parameters

• Regularization– penalize for complexity in objective function

• Fewer features

• Dimensionality reduction of features (e.g. PCA)

• More data…Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 40

Page 41: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Model Selection

• Cross-validation: Measure generalization accuracy by testing on data unused during training

• Regularization: Penalize complex modelsE’=error on data + λ model complexityAkaike’s information criterion (AIC), Bayesian information criterion (BIC)

• Minimum description length (MDL): Kolmogorov complexity, shortest description of data

• Structural risk minimization (SRM)

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 41

Page 42: eecs349 linear regression - Interactive Audio Labinteractiveaudiolab.github.io/teaching/eecs349... · Given the assumptions on the previous slide, our hypothesis space is the set

Generalized Linear Models

• Models shown have assumed that the response variable follows a Gaussian distribution around the mean

• Can be generalized to response variables that take on any exponential family distribution (Generalized Linear Models - GLMs)

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 42