linear regressionjbhuang/teaching/ece... · 1. a distance metric •continuous? discrete? pdf? gene...

59
Linear Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

Upload: others

Post on 29-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Linear Regression

Jia-Bin Huang

Virginia Tech Spring 2019ECE-5424G / CS-5824

Page 2: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?
Page 3: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?
Page 4: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Administrative

• Office hour• Chen Gao• Shih-Yang Su

• Feedback (Thanks!)• Notation?

• More descriptive slides?

• Video/audio recording?

• TA hours (uniformly spread over the week)?

Page 5: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Recap: Machine learning algorithms

Supervised Learning

Unsupervised Learning

Discrete Classification Clustering

Continuous RegressionDimensionality

reduction

Page 6: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Recap: Nearest neighbor classifier

• Training data

𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑁 , 𝑦 𝑁

• Learning

Do nothing.

• Testing

ℎ 𝑥 = 𝑦(𝑘), where 𝑘 = argmini 𝐷(𝑥, 𝑥(𝑖))

Page 7: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Recap: Instance/Memory-based Learning

1. A distance metric• Continuous? Discrete? PDF? Gene data? Learn the metric?

2. How many nearby neighbors to look at?• 1? 3? 5? 15?

3. A weighting function (optional)• Closer neighbors matter more

4. How to fit with the local points?• Kernel regression

Slide credit: Carlos Guestrin

Page 8: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Validation set• Spliting training set: A fake test set to tune hyper-parameters

Slide credit: CS231 @ Stanford

Page 9: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Cross-validation

• 5-fold cross-validation -> split the training data into 5 equal folds

• 4 of them for training and 1 for validation

Slide credit: CS231 @ Stanford

Page 10: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Things to remember

• Supervised Learning• Training/testing data; classification/regression; Hypothesis

• k-NN• Simplest learning algorithm• With sufficient data, very hard to beat “strawman” approach

• Kernel regression/classification• Set k to n (number of data points) and chose kernel width• Smoother than k-NN

• Problems with k-NN• Curse of dimensionality• Not robust to irrelevant features • Slow NN search: must remember (very large) dataset for prediction

Page 11: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Today’s plan: Linear Regression

•Model representation

•Cost function

•Gradient descent

• Features and polynomial regression

•Normal equation

Page 12: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Linear Regression

•Model representation

•Cost function

•Gradient descent

• Features and polynomial regression

•Normal equation

Page 13: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Training set

Learning Algorithm

ℎ𝑥 𝑦

HypothesisSize of house Estimated price

Regression

real-valued output

Page 14: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

House pricing prediction

Price ($)in 1000’s

500 1000 1500 2000 2500

100

200

300

400

Size in feet^2

Page 15: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

• Notation:•𝑚 = Number of training examples• 𝑥 = Input variable / features• 𝑦 = Output variable / target variable• (𝑥, 𝑦) = One training example

• (𝑥(𝑖), 𝑦(𝑖)) = 𝑖𝑡ℎ training example

Training set Size in feet^2 (x) Price ($) in 1000’s (y)

2104 460

1416 232

1534 315

852 178

… …

𝑚 = 47

Examples:

𝑥(1) = 2104𝑥(2) = 1416

𝑦(1) = 460

Slide credit: Andrew Ng

Page 16: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Model representation

𝑦 = ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

Shorthand ℎ 𝑥

Training set

Learning Algorithm

ℎ𝑥 𝑦HypothesisSize of house Estimated price

Price ($)in 1000’s

500 1000 1500 2000 2500

100

200

300

400

Size in feet^2

Univariate linear regressionSlide credit: Andrew Ng

Page 17: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Linear Regression

•Model representation

•Cost function

•Gradient descent

• Features and polynomial regression

•Normal equation

Page 18: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Training set

•Hypothesis ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

𝜃0, 𝜃1: parameters/weights

How to choose 𝜃𝑖’s?

Size in feet^2 (x) Price ($) in 1000’s (y)

2104 460

1416 232

1534 315

852 178

… …

𝑚 = 47

Slide credit: Andrew Ng

Page 19: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

1 2 3

1

2

3

1 2 3

1

2

3

1 2 3

1

2

3

𝜃0 = 1.5𝜃1 = 0

𝜃0 = 0𝜃1 = 0.5

𝜃0 = 1𝜃1 = 0.5

𝑥

𝑦

𝑥

𝑦

𝑥

𝑦

Slide credit: Andrew Ng

Page 20: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Cost function

• Idea: Choose 𝜃0, 𝜃1 so that ℎ𝜃 𝑥 is close to 𝑦 for our training example (𝑥, 𝑦)

Price ($)in 1000’s

500 1000 1500 2000 2500

100

200

300

400

Size in feet^2

𝑥

𝑦

ℎ𝜃 𝑥 𝑖 = 𝜃0 + 𝜃1𝑥(𝑖)

𝐽 𝜃0, 𝜃1 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

minimize1

2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

𝜃0, 𝜃1

minimize 𝐽 𝜃0, 𝜃1𝜃0, 𝜃1

Cost function

Slide credit: Andrew Ng

Page 21: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

• Hypothesis:

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

• Parameters:

𝜃0, 𝜃1

• Cost function:

𝐽 𝜃0, 𝜃1 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

• Goal: minimize 𝐽 𝜃0, 𝜃1

𝜃0, 𝜃1

• Hypothesis:

ℎ𝜃 𝑥 = 𝜃1𝑥 𝜃0 = 0

• Parameters:

𝜃1

• Cost function:

𝐽 𝜃1 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

• Goal:minimize 𝐽 𝜃1

Simplified

𝜃0, 𝜃1Slide credit: Andrew Ng

Page 22: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

ℎ𝜃 𝑥 , function of 𝑥

1 2 3

1

2

3

𝑥

𝑦

1 2 3

1

2

3

0

𝐽 𝜃1

𝜃1

𝐽 𝜃1 , function of 𝜃1

Slide credit: Andrew Ng

Page 23: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

ℎ𝜃 𝑥 , function of 𝑥

1 2 3

1

2

3

𝑥

𝑦

1 2 3

1

2

3

0

𝐽 𝜃1

𝜃1

𝐽 𝜃1 , function of 𝜃1

Slide credit: Andrew Ng

Page 24: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

ℎ𝜃 𝑥 , function of 𝑥

1 2 3

1

2

3

𝑥

𝑦

1 2 3

1

2

3

0

𝐽 𝜃1

𝜃1

𝐽 𝜃1 , function of 𝜃1

Slide credit: Andrew Ng

Page 25: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

ℎ𝜃 𝑥 , function of 𝑥

1 2 3

1

2

3

𝑥

𝑦

1 2 3

1

2

3

0

𝐽 𝜃1

𝜃1

𝐽 𝜃1 , function of 𝜃1

Slide credit: Andrew Ng

Page 26: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

ℎ𝜃 𝑥 , function of 𝑥

1 2 3

1

2

3

𝑥

𝑦

1 2 3

1

2

3

0

𝐽 𝜃1

𝜃1

𝐽 𝜃1 , function of 𝜃1

Slide credit: Andrew Ng

Page 27: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

•Hypothesis: ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

•Parameters: 𝜃0, 𝜃1

•Cost function: 𝐽 𝜃0, 𝜃1 =1

2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

•Goal: minimize 𝐽 𝜃0, 𝜃1𝜃0, 𝜃1

Slide credit: Andrew Ng

Page 28: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Cost function

Slide credit: Andrew Ng

Page 29: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

How do we find good 𝜃0, 𝜃1 that minimize 𝐽 𝜃0, 𝜃1 ?Slide credit: Andrew Ng

Page 30: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Linear Regression

•Model representation

•Cost function

•Gradient descent

• Features and polynomial regression

•Normal equation

Page 31: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Gradient descent

Have some function 𝐽 𝜃0, 𝜃1Want argmin 𝐽 𝜃0, 𝜃1

Outline:

• Start with some 𝜃0, 𝜃1•Keep changing 𝜃0, 𝜃1 to reduce 𝐽 𝜃0, 𝜃1

until we hopefully end up at minimum

𝜃0, 𝜃1

Slide credit: Andrew Ng

Page 32: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Slide credit: Andrew Ng

Page 33: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Gradient descent

Repeat until convergence{

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜕

𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 (for 𝑗 = 0 and 𝑗 = 1)

}

𝛼: Learning rate (step size)𝜕

𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 : derivative (rate of change)

Slide credit: Andrew Ng

Page 34: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Gradient descent

Incorrect:

temp0 ≔ 𝜃0 −𝛼𝜕

𝜕𝜃0𝐽 𝜃0, 𝜃1

𝜃0 ≔ temp0

temp1 ≔ 𝜃1 −𝛼𝜕

𝜕𝜃1𝐽 𝜃0, 𝜃1

𝜃1 ≔ temp1

Correct: simultaneous update

temp0 ≔ 𝜃0 −𝛼𝜕

𝜕𝜃0𝐽 𝜃0, 𝜃1

temp1 ≔ 𝜃1 −𝛼𝜕

𝜕𝜃1𝐽 𝜃0, 𝜃1

𝜃0 ≔ temp0𝜃1 ≔ temp1

Slide credit: Andrew Ng

Page 35: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

𝜃1 ≔ 𝜃1 − 𝛼𝜕

𝜕𝜃1𝐽 𝜃1

1 2 3

1

2

3

0

𝐽 𝜃1

𝜃1

𝜕

𝜕𝜃1𝐽 𝜃1 > 0

𝜕

𝜕𝜃1𝐽 𝜃1 < 0

Slide credit: Andrew Ng

Page 36: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Learning rate

Page 37: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Gradient descent for linear regression

Repeat until convergence{

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜕

𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 (for 𝑗 = 0 and 𝑗 = 1)

}

• Linear regression modelℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

𝐽 𝜃0, 𝜃1 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

Slide credit: Andrew Ng

Page 38: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Computing partial derivative

•𝜕

𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 =

𝜕

𝜕𝜃𝑗

1

2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

= 𝜕

𝜕𝜃𝑗

1

2𝑚σ𝑖=1𝑚 𝜃0 + 𝜃1𝑥

(𝑖) − 𝑦 𝑖 2

• 𝑗 = 0: 𝜕

𝜕𝜃0𝐽 𝜃0, 𝜃1 =

1

𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖

• 𝑗 = 1: 𝜕

𝜕𝜃1𝐽 𝜃0, 𝜃1 =

1

𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖

Slide credit: Andrew Ng

Page 39: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Gradient descent for linear regression

Repeat until convergence{

𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖

𝜃1 ≔ 𝜃1 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖

}

Update 𝜃0 and 𝜃1 simultaneously

Slide credit: Andrew Ng

Page 40: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Batch gradient descent

• “Batch”: Each step of gradient descent uses all the training examples

Repeat until convergence{

𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖

𝜃1 ≔ 𝜃1 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖

}

𝑚: Number of training examples

Slide credit: Andrew Ng

Page 41: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Linear Regression

•Model representation

•Cost function

•Gradient descent

• Features and polynomial regression

•Normal equation

Page 42: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Training dataset

Size in feet^2 (x) Price ($) in 1000’s (y)

2104 460

1416 232

1534 315

852 178

… …

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

Slide credit: Andrew Ng

Page 43: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Multiple features (input variables)Size in feet^2 (𝑥1)

Number of bedrooms (𝑥2)

Number of floors (𝑥3)

Age of home (years) (𝑥4)

Price ($) in 1000’s (y)

2104 5 1 45 460

1416 3 2 40 232

1534 3 2 30 315

852 2 1 36 178

… …

Notation:

𝑛 = Number of features

𝑥(𝑖)= Input features of 𝑖𝑡ℎ training example

𝑥𝑗(𝑖)

= Value of feature 𝑗 in 𝑖𝑡ℎ training example

𝑥3(2)

=?

𝑥3(4)

=?

Slide credit: Andrew Ng

Page 44: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Hypothesis

Previously: ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

Now: ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +𝜃3 𝑥3 + 𝜃4𝑥4

Slide credit: Andrew Ng

Page 45: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛

• For convenience of notation, define 𝑥0 = 1(𝑥0

(𝑖)= 1 for all examples)

• 𝒙 =

𝑥0𝑥1𝑥2⋮𝑥𝑛

∈ 𝑅𝑛+1 𝜽 =

𝜃0𝜃1𝜃2⋮𝜃𝑛

∈ 𝑅𝑛+1

• ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛= 𝜽⊤𝒙

Slide credit: Andrew Ng

Page 46: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Gradient descent

• Previously (𝑛 = 1)

Repeat until convergence{

𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖

𝜃1 ≔ 𝜃1 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖

}

• New algorithm (𝑛 ≥ 1)

Repeat until convergence{

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖

}

Simultaneously update 𝜃𝑗 , for 𝑗 = 0, 1,⋯ , 𝑛

Slide credit: Andrew Ng

Page 47: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Gradient descent in practice: Feature scaling

• Idea: Make sure features are on a similar scale (e.g,. −1 ≤ 𝑥𝑖≤ 1)

• E.g. 𝑥1 = size (0-2000 feat^2)

𝑥2 = number of bedrooms (1-5)

1 2 3

1

2

3

0 𝜃1

𝜃2

1 2 3

1

2

3

0 𝜃1

𝜃2

Slide credit: Andrew Ng

Page 48: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Gradient descent in practice: Learning rate

• Automatic convergence test

• 𝛼 too small: slow convergence

• 𝛼 too large: may not converge

• To choose 𝛼, try

0.001, … 0.01, …, 0.1, … , 1

Image credit: CS231n@Stanford

Page 49: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

House prices prediction

• ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × frontage + 𝜃2 × depth

• Area𝑥 = frontage × depth

• ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

Slide credit: Andrew Ng

Page 50: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Polynomial regression

•ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥3= 𝜃0 + 𝜃1 𝑠𝑖𝑧𝑒 + 𝜃2 𝑠𝑖𝑧𝑒 2 + 𝜃3 𝑠𝑖𝑧𝑒 3

Price ($)in 1000’s

500 1000 1500 2000 2500

100

200

300

400

Size in feet^2

𝑥1 = (size)𝑥2 = (size)^2𝑥3 = (size)^3

Slide credit: Andrew Ng

Page 51: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Linear Regression

•Model representation

•Cost function

•Gradient descent

• Features and polynomial regression

•Normal equation

Page 52: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

(𝑥0) Size in feet^2(𝑥1)

Number of bedrooms

(𝑥2)

Number of floors (𝑥3)

Age of home (years) (𝑥4)

Price ($) in 1000’s (y)

1 2104 5 1 45 460

1 1416 3 2 40 232

1 1534 3 2 30 315

1 852 2 1 36 178

… …

𝑦 =

460232315178

𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦Slide credit: Andrew Ng

Page 53: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Least square solution

• 𝐽 𝜃 =1

2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

=1

2𝑚σ𝑖=1𝑚 𝜃⊤𝑥(𝑖) − 𝑦 𝑖 2

=1

2𝑚𝑋𝜃 − 𝑦 2

2

•𝜕

𝜕𝜃𝐽 𝜃 = 0

• 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦

Page 54: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Justification/interpretation 1

• Loss minimization

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2=

1

𝑚

𝑖=1

𝑚

𝐿𝑙𝑠 ℎ𝜃 𝑥 𝑖 , 𝑦 𝑖

• 𝐿𝑙𝑠 𝑦, ො𝑦 =1

2𝑦 − ො𝑦 2

2: Least squares loss

• Empirical Risk Minimization (ERM)1

𝑚

𝑖=1

𝑚

𝐿𝑙𝑠 𝑦 𝑖 , ො𝑦

Page 55: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Justification/interpretation 2

•Probabilistic model• Assume linear model with Gaussian errors

𝑝𝜃 𝑦 𝑖 𝑥 𝑖 =1

2𝜋𝜎2exp(−

1

2𝜎2(𝑦 𝑖 − 𝜃⊤𝑥 𝑖 )

• Solving maximum likelihood

argmin𝜃

𝑖=1

𝑚

𝑝𝜃 𝑦 𝑖 𝑥 𝑖

argmin𝜃

log(ෑ

𝑖=1

𝑚

𝑝 𝑦 𝑖 𝑥 𝑖 ) = argmin𝜃

1

2𝜎2

𝑖=1

𝑚1

2𝜃⊤𝑥 𝑖 − 𝑦 𝑖 2

Image credit: CS 446@UIUC

Page 56: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Justification/interpretation 3

•Geometric interpretation

𝑿 =

1 ← 𝒙(1) →1⋮1

← 𝒙(2) →⋮

← 𝒙(𝑚) →

=↑𝒛𝟎

↑𝒛𝟏

↑𝒛𝟐

⋯↑𝒛𝒏

↓ ↓ ↓ ↓

• 𝑿𝜽: column space of 𝑿 or span({𝒛𝟎, 𝒛𝟏, ⋯ , 𝒛𝒏})

• Residual 𝑿𝜽 − 𝐲 is orthogonal to the column space of 𝑿

• 𝑿⊤ 𝑿𝜽 − 𝐲 = 0 → (𝑿⊤𝑿)𝜽 = 𝑿⊤𝒚

𝒚

column space of 𝑿

𝑿𝜽

𝑿𝜽 − 𝒚

Page 57: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

𝑚 training examples, 𝑛 features

Gradient Descent

•Need to choose 𝛼

•Need many iterations

•Works well even when 𝑛 is large

Normal Equation

•No need to choose 𝛼

•Don’t need to iterate

•Need to compute (𝑋⊤𝑋)−1

•Slow if 𝑛 is very large

Slide credit: Andrew Ng

Page 58: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Things to remember

• Model representationℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥

• Cost function 𝐽 𝜃 =1

2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

• Gradient descent for linear regressionRepeat until convergence {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼

1

𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗

𝑖}

• Features and polynomial regression

Can combine features; can use different functions to generate features (e.g., polynomial)

• Normal equation 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦

Page 59: Linear Regressionjbhuang/teaching/ECE... · 1. A distance metric •Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? •1? 3? 5? 15?

Next

• Naïve Bayes, Logistic regression, Regularization