linear regressionjbhuang/teaching/ece... · 1. a distance metric •continuous? discrete? pdf? gene...

Linear Regression

Jia-Bin Huang

Virginia Tech Spring 2019ECE-5424G / CS-5824

Administrative

• Office hour• Chen Gao• Shih-Yang Su

• Feedback (Thanks!)• Notation?

• More descriptive slides?

• Video/audio recording?

• TA hours (uniformly spread over the week)?

Recap: Machine learning algorithms

Supervised Learning

Unsupervised Learning

Discrete Classification Clustering

Continuous RegressionDimensionality

reduction

Recap: Nearest neighbor classifier

• Training data

𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑁 , 𝑦 𝑁

• Learning

Do nothing.

• Testing

ℎ 𝑥 = 𝑦(𝑘), where 𝑘 = argmini 𝐷(𝑥, 𝑥(𝑖))

Recap: Instance/Memory-based Learning

1. A distance metric• Continuous? Discrete? PDF? Gene data? Learn the metric?

2. How many nearby neighbors to look at?• 1? 3? 5? 15?

3. A weighting function (optional)• Closer neighbors matter more

4. How to fit with the local points?• Kernel regression

Slide credit: Carlos Guestrin

Validation set• Spliting training set: A fake test set to tune hyper-parameters

Slide credit: CS231 @ Stanford

Cross-validation

• 5-fold cross-validation -> split the training data into 5 equal folds

• 4 of them for training and 1 for validation

Slide credit: CS231 @ Stanford

Things to remember

• Supervised Learning• Training/testing data; classification/regression; Hypothesis

• k-NN• Simplest learning algorithm• With sufficient data, very hard to beat “strawman” approach

• Kernel regression/classification• Set k to n (number of data points) and chose kernel width• Smoother than k-NN

• Problems with k-NN• Curse of dimensionality• Not robust to irrelevant features • Slow NN search: must remember (very large) dataset for prediction

Today’s plan: Linear Regression

•Model representation

•Cost function

•Gradient descent

• Features and polynomial regression

•Normal equation

Linear Regression


•Cost function

•Gradient descent


•Normal equation

Training set

Learning Algorithm

ℎ𝑥 𝑦

HypothesisSize of house Estimated price

Regression

real-valued output

House pricing prediction

Price ($)in 1000’s

500 1000 1500 2000 2500

100

200

300

400

Size in feet^2

• Notation:•𝑚 = Number of training examples• 𝑥 = Input variable / features• 𝑦 = Output variable / target variable• (𝑥, 𝑦) = One training example

• (𝑥(𝑖), 𝑦(𝑖)) = 𝑖𝑡ℎ training example

Training set Size in feet^2 (x) Price ($) in 1000’s (y)

2104 460

1416 232

1534 315

852 178

… …

𝑚 = 47

Examples:

𝑥(1) = 2104𝑥(2) = 1416

𝑦(1) = 460

Slide credit: Andrew Ng

Model representation

𝑦 = ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

Shorthand ℎ 𝑥

Training set

Learning Algorithm

ℎ𝑥 𝑦HypothesisSize of house Estimated price


500 1000 1500 2000 2500

100

200

300

400

Size in feet^2

Univariate linear regressionSlide credit: Andrew Ng

Linear Regression


•Cost function

•Gradient descent


•Normal equation

Training set

•Hypothesis ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

𝜃0, 𝜃1: parameters/weights

How to choose 𝜃𝑖’s?

Size in feet^2 (x) Price ($) in 1000’s (y)

2104 460

1416 232

1534 315

852 178

… …

𝑚 = 47


ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

1 2 3

1

2

3

1 2 3

1

2

3

1 2 3

1

2

3

𝜃0 = 1.5𝜃1 = 0

𝜃0 = 0𝜃1 = 0.5

𝜃0 = 1𝜃1 = 0.5

𝑥

𝑦

𝑥

𝑦

𝑥

𝑦


Cost function

• Idea: Choose 𝜃0, 𝜃1 so that ℎ𝜃 𝑥 is close to 𝑦 for our training example (𝑥, 𝑦)


500 1000 1500 2000 2500

100

200

300

400

Size in feet^2

𝑥

𝑦

ℎ𝜃 𝑥 𝑖 = 𝜃0 + 𝜃1𝑥(𝑖)

𝐽 𝜃0, 𝜃1 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

minimize1

2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

𝜃0, 𝜃1

minimize 𝐽 𝜃0, 𝜃1𝜃0, 𝜃1

Cost function


• Hypothesis:


• Parameters:

𝜃0, 𝜃1

• Cost function:

𝐽 𝜃0, 𝜃1 =1

2𝑚

𝑖=1

𝑚


• Goal: minimize 𝐽 𝜃0, 𝜃1

𝜃0, 𝜃1

• Hypothesis:

ℎ𝜃 𝑥 = 𝜃1𝑥 𝜃0 = 0

• Parameters:

𝜃1

• Cost function:

𝐽 𝜃1 =1

2𝑚

𝑖=1

𝑚


• Goal:minimize 𝐽 𝜃1

Simplified

𝜃0, 𝜃1Slide credit: Andrew Ng

ℎ𝜃 𝑥 , function of 𝑥

1 2 3

1

2

3

𝑥

𝑦

1 2 3

1

2

3

0

𝐽 𝜃1

𝜃1

𝐽 𝜃1 , function of 𝜃1


•Hypothesis: ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

•Parameters: 𝜃0, 𝜃1

•Cost function: 𝐽 𝜃0, 𝜃1 =1


•Goal: minimize 𝐽 𝜃0, 𝜃1𝜃0, 𝜃1


Cost function


How do we find good 𝜃0, 𝜃1 that minimize 𝐽 𝜃0, 𝜃1 ?Slide credit: Andrew Ng

Linear Regression


•Cost function

•Gradient descent


•Normal equation

Gradient descent

Have some function 𝐽 𝜃0, 𝜃1Want argmin 𝐽 𝜃0, 𝜃1

Outline:

• Start with some 𝜃0, 𝜃1•Keep changing 𝜃0, 𝜃1 to reduce 𝐽 𝜃0, 𝜃1

until we hopefully end up at minimum

𝜃0, 𝜃1


Gradient descent

Repeat until convergence{

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜕

𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 (for 𝑗 = 0 and 𝑗 = 1)

}

𝛼: Learning rate (step size)𝜕

𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 : derivative (rate of change)


Gradient descent

Incorrect:

temp0 ≔ 𝜃0 −𝛼𝜕

𝜕𝜃0𝐽 𝜃0, 𝜃1

𝜃0 ≔ temp0


𝜕𝜃1𝐽 𝜃0, 𝜃1

𝜃1 ≔ temp1

Correct: simultaneous update


𝜕𝜃0𝐽 𝜃0, 𝜃1


𝜕𝜃1𝐽 𝜃0, 𝜃1

𝜃0 ≔ temp0𝜃1 ≔ temp1


𝜃1 ≔ 𝜃1 − 𝛼𝜕

𝜕𝜃1𝐽 𝜃1

1 2 3

1

2

3

0

𝐽 𝜃1

𝜃1

𝜕

𝜕𝜃1𝐽 𝜃1 > 0

𝜕

𝜕𝜃1𝐽 𝜃1 < 0


Learning rate

Gradient descent for linear regression


𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜕

𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 (for 𝑗 = 0 and 𝑗 = 1)

}

• Linear regression modelℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

𝐽 𝜃0, 𝜃1 =1

2𝑚

𝑖=1

𝑚



Computing partial derivative

•𝜕

𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 =

𝜕

𝜕𝜃𝑗

1


= 𝜕

𝜕𝜃𝑗

1

2𝑚σ𝑖=1𝑚 𝜃0 + 𝜃1𝑥

(𝑖) − 𝑦 𝑖 2

• 𝑗 = 0: 𝜕

𝜕𝜃0𝐽 𝜃0, 𝜃1 =

1

𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖

• 𝑗 = 1: 𝜕

𝜕𝜃1𝐽 𝜃0, 𝜃1 =

1

𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖


Gradient descent for linear regression


𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖

𝜃1 ≔ 𝜃1 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖

}

Update 𝜃0 and 𝜃1 simultaneously


Batch gradient descent

• “Batch”: Each step of gradient descent uses all the training examples


𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚


𝜃1 ≔ 𝜃1 − 𝛼1

𝑚

𝑖=1

𝑚


}

𝑚: Number of training examples


Linear Regression


•Cost function

•Gradient descent


•Normal equation

Training dataset

Size in feet^2 (x) Price ($) in 1000’s (y)

2104 460

1416 232

1534 315

852 178

… …



Multiple features (input variables)Size in feet^2 (𝑥1)

Number of bedrooms (𝑥2)

Number of floors (𝑥3)

Age of home (years) (𝑥4)

Price ($) in 1000’s (y)

2104 5 1 45 460

1416 3 2 40 232

1534 3 2 30 315

852 2 1 36 178

… …

Notation:

𝑛 = Number of features

𝑥(𝑖)= Input features of 𝑖𝑡ℎ training example

𝑥𝑗(𝑖)

= Value of feature 𝑗 in 𝑖𝑡ℎ training example

𝑥3(2)

=?

𝑥3(4)

=?


Hypothesis

Previously: ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

Now: ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +𝜃3 𝑥3 + 𝜃4𝑥4


ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛

• For convenience of notation, define 𝑥0 = 1(𝑥0

(𝑖)= 1 for all examples)

• 𝒙 =

𝑥0𝑥1𝑥2⋮𝑥𝑛

∈ 𝑅𝑛+1 𝜽 =

𝜃0𝜃1𝜃2⋮𝜃𝑛

∈ 𝑅𝑛+1

• ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛= 𝜽⊤𝒙


Gradient descent

• Previously (𝑛 = 1)


𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚


𝜃1 ≔ 𝜃1 − 𝛼1

𝑚

𝑖=1

𝑚


}

• New algorithm (𝑛 ≥ 1)


𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖

}

Simultaneously update 𝜃𝑗 , for 𝑗 = 0, 1,⋯ , 𝑛


Gradient descent in practice: Feature scaling

• Idea: Make sure features are on a similar scale (e.g,. −1 ≤ 𝑥𝑖≤ 1)

• E.g. 𝑥1 = size (0-2000 feat^2)

𝑥2 = number of bedrooms (1-5)

1 2 3

1

2

3

0 𝜃1

𝜃2

1 2 3

1

2

3

0 𝜃1

𝜃2


Gradient descent in practice: Learning rate

• Automatic convergence test

• 𝛼 too small: slow convergence

• 𝛼 too large: may not converge

• To choose 𝛼, try

0.001, … 0.01, …, 0.1, … , 1

Image credit: CS231n@Stanford

House prices prediction

• ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × frontage + 𝜃2 × depth

• Area𝑥 = frontage × depth

• ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥


Polynomial regression

•ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥3= 𝜃0 + 𝜃1 𝑠𝑖𝑧𝑒 + 𝜃2 𝑠𝑖𝑧𝑒 2 + 𝜃3 𝑠𝑖𝑧𝑒 3


500 1000 1500 2000 2500

100

200

300

400

Size in feet^2

𝑥1 = (size)𝑥2 = (size)^2𝑥3 = (size)^3


Linear Regression


•Cost function

•Gradient descent


•Normal equation

(𝑥0) Size in feet^2(𝑥1)

Number of bedrooms

(𝑥2)

Number of floors (𝑥3)

Age of home (years) (𝑥4)

Price ($) in 1000’s (y)

1 2104 5 1 45 460

1 1416 3 2 40 232

1 1534 3 2 30 315

1 852 2 1 36 178

… …

𝑦 =

460232315178

𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦Slide credit: Andrew Ng

Least square solution

• 𝐽 𝜃 =1


=1

2𝑚σ𝑖=1𝑚 𝜃⊤𝑥(𝑖) − 𝑦 𝑖 2

=1

2𝑚𝑋𝜃 − 𝑦 2

2

•𝜕

𝜕𝜃𝐽 𝜃 = 0

• 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦

Justification/interpretation 1

• Loss minimization

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2=

1

𝑚

𝑖=1

𝑚

𝐿𝑙𝑠 ℎ𝜃 𝑥 𝑖 , 𝑦 𝑖

• 𝐿𝑙𝑠 𝑦, ො𝑦 =1

2𝑦 − ො𝑦 2

2: Least squares loss

• Empirical Risk Minimization (ERM)1

𝑚

𝑖=1

𝑚

𝐿𝑙𝑠 𝑦 𝑖 , ො𝑦


•Probabilistic model• Assume linear model with Gaussian errors

𝑝𝜃 𝑦 𝑖 𝑥 𝑖 =1

2𝜋𝜎2exp(−

1

2𝜎2(𝑦 𝑖 − 𝜃⊤𝑥 𝑖 )

• Solving maximum likelihood

argmin𝜃

ෑ

𝑖=1

𝑚

𝑝𝜃 𝑦 𝑖 𝑥 𝑖

argmin𝜃

log(ෑ

𝑖=1

𝑚

𝑝 𝑦 𝑖 𝑥 𝑖 ) = argmin𝜃

1

2𝜎2

𝑖=1

𝑚1

2𝜃⊤𝑥 𝑖 − 𝑦 𝑖 2

Image credit: CS 446@UIUC


•Geometric interpretation

𝑿 =

1 ← 𝒙(1) →1⋮1

← 𝒙(2) →⋮

← 𝒙(𝑚) →

=↑𝒛𝟎

↑𝒛𝟏

↑𝒛𝟐

⋯↑𝒛𝒏

↓ ↓ ↓ ↓

• 𝑿𝜽: column space of 𝑿 or span({𝒛𝟎, 𝒛𝟏, ⋯ , 𝒛𝒏})

• Residual 𝑿𝜽 − 𝐲 is orthogonal to the column space of 𝑿

• 𝑿⊤ 𝑿𝜽 − 𝐲 = 0 → (𝑿⊤𝑿)𝜽 = 𝑿⊤𝒚

𝒚

column space of 𝑿

𝑿𝜽

𝑿𝜽 − 𝒚

𝑚 training examples, 𝑛 features

Gradient Descent

•Need to choose 𝛼

•Need many iterations

•Works well even when 𝑛 is large

Normal Equation

•No need to choose 𝛼

•Don’t need to iterate

•Need to compute (𝑋⊤𝑋)−1

•Slow if 𝑛 is very large


Things to remember

• Model representationℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥

• Cost function 𝐽 𝜃 =1


• Gradient descent for linear regressionRepeat until convergence {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼

1

𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗

𝑖}


Can combine features; can use different functions to generate features (e.g., polynomial)

• Normal equation 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦

Next

• Naïve Bayes, Logistic regression, Regularization

linear regressionjbhuang/teaching/ece... · 1. a distance metric •continuous? discrete? pdf? gene...

Documents