linear regressionjbhuang/teaching/ece... · 1. a distance metric •continuous? discrete? pdf? gene...
TRANSCRIPT
Linear Regression
Jia-Bin Huang
Virginia Tech Spring 2019ECE-5424G / CS-5824
Administrative
• Office hour• Chen Gao• Shih-Yang Su
• Feedback (Thanks!)• Notation?
• More descriptive slides?
• Video/audio recording?
• TA hours (uniformly spread over the week)?
Recap: Machine learning algorithms
Supervised Learning
Unsupervised Learning
Discrete Classification Clustering
Continuous RegressionDimensionality
reduction
Recap: Nearest neighbor classifier
• Training data
𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑁 , 𝑦 𝑁
• Learning
Do nothing.
• Testing
ℎ 𝑥 = 𝑦(𝑘), where 𝑘 = argmini 𝐷(𝑥, 𝑥(𝑖))
Recap: Instance/Memory-based Learning
1. A distance metric• Continuous? Discrete? PDF? Gene data? Learn the metric?
2. How many nearby neighbors to look at?• 1? 3? 5? 15?
3. A weighting function (optional)• Closer neighbors matter more
4. How to fit with the local points?• Kernel regression
Slide credit: Carlos Guestrin
Validation set• Spliting training set: A fake test set to tune hyper-parameters
Slide credit: CS231 @ Stanford
Cross-validation
• 5-fold cross-validation -> split the training data into 5 equal folds
• 4 of them for training and 1 for validation
Slide credit: CS231 @ Stanford
Things to remember
• Supervised Learning• Training/testing data; classification/regression; Hypothesis
• k-NN• Simplest learning algorithm• With sufficient data, very hard to beat “strawman” approach
• Kernel regression/classification• Set k to n (number of data points) and chose kernel width• Smoother than k-NN
• Problems with k-NN• Curse of dimensionality• Not robust to irrelevant features • Slow NN search: must remember (very large) dataset for prediction
Today’s plan: Linear Regression
•Model representation
•Cost function
•Gradient descent
• Features and polynomial regression
•Normal equation
Linear Regression
•Model representation
•Cost function
•Gradient descent
• Features and polynomial regression
•Normal equation
Training set
Learning Algorithm
ℎ𝑥 𝑦
HypothesisSize of house Estimated price
Regression
real-valued output
House pricing prediction
Price ($)in 1000’s
500 1000 1500 2000 2500
100
200
300
400
Size in feet^2
• Notation:•𝑚 = Number of training examples• 𝑥 = Input variable / features• 𝑦 = Output variable / target variable• (𝑥, 𝑦) = One training example
• (𝑥(𝑖), 𝑦(𝑖)) = 𝑖𝑡ℎ training example
Training set Size in feet^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …
𝑚 = 47
Examples:
𝑥(1) = 2104𝑥(2) = 1416
𝑦(1) = 460
Slide credit: Andrew Ng
Model representation
𝑦 = ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
Shorthand ℎ 𝑥
Training set
Learning Algorithm
ℎ𝑥 𝑦HypothesisSize of house Estimated price
Price ($)in 1000’s
500 1000 1500 2000 2500
100
200
300
400
Size in feet^2
Univariate linear regressionSlide credit: Andrew Ng
Linear Regression
•Model representation
•Cost function
•Gradient descent
• Features and polynomial regression
•Normal equation
Training set
•Hypothesis ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
𝜃0, 𝜃1: parameters/weights
How to choose 𝜃𝑖’s?
Size in feet^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …
𝑚 = 47
Slide credit: Andrew Ng
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
1 2 3
1
2
3
1 2 3
1
2
3
1 2 3
1
2
3
𝜃0 = 1.5𝜃1 = 0
𝜃0 = 0𝜃1 = 0.5
𝜃0 = 1𝜃1 = 0.5
𝑥
𝑦
𝑥
𝑦
𝑥
𝑦
Slide credit: Andrew Ng
Cost function
• Idea: Choose 𝜃0, 𝜃1 so that ℎ𝜃 𝑥 is close to 𝑦 for our training example (𝑥, 𝑦)
Price ($)in 1000’s
500 1000 1500 2000 2500
100
200
300
400
Size in feet^2
𝑥
𝑦
ℎ𝜃 𝑥 𝑖 = 𝜃0 + 𝜃1𝑥(𝑖)
𝐽 𝜃0, 𝜃1 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
minimize1
2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
𝜃0, 𝜃1
minimize 𝐽 𝜃0, 𝜃1𝜃0, 𝜃1
Cost function
Slide credit: Andrew Ng
• Hypothesis:
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
• Parameters:
𝜃0, 𝜃1
• Cost function:
𝐽 𝜃0, 𝜃1 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
• Goal: minimize 𝐽 𝜃0, 𝜃1
𝜃0, 𝜃1
• Hypothesis:
ℎ𝜃 𝑥 = 𝜃1𝑥 𝜃0 = 0
• Parameters:
𝜃1
• Cost function:
𝐽 𝜃1 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
• Goal:minimize 𝐽 𝜃1
Simplified
𝜃0, 𝜃1Slide credit: Andrew Ng
ℎ𝜃 𝑥 , function of 𝑥
1 2 3
1
2
3
𝑥
𝑦
1 2 3
1
2
3
0
𝐽 𝜃1
𝜃1
𝐽 𝜃1 , function of 𝜃1
Slide credit: Andrew Ng
ℎ𝜃 𝑥 , function of 𝑥
1 2 3
1
2
3
𝑥
𝑦
1 2 3
1
2
3
0
𝐽 𝜃1
𝜃1
𝐽 𝜃1 , function of 𝜃1
Slide credit: Andrew Ng
ℎ𝜃 𝑥 , function of 𝑥
1 2 3
1
2
3
𝑥
𝑦
1 2 3
1
2
3
0
𝐽 𝜃1
𝜃1
𝐽 𝜃1 , function of 𝜃1
Slide credit: Andrew Ng
ℎ𝜃 𝑥 , function of 𝑥
1 2 3
1
2
3
𝑥
𝑦
1 2 3
1
2
3
0
𝐽 𝜃1
𝜃1
𝐽 𝜃1 , function of 𝜃1
Slide credit: Andrew Ng
ℎ𝜃 𝑥 , function of 𝑥
1 2 3
1
2
3
𝑥
𝑦
1 2 3
1
2
3
0
𝐽 𝜃1
𝜃1
𝐽 𝜃1 , function of 𝜃1
Slide credit: Andrew Ng
•Hypothesis: ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
•Parameters: 𝜃0, 𝜃1
•Cost function: 𝐽 𝜃0, 𝜃1 =1
2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
•Goal: minimize 𝐽 𝜃0, 𝜃1𝜃0, 𝜃1
Slide credit: Andrew Ng
Cost function
Slide credit: Andrew Ng
How do we find good 𝜃0, 𝜃1 that minimize 𝐽 𝜃0, 𝜃1 ?Slide credit: Andrew Ng
Linear Regression
•Model representation
•Cost function
•Gradient descent
• Features and polynomial regression
•Normal equation
Gradient descent
Have some function 𝐽 𝜃0, 𝜃1Want argmin 𝐽 𝜃0, 𝜃1
Outline:
• Start with some 𝜃0, 𝜃1•Keep changing 𝜃0, 𝜃1 to reduce 𝐽 𝜃0, 𝜃1
until we hopefully end up at minimum
𝜃0, 𝜃1
Slide credit: Andrew Ng
Slide credit: Andrew Ng
Gradient descent
Repeat until convergence{
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜕
𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 (for 𝑗 = 0 and 𝑗 = 1)
}
𝛼: Learning rate (step size)𝜕
𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 : derivative (rate of change)
Slide credit: Andrew Ng
Gradient descent
Incorrect:
temp0 ≔ 𝜃0 −𝛼𝜕
𝜕𝜃0𝐽 𝜃0, 𝜃1
𝜃0 ≔ temp0
temp1 ≔ 𝜃1 −𝛼𝜕
𝜕𝜃1𝐽 𝜃0, 𝜃1
𝜃1 ≔ temp1
Correct: simultaneous update
temp0 ≔ 𝜃0 −𝛼𝜕
𝜕𝜃0𝐽 𝜃0, 𝜃1
temp1 ≔ 𝜃1 −𝛼𝜕
𝜕𝜃1𝐽 𝜃0, 𝜃1
𝜃0 ≔ temp0𝜃1 ≔ temp1
Slide credit: Andrew Ng
𝜃1 ≔ 𝜃1 − 𝛼𝜕
𝜕𝜃1𝐽 𝜃1
1 2 3
1
2
3
0
𝐽 𝜃1
𝜃1
𝜕
𝜕𝜃1𝐽 𝜃1 > 0
𝜕
𝜕𝜃1𝐽 𝜃1 < 0
Slide credit: Andrew Ng
Learning rate
Gradient descent for linear regression
Repeat until convergence{
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜕
𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 (for 𝑗 = 0 and 𝑗 = 1)
}
• Linear regression modelℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
𝐽 𝜃0, 𝜃1 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
Slide credit: Andrew Ng
Computing partial derivative
•𝜕
𝜕𝜃𝑗𝐽 𝜃0, 𝜃1 =
𝜕
𝜕𝜃𝑗
1
2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
= 𝜕
𝜕𝜃𝑗
1
2𝑚σ𝑖=1𝑚 𝜃0 + 𝜃1𝑥
(𝑖) − 𝑦 𝑖 2
• 𝑗 = 0: 𝜕
𝜕𝜃0𝐽 𝜃0, 𝜃1 =
1
𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
• 𝑗 = 1: 𝜕
𝜕𝜃1𝐽 𝜃0, 𝜃1 =
1
𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖
Slide credit: Andrew Ng
Gradient descent for linear regression
Repeat until convergence{
𝜃0 ≔ 𝜃0 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
𝜃1 ≔ 𝜃1 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖
}
Update 𝜃0 and 𝜃1 simultaneously
Slide credit: Andrew Ng
Batch gradient descent
• “Batch”: Each step of gradient descent uses all the training examples
Repeat until convergence{
𝜃0 ≔ 𝜃0 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
𝜃1 ≔ 𝜃1 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖
}
𝑚: Number of training examples
Slide credit: Andrew Ng
Linear Regression
•Model representation
•Cost function
•Gradient descent
• Features and polynomial regression
•Normal equation
Training dataset
Size in feet^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
Slide credit: Andrew Ng
Multiple features (input variables)Size in feet^2 (𝑥1)
Number of bedrooms (𝑥2)
Number of floors (𝑥3)
Age of home (years) (𝑥4)
Price ($) in 1000’s (y)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… …
Notation:
𝑛 = Number of features
𝑥(𝑖)= Input features of 𝑖𝑡ℎ training example
𝑥𝑗(𝑖)
= Value of feature 𝑗 in 𝑖𝑡ℎ training example
𝑥3(2)
=?
𝑥3(4)
=?
Slide credit: Andrew Ng
Hypothesis
Previously: ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
Now: ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +𝜃3 𝑥3 + 𝜃4𝑥4
Slide credit: Andrew Ng
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛
• For convenience of notation, define 𝑥0 = 1(𝑥0
(𝑖)= 1 for all examples)
• 𝒙 =
𝑥0𝑥1𝑥2⋮𝑥𝑛
∈ 𝑅𝑛+1 𝜽 =
𝜃0𝜃1𝜃2⋮𝜃𝑛
∈ 𝑅𝑛+1
• ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛= 𝜽⊤𝒙
Slide credit: Andrew Ng
Gradient descent
• Previously (𝑛 = 1)
Repeat until convergence{
𝜃0 ≔ 𝜃0 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
𝜃1 ≔ 𝜃1 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖
}
• New algorithm (𝑛 ≥ 1)
Repeat until convergence{
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖
}
Simultaneously update 𝜃𝑗 , for 𝑗 = 0, 1,⋯ , 𝑛
Slide credit: Andrew Ng
Gradient descent in practice: Feature scaling
• Idea: Make sure features are on a similar scale (e.g,. −1 ≤ 𝑥𝑖≤ 1)
• E.g. 𝑥1 = size (0-2000 feat^2)
𝑥2 = number of bedrooms (1-5)
1 2 3
1
2
3
0 𝜃1
𝜃2
1 2 3
1
2
3
0 𝜃1
𝜃2
Slide credit: Andrew Ng
Gradient descent in practice: Learning rate
• Automatic convergence test
• 𝛼 too small: slow convergence
• 𝛼 too large: may not converge
• To choose 𝛼, try
0.001, … 0.01, …, 0.1, … , 1
Image credit: CS231n@Stanford
House prices prediction
• ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × frontage + 𝜃2 × depth
• Area𝑥 = frontage × depth
• ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
Slide credit: Andrew Ng
Polynomial regression
•ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥3= 𝜃0 + 𝜃1 𝑠𝑖𝑧𝑒 + 𝜃2 𝑠𝑖𝑧𝑒 2 + 𝜃3 𝑠𝑖𝑧𝑒 3
Price ($)in 1000’s
500 1000 1500 2000 2500
100
200
300
400
Size in feet^2
𝑥1 = (size)𝑥2 = (size)^2𝑥3 = (size)^3
Slide credit: Andrew Ng
Linear Regression
•Model representation
•Cost function
•Gradient descent
• Features and polynomial regression
•Normal equation
(𝑥0) Size in feet^2(𝑥1)
Number of bedrooms
(𝑥2)
Number of floors (𝑥3)
Age of home (years) (𝑥4)
Price ($) in 1000’s (y)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
… …
𝑦 =
460232315178
𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦Slide credit: Andrew Ng
Least square solution
• 𝐽 𝜃 =1
2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
=1
2𝑚σ𝑖=1𝑚 𝜃⊤𝑥(𝑖) − 𝑦 𝑖 2
=1
2𝑚𝑋𝜃 − 𝑦 2
2
•𝜕
𝜕𝜃𝐽 𝜃 = 0
• 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦
Justification/interpretation 1
• Loss minimization
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2=
1
𝑚
𝑖=1
𝑚
𝐿𝑙𝑠 ℎ𝜃 𝑥 𝑖 , 𝑦 𝑖
• 𝐿𝑙𝑠 𝑦, ො𝑦 =1
2𝑦 − ො𝑦 2
2: Least squares loss
• Empirical Risk Minimization (ERM)1
𝑚
𝑖=1
𝑚
𝐿𝑙𝑠 𝑦 𝑖 , ො𝑦
Justification/interpretation 2
•Probabilistic model• Assume linear model with Gaussian errors
𝑝𝜃 𝑦 𝑖 𝑥 𝑖 =1
2𝜋𝜎2exp(−
1
2𝜎2(𝑦 𝑖 − 𝜃⊤𝑥 𝑖 )
• Solving maximum likelihood
argmin𝜃
ෑ
𝑖=1
𝑚
𝑝𝜃 𝑦 𝑖 𝑥 𝑖
argmin𝜃
log(ෑ
𝑖=1
𝑚
𝑝 𝑦 𝑖 𝑥 𝑖 ) = argmin𝜃
1
2𝜎2
𝑖=1
𝑚1
2𝜃⊤𝑥 𝑖 − 𝑦 𝑖 2
Image credit: CS 446@UIUC
Justification/interpretation 3
•Geometric interpretation
𝑿 =
1 ← 𝒙(1) →1⋮1
← 𝒙(2) →⋮
← 𝒙(𝑚) →
=↑𝒛𝟎
↑𝒛𝟏
↑𝒛𝟐
⋯↑𝒛𝒏
↓ ↓ ↓ ↓
• 𝑿𝜽: column space of 𝑿 or span({𝒛𝟎, 𝒛𝟏, ⋯ , 𝒛𝒏})
• Residual 𝑿𝜽 − 𝐲 is orthogonal to the column space of 𝑿
• 𝑿⊤ 𝑿𝜽 − 𝐲 = 0 → (𝑿⊤𝑿)𝜽 = 𝑿⊤𝒚
𝒚
column space of 𝑿
𝑿𝜽
𝑿𝜽 − 𝒚
𝑚 training examples, 𝑛 features
Gradient Descent
•Need to choose 𝛼
•Need many iterations
•Works well even when 𝑛 is large
Normal Equation
•No need to choose 𝛼
•Don’t need to iterate
•Need to compute (𝑋⊤𝑋)−1
•Slow if 𝑛 is very large
Slide credit: Andrew Ng
Things to remember
• Model representationℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥
• Cost function 𝐽 𝜃 =1
2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
• Gradient descent for linear regressionRepeat until convergence {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗
𝑖}
• Features and polynomial regression
Can combine features; can use different functions to generate features (e.g., polynomial)
• Normal equation 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦
Next
• Naïve Bayes, Logistic regression, Regularization