linear regression: model and algorithms...by analogy to 1d case: 1/9/2017 15 cse 446: machine...
TRANSCRIPT
1/9/2017
1
CSE 446: Machine Learning
Linear Regression:Model and AlgorithmsCSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 9, 2017
©2017 Emily Fox
CSE 446: Machine Learning
Linear regression:The model
©2017 Emily Fox
1/9/2017
2
CSE 446: Machine Learning3
How much is my house worth?
©2017 Emily Fox
I want to listmy house
for sale
CSE 446: Machine Learning4
How much is my house worth?
©2017 Emily Fox
$$ ????
1/9/2017
3
CSE 446: Machine Learning
Data
©2017 Emily Fox
(x1 = sq.ft., y1 = $)
(x2 = sq.ft., y2 = $)
(x3 = sq.ft., y3 = $)
(x4 = sq.ft., y4 = $) Input vs. Output:• y is the quantity of interest
• assume y can be predicted from x
input output
…
(x5 = sq.ft., y5 = $)
CSE 446: Machine Learning6 ©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
Model –How we assume the world works
Regression model:
1/9/2017
4
CSE 446: Machine Learning7 ©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
“Essentially, all models are wrong, but some are useful.”
George Box, 1987.
Model –How we assume the world works
CSE 446: Machine Learning8 ©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
Simple linear regression model
yi = w0+w1 xi + εi
f(x) = w0+w1 x
1/9/2017
5
CSE 446: Machine Learning9 ©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
Simple linear regression model
parameters:regression coefficients
yi = w0+w1 xi + εi
CSE 446: Machine Learning10 ©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
f(x) = w0 + w1 x+ w2 x2
What about a quadratic function?
1/9/2017
6
CSE 446: Machine Learning11
Even higher order polynomial
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
f(x) = w0 + w1 x+ w2 x2 + … + wp xp
CSE 446: Machine Learning12
Polynomial regression
Model:yi = w0 + w1 xi+ w2 xi
2 + … + wp xip + εi
feature 1 = 1 (constant)feature 2 = xfeature 3 = x2
…feature p+1 = xp
©2017 Emily Fox
treat as different features
parameter 1 = w0
parameter 2 = w1
parameter 3 = w2
…parameter p+1 = wp
1/9/2017
7
CSE 446: Machine Learning13
Generic basis expansion
Model:yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
= wj hj(xi) + εi
feature 1 = h0(x)…often 1feature 2 = h1(x)… e.g., xfeature 3 = h2(x)… e.g., x2 or sin(2πx/12)…feature p+1 = hp(x)… e.g., xp
©2017 Emily Fox
jth regression coefficientor weight
jth feature
CSE 446: Machine Learning14
Generic basis expansion
Model:yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
= wj hj(xi) + εi
feature 1 = h0(x)…often 1 (constant)feature 2 = h1(x)… e.g., xfeature 3 = h2(x)… e.g., x2 or sin(2πx/12)…feature D+1 = hD(x)… e.g., xp
©2017 Emily Fox
1/9/2017
8
CSE 446: Machine Learning15
Predictions just based on house size
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
Only 1 bathroom! Only 1 bathroom! Not same as my
3 bathrooms
CSE 446: Machine Learning16
Add more inputs
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x[1]
y
x[2]
f(x) = w0 + w1 sq.ft.+ w2 #bath
1/9/2017
9
CSE 446: Machine Learning17
Many possible inputs
- Square feet
- # bathrooms
- # bedrooms
- Lot size
- Year built
- …
©2017 Emily Fox
CSE 446: Machine Learning18
General notation
Output: yInputs: x = (x[1],x[2],…, x[d])
Notational conventions:x[j] = jth input (scalar)hj(x) = jth feature (scalar)xi = input of ith data point (vector)xi[j] = jth input of ith data point (scalar)
©2017 Emily Fox
d-dim vector
scalar
1/9/2017
10
CSE 446: Machine Learning19
Generic linear regression model
©2017 Emily Fox
Model:yi = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi) + εi
= wj hj(xi) + εi
feature 1 = h0(x) … e.g., 1feature 2 = h1(x) … e.g., x[1] = sq. ft.feature 3 = h2(x) … e.g., x[2] = #bath
or, log(x[7]) x[2] = log(#bed) x #bath…feature D+1 = hD(x) … some other function of x[1],…, x[d]
CSE 446: Machine Learning
Fitting the linear regression model
©2017 Emily Fox
1/9/2017
11
CSE 446: Machine Learning
Step 1:Rewrite the regression model
©2017 Emily Fox
CSE 446: Machine Learning22
Rewrite in matrix notation
For observation i
©2017 Emily Fox
yi = wj hj(xi) + εi
= + εiyi = + εi
1/9/2017
12
CSE 446: Machine Learning23
Rewrite in matrix notation
For all observations together
©2017 Emily Fox
= +
CSE 446: Machine Learning
Step 2:Compute the cost
©2017 Emily Fox
1/9/2017
13
CSE 446: Machine Learning25
“Cost” of using a given line
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y Residual sum of squares (RSS)
RSS(w0,w1) = (yi-[w0+w1xi])2
CSE 446: Machine Learning26
RSS for multiple regression
©2017 Emily Fox
square feet (sq.ft.)
pri
ce (
$)
y
RSS(w) = (yi- )2
= (y-Hw)T(y-Hw)
x[1]
x[2]
1/9/2017
14
CSE 446: Machine Learning
Step 3:Take the gradient
©2017 Emily Fox
CSE 446: Machine Learning28
Gradient of RSS
©2017 Emily Fox
RSS(w) = [(y-Hw)T(y-Hw)]
= -2HT(y-Hw)
Δ Δ
Why? By analogy to 1D case:
1/9/2017
15
CSE 446: Machine Learning
Step 4, Approach 1:Set the gradient = 0
©2017 Emily Fox
CSE 446: Machine Learning30
Closed-form solution
©2017 Emily Fox
RSS(w) = -2HT(y-Hw) = 0
Δ
Solve for w:
1/9/2017
16
CSE 446: Machine Learning31
Closed-form solution
©2017 Emily Fox
ŵ= ( HTH )-1 HTy
Invertible if:
Complexity of inverse:
CSE 446: Machine Learning
Step 4, Approach 2:Gradient descent
©2017 Emily Fox
1/9/2017
17
CSE 446: Machine Learning33
Gradient descent
©2017 Emily Fox
while not convergedw(t+1) w(t) - η RSS(w(t))
Δ
-2HT(y-Hw)
CSE 446: Machine Learning34
Interpreting elementwise
©2017 Emily Fox
wj(t+1) wj
(t) + 2η hj(xi)(yi-ŷi(w(t)))
Update to jth feature weight:
square feet (sq.ft.)
pri
ce (
$)
y
x[1]
x[2]
1/9/2017
18
CSE 446: Machine Learning35
Summary of gradient descentfor multiple regression
©2017 Emily Fox
init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > εfor j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj(t+1) wj
(t) – η partial[j]
t t + 1
Δ
CSE 446: Machine Learning
Why min RSS?
©2017 Emily Fox
1/9/2017
19
CSE 446: Machine Learning37
Assuming Gaussian noise
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
yModel for εi:
Implied distributionon yi:
CSE 446: Machine Learning38
Maximum likelihood estimate of params
Maximize log-likelihood wrt w
©2017 Emily Fox
1/9/2017
20
CSE 446: Machine Learning
Interpreting the fitted function
©2015 Emily Fox & Carlos Guestrin
CSE 446: Machine Learning40 ©2015 Emily Fox & Carlos Guestrin
square feet (sq.ft.)
pri
ce
($
)
x
y
Interpreting the coefficients –Simple linear regression
ŷ= ŵ0 + ŵ1 x
1 sq. ft.
predictedchange in $
1/9/2017
21
CSE 446: Machine Learning41
fix
©2015 Emily Fox & Carlos Guestrin
Interpreting the coefficients –Two linear features
square feet (sq.ft.)
pri
ce (
$)
x[1]
y
x[2]
ŷ= ŵ0 + ŵ1 x[1] + ŵ2 x[2]
CSE 446: Machine Learning42
fix
©2015 Emily Fox & Carlos Guestrin
Interpreting the coefficients –Two linear features
ŷ= ŵ0 + ŵ1 x[1] + ŵ2 x[2]
# bathrooms
pri
ce
($
)
x[2]
1 bathroom
predictedchange in $
y
For fixed # sq.ft.!
1/9/2017
22
CSE 446: Machine Learning43
fix fix fixfix
©2015 Emily Fox & Carlos Guestrin
Interpreting the coefficients –Multiple linear featuresŷ= ŵ0 + ŵ1 x[1] + …+ŵj x[j] + … + ŵd x[d]
square feet (sq.ft.)
pri
ce (
$)
x[1]
y
x[2]
CSE 446: Machine Learning44 ©2015 Emily Fox & Carlos Guestrin
Interpreting the coefficients-Polynomial regression
ŷ= ŵ0 + ŵ1x +… + ŵj xj + … + ŵp xp
square feet (sq.ft.)
pri
ce
($
)
x
y
Can’t hold other features
fixed!
1/9/2017
23
CSE 446: Machine Learning
Recap of concepts
©2017 Emily Fox
CSE 446: Machine Learning46
What you can do now…• Describe polynomial regression• Write a regression model using multiple inputs or
features thereof• Cast both polynomial regression and regression
with multiple inputs as regression with multiple features
• Calculate a goodness-of-fit metric (e.g., RSS)• Estimate model parameters of a general multiple
regression model to minimize RSS:- In closed form- Using an iterative gradient descent algorithm
• Interpret the coefficients of a non-featurizedmultiple regression fit
• Exploit the estimated model to form predictions
©2017 Emily Fox