cse546:(linear(regression( bias(/(variance(tradeoff( winter...
TRANSCRIPT
![Page 1: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/1.jpg)
CSE546: Linear Regression Bias / Variance Tradeoff
Winter 2012
Luke ZeBlemoyer
Slides adapted from Carlos Guestrin
![Page 2: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/2.jpg)
PredicJon of conJnuous variables • Billionaire says: Wait, that’s not what I meant! • You say: Chill out, dude. • He says: I want to predict a conJnuous variable for conJnuous inputs: I want to predict salaries from GPA.
• You say: I can regress that…
![Page 3: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/3.jpg)
0 20 0
20
40
0 10 20 30 40
0 10
20 30
20 22 24 26
Linear Regression
Prediction Prediction
![Page 4: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/4.jpg)
Ordinary Least Squares (OLS)
0 20 0
Error or “residual”
Prediction
Observation
![Page 5: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/5.jpg)
The regression problem • Instances: <xj, tj> • Learn: Mapping from x to t(x) • Hypothesis space:
– Given, basis funcJons {h1,…,hk} – Find coeffs w={w1,…,wk}
– Why is this usually called linear regression? • model is linear in the parameters • Can we esJmate funcJons that are not lines???
• Precisely, minimize the residual squared error:
![Page 6: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/6.jpg)
Regression: matrix notaJon
N data points
K basis functions
N observed outputs
measurements weights
K basis func
![Page 7: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/7.jpg)
Regression soluJon: simple matrix math
where
k×k matrix for k basis functions
k×1 vector
![Page 8: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/8.jpg)
But, why? • Billionaire (again) says: Why sum squared error???
• You say: Gaussians, Dr. Gateson, Gaussians…
• Model: predicJon is linear funcJon plus Gaussian noise – t(x) = ∑i wi hi(x) + ε
• Learn w using MLE:
![Page 9: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/9.jpg)
Maximizing log-‐likelihood Maximize wrt w:
Least-squares Linear Regression is MLE for Gaussians!!!
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
= −N�
i=1
xi +Nµ = 0
= −N
σ+
N�
i=1
(xi − µ)2
σ3= 0
argmaxwln
�1
σ√2π
�N
+N�
j=1
−[tj −�
i wihi(xj)]2
2σ2
= argmaxw
N�
j=1
−[tj −�
i wihi(xj)]2
2σ2
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
= −N�
i=1
xi +Nµ = 0
= −N
σ+
N�
i=1
(xi − µ)2
σ3= 0
argmaxwln
�1
σ√2π
�N
+N�
j=1
−[tj −�
i wihi(xj)]2
2σ2
= argmaxw
N�
j=1
−[tj −�
i wihi(xj)]2
2σ2
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
= −N�
i=1
xi +Nµ = 0
= −N
σ+
N�
i=1
(xi − µ)2
σ3= 0
argmaxwln
�1
σ√2π
�N
+N�
j=1
−[tj −�
i wihi(xj)]2
2σ2
= argmaxw
N�
j=1
−[tj −�
i wihi(xj)]2
2σ2
= argminw
N�
j=1
[tj −�
i
wihi(xj)]2
2
![Page 10: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/10.jpg)
Bias-‐Variance tradeoff – IntuiJon
• Model too simple: does not fit the data well – A biased soluJon
• Model too complex: small changes to the data, soluJon changes a lot – A high-‐variance soluJon
x
t
M = 0
0 1
−1
0
1
x
t
M = 9
0 1
−1
0
1
![Page 11: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/11.jpg)
(Squared) Bias of learner • Given: dataset D with m samples • Learn: for different datasets D, you will get different
funcJons h(x) • Expected predicJon (averaged over hypotheses): ED[h(x)] • Bias: difference between expected predicJon and truth
– Measures how well you expect to represent true soluJon
– Decreases with more complex model
![Page 12: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/12.jpg)
Variance of learner • Given: dataset D with m samples • Learn: for different datasets D, you will get different
funcJons h(x) • Expected predicJon (averaged over hypotheses): ED[h(x)] • Variance: difference between what you expect to learn and
what you learn from a from a parJcular dataset – Measures how sensiJve learner is to specific dataset – Decreases with simpler model
![Page 13: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/13.jpg)
Bias–Variance decomposiJon of error
• Consider simple regression problem f:XàT f(x) = g(x) + ε
• Collect some data, and learn a funcJon h(x) • What are sources of predicJon error?
noise ~ N(0,σ)
determinisJc
![Page 14: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/14.jpg)
Sources of error 1 – noise
• What if we have perfect learner, infinite data? – If our learning soluJon h(x) saJsfies h(x)=g(x) – SJll have remaining, unavoidable error of σ2 due to noise ε
f(x) = g(x) + ε
![Page 15: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/15.jpg)
Sources of error 2 – Finite data
• What if we have imperfect learner, or only m training examples?
• What is our expected squared error per example? – ExpectaJon taken over random training sets D of size m, drawn from distribuJon P(X,T)
f(x) = g(x) + ε
![Page 16: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/16.jpg)
Bias-‐Variance DecomposiJon of Error Assume target function: t(x) = g(x) + ε
• Then expected squared error over fixed size training sets D drawn from P(X,T) can be expressed as sum of three components:
Where:
Bishop Chapter 3
![Page 17: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/17.jpg)
Bias-‐Variance Tradeoff • Choice of hypothesis class introduces learning bias – More complex class → less bias – More complex class → more variance
![Page 18: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/18.jpg)
Training set error
• Given a dataset (Training data) • Choose a loss funcJon
– e.g., squared error (L2) for regression • Training error: For a parJcular set of parameters, loss funcJon on training data:
![Page 19: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/19.jpg)
Training error as a funcJon of model complexity
![Page 20: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/20.jpg)
PredicJon error
• Training set error can be poor measure of “quality” of soluJon
• PredicJon error (true error): We really care about error over all possibiliJes:
![Page 21: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/21.jpg)
PredicJon error as a funcJon of model complexity
![Page 22: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/22.jpg)
CompuJng predicJon error • To correctly predict error
• Monte Carlo integration (sampling approximation) • Sample a set of i.i.d. points {x1,…,xM} from p(x) • Approximate integral with sample average
• Hard integral! • May not know t(x) for every x, may not know p(x)
![Page 23: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/23.jpg)
Why training set error doesn’t approximate predicJon error?
• Sampling approximaJon of predicJon error:
• Training error :
• Very similar equaJons!!! – Why is training set a bad measure of predicJon error???
![Page 24: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/24.jpg)
Why training set error doesn’t approximate predicJon error?
• Sampling approximaJon of predicJon error:
• Training error :
• Very similar equaJons!!! – Why is training set a bad measure of predicJon error???
Because you cheated!!!
Training error good estimate for a single w, But you optimized w with respect to the training error,
and found w that is good for this set of samples
Training error is a (optimistically) biased estimate of prediction error
![Page 25: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/25.jpg)
Test set error • Given a dataset, randomly split it into two parts: – Training data – {x1,…, xNtrain} – Test data – {x1,…, xNtest}
• Use training data to opJmize parameters w • Test set error: For the final solu)on w*, evaluate the error using:
![Page 26: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/26.jpg)
Test set error as a funcJon of model complexity
![Page 27: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/27.jpg)
Overfirng: this slide is so important we are looking at it again!
• Assume: – Data generated from distribuJon D(X,Y) – A hypothesis space H
• Define: errors for hypothesis h ∈ H – Training error: errortrain(h) – Data (true) error: errortrue(h)
• We say h overfits the training data if there exists an h’ ∈ H such that:
errortrain(h) < errortrain(h’) and errortrue(h) > errortrue(h’)
![Page 28: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/28.jpg)
Summary: error esJmators
• Gold Standard:
• Training: opJmisJcally biased
• Test: our final meaure, unbiased?
![Page 29: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/29.jpg)
Error as a funcJon of number of training examples for a fixed model complexity
little data infinite data
![Page 30: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/30.jpg)
Summary: error esJmators
• Gold Standard:
• Training: opJmisJcally biased
• Test: our final meaure, unbiased?
Be careful!!!
Test set only unbiased if you never never ever ever do any any any any learning on the test data
For example, if you use the test set to select
the degree of the polynomial… no longer unbiased!!! (We will address this problem later in the semester)
![Page 31: CSE546:(Linear(Regression( Bias(/(Variance(Tradeoff( Winter ...courses.cs.washington.edu/courses/cse546/12wi/slides/cse...Maximizing(loghlikelihood(Maximize wrt w: Least-squares Linear](https://reader035.vdocuments.site/reader035/viewer/2022070221/6135cbbe0ad5d20676479ade/html5/thumbnails/31.jpg)
What you need to know • Regression
– Basis funcJon = features – OpJmizing sum squared error – RelaJonship between regression and Gaussians
• Bias-‐Variance trade-‐off • Play with Applet
– hBp://mste.illinois.edu/users/exner/java.f/leastsquares/
• True error, training error, test error – Never learn on the test data
• Overfirng