linear’regression’ · linear’regression’ aar$%singh%&%barnabas%poczos% % %...
TRANSCRIPT
Linear Regression
Aar$ Singh & Barnabas Poczos
Machine Learning 10-‐701/15-‐781 Jan 23, 2014
• Learning distribu$ons – Maximum Likelihood Es$ma$on (MLE) – Maximum A Posteriori (MAP)
• Learning classifiers – Naïve Bayes
2
So far …
3
Discrete to Con3nuous Labels
Sports Science News
Classification Regression
Anemic cell Healthy cell
Stock Market Predic$on
Y = ?
X = Feb01
X = Document Y = Topic X = Cell Image Y = Diagnosis
Regression Tasks
4
Weather Predic$on
Y = Temp
X = 7 pm
Es$ma$ng Contamina$on
X = new loca3on Y = sensor reading
5
Supervised Learning
Sports Science News
Classification: Regression:
Probability of Error
Goal:
Mean Squared Error
Y = ?
X = Feb01
loss function (performance measure)
Regression algorithms
Learning algorithm
6
Linear Regression Regularized Linear Regression – Ridge regression, Lasso Polynomial Regression Kernel Regression Regression Trees, Splines, Wavelet es$mators, …
Replace Expecta3on with Empirical Mean
7
Empirical Minimizer:
Optimal predictor:
Law of Large Numbers:
Empirical mean
n ∞
Restrict class of predictors
8
Empirical Minimizer:
Optimal predictor:
Class of predictors
Why? Overfi_ng! Empiricial loss minimized by any func$on of the form
Xi
Yi
Restrict class of predictors
9
Empirical Minimizer:
Optimal predictor:
Class of predictors
-‐ Class of Linear func$ons -‐ Class of Polynomial func$ons -‐ Class of nonlinear func$ons
F
Linear Regression
10
-‐ Class of Linear func$ons
β1 -‐ intercept
β2 = slope Uni-‐variate case:
Mul$-‐variate case: 1
where ,
Least Squares Estimator
Least Squares Es3mator
11
f(Xi) = Xi�
Least Squares Es3mator
12
Normal Equa3ons
13
If is inver$ble,
When is inver$ble ? Recall: Full rank matrices are inver$ble. What is rank of ? What if is not inver$ble ? Regulariza$on (later)
p xp p x1 p x1
Gradient Descent
14
Even when is inver$ble, might be computa$onally expensive if A is huge.
Treat as op$miza$on problem Observa$on: J(β) is convex in β.
J(β1) J(β1, β2)
β1 β1 β2
How to find the minimizer?
Gradient Descent
15
Even when is inver$ble, might be computa$onally expensive if A is huge.
Ini$alize: Update:
0 if = Stop: when some criterion met e.g. fixed # itera$ons, or < ε.
Since J(β) is convex, move along nega3ve of gradient
step size
Effect of step-‐size α
16
Large α => Fast convergence but larger residual error Also possible oscilla$ons
Small α => Slow convergence but small residual error
Least Squares and MLE
17
Intui$on: Signal plus (zero-‐mean) Noise model
Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model !
log likelihood
= X�⇤
Regularized Least Squares and MAP
18
What if is not inver$ble ?
log likelihood log prior
I) Gaussian Prior
0
Ridge Regression
b�MAP = (AAA>AAA+ �III)�1AAA>YYY
Regularized Least Squares and MAP
19
What if is not inver$ble ?
log likelihood log prior
Prior belief that β is Gaussian with zero-‐mean biases solu$on to “small” β
I) Gaussian Prior
0
Ridge Regression
Regularized Least Squares and MAP
20
What if is not inver$ble ?
log likelihood log prior
Prior belief that β is Laplace with zero-‐mean biases solu$on to “small” β
Lasso
II) Laplace Prior
Ridge Regression vs Lasso
21
Ridge Regression: Lasso:
Lasso (l1 penalty) results in sparse solu3ons – vector with more zero coordinates Good for high-‐dimensional problems – don’t have to store all coordinates!
βs with constant l1 norm
Ideally l0 penalty, but op$miza$on becomes non-‐convex
βs with constant l0 norm
βs with constant J(β) (level sets of J(β))
βs with constant l2 norm
β2
β1
HOT!
Beyond Linear Regression
26
Polynomial regression Regression with nonlinear features Later … Kernel regression -‐ Local/Weighted regression
Polynomial Regression
27
Univariate (1-‐dim) case:
where ,
MulGvariate (p-‐dim) case:
degree m
f(X) = �0 + �1X(1)
+ �2X(2)
+ · · ·+ �pX(p)
+
pX
i=1
pX
j=1
�ijX(i)X(j)
+
pX
i=1
pX
j=1
pX
k=1
X(i)X(j)X(k)
+ . . . terms up to degree m
28
Polynomial Regression
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
k=1 k=2
k=3 k=7
Polynomial of order k, equivalently of degree up to k-‐1
What is the right order? Recall overfiPng! More later …
29
Regression with nonlinear features
In general, use any nonlinear features
e.g. eX, log X, 1/X, sin(X), …
Nonlinear features
Weight of each feature
What you should know Linear Regression Least Squares Es$mator Normal Equa$ons Gradient Descent Probabilis$c Interpreta$on (connec$on to MLE)
Regularized Linear Regression (connec$on to MAP) Ridge Regression, Lasso
Polynomial Regression, Regression with Non-‐linear features
25