ols coefficient analysis · 2021. 5. 6. · singular-value decomposition everya 2rm k,m...
TRANSCRIPT
OLS Coefficient Analysis
DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Sciencehttps://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html
Carlos Fernandez-Granda
Prerequisites
Linear algebra (orthogonal matrices, singular-value decomposition)
Ordinary least squares
Goal
Analyze coefficients of OLS estimator
Singular-value decomposition
Every A ∈ Rm×k , m ≥ k , has a singular-value decomposition (SVD)
A =[u1 u2 · · · uk
]s1 0 · · · 00 s2 · · · 0
. . .0 0 · · · sk
[v1 v2 · · · vk]T
= USV T
The singular values s1 ≥ s2 ≥ · · · ≥ sk are nonnegative
The left singular vectors u1, u2, . . . uk ∈ Rm are orthonormal
The right singular vectors v1, v2, . . . vk ∈ Rk are orthonormal
Linear maps
The SVD decomposes the action of a matrix A ∈ Rm×k on a vectorw ∈ Rk into:
1. Rotation
V Tw =k∑
i=1
〈vi ,w〉ei
2. Scaling
SV Tw =k∑
i=1
si 〈vi ,w〉ei
3. Rotation
USV Tw =k∑
i=1
si 〈vi ,w〉ui
Linear maps
~x
~v1
~v2
~y
V T~x
~e1
~e2V T~y
SV T~x
s1~e1
s2~e2SV T~y
USV T~x
s1~u1s2~u2
USV T~y
V T
S
U
Linear maps (s2 := 0)
~x
~v1
~v2
~y
V T~x
~e1
~e2V T~y
SV T~xs1~e1
~0 SV T~y
USV T~x
s1~u1
~0
USV T~y
V T
S
U
Regression
Goal: Estimate response (or dependent variable)
Data: Several observed variables, known as features (or covariates,or independent variables)
Data
Training data: (y1, x1), (y2, x2), . . . , (yn, xn), where yi ∈ R and xi ∈ Rp
We define a response vector y ∈ Rn and a feature matrix
X :=[x1 x2 · · · xn
]
OLS estimator
βOLS =(XXT
)−1Xy
= (US2UT )−1USV T y
= US−2UTUSV T y
= US−1V T y(XXT
)−1X is a left inverse or pseudoinverse of XT because(
XXT)−1
XXT = I
Additive model
Model for MSE analysis
y = xTβtrue + z
Problem: Ignores that estimate is obtained from training data
Model for training data
ytrain := XTβtrue + ztrain
I Feature matrix X ∈ Rp×n is deterministic
I Coefficients βtrue ∈ Rp are deterministic
I Noise ztrain is an n-dimensional iid Gaussian vector with zero meanand variance σ2
Maximum likelihood
Under this model, OLS is equivalent to maximum likelihood
Assume we observe ytrain
Lytrain(β) =1√
(2πσ2)nexp
(− 12σ2
∣∣∣∣∣∣ytrain − XTβ∣∣∣∣∣∣2
2
)
βML = arg maxβLytrain(β)
= arg maxβ
logLytrain(β)
= arg minβ
∣∣∣∣∣∣ytrain − XTβ∣∣∣∣∣∣2
2
Decomposition of OLS cost function
arg minβ‖ytrain − XTβ‖22
= arg minβ‖ztrain − XT (β − βtrue)‖22
= arg minβ
(β − βtrue)TXXT (β − βtrue)− 2zTtrainXT (β − βtrue) + zTtrainztrain
= arg minβ
(β − βtrue)TXXT (β − βtrue)−2zTtrainXTβ
Quadratic component (β − βtrue)TXXT (β − βtrue)
Contour lines are ellipsoids centered at βtrue
c2 = (β − βtrue)TXXT (β − βtrue) = (β − βtrue)TUS2UT (β − βtrue)
=
p∑i=1
si2(ui
T (β − βtrue))2
Axes of the ellipsoid are the left singular vectors
Stretch in direction of each axis is determined by singular values
(β − βtrue)TXXT (β − βtrue)
−3 −2 −1 0 1 2 3 4 5 6
β[1]
−2
−1
0
1
2
3
4
β[2] βtrue
0.25
0.50
0.50
1.00
1.00
2.00
2.00
4.00
4.00
7.00
7.00
0.10
(β − βtrue)TXXT (β − βtrue)
−3 −2 −1 0 1 2 3 4 5 6
β[1]
−2
−1
0
1
2
3
4
β[2]
c s−11 u1
c s−12 u2βtrue
0.10
−2zTtrainXTβ
−3 −2 −1 0 1 2 3 4 5 6
β[1]
−2
−1
0
1
2
3
4
β[2]
-0.20
-0.00
0.20
0.40
(β − βtrue)TXXT (β − βtrue)− 2zTtrainX
Tβ
−3 −2 −1 0 1 2 3 4 5 6
β[1]
−2
−1
0
1
2
3
4
β[2]βOLSβtrue
0.250.50
0.50
1.00
1.00
2.00
2.00
4.00
4.00
7.00
7.00
−2zTtrainXTβ
−3 −2 −1 0 1 2 3 4 5 6
β[1]
−2
−1
0
1
2
3
4
β[2]
-0.60
-0.40-0.20
-0.00
0.20
0.400.60
0.80
1.00 1.20
1.40
(β − βtrue)TXXT (β − βtrue)− 2zTtrainX
Tβ
−3 −2 −1 0 1 2 3 4 5 6
β[1]
−2
−1
0
1
2
3
4
β[2]βOLS βtrue
0.50
1.00
1.00
2.00
2.00
4.00
4.00
7.00
7.00
10.00
−2zTtrainXTβ
−3 −2 −1 0 1 2 3 4 5 6
β[1]
−2
−1
0
1
2
3
4
β[2]
-1.60-1.40
-1.20
-1.00
-0.80-0.60
-0.40
-0.20
-0.000.20
0.400.60
0.80
(β − βtrue)TXXT (β − βtrue)− 2zTtrainX
Tβ
−3 −2 −1 0 1 2 3 4 5 6
β[1]
−2
−1
0
1
2
3
4
β[2] βOLSβtrue
0.25
0.25
0.50
0.50
1.00
1.00
2.00
2.00
4.00
4.00
7.00
7.00
Minima for 200 realizations
−3 −2 −1 0 1 2 3 4 5 6
β[1]
−2
−1
0
1
2
3
4
β[2]
βtrue
βOLS
Minima
βOLS = (XXT )−1Xytrain
= (XXT )−1XXTβtrue + (XXT )−1X ztrain
= βtrue + (XXT )−1X ztrain
= βtrue + US−1V T ztrain
Distribution?
Linear transformations of Gaussian random vectors
Let x be a Gaussian random vector of dimension d with mean µ andcovariance matrix Σ
For any matrix A ∈ Rm×d and b ∈ Rm y = Ax + b is Gaussian withmean Aµ+ b and covariance matrix AΣAT (as long as it is full rank)
Minima
βOLS = (XXT )−1Xytrain
= (XXT )−1XXTβtrue + (XXT )−1X ztrain
= βtrue + (XXT )−1X ztrain
= βtrue + US−1V T ztrain
Distribution? Gaussian with mean βtrue and covariance matrix σ2US−2UT
Minima
−3 −2 −1 0 1 2 3 4 5 6
β[1]
−2
−1
0
1
2
3
4
β[2]
c s−11 u1
c s−12 u2βtrue
10−15
10−13
10−11
10−9
10−7
10−5
10−3
10−1
Conclusion
Coefficient error of OLS estimator depends on singular vectors and singularvalues of feature matrix
If singular values are small, error explodes!