ols coefficient analysis · 2021. 5. 6. · singular-value decomposition everya 2rm k,m...

OLS Coefficient Analysis

DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Sciencehttps://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html

Carlos Fernandez-Granda

https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html

Prerequisites

Linear algebra (orthogonal matrices, singular-value decomposition)

Ordinary least squares

Goal

Analyze coefficients of OLS estimator

Singular-value decomposition

Every A ∈ Rm×k , m ≥ k , has a singular-value decomposition (SVD)

A =[u1 u2 · · · uk

]s1 0 · · · 00 s2 · · · 0

. . .0 0 · · · sk

[v1 v2 · · · vk]T

= USV T

The singular values s1 ≥ s2 ≥ · · · ≥ sk are nonnegative

The left singular vectors u1, u2, . . . uk ∈ Rm are orthonormal

The right singular vectors v1, v2, . . . vk ∈ Rk are orthonormal

Linear maps

The SVD decomposes the action of a matrix A ∈ Rm×k on a vectorw ∈ Rk into:

1. Rotation

V Tw =k∑

i=1

〈vi ,w〉ei

2. Scaling

SV Tw =k∑

i=1

si 〈vi ,w〉ei

3. Rotation

USV Tw =k∑

i=1

si 〈vi ,w〉ui

Linear maps

~x

~v1

~v2

~y

V T~x

~e1

~e2V T~y

SV T~x

s1~e1

s2~e2SV T~y

USV T~x

s1~u1s2~u2

USV T~y

V T

S

U

Linear maps (s2 := 0)

~x

~v1

~v2

~y

V T~x

~e1

~e2V T~y

SV T~xs1~e1

~0 SV T~y

USV T~x

s1~u1

~0

USV T~y

V T

S

U

Regression

Goal: Estimate response (or dependent variable)

Data: Several observed variables, known as features (or covariates,or independent variables)

Data

Training data: (y1, x1), (y2, x2), . . . , (yn, xn), where yi ∈ R and xi ∈ Rp

We define a response vector y ∈ Rn and a feature matrix

X :=[x1 x2 · · · xn

]

OLS estimator

βOLS =(XXT

)−1Xy

= (US2UT )−1USV T y

= US−2UTUSV T y

= US−1V T y(XXT

)−1X is a left inverse or pseudoinverse of XT because(

XXT)−1

XXT = I

Additive model

Model for MSE analysis

y = xTβtrue + z

Problem: Ignores that estimate is obtained from training data

Model for training data

ytrain := XTβtrue + ztrain

I Feature matrix X ∈ Rp×n is deterministic

I Coefficients βtrue ∈ Rp are deterministic

I Noise ztrain is an n-dimensional iid Gaussian vector with zero meanand variance σ2

Maximum likelihood

Under this model, OLS is equivalent to maximum likelihood

Assume we observe ytrain

Lytrain(β) =1√

(2πσ2)nexp

(− 12σ2

∣∣∣∣∣∣ytrain − XTβ∣∣∣∣∣∣2

2

)

βML = arg maxβLytrain(β)

= arg maxβ

logLytrain(β)

= arg minβ

∣∣∣∣∣∣ytrain − XTβ∣∣∣∣∣∣2

2

Decomposition of OLS cost function

arg minβ‖ytrain − XTβ‖22

= arg minβ‖ztrain − XT (β − βtrue)‖22

= arg minβ

(β − βtrue)TXXT (β − βtrue)− 2zTtrainXT (β − βtrue) + zTtrainztrain

= arg minβ

(β − βtrue)TXXT (β − βtrue)−2zTtrainXTβ

Quadratic component (β − βtrue)TXXT (β − βtrue)

Contour lines are ellipsoids centered at βtrue

c2 = (β − βtrue)TXXT (β − βtrue) = (β − βtrue)TUS2UT (β − βtrue)

=

p∑i=1

si2(ui

T (β − βtrue))2

Axes of the ellipsoid are the left singular vectors

Stretch in direction of each axis is determined by singular values

(β − βtrue)TXXT (β − βtrue)

−3 −2 −1 0 1 2 3 4 5 6

β[1]

−2

−1

0

1

2

3

4

β[2] βtrue

0.25

0.50

0.50

1.00

1.00

2.00

2.00

4.00

4.00

7.00

7.00

0.10

(β − βtrue)TXXT (β − βtrue)

−3 −2 −1 0 1 2 3 4 5 6

β[1]

−2

−1

0

1

2

3

4

β[2]

c s−11 u1

c s−12 u2βtrue

0.10

−2zTtrainXTβ

−3 −2 −1 0 1 2 3 4 5 6

β[1]

−2

−1

0

1

2

3

4

β[2]

-0.20

-0.00

0.20

0.40

(β − βtrue)TXXT (β − βtrue)− 2zTtrainX

Tβ

−3 −2 −1 0 1 2 3 4 5 6

β[1]

−2

−1

0

1

2

3

4

β[2]βOLSβtrue

0.250.50

0.50

1.00

1.00

2.00

2.00

4.00

4.00

7.00

7.00

−2zTtrainXTβ

−3 −2 −1 0 1 2 3 4 5 6

β[1]

−2

−1

0

1

2

3

4

β[2]

-0.60

-0.40-0.20

-0.00

0.20

0.400.60

0.80

1.00 1.20

1.40


Tβ

−3 −2 −1 0 1 2 3 4 5 6

β[1]

−2

−1

0

1

2

3

4

β[2]βOLS βtrue

0.50

1.00

1.00

2.00

2.00

4.00

4.00

7.00

7.00

10.00

−2zTtrainXTβ

−3 −2 −1 0 1 2 3 4 5 6

β[1]

−2

−1

0

1

2

3

4

β[2]

-1.60-1.40

-1.20

-1.00

-0.80-0.60

-0.40

-0.20

-0.000.20

0.400.60

0.80


Tβ

−3 −2 −1 0 1 2 3 4 5 6

β[1]

−2

−1

0

1

2

3

4

β[2] βOLSβtrue

0.25

0.25

0.50

0.50

1.00

1.00

2.00

2.00

4.00

4.00

7.00

7.00

Minima for 200 realizations

−3 −2 −1 0 1 2 3 4 5 6

β[1]

−2

−1

0

1

2

3

4

β[2]

βtrue

βOLS

Minima

βOLS = (XXT )−1Xytrain

= (XXT )−1XXTβtrue + (XXT )−1X ztrain

= βtrue + (XXT )−1X ztrain

= βtrue + US−1V T ztrain

Distribution?

Linear transformations of Gaussian random vectors

Let x be a Gaussian random vector of dimension d with mean µ andcovariance matrix Σ

For any matrix A ∈ Rm×d and b ∈ Rm y = Ax + b is Gaussian withmean Aµ+ b and covariance matrix AΣAT (as long as it is full rank)

Minima

βOLS = (XXT )−1Xytrain

= (XXT )−1XXTβtrue + (XXT )−1X ztrain

= βtrue + (XXT )−1X ztrain

= βtrue + US−1V T ztrain

Distribution? Gaussian with mean βtrue and covariance matrix σ2US−2UT

Minima

−3 −2 −1 0 1 2 3 4 5 6

β[1]

−2

−1

0

1

2

3

4

β[2]

c s−11 u1

c s−12 u2βtrue

10−15

10−13

10−11

10−9

10−7

10−5

10−3

10−1

Conclusion

Coefficient error of OLS estimator depends on singular vectors and singularvalues of feature matrix

If singular values are small, error explodes!

ols coefficient analysis · 2021. 5. 6. · singular-value decomposition everya 2rm k,m...

Documents