the singular value decomposition - nyu courantcfgranda/pages/mtds_spring19/slides... · 2019. 3....

230
The Singular Value Decomposition DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring19/index.html Carlos Fernandez-Granda

Upload: others

Post on 23-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • The Singular Value Decomposition

    DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Sciencehttps://cims.nyu.edu/~cfgranda/pages/MTDS_spring19/index.html

    Carlos Fernandez-Granda

    https://cims.nyu.edu/~cfgranda/pages/MTDS_spring19/index.html

  • Motivating applications

    The singular-value decomposition

    Principal component analysis

    Dimensionality reduction

    Linear regression

    Ridge regression

    Low-rank matrix estimation

  • Dimensionality reduction

    Data with a large number of features can be difficult to analyze

    Data modeled as vectors in Rp (p very large)

    Aim: Reduce dimensionality of representation

    Intuition: Directions with very little variation are useless

    Problem: How to find directions of maximum/minimum variation

  • Regression

    The aim is to learn a function h that relates

    I a response or dependent variable y

    I to several observed variables ~x ∈ Rp, known as covariates, features orindependent variables

    The response is assumed to be of the form

    y ≈ h (~x)

  • Linear regression

    The regression function h is assumed to be linear

    y (i) ≈ ~x (i)T ~β + β0, 1 ≤ i ≤ n

    We estimate ~β ∈ Rp from training dataset

    Strain :={(

    y (1), ~x (1)),(y (2), ~x (2)

    ), . . . ,

    (y (n), ~x (n)

    )}A crucial question is how well we do on held-out test data

  • Linear regression

    80 100 120 140 160Weight (pounds)

    60

    62

    64

    66

    68

    70

    72

    74

    Heig

    ht (i

    nche

    s)

    Linear regression model

  • Collaborative filtering

    Quantity y [i , j ] depends on indices i and j

    We observe examples and want to predict new instances

    For example, y [i , j ] is rating given to a movie i by a user j

  • Collaborative filtering

    ? ? ? ?

    ?

    ?

    ??

    ??

    ???

    ?

    ?

    Figure courtesy of Mahdi Soltanolkotabi

  • Rank-1 bilinear model

    I Some movies are more popular in general

    I Some users are more generous in general

    y [i , j ] ≈ a[i ]b[j ]

    I a[i ] quantifies popularity of movie i

    I b[j ] quantifies generosity of user j

  • Rank-r bilinear model

    Certain people like certain movies: r factors

    y [i , j ] ≈r∑

    l=1

    al [i ]bl [j ]

    For each factor l

    I al [i ]: movie i is positively (> 0), negatively (< 0) or not (≈ 0)associated to factor l

    I bl [j ]: user j is positively (> 0), negatively (< 0) or not (≈ 0)associated to factor l

    Singular-value decomposition can be used to fit the model

  • Motivating applications

    The singular-value decomposition

    Principal component analysis

    Dimensionality reduction

    Linear regression

    Ridge regression

    Low-rank matrix estimation

  • Singular value decomposition

    Every rank r real matrix A ∈ Rm×n, has a singular-value decomposition(SVD) of the form

    A =[~u1 ~u2 · · · ~ur

    ]s1 0 · · · 00 s2 · · · 0

    . . .0 0 · · · sr

    ~v T1~vT2...~vTr

    = USV T

  • Singular value decomposition

    I The singular values s1 ≥ s2 ≥ · · · ≥ sr are positive real numbers

    I The left singular vectors ~u1, ~u2, . . . ~ur form an orthonormal set

    I The right singular vectors ~v1, ~v2, . . . ~vr also form an orthonormal set

    I The SVD is unique if all the singular values are different

    I If si = si+1 = . . . = si+k , then ~ui , . . . , ~ui+k can be replaced by anyorthonormal basis of their span (the same holds for ~vi , . . . , ~vi+k)

    I The SVD of an m×n matrix with m ≥ n can be computed in O(mn2

    )

  • Column and row space

    I The left singular vectors ~u1, ~u2, . . . ~ur are a basis for the column space

    I The right singular vectors ~v1, ~v2, . . . ~vr are a basis for the row space

    Proof:

    span (~u1, . . . , ~ur ) ⊆ col (A)

    because ~ui = A(s−1i ~vi

    )

    col (A) ⊆ span (~u1, . . . , ~ur )

    because A:i = U(SV T ~ei

    )

  • Column and row space

    I The left singular vectors ~u1, ~u2, . . . ~ur are a basis for the column space

    I The right singular vectors ~v1, ~v2, . . . ~vr are a basis for the row space

    Proof:

    span (~u1, . . . , ~ur ) ⊆ col (A) because ~ui = A(s−1i ~vi

    )col (A) ⊆ span (~u1, . . . , ~ur )

    because A:i = U(SV T ~ei

    )

  • Column and row space

    I The left singular vectors ~u1, ~u2, . . . ~ur are a basis for the column space

    I The right singular vectors ~v1, ~v2, . . . ~vr are a basis for the row space

    Proof:

    span (~u1, . . . , ~ur ) ⊆ col (A) because ~ui = A(s−1i ~vi

    )col (A) ⊆ span (~u1, . . . , ~ur ) because A:i = U

    (SV T ~ei

    )

  • Singular value decomposition

    A = [ ~u1 · · · ~ur︸ ︷︷ ︸Basis of range(A)

    ~ur+1 · · · ~un]

    s1 · · · 0 0 · · · 00 · · · 0 0 · · · 00 · · · sr 0 · · · 00 · · · 0 0 · · · 0

    · · · · · ·0 · · · 0 0 · · · 0

    [ ~v1 ~v2 · · · ~vr︸ ︷︷ ︸Basis of row(A) ~vr+1 · · · ~vn︸ ︷︷ ︸Basis of null(A)]T

  • Linear maps

    The SVD decomposes the action of a matrix A ∈ Rm×n on a vector~x ∈ Rn into:

    1. Rotation

    V T~x =n∑

    i=1

    〈~vi , ~x〉 ~ei

    2. Scaling

    SV T~x =n∑

    i=1

    si 〈~vi , ~x〉 ~ei

    3. Rotation

    USV T~x =n∑

    i=1

    si 〈~vi , ~x〉 ~ui

  • Linear maps

    ~x

    ~v1

    ~v2

    ~y

    V T~x

    ~e1

    ~e2V T~y

    SV T~x

    s1~e1

    s2~e2SV T~y

    USV T~x

    s1~u1s2~u2

    USV T~y

    V T

    S

    U

  • Linear maps (s2 := 0)

    ~x

    ~v1

    ~v2

    ~y

    V T~x

    ~e1

    ~e2V T~y

    SV T~xs1~e1

    ~0 SVT~y

    USV T~x

    s1~u1

    ~0

    USV T~y

    V T

    S

    U

  • Singular values

    The singular values satisfy

    s1 = max{||~x ||2=1 | ~x∈Rn}

    ||A~x ||2

    = max{||~y ||2=1 | ~y∈Rm}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

    si = max{||~x ||2=1 | ~x∈Rn, ~x⊥~u1,...,~ui−1}

    ||A~x ||2

    = max{||~y ||2=1 | ~y∈Rm, ~y⊥~v1,...,~vi−1}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2, 2 ≤ i ≤ min {m, n}

  • Singular vectors

    The right singular vectors satisfy

    ~v1 = argmax{||~x ||2=1 | ~x∈Rn}

    ||A~x ||2

    ~vi = argmax{||~x ||2=1 | ~x∈Rn, ~x⊥~v1,...,~vi−1}

    ||A~x ||2 , 2 ≤ i ≤ m

    and the left singular vectors satisfy

    ~u1 = argmax{||~y ||2=1 | ~y∈Rm}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

    ~ui = argmax{||~y ||2=1 | ~y∈Rm, ~y⊥~u1,...,~ui−1}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2, 2 ≤ i ≤ n

  • Proof

    ||A~vi ||22

    =

    〈n∑

    k=1

    sk 〈~vk , ~vi 〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~vi 〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~vi 〉2

    = s2i

  • Proof

    ||A~vi ||22 =

    〈n∑

    k=1

    sk 〈~vk , ~vi 〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~vi 〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~vi 〉2

    = s2i

  • Proof

    ||A~vi ||22 =

    〈n∑

    k=1

    sk 〈~vk , ~vi 〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~vi 〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~vi 〉2

    = s2i

  • Proof

    ||A~vi ||22 =

    〈n∑

    k=1

    sk 〈~vk , ~vi 〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~vi 〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~vi 〉2

    = s2i

  • Proof

    Still need to prove that no other vector achieves a larger value

    Consider ~x ∈ Rn such that ||~x ||2 = 1 and for a fixed 1 ≤ i ≤ n

    ~x ⊥ ~v1, . . . , ~vi−1

    We decompose ~x into

    ~x =n∑j=i

    αj~vj + Prow(A)⊥ ~x

    where 1 = ||~x ||22 ≥∑n

    j=i α2j

  • Proof

    Still need to prove that no other vector achieves a larger value

    Consider ~x ∈ Rn such that ||~x ||2 = 1 and for a fixed 1 ≤ i ≤ n

    ~x ⊥ ~v1, . . . , ~vi−1

    We decompose ~x into

    ~x =n∑j=i

    αj~vj + Prow(A)⊥ ~x

    where 1 = ||~x ||22 ≥∑n

    j=i α2j

  • Proof

    ||A~x ||22

    =

    〈n∑

    k=1

    sk 〈~vk , ~x〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~x〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~x〉2 because ~u1, . . . , ~un are orthonormal

    =n∑

    k=1

    s2k

    〈~vk ,

    n∑j=i

    αj~vj + Prow(A)⊥ ~x

    〉2

    =n∑j=i

    s2j α2j because ~v1, . . . , ~vn are orthonormal

    ≤ s2in∑j=i

    α2j because si ≥ si+1 ≥ . . . ≥ sn

    ≤ s2i

  • Proof

    ||A~x ||22 =

    〈n∑

    k=1

    sk 〈~vk , ~x〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~x〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~x〉2 because ~u1, . . . , ~un are orthonormal

    =n∑

    k=1

    s2k

    〈~vk ,

    n∑j=i

    αj~vj + Prow(A)⊥ ~x

    〉2

    =n∑j=i

    s2j α2j because ~v1, . . . , ~vn are orthonormal

    ≤ s2in∑j=i

    α2j because si ≥ si+1 ≥ . . . ≥ sn

    ≤ s2i

  • Proof

    ||A~x ||22 =

    〈n∑

    k=1

    sk 〈~vk , ~x〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~x〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~x〉2 because ~u1, . . . , ~un are orthonormal

    =n∑

    k=1

    s2k

    〈~vk ,

    n∑j=i

    αj~vj + Prow(A)⊥ ~x

    〉2

    =n∑j=i

    s2j α2j because ~v1, . . . , ~vn are orthonormal

    ≤ s2in∑j=i

    α2j because si ≥ si+1 ≥ . . . ≥ sn

    ≤ s2i

  • Proof

    ||A~x ||22 =

    〈n∑

    k=1

    sk 〈~vk , ~x〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~x〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~x〉2 because ~u1, . . . , ~un are orthonormal

    =n∑

    k=1

    s2k

    〈~vk ,

    n∑j=i

    αj~vj + Prow(A)⊥ ~x

    〉2

    =n∑j=i

    s2j α2j because ~v1, . . . , ~vn are orthonormal

    ≤ s2in∑j=i

    α2j because si ≥ si+1 ≥ . . . ≥ sn

    ≤ s2i

  • Proof

    ||A~x ||22 =

    〈n∑

    k=1

    sk 〈~vk , ~x〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~x〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~x〉2 because ~u1, . . . , ~un are orthonormal

    =n∑

    k=1

    s2k

    〈~vk ,

    n∑j=i

    αj~vj + Prow(A)⊥ ~x

    〉2

    =n∑j=i

    s2j α2j because ~v1, . . . , ~vn are orthonormal

    ≤ s2in∑j=i

    α2j because si ≥ si+1 ≥ . . . ≥ sn

    ≤ s2i

  • Proof

    ||A~x ||22 =

    〈n∑

    k=1

    sk 〈~vk , ~x〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~x〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~x〉2 because ~u1, . . . , ~un are orthonormal

    =n∑

    k=1

    s2k

    〈~vk ,

    n∑j=i

    αj~vj + Prow(A)⊥ ~x

    〉2

    =n∑j=i

    s2j α2j because ~v1, . . . , ~vn are orthonormal

    ≤ s2in∑j=i

    α2j because si ≥ si+1 ≥ . . . ≥ sn

    ≤ s2i

  • Proof

    ||A~x ||22 =

    〈n∑

    k=1

    sk 〈~vk , ~x〉 ~uk ,n∑

    k=1

    sk 〈~vk , ~x〉 ~uk

    =n∑

    k=1

    s2k 〈~vk , ~x〉2 because ~u1, . . . , ~un are orthonormal

    =n∑

    k=1

    s2k

    〈~vk ,

    n∑j=i

    αj~vj + Prow(A)⊥ ~x

    〉2

    =n∑j=i

    s2j α2j because ~v1, . . . , ~vn are orthonormal

    ≤ s2in∑j=i

    α2j because si ≥ si+1 ≥ . . . ≥ sn

    ≤ s2i

  • Motivating applications

    The singular-value decomposition

    Principal component analysis

    Dimensionality reduction

    Linear regression

    Ridge regression

    Low-rank matrix estimation

  • Quantifying directional variation

    Goal: Quantify variation of a dataset embedded in a p-dimensional space

    Two possible perspectives:

    I Probabilistic: The data are samples from a p-dimensional randomvector ~x

    I Geometric: The data are just a cloud of points in Rp

  • Probabilistic perspective (p = 1)

    A natural quantifier of variation is the variance of the distribution

    Var (x) := E(

    (x− E (x))2)

  • Geometric perspective (p = 1)

    A natural quantifier of variation is the sample variance of the data

    var (x1, x2, . . . , xn) :=1

    n − 1

    n∑i=1

    (xi − av (x1, x2, . . . , xn))2

    av (x1, x2, . . . , xn) :=1n

    n∑i=1

    xi

  • Connection

    Sample variance is an estimator of variance

    If {x1, x2, . . . , xn} have mean µ and variance σ2,

    E (av (x1, x2, . . . , xn)) = µ

    E (var (x1, x2, . . . , xn)) = σ2

    by linearity of expectation, so the estimator is unbiased

  • Connection

    If samples are independent, again by linearity of expectation,

    E(

    (av (x1, x2, . . . , xn)− µ)2)

    =σ2

    n

    If fourth moment is bounded,

    E((

    var (x1, x2, . . . , xn)− σ2)2)

    also scales as 1/n

  • Connection

    If samples are independent, again by linearity of expectation,

    E(

    (av (x1, x2, . . . , xn)− µ)2)

    =σ2

    n

    If fourth moment is bounded,

    E((

    var (x1, x2, . . . , xn)− σ2)2)

    also scales as 1/n

  • Probabilistic perspective (p > 1)

    1. Center data by subtracting mean

    2. Variation in direction ~v can be quantified by Var(~v T~x

    )

  • Geometric perspective (p > 1)

    1. Center data by subtracting average

    2. Variation in direction ~v quantified by var(~v T~x1, ~v

    T~x2, . . . , ~vT~xn

    )

  • Covariance

    The covariance of two random variables x and y is

    Cov (x, y) := E ((x− E (x)) (y − E (y)))

  • Covariance matrix

    The covariance matrix of a random vector ~x is defined as

    Σ~x :=

    Var (~x [1]) Cov (~x [1] ,~x [2]) · · · Cov (~x [1] ,~x [p])

    Cov (~x [2] ,~x [1]) Var (~x [2]) · · · Cov (~x [2] ,~x [p])...

    .... . .

    ...

    Cov (~x [n] ,~x [1]) Cov (~x [n] ,~x [2]) · · · Var (~x [p])

    = E(~x~xT )− E(~x)E(~x)T .

    If the covariance matrix is diagonal, the entries are uncorrelated

  • Covariance matrix after a linear transformation

    For any matrix A ∈ Rm×n and n-dimensional random vector ~x,the covariance matrix of A~x equals

    ΣA~x = AΣ~xAT

    Proof:

    ΣA~x

    = E(

    (A~x) (A~x)T)− E (A~x) E (A~x)T

    = A(E(~x~xT

    )− E(~x)E(~x)T

    )AT

    = AΣ~xAT

  • Covariance matrix after a linear transformation

    For any matrix A ∈ Rm×n and n-dimensional random vector ~x,the covariance matrix of A~x equals

    ΣA~x = AΣ~xAT

    Proof:

    ΣA~x

    = E(

    (A~x) (A~x)T)− E (A~x) E (A~x)T

    = A(E(~x~xT

    )− E(~x)E(~x)T

    )AT

    = AΣ~xAT

  • Covariance matrix after a linear transformation

    For any matrix A ∈ Rm×n and n-dimensional random vector ~x,the covariance matrix of A~x equals

    ΣA~x = AΣ~xAT

    Proof:

    ΣA~x = E(

    (A~x) (A~x)T)− E (A~x) E (A~x)T

    = A(E(~x~xT

    )− E(~x)E(~x)T

    )AT

    = AΣ~xAT

  • Covariance matrix after a linear transformation

    For any matrix A ∈ Rm×n and n-dimensional random vector ~x,the covariance matrix of A~x equals

    ΣA~x = AΣ~xAT

    Proof:

    ΣA~x = E(

    (A~x) (A~x)T)− E (A~x) E (A~x)T

    = A(E(~x~xT

    )− E(~x)E(~x)T

    )AT

    = AΣ~xAT

  • Covariance matrix after a linear transformation

    For any matrix A ∈ Rm×n and n-dimensional random vector ~x,the covariance matrix of A~x equals

    ΣA~x = AΣ~xAT

    Proof:

    ΣA~x = E(

    (A~x) (A~x)T)− E (A~x) E (A~x)T

    = A(E(~x~xT

    )− E(~x)E(~x)T

    )AT

    = AΣ~xAT

  • Corollary: Variance in a fixed direction

    The variance of ~x in the direction of a unit-norm vector ~v equals

    Var(~v T~x

    )= ~v TΣ~x~v

    Covariance matrix captures variance in every direction!

    Also, covariance matrices are positive semidefinite, i.e. for any ~v ,

    ~v TΣ~x~v ≥ 0

  • SVD of covariance matrix

    Because covariance matrices are symmetric and positive semidefinite

    Σ~x = UΛUT

    =[~u1 ~u2 · · · ~un

    ] s1 0 · · · 00 s2 · · · 0

    · · ·0 0 · · · sn

    [~u1 ~u2 · · · ~un]T

  • Directions of maximum variance

    The SVD of the covariance matrix Σ~x of a random vector ~x satisfies

    s1 = max||~v ||2=1

    Var(~v T~x

    )~u1 = arg max

    ||~v ||2=1Var

    (~v T~x

    )sk = max

    ||~v ||2=1,~v⊥~u1,...,~uk−1Var

    (~v T~x

    )~uk = arg max

    ||~v ||2=1,~v⊥~u1,...,~uk−1Var

    (~v T~x

    )

  • Proof

    Var(~v T~x

    )= ~v TΣ~x~v

    = ~v TUSUT~v

    =∣∣∣∣∣∣√SUT~v ∣∣∣∣∣∣2

    2

  • Singular values

    The singular values of A satisfy

    s1 = max{||~x ||2=1 | ~x∈Rn}

    ||A~x ||2

    = max{||~y ||2=1 | ~y∈Rp}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

    si = max{||~x ||2=1 | ~x∈Rn, ~x⊥~u1,...,~ui−1}

    ||A~x ||2

    = max{||~y ||2=1 | ~y∈Rp , ~y⊥~v1,...,~vi−1}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2, 2 ≤ i ≤ min {m, n}

  • Singular vectors

    The left singular vectors of A satisfy

    ~u1 = argmax{||~y ||2=1 | ~y∈Rp}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

    ~ui = argmax{||~y ||2=1 | ~y∈Rp , ~y⊥~v1,...,~vi−1}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2, 2 ≤ i ≤ n

  • Directions of maximum variance

    √s1 = 1.22,

    √s2 = 0.71

    √s1 = 1,

    √s2 = 1

    √s1 = 1.38,

    √s2 = 0.32

  • Sample covariance

    For a data set (x1, y1), (x2, y2), . . . , (xn, yn)

    cov ((x1, y1) , . . . , (xn, yn)) :=1

    n − 1

    n∑i=1

    (xi − av (x1, . . . , xn)) (yi − av (y1, . . . , yn))

    If (x1, y1), (x2, y2), . . . , (xn, yn) are iid samples from x and y

    E (cov ((x1, y1) , . . . , (xn, yn))) = Cov (x, y)

  • Sample covariance matrix

    The sample covariance matrix of {~x1, ~x2, . . . , ~xn} ∈ Rp

    Σ (~x1, . . . , ~xn) :=1

    n − 1

    n∑i=1

    (~xi − av (~x1, . . . , ~xn)) (~xi − av (~x1, . . . , ~xn))T

    av (~x1, ~x2, . . . , ~xn) :=1n

    n∑i=1

    ~xi

    Σ (~x1, . . . , ~xn)ij =

    {var (~x1 [i ] , . . . , ~xn [i ]) if i = j ,cov ((~x1 [i ] , ~x1 [j ]) , . . . , (~xn [i ] , ~xn [j ])) if i 6= j

  • Sample covariance converges to true covariance

    n = 5 n = 20 n = 100

    True covarianceSample covariance

  • Variation in a certain direction

    For a unit vector ~v ∈ Rp

    var(~v T~x1, . . . , ~v

    T~xn)

    =1

    n − 1

    n∑i=1

    (~v T~xi − av

    (~v T~x1, . . . , ~v

    T~xn))2

    =1

    n − 1

    n∑i=1

    (~v T (~xi − av (~x1, . . . , ~xn))

    )2= ~v T

    (1

    n − 1

    n∑i=1

    (~xi − av (~x1, . . . , ~xn)) (~xi − av (~x1, . . . , ~xn))T)~v

    = ~v TΣ (~x1, . . . , ~xn) ~v

    Sample covariance matrix captures sample variance in every direction!

  • Variation in a certain direction

    For a unit vector ~v ∈ Rp

    var(~v T~x1, . . . , ~v

    T~xn)

    =1

    n − 1

    n∑i=1

    (~v T~xi − av

    (~v T~x1, . . . , ~v

    T~xn))2

    =1

    n − 1

    n∑i=1

    (~v T (~xi − av (~x1, . . . , ~xn))

    )2= ~v T

    (1

    n − 1

    n∑i=1

    (~xi − av (~x1, . . . , ~xn)) (~xi − av (~x1, . . . , ~xn))T)~v

    = ~v TΣ (~x1, . . . , ~xn) ~v

    Sample covariance matrix captures sample variance in every direction!

  • Variation in a certain direction

    For a unit vector ~v ∈ Rp

    var(~v T~x1, . . . , ~v

    T~xn)

    =1

    n − 1

    n∑i=1

    (~v T~xi − av

    (~v T~x1, . . . , ~v

    T~xn))2

    =1

    n − 1

    n∑i=1

    (~v T (~xi − av (~x1, . . . , ~xn))

    )2

    = ~v T

    (1

    n − 1

    n∑i=1

    (~xi − av (~x1, . . . , ~xn)) (~xi − av (~x1, . . . , ~xn))T)~v

    = ~v TΣ (~x1, . . . , ~xn) ~v

    Sample covariance matrix captures sample variance in every direction!

  • Variation in a certain direction

    For a unit vector ~v ∈ Rp

    var(~v T~x1, . . . , ~v

    T~xn)

    =1

    n − 1

    n∑i=1

    (~v T~xi − av

    (~v T~x1, . . . , ~v

    T~xn))2

    =1

    n − 1

    n∑i=1

    (~v T (~xi − av (~x1, . . . , ~xn))

    )2= ~v T

    (1

    n − 1

    n∑i=1

    (~xi − av (~x1, . . . , ~xn)) (~xi − av (~x1, . . . , ~xn))T)~v

    = ~v TΣ (~x1, . . . , ~xn) ~v

    Sample covariance matrix captures sample variance in every direction!

  • Variation in a certain direction

    For a unit vector ~v ∈ Rp

    var(~v T~x1, . . . , ~v

    T~xn)

    =1

    n − 1

    n∑i=1

    (~v T~xi − av

    (~v T~x1, . . . , ~v

    T~xn))2

    =1

    n − 1

    n∑i=1

    (~v T (~xi − av (~x1, . . . , ~xn))

    )2= ~v T

    (1

    n − 1

    n∑i=1

    (~xi − av (~x1, . . . , ~xn)) (~xi − av (~x1, . . . , ~xn))T)~v

    = ~v TΣ (~x1, . . . , ~xn) ~v

    Sample covariance matrix captures sample variance in every direction!

  • Principal component analysis

    Given data vectors ~x1, ~x2, . . . , ~xn ∈ Rd , compute the SVD of theirsample covariance matrix

    The left singular vectors are the principal directions

    The principal values are the coefficients of the centered vectorsin the basis of principal directions.

  • Equivalently

    Center the data

    ~ci = ~xi − av (~x1, ~x2, . . . , ~xn) , 1 ≤ i ≤ n,

    Compute the SVD of the matrix

    C =[~c1 ~c2 · · · ~cn

    ]Result is the same because sample covariance matrix equals 1n−1CC

    T

  • Directions of maximum variance

    The principal directions satisfy

    ~u1 = argmax{||~v ||2=1 | ~v∈Rn}

    var(~v T~x1, . . . , ~v

    T~xn)

    ~ui = argmax{||~v ||2=1 | ~v∈Rn, ~v⊥~u1,...,~ui−1}

    var(~v T~x1, . . . , ~v

    T~xn), 2 ≤ i ≤ k

  • Directions of maximum variance

    The associated singular values satisfy

    s1n − 1

    = max{||~v ||2=1 | ~v∈Rn}

    var(~v T~x1, . . . , ~v

    T~xn)

    sin − 1

    = max{||~v ||2=1 | ~v∈Rn, ~v⊥~u1,...,~ui−1}

    var(~v T~x1, . . . , ~v

    T~xn), 2 ≤ i ≤ k

  • Proof

    For any vector ~v

    var(~v T~x1, . . . , ~v

    T~xn)

    = ~v TΣ (~x1, . . . , ~xn) ~v

  • Singular values

    The singular values of A satisfy

    s1 = max{||~x ||2=1 | ~x∈Rn}

    ||A~x ||2

    = max{||~y ||2=1 | ~y∈Rp}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

    si = max{||~x ||2=1 | ~x∈Rn, ~x⊥~u1,...,~ui−1}

    ||A~x ||2

    = max{||~y ||2=1 | ~y∈Rp , ~y⊥~v1,...,~vi−1}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2, 2 ≤ i ≤ min {m, n}

  • Singular vectors

    The left singular vectors of A satisfy

    ~u1 = argmax{||~y ||2=1 | ~y∈Rp}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

    ~ui = argmax{||~y ||2=1 | ~y∈Rp , ~y⊥~v1,...,~vi−1}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2, 2 ≤ i ≤ n

  • PCA in 2D

    √s1/(n − 1) = 0.705,√s2/(n − 1) = 0.690

    √s1/(n − 1) = 0.983,√s2/(n − 1) = 0.356

    √s1/(n − 1) = 1.349,√s2/(n − 1) = 0.144

    ~u1

    ~u2

    v1

    v2

    ~u1

    ~u2

    ~u1

    ~u2

  • PCA of faces

    Data set of 400 64× 64 images from 40 subjects (10 per subject)

    Each face is vectorized and interpreted as a vector in R4096

  • PCA of faces

    Center PD 1 PD 2 PD 3 PD 4 PD 5

    √si/(n − 1) 330 251 192 152 130

  • PCA of faces

    PD 10 PD 15 PD 20 PD 30 PD 40 PD 50

    90.2 70.8 58.7 45.1 36.0 30.8

  • PCA of faces

    PD 100 PD 150 PD 200 PD 250 PD 300 PD 359

    19.0 13.7 10.3 8.01 6.14 3.06

  • Projection onto first 7 principal directions

    Center PD 1 PD 2

    = 8613 - 2459 + 665

    PD 3 PD 4 PD 5

    - 180 + 301 + 566

    PD 6 PD 7

    + 638 + 403

  • Projection onto first k principal directions

    Signal 5 PDs 10 PDs 20 PDs 30 PDs 50 PDs

    100 PDs 150 PDs 200 PDs 250 PDs 300 PDs 359 PDs

  • Motivating applications

    The singular-value decomposition

    Principal component analysis

    Dimensionality reduction

    Linear regression

    Ridge regression

    Low-rank matrix estimation

  • Dimensionality reduction

    Data with a large number of features can be difficult to analyze or process

    Dimensionality reduction is a useful preprocessing step

    If data are modeled as vectors in Rp we can reduce the dimension byprojecting onto Rk , where k < p

    For orthogonal projections, the new representation is 〈~b1, ~x〉, 〈~b2, ~x〉, . . . ,〈~bk , ~x〉 for a basis ~b1, . . . , ~bk of the subspace that we project on

    Problem: How do we choose the subspace?

  • Optimal subspace for orthogonal projection

    Given a set of vectors ~a1, ~a2, . . . ~an and a fixed dimension k ≤ n, theSVD of

    A :=[~a1 ~a2 · · · ~an

    ]∈ Rm×n

    provides the k-dimensional subspace that captures the most energy

    n∑i=1

    ∣∣∣∣Pspan(~u1,~u2,...,~uk ) ~ai ∣∣∣∣22 ≥ n∑i=1

    ||PS ~ai ||22

    for any subspace S of dimension k

  • Proof

    Because ~u1, ~u2, . . . , ~uk are orthonormal

    n∑i=1

    ∣∣∣∣Pspan(~u1,~u2,...,~uk ) ~ai ∣∣∣∣22 = n∑i=1

    k∑j=1

    〈~uj , ~ai 〉2

    =k∑

    j=1

    ∣∣∣∣∣∣AT ~uj ∣∣∣∣∣∣22

    Induction on k

    The base case k = 1 follows from

    ~u1 = argmax{||~y ||2=1 | ~y∈Rm}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

  • Proof

    Because ~u1, ~u2, . . . , ~uk are orthonormal

    n∑i=1

    ∣∣∣∣Pspan(~u1,~u2,...,~uk ) ~ai ∣∣∣∣22 = n∑i=1

    k∑j=1

    〈~uj , ~ai 〉2

    =k∑

    j=1

    ∣∣∣∣∣∣AT ~uj ∣∣∣∣∣∣22

    Induction on k

    The base case k = 1 follows from

    ~u1 = argmax{||~y ||2=1 | ~y∈Rm}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

  • Proof

    Because ~u1, ~u2, . . . , ~uk are orthonormal

    n∑i=1

    ∣∣∣∣Pspan(~u1,~u2,...,~uk ) ~ai ∣∣∣∣22 = n∑i=1

    k∑j=1

    〈~uj , ~ai 〉2

    =k∑

    j=1

    ∣∣∣∣∣∣AT ~uj ∣∣∣∣∣∣22

    Induction on k

    The base case k = 1 follows from

    ~u1 = argmax{||~y ||2=1 | ~y∈Rm}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

  • Proof

    Because ~u1, ~u2, . . . , ~uk are orthonormal

    n∑i=1

    ∣∣∣∣Pspan(~u1,~u2,...,~uk ) ~ai ∣∣∣∣22 = n∑i=1

    k∑j=1

    〈~uj , ~ai 〉2

    =k∑

    j=1

    ∣∣∣∣∣∣AT ~uj ∣∣∣∣∣∣22

    Induction on k

    The base case k = 1 follows from

    ~u1 = argmax{||~y ||2=1 | ~y∈Rm}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

  • Proof

    Let S be a subspace of dimension k

    S ∩ span (~u1, . . . , ~uk−1)⊥ contains a nonzero vector ~b

    If dim (V) has dimension n, S1,S2 ⊆ V and dim (S1) + dim (S2) > n,then dim (S1 ∩ S2) ≥ 1

    There exists an orthonormal basis ~b1, ~b2, . . . , ~bk for S such that ~bk := ~bis orthogonal to ~u1, ~u2, . . . , ~uk−1

  • Proof

    Let S be a subspace of dimension k

    S ∩ span (~u1, . . . , ~uk−1)⊥ contains a nonzero vector ~b

    If dim (V) has dimension n, S1,S2 ⊆ V and dim (S1) + dim (S2) > n,then dim (S1 ∩ S2) ≥ 1

    There exists an orthonormal basis ~b1, ~b2, . . . , ~bk for S such that ~bk := ~bis orthogonal to ~u1, ~u2, . . . , ~uk−1

  • Induction hypothesis

    k−1∑i=1

    ∣∣∣∣∣∣AT ~ui ∣∣∣∣∣∣22

    =n∑

    i=1

    ∣∣∣∣Pspan(~u1,~u2,...,~uk−1) ~ai ∣∣∣∣22≥

    n∑i=1

    ∣∣∣∣∣∣Pspan(~b1,~b2,...,~bk−1) ~ai ∣∣∣∣∣∣22=

    k−1∑i=1

    ∣∣∣∣∣∣AT~bi ∣∣∣∣∣∣22

  • Proof

    Recall that

    ~uk = argmax{||~y ||2=1 | ~y∈Rm, ~y⊥~u1,...,~uk−1}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

    which implies ∣∣∣∣∣∣AT ~uk ∣∣∣∣∣∣22≥∣∣∣∣∣∣AT~bk ∣∣∣∣∣∣2

    2

    We concluden∑

    i=1

    ∣∣∣∣Pspan(~u1,~u2,...,~uk ) ~ai ∣∣∣∣22 = k∑i=1

    ∣∣∣∣∣∣AT ~ui ∣∣∣∣∣∣22

    ≥k∑

    i=1

    ∣∣∣∣∣∣AT~bi ∣∣∣∣∣∣22

    =n∑

    i=1

    ||PS ~ai ||22

  • Proof

    Recall that

    ~uk = argmax{||~y ||2=1 | ~y∈Rm, ~y⊥~u1,...,~uk−1}

    ∣∣∣∣∣∣AT~y ∣∣∣∣∣∣2

    which implies ∣∣∣∣∣∣AT ~uk ∣∣∣∣∣∣22≥∣∣∣∣∣∣AT~bk ∣∣∣∣∣∣2

    2

    We concluden∑

    i=1

    ∣∣∣∣Pspan(~u1,~u2,...,~uk ) ~ai ∣∣∣∣22 = k∑i=1

    ∣∣∣∣∣∣AT ~ui ∣∣∣∣∣∣22

    ≥k∑

    i=1

    ∣∣∣∣∣∣AT~bi ∣∣∣∣∣∣22

    =n∑

    i=1

    ||PS ~ai ||22

  • Nearest-neighbor classification

    Training set of points and labels {~x1, l1}, . . . , {~xn, ln}

    To classify a new data point ~y , find

    i∗ := arg min1≤i≤n

    ||~y − ~xi ||2 ,

    and assign li∗ to ~y

    Cost: O (mnp) to classify m new points

  • Nearest neighbors in principal-component space

    Idea: Project onto first k main principal directions beforehand

    Cost:I O

    (p2n), if p < n, to compute principal dimensions

    I knp operations to project training setI kmp operations to project test setI kmn to perform nearest-neighbor classification

    Faster if m > p

  • Face recognition

    Training set: 360 64× 64 images from 40 different subjects (9 each)

    Test set: 1 new image from each subject

    We model each image as a vector in R4096 (m = 4096)

    To classify we:

    1. Project onto first k principal directions

    2. Apply nearest-neighbor classification using the `2-norm distance in Rk

  • Performance

    0 10 20 30 40 50 60 70 80 90 100

    10

    20

    30

    4

    Number of principal components

    Erro

    rs

  • Nearest neighbor in R41

    Test image

    Projection

    Closestprojection

    Correspondingimage

  • Dimensionality reduction for visualization

    Motivation: Visualize high-dimensional features projected onto 2D or 3D

    Example:

    Seeds from three different varieties of wheat: Kama, Rosa and Canadian

    Features:I AreaI PerimeterI CompactnessI Length of kernelI Width of kernelI Asymmetry coefficientI Length of kernel groove

  • Projection onto two first PDs

    2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0First principal component

    2.52.01.51.00.50.00.51.01.52.0

    Seco

    nd p

    rincip

    al c

    ompo

    nent

  • Projection onto two last PDs

    2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0(d-1)th principal component

    2.52.01.51.00.50.00.51.01.52.0

    dth

    prin

    cipal

    com

    pone

    nt

  • Motivating applications

    The singular-value decomposition

    Principal component analysis

    Dimensionality reduction

    Linear regression

    Ridge regression

    Low-rank matrix estimation

  • Regression

    The aim is to learn a function h that relates

    I a response or dependent variable y

    I to several observed variables ~x ∈ Rp, known as covariates, features orindependent variables

    The response is assumed to be of the form

    y ≈ h (~x)

  • Linear regression

    The regression function h is assumed to be linear

    y (i) ≈ ~x (i)T ~β + β0, 1 ≤ i ≤ n

    We estimate ~β ∈ Rp from training dataset

    Strain :={(

    y (1), ~x (1)),(y (2), ~x (2)

    ), . . . ,

    (y (n), ~x (n)

    )}

  • Linear regression

    In matrix formy (1)

    y (2)

    · · ·

    y (n)

    ≈~x (1)[1] ~x (1)[2] · · · ~x (1)[p]

    ~x (2)[1] ~x (2)[2] · · · ~x (2)[p]

    · · · · · · · · · · · ·

    ~x (n)[1] ~x (n)[2] · · · ~x (n)[p]

    ~β[1]

    ~β[2]

    · · ·~β[p]

    .

    Equivalently,

    ~y ≈ X ~β

  • Linear model for GDP

    State GDP (millions) Population Unemployment Rate

    North Dakota 52 089 757 952 2.4Alabama 204 861 4 863 300 3.8Mississippi 107 680 2 988 726 5.2Arkansas 120 689 2 988 248 3.5Kansas 153 258 2 907 289 3.8Georgia 525 360 10 310 371 4.5Iowa 178 766 3 134 693 3.2West Virginia 73 374 1 831 102 5.1Kentucky 197 043 4 436 974 5.2Tennessee ??? 6 651 194 3.0

  • Centering

    ~ycent =

    −127 14725 625−71 556−58 547−25 978

    470−105 86217 807

    Xcent =

    3 044 121 −1.71 061 227 −2.8−813 346 1.1−813 825 −5.8−894 784 −2.86508 298 4.2−667 379 −8.8−1 970 971 1.0634 901 1.1

    av (~y) = 179 236 av (X ) =[3 802 073 4.1

    ]

  • Normalizing

    ~ynorm =

    −0.3210.065−0.180−0.148−0.0650.872−0.001−0.2670.045

    Xnorm =

    −0.394 −0.6000.137 −0.099−0.105 0.401−0.105 −0.207−0.116 −0.0990.843 0.151−0.086 −0.314−0.255 0.3660.082 0.401

    std (~y) = 396 701 std (X ) =[7 720 656 2.80

    ]

  • Linear model for GDP

    Aim: find ~β ∈ R2 such that ~ynorm ≈ Xnorm ~β

    The estimate for the GDP of Tennessee will be

    ~yTen = av (~y) + std (~y)〈~xTennorm,

    ~β〉

    where ~xTennorm is centered using av (X ) and normalized using std (X )

  • Least squares

    For fixed ~β we can evaluate the error using

    n∑i=1

    (y (i) − ~x (i)T ~β

    )2=∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣2

    2

    The least-squares estimate ~βLS minimizes this cost function

    ~βLS := argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣2

    =(XTX

    )−1XT~y

    if X is full rank and n ≥ p

  • Least-squares solution

    Let X = USV T

    ~y = UUT~y +(I − UUT

    )~y

    By the Pythagorean theorem

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    =

    ∣∣∣∣∣∣(I − UUT) ~y ∣∣∣∣∣∣22

    +∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣2

    2

    argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − USV T ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UT~y − SV T ~β∣∣∣∣∣∣22

    = VS−1UT~y =(XTX

    )−1XT~y

  • Least-squares solution

    Let X = USV T

    ~y = UUT~y +(I − UUT

    )~y

    By the Pythagorean theorem

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    =∣∣∣∣∣∣(I − UUT) ~y ∣∣∣∣∣∣2

    2+∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣2

    2

    argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − USV T ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UT~y − SV T ~β∣∣∣∣∣∣22

    = VS−1UT~y =(XTX

    )−1XT~y

  • Least-squares solution

    Let X = USV T

    ~y = UUT~y +(I − UUT

    )~y

    By the Pythagorean theorem

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    =∣∣∣∣∣∣(I − UUT) ~y ∣∣∣∣∣∣2

    2+∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣2

    2

    argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − USV T ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UT~y − SV T ~β∣∣∣∣∣∣22

    = VS−1UT~y =(XTX

    )−1XT~y

  • Least-squares solution

    Let X = USV T

    ~y = UUT~y +(I − UUT

    )~y

    By the Pythagorean theorem

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    =∣∣∣∣∣∣(I − UUT) ~y ∣∣∣∣∣∣2

    2+∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣2

    2

    argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − USV T ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UT~y − SV T ~β∣∣∣∣∣∣22

    = VS−1UT~y =(XTX

    )−1XT~y

  • Least-squares solution

    Let X = USV T

    ~y = UUT~y +(I − UUT

    )~y

    By the Pythagorean theorem

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    =∣∣∣∣∣∣(I − UUT) ~y ∣∣∣∣∣∣2

    2+∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣2

    2

    argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − USV T ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UT~y − SV T ~β∣∣∣∣∣∣22

    = VS−1UT~y =(XTX

    )−1XT~y

  • Least-squares solution

    Let X = USV T

    ~y = UUT~y +(I − UUT

    )~y

    By the Pythagorean theorem

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    =∣∣∣∣∣∣(I − UUT) ~y ∣∣∣∣∣∣2

    2+∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣2

    2

    argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − X ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UUT~y − USV T ~β∣∣∣∣∣∣22

    = argmin~β

    ∣∣∣∣∣∣UT~y − SV T ~β∣∣∣∣∣∣22

    = VS−1UT~y =(XTX

    )−1XT~y

  • Linear model for GDP

    The least-squares estimate is

    ~βLS =

    [1.019−0.111

    ]GDP seems to be proportional to population and inversely proportionalto unemployment

  • Linear model for GDP

    State GDP Estimate

    North Dakota 52 089 46 241Alabama 204 861 239 165Mississippi 107 680 119 005Arkansas 120 689 145 712Kansas 153 258 136 756Georgia 525 360 513 343Iowa 178 766 158 097West Virginia 73 374 59 969Kentucky 197 043 194 829Tennessee 328 770 345 352

  • Geometric interpretation

    I Any vector X ~β is in the span of the columns of X

    I The least-squares estimate is the closest vector to ~y that can berepresented in this way

    I This is the projection of ~y onto the column space of X

    X ~βLS = XVS−1UT~y

    = USV TVS−1UT~y

    = UUT~y

  • Geometric interpretation

  • Probabilistic interpretation

    Let y be a random scalar with zero mean and ~x a p-dimensionalrandom vector

    Goal: Estimate y as linear function of ~x

    Criterion: Mean square error

    MSE(~β) := E((y −~xT ~β)2)

    = E(y2)− 2E (y~x)T ~β + ~β TE

    (~x~xT

    )~β

    = Var (y)− 2ΣTy~x~β + ~βTΣ~x~β

  • Probabilistic interpretation

    Let y be a random scalar with zero mean and ~x a p-dimensionalrandom vector

    Goal: Estimate y as linear function of ~x

    Criterion: Mean square error

    MSE(~β) := E((y −~xT ~β)2)

    = E(y2)− 2E (y~x)T ~β + ~β TE

    (~x~xT

    )~β

    = Var (y)− 2ΣTy~x~β + ~βTΣ~x~β

  • Probabilistic interpretation

    Let y be a random scalar with zero mean and ~x a p-dimensionalrandom vector

    Goal: Estimate y as linear function of ~x

    Criterion: Mean square error

    MSE(~β) := E((y −~xT ~β)2)

    = E(y2)− 2E (y~x)T ~β + ~β TE

    (~x~xT

    )~β

    = Var (y)− 2ΣTy~x~β + ~βTΣ~x~β

  • Probabilistic interpretation

    ∇MSE(~β) = 2Σ~x~β − 2Σy~x

    ~βMMSE = Σ−1~x Σy~x

    Σ~x ≈1nXTX

    Σy~x ≈1nX~y

    ~βMMSE ≈(XTX

    )−1X~y

  • Probabilistic interpretation

    ∇MSE(~β) = 2Σ~x~β − 2Σy~x

    ~βMMSE = Σ−1~x Σy~x

    Σ~x ≈1nXTX

    Σy~x ≈1nX~y

    ~βMMSE ≈(XTX

    )−1X~y

  • Probabilistic interpretation

    ∇MSE(~β) = 2Σ~x~β − 2Σy~x

    ~βMMSE = Σ−1~x Σy~x

    Σ~x ≈1nXTX

    Σy~x ≈1nX~y

    ~βMMSE ≈(XTX

    )−1X~y

  • Probabilistic interpretation

    ∇MSE(~β) = 2Σ~x~β − 2Σy~x

    ~βMMSE = Σ−1~x Σy~x

    Σ~x ≈1nXTX

    Σy~x ≈1nX~y

    ~βMMSE ≈(XTX

    )−1X~y

  • Strange claim

    I found a cool way to predict the daily temperature in New York: It’s just alinear combination of the temperature in every other state. I fit the modelon data from the last month and a half and it’s almost perfect!

  • Temperature prediction via linear regression

    I Dataset of hourly temperatures measured at weather stations all overthe US

    I Goal: Predict temperature in Yosemite from other temperatures

    I Response: Temperature in Yosemite

    I Features: Temperatures in 133 other stations (p = 133) in 2015

    I Test set: 103 measurements

    I Additional test set: All measurements from 2016

  • Results

    200 500 1000 2000 5000Number of training data

    0

    1

    2

    3

    4

    5

    6

    7Av

    erag

    e er

    ror (

    deg

    Celsi

    us)

    Training errorTest errorTest error (2016)Test error (nearest neighbors)

  • Linear model

    To analyze the performance of the least-squares estimator we assume alinear model with additive noise

    ~ytrain := Xtrain~βtrue + ~ztrain

    The LS estimator equals

    ~βLS := argmin~β‖~ytrain − Xtrain~β‖2

  • Noise model

    We assume that ~ztrain is iid Gaussian with zero mean and variance σ2

    Under this assumption ~βLS is the maximum-likelihood estimate

  • Maximum likelihood estimate

    Likelihood of iid Gaussian samples with mean X ~β and variance σ2

    L~y (~β) =1√

    (2πσ2)nexp(− 12σ2

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    )

    To find the ML estimate, we maximize the log likelihood

    ~βML = argmax~βL~y(~β)

    = argmax~β

    logL~y(~β)

    = argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

  • Maximum likelihood estimate

    Likelihood of iid Gaussian samples with mean X ~β and variance σ2

    L~y (~β) =1√

    (2πσ2)nexp(− 12σ2

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    )To find the ML estimate, we maximize the log likelihood

    ~βML = argmax~βL~y(~β)

    = argmax~β

    logL~y(~β)

    = argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

  • Coefficient error

    If the training data follow the linear model and Xtrain is full rank,

    ~βLS − ~βtrue =(XTtrainXtrain

    )−1XTtrain~ytrain − ~βtrue

    =(XTtrainXtrain

    )−1XTtrain

    (Xtrain~βtrue + ~ztrain

    )− ~βtrue

    = ~βtrue +(XTtrainXtrain

    )−1XTtrain~ztrain − ~βtrue

    =(XTtrainXtrain

    )−1XTtrain~ztrain

  • Coefficient error

    If the training data follow the linear model and Xtrain is full rank,

    ~βLS − ~βtrue =(XTtrainXtrain

    )−1XTtrain~ytrain − ~βtrue

    =(XTtrainXtrain

    )−1XTtrain

    (Xtrain~βtrue + ~ztrain

    )− ~βtrue

    = ~βtrue +(XTtrainXtrain

    )−1XTtrain~ztrain − ~βtrue

    =(XTtrainXtrain

    )−1XTtrain~ztrain

  • Coefficient error

    If the training data follow the linear model and Xtrain is full rank,

    ~βLS − ~βtrue =(XTtrainXtrain

    )−1XTtrain~ytrain − ~βtrue

    =(XTtrainXtrain

    )−1XTtrain

    (Xtrain~βtrue + ~ztrain

    )− ~βtrue

    = ~βtrue +(XTtrainXtrain

    )−1XTtrain~ztrain − ~βtrue

    =(XTtrainXtrain

    )−1XTtrain~ztrain

  • Coefficient error

    If the training data follow the linear model and Xtrain is full rank,

    ~βLS − ~βtrue =(XTtrainXtrain

    )−1XTtrain~ytrain − ~βtrue

    =(XTtrainXtrain

    )−1XTtrain

    (Xtrain~βtrue + ~ztrain

    )− ~βtrue

    = ~βtrue +(XTtrainXtrain

    )−1XTtrain~ztrain − ~βtrue

    =(XTtrainXtrain

    )−1XTtrain~ztrain

  • Training error

    The training error is the projection of the noise onto the orthogonalcomplement of the column space of Xtrain

    ~ytrain − ~yLS = ~ytrain − Pcol(Xtrain) ~ytrain

    = Xtrain~βtrue + ~ztrain − Pcol(Xtrain) (Xtrain~βtrue + ~ztrain)

    = Xtrain~βtrue + ~ztrain − Xtrain~βtrue − Pcol(Xtrain) ~ztrain= Pcol(Xtrain)⊥ ~ztrain

  • Training error

    The training error is the projection of the noise onto the orthogonalcomplement of the column space of Xtrain

    ~ytrain − ~yLS = ~ytrain − Pcol(Xtrain) ~ytrain= Xtrain~βtrue + ~ztrain − Pcol(Xtrain) (Xtrain~βtrue + ~ztrain)

    = Xtrain~βtrue + ~ztrain − Xtrain~βtrue − Pcol(Xtrain) ~ztrain= Pcol(Xtrain)⊥ ~ztrain

  • Training error

    The training error is the projection of the noise onto the orthogonalcomplement of the column space of Xtrain

    ~ytrain − ~yLS = ~ytrain − Pcol(Xtrain) ~ytrain= Xtrain~βtrue + ~ztrain − Pcol(Xtrain) (Xtrain~βtrue + ~ztrain)

    = Xtrain~βtrue + ~ztrain − Xtrain~βtrue − Pcol(Xtrain) ~ztrain

    = Pcol(Xtrain)⊥ ~ztrain

  • Training error

    The training error is the projection of the noise onto the orthogonalcomplement of the column space of Xtrain

    ~ytrain − ~yLS = ~ytrain − Pcol(Xtrain) ~ytrain= Xtrain~βtrue + ~ztrain − Pcol(Xtrain) (Xtrain~βtrue + ~ztrain)

    = Xtrain~βtrue + ~ztrain − Xtrain~βtrue − Pcol(Xtrain) ~ztrain= Pcol(Xtrain)⊥ ~ztrain

  • Concentration of projection of Gaussian vector

    Let S be a k-dimensional subspace of Rn and ~z ∈ Rn a vector of iidGaussian noise with variance σ2. For any � ∈ (0, 1)

    σ√

    k (1− �) ≤ ||PS ~z||2 ≤ σ√k (1 + �)

    with probability at least 1− 2 exp(−k�2/8

    )

  • Training error

    Dimension of orthogonal complement of col(Xtrain) equals n − p

    Training RMSE :=

    √||~ytrain − ~yLS||22

    n

    ≈ σ√

    1− pn

  • Training error

    Dimension of orthogonal complement of col(Xtrain) equals n − p

    Training RMSE :=

    √||~ytrain − ~yLS||22

    n

    ≈ σ√

    1− pn

  • Temperature prediction via linear regression

    200 500 1000 2000 5000Number of training data

    0

    1

    2

    3

    4

    5

    6

    7Av

    erag

    e er

    ror (

    deg

    Celsi

    us)

    1 p/n1 + p/n

    Training errorTest error

  • Overfitting

    When p ≈ n, the training error is very low!

    Lower than error achieved by true coefficients

    Ideal training RMSE :=

    √√√√∣∣∣∣∣∣~ytrain − Xtrain~βtrue∣∣∣∣∣∣22

    n

    =

    √||~ztrain||22

    n≈ σ

    This indicates overfitting of noise

  • Test error

    What we really care about is performance on held-out data

    ytest := 〈~xtest, ~βtrue〉+ ztest

    The least-squares estimate equals

    yLS := 〈~xtest, ~βLS〉

    where ~βLS is computed from the training data

  • Model

    Training and test noise are iid Gaussian with variance σ2

    Test feature vector ~xtest is a zero-mean random vector

    Training and test noise, and test features are all independent

  • Test error

    Test RMSE :=√

    E ((y test − yLS)2)≈ σ√

    1 +p

    n

    as long as

    E(~xtest~xTtest

    )≈ 1

    nXTtrainXtrain

  • Temperature prediction via linear regression

    200 500 1000 2000 5000Number of training data

    0

    1

    2

    3

    4

    5

    6

    7Av

    erag

    e er

    ror (

    deg

    Celsi

    us)

    1 p/n1 + p/n

    Training errorTest error

  • Proof

    y test − yLS = 〈~xtest, ~βtrue〉+ z test − 〈~xtest, ~βLS〉

    = 〈~xtest, ~βtrue − ~βLS〉+ z test

    = −〈~xtest,(XTtrainXtrain

    )−1XTtrain~ztrain〉+ z test

    = −〈~xtest,X †~ztrain〉+ z test

  • Proof

    y test − yLS = 〈~xtest, ~βtrue〉+ z test − 〈~xtest, ~βLS〉

    = 〈~xtest, ~βtrue − ~βLS〉+ z test

    = −〈~xtest,(XTtrainXtrain

    )−1XTtrain~ztrain〉+ z test

    = −〈~xtest,X †~ztrain〉+ z test

  • Proof

    y test − yLS = 〈~xtest, ~βtrue〉+ z test − 〈~xtest, ~βLS〉

    = 〈~xtest, ~βtrue − ~βLS〉+ z test

    = −〈~xtest,(XTtrainXtrain

    )−1XTtrain~ztrain〉+ z test

    = −〈~xtest,X †~ztrain〉+ z test

  • Proof

    y test − yLS = 〈~xtest, ~βtrue〉+ z test − 〈~xtest, ~βLS〉

    = 〈~xtest, ~βtrue − ~βLS〉+ z test

    = −〈~xtest,(XTtrainXtrain

    )−1XTtrain~ztrain〉+ z test

    = −〈~xtest,X †~ztrain〉+ z test

  • Proof

    E((y test − yLS)2

    )= E

    (〈~xtest,X †~ztrain〉2

    )+ E

    (z2test

    )

    = E(~xTtestX

    †~ztrain~zTtrain(X†)T~xtest

    )+ σ2

    = E(E(~xTtestX

    †~ztrain~zTtrain(X†)T~xtest |~xtest

    ))+ σ2

    = E(~xTtestX

    †E(~ztrain~zTtrain

    )(X †)T~xtest

    )+ σ2

    = σ2 E(~xTtestX

    †(X †)T~xtest)

    + σ2

  • Proof

    E((y test − yLS)2

    )= E

    (〈~xtest,X †~ztrain〉2

    )+ E

    (z2test

    )= E

    (~xTtestX

    †~ztrain~zTtrain(X†)T~xtest

    )+ σ2

    = E(E(~xTtestX

    †~ztrain~zTtrain(X†)T~xtest |~xtest

    ))+ σ2

    = E(~xTtestX

    †E(~ztrain~zTtrain

    )(X †)T~xtest

    )+ σ2

    = σ2 E(~xTtestX

    †(X †)T~xtest)

    + σ2

  • Proof

    E((y test − yLS)2

    )= E

    (〈~xtest,X †~ztrain〉2

    )+ E

    (z2test

    )= E

    (~xTtestX

    †~ztrain~zTtrain(X†)T~xtest

    )+ σ2

    = E(E(~xTtestX

    †~ztrain~zTtrain(X†)T~xtest |~xtest

    ))+ σ2

    = E(~xTtestX

    †E(~ztrain~zTtrain

    )(X †)T~xtest

    )+ σ2

    = σ2 E(~xTtestX

    †(X †)T~xtest)

    + σ2

  • Proof

    E((y test − yLS)2

    )= E

    (〈~xtest,X †~ztrain〉2

    )+ E

    (z2test

    )= E

    (~xTtestX

    †~ztrain~zTtrain(X†)T~xtest

    )+ σ2

    = E(E(~xTtestX

    †~ztrain~zTtrain(X†)T~xtest |~xtest

    ))+ σ2

    = E(~xTtestX

    †E(~ztrain~zTtrain

    )(X †)T~xtest

    )+ σ2

    = σ2 E(~xTtestX

    †(X †)T~xtest)

    + σ2

  • Proof

    E((y test − yLS)2

    )= E

    (〈~xtest,X †~ztrain〉2

    )+ E

    (z2test

    )= E

    (~xTtestX

    †~ztrain~zTtrain(X†)T~xtest

    )+ σ2

    = E(E(~xTtestX

    †~ztrain~zTtrain(X†)T~xtest |~xtest

    ))+ σ2

    = E(~xTtestX

    †E(~ztrain~zTtrain

    )(X †)T~xtest

    )+ σ2

    = σ2 E(~xTtestX

    †(X †)T~xtest)

    + σ2

  • Proof

    X †(X †)T =(XTtrainXtrain

    )−1XTtrainXtrain

    (XTtrainXtrain

    )−1=(XTtrainXtrain

    )−1

    E((y test − yLS)2

    )= σ2 E

    (~xTtest

    (XTtrainXtrain

    )−1~xtest

    )+ σ2

  • Proof

    X †(X †)T =(XTtrainXtrain

    )−1XTtrainXtrain

    (XTtrainXtrain

    )−1=(XTtrainXtrain

    )−1

    E((y test − yLS)2

    )= σ2 E

    (~xTtest

    (XTtrainXtrain

    )−1~xtest

    )+ σ2

  • Proof

    E(~xTtest

    (XTtrainXtrain

    )−1~xtest

    )= E

    (tr(~xTtest

    (XTtrainXtrain

    )−1~xtest

    ))

    = E(tr((

    XTtrainXtrain

    )−1~xtest~xTtest

    ))= tr

    ((XTtrainXtrain

    )−1E(~xtest~xTtest

    ))≈ 1

    ntr (I ) =

    p

    n

  • Proof

    E(~xTtest

    (XTtrainXtrain

    )−1~xtest

    )= E

    (tr(~xTtest

    (XTtrainXtrain

    )−1~xtest

    ))= E

    (tr((

    XTtrainXtrain

    )−1~xtest~xTtest

    ))= tr

    ((XTtrainXtrain

    )−1E(~xtest~xTtest

    ))≈ 1

    ntr (I ) =

    p

    n

  • SVD analysis

    ~βLS − ~βtrue =(XTtrainXtrain

    )−1XTtrain~ztrain

    = VtrainS−1trainU

    Ttrain~ztrain

    =

    p∑i=1

    〈~ui , ~ztrain〉si

    ~vi

    ytest − yLS = 〈~xtest, ~βtrue − ~βLS〉+ ztest

    = ztest −p∑

    i=1

    〈~xtest, ~vi 〉 〈~ui , ~ztrain〉si

  • SVD analysis

    ~βLS − ~βtrue =(XTtrainXtrain

    )−1XTtrain~ztrain

    = VtrainS−1trainU

    Ttrain~ztrain

    =

    p∑i=1

    〈~ui , ~ztrain〉si

    ~vi

    ytest − yLS = 〈~xtest, ~βtrue − ~βLS〉+ ztest

    = ztest −p∑

    i=1

    〈~xtest, ~vi 〉 〈~ui , ~ztrain〉si

  • SVD analysis

    ~βLS − ~βtrue =(XTtrainXtrain

    )−1XTtrain~ztrain

    = VtrainS−1trainU

    Ttrain~ztrain

    =

    p∑i=1

    〈~ui , ~ztrain〉si

    ~vi

    ytest − yLS = 〈~xtest, ~βtrue − ~βLS〉+ ztest

    = ztest −p∑

    i=1

    〈~xtest, ~vi 〉 〈~ui , ~ztrain〉si

  • Temperature prediction via linear regression

    200 500 1000 2000 5000Number of training data

    10 3

    10 2

    10 1

    100

    101

    Sing

    ular

    val

    ues o

    f tra

    inin

    g m

    atrix

  • SVD analysis

    If sample covariance of training matrix is close to true covariance matrix

    E(〈~xtest, ~vi 〉2

    )= E

    (~v Ti ~xtest~x

    Ttest~vi

    )

    = ~v Ti E(~xtest~xTtest

    )~vi

    ≈ 1n~v Ti X

    TtrainXtrain~vi

    =1n~v Ti VS

    2V T~vi

    =s2in

    Otherwise: noise amplification

  • SVD analysis

    If sample covariance of training matrix is close to true covariance matrix

    E(〈~xtest, ~vi 〉2

    )= E

    (~v Ti ~xtest~x

    Ttest~vi

    )= ~v Ti E

    (~xtest~xTtest

    )~vi

    ≈ 1n~v Ti X

    TtrainXtrain~vi

    =1n~v Ti VS

    2V T~vi

    =s2in

    Otherwise: noise amplification

  • SVD analysis

    If sample covariance of training matrix is close to true covariance matrix

    E(〈~xtest, ~vi 〉2

    )= E

    (~v Ti ~xtest~x

    Ttest~vi

    )= ~v Ti E

    (~xtest~xTtest

    )~vi

    ≈ 1n~v Ti X

    TtrainXtrain~vi

    =1n~v Ti VS

    2V T~vi

    =s2in

    Otherwise: noise amplification

  • SVD analysis

    If sample covariance of training matrix is close to true covariance matrix

    E(〈~xtest, ~vi 〉2

    )= E

    (~v Ti ~xtest~x

    Ttest~vi

    )= ~v Ti E

    (~xtest~xTtest

    )~vi

    ≈ 1n~v Ti X

    TtrainXtrain~vi

    =1n~v Ti VS

    2V T~vi

    =s2in

    Otherwise: noise amplification

  • SVD analysis

    If sample covariance of training matrix is close to true covariance matrix

    E(〈~xtest, ~vi 〉2

    )= E

    (~v Ti ~xtest~x

    Ttest~vi

    )= ~v Ti E

    (~xtest~xTtest

    )~vi

    ≈ 1n~v Ti X

    TtrainXtrain~vi

    =1n~v Ti VS

    2V T~vi

    =s2in

    Otherwise: noise amplification

  • SVD analysis

    If sample covariance of training matrix is close to true covariance matrix

    E(〈~xtest, ~vi 〉2

    )= E

    (~v Ti ~xtest~x

    Ttest~vi

    )= ~v Ti E

    (~xtest~xTtest

    )~vi

    ≈ 1n~v Ti X

    TtrainXtrain~vi

    =1n~v Ti VS

    2V T~vi

    =s2in

    Otherwise: noise amplification

  • Motivating applications

    The singular-value decomposition

    Principal component analysis

    Dimensionality reduction

    Linear regression

    Ridge regression

    Low-rank matrix estimation

  • Temperature prediction via linear regression

    200 500 1000 2000 5000Number of training data

    0.6

    0.4

    0.2

    0.0

    0.2

    0.4

    0.6Co

    effic

    ient

    s

    Moose, WYMontrose, COLaJunta, CO

  • Motivation

    Overfitting often reflected in large coefficients that cancel out tomatch the noise

    Solution: Penalize large-norm solutions when fitting the model

    Adding a penalty term to promote a particular structure is calledregularization

  • Ridge regression

    For a fixed regularization parameter λ > 0

    ~βridge := argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    +λ∣∣∣∣∣∣~β∣∣∣∣∣∣2

    2

    =(XTX + λI

    )−1XT~y

    When λ→ 0 then ~βridge → ~βLS

    When λ→∞ then ~βridge → 0

  • Ridge regression

    For a fixed regularization parameter λ > 0

    ~βridge := argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    +λ∣∣∣∣∣∣~β∣∣∣∣∣∣2

    2

    =(XTX + λI

    )−1XT~y

    When λ→ 0 then ~βridge → ~βLS

    When λ→∞ then ~βridge → 0

  • Ridge regression

    For a fixed regularization parameter λ > 0

    ~βridge := argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    +λ∣∣∣∣∣∣~β∣∣∣∣∣∣2

    2

    =(XTX + λI

    )−1XT~y

    When λ→ 0 then ~βridge → ~βLS

    When λ→∞ then ~βridge → 0

  • Ridge regression

    For a fixed regularization parameter λ > 0

    ~βridge := argmin~β

    ∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣22

    +λ∣∣∣∣∣∣~β∣∣∣∣∣∣2

    2

    =(XTX + λI

    )−1XT~y

    When λ→ 0 then ~βridge → ~βLS

    When λ→∞ then ~βridge → 0

  • Proof

    ~βridge is the solution to a modified least-squares problem

    ~βridge = argmin~β

    ∣∣∣∣∣∣∣∣[~y0]−[

    X√λI

    ]~β

    ∣∣∣∣∣∣∣∣22

    =

    ([X√λI

    ]T [X√λI

    ])−1 [X√λI

    ]T [~y0

    ]=(XTX + λI

    )−1XT~y

  • Proof

    ~βridge is the solution to a modified least-squares problem

    ~βridge = argmin~β

    ∣∣∣∣∣∣∣∣[~y0]−[

    X√λI

    ]~β

    ∣∣∣∣∣∣∣∣22

    =

    ([X√λI

    ]T [X√λI

    ])−1 [X√λI

    ]T [~y0

    ]

    =(XTX + λI

    )−1XT~y

  • Proof

    ~βridge is the solution to a modified least-squares problem

    ~βridge = argmin~β

    ∣∣∣∣∣∣∣∣[~y0]−[

    X√λI

    ]~β

    ∣∣∣∣∣∣∣∣22

    =

    ([X√λI

    ]T [X√λI

    ])−1 [X√λI

    ]T [~y0

    ]=(XTX + λI

    )−1XT~y

  • Problem

    How to calibrate regularization parameter

    Cannot use coefficient error (we don’t know the true value!)

    Cannot minimize over training data (why?)

    Solution: Check fit on new data

  • Cross validation

    Given a set of examples(y (1), ~x (1)

    ),(y (2), ~x (2)

    ), . . . ,

    (y (n), ~x (n)

    ),

    1. Partition data into a training set Xtrain ∈ Rntrain×p, ~ytrain ∈ Rntrain anda validation set Xval ∈ Rnval×p, ~yval ∈ Rnval

    2. Fit model using the training set for every λ in a set Λ

    ~βridge (λ) := argmin~β

    ∣∣∣∣∣∣~ytrain − Xtrain~β∣∣∣∣∣∣22

    + λ∣∣∣∣∣∣~β∣∣∣∣∣∣2

    2

    and evaluate the fitting error on the validation set

    err (λ) :=∣∣∣∣∣∣~ytrain − Xtrain~βridge(λ)∣∣∣∣∣∣2

    2

    3. Choose the value of λ that minimizes the validation-set error

    λcv := argminλ∈Λ

    err (λ)

  • Temperature prediction via linear regression (n = 202)

    10 4 10 3 10 2 10 1 100 101

    Regularization parameter ( /n)

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    5.0

    Aver

    age

    erro

    r (de

    g Ce

    lsius

    )

    Training errorValidation error

  • Temperature prediction via linear regression (n = 202)

    10 4 10 3 10 2 10 1 100 101

    Regularization parameter ( /n)

    1.00

    0.75

    0.50

    0.25

    0.00

    0.25

    0.50

    0.75

    1.00

    Coef

    ficie

    nts

    Moose, WYMontrose, COLaJunta, CO

  • Temperature prediction via linear regression

    200 500 1000 2000 5000Number of training data

    0.6

    0.4

    0.2

    0.0

    0.2

    0.4

    0.6Co

    effic

    ient

    s

    Moose, WYMontrose, COLaJunta, CO

  • Temperature prediction via linear regression

    200 500 1000 2000 5000Number of training data (n)

    10 4

    10 3

    10 2

    10 1

    Regu

    lariz

    atio

    n pa

    ram

    eter

    (/n

    )

  • Temperature prediction via linear regression

    200 500 1000 2000 5000Number of training data

    0

    1

    2

    3

    4

    5

    6

    7Av

    erag

    e er

    ror (

    deg

    Celsi

    us)

    Training error (LS)Test error (LS)Test error 2016 (LS)Training error (RR)Test error (RR)Test error 2016 (RR)

  • Ridge-regression estimator

    If ~ytrain := Xtrain~βtrue + ~ztrain

    ~βRR = V

    s21s21 +λ

    0 · · · 0

    0 s22

    s22 +λ· · · 0

    · · ·

    0 0 · · · s2p

    s2p +λ

    VT ~βtrue + V

    s1s21 +λ

    0 · · · 0

    0 s2s22 +λ

    · · · 0

    · · ·

    0 0 · · · sps2p +λ

    UT~ztrain

    where USV T is the SVD of Xtrain and s1, . . . , sp are the singular values

  • Proof

    ~βRR =(XTtrainXtrain + λI

    )−1XTtrain

    (Xtrain~βtrue + ~ztrain

    )

    =(VS2V T + λVV T

    )−1 (VS2V T ~βtrue + VSU

    T~ztrain

    )= V

    (S2 + λI

    )−1V T

    (VS2V T ~βtrue + VSU

    T~ztrain

    )= V

    (S2 + λI

    )−1S2V T ~βtrue + V

    (S2 + λI

    )−1SUT~ztrain

  • Proof

    ~βRR =(XTtrainXtrain + λI

    )−1XTtrain

    (Xtrain~βtrue + ~ztrain

    )=(VS2V T + λVV T

    )−1 (VS2V T ~βtrue + VSU

    T~ztrain

    )

    = V(S2 + λI

    )−1V T

    (VS2V T ~βtrue + VSU

    T~ztrain

    )= V

    (S2 + λI

    )−1S2V T ~βtrue + V

    (S2 + λI

    )−1SUT~ztrain

  • Proof

    ~βRR =(XTtrainXtrain + λI

    )−1XTtrain

    (Xtrain~βtrue + ~ztrain

    )=(VS2V T + λVV T

    )−1 (VS2V T ~βtrue + VSU

    T~ztrain

    )= V

    (S2 + λI

    )−1V T

    (VS2V T ~βtrue + VSU

    T~ztrain

    )

    = V(S2 + λI

    )−1S2V T ~βtrue + V

    (S2 + λI

    )−1SUT~ztrain

  • Proof

    ~βRR =(XTtrainXtrain + λI

    )−1XTtrain

    (Xtrain~βtrue + ~ztrain

    )=(VS2V T + λVV T

    )−1 (VS2V T ~βtrue + VSU

    T~ztrain

    )= V

    (S2 + λI

    )−1V T

    (VS2V T ~βtrue + VSU

    T~ztrain

    )= V

    (S2 + λI

    )−1S2V T ~βtrue + V

    (S2 + λI

    )−1SUT~ztrain

  • Coefficient and test error

    ~βRR − ~βtrue = −λV(S2 + λI

    )−1V T ~βtrue + V

    (S2 + λI

    )−1SUT~ztrain

    ytest − yRR = 〈~xtest, ~βtrue − ~βRR〉+ ztest

    = ztest +

    p∑i=1

    〈~xtest, ~vi 〉(λ〈~βtrue, ~vi

    〉− si 〈~ui , ~ztrain〉

    )s2i + λ

    ytest − yLS = ztest −p∑

    i=1

    〈~xtest, ~vi 〉 〈~ui , ~ztrain〉si

  • Coefficient and test error

    ~βRR − ~βtrue = −λV(S2 + λI

    )−1V T ~βtrue + V

    (S2 + λI

    )−1SUT~ztrain

    ytest − yRR = 〈~xtest, ~βtrue − ~βRR〉+ ztest

    = ztest +

    p∑i=1

    〈~xtest, ~vi 〉(λ〈~βtrue, ~vi

    〉− si 〈~ui , ~ztrain〉

    )s2i + λ

    ytest − yLS = ztest −p∑

    i=1

    〈~xtest, ~vi 〉 〈~ui , ~ztrain〉si

  • Coefficient and test error

    ~βRR − ~βtrue = −λV(S2 + λI

    )−1V T ~βtrue + V

    (S2 + λI

    )−1SUT~ztrain

    ytest − yRR = 〈~xtest, ~βtrue − ~βRR〉+ ztest

    = ztest +

    p∑i=1

    〈~xtest, ~vi 〉(λ〈~βtrue, ~vi

    〉− si 〈~ui , ~ztrain〉

    )s2i + λ

    ytest − yLS = ztest −p∑

    i=1

    〈~xtest, ~vi 〉 〈~ui , ~ztrain〉si

  • Motivating applications

    The singular-value decomposition

    Principal component analysis

    Dimensionality reduction

    Linear regression

    Ridge regression

    Low-rank matrix estimation

  • Singular value decomposition

    Every rank r real matrix A ∈ Rm×n, has a singular-value decomposition(SVD) of the form

    A =[~u1 ~u2 · · · ~ur

    ]s1 0 · · · 00 s2 · · · 0

    . . .0 0 · · · sr

    ~v T1~vT2...~vTr

    = USV T

  • Connection between rank and SVD

    Qualitative connection: rank = number of nonzero singular values

    Quantitative connection: decomposition in rank-1 matrices

    A = USV T =n∑

    i=1

    si ~ui~vTi

  • Vector space of matrices

    Rm×n matrices are a vector space

    Inner product:

    〈A,B〉 := tr(ATB

    ), A,B ∈ Rm×n,

    Norm:

    ||A||F :=√

    tr (ATA) =

    √√√√ m∑i=1

    n∑j=1

    A2ij ,

  • Orthonormal set

    The rank-1 matrices ~ui~vTi are orthonormal vectors∣∣∣∣∣∣~ui~vTi ∣∣∣∣∣∣2F = tr (~vi ~uTi ~ui~vTi )

    = ~vTi ~vi ~uTi ~ui = 1

    〈~ui~v

    Ti , ~ui~v

    Ti

    〉= tr

    (~vi ~u

    Ti ~uj~v

    Tj

    )

    = ~vTi ~vj~uTj ~ui = 0, if i 6= j

  • Orthonormal set

    The rank-1 matrices ~ui~vTi are orthonormal vectors∣∣∣∣∣∣~ui~vTi ∣∣∣∣∣∣2F = tr (~vi ~uTi ~ui~vTi )= ~vTi ~vi ~u

    Ti ~ui = 1

    〈~ui~v

    Ti , ~ui~v

    Ti

    〉= tr

    (~vi ~u

    Ti ~uj~v

    Tj

    )

    = ~vTi ~vj~uTj ~ui = 0, if i 6= j

  • Orthonormal set

    The rank-1 matrices ~ui~vTi are orthonormal vectors∣∣∣∣∣∣~ui~vTi ∣∣∣∣∣∣2F = tr (~vi ~uTi ~ui~vTi )= ~vTi ~vi ~u

    Ti ~ui = 1

    〈~ui~v

    Ti , ~ui~v

    Ti

    〉= tr

    (~vi ~u

    Ti ~uj~v

    Tj

    )= ~vTi ~vj~u

    Tj ~ui = 0, if i 6= j

  • Decomposition of matrix norm

    ||A||2F =n∑

    i=1

    ∣∣∣∣∣∣si ~ui~vTi ∣∣∣∣∣∣2F , by Pythagoras’s Theorem,=

    n∑i=1

    s2i , by homogeneity of norms.

  • Connection between rank and SVD

    SVD decomposes a matrix into n components

    Norm of each component equals corresponding singular value

    A matrix is approximately rank r if last n − r singular values are smallwith respect to first r

    Error of approximation given by last n − r singular values∣∣∣∣∣∣∣∣∣∣A−

    r∑i=1

    si~ui~vTi

    ∣∣∣∣∣∣∣∣∣∣2

    F

    =

    ∣∣∣∣∣∣∣∣∣∣

    n∑i=r+1

    si~ui~vTi

    ∣∣∣∣∣∣∣∣∣∣2

    F

    =n∑

    i=r+1

    s2i

  • Connection between rank and SVD

    SVD decomposes a matrix into n components

    Norm of each component equals corresponding singular value

    A matrix is approximately rank r if last n − r singular values are smallwith respect to first r

    Error of approximation given by last n − r singular values∣∣∣∣∣∣∣∣∣∣A−

    r∑i=1

    si ~ui~vTi

    ∣∣∣∣∣∣∣∣∣∣2

    F

    =

    ∣∣∣∣∣∣∣∣∣∣

    n∑i=r+1

    si ~ui~vTi

    ∣∣∣∣∣∣∣∣∣∣2

    F

    =n∑

    i=r+1

    s2i

  • Rank-r bilinear model

    Certain people like certain movies: r factors

    y [i , j ] ≈r∑

    l=1

    al [i ]bl [j ]

    For each factor l

    I al [i ]: movie i is positively (> 0), negatively (< 0) or not (≈ 0)associated to factor l

    I bl [j ]: user j is positively (> 0), negatively (< 0) or not (≈ 0)associated to factor l

  • Rank-r bilinear model

    In matrix form,

    Y ≈[a1 a2 · · · ar

    ] [b1 b2 · · · br

    ]T= ABT

    To fit the model, we would like to solve

    minA∈Rm×r ,B∈Rr×n

    ||Y − AB||F subject to ||A:,1||2 = 1, . . . , ||A:,r ||2 = 1,

  • Best rank-r approximation

    Let USV T be the SVD of a matrix A ∈ Rm×n

    The truncated SVD U:,1:rS1:r ,1:rV T:,1:r is the best rank-r approximation

    U:,1:rS1:r ,1:rVT:,1:r = argmin

    {Ã | rank(Ã)=r}

    ∣∣∣∣∣∣A− Ã∣∣∣∣∣∣F

  • Proof

    Let à be an arbitrary matrix in Rm×n with rank(Ã) = r

    Let Ũ ∈ Rm×k be a matrix with orthonormal columns such thatcol(Ũ) = col(Ã)

    ∣∣∣∣∣∣U:,1:rUT:,1:rA∣∣∣∣∣∣2F =n∑

    i=1

    ∣∣∣∣Pcol(U:,1:r ) A:,i ∣∣∣∣22

    ≥n∑

    i=1

    ∣∣∣∣∣∣Pcol(Ũ) A:,i ∣∣∣∣∣∣22=∣∣∣∣∣∣ŨŨTA∣∣∣∣∣∣2

    F

  • Proof

    Let à be an arbitrary matrix in Rm×n with rank(Ã) = r

    Let Ũ ∈ Rm×k be a matrix with orthonormal columns such thatcol(Ũ) = col(Ã)

    ∣∣∣∣∣∣U:,1:rUT:,1:rA∣∣∣∣∣∣2F =n∑

    i=1

    ∣∣∣∣Pcol(U:,1:r ) A:,i ∣∣∣∣22≥

    n∑i=1

    ∣∣∣∣∣∣Pcol(Ũ) A:,i ∣∣∣∣∣∣22

    =∣∣∣∣∣∣ŨŨTA∣∣∣∣∣∣2

    F

  • Proof

    Let à be an arbitrary matrix in Rm×n with rank(Ã) = r

    Let Ũ ∈ Rm×k be a matrix with orthonormal columns such thatcol(Ũ) = col(Ã)

    ∣∣∣∣∣∣U:,1:rUT:,1:rA∣∣∣∣∣∣2F =n∑

    i=1

    ∣∣∣∣Pcol(U:,1:r ) A:,i ∣∣∣∣22≥

    n∑i=1

    ∣∣∣∣∣∣Pcol(Ũ) A:,i ∣∣∣∣∣∣22=∣∣∣∣∣∣ŨŨTA∣∣∣∣∣∣2

    F

  • Optimal subspace for orthogonal projection

    Given a set of vectors ~a1, ~a2, . . . ~an and a fixed dimension k ≤ n, theSVD of

    A :=[~a1 ~a2 · · · ~an

    ]∈ Rm×n

    provides the k-dimensional subspace that captures the most energy

    n∑i=1

    ∣∣∣∣Pspan(~u1,~u2,...,~uk ) ~ai ∣∣∣∣22 ≥ n∑i=1

    ||PS ~ai ||22

    for any subspace S of dimension k

  • Orthogonal column spaces

    If the column spaces of A,B ∈ Rm×n are orthogonal then

    ||A + B||2F = ||A||2F + ||B||

    2F

    Corollary:

    ||A||2F = ||A + B||2F − ||B||

    2F

  • Proof

    col(A− ŨŨTA

    )is orthogonal to col

    (Ã)

    = col(Ũ)

    ∣∣∣∣∣∣A− Ã∣∣∣∣∣∣2F

    =∣∣∣∣∣∣A− ŨŨTA∣∣∣∣∣∣2

    F+∣∣∣∣∣∣Ã− ŨŨTA∣∣∣∣∣∣2

    F

    ≥∣∣∣∣∣∣A− ŨŨTA∣∣∣∣∣∣2

    F

    = ||A||2F −∣∣∣∣∣∣ŨŨTA∣∣∣∣∣∣2

    F

    ≥ ||A||2F −∣∣∣∣∣∣U:,1:rUT:,1:rA∣∣∣∣∣∣2F

    =∣∣∣∣∣∣A− U:,1:rUT:,1:rA∣∣∣∣∣∣2F

  • Proof

    col(A− ŨŨTA

    )is orthogonal to col

    (Ã)

    = col(Ũ)

    ∣∣∣∣∣∣A− Ã∣∣∣∣∣∣2F

    =∣∣∣∣∣∣A− ŨŨTA∣∣∣∣∣∣2

    F+∣∣∣∣∣∣Ã− ŨŨTA∣∣∣∣∣∣2

    F

    ≥∣∣∣∣∣∣A− ŨŨTA∣∣∣∣∣∣2

    F

    = ||A||2F −∣∣∣∣∣∣ŨŨTA∣∣∣∣∣∣2

    F

    ≥ ||A||2F −∣∣∣∣∣∣U:,1:rUT:,1:rA∣∣∣∣∣∣2F

    =∣∣∣∣∣∣A− U:,1:rUT:,1:rA∣∣∣∣∣∣2F

  • Proof

    col(A− ŨŨTA

    )is orthogonal to col

    (Ã)

    = col(Ũ)

    ∣∣∣∣∣∣A− Ã∣∣∣∣∣∣2F

    =∣∣∣∣∣∣A− ŨŨTA∣∣∣∣∣∣2

    F+∣∣∣∣∣∣Ã− ŨŨTA∣∣∣∣∣∣2

    F

    ≥∣∣∣∣∣∣A− ŨŨTA∣∣∣∣∣∣2

    F

    = ||A||2F −∣∣∣∣∣∣ŨŨTA∣∣∣∣∣∣2

    F

    ≥ ||A||2F −∣∣∣∣∣∣U:,1:rUT:,1:rA∣∣∣∣∣∣2F

    =∣∣∣∣∣∣A− U:,1:rUT:,1:rA∣∣∣∣∣∣2F

  • Proof

    col(A− ŨŨTA

    )is orthogonal to col

    (Ã)

    = col(Ũ)

    ∣∣∣∣∣∣A− Ã∣∣∣∣∣∣2F

    =∣∣∣∣∣∣A− ŨŨTA∣∣∣∣∣∣2

    F+∣∣∣∣∣∣Ã− ŨŨTA∣∣∣∣∣∣2

    F

    ≥∣∣∣∣∣∣A− ŨŨTA∣∣∣∣∣∣2

    F

    = ||A||2F −∣∣∣∣∣∣ŨŨTA∣∣∣∣∣∣2

    F

    ≥ ||A||2F −∣∣∣∣∣∣U:,1:rUT:,1:rA∣∣∣∣∣∣2F

    =∣∣∣∣∣∣A− U:,1:rUT:,1:rA∣∣∣∣∣∣2F

  • Proof

    col(A− ŨŨTA

    )is orthogonal to col

    (Ã)

    = col(Ũ)

    ∣∣∣∣∣∣A− Ã∣∣∣∣∣∣2F

    =∣∣∣∣∣∣A− ŨŨTA∣∣∣∣∣∣2

    F+∣∣∣∣∣∣Ã− ŨŨTA∣∣∣∣∣∣2

    F

    ≥∣∣∣∣∣∣A− ŨŨTA∣∣∣∣∣∣2

    F

    = ||A||2F −∣∣∣∣∣∣ŨŨTA∣∣∣∣∣∣2

    F

    ≥ ||A||2F −∣∣∣∣∣∣U:,1:rUT:,1:rA∣∣∣∣∣∣2F

    =∣∣∣∣∣∣A− U:,1:rUT:,1:rA∣∣∣∣∣∣2F

  • Collaborative filtering

    I Movielens data

    I Ratings between 1 and 10

    I 100 users and movies with more ratings

    I 6,031 out of 104 ratings are observed

    I Test set with 103

    I Validation set with max {ntrain, 400}

  • Bilinear model

    Incorporates average rating µ

    Y ≈ AB + µ

    1. Set µ equal to average rating

    2. Subtract average rating from all entries, set unobserved entries to 0

    3. Compute SVD of centered matrix

    4. Set A := U:,1:r , B := S1:r ,1:rV T:,1:r

  • SVD of matrix (ntrain := 4, 600)

    0 20 40 60 80 100i

    0

    5

    10

    15

    20

    Sing

    ular

    val

    ues (

    s i)

  • Results (ntrain := 4, 600)

    0 20 40 60 80 100Rank

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Erro

    r (ra

    ting

    1-10

    )

    Training errorValidation error

  • Selected rank

    1000 2000 3000 4000Number of training data

    234

    17

    28

    Rank

  • Results

    1000 2000 3000 4000Number of training data

    0.86

    0.88

    0.90

    0.92

    0.94

    0.96

    0.98

    1.00

    Erro

    r (ra

    ting

    1-10

    )

    Average ranking (total)Average ranking (movie)Low-rank model (test)

  • Rank-r bilinear model

    Certain people like certain movies: r factors

    y [i , j ] ≈r∑

    l=1

    al [i ]bl [j ]

    For each factor l

    I al [i ]: movie i is positively (> 0), negatively (< 0) or not (≈ 0)associated to factor l

    I bl [j ]: user j is positively (> 0), negatively (< 0) or not (≈ 0)associated to factor l

  • Rank-3 bilinear model

    0.3 0.2 0.1 0.0 0.1 0.2 0.3u1

    0.3

    0.2

    0.1

    0.0

    0.1

    0.2

    0.3u 3

  • Rank-3 bilinear model

    u1 u2 u3

    -0.03 -0.15 0.06 Forrest Gump0.14 -0.15 -0.10 Pulp Fiction0.05 -0.15 -0.11 The Shawshank Redemption0.04 -0.16 -0.12 Silence of the Lambs0.09 -0.22 0.06 Star Wars Ep. IV-0.12 -0.14 0.03 Jurassic Park0.10 -0.14 0.10 The Matrix-0.07 -0.10 0.03 Toy Story0.04 -0.13 -0.12 Schindler’s List-0.01 -0.03 0.08 Terminator 20.08 -0.2 0.22 Star Wars Ep. V0.00 -0.12 -0.10 Braveheart0.02 -0.12 0.00 Back to the Future0.07 -0.09 -0.27 Fargo0.04 -0.23 0.12 Raiders of the Lost Ark0.06 -0.13 -0.12 American Beauty-0.21 0.07 0.08 Independence Day0.06 -0.16 0.19 Star Wars Ep. VI-0.09 -0.08 0.01 Aladdin-0.04 -0.08 -0.02 The Fugitive

    Motivating applicationsThe singular-value decompositionPrincipal component analysisDimensionality reductionLinear regressionRidge regressionLow-rank matrix estimation