the terms that you have to know! basis, linear independent, orthogonal column space, row space, rank...

The Terms that You Have to Know!

Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product Eigenvalue, Eigenvector Projection

Least Squares Problem:

The normal equation for LS problem: ATAx = ATb

Finding the projection of onto thecol(A)b

Ax ù b

The projection matrix: P = A(ATA)à 1AT 2 Rmâ m

Let be a matrix with full column rankA 2 Rmâ n

If has orthonormal columns, then the LS problem becomes easy:

A

Pb= AATb=P

i=1

n

A ï iATï ib

Think of orthonormal axis system

Matrix Factorization

LU-Factorization: A = LU

QR-Factorization:

Very useful for solving linear system equations Some row exchanges are required

A = QR; A 2 Rmân; Q 2 Rmân;R 2 Rnân

Every matrix with linearly independent columns can be factored into . The columns of are orthonormal,and is upper triangular and invertible. When and all matrices are square, becomes anorthogonal matrix ( )

A 2 Rmâ n

A = QR Q

Rm = n Q

QTQ = I

QR Factorization SimplifiesLeast Squares Problem

The normal equation for LS problem: ATAx = ATb

ATAx = RTQTQRx = RTRx = RTQTb

, Rx = QTb (RT is invertible)

A ï j = Q áR ï j =P

k=1

n

RkjQ ï k

A

Note: The orthogonal matrix constructs the column space of matrix

Q

LS problem: Finding the projection of onto the col(A)b

Motivation for Computing QR of the term-by-doc Matrix

The basis vectors of the column space of can be used to describe the semantic content of the corresponding text collection

A

cosòk = jjA ï kjj2jjqjj2

A Tï káq = jjQR ï kjj2jjqjj2

(QR ï k)Táq = jjR ï kjj2jjqjj2

R Tï k(Q

Táq)

Let be the angle between a query and the document vector

òk qA ï k

That means we can keep and instead of Q R A

QR also can be applied to dimension reduction

Recall Matrix Notations Random vector x = [X1, X2, …, Xn]T where each Xi is

a random variable to describe the value of the i-th attribute

Expectation: E[x] = , covariance: E[(x – )(x –)T] =

Expectation of projection:

E[wTx] = E[∑iwi Xi] = ∑

iwi E[Xi] = wTE[x] = wT

Variance of projection: Var(wTx) = E[(wTx – wT)2] = E[(wTx – wT)(wTx – wT)]= E[wT(x – )(x – )Tw]= wT E[(x – )(x –)T]w = wT w

wT: 1 n x: n 1

Principal Components Analysis (PCA)

Not using the output information Find a mapping from the inputs in the original n-

dimensional space to a new (k<n)-dimensional space such that when x is projected there, information loss is minimized.

The projection of x on the direction of w is: z = wTx Find w such that Var(z) is maximized (after the

projection, the difference between the sample points becomes most apparent)

For a unique solution, ||w|| = 1

w (||w|| = 1)

x

wTx

The 1st Principal Component

Maximize Var(z) subject to ||w||=1

Take the derivative w.r.t. w1 and setting it to 0, we have

That is, w1 is an eigenvector of and the corresponding eigenvalue

Also,

We can choose the largest eigenvalue for Var(z) to be maximum

The 1st principal component is the eigenvector of the covariance matrix of the input sample with the largest eigenvalue, 1 =

maxw1

wT1Î w1à ë wT

1w1à 1à á

2Î w1à 2ëw1 = 0) Î w1 = ëw1

Var(z) = wT1Î w1 = ëwT

1w1 = ë

The 2nd Principal Component

Maximize Var(z2), s.t., ||w2|| = 1 and orthogonal to w1

Take the derivative w.r.t. w2 and setting it equal to 0, we have

Premultiply by w1T and we get

Note that w1Tw2 = 0, and w1

T w2 is a scalar, equal to its transpose, therefore

And we have That is, w2 is the eigenvector of with the second

largest eigenvalue 2 = , and so on.

maxw2

wT2Î w2à ë wT

2w2à 1à á

à ì wT2w1à 0

à á

2Î w2à 2ëw2à ì w1 = 0

2wT1Î w2à 2ëwT

1w2à ì wT1w1 = 0

wT1Î w2 = wT

2Î w1 = õ1wT2w1 = 0

Î w2 = ëw2

Recall from Linear Algebra

Theorem: Eigenvectors associated with different eigenvalues are orthogonal to each other

Theorem: A real symmetry matrix A can be transformed into a diagonal matrix by P-1AP = D where P has its columns as the eigenvectors of A

Recall from Linear Algebra (cont.)

Def: Positive definite bilinear form: f (x, x) > 0, x 0 E.g.: f (x, y) = xTAy

xTAx > 0 x 0 A an nn matrix is called a positive definite matrix

Def: Positive semidefinite bilinear form: f (x, x) 0, x E.g.: xTAx 0, x A is called a positive

semidefinite matrix

Theorem: Matrix A is positive definite if and only if all the eigenvalues of A are positives

What PCA does Consider an Rn Rk transformation

where the k columns of W are the k leading eigenvectors of S (the estimator to ), and m is the sample meanNote: if k = n, WWT = WTW = I, so W-1 = WT ,or WTW = Ikk, if k < n

The transformation will center the data at the origin and rotates the axes to those eigenvectors, and the variances over the new dimensions are equal to the eigenvalues

z = WT(x – m) (just like z = w1Tx, z = w2

Tx, …)

Singular Value Decomposition (SVD)

A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n

The columns of are eigenvectors of and the columnsU AAT

of are eigenvectors ofV ATA

Î =

û1 ááá 00

... 00 ááá ûr

0... 0

0 ááá 0

2

6664

3

7775

mâ n

r = min(m;n)

û1>û2>. . .>ûr

eigenvalues of both and AATATA

are square roots of the nonzero

Singular Value Decomposition (SVD)

à 1 1 00 à 1 1

ô õ=

à 1 1

22

p

22

pô õ

3p

0 00 1 0

ô õ 66

p

à 36

p

66

p

à 22

p

0 22

p

33

p

33

p

33

p

2

64

3

75

A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n

AAT = UÎ VTVÎ TUT = UÎ Î TUT ) col(A) = col(U)

ATA = VÎ TÎ VT ) row(A) = col(V)

à 122

" #

=à

31á

à 1 2 22 à 1 22 2 à 1

" #á 3

00

" #

1[ ]

Latent Semantic Indexing (LSI)

Basic idea: explore the correlation between words and documents

Two words are correlated when they co-occur together many times

Two documents are correlated when they have many words

Latent Semantic Indexing (LSI)

Computation: using single value decomposition (SVD)

Concept Space m is the number of

conceptsRep. of Concepts

in term space

Concept

Concept

Rep. of concepts in document space

m: number of concepts/topics

54.20

034.3X X

SVD: Example: m=2

54.20

034.3X X

SVD: Example: m=2

5

476.0

34.3

54.2

SVD: Eigenvalues

Determining m is usually difficult

SVD: Orthogonality

54.20

034.3X X

u1 u2 · = 0

v1

v2

v1 · v2 = 0

54.20

034.3

X X

SVD: Properties

rank(S): the maximum number of either row or column vectors within matrix S that are linearly independent.

SVD produces the best low rank approximation

X’: rank(X’) = 2X: rank(X) = 9

SVD: Visualization

X

=

SVD: Visualization

SVD tries to preserve the Euclidean distance of document vectors

the terms that you have to know! basis, linear independent, orthogonal column space, row space, rank...

Documents

w t w w t

ew t x w t w t x w t

ew t x x t w

w t ex x t w

x t ax

varw t x

columns of w

w t variance of projection