b3-3

Latent Semantic Analysis by CUR matrix decomposition

Nov 1, 2012

Presentation

KORMS 2012 Fall

Minhoe Hur, Seokho Kang, Sungzoon Cho

Seoul National University

Preface

2

Latent Semantic Analysis(LSA)는 각 문서에 나타나는 단어의 빈도 수를 나타내는 TF 행렬을 이용하여 단

어간 혹은 문서간에 표면적으로 드러나지 않는 잠재된 연관성이 있는 것으로 가정하고 이를 찾는 방법으

로서 기존 데이터의 차원을 축소하는 방법으로 연관성 높은 문서 혹은 단어 군집을 찾는 방법론을 의미한

다. 그리고 CUR matrix decomposition은 기존의 Singular Value Decomposition(SVD)이 갖는 한계를 극

복하기 위해 제안된 행렬 분해 방법론으로 최근 다양한 사례에서 응용되고 있다. 본 연구에서는 CUR을 이

용하여 LSA에 활용 가능성을 알아보고자 한다. 그리고 기존에 SVD를 이용한 LSA와의 비교 분석을 통해

CUR이 갖는 장단점에 대해 논의한다.

This work was supported by the second stage of the Brain Korea 21 Project in 2012.The National

Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-

0030814). The Engineering Research Institute of SNU.

Acknowledgement

Abstract

KORMS 2012, Minhoe Hur

Contents

Research Background

– Latent Semantic Analysis(LSA)

– Singular Value Decomposition(SVD)

– CUR matrix decomposition

Experiment data

Method

– LSA

– Text categorization

Results

– LSA with SVD and CUR

– Document classification with SVD and CUR

Conclusion

3 KORMS 2012, Minhoe Hur

Research Background

Latent Semantic Analysis(LSA, or LSI)

– Concept can be described in multiple ways

• Synonyms, people’s language habits, different languages

ex) mobile, cellular phone, mobile phone …

• Causes low recall in data retrieval / High dimension

– Approach

• Assumptions: Some underlying latent semantic structure in the data

• By using statistical technique, SVD, latent structure can be estimated.

– Latent structure in the text

• ‘hidden concept space’: transformed space

• Associates syntactically different but semantically similar terms


Terms Document Terms Latent Space Document

T1

T2

Tm

…

D1

D2

Dn

…

D3

T1

T2

Tm

…

D1

D2

Dn

…

D3

L1

L2

Lk …

L3

L4

< Term-Document relationship > < Term-Document and latent space >

Research Background

LSA by Singular Value Decomposition(SVD)


AMatrix

Let be the text collection, m x n, the number of distinctive words be m and the

number of documents be n.

A

m documents

n t

erm

s

d1 d2 … dm

t1 t

2 t

3 …

tn

jiA ,

,where each entries are number

of occurrences of term i in document j

jiA ,

What SVD does is to factorize matrix into the product of three matrices, AT

nrrrrmnm VUA

rmU Left singular vectors, associated with r eigenvectors of , TAA IUU T

rnV Right singular vectors, associated with r eigenvectors of , AATIVV T

rr Diagonal/square matrix composed of non-negative singular values

, where ),...,( 21 rrr diag 0...21 r

Research Background

LSA by Singular Value Decomposition(SVD) (cont’d)

– Eliminating unnecessary row/columns

• : rank of matrix = # of independent

• If more than two terms or documents have same vector, r would be reduced.

– Dimension reduction

• Delete insignificant dimensions in the transformed space

• Significance of the dimensions = magnitudes of the singular values


Ar ) ,min( columnsrows

n t

erm

s

T

krkkkmnm

T

nrrrrmnm

VUA

VUA

kU U

k

V

T

kV

Research Background


– Choosing optimal k in dimension reduction

• By considering singular values in the diagonal line,


r

12

kSignificance

( total variance)

r

i

i

k

1

21

When dim=50, we can explain 90% of total variability of original data!

dimensions

Tota

l va

rian

ce(%

)

Research Background


– Matrix Ak with various k values


Dim

=1

04

To

tal v

ar:

10

0%

Dim

=8

0

Tota

l var:

98

.81

%

Dim

=5

0

Tota

l var:

88

.78

%

Dim

=2

0

Tota

l var:

66

.05

%

Research Background


– Finding similar terms in latent space

• The goal of LSA is to find semantically similar terms in term-document matrix

• Clustering methods can be used in the new(transformed) space.


T

krkkkmnm VUA

Original Transformed

For 2 dimensions,

d1

d2 d1’

d2’

Axis of original space

Axis of transformed space

car motor computer interface tree

< Hierarchical clustering of terms >

Research Background

Singular Value Decomposition (SVD)

10

nnnmmm RvvvVandRuuuU ]...[ ]...[ 2121

nmRA For a given ,

there exists orthogonal matrices such

that ),...,( 1 k

T diagAVU

0... n},min{m,, 321

nmRwhere

k

t

tTt

tk vuA1

And we can get the best rank-k approximation to the data matrix A.

If k ≤ r=rank(A) and we define then

2

)(:

2min

FkXrankRXFk XAAAnm

where , Frobenius norm ij ijFAA 22

< First two eigenvectors in scatter plot >


Research Background

Limitations of SVD analysis

– lack of interpretability(not informative)

– Not ‘things’ and ‘physical reality’

– Requires an intimate knowledge of the field from data are drawn

– Cannot explain well for some data patterns

11

e.g) eigenvector 1 = 0.5 * age + 1.43*height – 0.43*income + 1.2*footsize+…

< First two eigenvectors in scatter plot: bad cases >


Research Background

We desire the decomposition that

– Should have provable worst-case optimality and algorithmic properties

– Should have a natural statistical interpretation associated with its construction

– Should perform well in practice

CUR matrix decomposition

– Expressed in terms of a small number of actual rows/columns of original data

• C: consists of a small number of actual columns of A

• R: consists of a small number of actual rows of A

• U: carefully constructed matrix that guarantees that C*U*R is close to A

– C and R can be used in place of the eigencolumns and eigenrows


Research Background

Constructing C( and R )

– Calculate ‘Importance(statistical leverage) score’ for each column of A

– Randomly sample small number of columns from A by using that score as an importance sampling probability distribution

ColumnSelect procedure

– 1. Compute the top k right singular vectors of A and the statistical leverage scores.

– 2. Keep the jth column of A with probability for all where

– 3. return the matrix C consisting of the selected columns of A

13

)(1 2

1

k

jj vk

jv th where is the jth coordinate of the right singular vectors

},...,1{ nj},1min{ jj cp

)/log( 2kkOc


Research Background

CUR matrix decomposition

– For matrix (m by n) A, a rank parameter k and an error parameter

• 1. Run ColumnSelect on A with to choose columns of A and construct the matrix C

• 2. Run ColumnSelect on AT with to choose columns of AT and construct the matrix R

• 3. Define the matrix U as , where is a Moore-Penrose generalized inverse of the matrix X.

– And by CUR decomposition, we expect that

– Example (k = 2, epsilon=0.4)

14

)/log( 2kkOc

)/log( 2kkOr

ARCU X

FkFAACURA )2(

[ 1 2 3]

[ 4 5 6]

[ 7 8 9]

[10 11 12]

[ 1. 1. 2. 3. 3.]

[ 4. 4. 5. 6. 6.]

[ 7. 7. 8. 9. 9.]

[ 10. 10. 11. 12. 12.]

[-0.38 -0.11 -0.11 0.17 0.17]

[-0.38 -0.11 -0.11 0.17 0.17]

[-0.02 0.00 0.00 0.02 0.01]

[ 0.32 0.10 0.10 -0.14 -0.13]

[ 0.32 0.10 0.10 -0.14 -0.13]

[ 4. 5. 6.]

[ 7. 8. 9.]

[ 7. 8. 9.]

[ 10. 11. 12.]

[ 10. 11. 12.]

A C U R A’

[ 1. 2. 3.]

[ 4. 5. 6.]

[ 7. 8. 9.]

[ 10. 11. 12.]

≈ X X =


Research Background

Finding Latent Semantics

– We can follow the same procedure of LSA that we did by using SVD.

– Comparing to SVD, term-document term frequency matrix,

– For each term in D’, do hierarchical clustering with ‘new features’ generated by CU

– Find semantic meanings for each cluster.

15

ARCUwhere

UD

VUA T

'

'

CUD

CURA

'

'

nmRA

SVD CUR decomposition


Experimental data

Data

– Reuters-21578 Text Collection dataset ofTerm-Frequency matrix

• News corpus contains 8,293 documents and 65 categories

• Number of terms: 18,933

• Based on Reuters news in 1987

• Categories are manually assigned

• Widely used in text categorization studies

– For text categorization,

• Training: test = 5,946 : 2,347

KORMS 2012, Minhoe Hur 16

Method

Latent Semantic Analysis

– SVD with Python program

• k = 200

• With transformed matrix , we did hierarchical clustering for all terms

• # of clusters: 65

– CUR with Python program

• k = 100, epsilon = 0.01

• With transformed matrix , we did hierarchical clustering for all terms


Text categorization

– Categorization method: k-NN(k = 1, 5, 10, 20)

• Distance measure: Euclidean, Cosine

• Evaluation criteria: F-measure(2 * Precision * Recall/(Precision + Recall))


U

CU

Method

Text categorization(cont’d)

– SVD with Matlab program

• For training dataset, decompose the matrix with k = 50, 100, 200, 500

• For test dataset:

– CUR with Python program

• For training dataset, decompose the matrix with k = 100, epsilon = 0.01

• For test dataset:


T

ktestkktest AUA ,kkU ,, where are decomposed in training step,

is the transformed matrix from test set ktestA ,

1

,

kk

T

testktest UAAAfter we get , try categorization with training dataset

testtrainingtesttest RUCA , where are composed of same indices of testtest RC ,

trainingtraining RC ,

After we get , try categorization with training dataset trainingtestUC

Results

Results of Latent Semantics by SVD decomposition

– Cluster results


• Cosine similarity

• Clusters that have less than 5 members were eliminated.


< LSA result by SVD> < LSA result by CUR >

Results

Results of Latent Semantics by SVD decomposition

– Cluster examples(selected)


Economy, Industry

Cluster 1

Economy, Finance

Geology, World-

nature

sales finance

half fall

approved committee

special equipment

mining reflected

stockholders satisfactory

owner submitted

recapitalization

prices government

economic reuters amount below

productivity systems

economy conditions overfunded

country continued addition

governments countrys sharply overall

burned following

late majority

wash cannon started mine

destroyed discovery delayed

unspecified attack lake

Cluster 2 Cluster 3

Results

Results of Latent Semantics by CUR decomposition

– Cluster examples(selected)


additive

butane

carbozulia

zulia

constructed

carbone

cryogenic

guasare

mtbe

britain

competition

representative

legislators opposed

broad

aim

disputes differentials

respond

presidential

diplomatic hascompleted

ministerial

allies

lawsuits meets

world set

export sources

community london france

overseas venezuela german

switzerland luxembourg

norway cuba irans soy

zimbabwes producing

quotas

Science, Chemistry

Cluster 1

Nation, Government,

Administration

World-wide,

International trading

Cluster 2 Cluster 3

Results

Results of text categorization

– Comparing to CUR, SVD performs better in categorization.


# of records F-measure

Topic name Training Test k-NN(Euclidean) k-NN(Cosine)

none SVD CUR-C CUR-CU none SVD CUR-C CUR-CU

earn 2673 1040 86.11% 96.89% 94.20% 88.31% 94.66% 97.00% 95.26% 87.21%

acq 1435 620 80.60% 91.59% 84.66% 70.98% 91.05% 92.07% 87.63% 72.48%

crude 223 98 67.52% 83.42% 69.77% 37.09% 88.32% 81.00% 77.78% 47.40%

trade 225 73 57.85% 75.64% 62.50% 52.17% 81.21% 80.82% 64.56% 57.47%

money-fx 176 69 40.34% 67.11% 45.07% 38.46% 78.43% 74.21% 58.39% 47.89%

interest 140 57 59.02% 74.78% 58.73% 44.68% 77.78% 79.28% 64.86% 43.48%

ship 107 35 22.73% 45.71% 40.78% 17.28% 64.41% 55.00% 48.65% 27.12%

sugar 90 24 82.35% 76.92% 58.82% 40.82% 72.13% 77.55% 64.29% 54.55%

coffee 89 21 59.65% 89.36% 62.86% 29.27% 90.91% 95.45% 66.67% 46.15%

gold 70 20 35.71% 95.00% 23.08% - 91.89% 94.74% 33.33% 22.22%

Average 59.19% 79.64% 60.05% 46.56% 83.08% 82.71% 66.14% 50.60%

Conclusion

Conclusion

– CUR matrix decomposition as a new approach

• To overcome cons of SVD

• Composing real columns and rows from original dataset

– Is CUR can be a good alternative to SVD?

• LSA

• Good. We can find some good semantic clusters by CUR

• But some data that have only ‘zero’ values in the vector cannot be clustered

• How to evaluate LSA?

• Text categorization

• Not good. SVD outperforms CUR

Future works

– More experiments by diversifying parameters of CUR

– TF-IDF matrix


References


• Bing Liu, ‘Web Data Mining’, Springer, Dec 2006

• Michael W. Mahoney and Petros Drineas , ‘CUR matrix decompositions for improved data

Analysis’, PNAS, 2009

• P. Drineas et al., ‘Relative-Error CUR Matrix Decompositions’, SIAM J. Matrix Analysis and

Applications, 30, 844-881, 2008

• Pymf(Python Matrix Factorization Module),

http://code.google.com/p/pymf/source/browse/trunk/lib/pymf/cur.py?r=34

• Edda Leopold and Jörg Kindermann, ‘Text Categorization with Support Vector Machines. How to

Represent Texts in Input Space?’, Machine Learning, 46, 423-444, 2002

• Thorsten Joachims, ‘Text categorization with Support Vector Machines: Learning with many

relevant features’, Lecture Notes in Computer Science Volume 1398, 137-142, 1998

http://code.google.com/p/pymf/source/browse/trunk/lib/pymf/cur.py?r=34

End of Document


Thank you

b3-3

Documents