b3-3

25
Latent Semantic Analysis by CUR matrix decomposition Nov 1, 2012 Presentation KORMS 2012 Fall Minhoe Hur, Seokho Kang, Sungzoon Cho Seoul National University

Upload: juan-perez-arrikitaun

Post on 14-Jul-2016

5 views

Category:

Documents


3 download

DESCRIPTION

badminton there

TRANSCRIPT

Page 1: B3-3

Latent Semantic Analysis by CUR matrix decomposition

Nov 1, 2012

Presentation

KORMS 2012 Fall

Minhoe Hur, Seokho Kang, Sungzoon Cho

Seoul National University

Page 2: B3-3

Preface

2

Latent Semantic Analysis(LSA)는 각 문서에 나타나는 단어의 빈도 수를 나타내는 TF 행렬을 이용하여 단

어간 혹은 문서간에 표면적으로 드러나지 않는 잠재된 연관성이 있는 것으로 가정하고 이를 찾는 방법으

로서 기존 데이터의 차원을 축소하는 방법으로 연관성 높은 문서 혹은 단어 군집을 찾는 방법론을 의미한

다. 그리고 CUR matrix decomposition은 기존의 Singular Value Decomposition(SVD)이 갖는 한계를 극

복하기 위해 제안된 행렬 분해 방법론으로 최근 다양한 사례에서 응용되고 있다. 본 연구에서는 CUR을 이

용하여 LSA에 활용 가능성을 알아보고자 한다. 그리고 기존에 SVD를 이용한 LSA와의 비교 분석을 통해

CUR이 갖는 장단점에 대해 논의한다.

This work was supported by the second stage of the Brain Korea 21 Project in 2012.The National

Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-

0030814). The Engineering Research Institute of SNU.

Acknowledgement

Abstract

KORMS 2012, Minhoe Hur

Page 3: B3-3

Contents

Research Background

– Latent Semantic Analysis(LSA)

– Singular Value Decomposition(SVD)

– CUR matrix decomposition

Experiment data

Method

– LSA

– Text categorization

Results

– LSA with SVD and CUR

– Document classification with SVD and CUR

Conclusion

3 KORMS 2012, Minhoe Hur

Page 4: B3-3

Research Background

Latent Semantic Analysis(LSA, or LSI)

– Concept can be described in multiple ways

• Synonyms, people’s language habits, different languages

ex) mobile, cellular phone, mobile phone …

• Causes low recall in data retrieval / High dimension

– Approach

• Assumptions: Some underlying latent semantic structure in the data

• By using statistical technique, SVD, latent structure can be estimated.

– Latent structure in the text

• ‘hidden concept space’: transformed space

• Associates syntactically different but semantically similar terms

4 KORMS 2012, Minhoe Hur

Terms Document Terms Latent Space Document

T1

T2

Tm

D1

D2

Dn

D3

T1

T2

Tm

D1

D2

Dn

D3

L1

L2

Lk …

L3

L4

< Term-Document relationship > < Term-Document and latent space >

Page 5: B3-3

Research Background

LSA by Singular Value Decomposition(SVD)

5 KORMS 2012, Minhoe Hur

AMatrix

Let be the text collection, m x n, the number of distinctive words be m and the

number of documents be n.

A

m documents

n t

erm

s

d1 d2 … dm

t1 t

2 t

3 …

tn

jiA ,

,where each entries are number

of occurrences of term i in document j

jiA ,

What SVD does is to factorize matrix into the product of three matrices, AT

nrrrrmnm VUA

rmU Left singular vectors, associated with r eigenvectors of , TAA IUU T

rnV Right singular vectors, associated with r eigenvectors of , AATIVV T

rr Diagonal/square matrix composed of non-negative singular values

, where ),...,( 21 rrr diag 0...21 r

Page 6: B3-3

Research Background

LSA by Singular Value Decomposition(SVD) (cont’d)

– Eliminating unnecessary row/columns

• : rank of matrix = # of independent

• If more than two terms or documents have same vector, r would be reduced.

– Dimension reduction

• Delete insignificant dimensions in the transformed space

• Significance of the dimensions = magnitudes of the singular values

6 KORMS 2012, Minhoe Hur

Ar ) ,min( columnsrows

n t

erm

s

T

krkkkmnm

T

nrrrrmnm

VUA

VUA

kU U

k

V

T

kV

Page 7: B3-3

Research Background

LSA by Singular Value Decomposition(SVD) (cont’d)

– Choosing optimal k in dimension reduction

• By considering singular values in the diagonal line,

7 KORMS 2012, Minhoe Hur

r

12

kSignificance

( total variance)

r

i

i

k

1

21

When dim=50, we can explain 90% of total variability of original data!

dimensions

Tota

l va

rian

ce(%

)

Page 8: B3-3

Research Background

LSA by Singular Value Decomposition(SVD) (cont’d)

– Matrix Ak with various k values

8 KORMS 2012, Minhoe Hur

Dim

=1

04

To

tal v

ar:

10

0%

Dim

=8

0

Tota

l var:

98

.81

%

Dim

=5

0

Tota

l var:

88

.78

%

Dim

=2

0

Tota

l var:

66

.05

%

Page 9: B3-3

Research Background

LSA by Singular Value Decomposition(SVD) (cont’d)

– Finding similar terms in latent space

• The goal of LSA is to find semantically similar terms in term-document matrix

• Clustering methods can be used in the new(transformed) space.

9 KORMS 2012, Minhoe Hur

T

krkkkmnm VUA

Original Transformed

For 2 dimensions,

d1

d2 d1’

d2’

Axis of original space

Axis of transformed space

car motor computer interface tree

< Hierarchical clustering of terms >

Page 10: B3-3

Research Background

Singular Value Decomposition (SVD)

10

nnnmmm RvvvVandRuuuU ]...[ ]...[ 2121

nmRA For a given ,

there exists orthogonal matrices such

that ),...,( 1 k

T diagAVU

0... n},min{m,, 321

nmRwhere

k

t

tTt

tk vuA1

And we can get the best rank-k approximation to the data matrix A.

If k ≤ r=rank(A) and we define then

2

)(:

2min

FkXrankRXFk XAAAnm

where , Frobenius norm ij ijFAA 22

< First two eigenvectors in scatter plot >

KORMS 2012, Minhoe Hur

Page 11: B3-3

Research Background

Limitations of SVD analysis

– lack of interpretability(not informative)

– Not ‘things’ and ‘physical reality’

– Requires an intimate knowledge of the field from data are drawn

– Cannot explain well for some data patterns

11

e.g) eigenvector 1 = 0.5 * age + 1.43*height – 0.43*income + 1.2*footsize+…

< First two eigenvectors in scatter plot: bad cases >

KORMS 2012, Minhoe Hur

Page 12: B3-3

Research Background

We desire the decomposition that

– Should have provable worst-case optimality and algorithmic properties

– Should have a natural statistical interpretation associated with its construction

– Should perform well in practice

CUR matrix decomposition

– Expressed in terms of a small number of actual rows/columns of original data

• C: consists of a small number of actual columns of A

• R: consists of a small number of actual rows of A

• U: carefully constructed matrix that guarantees that C*U*R is close to A

– C and R can be used in place of the eigencolumns and eigenrows

12 KORMS 2012, Minhoe Hur

Page 13: B3-3

Research Background

Constructing C( and R )

– Calculate ‘Importance(statistical leverage) score’ for each column of A

– Randomly sample small number of columns from A by using that score as an importance sampling probability distribution

ColumnSelect procedure

– 1. Compute the top k right singular vectors of A and the statistical leverage scores.

– 2. Keep the jth column of A with probability for all where

– 3. return the matrix C consisting of the selected columns of A

13

)(1 2

1

k

jj vk

jv th where is the jth coordinate of the right singular vectors

},...,1{ nj},1min{ jj cp

)/log( 2kkOc

KORMS 2012, Minhoe Hur

Page 14: B3-3

Research Background

CUR matrix decomposition

– For matrix (m by n) A, a rank parameter k and an error parameter

• 1. Run ColumnSelect on A with to choose columns of A and construct the matrix C

• 2. Run ColumnSelect on AT with to choose columns of AT and construct the matrix R

• 3. Define the matrix U as , where is a Moore-Penrose generalized inverse of the matrix X.

– And by CUR decomposition, we expect that

– Example (k = 2, epsilon=0.4)

14

)/log( 2kkOc

)/log( 2kkOr

ARCU X

FkFAACURA )2(

[ 1 2 3]

[ 4 5 6]

[ 7 8 9]

[10 11 12]

[ 1. 1. 2. 3. 3.]

[ 4. 4. 5. 6. 6.]

[ 7. 7. 8. 9. 9.]

[ 10. 10. 11. 12. 12.]

[-0.38 -0.11 -0.11 0.17 0.17]

[-0.38 -0.11 -0.11 0.17 0.17]

[-0.02 0.00 0.00 0.02 0.01]

[ 0.32 0.10 0.10 -0.14 -0.13]

[ 0.32 0.10 0.10 -0.14 -0.13]

[ 4. 5. 6.]

[ 7. 8. 9.]

[ 7. 8. 9.]

[ 10. 11. 12.]

[ 10. 11. 12.]

A C U R A’

[ 1. 2. 3.]

[ 4. 5. 6.]

[ 7. 8. 9.]

[ 10. 11. 12.]

≈ X X =

KORMS 2012, Minhoe Hur

Page 15: B3-3

Research Background

Finding Latent Semantics

– We can follow the same procedure of LSA that we did by using SVD.

– Comparing to SVD, term-document term frequency matrix,

– For each term in D’, do hierarchical clustering with ‘new features’ generated by CU

– Find semantic meanings for each cluster.

15

ARCUwhere

UD

VUA T

'

'

CUD

CURA

'

'

nmRA

SVD CUR decomposition

KORMS 2012, Minhoe Hur

Page 16: B3-3

Experimental data

Data

– Reuters-21578 Text Collection dataset ofTerm-Frequency matrix

• News corpus contains 8,293 documents and 65 categories

• Number of terms: 18,933

• Based on Reuters news in 1987

• Categories are manually assigned

• Widely used in text categorization studies

– For text categorization,

• Training: test = 5,946 : 2,347

KORMS 2012, Minhoe Hur 16

Page 17: B3-3

Method

Latent Semantic Analysis

– SVD with Python program

• k = 200

• With transformed matrix , we did hierarchical clustering for all terms

• # of clusters: 65

– CUR with Python program

• k = 100, epsilon = 0.01

• With transformed matrix , we did hierarchical clustering for all terms

• # of clusters: 65

Text categorization

– Categorization method: k-NN(k = 1, 5, 10, 20)

• Distance measure: Euclidean, Cosine

• Evaluation criteria: F-measure(2 * Precision * Recall/(Precision + Recall))

KORMS 2012, Minhoe Hur 17

U

CU

Page 18: B3-3

Method

Text categorization(cont’d)

– SVD with Matlab program

• For training dataset, decompose the matrix with k = 50, 100, 200, 500

• For test dataset:

– CUR with Python program

• For training dataset, decompose the matrix with k = 100, epsilon = 0.01

• For test dataset:

KORMS 2012, Minhoe Hur 18

T

ktestkktest AUA ,kkU ,, where are decomposed in training step,

is the transformed matrix from test set ktestA ,

1

,

kk

T

testktest UAAAfter we get , try categorization with training dataset

testtrainingtesttest RUCA , where are composed of same indices of testtest RC ,

trainingtraining RC ,

After we get , try categorization with training dataset trainingtestUC

Page 19: B3-3

Results

Results of Latent Semantics by SVD decomposition

– Cluster results

• # of clusters: 64

• Cosine similarity

• Clusters that have less than 5 members were eliminated.

19 KORMS 2012, Minhoe Hur

< LSA result by SVD> < LSA result by CUR >

Page 20: B3-3

Results

Results of Latent Semantics by SVD decomposition

– Cluster examples(selected)

20 KORMS 2012, Minhoe Hur

Economy, Industry

Cluster 1

Economy, Finance

Geology, World-

nature

sales finance

half fall

approved committee

special equipment

mining reflected

stockholders satisfactory

owner submitted

recapitalization

prices government

economic reuters amount below

productivity systems

economy conditions overfunded

country continued addition

governments countrys sharply overall

burned following

late majority

wash cannon started mine

destroyed discovery delayed

unspecified attack lake

Cluster 2 Cluster 3

Page 21: B3-3

Results

Results of Latent Semantics by CUR decomposition

– Cluster examples(selected)

21 KORMS 2012, Minhoe Hur

additive

butane

carbozulia

zulia

constructed

carbone

cryogenic

guasare

mtbe

britain

competition

representative

legislators opposed

broad

aim

disputes differentials

respond

presidential

diplomatic hascompleted

ministerial

allies

lawsuits meets

world set

export sources

community london france

overseas venezuela german

switzerland luxembourg

norway cuba irans soy

zimbabwes producing

quotas

Science, Chemistry

Cluster 1

Nation, Government,

Administration

World-wide,

International trading

Cluster 2 Cluster 3

Page 22: B3-3

Results

Results of text categorization

– Comparing to CUR, SVD performs better in categorization.

22 KORMS 2012, Minhoe Hur

# of records F-measure

Topic name Training Test k-NN(Euclidean) k-NN(Cosine)

none SVD CUR-C CUR-CU none SVD CUR-C CUR-CU

earn 2673 1040 86.11% 96.89% 94.20% 88.31% 94.66% 97.00% 95.26% 87.21%

acq 1435 620 80.60% 91.59% 84.66% 70.98% 91.05% 92.07% 87.63% 72.48%

crude 223 98 67.52% 83.42% 69.77% 37.09% 88.32% 81.00% 77.78% 47.40%

trade 225 73 57.85% 75.64% 62.50% 52.17% 81.21% 80.82% 64.56% 57.47%

money-fx 176 69 40.34% 67.11% 45.07% 38.46% 78.43% 74.21% 58.39% 47.89%

interest 140 57 59.02% 74.78% 58.73% 44.68% 77.78% 79.28% 64.86% 43.48%

ship 107 35 22.73% 45.71% 40.78% 17.28% 64.41% 55.00% 48.65% 27.12%

sugar 90 24 82.35% 76.92% 58.82% 40.82% 72.13% 77.55% 64.29% 54.55%

coffee 89 21 59.65% 89.36% 62.86% 29.27% 90.91% 95.45% 66.67% 46.15%

gold 70 20 35.71% 95.00% 23.08% - 91.89% 94.74% 33.33% 22.22%

Average 59.19% 79.64% 60.05% 46.56% 83.08% 82.71% 66.14% 50.60%

Page 23: B3-3

Conclusion

Conclusion

– CUR matrix decomposition as a new approach

• To overcome cons of SVD

• Composing real columns and rows from original dataset

– Is CUR can be a good alternative to SVD?

• LSA

• Good. We can find some good semantic clusters by CUR

• But some data that have only ‘zero’ values in the vector cannot be clustered

• How to evaluate LSA?

• Text categorization

• Not good. SVD outperforms CUR

Future works

– More experiments by diversifying parameters of CUR

– TF-IDF matrix

KORMS 2012, Minhoe Hur 23

Page 24: B3-3

References

KORMS 2012, Minhoe Hur 24

• Bing Liu, ‘Web Data Mining’, Springer, Dec 2006

• Michael W. Mahoney and Petros Drineas , ‘CUR matrix decompositions for improved data

Analysis’, PNAS, 2009

• P. Drineas et al., ‘Relative-Error CUR Matrix Decompositions’, SIAM J. Matrix Analysis and

Applications, 30, 844-881, 2008

• Pymf(Python Matrix Factorization Module),

http://code.google.com/p/pymf/source/browse/trunk/lib/pymf/cur.py?r=34

• Edda Leopold and Jörg Kindermann, ‘Text Categorization with Support Vector Machines. How to

Represent Texts in Input Space?’, Machine Learning, 46, 423-444, 2002

• Thorsten Joachims, ‘Text categorization with Support Vector Machines: Learning with many

relevant features’, Lecture Notes in Computer Science Volume 1398, 137-142, 1998

Page 25: B3-3

End of Document

KORMS 2012, Minhoe Hur 25

Thank you