b3-3
DESCRIPTION
badminton thereTRANSCRIPT
Latent Semantic Analysis by CUR matrix decomposition
Nov 1, 2012
Presentation
KORMS 2012 Fall
Minhoe Hur, Seokho Kang, Sungzoon Cho
Seoul National University
Preface
2
Latent Semantic Analysis(LSA)는 각 문서에 나타나는 단어의 빈도 수를 나타내는 TF 행렬을 이용하여 단
어간 혹은 문서간에 표면적으로 드러나지 않는 잠재된 연관성이 있는 것으로 가정하고 이를 찾는 방법으
로서 기존 데이터의 차원을 축소하는 방법으로 연관성 높은 문서 혹은 단어 군집을 찾는 방법론을 의미한
다. 그리고 CUR matrix decomposition은 기존의 Singular Value Decomposition(SVD)이 갖는 한계를 극
복하기 위해 제안된 행렬 분해 방법론으로 최근 다양한 사례에서 응용되고 있다. 본 연구에서는 CUR을 이
용하여 LSA에 활용 가능성을 알아보고자 한다. 그리고 기존에 SVD를 이용한 LSA와의 비교 분석을 통해
CUR이 갖는 장단점에 대해 논의한다.
This work was supported by the second stage of the Brain Korea 21 Project in 2012.The National
Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-
0030814). The Engineering Research Institute of SNU.
Acknowledgement
Abstract
KORMS 2012, Minhoe Hur
Contents
Research Background
– Latent Semantic Analysis(LSA)
– Singular Value Decomposition(SVD)
– CUR matrix decomposition
Experiment data
Method
– LSA
– Text categorization
Results
– LSA with SVD and CUR
– Document classification with SVD and CUR
Conclusion
3 KORMS 2012, Minhoe Hur
Research Background
Latent Semantic Analysis(LSA, or LSI)
– Concept can be described in multiple ways
• Synonyms, people’s language habits, different languages
ex) mobile, cellular phone, mobile phone …
• Causes low recall in data retrieval / High dimension
– Approach
• Assumptions: Some underlying latent semantic structure in the data
• By using statistical technique, SVD, latent structure can be estimated.
– Latent structure in the text
• ‘hidden concept space’: transformed space
• Associates syntactically different but semantically similar terms
4 KORMS 2012, Minhoe Hur
Terms Document Terms Latent Space Document
T1
T2
Tm
…
D1
D2
Dn
…
D3
T1
T2
Tm
…
D1
D2
Dn
…
D3
L1
L2
Lk …
L3
L4
< Term-Document relationship > < Term-Document and latent space >
Research Background
LSA by Singular Value Decomposition(SVD)
5 KORMS 2012, Minhoe Hur
AMatrix
Let be the text collection, m x n, the number of distinctive words be m and the
number of documents be n.
A
m documents
n t
erm
s
d1 d2 … dm
t1 t
2 t
3 …
tn
jiA ,
,where each entries are number
of occurrences of term i in document j
jiA ,
What SVD does is to factorize matrix into the product of three matrices, AT
nrrrrmnm VUA
rmU Left singular vectors, associated with r eigenvectors of , TAA IUU T
rnV Right singular vectors, associated with r eigenvectors of , AATIVV T
rr Diagonal/square matrix composed of non-negative singular values
, where ),...,( 21 rrr diag 0...21 r
Research Background
LSA by Singular Value Decomposition(SVD) (cont’d)
– Eliminating unnecessary row/columns
• : rank of matrix = # of independent
• If more than two terms or documents have same vector, r would be reduced.
– Dimension reduction
• Delete insignificant dimensions in the transformed space
• Significance of the dimensions = magnitudes of the singular values
6 KORMS 2012, Minhoe Hur
Ar ) ,min( columnsrows
n t
erm
s
T
krkkkmnm
T
nrrrrmnm
VUA
VUA
kU U
k
V
T
kV
Research Background
LSA by Singular Value Decomposition(SVD) (cont’d)
– Choosing optimal k in dimension reduction
• By considering singular values in the diagonal line,
7 KORMS 2012, Minhoe Hur
r
12
kSignificance
( total variance)
r
i
i
k
1
21
When dim=50, we can explain 90% of total variability of original data!
dimensions
Tota
l va
rian
ce(%
)
Research Background
LSA by Singular Value Decomposition(SVD) (cont’d)
– Matrix Ak with various k values
8 KORMS 2012, Minhoe Hur
Dim
=1
04
To
tal v
ar:
10
0%
Dim
=8
0
Tota
l var:
98
.81
%
Dim
=5
0
Tota
l var:
88
.78
%
Dim
=2
0
Tota
l var:
66
.05
%
Research Background
LSA by Singular Value Decomposition(SVD) (cont’d)
– Finding similar terms in latent space
• The goal of LSA is to find semantically similar terms in term-document matrix
• Clustering methods can be used in the new(transformed) space.
9 KORMS 2012, Minhoe Hur
T
krkkkmnm VUA
Original Transformed
For 2 dimensions,
d1
d2 d1’
d2’
Axis of original space
Axis of transformed space
car motor computer interface tree
< Hierarchical clustering of terms >
Research Background
Singular Value Decomposition (SVD)
10
nnnmmm RvvvVandRuuuU ]...[ ]...[ 2121
nmRA For a given ,
there exists orthogonal matrices such
that ),...,( 1 k
T diagAVU
0... n},min{m,, 321
nmRwhere
k
t
tTt
tk vuA1
And we can get the best rank-k approximation to the data matrix A.
If k ≤ r=rank(A) and we define then
2
)(:
2min
FkXrankRXFk XAAAnm
where , Frobenius norm ij ijFAA 22
< First two eigenvectors in scatter plot >
KORMS 2012, Minhoe Hur
Research Background
Limitations of SVD analysis
– lack of interpretability(not informative)
– Not ‘things’ and ‘physical reality’
– Requires an intimate knowledge of the field from data are drawn
– Cannot explain well for some data patterns
11
e.g) eigenvector 1 = 0.5 * age + 1.43*height – 0.43*income + 1.2*footsize+…
< First two eigenvectors in scatter plot: bad cases >
KORMS 2012, Minhoe Hur
Research Background
We desire the decomposition that
– Should have provable worst-case optimality and algorithmic properties
– Should have a natural statistical interpretation associated with its construction
– Should perform well in practice
CUR matrix decomposition
– Expressed in terms of a small number of actual rows/columns of original data
• C: consists of a small number of actual columns of A
• R: consists of a small number of actual rows of A
• U: carefully constructed matrix that guarantees that C*U*R is close to A
– C and R can be used in place of the eigencolumns and eigenrows
12 KORMS 2012, Minhoe Hur
Research Background
Constructing C( and R )
– Calculate ‘Importance(statistical leverage) score’ for each column of A
– Randomly sample small number of columns from A by using that score as an importance sampling probability distribution
ColumnSelect procedure
– 1. Compute the top k right singular vectors of A and the statistical leverage scores.
– 2. Keep the jth column of A with probability for all where
– 3. return the matrix C consisting of the selected columns of A
13
)(1 2
1
k
jj vk
jv th where is the jth coordinate of the right singular vectors
},...,1{ nj},1min{ jj cp
)/log( 2kkOc
KORMS 2012, Minhoe Hur
Research Background
CUR matrix decomposition
– For matrix (m by n) A, a rank parameter k and an error parameter
• 1. Run ColumnSelect on A with to choose columns of A and construct the matrix C
• 2. Run ColumnSelect on AT with to choose columns of AT and construct the matrix R
• 3. Define the matrix U as , where is a Moore-Penrose generalized inverse of the matrix X.
– And by CUR decomposition, we expect that
– Example (k = 2, epsilon=0.4)
14
)/log( 2kkOc
)/log( 2kkOr
ARCU X
FkFAACURA )2(
[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]
[ 1. 1. 2. 3. 3.]
[ 4. 4. 5. 6. 6.]
[ 7. 7. 8. 9. 9.]
[ 10. 10. 11. 12. 12.]
[-0.38 -0.11 -0.11 0.17 0.17]
[-0.38 -0.11 -0.11 0.17 0.17]
[-0.02 0.00 0.00 0.02 0.01]
[ 0.32 0.10 0.10 -0.14 -0.13]
[ 0.32 0.10 0.10 -0.14 -0.13]
[ 4. 5. 6.]
[ 7. 8. 9.]
[ 7. 8. 9.]
[ 10. 11. 12.]
[ 10. 11. 12.]
A C U R A’
[ 1. 2. 3.]
[ 4. 5. 6.]
[ 7. 8. 9.]
[ 10. 11. 12.]
≈ X X =
KORMS 2012, Minhoe Hur
Research Background
Finding Latent Semantics
– We can follow the same procedure of LSA that we did by using SVD.
– Comparing to SVD, term-document term frequency matrix,
– For each term in D’, do hierarchical clustering with ‘new features’ generated by CU
– Find semantic meanings for each cluster.
15
ARCUwhere
UD
VUA T
'
'
CUD
CURA
'
'
nmRA
SVD CUR decomposition
KORMS 2012, Minhoe Hur
Experimental data
Data
– Reuters-21578 Text Collection dataset ofTerm-Frequency matrix
• News corpus contains 8,293 documents and 65 categories
• Number of terms: 18,933
• Based on Reuters news in 1987
• Categories are manually assigned
• Widely used in text categorization studies
– For text categorization,
• Training: test = 5,946 : 2,347
KORMS 2012, Minhoe Hur 16
Method
Latent Semantic Analysis
– SVD with Python program
• k = 200
• With transformed matrix , we did hierarchical clustering for all terms
• # of clusters: 65
– CUR with Python program
• k = 100, epsilon = 0.01
• With transformed matrix , we did hierarchical clustering for all terms
• # of clusters: 65
Text categorization
– Categorization method: k-NN(k = 1, 5, 10, 20)
• Distance measure: Euclidean, Cosine
• Evaluation criteria: F-measure(2 * Precision * Recall/(Precision + Recall))
KORMS 2012, Minhoe Hur 17
U
CU
Method
Text categorization(cont’d)
– SVD with Matlab program
• For training dataset, decompose the matrix with k = 50, 100, 200, 500
• For test dataset:
– CUR with Python program
• For training dataset, decompose the matrix with k = 100, epsilon = 0.01
• For test dataset:
KORMS 2012, Minhoe Hur 18
T
ktestkktest AUA ,kkU ,, where are decomposed in training step,
is the transformed matrix from test set ktestA ,
1
,
kk
T
testktest UAAAfter we get , try categorization with training dataset
testtrainingtesttest RUCA , where are composed of same indices of testtest RC ,
trainingtraining RC ,
After we get , try categorization with training dataset trainingtestUC
Results
Results of Latent Semantics by SVD decomposition
– Cluster results
• # of clusters: 64
• Cosine similarity
• Clusters that have less than 5 members were eliminated.
19 KORMS 2012, Minhoe Hur
< LSA result by SVD> < LSA result by CUR >
Results
Results of Latent Semantics by SVD decomposition
– Cluster examples(selected)
20 KORMS 2012, Minhoe Hur
Economy, Industry
Cluster 1
Economy, Finance
Geology, World-
nature
sales finance
half fall
approved committee
special equipment
mining reflected
stockholders satisfactory
owner submitted
recapitalization
prices government
economic reuters amount below
productivity systems
economy conditions overfunded
country continued addition
governments countrys sharply overall
burned following
late majority
wash cannon started mine
destroyed discovery delayed
unspecified attack lake
Cluster 2 Cluster 3
Results
Results of Latent Semantics by CUR decomposition
– Cluster examples(selected)
21 KORMS 2012, Minhoe Hur
additive
butane
carbozulia
zulia
constructed
carbone
cryogenic
guasare
mtbe
britain
competition
representative
legislators opposed
broad
aim
disputes differentials
respond
presidential
diplomatic hascompleted
ministerial
allies
lawsuits meets
world set
export sources
community london france
overseas venezuela german
switzerland luxembourg
norway cuba irans soy
zimbabwes producing
quotas
Science, Chemistry
Cluster 1
Nation, Government,
Administration
World-wide,
International trading
Cluster 2 Cluster 3
Results
Results of text categorization
– Comparing to CUR, SVD performs better in categorization.
22 KORMS 2012, Minhoe Hur
# of records F-measure
Topic name Training Test k-NN(Euclidean) k-NN(Cosine)
none SVD CUR-C CUR-CU none SVD CUR-C CUR-CU
earn 2673 1040 86.11% 96.89% 94.20% 88.31% 94.66% 97.00% 95.26% 87.21%
acq 1435 620 80.60% 91.59% 84.66% 70.98% 91.05% 92.07% 87.63% 72.48%
crude 223 98 67.52% 83.42% 69.77% 37.09% 88.32% 81.00% 77.78% 47.40%
trade 225 73 57.85% 75.64% 62.50% 52.17% 81.21% 80.82% 64.56% 57.47%
money-fx 176 69 40.34% 67.11% 45.07% 38.46% 78.43% 74.21% 58.39% 47.89%
interest 140 57 59.02% 74.78% 58.73% 44.68% 77.78% 79.28% 64.86% 43.48%
ship 107 35 22.73% 45.71% 40.78% 17.28% 64.41% 55.00% 48.65% 27.12%
sugar 90 24 82.35% 76.92% 58.82% 40.82% 72.13% 77.55% 64.29% 54.55%
coffee 89 21 59.65% 89.36% 62.86% 29.27% 90.91% 95.45% 66.67% 46.15%
gold 70 20 35.71% 95.00% 23.08% - 91.89% 94.74% 33.33% 22.22%
Average 59.19% 79.64% 60.05% 46.56% 83.08% 82.71% 66.14% 50.60%
Conclusion
Conclusion
– CUR matrix decomposition as a new approach
• To overcome cons of SVD
• Composing real columns and rows from original dataset
– Is CUR can be a good alternative to SVD?
• LSA
• Good. We can find some good semantic clusters by CUR
• But some data that have only ‘zero’ values in the vector cannot be clustered
• How to evaluate LSA?
• Text categorization
• Not good. SVD outperforms CUR
Future works
– More experiments by diversifying parameters of CUR
– TF-IDF matrix
KORMS 2012, Minhoe Hur 23
References
KORMS 2012, Minhoe Hur 24
• Bing Liu, ‘Web Data Mining’, Springer, Dec 2006
• Michael W. Mahoney and Petros Drineas , ‘CUR matrix decompositions for improved data
Analysis’, PNAS, 2009
• P. Drineas et al., ‘Relative-Error CUR Matrix Decompositions’, SIAM J. Matrix Analysis and
Applications, 30, 844-881, 2008
• Pymf(Python Matrix Factorization Module),
http://code.google.com/p/pymf/source/browse/trunk/lib/pymf/cur.py?r=34
• Edda Leopold and Jörg Kindermann, ‘Text Categorization with Support Vector Machines. How to
Represent Texts in Input Space?’, Machine Learning, 46, 423-444, 2002
• Thorsten Joachims, ‘Text categorization with Support Vector Machines: Learning with many
relevant features’, Lecture Notes in Computer Science Volume 1398, 137-142, 1998
End of Document
KORMS 2012, Minhoe Hur 25
Thank you