information retrieval through various approximate matrix decompositions
DESCRIPTION
Information Retrieval through Various Approximate Matrix Decompositions. Kathryn Linehan Advisor: Dr. Dianne O’Leary. Information Retrieval. Extracting information from databases We need an efficient way of searching large amounts of data Example: web search engine. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/1.jpg)
1
Information Retrieval through Various Approximate Matrix Decompositions
Kathryn Linehan
Advisor: Dr. Dianne O’Leary
![Page 2: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/2.jpg)
2
Information Retrieval
Extracting information from databases
We need an efficient way of searching large amounts of data
Example: web search engine
![Page 3: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/3.jpg)
3
Querying a Document Database
We want to return documents that are relevant to entered search terms
Given data: • Term-Document Matrix, A
• Entry ( i , j ): importance of term i in document j
• Query Vector, q• Entry ( i ): importance of term i in the query
![Page 4: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/4.jpg)
4
Term-Document Matrix Entry ( i, j) : weight of term i in document j
15000
20000
010200
05100
020015
00015
Document
1 2 3 4
Term
Mark
Twain
Samuel
Clemens
Purple
Fairy
Example:
Example taken from [5]
![Page 5: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/5.jpg)
5
Query Vector Entry ( i ) : weight of term i in the query
0
0
0
0
1
1 Example:
search for “Mark Twain”
Term
Mark
Twain
Samuel
Clemens
Purple
Fairy
Example taken from [5]
![Page 6: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/6.jpg)
6
Document Scoring
Doc 1 and Doc 3 will be returned as relevant, but Doc 2 will not
T
T
0
20
0
30
15000
20000
010200
05100
020015
00015
*
0
0
0
0
1
1
Document
1 2 3 4Term
Mark
Twain
Samuel
Clemens
Purple
Fairy
Scores
Doc 1
Doc 2
Doc 3
Doc 4
Example taken from [5]
![Page 7: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/7.jpg)
7
Can we do better if we replace the matrix by an approximation?
Singular Value Decomposition (SVD)
•
Nonnegative Matrix Factorization (NMF)
•
CUR Decomposition
•
WHA
CURA
TVUA
![Page 8: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/8.jpg)
8
Nonnegative Matrix Factorization (NMF)
HWA *
m x n m x k
k x n
• W and H are nonnegative
• kWHrank )(
Storage: k(m + n) entries
![Page 9: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/9.jpg)
9
NMF
Multiplicative update algorithm of Lee and Seung found in [1]• Find W, H to minimize
• Random initialization for W,H
• Gradient descent method
• Slow due to matrix multiplications in iteration
2
2
1FWHA
![Page 10: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/10.jpg)
10
NMF Validation
A: 5 x 3 random dense matrix. Average over 5 runs.
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4NMF Validation: Relative Error
k: rank(WH), rank(SVD) <= k
rela
tive
erro
r: F
robe
nius
nor
m
NMF
SVD
40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8NMF Validation: Relative Error
k: rank(WH), rank(SVD) <= kre
lativ
e er
ror:
Fro
beni
us n
orm
NMF
SVD
B: 500 x 200 random sparse matrix. Average over 5 runs.
![Page 11: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/11.jpg)
11
NMF Validation
10 20 30 40 50 60 70 80 90 1000.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Iteration Number
Rel
ativ
e er
ror:
Fro
beni
us N
orm
Relative Error vs. Iteration Number
NMF
SVD
B: 500 x 200 random sparse matrix. Rank(NMF) = 80.
![Page 12: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/12.jpg)
12
CUR Decomposition
**A C U R
m x n m x c
c x r r x n
• C (R) holds c (r) sampled and rescaled columns (rows) of A
• U is computed using C and R
• kCURrank )(where k is a rank parameter
,
Storage: (nz(C) + cr + nz(R)) entries
![Page 13: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/13.jpg)
13
CUR Implementations
CUR algorithm in [3] by Drineas, Kannan, and Mahoney• Linear time algorithm
• Improvement: Compact Matrix Decomposition (CMD) in [6] by Sun, Xie, Zhang, and Faloutsos
• Modification: use ideas in [4] by Drineas, Mahoney, Muthukrishnan (no longer linear time)
• Other Modifications: our ideas
Deterministic CUR code by G. W. Stewart [2]
![Page 14: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/14.jpg)
14
Sampling Column (Row) norm sampling [3]
• Prob(col j) = (similar for row i)
Subspace Sampling [4]• Uses rank-k SVD of A for column probabilities
• Prob(col j) =
• Uses “economy size” SVD of C for row probabilities
• Prob(row i) =
Sampling without replacement
22)(:, FAjA
kjV kA2
, :),(
ciUC2:),(
![Page 15: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/15.jpg)
15
Computation of U
Linear U [3]: approximately solves
•
Optimal U: solves
•
FU
UCA ˆminˆ
RCCACCCU Tk
TTT )()(ˆ
2min F
U
CURA
)()( TTTT RRARCCCU
![Page 16: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/16.jpg)
16
Deterministic CUR
Code by G. W. Stewart [2] Uses a RRQR algorithm that does not
store Q• We only need the permutation vector
• Gives us the columns (rows) for C (R)
Uses an optimal U
![Page 17: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/17.jpg)
17
Compact Matrix Decomposition (CMD) Improvement
Remove repeated columns (rows) in C (R) Decreases storage while still achieving the
same relative error [6]
Algorithm [3] [3] with CMD
Runtime 0.008060 0.007153
Storage 880.5 550.5
Relative Error 0.820035 0.820035
A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.
![Page 18: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/18.jpg)
18
CUR: Sampling with Replacement Validation
A: 5 x 3 random dense matrix. Average over 5 runs.Legend: Sampling, U
5 6 7 8 9 10 11 12 13 14 150
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5CUR Validation: Relative Error
c/r: number of columns/rows sampled
rela
tive
erro
r: F
robe
nius
nor
m
CN,L
CN,O
S,LS,O
SVD
500 600 700 800 900 1000 1100 1200 1300 1400 15000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4CUR Validation: Relative Error
c/r: number of columns/rows sampled
rela
tive
erro
r: F
robe
nius
nor
m
CN,L
CN,O
S,LS,O
SVD
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
x 105
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4CUR Validation: Relative Error
c/r: number of columns/rows sampled
rela
tive
erro
r: F
robe
nius
nor
m
CN,L
CN,O
S,LS,O
SVD
![Page 19: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/19.jpg)
19
Sampling without Replacement: Scaling vs. No Scaling
Invert scaling factor applied to
• T
kT CCU )(
![Page 20: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/20.jpg)
20
CUR: Sampling without Replacement Validation
A: 5 x 3 random dense matrix. Average over 5 runs.
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8CUR Validation: Relative Error, r = 2k, c = k
k: rank(CUR) <= k, rank(SVD) <= k
rela
tive
erro
r: F
robe
nius
nor
m
w/o R,L,w/o Sc
w/o R,L,Sc
SVD
40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1CUR Validation: Relative Error, c = 3k, r = k
k: rank(CUR) <= k, rank(SVD) <= k
rela
tive
erro
r: F
robe
nius
nor
m
w/o R,L,w/o Sc
w/o R,L,ScSVD
Legend: Sampling, U, Scaling
B: 500 x 200 random sparse matrix. Average over 5 runs.
![Page 21: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/21.jpg)
21
CUR Comparison
40 60 80 100 120 140 160 180 2000
0.2
0.4
0.6
0.8
1
1.2
1.4CUR Validation: Relative Error, r = c = 2k
k: rank(CUR) <= k, rank(SVD) <= k
rela
tive
erro
r: F
robe
nius
nor
m
CN,L
CN,OS,L
S,O
w/o R,L,w/o Sc
w/o R,L,ScD
SVD
40 60 80 100 120 140 160 180 2000
0.2
0.4
0.6
0.8
1
1.2
1.4CUR Validation: Relative Error, r = c = k
k: rank(CUR) <= k, rank(SVD) <= k
rela
tive
erro
r: F
robe
nius
nor
m
CN,L
CN,OS,L
S,O
w/o R,L,w/o Sc
w/o R,L,ScD
SVD
B: 500 x 200 random sparse matrix. Average over 5 runs.Legend: Sampling, U, Scaling
![Page 22: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/22.jpg)
22
Judging Success: Precision and Recall
Measurement of performance for document retrieval
• Average precision and recall, where the average is taken over all queries in the data set
• Let Retrieved = number of documents retrieved,
Relevant = total number of relevant documents to the query,
RetRel = number of documents retrieved that are relevant.
• Precision:
• Recall:
Retrieved
RetRel)Retrieved( P
Relevant
RetRel)Retrieved( R
![Page 23: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/23.jpg)
23
LSI Results
Term-document matrix size: 5831 x 1033. All matrix approximations are rank 100 approximations (CUR: r = c = k). Average query time is less than 10-3 seconds for all matrix approximations.
0 200 400 600 800 1000 12000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
number of documents retrieved
aver
age
prec
isio
n
Average Precision vs. number of documents retrieved
SVDNMF
CUR:cn,lin
CUR:cn,opt
CUR:sub,linCUR:sub,opt
CUR:w/oR,no
CUR:w/oR,yes
CUR:GWSLTM
0 200 400 600 800 1000 12000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
number of documents retrieved
aver
age
reca
ll
Average Recall vs. number of documents retrieved
SVDNMF
CUR:cn,lin
CUR:cn,opt
CUR:sub,linCUR:sub,opt
CUR:w/oR,no
CUR:w/oR,yes
CUR:GWSLTM
![Page 24: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/24.jpg)
24
LSI Results
5 10 15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
number of documents retrieved
aver
age
prec
isio
n
Average Precision vs. number of documents retrieved
SVDNMF
CUR:cn,lin
CUR:cn,opt
CUR:sub,linCUR:sub,opt
CUR:w/oR,no
CUR:w/oR,yes
CUR:GWSLTM
5 10 15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
number of documents retrieved
aver
age
reca
ll
Average Recall vs. number of documents retrieved
SVDNMF
CUR:cn,lin
CUR:cn,opt
CUR:sub,linCUR:sub,opt
CUR:w/oR,no
CUR:w/oR,yes
CUR:GWSLTM
Term-document matrix size: 5831 x 1033.
All matrix approximations are rank 100 approximations. (CUR: r = c = k)
![Page 25: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/25.jpg)
25
Matrix Approximation Results Rel. Error (F-norm) Storage (nz) Runtime (sec)
SVD 0.8203 686500 22.5664
NMF 0.8409 686400 23.0210
CUR: cn,lin 1.4151 17242 0.1741
CUR: cn,opt 0.9724 16358 0.2808
CUR: sub,lin 1.2093 16175 48.7651
CUR: sub,opt 0.9615 16108 49.0830
CUR: w/oR,no 0.9931 17932 0.3466
CUR: w/oR,yes 0.9957 17220 0.2734
CUR:GWS 0.9437 25020 2.2857
LTM -- 52003 --
![Page 26: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/26.jpg)
26
Conclusions
We may not be able to store an entire term-document matrix and it may be too expensive to compute an SVD
We can achieve LSI results that are almost as good with cheaper approximations• Less storage
• Less computation time
![Page 27: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/27.jpg)
27
Completed Project Goals
Code/validate NMF and CUR Analyze relative error, runtime, and
storage of NMF and CUR Improve CUR algorithm of [3] Analyze use of NMF and CUR in LSI
![Page 28: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.site/reader035/viewer/2022062517/5681334c550346895d9a5184/html5/thumbnails/28.jpg)
28
ReferencesMichael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007.
M.W. Berry, S.A. Pulatova, and G.W. Stewart. Computing sparse reduced-rank approximations to sparse matrices. Technical Report UMIACS TR-2004-34 CMSC TR-4591, University of Maryland, May 2004.
Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006.
Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008.
Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998.
Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008.
[3]
[2]
[1]
[4]
[5]
[6]