information retrieval through various approximate matrix...
TRANSCRIPT
11
Information Retrieval through Information Retrieval through Various Approximate Matrix Various Approximate Matrix
DecompositionsDecompositions
Kathryn LinehanKathryn LinehanAdvisor: Dr. Dianne OAdvisor: Dr. Dianne O’’LearyLeary
22
Information RetrievalInformation Retrieval
Extracting information from databasesExtracting information from databases
We need an efficient way of searching We need an efficient way of searching large amounts of data large amounts of data
Example: web search engineExample: web search engine
33
Querying a Document DatabaseQuerying a Document Database
The problem: return documents that are The problem: return documents that are relevant to entered search termsrelevant to entered search terms
In real systems such as Google, the In real systems such as Google, the problem is formulated in terms of matrices:problem is formulated in terms of matrices:
TermTerm--Document MatrixDocument MatrixQuery vector Query vector
44
TermTerm--Document MatrixDocument MatrixEntry ( i, j) : weight of term i in document jEntry ( i, j) : weight of term i in document j
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
15000200000102000510002001500015
Document
1 2 3 4Term
Mark
Twain
Samuel
Clemens
Purple
Fairy
Example:
Example taken from [3]
55
Query VectorQuery Vector
Entry ( i ) : weight of term i in the queryEntry ( i ) : weight of term i in the query
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
000011Example:
search for “Mark Twain”
Term
Mark
Twain
Samuel
Clemens
Purple
Fairy
Example taken from [3]
66
Literal Term MatchingLiteral Term Matching
Given: Given: TermTerm--Document matrix, ADocument matrix, AQuery vector, qQuery vector, q
Using A and q, we need to determine Using A and q, we need to determine which documents are relevant to the querywhich documents are relevant to the query
Formulate score vector: s = qFormulate score vector: s = qTTAAReturn the highest scoring documentsReturn the highest scoring documents
77
Document ScoringDocument Scoring
Doc 1 and Doc 3 will be returned as relevant, Doc 1 and Doc 3 will be returned as relevant, but Doc 2 will not but Doc 2 will not
T
T
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
020030
15000200000102000510002001500015
*
000011
Document
1 2 3 4Term
Mark
Twain
Samuel
Clemens
Purple
Fairy
Scores
Doc 1
Doc 2
Doc 3
Doc 4
Example taken from [3]
88
Latent Semantic IndexingLatent Semantic Indexing
Problem: We need a way to return Problem: We need a way to return relevant documents that may not contain relevant documents that may not contain the exact query termsthe exact query terms
Solution: Latent Semantic Indexing (LSI)Solution: Latent Semantic Indexing (LSI)Use an approximation to the termUse an approximation to the term--document document matrixmatrix
99
RankRank--k SVDk SVDStandard approximation used in LSI: rankStandard approximation used in LSI: rank--k SVDk SVD
SVD Review: A = USVD Review: A = UΣΣVVTT
U: orthogonal, left singular vectors of AU: orthogonal, left singular vectors of AΣΣ: diagonal, singular values in decreasing order: diagonal, singular values in decreasing orderV: orthogonal, right singular vectors of AV: orthogonal, right singular vectors of A
RankRank--k SVD: Ak SVD: Akk = = UUkkΣΣkkVVkkTT
UUkk: first k columns of U: first k columns of UΣΣkk: diagonal, top k singular values in decreasing order: diagonal, top k singular values in decreasing orderVVkk: first k columns of V : first k columns of V
1010
RankRank--k SVDk SVDExample: rankExample: rank--2 approximation to original term2 approximation to original term--document matrixdocument matrix
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
150002000002.128.73.801.69.31.401.163.100.1105.55.37.3
Document
1 2 3 4Term
Mark
Twain
Samuel
Clemens
Purple
Fairy
Original matrix
Example taken from [3]
1111
Document ScoringDocument Scoring
Doc 1, Doc 2, and Doc 3 will all be returned Doc 1, Doc 2, and Doc 3 will all be returned as relevant as relevant
T
T
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
06.218.137.14
150002000002.128.73.801.69.31.401.163.100.1105.55.37.3
*
000011
Document
1 2 3 4Term
Mark
Twain
Samuel
Clemens
Purple
Fairy
Scores
Doc 1
Doc 2
Doc 3
Doc 4
Example taken from [3]
Original Scores
1212
Literal Term Matching vs. LSILiteral Term Matching vs. LSILiteral term matching: use sparse (m x n) termLiteral term matching: use sparse (m x n) term--document document matrix A matrix A LSI: use approximation to A. For example, ALSI: use approximation to A. For example, Akk = = UUkkΣΣkkVVkk
TT, , where Uwhere Ukk: m x k, : m x k, ΣΣkk: k x k, V: k x k, Vkk: n x k. : n x k.
These factors can be denseThese factors can be denseComputing AComputing Akk is a oneis a one--time expensetime expense
O((m+n)k+k)O((m+n)k+k)O((m+n)k+k)O((m+n)k+k)LSILSI
O(nz(A))O(nz(A))O(nz(A))O(nz(A))Literal term Literal term matchingmatching
Query TimeQuery TimeStorageStorage
1313
Precision and RecallPrecision and RecallWe need a measurement of performance for document retrievalWe need a measurement of performance for document retrievalLet Retrieved = number of documents retrieved, Let Retrieved = number of documents retrieved,
Relevant = total number of relevant documents to the query,Relevant = total number of relevant documents to the query,RetRel = number of documents retrieved that are relevant.RetRel = number of documents retrieved that are relevant.
Precision:Precision:
Recall: Recall:
RetrievedRetRel)Retrieved( =P
RelevantRetRel)Retrieved( =R
1414
Precision and RecallPrecision and Recall
Example: Example: Given: set of 25 documents, query, list of Given: set of 25 documents, query, list of documents relevant to the query = {22, 1, 11, 9, 5} documents relevant to the query = {22, 1, 11, 9, 5} Task: retrieve four documents in order of Task: retrieve four documents in order of relevance scorerelevance scoreResults: 22, 3, 9, 7Results: 22, 3, 9, 7
42)4(,3
2)3(,21)2(,1)1( ==== PPPP
52)4(,5
2)3(,51)2(,5
1)1( ==== RRRR
1515
ValidationValidation
Three common information retrieval data Three common information retrieval data sets found at sets found at www.cs.utk.edu/~lsi/www.cs.utk.edu/~lsi/
Listed under CISI, CRAN, MEDListed under CISI, CRAN, MEDEach data set includes a termEach data set includes a term--document matrix, document matrix, a term list, queries, and a list of relevant a term list, queries, and a list of relevant documents for each querydocuments for each query
1616
Project GoalsProject Goals
To use matrix approximations other than To use matrix approximations other than the rankthe rank--k SVD in LSIk SVD in LSI
Test performance using average precision Test performance using average precision and recall, where the average is taken over all and recall, where the average is taken over all queries in the data setqueries in the data setInvestigate storage, computation time and Investigate storage, computation time and relative error of matrix approximationsrelative error of matrix approximations
To investigate improvement to the CUR To investigate improvement to the CUR matrix approximation algorithmmatrix approximation algorithm
1717
SDDSDD
In [3], LSI using the semidiscrete decomposition In [3], LSI using the semidiscrete decomposition (SDD) is investigated(SDD) is investigated
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
Tkkkk YDXA **
m x n m x k
k x k k x n
• Xk and Yk contain entries from the set {-1,0,1}
• Dk is diagonal with positive entries
• kYDXrank Tkkk ≤)(
1818
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
TermTerm--document matrix is nonnegativedocument matrix is nonnegative
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
≈
⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
HWA *
m x n m x k
k x n
• W and H are nonnegative
• kWHrank ≤)(
1919
CURCUR
TermTerm--document matrix is sparsedocument matrix is sparse
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
≈
⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
**A C U R
m x n m x c
c x rr x n
• C (R) holds c (r) sampled and rescaled columns (rows) of A
• U is computed using C and R
• kCURrank ≤)(where k is a rank parameter
,
2020
Further TopicsFurther Topics
Time permitting investigationsTime permitting investigationsParallel implementations of matrix Parallel implementations of matrix approximationsapproximationsTesting performance of matrix approximations Testing performance of matrix approximations in forming a multidocument summaryin forming a multidocument summary
2121
ReferencesReferencesMichael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007.
Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006.
Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix de-composition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998.
[3]
[2]
[1]