information retrieval through various approximate matrix...

11

Information Retrieval through Information Retrieval through Various Approximate Matrix Various Approximate Matrix

DecompositionsDecompositions

Kathryn LinehanKathryn LinehanAdvisor: Dr. Dianne OAdvisor: Dr. Dianne O’’LearyLeary

22

Information RetrievalInformation Retrieval

Extracting information from databasesExtracting information from databases

We need an efficient way of searching We need an efficient way of searching large amounts of data large amounts of data

Example: web search engineExample: web search engine

33

Querying a Document DatabaseQuerying a Document Database

The problem: return documents that are The problem: return documents that are relevant to entered search termsrelevant to entered search terms

In real systems such as Google, the In real systems such as Google, the problem is formulated in terms of matrices:problem is formulated in terms of matrices:

TermTerm--Document MatrixDocument MatrixQuery vector Query vector

44

TermTerm--Document MatrixDocument MatrixEntry ( i, j) : weight of term i in document jEntry ( i, j) : weight of term i in document j

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

15000200000102000510002001500015

Document

1 2 3 4Term

Mark

Twain

Samuel

Clemens

Purple

Fairy

Example:

Example taken from [3]

55

Query VectorQuery Vector

Entry ( i ) : weight of term i in the queryEntry ( i ) : weight of term i in the query

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

000011Example:

search for “Mark Twain”

Term

Mark

Twain

Samuel

Clemens

Purple

Fairy


66

Literal Term MatchingLiteral Term Matching

Given: Given: TermTerm--Document matrix, ADocument matrix, AQuery vector, qQuery vector, q

Using A and q, we need to determine Using A and q, we need to determine which documents are relevant to the querywhich documents are relevant to the query

Formulate score vector: s = qFormulate score vector: s = qTTAAReturn the highest scoring documentsReturn the highest scoring documents

77

Document ScoringDocument Scoring

Doc 1 and Doc 3 will be returned as relevant, Doc 1 and Doc 3 will be returned as relevant, but Doc 2 will not but Doc 2 will not

T

T

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

020030

15000200000102000510002001500015

*

000011

Document

1 2 3 4Term

Mark

Twain

Samuel

Clemens

Purple

Fairy

Scores

Doc 1

Doc 2

Doc 3

Doc 4


88

Latent Semantic IndexingLatent Semantic Indexing

Problem: We need a way to return Problem: We need a way to return relevant documents that may not contain relevant documents that may not contain the exact query termsthe exact query terms

Solution: Latent Semantic Indexing (LSI)Solution: Latent Semantic Indexing (LSI)Use an approximation to the termUse an approximation to the term--document document matrixmatrix

99

RankRank--k SVDk SVDStandard approximation used in LSI: rankStandard approximation used in LSI: rank--k SVDk SVD

SVD Review: A = USVD Review: A = UΣΣVVTT

U: orthogonal, left singular vectors of AU: orthogonal, left singular vectors of AΣΣ: diagonal, singular values in decreasing order: diagonal, singular values in decreasing orderV: orthogonal, right singular vectors of AV: orthogonal, right singular vectors of A

RankRank--k SVD: Ak SVD: Akk = = UUkkΣΣkkVVkkTT

UUkk: first k columns of U: first k columns of UΣΣkk: diagonal, top k singular values in decreasing order: diagonal, top k singular values in decreasing orderVVkk: first k columns of V : first k columns of V

1010

RankRank--k SVDk SVDExample: rankExample: rank--2 approximation to original term2 approximation to original term--document matrixdocument matrix

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

150002000002.128.73.801.69.31.401.163.100.1105.55.37.3

Document

1 2 3 4Term

Mark

Twain

Samuel

Clemens

Purple

Fairy

Original matrix


1111

Document ScoringDocument Scoring

Doc 1, Doc 2, and Doc 3 will all be returned Doc 1, Doc 2, and Doc 3 will all be returned as relevant as relevant

T

T

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

06.218.137.14

150002000002.128.73.801.69.31.401.163.100.1105.55.37.3

*

000011

Document

1 2 3 4Term

Mark

Twain

Samuel

Clemens

Purple

Fairy

Scores

Doc 1

Doc 2

Doc 3

Doc 4


Original Scores

1212

Literal Term Matching vs. LSILiteral Term Matching vs. LSILiteral term matching: use sparse (m x n) termLiteral term matching: use sparse (m x n) term--document document matrix A matrix A LSI: use approximation to A. For example, ALSI: use approximation to A. For example, Akk = = UUkkΣΣkkVVkk

TT, , where Uwhere Ukk: m x k, : m x k, ΣΣkk: k x k, V: k x k, Vkk: n x k. : n x k.

These factors can be denseThese factors can be denseComputing AComputing Akk is a oneis a one--time expensetime expense

O((m+n)k+k)O((m+n)k+k)O((m+n)k+k)O((m+n)k+k)LSILSI

O(nz(A))O(nz(A))O(nz(A))O(nz(A))Literal term Literal term matchingmatching

Query TimeQuery TimeStorageStorage

1313

Precision and RecallPrecision and RecallWe need a measurement of performance for document retrievalWe need a measurement of performance for document retrievalLet Retrieved = number of documents retrieved, Let Retrieved = number of documents retrieved,

Relevant = total number of relevant documents to the query,Relevant = total number of relevant documents to the query,RetRel = number of documents retrieved that are relevant.RetRel = number of documents retrieved that are relevant.

Precision:Precision:

Recall: Recall:

RetrievedRetRel)Retrieved( =P

RelevantRetRel)Retrieved( =R

1414

Precision and RecallPrecision and Recall

Example: Example: Given: set of 25 documents, query, list of Given: set of 25 documents, query, list of documents relevant to the query = {22, 1, 11, 9, 5} documents relevant to the query = {22, 1, 11, 9, 5} Task: retrieve four documents in order of Task: retrieve four documents in order of relevance scorerelevance scoreResults: 22, 3, 9, 7Results: 22, 3, 9, 7

42)4(,3

2)3(,21)2(,1)1( ==== PPPP

52)4(,5

2)3(,51)2(,5

1)1( ==== RRRR

1515

ValidationValidation

Three common information retrieval data Three common information retrieval data sets found at sets found at www.cs.utk.edu/~lsi/www.cs.utk.edu/~lsi/

Listed under CISI, CRAN, MEDListed under CISI, CRAN, MEDEach data set includes a termEach data set includes a term--document matrix, document matrix, a term list, queries, and a list of relevant a term list, queries, and a list of relevant documents for each querydocuments for each query

1616

Project GoalsProject Goals

To use matrix approximations other than To use matrix approximations other than the rankthe rank--k SVD in LSIk SVD in LSI

Test performance using average precision Test performance using average precision and recall, where the average is taken over all and recall, where the average is taken over all queries in the data setqueries in the data setInvestigate storage, computation time and Investigate storage, computation time and relative error of matrix approximationsrelative error of matrix approximations

To investigate improvement to the CUR To investigate improvement to the CUR matrix approximation algorithmmatrix approximation algorithm

1717

SDDSDD

In [3], LSI using the semidiscrete decomposition In [3], LSI using the semidiscrete decomposition (SDD) is investigated(SDD) is investigated

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

=

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

Tkkkk YDXA **

m x n m x k

k x k k x n

• Xk and Yk contain entries from the set {-1,0,1}

• Dk is diagonal with positive entries

• kYDXrank Tkkk ≤)(

1818

Nonnegative Matrix FactorizationNonnegative Matrix Factorization

TermTerm--document matrix is nonnegativedocument matrix is nonnegative

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

≈

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

HWA *

m x n m x k

k x n

• W and H are nonnegative

• kWHrank ≤)(

1919

CURCUR

TermTerm--document matrix is sparsedocument matrix is sparse

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎦

⎤⎢⎣

⎡

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

≈

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

**A C U R

m x n m x c

c x rr x n

• C (R) holds c (r) sampled and rescaled columns (rows) of A

• U is computed using C and R

• kCURrank ≤)(where k is a rank parameter

,

2020

Further TopicsFurther Topics

Time permitting investigationsTime permitting investigationsParallel implementations of matrix Parallel implementations of matrix approximationsapproximationsTesting performance of matrix approximations Testing performance of matrix approximations in forming a multidocument summaryin forming a multidocument summary

2121

ReferencesReferencesMichael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007.

Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006.

Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix de-composition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998.

[3]

[2]

[1]

information retrieval through various approximate matrix...

Documents