1/ 30. problems for classical ir models introduction & background(lsi,svd,..etc) example...

• Problems for classical IR models• Introduction & Background(LSI,SVD,..etc)• Example• Standard query method• Analysis standard query method• Seeking the best• Experimental results• SVR Vs. IRR• SVR• Conclusion• Future work

2/ 30

•Problems for classical IR models•LSI•SVD

3/ 30

Problems for classical IR models

Synonymy: Various words and phrases refer to the same concept (lowers recall).

Polysemy: Individual words have more than one meaning (lowers precision)

Independence: No significance is given to two terms that frequently appear together

4/ 30

Latent Semantic Analysis• General idea– Map documents (and terms) to a low-dimensional

representation.– Design a mapping such that the low-dimensional

space reflects semantic associations (latent semantic space).

– Compute document similarity based on the inner product in the latent semantic space.

• Goals– Similar terms map to similar location in low

dimensional space.– Noise reduction by dimension reduction.

5/ 30

Vector Model

6/ 30

• Set of document:• A finite set of terms :• Every document can be

displayed as vector:• the same to the query:• Similarity of query q and

document d:• Given a threshold , all

documents with similarity > threshold are retrieved

1 2{ , ,..., }mD D D D

1 2{ , ,..., }nT t t t

1 2( , ,..., )j j j njd w w w

1 2( , ,..., )q q nqq w w w

i

j

dj

q

( , )( , ) cos( )

(|| || * || ||)

d qsimilarity q d

d q

SVD and low-rank approximations

•This optimality property of very useful in, e.g., Principal Component Analysis (PCA), LSI, etc.

Truncate the SVD by keeping n ≤ k terms:

7/ 30

orthogonal matrix containing the top k left (right) singular vectors of A.

orthogonal matrix containing the top k left (right) singular vectors of A.

diagonal matrix containing the top k singular values of A. ordered non-increasingly.

rank of A, the number of non-zero singular values.

diagonal matrix containing the top k singular values of A. ordered non-increasingly.

rank of A, the number of non-zero singular values.

the “best” matrix among all rank-k matrices wrt. to the spectral and Frobenius norms

the “best” matrix among all rank-k matrices wrt. to the spectral and Frobenius norms

10/ 30

11/ 30

12/ 30

13/ 30

14/ 30

• TREC-4 data set.” http://trec.nist.gov/ ”• randomly chose 5305 documents.• tested with 20 queries.• Stemming “Porter Stemmer” and stop-word were used.”

http://www.tartarus.org/~martin/PorterStemmer/”;” http://www.lextek.com/manuals/onix/stopwords1.html”

• term-by-document matrix was of dimension 16,571 x 5305 and was determined to have a full rank of 5305 through the SVD process.

15/ 30

http://trec.nist.gov/

http://www.tartarus.org/~martin/PorterStemmer/

http://www.lextek.com/manuals/onix/stopwords1.html

• T , measuring the area covered between the IRP curve and the horizontal axis of Recall and representing the average interpolated precision over the full range ([0, 1]) of recall

16/ 30

• LSI

• IRR

Aterm

doc

Weight

SVDU

VT

eigenvalue

eigenvector

eigenvector

rescaling

17/ 30

term

doc

turn to term

sentence

IRRU

VT

Put all document as a query to countthe similarity

18/ 30

19/ 30

SVR (singular value rescaling) IRR (iterative residual rescaling)

scaled the singular values in matrix S the residual vectors

technique built on top of SVD. independent of SVD

# of scalingonly once (though some trial-and-error process) to produce all the

basis vectors

multiple times, the #of which is determined by the #of desired basis vectors; each time, the

scaling of residual vectors produces the next basis vector

20/ 30

21/ 30

Fig 5: shows SVD of 2x2 matrix

22/ 30

• Mathematical analysis showed that:– The difference between the results of version A and

version B is a factor of S2 with S being the diagonal matrix of singular values in the dimension-reduced model.

– The retrieval results from version B and version B’ are always identical if the Equivalency Principle is satisfied.

– Version B (B’) should be a better option than version A.

23/ 30

• Experiments on standardized TREC data set confirmed that:

– 5.9% The improvement ratio of Using SVR in addition to the conventional LSI over using the conventional LSI alone.

– SVR is computationally as efficient as the best standard

query method ”Version B”. – SVR performs better than IRR.

24/ 30

• Applying SVR to other fields of IR such as image retrieval and video/audio retrieval.

• Seeking mathematical justification of SVR, including the relationship between the optimal rescaling factor S_exp and the characteristics of any particular data set.

25/ 30

1/ 30. problems for classical ir models introduction & background(lsi,svd,..etc) example...

Documents

svd of 2x2 matrix slide

j djdj q slide

document matrix

singular values of

set of document

best matrix

svd process

orthogonal matrix