1 cs 430 / info 430 information retrieval lecture 9 latent semantic indexing

28
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

Post on 22-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

1

CS 430 / INFO 430 Information Retrieval

Lecture 9

Latent Semantic Indexing

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

2

Course Administration

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

3

Latent Semantic Indexing

Objective

Replace indexes that use sets of index terms by indexes that use concepts.

Approach

Map the term vector space into a lower dimensional space, using singular value decomposition.

Each dimension in the new space corresponds to a latent concept in the original data.

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

4

Deficiencies with Conventional Automatic Indexing

Synonymy: Various words and phrases refer to the same concept (lowers recall).

Polysemy: Individual words have more than one meaning (lowers precision)

Independence: No significance is given to two terms that frequently appear together

Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

5

Example

Query: "IDF in computer-based information look-up"

Index terms for a document: access, document, retrieval, indexing

How can we recognize that information look-up is related to retrieval and indexing?

Conversely, if information has many different contexts in the set of documents, how can we discover that it is an unhelpful term for retrieval?

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

6

Technical Memo Example: Titles

 c1 Human machine interface for Lab ABC computer applications

 c2 A survey of user opinion of computer system response time

 c3 The EPS user interface management system

 c4 System and human system engineering testing of EPS

 c5 Relation of user-perceived response time to error measurement

m1 The generation of random, binary, unordered trees

m2 The intersection graph of paths in trees

m3 Graph minors IV: Widths of trees and well-quasi-ordering

 m4 Graph minors: A survey

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

7

Technical Memo Example: Terms and Documents

Terms Documents

c1 c2 c3 c4 c5 m1 m2 m3 m4human 1 0 0 1 0 0 0 0 0interface 1 0 1 0 0 0 0 0 0computer 1 1 0 0 0 0 0 0 0user 0 1 1 0 1 0 0 0 0system 0 1 1 2 0 0 0 0 0response 0 1 0 0 1 0 0 0 0time 0 1 0 0 1 0 0 0 0EPS 0 0 1 1 0 0 0 0 0survey 0 1 0 0 0 0 0 0 1trees 0 0 0 0 0 1 1 1 0graph 0 0 0 0 0 0 1 1 1minors 0 0 0 0 0 0 0 1 1

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

8

Technical Memo Example: Query

Query:

Find documents relevant to "human computer interaction"

Simple Term Matching:

Matches c1, c2, and c4Misses c3 and c5

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

9

Models of Semantic Similarity

Proximity models: Put similar items together in some space or structure

• Clustering (hierarchical, partition, overlapping). Documents are considered close to the extent that they contain the same terms. Most then arrange the documents into a hierarchy based on distances between documents. [Covered later in course.]

• Factor analysis based on matrix of similarities between documents (single mode).

• Two-mode proximity methods. Start with rectangular matrix and construct explicit representations of both row and column objects.

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

10

Selection of Two-mode Factor Analysis

Additional criterion:

Computationally efficient O(N2k3)

N is number of terms plus documentsk is number of dimensions

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

11

t1

t2

t3

d1 d2

The space has as many dimensions as there are terms in the word list.

The term vector space

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

12

Figure 1

• term

document

query

--- cosine > 0.9

Latent concept vector space

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

13

Mathematical concepts

Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents).

Singular Value Decomposition

For any matrix X, with t rows and d columns, there exist matrices T0, S0 and D0', such that:

X = T0S0D0'

T0 and D0 are the matrices of left and right singular vectorsT0 and D0 have orthonormal columns

S0 is the diagonal matrix of singular values

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

14

Dimensions of matrices

X = T0

D0'S0

t x d t x m m x dm x m

m is the rank of X < min(t, d)

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

15

Reduced Rank

S0 can be chosen so that the diagonal elements are positive and decreasing in magnitude. Keep the first k and set the others to zero.

Delete the zero rows and columns of S0 and the corresponding rows and columns of T0 and D0. This gives:

X X = TSD'

Interpretation

If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymy and recognizes dependence.

~~ ^

^

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

16

Selection of singular values

X =

t x d t x k k x dk x k

k is the number of singular values chosen to represent the concepts in the set of documents.

Usually, k « m.

T

S D'

^

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

17

Comparing a Term and a Document

An individual cell of X is the number of occurrences of term i in document j.

X = TSD'

= TS(DS)'

where S is a diagonal matrix whose values are the square root of the corresponding elements of S.

^

- -

-

^

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

18

Calculation Similarities in the Concept Space

Objective:

Calculate similarities between terms, documents, and queries, using the matrices T, S, and D.

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

19

Mathematical Revision

A is a p x q matrixB is a r x q matrix

ai is the vector represented by row i of Abj is the vector represented by row j of B

The inner product ai.bj is element i, j of AB'

p

q

q

r

A

B'

ith row of A

jth row of B

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

20

Comparing Two Terms

XX' = TSD'(TSD')'

= TSD'DS'T'

= TSS'T' Since D is orthonormal

= TS(TS)'

To calculate the i, j cell, take the dot product between the i and j rows of TS

Since S is diagonal, TS differs from T only by stretching the coordinate system

^

^

The dot product of two rows of X reflects the extent to which two terms have a similar pattern of occurrences.

^

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

21

Comparing Two Documents

X'X = (TSD')'TSD'

= DS(DS)'

To calculate the i, j cell, take the dot product between the i and j columns of DS.

Since S is diagonal DS differs from D only by stretching the coordinate system

^^

The dot product of two columns of X reflects the extent to which two columns have a similar pattern of occurrences.

^

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

22

Comparing a Query and a Document

A query can be expressed as a vector in the term-document vector space xq.

xqi = 1 if term i is in the query and 0 otherwise.

(Ignore query terms that are not in the term vector space.)

Let pqj be the inner product of the query xq with document dj in the term-document vector space.

pqj is the jth element in the product of xq'X. ^

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

23

Comparing a Query and a Document

[pq1 ... pqj ... pqt] = [xq1 xq2 ... xqt]

X̂inner product of query q with document dj

query

document dj is column j of X

^

pq' = xq'X

= xq'TSD'

= xq'T(DS)'

similarity(q, dj) =

^

pqj

|xq| |dj|

cosine of angle is inner product divided by lengths of vectors

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

24

Comparing a Query and a Document

In the reading, the authors treat the query as a pseudo-document in the concept space dq:

dq = xq'TS-1 [Note that S-1 stretches the vector]

To compare a query against document j, they extend the method used to compare document i with document j.

Take the jth element of the product of:

dqS and (DS)'

This is the jth element of product of:

xq'T (DS)' which is the same expression as before.

Note that with their notation dq is a row vector.

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

25

Technical Memo Example: Query

Terms Query xq

human 1interface 0computer 0user 0system 1response 0time 0EPS 0survey 0trees 1graph 0minors 0

Query: "human system interactions on trees"

In term-document space, a query is represented by xq, a column vector with t elements.

In concept space, a query is represented by dq, a row vector with k elements.

Page 26: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

26

Experimental Results

Deerwester, et al. tried latent semantic indexing on two test collections, MED and CISI, where queries and relevant judgments were available.

Documents were full text of title and abstract.

Stop list of 439 words (SMART); no stemming, etc.

Comparison with: (a) simple term matching, (b) SMART, (c) Voorhees method.

Page 27: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

27

Experimental Results: 100 Factors

Page 28: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing

28

Experimental Results: Number of Factors