hinrich schütze and christina lioma lecture 6: scoring, term weighting, the vector space model

48
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model 1

Upload: jerry-calhoun

Post on 04-Jan-2016

53 views

Category:

Documents


1 download

DESCRIPTION

Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model. Overview. Why ranked retrieval? Weighted zone scoring Term frequency tf-idf weighting The vector space model. Outline. Why ranked retrieval? Weighted zone scoring Term frequency - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

Introduction to

Information Retrieval

Hinrich Schütze and Christina LiomaLecture 6: Scoring, Term Weighting, The Vector

Space Model

1

Page 2: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

Overview

❶ Why ranked retrieval?

❷ Weighted zone scoring

❸ Term frequency

❹ tf-idf weighting

❺ The vector space model

2

Page 3: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline

3

❶ Why ranked retrieval?

❷ Weighted zone scoring

❸ Term frequency

❹ tf-idf weighting

❺ The vector space model

Page 4: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

4

Problem with Boolean search

Thus far, our queries have all been Boolean.• Documents either match or don’t.

Good for expert users with precise understanding of their needs and of the collection.

Not good for the majority of users• Most users are not capable of writing Boolean (or they are, but

they think it’s too much work.)• Most users don’t want to wade through 1000s of results.

4

Page 5: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

5

Boolean queries often result in either too few or too many (1000s) results.

Feast• Query : “standard user dlink 650” 200,000 hits

Famine• Query : “standard user dlink 650 no card found” 2 hits

In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. AND gives too few, and OR gives too many.

Problem with Boolean search: Feast or famine

Page 6: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

6

Feast or famine: No problem in ranked retrieval

With ranking, large result sets are not an issue. Just show the top 10 results Doesn’t overwhelm the user Premise: the ranking algorithm works: More relevant results

are ranked higher than less relevant results.

6

Page 7: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

7

Query-document matching scores

How do we compute the score of a query-document pair? Let’s start with a one-term query. If the query term does not occur in the document: score

should be 0. The more frequent the query term in the document, the

higher the score (should be) We will look at a number of alternatives for doing this.

7

Page 8: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

8

Outline

8

❶ Why ranked retrieval?

❷ Weighted zone scoring

❸ Term frequency

❹ tf-idf weighting

❺ The vector space model

Page 9: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

9

Parametric and zone indexes

Metadata- we mean specific forms of data about a document, such as its author(s), title and data of publication.

The metadata would generally include fields such as the data of creation and the format of the document, as well the author and possibly the title of the document.

Zones are similar to fields, except the contents of a zone can be arbitrary free text.

9

Page 10: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

10

Parametric index

Parametric search

Page 11: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

11

Zone index

Basic zone index

Zone index in which the zone is encoded

steven

Page 12: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

12

Weighted zone scoring

Weighted zone scoring is sometimes referred to also as ranked Boolean retrieval

Consider a set of documents contributes a Boolean value.

Page 13: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

13

Weighted zone scoring-Example

Consider the query steven in a collection in which each document has three zones: abstract, title and author. The Boolean score function for a zone takes on the value 1 if the query term steven is present in the zone, and 0 otherwise. Weighted zone scoring in such a collection would require three weights g1,g2 and g3, respectively corresponding to abstract, title and author zones. Suppose we set g1=0.2, g2=0.3, and g3=0.5

steven

Page 14: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

14

Algorithm for computing the weighted zone score from two postings lists

Page 15: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

15

Learning weights Given a query q and a document d, we use the given Boolean

match function to compute Boolean variables ST(d,q) and SB(d,q).

Train examples, each of which is a triple of the form

Page 16: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

16

Learning weights

Page 17: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

17

The optimal weight g

Page 18: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

18

The optimal weight g Total error=0.75

g= =0.25

Page 19: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline

19

❶ Why ranked retrieval?

❷ Weighted zone scoring

❸ Term frequency

❹ tf-idf weighting

❺ The vector space model

Page 20: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

20

Bag of words model

The exact ordering of the terms in a document is ignored but the number of occurrence of each term is material.

We only retain information on the number of occurrences of each term.

“John is quicker than Mary” and “Mary is quicker than John” are represented the same way.

This is called a bag of words model. For now: bag of words model

20

Page 21: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

21

Term frequency tf The term frequency tft,d of term t in document d is defined

as the number of occurrences of term t in document d. We want to use tf when computing query-document match

scores. But how? Raw term frequency is not what we want because:• A document with tf = 10 occurrences of the term is

more relevant than a document with tf = 1 occurrence of the term.• But not 10 times more relevant.

Relevance does not increase proportionally with term frequency.

21

Page 22: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

22

Log-frequency weighting The log frequency weight of term t in d is

• tft,d → wft,d :

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over terms t in both

q and d:Score

The score is 0 if none of the query terms is present in the document.

22

dqt dt ) tflog (1 ,

Page 23: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline

23

❶ Why ranked retrieval?

❷ Weighted zone scoring

❸ Term frequency

❹ tf-idf weighting

❺ The vector space model

Page 24: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

24

Collection frequency vs. Document frequency

Collection frequency(cft)

• The total number of occurrences of a term t in the collection.

Document frequency(dft)

• The number of documents in the collection that contain a term t.

24

Page 25: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

25

Desired weight for rare terms

Rare terms are more informative than frequent terms. Consider a term in the query that is rare in the collection

(e.g., ARACHNOCENTRIC). A document containing this term is very likely to be

relevant. → We want high weights for rare terms like

ARACHNOCENTRIC.

25

Page 26: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

26

Desired weight for frequent terms

Frequent terms are less informative than rare terms. Consider a term in the query that is frequent in the

collection (e.g., GOOD, INCREASE, LINE). A document containing this term is more likely to be

relevant than a document that doesn’t . . . . . . but words like GOOD, INCREASE and LINE are not sure

indicators of relevance. → For frequent terms like GOOD, INCREASE and LINE, we

want positive weights . . . . . . but lower weights than for rare terms.

26

Page 27: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

27

Inverse Document Frequency

We want high weights for rare terms like ARACHNOCENTRIC. We want low (positive) weights for frequent words like

GOOD, INCREASE and LINE. We will use document frequency to factor this into

computing the matching score.

27

Page 28: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

28

idf weight

dft is an inverse measure of the informativeness of term t.

We define the idft weight of term t as follows:

(N is the number of documents in the collection.) idft is a measure of the informativeness of the term.

28

Page 29: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

29

Examples for idf The Reuters collection of 806,791 documents

Compute idft using the formula:

29

Page 30: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

30

tf-idf weighting The tf-idf weighting scheme assigns to term t a weight in

document d given by

Highest when t occurs many times within a small number of documents.

Lower when the term occurs fewer times in a document, or occurs in many documents.

Lowest when the term occurs in virtually all documents.

30

Page 31: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

3131

Examples for tf-idf

Page 32: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

32

Variant tf-idf functions wf-idft,d weighting

32

Page 33: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline

33

❶ Why ranked retrieval?

❷ Weighted zone scoring

❸ Term frequency

❹ tf-idf weighting

❺ The vector space model

Page 34: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

34

Binary incidence matrix

Each document is represented as a binary vector {0, 1}∈ |V|.

34

Anthony and Cleopatra

Julius Caesar

The Tempest

Hamlet Othello Macbeth . . .

ANTHONYBRUTUS CAESARCALPURNIACLEOPATRAMERCYWORSER. . .

1110111

1111000

0000011

0110011

0010011

1010010

Page 35: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

35

Count matrix

Each document is now represented as a count vector N∈ |V|.

35

Anthony and Cleopatra

Julius Caesar

The Tempest

Hamlet Othello Macbeth . . .

ANTHONYBRUTUS CAESARCALPURNIACLEOPATRAMERCYWORSER. . .

1574

2320

5722

73157227

10000

0000031

0220081

0010051

1000085

Page 36: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

36

Binary → count → weight matrix

Each document is now represented as a real-valued vector of wf-idf weights R∈ |V|.

36

Anthony and Cleopatra

Julius Caesar

The Tempest

Hamlet Othello Macbeth . . .

ANTHONYBRUTUS CAESARCALPURNIACLEOPATRAMERCYWORSER. . .

5.251.218.59

0.02.851.511.37

3.186.102.541.54

0.00.00.0

0.00.00.00.00.0

1.900.11

0.01.0

1.510.00.0

0.124.15

0.00.0

0.250.00.0

5.250.25

0.350.00.00.00.0

0.881.95

Page 37: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

37

Documents as vectors

Each document is now represented as a real-valued vector of tf-idf weights R∈ |V|.

So we have a |V|-dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions

when you apply this to web search engines Each vector is very sparse - most entries are zero.

37

Page 38: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

38

Queries as vectors

Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space.

Key idea 2: Rank documents according to their proximity to the query.

proximity = similarity. proximity ≈ negative distance. Rank relevant documents higher than non-relevant

documents.

38

Page 39: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

39

How do we formalize vector space similarity?

First cut: distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea because Euclidean

distance is large for vectors of different lengths.

39

Page 40: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

40

Use angle instead of distance Rank documents according to angle with query Thought experiment: take a document d and append it to

itself. Call this document d′. d ′ is twice as long as d. d and d′ have the same content. The angle between the two documents is 0, corresponding

to maximal similarity even though the Euclidean distance between the two documents can be quite large.

The following two notions are equivalent.• Rank documents according to the angle between query and

document in decreasing order• Rank documents according to cosine(query,document) in

increasing order40

Page 41: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

41

Length normalization How do we compute the cosine? A vector can be (length-) normalized by dividing each of its

components by its length – here we use the L2 norm:

This maps vectors onto the unit sphere . . . . . . since after normalization: As a result, longer documents and shorter documents have

weights of the same order of magnitude. Effect on the two documents d and d ′ (d appended to

itself) from earlier slide: they have identical vectors after length-normalization.

41

Page 42: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

42

Cosine similarity between query and document

qi is the tf-idf weight of term i in the query.

di is the tf-idf weight of term i in the document. | | and | | are the lengths of and This is the cosine similarity of and . . . . . . or,

equivalently, the cosine of the angle between and

42

Page 43: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

43

Cosine for normalized vectors

For normalized vectors, the cosine is equivalent to the dot product or scalar product.

(if and are length-normalized).

43

Page 44: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

44

Cosine similarity illustrated

44

Page 45: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

45

Cosine: Example1

45

v(Doc1)•v(Doc2)=0.897*0.076+0.126*0.787=0.167v(Doc1)•v(Doc3)=0.696v(Doc2)•v(Doc3)=0.478

Page 46: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

46

Cosine: Example2(N=1,000,000)

46

v(q,Doc1)=0*2.3+0.34*0+0.52*2+0.78*6=5.72 Length[Doc1]=6.73, scores[Doc1]=0.85v(q,Doc2)=0*4.6+0.34*1.3+0.52*4.0+0.78*0=2.52 Length[Doc2]=6.23, scores[Doc2]=0.40v(q,Doc3)=0*2.3+0.34*2.6+0.52*10.0+0.78*6.0=10.76 Length[Doc3]=12.17, scores[Doc3]=0.88Ranking=Doc3, Doc1, Doc2.

Page 47: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

47

Computing the cosine score

47

Page 48: Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Introduction to Information RetrievalIntroduction to Information Retrieval

48

Components of tf-idf weighting

48