info 4300 / cs4300 information retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0...

INFO 4300 / CS4300Information Retrieval

slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/

IR 6: Ranking

Paul Ginsparg

Cornell University, Ithaca, NY

13 Sep 2011

1 / 48

http://informationretrieval.org/

Administrativa

Course Webpage:http://www.infosci.cornell.edu/Courses/info4300/2011fa/

Assignment 1. Posted: 2 Sep, Due: Sun, 18 Sep

Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11

Instructor: Paul Ginsparg, ginsparg@..., 255-7371,Physical Sciences Building 452

Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mailinstructor to schedule an appointment

Teaching Assistant: Saeed Abdullah, office hour Fri3:30pm-4:30pm in the small conference room (133) at 301College Ave, and by email, use [email protected]

Course text at: http://informationretrieval.org/Introduction to Information Retrieval , C.Manning, P.Raghavan, H.Schutze

see alsoInformation Retrieval , S. Buttcher, C. Clarke, G. Cormack

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307

2 / 48

http://www.infosci.cornell.edu/Courses/info4300/2011fa/

http://informationretrieval.org/

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307

Administrativa

Reread assignment 1 instructions

The Midterm Examination is on Thu, Oct 13 from 11:40 to12:55, in Kimball B11. It will be open book. The topics to beexamined are all the lectures and discussion class readingsbefore the midterm break.

According to the registrar(http://registrar.sas.cornell.edu/Sched/EXFA.html ),the final examination is Wed 14 Dec 7:00-9:30 pm (locationTBD)

3 / 48

Discussion 2, 20 Sep

For this class, read and be prepared to discuss the following:

K. Sparck Jones, “A statistical interpretation of termspecificity and its application in retrieval”.Journal of Documentation 28, 11-21, 1972.http://www.soi.city.ac.uk/∼ser/idfpapers/ksj orig.pdf

Letter by Stephen Robertson and reply by Karen SparckJones, Journal of Documentation 28, 164-165, 1972.http://www.soi.city.ac.uk/∼ser/idfpapers/letters.pdf

The first paper introduced the term weighting scheme known asinverse document frequency (IDF). Some of the terminology usedin this paper will be introduced in the lectures. The letterdescribes a slightly different way of expressing IDF, which hasbecome the standard form.(Stephen Robertson has mounted these papers on his Web sitewith permission from the publisher.)

4 / 48

http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf

http://www.soi.city.ac.uk/~ser/idfpapers/letters.pdf

Overview

1 Recap

2 Zones

3 Why rank?

4 More on cosine

5 Implementation

5 / 48

Outline

1 Recap

2 Zones

3 Why rank?

4 More on cosine

5 Implementation

6 / 48

Query Scores: S(q, d) =∑

t∈q w(idf)t · w (tf)

t,d (ltn.lnn)1. “A sentence is a document.”2. “A document is a sentence and a sentence is a document.”3. “This document is short.”4. “This document is a sentence.”

tft,d doc1 doc2 doc3 doc4a 2 4 0 1and 0 1 0 0document 1 2 1 1is 1 2 1 1sentence 1 2 0 1short 0 0 1 0this 0 0 1 1

→

w(tf)t,d doc1 doc2 doc3 doc4

a 1.3 1.6 0 1and 0 1 0 0document 1 1.3 1 1is 1 1.3 1 1sentence 1 1.3 0 1short 0 0 1 0this 0 0 1 1

df w(idf)t

a 3 .125and 1 .6document 4 0is 4 0sentence 2 .3short 1 .6this 2 .3

[

log(4/4) = 0, log(4/3) ≈ .125, log(4/2) ≈ .3, log(4/1) ≈ .6]

Query: “a sentence”

doc1: .125 ∗ 1.3 + .3 ∗ 1 = .46, doc2: .125 ∗ 1.6 + .3 ∗ 1.3 = .59

doc3: .125 ∗ 0 + .3 ∗ 0 = 0, doc4: .125 ∗ 1 + .3 ∗ 1 = .425

Query: “short sentence”

doc1: .6 ∗ 0 + .3 ∗ 1 = .3, doc2: .6 ∗ 0 + .3 ∗ 1.3 = .39

doc3: .6 ∗ 1 + .3 ∗ 0 = .6, doc4: .6 ∗ 0 + .3 ∗ 1 = .37 / 48

Query Scores: S(q, d) =∑

t∈q w(idf)t · w (tf)

t,d (ltn.lnc)1. “A sentence is a document.”2. “A document is a sentence and a sentence is a document.”3. “This document is short.”4. “This document is a sentence.”

w(tf)t,d doc1 doc2 doc3 doc4

a 1.3 1.6 0 1and 0 1 0 0document 1 1.3 1 1is 1 1.3 1 1sentence 1 1.3 0 1short 0 0 1 0this 0 0 1 1

→

w(tf)t,d

doc1 doc2 doc3 doc4

a .60 .54 0 .45and 0 .34 0 0document .46 .44 .5 .45is .46 .44 .5 .45sentence .46 .44 0 .45short 0 0 .5 0this 0 0 .5 .45

df w(idf)t

a 3 .125and 1 .6document 4 0is 4 0sentence 2 .3short 1 .6this 2 .3

lengths(doc1,. . .,doc4) = (2.17, 2.94, 2, 2.24)

Query: “a sentence”

doc1: .125 ∗ .6 + .3 ∗ .46 = .21, doc2: .125 ∗ .54 + .3 ∗ .44 = .20

doc3: .125 ∗ 0 + .3 ∗ 0 = 0, doc4: .125 ∗ .45 + .3 ∗ .45 = .19

Query: “short sentence”

doc1: .6 ∗ 0 + .3 ∗ .46 = .14, doc2: .6 ∗ 0 + .3 ∗ .44 = .133

doc3: .6 ∗ .5 + .3 ∗ 0 = .3, doc4: .6 ∗ 0 + .3 ∗ .45 = .1348 / 48

Cosine similarity between query and document

cos(~q, ~d) = sim(~q, ~d) =~q

|~q| ·~d

|~d |=

|V |∑

i=1

qi√

∑|V |i=1 q2

i

· di√

∑|V |i=1 d2

i

qi is the tf-idf weight (idf) of term i in the query.

di is the tf-idf weight (tf) of term i in the document.

|~q| and |~d | are the lengths of ~q and ~d .

~q/|~q| and ~d/|~d | are length-1 vectors (= normalized).

9 / 48

Cosine similarity illustrated

0 10

1

rich

poor

~v(q)

~v(d1)

~v(d2)

~v(d3)

θ

10 / 48

Variant tf-idf functions

We’ve considered sublinear tf scaling (wft,d = 1 + log tft,d)

Or normalize instead by maximum tf in document, tfmax(d):

ntft,d = a + (1 − a)tft,d

tfmax(d)

where a ∈ [0, 1] (e.g., .4) is a smoothing term to avoid large swingin ntf due small changes in tf.

This eliminates repeat content problem (d ′ = d + d), but hasother issues:

sensitive to change in stop word list

outlier terms with large tf

skewed distribution of many nearly most frequent terms.

11 / 48

Components of tf.idf weighting

Term frequency Document frequency Normalization

n (natural) tft,d n (no) 1 n (none) 1

l (logarithm) 1 + log(tft,d) t (idf) log N

dftc (cosine) 1√

w21 +w2

2 +...+w2M

a (augmented) 0.5 +0.5×tft,d

maxt(tft,d )p (prob idf) max{0, log N−dft

dft} u (pivoted

unique)1/u

b (boolean)

{

1 if tft,d > 00 otherwise

b (byte size) 1/CharLengthα,α < 1

L (log ave)1+log(tf t,d )

1+log(avet∈d(tf t,d ))

Best known combination of weighting options

Default: no weighting

12 / 48

tf.idf example

We often use different weightings for queries and documents.

Notation: qqq.ddd(term frequency / document frequency / normalization)for (query.document)

Example: ltn.lnc

query: logarithmic tf, idf, no normalization

document: logarithmic tf, no df weighting, cosinenormalization

Isn’t it bad to not idf-weight the document?

Example query: “best car insurance”

Example document: “car insurance auto insurance”

13 / 48

tf.idf example: ltn.lnc

Query: “best car insurance”. Document: “car insurance auto insurance”.word query document product

tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized

auto 0 0 5000 2.3 0 1 1 1 0.52 0best 1 1 50000 1.3 1.3 0 0 0 0 0car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04

Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight

√12 + 02 + 12 + 1.32 ≈ 1.92

1/1.92 ≈ 0.521.3/1.92 ≈ 0.68

Final similarity score between query and document:∑

i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08

14 / 48

Outline

1 Recap

2 Zones

3 Why rank?

4 More on cosine

5 Implementation

15 / 48

Parametric and Zone indices

Digital documents have additional structure: metadata encoded inmachine-parseable form (e.g., author, title, date of publication,. . .)

One parametric index for each field.

Fields: take finite set of values (e.g., dates of authorship)

Zones: arbitrary free text (e.g., titles, abstracts)

Permits searching for documents by Shakespeare written in 1601containing the phrase “alas poor Yorick”or find documents with “merchant” in title and “william” in authorlist and the phrase “gentle rain” in body

Use separate indexes for each field and zone,or use

william.abstract, william.title, william.author

Permits weighted zone scoring

16 / 48

Weighted Zone Scoring

Given boolean query q and document d ,assign to the pair (q, d) a score in [0,1]by computing linear combination of zone scores.

Let g1, . . . , gℓ ∈ [0, 1] such that∑ℓ

i=1 gi = 1.

For 1 ≤ i ≤ ℓ, let si be the score between q and the i th zone.

Then the weighted zone score is defined as∑ℓ

i=1 gi si .

Example:Three zones: author title, bodyg1 = .2, g2 = .5, g3 = .3 (match in author zone least important)

Compute weighted zone scores directly from inverted indexes:Instead of adding document to set of results as for boolean ANDquery, now compute a score for each document.

17 / 48

Learning Weights

How to determine the weights gi for weighted zone scoring?

A. specified by expert

B. “learned” using training examples that have been judgededitorially (machine-learned relevance)

1. given set of training examples [(q, d) plus relevancejudgment (e.g., yes/no)]

2. set the weights gi to best approximate the relevancejudgments

Expensive component: labor-intensive assembly of user-generatedrelevance judgments, especially expensive in rapidly changingcollection (such as Web).

Or use “passive collaborative feedback”? (clickthrough data)

18 / 48

19 / 48

Machine Learned Relevance

Given a table sT (d , q), sB(d , q) of Boolean matches,and relevance judgements r(d , q) (also, e.g., binary) of documentd relevant to query q (see fig. 6.5 in text), compute a score foreach of the training examples

score(d , q) = gsT (d , q) + (1 − g)sB(d , q)

and compare with r(d , q) using an error function

ε(g ,Φj ) =(

r(dj , qj) − score(dj , qj))2

for each training example.

Choose g to minimize total error∑

j ε(g ,Φj ) (a quadratic functionof g , so elementary algebra in this case; more generally asophisticated optimization problem).

20 / 48

Outline

1 Recap

2 Zones

3 Why rank?

4 More on cosine

5 Implementation

21 / 48

Why is ranking so important?

Two lectures ago: Problems with unranked retrieval

Users want to look at a few results – not thousands.It’s very hard to write queries that produce a few results.Even for expert searchers→ Ranking is important because it effectively reduces a largeset of results to a very small one.

Next: More data on “users only look at a few results”

Actually, in the vast majority of cases they only look at 1, 2,or 3 results.

22 / 48

Empirical investigation of the effect of ranking

How can we measure how important ranking is?

Observe what searchers do when they are searching in acontrolled setting

Videotape themAsk them to “think aloud”Interview themEye-track themTime themRecord and count their clicks

The following slides are from Dan Russell’s JCDL talk 2007

Dan Russell is the “Uber Tech Lead for Search Quality & UserHappiness” at Google.

23 / 48

Interview video

So . . . Did you notice the FTD official site?

To be honest I didn’t even look at that.

At first I saw “from $20” and $20 is what I was looking for.

To be honest, 1800-flowers is what I’m familiar with and why Iwent there next even though I kind of assumed they wouldn’t have$20 flowers.

And you knew they were expensive?

I knew they were expensive but I thought “hey, maybe they’ve gotsome flowers for under $20 here . . .”

But you didn’t notice the FTD?

No I didn’t, actually. . . that’s really funny.

24 / 48

Local work

Granka, L., Joachims, T., and Gay, G. (2004)“Eye-Tracking Analysis of User Behavior in WWW Search”Proceedings of the 28th Annual ACM Conference on Research and

Development in Information and Retrieval (SIGIR ’04)

http://www.cs.cornell.edu/People/tj/publications/granka etal 04a.pdf

26 / 48

Out of date?

Use of top and right margins

“instant” results

only for high bandwidth users?

mobile devices

31 / 48

Importance of ranking: Summary

Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).

Clicking: Distribution is even more skewed for clicking

In 1 out of 2 cases, users click on the top-ranked page.

Even if the top-ranked page is not relevant, 30% of users willclick on it.

→ Getting the ranking right is very important.

→ Getting the top-ranked page right is most important.

32 / 48

Outline

1 Recap

2 Zones

3 Why rank?

4 More on cosine

5 Implementation

33 / 48

A problem for cosine normalization

Query q: “anti-doping rules Beijing 2008 olympics”

Compare three documents

d1: a short document on anti-doping rules at 2008 Olympicsd2: a long document that consists of a copy of d1 and 5 othernews stories, all on topics different from Olympics/anti-dopingd3: a short document on anti-doping rules at the 2004 AthensOlympics

What ranking do we expect in the vector space model?

d2 is likely to be ranked below d3 . . .

. . . but d2 is more relevant than d3.

What can we do about this?

34 / 48

Pivot normalization

Cosine normalization produces weights that are too large forshort documents and too small for long documents (onaverage).

Adjust cosine normalization by linear adjustment: “turning”the average normalization on the pivot

Effect: Similarities of short documents with query decrease;similarities of long documents with query increase.

This removes the unfair advantage that short documents have.

35 / 48

Predicted and true probability of relevance

source:Lillian Lee

36 / 48

Pivot normalization

source:Lillian Lee

37 / 48

Outline

1 Recap

2 Zones

3 Why rank?

4 More on cosine

5 Implementation

38 / 48

Now we also need term frequencies in the index

Brutus −→ 1,2 7,3 83,1 87,2 . . .

Caesar −→ 1,1 5,1 13,1 17,1 . . .

Calpurnia −→ 7,1 8,2 40,1 97,3

term frequencies

We also need positions. Not shown here.

39 / 48

Term frequencies in the inverted index

In each posting, store tft,d in addition to docID d

As an integer frequency, not as a (log-)weighted real number. . .

. . . because real numbers are difficult to compress.

Unary code is effective for encoding term frequencies.

Why?

Overall, additional space requirements are small: much lessthan a byte per posting.

40 / 48

How do we compute the top k in ranking?

In many applications, we don’t need a complete ranking.

We just need the top k for a small k (e.g., k = 100).

If we don’t need a complete ranking, is there an efficient wayof computing just the top k?

Naive:

Compute scores for all N documentsSortReturn the top k

What’s bad about this?

Alternative?

41 / 48

Use min heap for selecting top k out of N

Use a binary min heap

A binary min heap is a binary tree in which each node’s valueis less than the values of its children.

Takes O(N log k) operations to construct (where N is thenumber of documents) . . .

. . . then read off k winners in O(k log k) steps

Essentially linear in N for small k and large N.

42 / 48

Binary min heap

0.6

0.85 0.7

0.9 0.97 0.8 0.95

43 / 48

Selecting top k scoring documents in O(N log k)

Goal: Keep the top k documents seen so far

Use a binary min heap

To process a new document d ′ with score s ′:

Get current minimum hm of heap (O(1))If s ′ ≤ hm skip to next documentIf s ′ > hm heap-delete-root (O(log k))Heap-add d ′/s ′ (O(log k))

44 / 48

Even more efficient computation of top k?

Ranking has time complexity O(N) where N is the number ofdocuments.

Optimizations reduce the constant factor, but they are stillO(N) — and 1010 < N < 1011!

Are there sublinear algorithms?

Ideas?

What we’re doing in effect: solving the k-nearest neighbor(kNN) problem for the query vector (= query point).

There are no general solutions to this problem that aresublinear.

We will revisit when we do kNN classification

45 / 48

Cluster pruning

Cluster docs in preprocessing step

Pick√

N “leaders”

For non-leaders, find nearest leader (expect√

N / leader)

For query q, find closest leader L (√

N computations)

Rank L and followers

or generalize: b1 closest leaders, and then b2 leaders closest toquery

46 / 48

47 / 48

Even more efficient computation of top k

Idea 1: Reorder postings lists

Instead of ordering according to docID . . .. . . order according to some measure of “expected relevance”.

Idea 2: Heuristics to prune the search space

Not guaranteed to be correct . . .. . . but fails rarely.In practice, close to constant time.For this, we’ll need the concepts of document-at-a-timeprocessing and term-at-a-time processing.

48 / 48

info 4300 / cs4300 information retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0...

Documents