information retrieval

84
Information Retrieval CSE 8337 (Part B) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze http://informationretrieval.org

Upload: chinara

Post on 23-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Information Retrieval. CSE 8337 (Part B) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza -Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Retrieval

Information Retrieval

CSE 8337 (Part B)Spring 2009

Some Material for these slides obtained from:Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto

http://www.sims.berkeley.edu/~hearst/irbook/Data Mining Introductory and Advanced Topics by Margaret H. Dunham

http://www.engr.smu.edu/~mhd/bookIntroduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze

http://informationretrieval.org

Page 2: Information Retrieval

CSE 8337 Spring 2009 2

CSE 8337 Outline• Introduction• Simple Text Processing• Boolean Queries• Web Searching/Crawling• Indexes• Vector Space Model• Matching• Evaluation

Page 3: Information Retrieval

CSE 8337 Spring 2009 3

CSE 8337 Outline• Introduction• Simple Text Processing• Boolean Queries• Web Searching/Crawling• Indexes• Vector Space Model• Matching• Evaluation

Page 4: Information Retrieval

CSE 8337 Spring 2009 4

Modeling TOC(Vector Space and Other Models)

Introduction Classic IR Models

Boolean Model Vector Model Probabilistic Model

Extended Boolean Model Vector Space Scoring Vector Model and Web Search

Page 5: Information Retrieval

CSE 8337 Spring 2009 5

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc Filtering

Browsing

U s e r

T a s k

Classic Models

boolean vector probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector Lat. Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext

Page 6: Information Retrieval

CSE 8337 Spring 2009 6

The Boolean Model Simple model based on set theory Queries specified as boolean

expressions precise semantics and neat formalism

Terms are either present or absent. Thus, wij {0,1}

Consider q = ka (kb kc) qdnf = (1,1,1) (1,1,0) (1,0,0) qcc= (1,1,0) is a conjunctive component

Page 7: Information Retrieval

CSE 8337 Spring 2009 7

The Boolean Model

q = ka (kb kc)

sim(q,dj) = 1 if qcc | (qcc qdnf) (ki, gi(dj)= gi(qcc))0 otherwise

(1,1,1)(1,0,0)

(1,1,0)Ka Kb

Kc

Page 8: Information Retrieval

CSE 8337 Spring 2009 8

Drawbacks of the Boolean Model

Retrieval based on binary decision criteria with no notion of partial matching

No ranking of the documents is provided Information need has to be translated into

a Boolean expression The Boolean queries formulated by the

users are most often too simplistic As a consequence, the Boolean model

frequently returns either too few or too many documents in response to a user query

Page 9: Information Retrieval

CSE 8337 Spring 2009 9

The Vector Model Use of binary weights is too limiting Non-binary weights provide

consideration for partial matches These term weights are used to

compute a degree of similarity between a query and each document

Ranked set of documents provides for better matching

Page 10: Information Retrieval

CSE 8337 Spring 2009 10

The Vector Model wij > 0 whenever ki appears in dj

wiq >= 0 associated with the pair (ki,q) dj = (w1j, w2j, ..., wtj) q = (w1q, w2q, ..., wtq) To each term ki is associated a unitary

vector i The unitary vectors i and j are assumed to

be orthonormal (i.e., index terms are assumed to occur independently within the documents)

The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

Page 11: Information Retrieval

CSE 8337 Spring 2009 11

The Vector Model

Sim(q,dj) = cos()

= [dj q] / |dj| * |q|= [ wij * wiq] / |dj| * |q|

Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

A document is retrieved even if it matches the query terms only partially

i

j

dj

q

Page 12: Information Retrieval

CSE 8337 Spring 2009 12

Weights wij and wiq ? One approach is to examine the frequency

of the occurence of a word in a document: Absolute frequency:

tf factor, the term frequency within a document

freqi,j - raw frequency of ki within dj

Both high-frequency and low-frequency terms may not actually be significant

Relative frequency: tf divided by number of words in document

Normalized frequency:fi,j = (freqi,j)/(maxl freql,j)

Page 13: Information Retrieval

CSE 8337 Spring 2009 13

Inverse Document Frequency Importance of term may depend

more on how it can distinguish between documents.

Quantification of inter-documents separation

Dissimilarity not similarity idf factor, the inverse document

frequency

Page 14: Information Retrieval

CSE 8337 Spring 2009 14

IDF N be the total number of docs in the

collection ni be the number of docs which contain ki The idf factor is computed as

idfi = log (N/ni) the log is used to make the values of tf and

idf comparable. It can also be interpreted as the amount of information associated with the term ki.

IDF Ex: N=1000, n1=100, n2=500, n3=800 idf1= 3 - 2 = 1 idf2= 3 – 2.7 = 0.3 idf3 = 3 – 2.9 = 0.1

Page 15: Information Retrieval

CSE 8337 Spring 2009 15

The Vector Model

The best term-weighting schemes take both into account.

wij = fi,j * log(N/ni) This strategy is called a tf-idf

weighting scheme

Page 16: Information Retrieval

CSE 8337 Spring 2009 16

The Vector Model

For the query term weights, a suggestion is wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)

The vector model with tf-idf weights is a good ranking strategy with general collections

The vector model is usually as good as any known ranking alternatives.

It is also simple and fast to compute.

Page 17: Information Retrieval

CSE 8337 Spring 2009 17

The Vector Model

Advantages: term-weighting improves quality of the

answer set partial matching allows retrieval of docs that

approximate the query conditions cosine ranking formula sorts documents

according to degree of similarity to the query Disadvantages:

Assumes independence of index terms (??); not clear that this is bad though

Page 18: Information Retrieval

CSE 8337 Spring 2009 18

The Vector Model: Example I

k1 k2 k3 q djd1 1 0 1 2d2 1 0 0 1d3 0 1 1 2d4 1 0 0 1d5 1 1 1 3d6 1 1 0 2d7 0 1 0 1

q 1 1 1

d1

d2

d3d4 d5

d6d7

k1k2

k3

Page 19: Information Retrieval

CSE 8337 Spring 2009 19

The Vector Model: Example II

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q djd1 1 0 1 4d2 1 0 0 1d3 0 1 1 5d4 1 0 0 1d5 1 1 1 6d6 1 1 0 3d7 0 1 0 2

q 1 2 3

Page 20: Information Retrieval

CSE 8337 Spring 2009 20

The Vector Model: Example III

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10

q 1 2 3

Page 21: Information Retrieval

CSE 8337 Spring 2009 21

Probabilistic Model Objective: to capture the IR problem

using a probabilistic framework Given a user query, there is an ideal

answer set Querying as specification of the

properties of this ideal answer set (clustering)

But, what are these properties? Guess at the beginning what they could

be (i.e., guess initial description of ideal answer set)

Improve by iteration

Page 22: Information Retrieval

CSE 8337 Spring 2009 22

Probabilistic Model An initial set of documents is retrieved

somehow User inspects these docs looking for the

relevant ones (in truth, only top 10-20 need to be inspected)

IR system uses this information to refine description of ideal answer set

By repeating this process, it is expected that the description of the ideal answer set will improve

Have always in mind the need to guess at the very beginning the description of the ideal answer set

Description of ideal answer set is modeled in probabilistic terms

Page 23: Information Retrieval

CSE 8337 Spring 2009 23

Probabilistic Ranking Principle

Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.

But, how to compute probabilities? what is the sample space?

Page 24: Information Retrieval

CSE 8337 Spring 2009 24

The Ranking

Probabilistic ranking computed as: sim(q,dj) = P(dj relevant-to q) / P(dj non-

relevant-to q) This is the odds of the document dj being

relevant Taking the odds minimize the probability of

an erroneous judgement Definition:

wij {0,1} P(R | dj) : probability that given doc is

relevant P(R | dj) : probability doc is not relevant

Page 25: Information Retrieval

CSE 8337 Spring 2009 25

The Ranking

sim(dj,q) = P(R | dj) / P(R | dj)= [P(dj | R) * P(R)]

[P(dj | R) * P(R)]~ P(dj | R)

P(dj | R)

P(dj | R) : probability of randomly selecting the document dj from the set R of relevant documents

Page 26: Information Retrieval

CSE 8337 Spring 2009 26

The Ranking

sim(dj,q) ~ P(dj | R) P(dj | R)

~ [ P(ki | R)] * [ P(ki | R)]

[ P(ki | R)] * [ P(ki | R)]

P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents

Page 27: Information Retrieval

CSE 8337 Spring 2009 27

The Ranking

sim(dj,q)~ log [ P(ki | R)] * [ P(kj | R)]

[ P(ki |R)] * [ P(ki | R)]

~ K * [ log P(ki | R) + log P(ki | R) ] P(ki | R) P(ki | R)

where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)

Page 28: Information Retrieval

CSE 8337 Spring 2009 28

The Initial Ranking

sim(dj,q) ~ wiq * wij * (log P(ki | R) + log P(ki | R) )

P(ki | R) P(ki | R) Probabilities P(ki | R) and P(ki | R) ? Estimates based on assumptions:

P(ki | R) = 0.5 P(ki | R) = ni

N

Use this initial guess to retrieve an initial ranking

Improve upon this initial ranking

Page 29: Information Retrieval

CSE 8337 Spring 2009 29

Improving the Initial Ranking Let

V : set of docs initially retrieved Vi : subset of docs retrieved that

contain ki Reevaluate estimates:

P(ki | R) = Vi V

P(ki | R) = ni - Vi N - V

Repeat recursively

Page 30: Information Retrieval

CSE 8337 Spring 2009 30

Improving the Initial Ranking To avoid problems with V=1 and

Vi=0: P(ki | R) = Vi + 0.5

V + 1 P(ki | R) = ni - Vi + 0.5

N - V + 1 Also,

P(ki | R) = Vi + ni/N V + 1

P(ki | R) = ni - Vi + ni/N N - V + 1

Page 31: Information Retrieval

CSE 8337 Spring 2009 31

Pluses and Minuses Advantages:

Docs ranked in decreasing order of probability of relevance

Disadvantages: need to guess initial estimates for P(ki |

R) method does not take into account tf

and idf factors

Page 32: Information Retrieval

CSE 8337 Spring 2009 32

Brief Comparison of Classic Models

Boolean model does not provide for partial matches and is considered to be the weakest classic model

Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections

This seems also to be the view of the research community

Page 33: Information Retrieval

CSE 8337 Spring 2009 33

Extended Boolean Model Boolean model is simple and elegant. But, no provision for a ranking As with the fuzzy model, a ranking can be

obtained by relaxing the condition on set membership

Extend the Boolean model with the notions of partial matching and term weighting

Combine characteristics of the Vector model with properties of Boolean algebra

Page 34: Information Retrieval

CSE 8337 Spring 2009 34

The Idea The Extended Boolean Model

(introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra

Let, q = kx ky wxj = fxj * idfx associated with

[kx,dj] max(idfi) Further, wxj = x and wyj = y

Page 35: Information Retrieval

CSE 8337 Spring 2009 35

The Idea:

qand = kx ky; wxj = x and wyj = y

dj

y = wyj

x = wxj(0,0)

(1,1)

kx

ky

sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2

2 2

AND

Page 36: Information Retrieval

CSE 8337 Spring 2009 36

The Idea: qor = kx ky; wxj = x and wyj = y

(1,1)

sim(qor,dj) = sqrt( x + y ) 2

2 2

djy = wyj

x = wxj(0,0) kx

ky OR

Page 37: Information Retrieval

CSE 8337 Spring 2009 37

Generalizing the Idea We can extend the previous model

to consider Euclidean distances in a t-dimensional space

This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter

Page 38: Information Retrieval

CSE 8337 Spring 2009 38

Generalizing the IdeaA generalized disjunctive query is given by qor = k1 k2 . . . kt

A generalized conjunctive query is given by qand = k1 k2 . . . kt

ppp

p pp

sim(qor,dj) = (x1 + x2 + . . . + xm ) m

p p p p1

sim(qand,dj)=1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m

p1ppp

Page 39: Information Retrieval

CSE 8337 Spring 2009 39

Properties If p = 1 then (Vector like)

sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm m

If p = then (Fuzzy like) sim(qor,dj) = max (wxj) sim(qand,dj) = min (wxj)

By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model

Page 40: Information Retrieval

CSE 8337 Spring 2009 40

Properties This is quite powerful and is a good

argument in favor of the extended Boolean model

q = (k1 k2) k3

k1 and k2 are to be used as in a vector retrieval while the presence of k3 is required.

sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2 ______ 2

2

Page 41: Information Retrieval

CSE 8337 Spring 2009 41

Conclusions Model is quite powerful Properties are interesting and might be

useful Computation is somewhat complex However, distributivity operation does not

hold for ranking computation: q1 = (k1 k2) k3 q2 = (k1 k3) (k2 k3) sim(q1,dj) sim(q2,dj)

Page 42: Information Retrieval

CSE 8337 Spring 2009 42

Vector Space Scoring First cut: distance between two

points ( = distance between the end points

of the two vectors) Euclidean distance? Euclidean distance is a bad

idea . . . . . . because Euclidean distance is

large for vectors of different lengths.

Page 43: Information Retrieval

CSE 8337 Spring 2009 43

Why distance is a bad idea

The Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 arevery similar.

Page 44: Information Retrieval

CSE 8337 Spring 2009 44

Use angle instead of distance Thought experiment: take a

document d and append it to itself. Call this document d′.

“Semantically” d and d′ have the same content

The Euclidean distance between the two documents can be quite large

The angle between the two documents is 0, corresponding to maximal similarity.

Key idea: Rank documents according to angle with query.

Page 45: Information Retrieval

CSE 8337 Spring 2009 45

From angles to cosines The following two notions are

equivalent. Rank documents in decreasing order

of the angle between query and document

Rank documents in increasing order of cosine(query,document)

Cosine is a monotonically decreasing function for the interval [0o, 180o]

Page 46: Information Retrieval

CSE 8337 Spring 2009 46

Length normalization A vector can be (length-)

normalized by dividing each of its components by its length – for this we use the L2 norm:

Dividing a vector by its L2 norm makes it a unit (length) vector

Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.

i ixx 2

2

Page 47: Information Retrieval

CSE 8337 Spring 2009 47

cosine(query,document)

V

i iV

i i

V

i ii

dq

dq

dd

qq

dqdqdq

12

12

1),cos(

Dot product Unit vectors

qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the documentcos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.

Page 48: Information Retrieval

CSE 8337 Spring 2009 48

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar arethe novelsSaS: Sense andSensibilityPaP: Pride andPrejudice, andWH: WutheringHeights?

Term frequencies (counts)

Page 49: Information Retrieval

CSE 8337 Spring 2009 49

3 documents example contd.Log frequency weighting

term SaS PaP WHaffection 3.06 2.76 2.30jealous 2.00 1.85 2.04gossip 1.30 0 1.78wuthering

0 0 2.58

After normalization

term SaS PaP WHaffection 0.789 0.832 0.524jealous 0.515 0.555 0.465gossip 0.335 0 0.405wuthering

0 0 0.588

cos(SaS,PaP) ≈0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0≈ 0.94cos(SaS,WH) ≈ 0.79cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SAS,WH)?

Page 50: Information Retrieval

CSE 8337 Spring 2009 50

tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

Page 51: Information Retrieval

CSE 8337 Spring 2009 51

Weighting may differ in Queries vs Documents

Many search engines allow for different weightings for queries vs documents

To denote the combination in use in an engine, we use the notation qqq.ddd with the acronyms from the previous table

Example: ltn.ltc means: Query: logarithmic tf (l in leftmost column),

idf (t in second column), no normalization … Document logarithmic tf, no idf and cosine

normalizationIs this a bad idea?

Page 52: Information Retrieval

CSE 8337 Spring 2009 52

tf-idf example: ltn.lnc

Term Query Document Prodtf-

rawtf-wt df idf wt tf-

rawtf-wt wt n’lize

dauto 0 0 5000 2.

3 0 1 1 1 0.52 0

best 1 1 50000 1.3

1.3 0 0 0 0 0

car 1 1 10000 2.0

2.0 1 1 1 0.52 1.04

insurance

1 1 1000 3.0

3.0 2 1.3 0.68 1.3 2.04

Document: car insurance auto insuranceQuery: best car insurance

Exercise: what is N, the number of docs?

Score = 0+0+1.04+2.04 = 3.08Doc length = 92.11101 2222

Page 53: Information Retrieval

CSE 8337 Spring 2009 53

Summary – vector space ranking Represent the query as a weighted tf-

idf vector Represent each document as a

weighted tf-idf vector Compute the cosine similarity score

for the query vector and each document vector

Rank documents with respect to the query by score

Return the top K (e.g., K = 10) to the user

Page 54: Information Retrieval

CSE 8337 Spring 2009 54

Vector Model and Web Search Speeding up vector space

ranking Putting together a complete

search system Will require learning about a

number of miscellaneous topics and heuristics

Page 55: Information Retrieval

CSE 8337 Spring 2009 55

Efficient cosine ranking Find the K docs in the collection

“nearest” to the query K largest query-doc cosines.

Efficient ranking: Computing a single cosine efficiently. Choosing the K largest cosine values

efficiently. Can we do this without computing all N

cosines?

Page 56: Information Retrieval

CSE 8337 Spring 2009 56

Efficient cosine ranking What we’re doing in effect: solving the

K-nearest neighbor problem for a query vector

In general, we do not know how to do this efficiently for high-dimensional spaces

But it is solvable for short queries, and standard indexes support this well

Page 57: Information Retrieval

CSE 8337 Spring 2009 57

Special case – unweighted queries

No weighting on query terms Assume each query term occurs only

once Then for ranking, don’t need to

normalize query vector

Page 58: Information Retrieval

CSE 8337 Spring 2009 58

Faster cosine: unweighted query

Page 59: Information Retrieval

CSE 8337 Spring 2009 59

Computing the K largest cosines: selection vs. sorting Typically we want to retrieve the top K

docs (in the cosine ranking for the query) not to totally order all docs in the

collection Can we pick off docs with K highest

cosines? Let J = number of docs with nonzero

cosines We seek the K best of these J

Page 60: Information Retrieval

CSE 8337 Spring 2009 60

Use heap for selecting top K Binary tree in which each node’s value

> the values of children Takes 2J operations to construct, then

each of K “winners” read off in 2log J steps.

For J=1M, K=100, this is about 10% of the cost of sorting. 1

.9 .3.8.3

.1

.1

Page 61: Information Retrieval

CSE 8337 Spring 2009 61

Bottlenecks Primary computational bottleneck in

scoring: cosine computation Can we avoid all this computation? Yes, but may sometimes get it wrong

a doc not in the top K may creep into the list of K output docs

Is this such a bad thing?

Page 62: Information Retrieval

CSE 8337 Spring 2009 62

Cosine similarity is only a proxy User has a task and a query formulation Cosine matches docs to query Thus cosine is anyway a proxy for user

happiness If we get a list of K docs “close” to the

top K by cosine measure, should be ok

Page 63: Information Retrieval

CSE 8337 Spring 2009 63

Generic approach Find a set A of contenders, with K < |A|

<< N A does not necessarily contain the top K,

but has many docs from among the top K

Return the top K docs in A Think of A as pruning non-contenders The same approach is also used for

other (non-cosine) scoring functions Will look at several schemes following

this approach

Page 64: Information Retrieval

CSE 8337 Spring 2009 64

Index elimination Basic algorithm of Fig 7.1 only

considers docs containing at least one query term

Take this further: Only consider high-idf query terms Only consider docs containing many

query terms

Page 65: Information Retrieval

CSE 8337 Spring 2009 65

High-idf query terms only For a query such as catcher in the

rye Only accumulate scores from

catcher and rye Intuition: in and the contribute

little to the scores and don’t alter rank-ordering much

Benefit: Postings of low-idf terms have many

docs these (many) docs get eliminated from A

Page 66: Information Retrieval

CSE 8337 Spring 2009 66

Docs containing many query terms Any doc with at least one query

term is a candidate for the top K output list

For multi-term queries, only compute scores for docs containing several of the query terms Say, at least 3 out of 4 Imposes a “soft conjunction” on

queries seen on web search engines (early Google)

Easy to implement in postings traversal

Page 67: Information Retrieval

CSE 8337 Spring 2009 67

3 of 4 query terms

BrutusCaesarCalpurnia

1 2 3 5 8 13 21 342 4 8 16 32 64128

13 16

Antony 3 4 8 16 32 64128

32

Scores only computed for 8, 16 and 32.

Page 68: Information Retrieval

CSE 8337 Spring 2009 68

Champion lists Precompute for each dictionary term t,

the r docs of highest weight in t’s postings Call this the champion list for t (aka fancy list or top docs for t)

Note that r has to be chosen at index time

At query time, only compute scores for docs in the champion list of some query term Pick the K top-scoring docs from amongst

these

Page 69: Information Retrieval

CSE 8337 Spring 2009 69

Quantitative

Static quality scores We want top-ranking documents to be

both relevant and authoritative Relevance is being modeled by cosine

scores Authority is typically a query-

independent property of a document Examples of authority signals

Wikipedia among websites Articles in certain newspapers A paper with many citations Many diggs, Y!buzzes or del.icio.us marks (Pagerank)

Page 70: Information Retrieval

CSE 8337 Spring 2009 70

Modeling authority Assign to each document a query-

independent quality score in [0,1] to each document d Denote this by g(d)

Thus, a quantity like the number of citations is scaled into [0,1] Exercise: suggest a formula for this.

Page 71: Information Retrieval

CSE 8337 Spring 2009 71

Net score Consider a simple total score

combining cosine relevance and authority

net-score(q,d) = g(d) + cosine(q,d) Can use some other linear

combination than an equal weighting Indeed, any function of the two

“signals” of user happiness – more later

Now we seek the top K docs by net score

Page 72: Information Retrieval

CSE 8337 Spring 2009 72

Top K by net score – fast methods First idea: Order all postings by

g(d) Key: this is a common ordering for

all postings Thus, can concurrently traverse

query terms’ postings for Postings intersection Cosine score computation

Exercise: write pseudocode for cosine score computation if postings are ordered by g(d)

Page 73: Information Retrieval

CSE 8337 Spring 2009 73

Why order postings by g(d)? Under g(d)-ordering, top-scoring

docs likely to appear early in postings traversal

In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early Short of computing scores for all docs

in postings

Page 74: Information Retrieval

CSE 8337 Spring 2009 74

Champion lists in g(d)-ordering Can combine champion lists with

g(d)-ordering Maintain for each term a champion

list of the r docs with highest g(d) + tf-idftd

Seek top-K results from only the docs in these champion lists

Page 75: Information Retrieval

CSE 8337 Spring 2009 75

High and low lists For each term, we maintain two

postings lists called high and low Think of high as the champion list

When traversing postings on a query, only traverse high lists first If we get more than K docs, select the top K

and stop Else proceed to get docs from the low lists

Can be used even for simple cosine scores, without global quality g(d)

A means for segmenting index into two tiers

Page 76: Information Retrieval

CSE 8337 Spring 2009 76

Impact-ordered postings We only want to compute scores

for docs for which wft,d is high enough

We sort each postings list by wft,d

Now: not all postings in a common order!

How do we compute scores in order to pick off top K? Two ideas follow

Page 77: Information Retrieval

CSE 8337 Spring 2009 77

1. Early termination When traversing t’s postings, stop

early after either a fixed number of r docs wft,d drops below some threshold

Take the union of the resulting sets of docs One from the postings of each query

term Compute only the scores for docs

in this union

Page 78: Information Retrieval

CSE 8337 Spring 2009 78

2. idf-ordered terms When considering the postings of

query terms Look at them in order of

decreasing idf High idf terms likely to contribute

most to score As we update score contribution

from each query term Stop if doc scores relatively

unchanged Can apply to cosine or some other

net scores

Page 79: Information Retrieval

CSE 8337 Spring 2009 79

Cluster pruning: preprocessing Pick N docs at random: call

these leaders For every other doc, pre-

compute nearest leader Docs attached to a leader: its

followers; Likely: each leader has ~ N

followers.

Page 80: Information Retrieval

CSE 8337 Spring 2009 80

Cluster pruning: query processing Process a query as follows:

Given query Q, find its nearest leader L.

Seek K nearest docs from among L’s followers.

Page 81: Information Retrieval

CSE 8337 Spring 2009 81

Visualization

Query

Leader Follower

Page 82: Information Retrieval

CSE 8337 Spring 2009 82

Why use random sampling Fast Leaders reflect data distribution

Page 83: Information Retrieval

CSE 8337 Spring 2009 83

General variants Have each follower attached to

b1=3 (say) nearest leaders. From query, find b2=4 (say)

nearest leaders and their followers. Can recur on leader/follower

construction.

Page 84: Information Retrieval

CSE 8337 Spring 2009 84

Putting it all together