| 1 › gertjan van noord2014 zoekmachines lecture 4

29
| 1 ›Gertjan van Noord 2014 Zoekmachines Lecture 4

Upload: darcy-wilkins

Post on 17-Dec-2015

215 views

Category:

Documents


3 download

TRANSCRIPT

| 1

› Gertjan van Noord 2014

Zoekmachines

Lecture 4

Chapter 6 Overview

Scoring

•Parametric and zone indexing•Weighted zone scoring (skip 6.1.2 and 6.1.3)•Term frequency and weighting•Vector model, queries as vectors•Computing scores• (skip 6.4)

Formulas and symbols

•Ʃ : sum ∏ : product

•√ : (square) root (base 2) √x = y : y2 = x

√1=1 √4=2 √10=3,16.. √100=10

•log: logarithm (base 10) logx = y : 10y = x

log1=0 log4=0,6.. log10=1 log100=2

∑i=1

n

a [ i ] ∏i=1

n

a [ i ]

sum/ multiply array elements

numbers increase slowly

• |x| stands for number of elements of x

Parametric indexes

To a digital document, metadata can be attachedfields: format, date, author, title, keywords, …Dublin core: set of metadata fields for digital

library

Extra parametric indexes built for these fieldseach field its own - limited - dictionary

The user interface should accommodate querying of metadata next to text queries

The system must combine the query parts for retrieval

Zone indexes

In the (free) text of a document itself, different zones can be recognized:

• Author• Title• date of publ• Abstract• Keywords• Body• …

Difference between parametric index and zone index

Zone indexes

– If documents have a common outline, zones can be separated in the index, in 2 ways:

1.different index terms: william.author, william.body

2.different postings (each doc zone):william -> [1.author -> 2.author, 2.body -> …]

Again: user interface and retrieval must be adapted

Weighted zone scoring

Zones can be used by the system for a ranked Boolean retrieval (without zone queries)

With a query word in the title the doc is more relevant

Each zone gets a partial weight (sum=1)

Score of each zone of a doc for a query is Boolean score (0/1) * zone weight

The score of the doc for the query is the sum of its zone scores (0 <= docscore <= 1)), so docs can be ordered

Example• Query: pie AND cream•1:title 2:abstract g1= 0.6, g2 = 0.4, s=AND

• Doc score:

• Document D1 g s sc

1.Title apple pie 0.6 * 0 = 0

2. Abstract pie cream 0.4 * 1 = 0.4

Total score (D1, q) 0.4

• What if one word in title, other in abstract?

∑i=1

l

gisi

Summary simple boolean model

• only store presence of a term in a document • conjunctive query

select docs that contain all query terms• disjunctive query

select docs that contain at least one of the query terms

• arbitrary boolean queryselect docs in which the query term distribution satisfies the query, using the operators AND, OR and NOT

• binary decision: documents are selected or not, no ranking

How can we improve the system?We need methods to rank the results

containing some or all query terms

Is there more than presence/absence?

Is each term equally important?

Vector model• terms in a document (and a query) receive a

score• scores are combined to calculate for each

document the similarity with the query • using the similarity scores, the result set of

documents containing one or more query terms can be ranked

How can we calculate a score for a term?For a term in a document? A term in general?

Term weighting

First idea: use the frequency of the query terms in the document

Documents as vectorsDocuments presented as vector of weights for

each term of the dictionary

V (Doc1) = (8, 2, 2, 10, 2, ...)only few of the dictionary terms shown belowusing term frequencies as weights,

which document scores best for query: food of ape?

ape child food of panther

Doc1 8 2 2 10 2Doc2 1 5 9 20 0Docn .. .. .. .. ..Query 1 0 1 1 0

15 of 29

Advanced Scores for Terms in Documents• tft,d : term frequency of t in d: how many

tokens of term t in document d• dft : document frequency of t:how many

documents with term t?• dft /N : relative document frequency of

t:how many docs with term t / total of docs (N)?

BUT: high document freq => important term?

• idft : inverse document frequency of t: N/dft, if dft is higher, the inverse is lower

• tft,d∙idft: score for a term in a document

Score for a query – doc combinationscore ( q,d )=∑

t∈qtf t,d× idf t

q query d document each term t that is an element of query qtft,d term frequency of term t in doc d

idft inverse document frequency of t: log N/dft

N number of documents in collection

t∈q

Example

assume: idfape=5 idffood=2 idfof=0,001

score(“food of ape”, Doc1) = score(“food of ape”, Doc2) =

ape child food of panther

Doc1 8 2 2 10 2Doc2 1 5 9 20 0Docn .. .. .. .. ..Query 1 0 1 1 0

Example

score(“food of ape”, Doc1) = 8*5 + 2*2 + 0 = 44

score(“food of ape”, Doc2) = 1*5 + 9*2 + 0 = 23

ape child food of panther

Doc1 8 2 2 10 2Doc2 1 5 9 20 0Docn .. .. .. .. ..Query 1 0 1 1 0

score ( q,d )=∑t∈qtf t,d× idf t

19 of 29

What about the query terms?

The query vector can also have weights, then we multiply the 2 weights for each term:

Often not the same weighting, so in general

score ( q,d )=∑t∈q

( tf t,q× idf t ) .( tf t,d× idf t )

score ( q,d )=∑t∈q

( w t,q ) .(w t,d )

20 of 29

• Problem: current approach does not take into account document length

• Solution: use vector term weights instead of term frequencies

• For vector length: Euclidean length

√∑i=1

n

V⃗i 2

Length of a vector V with n elements =the square root of (the sum of the squares of all the elements)

Example: compute normalized frequencies for Figure 6.9

Next couple of sheets from:http://www.stanford.edu/class/cs276/

Documents as vectors

So we have a |V|-dimensional vector space

Terms are axes of the space

Documents are points or vectors in this space

Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine

These are very sparse vectors - most entries are zero.

Sec. 6.3

Queries as vectors

Key idea 1: Do the same for queries: represent them as vectors in the space

Key idea 2: Rank documents according to their proximity to the query in this space

proximity = similarity of vectors

proximity ≈ inverse of distance

Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model.

Instead: rank more relevant documents higher than less relevant documents

Sec. 6.3

Formalizing vector space proximity

First cut: distance between two points

( = distance between the end points of the two vectors)Euclidean distance?

Euclidean distance is a bad idea . . .

. . . because Euclidean distance is large for vectors of different lengths.

Sec. 6.3

Why distance is a bad idea

The Euclidean distance between q and d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 are

very similar.

Sec. 6.3

Use angle instead of distance

Thought experiment: take a document d and append it to itself. Call this document d′.

“Semantically” d and d′ have the same content

The Euclidean distance between the two documents can be quite large

The angle between the two documents is 0, corresponding to maximal similarity.

Key idea: Rank documents according to angle with query.

Sec. 6.3

Length normalization

A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:

Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)

Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.

Long and short documents now have comparable weights

‖x⃗‖2=√∑ x i2

Sec. 6.3

cosine(query,document)

cos ( q⃗ , d⃗ )= q⃗⋅d⃗|q⃗||⃗d|

= q⃗|⃗q|

⋅ d⃗|⃗d|

=∑ q id i

√∑ q i2√∑ d i

2

Dot product Unit vectors

qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.

Sec. 6.3

Cosine for length-normalized vectors

For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):

for q, d length-normalized.

29

cos ( q⃗ , d⃗ )=q⃗ · d⃗=∑ q id i

Example: compute weights for Figure 6.13, for the query “jealous gossip”

(v(q) = (0,1,1) not normalized)

v(q) = (0,0.707,0.707) (why?)

v(SaS) = (0.996,0.087,0.017)

v(PaP) = (0.993,0.120,0)

v(WH) = (0.874,0.466,0.254)