information retrieval lecture 2 introduction to information retrieval (manning et al. 2007) chapter...
TRANSCRIPT
![Page 1: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/1.jpg)
Information Retrieval
Lecture 2Introduction to Information Retrieval (Manning et al. 2007)
Chapter 6 & 7
For the MSc Computer Science Programme
Dell ZhangBirkbeck, University of London
![Page 2: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/2.jpg)
Boolean Search
Strength Docs either match or not. Good for expert users with precise understanding
of their needs and the corpus. Weakness
Not good for (the majority of) users with poor Boolean formulation of their needs.
Applications may consume 1000’s of results, but most users don’t want to wade through 1000’s of results – cf. use of Web search engines.
![Page 3: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/3.jpg)
Beyond Boolean Search
Solution: Ranking We wish to return in order the documents most
likely to be useful to the searcher. How can we rank/order the docs in the corpus
with respect to a query? Assign a score – say in [0,1] – for each doc on
each query.
![Page 4: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/4.jpg)
Document Scoring
Idea: More is Better If a document talks about a topic more, then it is a
better match. That is to say, a document is more relevant if it
has more relevant terms. This leads to the problem of term weighting.
![Page 5: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/5.jpg)
Bag-Of-Words (BOW) Model
Term-Document Count Matrix Each document corresponds to a vector in ℕv, i.e.,
a column below.
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
The matrix element A(i,j) is the number of occurrences of the i-th term in the j-th doc.
![Page 6: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/6.jpg)
Bag-Of-Words (BOW) Model
Simplification In the BOW model, the doc
John is quicker than Mary. is indistinguishable from the doc
Mary is quicker than John.
![Page 7: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/7.jpg)
Term Frequency (TF)
Digression: Terminology WARNING: In a lot of IR literature, “frequency” is
used to mean “count”. Thus term frequency in IR literature is used to mean the
number of occurrences of a term in a document not divided by document length (which would actually make it a frequency).
We will conform to this misnomer: in saying term frequency we mean the number of occurrences of a term in a document.
![Page 8: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/8.jpg)
Term Frequency (TF)
What is the relative importance of 0 vs. 1 occurrence of a term in a doc, 1 vs. 2 occurrences, 2 vs. 3 occurrences, ……?
Can just use raw tf . While it seems that more is better, a lot isn’t
proportionally better than a few. So another option commonly used in practice:
otherwise log1 ,0 if 0 ,,, dtdtdt tftfwf
![Page 9: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/9.jpg)
The score of a document d for a query q
0 if no query terms in document wf can be used instead of tf in the above
Term Frequency (TF)
qt dttf ,
![Page 10: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/10.jpg)
Term Frequency (TF)
Is TF good enough for weighting? Ignorance of document length
Long docs are favored because they’re more likely to contain query terms.
This can be fixed to some extent by normalizing for document length. [talk later]
![Page 11: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/11.jpg)
Term Frequency (TF)
Is TF good enough for weighting? Ignorance of term rarity in corpus
Consider the query ides of march. Julius Caesar has 5 occurrences of ides , while no other
play has ides . march occurs in over a dozen. All the plays contain of .
By this weighting scheme, the top-scoring play is likely to be the one with the most ofs.
![Page 12: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/12.jpg)
Document/Collection Frequency Which of these tells you more about a doc?
5 occurrences of of? 5 occurrences of march? 5 occurrences of ides?
We’d like to attenuate the weight of a common term. But what is “common”? Collection Frequency (CF)
the number of occurrences of the term in the corpus Document Frequency (DF)
the number of docs in the corpus containing the term
![Page 13: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/13.jpg)
Document/Collection Frequency DF may be better than CF
Word CF DFtry 10422 8760insurance 10440 3997
So how do we make use of DF?
![Page 14: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/14.jpg)
Inverse Document Frequency (IDF) Could just be the reciprocal of DF (idfi = 1/dfi). But by far the most commonly used version
is:
dfnidf
i
i log
![Page 15: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/15.jpg)
Inverse Document Frequency (IDF) Prof Karen Spark Jones
1935-2007
![Page 16: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/16.jpg)
TFxIDF
TFxIDF weighting scheme combines: Term Frequency (TF)
measure of term density in a doc Inverse Document Frequency (IDF)
measure of informativeness of a term: its rarity across the whole corpus
![Page 17: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/17.jpg)
TFxIDF
)/log(,, ididi dfntfw
rmcontain te that documents ofnumber the
documents ofnumber total
document in termoffrequency ,
idf
n
ditf
i
di
Each term i in each document d is assigned a TFxIDF weight
What is the weight of a term that occurs in all of the docs?
Increases with the number of occurrences within a doc.Increases with the rarity of the term across the whole corpus.
![Page 18: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/18.jpg)
Term-Document Matrix (Real-Valued)
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 13.1 11.4 0.0 0.0 0.0 0.0
Brutus 3.0 8.3 0.0 1.0 0.0 0.0
Caesar 2.3 2.3 0.0 0.5 0.3 0.3
Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0
Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0
mercy 0.5 0.0 0.7 0.9 0.9 0.3
worser 1.2 0.0 0.6 0.6 0.6 0.0
The matrix element A(i,j) is the log-scaled TFxIDF weight.
Note: can be > 1.
![Page 19: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/19.jpg)
Vector Space Model
Docs Vectors Each doc j can now be viewed as a vector of
TFxIDF values, one component for each term. So we have a vector space
Terms are axes Docs live in this space May have 20,000+ dimensions
even with stemming
![Page 20: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/20.jpg)
Vector Space Model
Prof Gerard Salton
1927-1995
The SMART information retrieval system
![Page 21: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/21.jpg)
Vector Space Model
First application: Query-By-Example (QBE) Given a doc d, find others “like” it. Now that d is a vector, find vectors (docs) “near” it.
Postulate: Documents that are “close together” in the vector space talk about the same things.
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
![Page 22: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/22.jpg)
Vector Space Model
Queries Vectors Regard a query as a (very short) document. Return the docs ranked by the closeness of their
vectors to the query, also represented as a vector.
![Page 23: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/23.jpg)
Desiderata for Proximity
If d1 is near d2, then d2 is near d1.
If d1 near d2, and d2 near d3, then d1 is not far from d3.
No doc is closer to d than d itself.
![Page 24: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/24.jpg)
Euclidean Distance
Distance between dj and dk is
Why is this not a great idea? We still haven’t dealt with the issue of length
normalization: long documents would be more similar to each other by virtue of length, not topic.
However, we can implicitly normalize by looking at angles instead.
n
i kijikjkj dddddddist1
2,,),(
![Page 25: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/25.jpg)
Cosine Similarity
Vector Normalization A vector can be normalized (given a length of 1)
by dividing each of its components by its length – here we use the L2 norm
This maps vectors onto the unit sphere.
Then longer documents don’t get more weight.
m
i jij wd1
2,
![Page 26: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/26.jpg)
Cosine Similarity
Cosine of angle between two vectors
The denominator involves the lengths of the vectors. This means normalization.
),cos(),( kjkj ddddsim
m
i ki
m
i ji
m
i kiji
kj
kj
ww
ww
dd
dd
1
2,1
2,
1 ,,
![Page 27: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/27.jpg)
Cosine Similarity
The similarity between dj and dk is captured by the cosine of the angle between their vectors.
No triangle inequality for similarity.
t 1
d 2
d 1
t 3
t 2
θ
![Page 28: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/28.jpg)
Cosine Similarity - Exercise
Rank the following by decreasing cosine similarity: Two docs that have only frequent words (the, a,
an, of) in common. Two docs that have no words in common. Two docs that have many rare words in common
(wingspan, tailfin).
![Page 29: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/29.jpg)
Cosine Similarity - Exercise
Show that, for normalized vectors, Euclidean distance measure gives the same proximity ordering as the cosine similarity measure.
![Page 30: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/30.jpg)
Cosine Similarity - Example
Docs Austen's Sense and Sensibility (SaS) Austen's Pride and Prejudice (PaP) Bronte's Wuthering Heights (WH)
cos(SaS, PaP) = 0.996 x 0.993 + 0.087 x 0.120 + 0.017 x 0.000 = 0.999cos(SaS, WH) = 0.996 x 0.847 + 0.087 x 0.466 + 0.017 x 0.254 = 0.889
SaS PaP WHaffection 115 58 20jealous 10 7 11gossip 2 0 6
SaS PaP WHaffection 0.996 0.993 0.847jealous 0.087 0.120 0.466gossip 0.017 0.000 0.254
![Page 31: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/31.jpg)
Vector Space Model - Summary What’s the real point of using vector space?
Every query can be viewed as a (very short) doc. Every query becomes a vector in the same space
as the docs. Can measure each doc’s proximity to the query. It provides a natural measure of scores/ranking –
no longer Boolean. Docs (and queries) are expressed as bags of words.
![Page 32: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/32.jpg)
Vector Space Model - Exercise How would you augment the inverted index
built in previous lectures to support cosine ranking computations?
Walk through the steps of serving a query using the Vector Space Model.
![Page 33: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/33.jpg)
Efficient Cosine Ranking
Computing a single cosine For every term t, with each doc d, Add tft,d to
postings lists. Some tradeoffs on whether to store term count, term
weight, or weighted by IDF. At query time, accumulate component-wise sum.
m
i
di
qi wwdqdqsim
1
)()(),cos(),(
![Page 34: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/34.jpg)
Efficient Cosine Ranking
Computing the k largest cosines Search as a kNN problem
Find the k docs “nearest” to the query (with largest query-doc cosines) in the vector space.
Do not need to totally order all docs in the corpus. Use heap for selecting top k docs
Binary tree in which each node’s value > values of children
Takes 2n operations to construct, then each of k “winners” read off in with log n steps.
For n=1M, k=100, this is about 10% of the cost of complete sorting.
![Page 35: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/35.jpg)
Efficient Cosine Ranking
Heuristics Avoid computing cosines from query to each of n
docs, but may occasionally get an answer wrong. For example, cluster pruning.
![Page 36: Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ec85503460f94bd50c7/html5/thumbnails/36.jpg)
Take Home Messages
TFxIDF
Vector Space Model docs and queries as vectors cosine similarity efficient cosine ranking
)/log(,, ididi dfntfw
),cos(),( kjkj ddddsim
m
i ki
m
i ji
m
i kiji
kj
kj
ww
ww
dd
dd
1
2,1
2,
1 ,,