improvements and extras paul thomas csiro. overview of the lectures 1.introduction to information...
TRANSCRIPT
![Page 1: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/1.jpg)
Improvements and extras
Paul ThomasCSIRO
![Page 2: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/2.jpg)
Overview of the lectures
1.Introduction to information retrieval (IR)
2.Ranked retrieval
3.Probabilistic retrieval
4.Evaluating IR systems
5.Improvements and extras
![Page 3: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/3.jpg)
Problems matching terms
“It is impossibly difficult for users to predict the exact words, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents”
(Blair and Maron 1985)
![Page 4: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/4.jpg)
Query refinement
It's hard to get queries right, especially if you don't know:
What you're searching for; or What you're searching in
We can refine a query: Manually Automatically
![Page 5: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/5.jpg)
Automatic refinement
![Page 6: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/6.jpg)
Relevance feedback
Assume that relevant documents have something in common.
Then if we have some documents we know are relevant, we can find more like those.
1.Return the documents we think are relevant;
2.User provides feedback on one or more;
3.Return a new set, taking that feedback into account.
![Page 7: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/7.jpg)
An example
![Page 8: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/8.jpg)
In vector space
A query can be represented as a vector; so can all documents, relevant or not.
We want to adjust the query vector so it's: Closer to the centroid of the relevant
documents And away from the centroid of the non-relevant
documents
![Page 9: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/9.jpg)
Moving a query vector
![Page 10: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/10.jpg)
Rocchio's algorithm
q ' = αq
+ β 1∣Dr∣
∑d∈Dr
d
− γ 1∣Dn∣
∑d∈Dn
d
![Page 11: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/11.jpg)
In probabilistic retrieval
With real relevance judgements, we can make better estimates of probability P(rel|q,d) .
pi ≈ (w+0.5) / (w+y+1)
Or, to get smoother estimates:
pi' ≈ (w+κp
i) / (w+y+κ)
![Page 12: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/12.jpg)
In lucene
score(d ,q ) = coord(d ,q)× queryNorm(q)
× ∑t∈q (√ tf t ,d×(1+ logN
df t+ 1)×
boost (t )× 1
√∥d∥)
Query.setBoost(float b)term^boost
![Page 13: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/13.jpg)
Pseudo-relevance feedback
We can assume the top k ranked documents are relevant.
Less accurate (probably); But less effort (definitely).
Or an in-between option: use implicit relevance feedback.
For example, use clicks to refine future ranking.
![Page 14: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/14.jpg)
When does it work?
Have to have a good enough first query. Have to have relevant documents which are
similar to each other. Users have to be willing to provide feedback.
![Page 15: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/15.jpg)
Web search
![Page 16: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/16.jpg)
Why is the web different?
Scale Authorship Document types Markup Link structure
![Page 17: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/17.jpg)
The web graph
Paul'shome page
CSIRO
ANU
Research School
Collaborativeprojects
Past projects
…I work at the CSIRO as a researcher in information retrieval……I work at the CSIRO as a researcher in information retrieval…
![Page 18: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/18.jpg)
Making use of link structure
Text in (or near) the link Treat this as part of the target document
Indegree
Graph-theoretic measures Centrality, betweeness, … PageRank
![Page 19: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/19.jpg)
PageRank
Paul'shome page
CSIRO
ANU
Research School
Collaborativeprojects
Past projects
![Page 20: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/20.jpg)
Incorporating PageRank
PageRank is query-independent evidence: it is the same for any query.
Can simply combine this with query-dependent evidence such as probability of relevance, cosine distance, term counts, …
score(d,q) = α PageRank(d) + (1-α) similarity(d,q)
![Page 21: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/21.jpg)
Other forms of web evidence
Trust in the host (or domain, or domain owner, or network block)
Reports of spam or malware Frequency of updates Related queries which lead to the same place URLs Page length Language …
![Page 22: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/22.jpg)
Machine learning for IR
Machine learning is a set of techniques for discovering rules from data.
In information retrieval, we use machine learning for:
Choosing parameters Classifying text Ranking documents
![Page 23: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/23.jpg)
Classifiers
Naive Bayes:
Find category c such that P(c|d) is maximised
Support vector machines (SVM):
Find a separating hyperplane
P(c |d )∝∏ t∈dP(t i |c )
![Page 24: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/24.jpg)
Learning parameters
Feature α (e.g. PageRank)
Feature β
(e.g. cosine)
score(α,β) = θ
![Page 25: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/25.jpg)
Ranking
Ranking SVM
Instead of classifying one document into {relevant, not relevant}:
Classify a pair of documents into {first better, second better}
RankNet LambdaNet LambdaMART …
![Page 26: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/26.jpg)
What we covered today
It's hard to write a good query: query rewriting Manual Automatic: spelling correction, thesauri, relevance
feedback, pseudo-relevance feedback Web retrieval
Has to cope with large scale, antagonistic authors But can make use of new features e.g. web graph
Machine learning Makes it possible to “learn” how to classify or rank
at scale, with lots of features
![Page 27: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/27.jpg)
Recap lecture 1
Retrieval system = indexer + query processor Indexer (normally) writes an inverted file Query processor uses the index
![Page 28: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/28.jpg)
Recap lecture 2
Ranking search results: why it's important Term frequency and “bag of words” td.idf Cosine similarity and the vector space model
![Page 29: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/29.jpg)
Recap lecture 3
Probabilistic retrieval uses probability theory to deal with the uncertainty of relevance
Ranking by P(rel | d) is optimal (under some assumptions)
We can turn this into a sum of term weights and use an index and accumulators
Very popular, very influential, and still in vogue
![Page 30: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/30.jpg)
Recap lecture 4
Why should we evaluate? Efficiency and effectiveness Some ways to evaluate: observation, lab
studies, log files, test collections Effectiveness measures
![Page 31: Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval](https://reader036.vdocuments.site/reader036/viewer/2022062417/551b3ecc550346d41a8b55ef/html5/thumbnails/31.jpg)
Now…
There's a lab starting a bit after 11, in the Computer Science building (N114):
Getting started with lucene Working with trec_eval