![Page 1: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/1.jpg)
1
Ranked Queries over sources with Boolean Query Interfaces
without Ranking Support
Vagelis Hristidis, Florida International UniversityYuheng Hu, Arizona State UniversityPanos Ipeirotis, New York University
![Page 2: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/2.jpg)
2
Motivation: PubMed (and USPTO, and Linked In, and…)
¨ PubMed offers only ranking by date, author, title, or journal¨ Usually, user like ranking by relevance
– Measured by IR ranking function, like tf-idf
![Page 3: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/3.jpg)
3
Problem Definition¨ Input
– Query Q contains term t1,…tn – Database D contain documents d1,…,dm
¨ Output– Top-k documents ranked according to a relevance score function
¨ Example of ranking function: tf.idf
¨ Baseline: Submit a disjunctive query with all query keywords, retrieve all the documents, locally re-rank
¨ Problems with Baseline method: Too many results!– “immunodeficiency virus structure” 1,451,446 results
![Page 4: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/4.jpg)
4
Query Relaxation Approach
¨ A tf.idf query has OR semantics¨ Using queries will AND semantics returns promising
documents earlier on¨ Gradual query relaxation allows fast execution¨ Key questions:
¨ Which (conjunctive) queries to execute?¨ When to stop?
![Page 5: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/5.jpg)
5
Problem Setting and Challenges¨ Boolean query interface, (e.g PubMed)¨ Limited data access through web service (quota per day)¨ No useful ranking functions¨ No indices to rely on¨ No statistics exported from database
![Page 6: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/6.jpg)
6
Probabilistic Approach¨ Document Score
¨ Estimate tf (and scores) probabilistically:– The tf of the terms in a database
tend to follow a Poisson distribution
– Document scores also follow a Poisson
tf parameter of Poissonfor the term in database
idf, (easy part)
tf, (challenging part)
![Page 7: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/7.jpg)
7
Probabilistic top-k with query relaxation
¨ Querying strategy – How to pick a good query candidate?– A good query should have good “benefit”
¨ Benefit: Probability that document in results for relaxed query q in top-k.
The k-th highest score so farQuery candidate
We choose the query candidate q with maximum probability
Score follows Poisson, function of the λ parameters of query terms in Q
Pr{ScoreQ(D,q) > τ}
![Page 8: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/8.jpg)
8
Estimation of Poisson Parameters ¨ Sample-based estimation: Fetch documents, construct
sample, use estimates from sample– Need very extensive sampling size for reliable estimates
¨ Query-based estimation: Combine sampling and query execution– Every query generates a sample and provides candidate top-k docs– Main challenge: Adjust estimates to compensate for querying bias
(we are looking for top-k documents, we do not perform random sampling)
![Page 9: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/9.jpg)
9
¨ Document sample returned for each query is not random!¨ Sample is “conditional” on query terms (guaranteed to appear)
– Need to acknowledge in estimates that queries are trying to find the top-k, not intended for random sampling
¨ Without correction, estimates significantly off
Query-based Sampling
![Page 10: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/10.jpg)
10
Top-k algorithm using query relaxation1. Send conjunctive query to the database with all terms
2. Update statistics for each termusing estimates from the biased sample
3. Compute benefits for each possible query relaxation
4. If benefit (i.e., probability of finding top-k document) belowthreshold, stop; else go to step 1
![Page 11: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/11.jpg)
11
Experiments¨ Datasets
– PubMed– TREC
¨ Quality Measure– Spearman’s Footrule
¨ Algorithms– Baseline– Summary-based– Query-based
![Page 12: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/12.jpg)
12
Experiments: Quality¨ Compared footrule distance compared to baseline
(baseline = retrieve everything, fetch locally, rerank)¨ Lower values better¨ Query-based sampling consistently better than alternatives
![Page 13: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/13.jpg)
13
Experiments: Efficiency¨ Measured #documents, queries, and execution time of
alternative techniques
![Page 14: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/14.jpg)
14
Conclusion¨ Technique for top-k queries on top of document databases without
ranking support
¨ Introduction of an exploration-exploitation framework for building necessary statistics on-the-fly, during query execution
¨ Order-of-magnitude efficiency improvements, small losses in quality
![Page 15: Ranked Queries over sources with Boolean Query Interfaces without Ranking Support](https://reader036.vdocuments.site/reader036/viewer/2022062315/568161da550346895dd1e32c/html5/thumbnails/15.jpg)
15
Thank you !
Questions?